Age Race Gender Detection Using Transfer learning
ARG (Age Race Gender) Detection Using Transfer learning in TensorFlow
<To Download you first need to login.>
Recently, by the advent of deep networks, detecting proxy features from input face images has attracted much attention; however, the promising results show to lack sufficient accuracy due to the elaborated network architecture and complexity of time regarding the weight sub-optimal solution. Meanwhile, in this project, we explore the problem of predicting three attributes related to human face (Age, race and gender) using pre-trained FaceNet model using transfer learning. Leveraging pre-trained model, it is rather easy to get satisfying performance with deep learning solutions in Tensorflow framework. We achieved gender, race and age recognition accuracy of 93\%, 84\%, and 73\%, respectively. Finally, we developed an online prediction module that automatically detects a face in an image using YOLO and passes the image to the network to generate final predictions, real-time. Tensorboard histogram is rendered to showcase the results and compare the outcomes.
Methodology
- Data-set preparation
The images dataset with gender, race, and age labels which was used in this project was UTKFace. It contains 23,708 faces with even distribution of gender, but uneven distribution of race and age. The files contain a single face in an image with its corresponding labels. In addition, we divided the dataset into these ratios: 80% training data, 10% validation, 10% for testing.
- Data-set Augmentation
The dataset initially given has an uneven distribution relative to the classes of race and age. This has potential to adversely affect learning and skew the model to favor the classes with more data points. To alleviate this issue, we performed data augmentation on the 4 other categories of race (excluding class: white), and 20 of the categories of age for un-binned age prediction. Binned age prediction did not require data augmentation as the binning classes were chosen to fit a uniform distribution. The augmentation was done by choosing 1 of 4 random adjustments: rotating the image, flipping vertically, adding Gaussian noise, or adding Gaussian noise and flipping vertically. After augmenting the dataset, it has approximately 40,000 images.
- Image alignment
The alignment module uses dlib to detect a face and create landmarks; the landmarks are then adjusted to create uniformity in the faces that are passed in. Relatively however the dataset that was used (UTKFace) already had most of the faces cropped and adjusted to have the facial features centred to the picture. However in the case of using a raw image or during a live session, the images captured needed to be cropped and aligned for accurate predictions.
- Extracting labels
Execution of the network was done through TensorFlow-gpu 1.12; a base code from \cite{GR} was acquired and modified. Initially for label extraction each of the images had embedded the label within the name of the image. Age, Race, and Gender were extracted from the image name and formulated into TensorFlow records along with the image following data augmentation. TensorFlow records were used for seamless data pipeline. Modifications to the TensorFlow graph were made by adding the age prediction to the base code.
- Overall structure
For this project we chose to use the weights from FaceNet trained on VGGFace2 dataset, as a baseline for transfer learning. Following this we use a multitask-learning approach, The network has some layers of shared variables (facenet in our project) to identify common features, then proceeds to 3 separate auxiliary layers for age, race, and gender detection respectively. All three output layers use softmax cross entropy as a loss function. The shared layers’ weights are optimized by minimizing the weighted sum of all three losses. The auxiliary layers in our network have 128 neurons with linear activation.
- Multi-task learning and variations of the model
We used the network in \cite{GR} which was previously designed to predict race and gender, and modified the structure in order to add another task (age detection) to the network. Our data set had an age range of 0 to 116. Since the network was designed for classification, we needed to define our age classes. We applied 2 different variations. First we considered 1 class for each age up to the age of 62 and 1 class for the age of 63 and older. We trained the network for 150 epochs ($\approx 7$ hours) and generated the results (which will be explained in more detail in the result section). In this process we assumed a learning rate close to the one chosen for gender. It is true that the learning rate must be modified as is obvious from the results, because race and gender converged much faster than age. But this modification required long hours of training and retraining which was impossible due to our limited time frame. Also it was not guaranteed that learning rate modification would make the results rise up to our standards. That is because the difference between the number of classes in each task of the multi-task model also played a role in decreasing the quality of performance. So, we used a binning method for age classification. That is, we considered age classes according to what was previously mentioned in objective section. Having an overall number of 9 classes for age, we retrained the network. The second approach showed much better performance after one third of the number of training epochs relative to the last approach, i.e. 50 epochs ($\approx2.5$ hours).
Training of both variations of the network were done on an RTX 2080Ti. During training, the variables for tracking training progress were recorded using TF.summary and saved which allowed for visualizing training progress through Tensorboard. The final results of the trained networks can be seen in the results section. The charts were produced by Tensorboard after training was complete. The trained model was later used for the purpose of online prediction.
can you show the validation accuracy instead of training accuracy please.
Thanks for your question. We have shown the quantitative result of the test set in the table 1 of the report.