Heterogeneous Face Recognition from Facial Sketches

Thesis


Introduction
Generative adversarial networks (GANs) [2] have gained a huge amount of popularity in recent years thanks to their ability to generate photo-realistic results and the ability to capture important image details during image-to-image translation task.In this paper, a novel approach called X-Bridge is presented.X-Bridge is designed specifically as a cross-modal bridge in the heterogeneous face recognition task.X-Bridge is a supervised method and its structure is based on a conditional GAN, however, it also assumes shared-latent space across two different domains.To fully demonstrate the abilities of the X-Bridge approach, we test it on the arguably very challenging task of facial sketch-toimage translation using CUFSF dataset [3].

X-Bridge method
X-Bridge contains two main paths based: translation path, and reconstruction path.These paths can be imagined as two separate GANs.They both have their own generator and discriminator, whereas both of them share one shared encoder.Each path has its own specific task.The task of the translation path, based on the conditional GAN, is to translate an input image from its domain into the other domain.On the other hand, the task of the reconstruction path, based on vanilla GAN, is to reconstruct the original input.Via this process, the reconstruction path motivates the shared encoder to preserve important features, to generalize better, and to learn important regularities.To further improve features propagation through the networks, skip connections in the form of channel concatenation between the last four layers of the encoder and the first four layers of the generators are added.Both of the paths utilize traditional adversarial loss, which can be for the domain translation problem expressed as follows: where x is a real image from the first domain, x is a real im-age from the second domain, the encoder, and the generator are together denoted as EG and D stands for the discriminator, and z is a vector from the shared-latent space.
Moreover, both paths utilize L 1 distance defined as follows: The final loss is then defined as a sum of both losses for both paths.

Experiments
Several deep generative models were proposed for imageto-image translation in recent years.Most of existing approaches are based on supervised learning, however, models based on unsupervised learning became very popular lately.To benchmark X-Bridge, we decide to use two significant methods, one from each group.We compare it with the Pix2pix approach [4] and the MUNIT approach [5].Both of these methods provide state-of-the-art results in image-to-image translation tasks, where Pix2pix is a supervised method the same as X-Bridge, whereas MUNIT is unsupervised.All the methods were trained and tested on the CUFSF dataset containing 1194 facial photo-sketch pairs.
In the first experiment, we test the translation of frontal views, see Fig. 1.All methods provide very realistic and precise results, however, we argue that both supervised models outperform the MUNIT approach, which has problems generating sharp images and therefore also small details.Pix2pix and X-Bridge reach comparable results.In the second experiment, where we test non-frontal cases, X-Bridge over-performed other methods, see Fig. 2 in terms of generalization, precision, and facial features preservation.All the methods were also tested on the IIIT-D Sketch dataset with comparable results.

Conclusion
We argue that qualitative results provided by X-Bridge overcome other tested methods.In our future research, firstly, we would like to propose a suitable metric to objectively compare the performance of methods in the imageto-image translation tasks, which is, to this day, nonexistent.Secondly, we would like to address the ambiguity issue, which is a critical problem in the heterogeneous face recognition task.

Figure 1 :
Figure 1: Cross-domain translation comparison.There are input images (the first column), translated corresponding sketches using Pix2pix (the second column), MUNIT (the third column), X-Bridge (the fourth column), and groundtruth outputs (the last column).

Figure 2 :
Figure 2: Cross-domain translation comparison for nonfrontal view.The order of methods is the same as in Fig. 1.