Attribute-conditioned facial image generation through contrastive learning
Δημιουργία εικόνων προσώπων με την χρήση χαρακτηριστικών μέσω αντιθετικής μάθησης
Master Thesis
Author
Stavrianoudakis, Vasileios
Σταυριανουδάκης, Βασίλειος
Date
2025Keywords
Attributes ; Encoders ; StyleGAN2 ; GAN ; Contrastive learning ; Image generation ; Guided generation ; Dataset creation ; Facial image generation ; Mid-level semanticsAbstract
The growing complexity of deep learning applications demands advanced representation learning techniques capable of capturing semantic information across multiple scales. Traditional deep learning systems often focus on either high-level semantics, such as class labels in supervised learning, or low-level semantics, like pixel-level details in unsupervised reconstruction tasks. However, modern applications increasingly require representations that also encompass mid-level semantics, which bridge the gap between global structure and fine-grained details. These mid-level representations are particularly valuable in tasks like guided image generation, where maintaining a balance between structural coherence and detailed precision is essential. This thesis introduces a novel framework leveraging contrastive learning to develop continuous and expressive attribute-related encoders specifically designed to capture these mid-level semantics.
The proposed framework comprises three key components: (i) a pipeline for constructing attribute datasets that effectively represent mid-level semantics, (ii) the application of contrastive learning techniques to train attribute encoders, and (iii) a methodology for conditioning facial image generation on these attribute encodings.
In the first stage, the VoxCeleb2 dataset is preprocessed to enhance image quality, and state-of-the-art pre-trained models are employed to infer attribute information scalably, eliminating the need for manual supervision. The second stage introduces the Rank-n-Contrast (RNC) loss, an extension of contrastive learning that accommodates real-valued continuous annotations. This approach enables the encoders to learn representation spaces where attributes are modeled continuously, yielding effective and interpretable attribute descriptors. In the third stage, these attribute vectors are integrated into a conditional facial image generation pipeline, referred to as attribute-conditioned GAN (ac-GAN). Experimental results demonstrate that ac-GAN significantly surpasses conventional guided generation approaches based on class labels, delivering higher-quality facial images and superior attribute-driven generation outcomes.
Each stage of the proposed framework builds upon and validates the preceding one. Through carefully selected quantitative metrics assessing image quality and attribute fidelity, alongside qualitative visual evaluations, this thesis highlights the effectiveness of the RNC-trained encoders and the ac-GAN pipeline. The findings pave the way for broader applications in attribute-driven image generation and editing, including tasks such as face reenactment and talking head generation.