PhD defence of Rezvan Sherkati – Saliency Prediction for Groups Using Generative Models
Abstract
Human visual attention has long been a topic of interest for researchers and scientists. Visual attention is composed of fixations, where the eyes remain stable to process the visual information, and rapid eye movements between these fixations, known as saccades. Fixation points often carry a lot of information, including key events within a scene and can offer insights into a viewer’s personality traits. Consequently, predicting the visual attention, referred to as saliency prediction, has been a longstanding and significant research problem.
In the area of saliency prediction, most existing methods focus on universal saliency prediction, the prediction of attention for an average viewer. These methods fail to catch the inter-individual variability in attention. To address this, some methods have been proposed for personalized saliency prediction, which predict saliency for individuals by considering their features. While these methods account for individual differences, they face limitations due to challenges in large-scale data collection, noisy data, and privacy concerns.
To address the issues associated with universal and personalized saliency prediction, this thesis presents methods for saliency prediction in groups, referred to as group saliency prediction. We propose grouping viewers based on similarities in demographics, interests, visual attention, and other available data. Based on these identified groups, we design architectures for predicting saliency specific to each viewer group.
Our first method is an image saliency prediction technique called Clustered Saliency Prediction. This method groups viewers into clusters based on their personal features and known saliency maps, using selected importance weights for personal feature factors. Building on these clusters, we introduce the Multi-Domain Saliency Translation (MDST) model, an image saliency prediction framework based on Generative Adversarial Networks (GANs), conditioned on cluster labels. The MDST model generates saliency maps tailored to each identified group of viewers. We evaluate our approach on a public dataset of personalized saliency maps and show that our method outperforms state-of-the-art universal saliency prediction models. We also demonstrate the effectiveness of our clustering method by comparing results using our clusters with those from baseline methods. Finally, we propose an approach to assign new individuals to their most appropriate cluster and show its applicability through a series of experiments.
We additionally introduce a novel set of generative neural networks designed for saliency prediction tailored to viewer groups. These models are built on a generative framework that leverages style-transfer techniques to transform universal saliency maps into group-specific predictions. We evaluate their performance on personalized saliency map datasets and investigate the impact of data augmentation strategies. Additionally, we analyze the strengths and limitations of each model and conduct ablation studies to further justify our design decisions.
Lastly, we apply our group saliency prediction methods to a new egocentric video and eye-tracking dataset that we acquired in a convenience store. This dataset includes 108 first-person videos of 36 shoppers searching for three products: orange juice, KitKat chocolate bars, and canned tuna, along with eye fixation data for each video frame. It also includes demographic information about each participant in the form of an 11-question survey. Using survey responses, our clustering method identified two distinct viewer groups. We trained our group saliency prediction models on the fixation data from the store videos. The results show improved saliency prediction performance on this real-world dataset compared to leading universal models.