PhD defence of Farzaneh Askari – Skeleton-Based Human Interaction Understanding from Real-World Videos: Applications in Sports and Retail
Abstract
Video understanding has become a fundamental research area in computer vision due to its wide range of applications, including surveillance, healthcare, entertainment, and sports analytics. With the advent of deep learning and the growing availability of large-scale video data from social media, broadcasting, and online platforms, remarkable progress has been achieved in recognizing human actions and interactions from videos. However, real-world environments remain challenging due to dynamic motion, visual clutter, occlusions, and viewpoint variations.
This thesis addresses the problem of Human Interaction Recognition from Videos (HIRV) under real-world conditions. The work is divided into two main parts. The first part focuses on structured environments through the study of sports videos, specifically ice hockey, as a representative and demanding real-world domain. Hockey broadcast videos feature fast player motion, frequent occlusions, complex multi-person interactions, and low inter-class visual variance in penalty scenes. We propose a series of skeleton pose–based methods for recognizing penalties and player interactions. These methods address several key challenges, including limited dataset size, efficient interaction recognition via custom architecture, and action localization in crowded scenes. In addition, we introduce a hockey-specific pose dataset designed to evaluate and improve pose-based human interaction understanding in challenging broadcast conditions.
The second part of the thesis extends the study to open-world environments by investigating human interactions in retail spaces, where individuals interact not only with each other but also with their surroundings. Unlike structured sports scenes, retail environments involve longer and overlapping activities and complex human–object interactions influenced by environmental factors such as product placement, stock availability, and store layout. In this study, we analyze customer behavior and decision-making processes by modeling both person–person and person–object interactions over extended temporal sequences. In this study, using the skeleton pose representation, we study the customers' behavior in a retail environment and the factors affecting their decisions.
Overall, this thesis advances human interaction understanding in real-world videos through the use of skeleton-based representations, domain-specific dataset construction, and frameworks capable of reasoning about complex multi-person interactions. The proposed approaches demonstrate the potential of structured pose features to enhance robustness, interpretability, and privacy in video understanding across both structured and unconstrained environments.