PhD defence of Bahar Nikpour – Hard Attention Finding using Reinforcement Learning
Abstract
In machine learning, attention is an effective method mimicking the human cognitive attention. This approach aims at enhancing the effect of some parts of the data and reducing those of other parts. Attention have exhibited promising potential in enhancing learning models by identifying salient portions of input data, in various fields. In this thesis, hard attention finding is explored in two vision domains including human activity recognition and few-shot learning.
Finding attention can benefit human activity recognition (HAR), which is a challenging research field. Current methods in skeleton-based activity recognition primarily develop deep learning architectures to identify key features from 2D or 3D coordinates of human body joints. These approaches typically treat all joints as equally important, which may not be accurate, as the relevance of joints varies throughout and between activities. Also, not all video frames equally contribute to recognizing an activity. Our research introduces a method that simultaneously finds both temporal (key frames) and spatial (key joints) attention, potentially enhancing baseline classifier performance and reducing computational load. Hence, we first propose a method consisting of two agents i.e. temporal and spatial which are trained by interacting together. The temporal agent finds the key frames, and the spatial agent looks for key joints. After that, since the benchmark datasets mainly have short video sequences, we decided to withdraw the temporal agent to investigate and improve the performance of the spatial agent alone. Therefore, we propose a spatial hard attention-finding method that aims to discard the irrelevant and misleading joints and preserve the most discriminative ones, per frame. In the above approaches, we formulate the frame selection and joint selection problems as Markov decision process and use deep reinforcement learning to solve them. The proposed methods are general frameworks that can be applied to the existing HAR models to improve their performance. We achieve very competitive results on the widely used human activity datasets in this field. We have published our results to the Pattern Recognition journal(Elsevier), IEEE Transactions on Systems, Man, and Cybernetics: Systems, and 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). Also, we conducted a survey on reinforcement learning-based HAR techniques, published in IEEE Transactions on Neural Networks and Learning Systems.
Attention finding is particularly valuable in scenarios where limited training samples are accessible, which is the case most of the times, due to challenges in data collection and labeling. Learning from a few labeled data is specifically referred to as few-shot learning. Hence, we further aimed to explore the idea of hard attention finding in this area. Attention mechanisms help model to focus on relevant parts of the data, that is particularly valuable when dealing with scarce training data. By attending to the most informative features or regions in the input, the model can make better decisions and generalize more effectively from the few examples it has been exposed to. In situations with few training samples, existing studies struggle to locate such informative regions due to the large number of training parameters that cannot be effectively learned from the available limited samples. In this work, we introduce a novel framework for achieving explainable hard attention finding, specifically adapted to few-shot learning scenarios, called FewXAT. Our approach employs deep reinforcement learning to implement the concept of hard attention, directly impacting raw input data and thus rendering the process interpretable for human understanding. Through extensive experimentation across benchmark datasets, we demonstrate the efficacy of our proposed method. The results of this work are submitted to the 2024 European Conference on Computer Vision (ECCV2024).