PhD defence of Samrudhdhi Rangrej – Visual Hard Attention Models Under Partial Observability

Tuesday, June 13, 2023 14:00to16:00
McConnell Engineering Building Room 603, 3480 rue University, Montreal, QC, H3A 0E9, CA



Existing state-of-the-art recognition models achieve impressive performance but require a complete scene which may not always be available. For example, sensing a complete scene at once is infeasible in applications such as aerial imaging. Further, in applications such as disaster recovery, imaging devices should be light, inexpensive, and energy-efficient; thus, they are often built using small field-of-view cameras that capture only a part of a scene at a time. In the above cases, the imaging devices must scan the area sequentially. Moreover, they must also prioritize the scanning of informative subregions for timely recognition.

Many developed attention models that recognize a scene by observing it through small informative subregions called glimpses. However, most models locate informative glimpses by glancing at a low-resolution gist of a complete scene, which is unavailable in practice. In this thesis, we develop sequential recognition models that locate and attend to informative glimpses without assessing a complete scene. Our sequential attention models predict the location of the next glimpse based solely on past glimpses. Our models achieve effective attention policies under partial observability by selecting subsequent glimpses that, combined with past glimpses, help the most in reasoning about the complete scene.

We present three attention models, two for spatial and one for spatiotemporal recognition. The first is Probabilistic Attention Model (PAM). PAM uses Bayesian Optimal Experiment Design to attend to a glimpse with maximum expected information gain (EIG). It synthesizes features of the complete scene from past glimpses to estimate the EIG for yet unobserved regions. The second is Sequential Transformers Attention Model (STAM), which employs the one-step actor-critic algorithm to attend to a sequence of glimpses that produce class distribution consistent with the one produced using a complete scene. The third is Glimpse Transformer (GliTr). GliTr learns an effective attention mechanism for online action recognition by selecting glimpses with features and class distribution consistent with the corresponding complete video frames.

Throughout the thesis, we evaluate our models on multiple datasets and compare them with existing models. Our two key findings are as follows. First, reasoning about the complete scene from partial observations helps in learning an effective attention policy under partial observability. Second, while reducing the amount of sensing required for recognition, our glimpse-based models achieve comparable or higher performance than the existing models that require complete scenes. The key takeaway is that one can attain good performance even using low-cost sensing devices and non-ideal imaging by automating the sensing process and compelling the recognition model to fill in the missing information.

Back to top