Publications

in reverse chronological order, generated by jekyll-scholar.

2025

  1. Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
    In Winter Conference on Applications of Computer Vision (WACV), Mar 2025
    New!

2024

  1. Major Entity Identification: A Generalizable Alternative to Coreference Resolution
    In Empirical Methods in Natural Language Processing (EMNLP), Nov 2024
    New!
  2. VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?
    In , 2024
    New preprint!
  3. No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
    Manu Gaur, Darshan Singh S, and Makarand Tapaswi
    In , 2024
    New preprint!
  4. Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation
    Manu Gaur, Darshan Singh S, and Makarand Tapaswi
    In ECCV Workshop on Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo), Sep 2024
  5. Localizing Auditory Concepts in CNNs
    Pratyaksh Gautam, Makarand Tapaswi*, and Vinoo Alluri*
    In ICML Workshop on Mechanistic Interpretability (ICMLW-MI), Jul 2024
  6. "Previously on ..." From Recaps to Story Summarization
    Aditya Kumar Singh, Dhruv Srivastava, and Makarand Tapaswi
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
    Media: Talk at Twelve Labs  
  7. MICap: A Unified Model for Identity-aware Movie Descriptions
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
    Media: RSIP Vision   Talk at Twelve Labs  
  8. NurtureNet: A Multi-task Video-based Approach for Newborn Anthropometry
    Yash Khandelwal, Mayur Arvind, Sriram Kumar, Ashish Gupta, Sachin Kumar Danisetty, Piyush Bagad, Anish Madan, Mayank Lunayach, Aditya Annavajjala, Abhishek Maiti, Sansiddh Jain, Aman Dalmia, Namrata Deka, Jerome White, Jigar Doshi, Angjoo Kanazawa, Rahul Panicker, Alpan Raval, Srinivas Rana, and Makarand Tapaswi
    In CVPR Worskhop on Computer Vision for Physiological Measurements (CVPM), Jun 2024
    Best Paper Award
  9. FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
    Darshan Singh S, Zeeshan Khan, and Makarand Tapaswi
    In , 2024
    Preprint

2023

  1. How you feelin’? Learning Emotions and Mental States in Movie Scenes
    Dhruv Srivastava, Aditya Kumar Singh, and Makarand Tapaswi
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023
    Media: IIIT-H Blog   Times of India   Hindu Businessline   Telangana Today   Deccan Chronicle  
  2. Test of Time: Instilling Video-Language Models with a Sense of Time
    Piyush Bagad, Makarand Tapaswi, and Cees G M Snoek
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023
  3. GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
    Dhaval Taunk, Lakshya Khanna, Pavan Kandru, Vasudeva Varma, Charu Sharma, and Makarand Tapaswi
    In WWW Workshop on Natural Language Processing for Knowledge Graph Construction (NLP4KGc), May 2023
  4. Unsupervised Audio-Visual Lecture Segmentation
    Darshan Singh S, Anchit Gupta, C V Jawahar, and Makarand Tapaswi
    In Winter Conference on Applications of Computer Vision (WACV), Jan 2023

2022

  1. Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations
    Jaidev Shriram, Makarand Tapaswi, and Vinoo Alluri
    In International Society for Music Information Retrieval Conference (ISMIR), Dec 2022
    Brave New Idea Award!
    Media: IIIT-H Blog   Hindu Businessline   Eenadu (Telugu)   Times of India  
  2. Grounded Video Situation Recognition
    Zeeshan Khan, C V Jawahar, and Makarand Tapaswi
    In Neural Information Processing Systems (NeurIPS), Dec 2022
  3. Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
    In Neural Information Processing Systems (NeurIPS), Dec 2022
  4. Can we Adopt Self-supervised Pretraining for Chest X-Rays?
    Arsh Verma, and Makarand Tapaswi
    In Machine Learning for Healthcare (ML4H) (Extended Abstract), Nov 2022
  5. Instruction-driven History-aware Policies for Robotic Manipulations
    Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia Pinel, Makarand Tapaswi, Ivan Laptev, and Cordelia Schmid
    In Conference on Robot Learning (CoRL), Dec 2022
  6. Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
    In European Conference on Computer Vision (ECCV), Oct 2022
  7. Learning Object Manipulation Skills from Video via Approximate Differentiable Physics
    In International Conference on Intelligent Robots and Systems (IROS), Oct 2022
  8. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022

2021

  1. Long term Spatio-Temporal Modeling for Action Detection
    Makarand Tapaswi*, Vijay Kumar*, and Ivan Laptev
    Computer Vision and Image Understanding (CVIU), 2021
  2. Feature Generation for Long-tail Classification
    In Indian Conference on Computer Vision, Graphics, and Image Processing (ICVGIP), Dec 2021
  3. Airbert: In-domain Pretraining for Vision-and-Language Navigation
    In International Conference on Computer Vision (ICCV), Oct 2021

2020

  1. Learning Object Manipulation Skills via Approximate State Estimation from Real Videos
    Vladimir Petrik*, Makarand Tapaswi*, Ivan Laptev, and Josef Sivic
    In Conference on Robot Learning (CoRL), Nov 2020
  2. Learning Interactions and Relationships between Movie Characters
    Anna Kukleva, Makarand Tapaswi, and Ivan Laptev
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020
  3. Clustering based Contrastive Learning for Improving Face Representations
    Vivek Sharma, Makarand Tapaswi, Saquib Sarfraz, and Rainer Stiefelhagen
    In IEEE International Conference on Automatic Face and Gesture Recognition (FG), May 2020
  4. Video Face Clustering with Self-Supervised Representation Learning
    Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, and Rainer Stiefelhagen
    IEEE Transactions on Biometrics (T-BIOM), 2020

2019

  1. Video Face Clustering with Unknown Number of Clusters
    Makarand Tapaswi, Marc T. Law, and Sanja Fidler
    In International Conference on Computer Vision (ICCV), Oct 2019
  2. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
    In International Conference on Computer Vision (ICCV), Oct 2019
    Media: Data Skeptic Podcast  
  3. Self-Supervised Learning of Face Representations for Video Face Clustering
    Vivek Sharma, Makarand Tapaswi, Saquib Sarfraz, and Rainer Stiefelhagen
    In IEEE International Conference on Automatic Face and Gesture Recognition (FG), May 2019
    Best Paper Award!
  4. Visual Reasoning by Progressive Module Networks
    Seung Wook Kim, Makarand Tapaswi, and Sanja Fidler
    In International Conference on Learning Representations (ICLR), May 2019
  5. The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries
    Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, and Sanja Fidler
    arXiv:1912.13082, 2019
  6. Deep Multimodal Feature Encoding for Video Ordering
    Vivek Sharma, Makarand Tapaswi, and Rainer Stiefelhagen
    In ICCV Workshop on Large Scale Holistic Video Understanding, 2019

2018

  1. MovieGraphs: Towards Understanding Human-Centric Situations from Videos
    Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2018
    Media: UofT News   phys.org  
  2. Now You Shake Me: Towards Automatic 4D Cinema
    Yuhao Zhou, Makarand Tapaswi, and Sanja Fidler
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2018
    Media: UofT News   Inquisitr   CBC Radio  

2017

  1. Situation Recognition with Graph Neural Networks
    Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, and Sanja Fidler
    In International Conference on Computer Vision (ICCV), Oct 2017

2016

  1. MovieQA: Understanding Stories in Movies through Question-Answering
    Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016
    Media: MIT Tech Review   NVidia Developer News  
  2. Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning
    Ziad Al-Halah, Makarand Tapaswi, and Rainer Stiefelhagen
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016
  3. Naming TV Characters by Watching and Analyzing Dialogs
    In Winter Conference on Applications of Computer Vision (WACV), Mar 2016
  4. A Closed-form Gradient for the 1D Earth Mover’s Distance for Spectral Deep Learning on Biological Data
    Manuel Martinez, Makarand Tapaswi, and Rainer Stiefelhagen
    In ICML Workshop on Computational Biology (CompBio-ICML16), Jun 2016

2015

  1. Accio: A Data Set for Face Track Retrieval in Movies Across Age
    Esam Ghaleb, Makarand Tapaswi, Ziad Al-Halah, Hazım Kemal Ekenel, and Rainer Stiefelhagen
    In International Conference on Multimedia Retrieval (ICMR), Jun 2015
  2. Book2Movie: Aligning Video scenes with Book chapters
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015
  3. Aligning Plot Synopses to Videos for Story-based Retrieval
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    International Journal of Multimedia Information Retrieval (IJMIR), 2015
  4. Improved Weak Labels using Contextual Cues for Person Identification in Videos
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    In International Conference on Automatic Face and Gesture Recognition (FG), May 2015
  5. KIT at MediaEval 2015 – Evaluating Visual Cues for Affective Impact of Movies Task
    Marin Vlastelica Pogančić, Sergey Hayrapetyan, Makarand Tapaswi, and Rainer Stiefelhagen
    In MediaEval2015 Multimedia Benchmark Workshop (MediaEval2015), Sep 2015

2014

  1. Total Cluster: A person agnostic clustering method for broadcast videos
    Makarand Tapaswi, Omkar M. Parkhi, Esa Rahtu, Eric Sommerlade, Rainer Stiefelhagen, and Andrew Zisserman
    In Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Dec 2014
  2. Cleaning up after a Face Tracker: False Positive Removal
    Makarand Tapaswi, Cemal Çağrı Çörez, Martin Bäuml, Hazım Kemal Ekenel, and Rainer Stiefelhagen
    In International Conference on Image Processing (ICIP), Oct 2014
  3. A Time Pooled Track Kernel for Person Identification
    Martin Bäuml, Makarand Tapaswi, and Rainer Stiefelhagen
    In Conference on Advanced Video and Signal-based Surveillance (AVSS), Aug 2014
  4. StoryGraphs: Visualizing Character Interactions as a Timeline
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2014
  5. Story-based Video Retrieval in TV series using Plot Synopses
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    In International Conference on Multimedia Retrieval (ICMR), Apr 2014

2013

  1. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data
    Martin Bäuml, Makarand Tapaswi, and Rainer Stiefelhagen
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2013
  2. QCompere @ Repere 2013
    Hervé Bredin, Johann Poignant, Guillaume Fortier, Makarand Tapaswi, Viet Bac Le, Anindya Roy, Claude Barras, Sophie Rosset, Achintya Sarkar, Hua Gao, Alexis Mignon, Jakob Verbeek, Laurent Besacier, Georges Quénot, Hazım Kemal Ekenel, and Rainer Stiefelhagen
    In Workshop on Speech, Language and Audio in Multimedia (SLAM), Aug 2013

2012

  1. Contextual Constraints for Person Retrieval in Camera Networks
    Martin Bäuml, Makarand Tapaswi, Arne Schumann, and Rainer Stiefelhagen
    In Conference on Advanced Video and Signal-based Surveillance (AVSS), Sep 2012
  2. “Knock! Knock! Who is it?” Probabilistic Person Identification in TV series
    Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen
    In Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2012
    Media: ITWorld  
  3. Fusion of Speech, Faces and Text for Person Identification in TV Broadcast
    Hervé Bredin, Johann Poignant, Makarand Tapaswi, Guillaume Fortier, Viet Bac Le, Thibault Napoleon, Hua Gao, Claude Barras, Sophie Rosset, Laurent Besacier, Jakob Verbeek, Georges Quénot, Frédéric Jurie, and Hazım Kemal Ekenel
    In Workshop on Information Fusion in Computer Vision for Concept Recognition (held with ECCV 2012) (IFCVCR), Oct 2012
  4. KIT at MediaEval2012 - Content-based Genre Classification with Visual Cues
    Tomas Semela, Makarand Tapaswi, Hazım Kemal Ekenel, and Rainer Stiefelhagen
    In MediaEval2012 Multimedia Benchmark Workshop (MediaEval2012), Oct 2012

2010

  1. Direct modeling of spoken passwords for text-dependent speaker recognition by compressed time-feature representations
    Amitava Das, and Makarand Tapaswi
    In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Mar 2010

2008

  1. Audio-Visual Person Authentication with Multiple Visualized-Speech Features and Multiple Face Profiles
    Amitava Das, Ohil K. Manyam, and Makarand Tapaswi
    In Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Dec 2008
  2. Multilingual spoken-password based user authentication in emerging economies using cellular phone networks
    Amitava Das, Ohil K. Manyam, Makarand Tapaswi, and Veeresh Taranalli
    In Workshop on Spoken Language Technology (SLT), Dec 2008

Disclaimer

This publication material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.