Skip to content

ViNet-Saliency/vinet_v2

Repository files navigation

ViNet++: Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues


arXiv IEEE Xplore

Accepted at 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025).


Results Image

Comparing Ground Truth with the predicted saliency maps of our models and STSANet on three different datasets - DHF1K, UCF-Sports and DIEM.

Abstract

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Architecture

ViNet Architecture

Checkpoints

Link to the checkpoints

Cite

@INPROCEEDINGS{vinet2025,
  author={Girmaji, Rohit and Jain, Siddharth and Beri, Bhav and Bansal, Sarthak and Gandhi, Vineet},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10888852},
  url={https://ieeexplore.ieee.org/abstract/document/10888852}
}

Contact

For any queries or questions, please contact rohit.girmaji@research.iiit.ac.in or bhav.beri@researchiiit.ac.in, or use the public issues section of this repository.


This work © 2025 by the authors of the paper is licensed under CC BY-NC-SA 4.0