ViNet++: Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

Accepted at 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025).

Comparing Ground Truth with the predicted saliency maps of our models and STSANet on three different datasets - DHF1K, UCF-Sports and DIEM.

Abstract

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Architecture

Checkpoints

Link to the checkpoints

Cite

@INPROCEEDINGS{vinet2025,
  author={Girmaji, Rohit and Jain, Siddharth and Beri, Bhav and Bansal, Sarthak and Gandhi, Vineet},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10888852},
  url={https://ieeexplore.ieee.org/abstract/document/10888852}
}

Contact

For any queries or questions, please contact rohit.girmaji@research.iiit.ac.in or bhav.beri@researchiiit.ac.in, or use the public issues section of this repository.

This work © 2025 by the authors of the paper is licensed under CC BY-NC-SA 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ViNet_A		ViNet_A
ViNet_S		ViNet_S
figures		figures
fold_lists		fold_lists
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ViNet_E_inferences_metrics.py		ViNet_E_inferences_metrics.py
loss.py		loss.py
run_ViNet_E_inferences.sh		run_ViNet_E_inferences.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViNet++: Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

Abstract

Architecture

Checkpoints

Cite

Contact

About

Uh oh!

Uh oh!

Contributors 2

Languages

License

ViNet-Saliency/vinet_v2

Folders and files

Latest commit

History

Repository files navigation

ViNet++: Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

Abstract

Architecture

Checkpoints

Cite

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Languages