Jiun Tian Hoe1, Weipeng Hu1,2, Xudong Jiang1, Yap-Peng Tan1,4, Chee Seng Chan3
1Nanyang Technological University 2Sun Yat-sen University 3Universiti Malaya 4VinUniversity
CVPR 2026 (Main)
OneHOI unifies Human-Object Interaction (HOI) generation and editing in a single, versatile model. It excels at challenging HOI editing, from text-guided changes to novel layout-guided control and novel multi-HOI edits. For generation, OneHOI synthesises scenes from text, layouts, arbitrary shapes, or mixed conditions, offering unprecedented control over relational understanding in images.
- [2026/02] 🎉 OneHOI is accepted to CVPR 2026!
- [2026/02] 🌐 Project page is live!
- Release paper on arXiv
- Release inference code and pretrained models
- Release HOI-Edit-44K dataset
- Release training code
Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as ⟨person, action, object⟩ triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing.
- InteractDiffusion (CVPR 2024): Interaction Control in Text-to-Image Diffusion Models
- InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images (IEBench dataset)
- FLUX.1 Kontext
- EliGen: Entity-Level Controlled Image Generation with Regional Attention
If you find our code useful, feel free to ⭐ star this repo!
If you use our work in your research, please cite:
@inproceedings{hoe2026onehoi,
title={OneHOI: Unifying Human-Object Interaction Generation and Editing},
author={Hoe, Jiun Tian and Hu, Weipeng and Jiang, Xudong and Tan, Yap-Peng and Chan, Chee Seng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}