InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion (2023-08-31T00:00:00.000000Z)

TL;DR

This paper proposes InterDiff, a framework comprising two key steps: interaction diffusion, where a diffusion model is leverage to encode the distribution of future human-object interactions; and interaction correction, where a physics-informed predictor is introduced to correct denoised HOIs in a diffusion step.

Abstract

This paper addresses a novel task of anticipating 3D human-object interactions (HOIs). Most existing research on HOI synthesis lacks comprehensive whole-body interactions with dynamic objects, e.g., often limited to manipulating small or static objects. Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions. To this end, we propose InterDiff, a framework comprising two key steps: (i) interaction diffusion, where we leverage a diffusion model to encode the distribution of future human-object interactions; (ii) interaction correction, where we introduce a physics-informed predictor to correct denoised HOIs in a diffusion step. Our key insight is to inject prior knowledge that the interactions under reference with respect to contact points follow a simple pattern and are easily predictable. Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably longterm 3D HOI predictions.

Authors

Sirui Xu

3 papers

Zhengyu Li

1 papers

Yu-Xiong Wang

1 papers

TL;DR

Abstract

Authors

References136 items

ROAM: Robust and Object-Aware Motion Generation Using Neural Pose Descriptors

Synthesizing Physically Plausible Human Motions in 3D Scenes

Diagnosing Human-Object Interaction Detectors

SMPL: A Skinned Multi-Person Linear Model

TransFusion: A Practical and Effective Transformer-Based Diffusion Model for 3D Human Motion Prediction

TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

Action-Conditioned Generation of Bimanual Object Manipulation Sequences

Hierarchical Planning and Control for Box Loco-Manipulation

Stochastic Multi-Person 3D Motion Forecasting

Object pop-up: Can we infer 3D objects and their poses from human interactions alone?

NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects

Synthesizing Diverse Human Motions in 3D Indoor Scenes

PMP: Learning to Physically Interact with Environments using Part-wise Motion Priors

Compositional 3D Human-Object Neural Animation

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Visibility Aware Human-Object Interaction Tracking from Single RGB Camera

CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis

Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations

Affordance Diffusion: Synthesizing Hand-Object Interactions

Detecting Human-Object Contact in Images

Human Motion Diffusion as a Generative Prior

Can We Use Diffusion Probabilistic Models for 3D Motion Prediction?

Single Motion Diffusion

Diverse Human Motion Prediction Guided by Multi-level Spatial-Temporal Anchors

HumanMAC: Masked Motion Completion for Human Motion Prediction

Synthesizing Physical Character-Scene Interactions

Generating Human Motion from Textual Descriptions with Discrete Representations

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Locomotion-Action-Manipulation: Synthesizing Human-Scene Interactions in Complex 3D Environments

NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions

IMoS: Intent‐Driven Full‐Body Motion Synthesis for Human‐Object Interactions

Executing your Commands via Motion Diffusion in Latent Space

MoFusion: A Framework for Denoising-Diffusion-Based Motion Synthesis

PhysDiff: Physics-Guided Human Motion Diffusion Model

BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction

FLEX: Full-Body Grasping Without Full-Body Grasps

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Human Motion Diffusion Model

InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection

Learning Soccer Juggling Skills with Layer-wise Mixture-of-Experts

Compositional Human-Scene Interaction Synthesis with Semantic Control

Learn to Predict How Humans Manipulate Large-sized Objects from Interactive Motions

Compositional Visual Generation with Composable Diffusion Models

Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction

Learning to use chopsticks in diverse gripping styles

Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis

TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement

COUCH: Towards Controllable Human-Chair Interactions

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

TEMOS: Generating diverse human motions from textual descriptions

Human-Object Interaction Detection via Disentangled Transformer

BEHAVE: Dataset and Method for Tracking Human Object Interactions

Hierarchical Text-Conditional Image Generation with CLIP Latents

CHORE: Contact, Human and Object REconstruction from a single RGB image

Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Learning Multi-Object Dynamics with Compositional Neural Radiance Fields

Learned Queries for Efficient Local Attention

GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping

SAGA: Stochastic Whole-Body Grasping with Contact

The Wanderings of Odysseus in 3D Scenes

EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers

Detecting Human-Object Relationships in Videos

D3D-HOI: Dynamic 3D Human-Object Interactions from Videos

Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Stochastic Scene-Aware Motion Prediction

The KIT Bimanual Manipulation Dataset

ManipNet

Scene-aware Generative Network for Human Motion Synthesis

ContactOpt: Optimizing Contact to Improve Grasps

Action-Conditioned 3D Human Motion Synthesis with Transformer VAE