DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models

Seoul National University1
Random Image

Given a 3D object mesh, we generates 4D Human-Object Interaction samples, and learn a new type of affordance called Dynamic Affordance which models dynamic patterns of both human and object during HOI.


Abstract

Understanding the ability of humans to use objects is crucial for AI to improve daily life. Existing studies for learning such ability focus on human-object patterns (e.g., contact, spatial relation, orientation) in static situations, and learning Human-Object Interaction (HOI) patterns over time (i.e., movement of human and object) is relatively less explored. In this paper, we introduce a novel type of affordance named Dynamic Affordance. For a given input 3D object mesh, we learn the dynamic affordance to model the distribution of both (1) human motion and (2) human guided object pose during interactions. As a core idea, we present a method to learn the 3D dynamic affordance from synthetically generated 2D videos, leveraging a pre-trained video diffusion model. Specifically, we propose a pipeline that first generates 2D HOI videos from the 3D object and then lifts them to generate 4D HOI samples. Once we generate diverse 4D HOI samples on various target objects, we train our DAViD, where we present a method based on the Low-Rank Adaptation (LoRA) module for pre-trained human motion diffusion model (MDM) and an object pose diffusion model with human pose guidance. Our motion diffusion model is extended for multi-object interactions, demonstrating the advantage of our pipeline with LoRA for combining the concepts of object usage. Through extensive experiments, we demonstrate our DAViD outperforms the baselines in generating human motion with HOIs.


Method Overview

Random Image

Our method consists of two parts; (1) 4D HOI Sample Generation, represented in upper box, (2) Learning Dynamic Affordance from Generated 4D HOI Samples, represented in lower box. First, we create 2D HOI Video using the structure guidance of object rendering, and generate 4D HOI samples with our uplifting pipeline. Then, the generated 4D HOI Samples are used to train our DAViD, learning the patterns of human motion and object pose conditioned by human pose.

Video

Results


BibTeX

Comming Soon!