Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

Hyeonwoo Kim^1*, Sookwan Han^1*, Patrick Kwon², Hanbyul Joo¹

Seoul National University¹, Naver Webtoon AI²
^*Indicates Equal Contribution

ECCV 2024 (Oral)

Paper Code arXiv Dataset

Random Image

Given a 3D object mesh, we generates numerous 3D Human-Object Interaction samples, and learn a novel affordance representation called Comprehensive Affordance (ComA) which models both contact and non-contact patterns.

Abstract

Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e. patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D target object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties

Key Takeaways

Problem Description

Traditional affordances focus on contact in human-object interactions. However, important patterns like relative orientations and positions cannot be expressed through contact alone. These overlooked aspects are crucial for further understanding affordance.

Problem Description

Comprehensive Affordance (ComA) is the first to capture both high-resolution contact and non-contact interactions, offering a complete view of object affordances.

We present a scalable method to learn ComA for any 3D objects. In a nutshell, (1) we leverage pre-trained diffusion model to generate large-scale samples of 3D humans interacting with the given object, and (2) we use that generated dataset to learn ComA.

Overview

Random Image

Our 3D human-object interaction sample generation pipeline first renders the 3D object from multiple viewpoints, (2) insert humans interacting with the object into these renderings using pre-trained inpainting diffusion models, (3) lift the inferred human back into 3D space by resolving depth ambiguities through our specialized optimization pipeline.

Adaptive Mask Inpainting

Random Image

Random Image

During inserting humans into image via inpainting, the object geometry and details within the mask region are not preserved, resulting in false affordances. Our Adaptive Mask Inpainting alleviates this by progressively specifying the inpainting region over diffusion timesteps.

Depth Optimization using Weak Auxiliary Cues

Random Image

For each provided image, we find similar images with relevant human poses from different viewpoints using RANSAC, based on joint reprojection error. We utilize these images as weak auxiliary cues to optimize depth and resolve ambiguities in 3D space.

ComA enables diverse applications, including reconstructing human-object interactions (figure). We can adapt these applications to any 3D objects using our Dataset Generation method.

Video

Results

Contact-based Affordance

Motorcycle

Keyboard

Skateboard

Soccer Ball

Suitcase

Tennis Racket

Orientational Affordance

Stool

Chair

Spatial Affordance

Input

Full Body

Hand

Face

BibTeX

@misc{coma,
      title={Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models}, 
      author={Hyeonwoo Kim and Sookwan Han and Patrick Kwon and Hanbyul Joo},
      year={2024},
      eprint={2401.12978},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2401.12978},
}