Chupa : Carving 3D Clothed Humans
from Skinned Shape Priors
using 2D Diffusion Probabilistic Models

Byungjun Kim1*, Patrick Kwon2*, Kwangho Lee2, Myunggi Lee2, Sookwan Han1, Daesik Kim2, Hanbyul Joo1
1Seoul National University, 2Naver Webtoon AI
*Indicates Equal Contribution
ICCV 2023 (Oral)
Random Image

We propose Chupa, a 3D human generation pipeline that combines the generative power of diffusion models and neural rendering techniques to create diverse, realistic 3D humans. Our pipeline can easily generalize to unseen human poses and display realistic qualities.

Chupa generates divese high quality human mesh from SMPL-X mesh.


We propose a 3D generation pipeline that uses diffusion models to generate realistic human digital avatars. Due to the wide variety of human identities, poses, and stochastic details, the generation of 3D human meshes has been a challenging problem. To address this, we decompose the problem into 2D normal map generation and normal map based 3D reconstruction. Specifically, we first simultaneously generate realistic normal maps for the front and backside of a clothed human using pose-conditional diffusion models. For 3D reconstruction, we "carve" the prior SMPL mesh to a detailed 3D mesh according to the normal maps through mesh optimization. To further enhance the high-frequency details, we present a diffusion resampling scheme on both body and facial regions, thus encouraging the generation of realistic digital avatars. We also seamlessly incorporate a recent text-to-image diffusion model to support text-based human identity control. Our method, namely, Chupa, is capable of generating realistic 3D clothed humans with better perceptual quality and identity variety.


Random Image

Chupa takes a posed SMPL-X mesh and its front normal map as input. At the first stage, our diffusion model generates frontal and backside normal maps, which we call dual normal map, conditioned on SMPL-X frontal normal map. The dual normal map is then used to 'carve' the input SMPL-X mesh into the clothed human mesh with normal map-based mesh optimization along with differentiable rasterizer. To further increase the quality, we refine the normal maps rendered from the full body and facial regions of the optimized mesh through a resampling procedure, and perform the second optimization with the refined normal maps to create final mesh. Chupa can also support text-guided generation by leveraging the power of a text-to-image diffusion model.

Video Presentation


    author    = {Kim, Byungjun and Kwon, Patrick and Lee, Kwangho and Lee, Myunggi and Han, Sookwan and Kim, Daesik and Joo, Hanbyul},
    title     = {Chupa: Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic Models},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {15965-15976}