SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image

Dan Casas and Marc Comino-Trinidad
British Machine Vision Conference (BMVC), 2023



Abstract

We propose SMPLitex, a method for estimating and manipulating the complete 3D appearance of humans captured from a single image. SMPLitex builds upon the recently proposed generative models for 2D images, and extends their use to the 3D domain through pixel-to-surface correspondences computed on the input image. To this end, we first train a generative model for complete 3D human appearance, and then fit it into the input image by conditioning the generative model to the visible parts of subject. Furthermore, we propose a new dataset of high-quality human textures built by sampling SMPLitex conditioned on subject descriptions and images. We quantitatively and qualitatively evaluate our method in 3 publicly available datasets, demonstrating that SMPLitex significantly outperforms existing methods for human texture estimation while allowing for a wider variety of tasks such as editing, synthesis, and manipulation..


Citation

@inproceedings {casas2023smplitex,
    title = {{SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image}},
    author = {Casas, Dan and Comino-Trinidad, Marc},
    booktitle = {British Machine Vision Conference (BMVC)},
    year = {2023}
}

Description

From a single image where a human is partly visible, SMPLitex automatically estimates a complete 3D texture map that can be applied to SMPL body mesh sequences.

Under the hood, SMPLitex leverages a fine-tuned latent diffusion model (LDM) trained to generate texturemaps for photorealistic SMPL avatars. Our key intuition is that, to enable the estimation of 3D human appearance from a single image, we can condition the synthesis of an LDM for human appearance to the visible parts of the subject in the input image.

To this end, given an input image, we estimate pixel-to-surface correspondences and project the pixels of input image with assigned surface correspondences to a partial UV map. We then use the partial UV map as a conditional signal to sample a fine-tunned diffusion model for human texturemaps.

Results

Here we show results using input images from the DeepFashion-MultiModal dataset. Each triplet shows: input image, estimated texturemap, and 3D render.




SMPLitex samples

SMPLitex, our generative model used to estimate human textures from images, can also be sampled by text prompts. We leverage this capability to build a dataset of high-quality textures by simply sampling the latent space. Below we showcase a few of the SMPLitex samples. Notice that none of these were used to fine-tune SMPLitex.



Comparison with state-of-the-art

Below we qualitative compare SMPLitex to the state-of-the-art methods for texture map estimation from single image. SMPLitex outputs higher quality textures, including face details and garment wrinkles. See main paper for quantitative analysis.


Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 899739 (H2020-FETOPEN-2018-2020 CrowdDNA project).

Contact

Dan Casas – dan.casas@urjc.es