[Paper Review] PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

4 min readMay 3, 2021

S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, “PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization,” ICCV, 2019, pp. 2304–2314.
University of Southern California
USC Institute for Creative Technologies
Waseda Universtiy
University of California, Berkeley
Pinscreen

Overview

This paper proposed a novel implicit function that can create a mesh surface with RGB texture value in 3D space from a single-view image or multi-view images. I believe the key contributions of this work are in two concepts. First, the design of an implicit function that decodes pixel-aligned feature vectors to capture ambiguous clothed human body shape which was a challenging task for a widely used parametric model. Second, aggregating obtained mesh feature vector into texture decoding model to recover unseen regions’ texture information.

Goal

Introducing Pixel-aligned Implicit Function (PIFu) for an end-to-end deep learning method to 3D reconstruct highly detailed clothed humans with textures from a single image or multi-view images

Contributions

Introduced a novel pixel-aligned implicit function that spatially aligns the pixel-level information of the input image
Preserved high-frequency details present in the image while capturing unseen regions’ shape and texture
Generated textured 3D surfaces of a clothed person from a single RGB camera image

Related Work

Single-View 3D Human Digitization

Parametric models of human bodies and shapes such as SMPL are widely used due to the requirement of strong priors in the reconstruction of ambiguous figures of human bodies. However, it only produces a naked human body ignoring complex topologies such as dresses, skirts, and long hair.
Template-free methods such as BodyNet resolve this issue by directly generating voxel representations. However, it often misses fine-scale details due to the high memory requirements.
Silhouette-based methods such as SiCloPe are more memory efficient. However, it showed weak performance on inferring concave regions and lacked in quality of reconstructed geometry.

Multi-View 3D Human Digitization

Early attempts by visual hulls use silhouettes from multiple views to carve out the visible areas of a capture volume. Requires a large number of cameras for reasonable quality, but concavities are still challenging to solve.
Multi-view stereo techniques and using controlled illumination show better results, but they are significantly less flexible and deployable for practical use. Also, these methods are based on voxel representations that are memory intensive.

PIFu: Pixel-Aligned Implicit Function

An implicit function defines a surface as a level set of a function f. The proposed pixel-aligned implicit function consists of a fully convolutional image encoder g and a continuous implicit function f represented by multi-layer perceptrons (MLPs), where the surface is defined as a level set of

Single-view Surface Reconstruction

Training PIFu is done by minimizing the Mean-Squared-Error (MSE) between the implicit function value at a sampled pixel point and the ground truth which is defined by the iso-surface as 0.5 thresholds.

Basically, it’s trying to make an implicit function using a multi-layer perceptron that distinguishes whether a certain point in a 3D space is a mesh surface or not given an encoded feature vector at a pixel from an input image.

Texture Inference

Training an implicit function for the texture follows the same way, except for a few changes. First, obviously using RGB color value as ground truth. Second, aggregating the encoded feature vector from the shape into the texture’s objective function to avoid overfitting and capture unseen regions’ textures. Therefore, the initial and final objective function is as below:

Multi-View Stereo

In this work, a multi-view fusing network is defined as f. Instead of using just one latent feature at a single view, they merged multiple latent features by average pooling and then used it to infer the surface. This fusion scheme needs to find a corresponding sampled point in a 3D space from different views, which I believe requires known camera parameters from each image.

Results

Qualitative results of this paper showed high-quality mesh detail from ambiguous clothed humans and well described occluded regions’ texture information.

Also, quantitative results showed state-of-the-art performance regarding three different evaluation metrics.

Evaluation Metrics

Average point-to-surface Euclidean distance (P2S) in cm from the vertices on the reconstructed surface to the ground truth
Chamfer distance between the reconstructed and the ground truth surfaces
L2 error between the reconstructed and the ground truth surfaces’ normal maps

Discussion

Introduced a novel pixel-aligned implicit function that successfully captured fine detail of ambiguous clothed human bodies while inferring plausible unseen regions’ textures simultaneously.

Opened up great possibilities towards 3D digitization of ambiguous figures and provided an end-to-end framework to train and test the process.