Template Deformation-Based 3-D Reconstruction of Full Human Body Scans From Low-Cost Depth Cameras.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Template Deformation-Based 3-D Reconstruction of Full Human Body Scans From Low-Cost Depth Cameras Zhenbao Liu, Member, IEEE, Jinxin Huang, Shuhui Bu, Junwei Han, Senior Member, IEEE, Xiaojun Tang, and Xuelong Li, Fellow, IEEE

Abstract—Full human body shape scans provide valuable data for a variety of applications including anthropometric surveying, clothing design, human-factors engineering, health, and entertainment. However, the high price, large volume, and difficulty of operating professional 3-D scanners preclude their use in home entertainment. Recently, portable low-cost red green blue-depth cameras such as the Kinect have become popular for computer vision tasks. However, the infrared mechanism of this type of camera leads to noisy and incomplete depth images. We construct a stereo full-body scanning environment composed of multiple depth cameras and propose a novel registration algorithm. Our algorithm determines a segment constrained correspondence for two neighboring views, integrating them using rigid transformation. Furthermore, it aligns all of the views based on uniform error distribution. The generated 3-D mesh model is typically sparse, noisy, and even with holes, which makes it lose surface details. To address this, we introduce a geometric and topological fitting prior in the form of a professionally designed high-resolution template model. We formulate a template deformation optimization problem to fit the high-resolution model to the low-quality scan. Its solution overcomes the obstacles posed by different poses, varying body details, and surface noise. The entire process is free of body and template markers, fully automatic, and achieves satisfactory reconstruction results.

Manuscript received November 11, 2015; accepted January 28, 2016. This work was supported in part by the National Natural Science Foundation of China under Grant 61003137, Grant 61473231, Grant 61573284, and Grant 61522207, in part by the NorthWestern Polytechnical University Basic Research Fund under Grant 310201401-JCQ01009 and Grant 310201401-JCQ01012, in part by the Open Fund of the State Key Laboratory of Computer Aided Design & Computer Graphics in Zhejiang University under Grant A1509, in part by the Open Research Foundation of the State Key Laboratory of Digital Manufacturing Equipment and Technology in the Huazhong University of Science and Technology under Grant DMETKF2015009, in part by the Fund of the National Engineering and Research Center for Commercial Aircraft Manufacturing under Grant SAMC14-JS-15-045, and in part by the Shaanxi Natural Science Fund under Grant 2015JM6344. This paper was recommended by Associate Editor Q. Ji. (Corresponding authors: Shuhui Bu and Junwei Han.) Z. Liu, J. Huang, S. Bu, J. Han, and X. Tang are with Northwestern Polytechnical University, Xi’an 710072, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). X. Li is with the Center for OPTical IMagery Analysis and Learning, State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2016.2524406

Index Terms—3-D reconstruction, deformation optimization, depth cameras, full human body, template deformation.

I. I NTRODUCTION HE LAST decade has witnessed the emergence of human body scanners used to capture the entire shape and appearance of the human body. Human body scans are widely used in anthropometric surveying, clothing design, gaming, virtual reality, augmented reality, animation, television production, and human-factors engineering, medicine, and health (e.g., sport training and remote medical diagnosis). This wide range of applications of human body scans underscores the importance of human shape-capturing technology. 3-D scanning devices provide a straightforward approach to obtain the 3-D human shape for many applications. For example, professional 3-D human body scanners are extensively used in the health care industry, providing high-quality point cloud data for orthotics and prosthetics. However, these scanners are impractical for use in home entertainment because of their high price, large volume, and difficult operation. In recent years, portable low-cost and easy-to-operate red green blue-depth (RGB-D) cameras, for example, Kinect [1], have become highly appealing for many applications including sign language recognition [2], comics [3], face recognition [4], and scene classification [5]. Readers are referred to recent comprehensive surveys on Kinect sensor applications [6], [7]. Unfortunately, this type of camera generates low-quality depth images because of the fundamental mechanism of infrared sensors, rendering the scanned point cloud noisy and incomplete [8]. This is the main impediment to providing high-quality immersive and virtual applications. In order to obtain a full human body scan using RGB-D cameras, we construct a stereo scanning environment by placing multiple depth cameras to form a circle and simultaneously capture different views with as many overlapping points as possible. This can prevent the capture of involuntary body movements occurring between multiple scans when rotating and translating a single camera. Unfortunately, automatically registering these different views and generating a mesh model poses a challenge. We propose a novel registration algorithm that determines a segment constrained correspondence between two neighboring partial point clouds and integrates

T

c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON CYBERNETICS

them through rigid transformation. Global registration is used to distribute registration errors evenly among all of the views. The generated 3-D mesh model is typically sparse, noisy, and incomplete, leading to a loss of surface detail and making it unsuitable for home or industrial applications. We thus intend to introduce a professionally designed high-resolution template model to fit the low-quality human body scan. The template model is essential as a geometric and topological prior for the robust reconstruction of the human body when faced with incomplete and noisy data acquisition. This method is expected to overcome several challenging issues: different poses, varying body details, and surface noise. We formulate a template deformation optimization problem to fit the low-quality scanned mesh with a high-resolution template model. The solution to this problem lies in establishing a nonrigid point-to-point correspondence between two mesh surfaces with the same overall shape but with variations in details. The deformable template-based reconstruction does not require body markers and is accomplished in a global-toglobal manner. Furthermore, it does not require preassigned control points for a given template. The entire process is fully automatic and easy to operate. A suite of experiments with different users demonstrates that this solution achieves satisfactory reconstruction results. A. Applications Our technique can be employed to assist in various modeling tasks in the fields of clothing design, anthropometric surveying, and entertainment, which will benefit from high-quality reconstruction and affordable depth cameras. Everyday users can provide to online clothing stores their obtained 3-D human body models for accurately customizing clothes. Moreover, these models can be easily and conveniently scanned with lowcost cameras at home. This technique can also be applied into virtual fitting shops for observing the stereo effect in trying on clothes from latest collections. In home games, characters can be customized for users according to their scanned body types using our algorithm, which will enhance user experience. II. R ELATED W ORKS 3-D human body scanning, modeling, and reconstruction have been well researched in the fields of multimedia, virtual reality, and computer graphics in recent years. Readers are referred to a comprehensive survey for details on recent advances [9]. We focus on investigating recent related techniques involving human body reconstruction with depth cameras, human body detection and tracking with depth cameras, rigid and nonrigid registration techniques, and template-based reconstruction. A. Human Body Reconstruction With Depth Cameras KinectFusion [10] enables a user holding and moving a standard Kinect camera to rapidly create detailed 3-D reconstruction of an indoor scene. Only the depth data from Kinect is used to track the 3-D pose of the sensor and reconstruct geometrically precise 3-D models of the physical scene in real

time. Tong et al. [11] presented a novel scanning system for capturing different parts of a human body at close distance and then reconstructing a whole 3-D human body via multiple Kinects. This system requires external calibration and an accurate capture space, which is not available in home applications. Barmpoutis [12] proposed a real-time human reconstruction by employing a novel parameterization of cylindrical-type objects using Cartesian tensor and B-spline bases. The tensor body is fitted to the input Kinect data. Alexiadis et al. [13] proposed real-time full 3-D reconstruction of moving foreground objects from depth cameras. Weiss et al. [14] combined low-resolution image silhouettes with coarse range data to estimate a parametric model of the body. Accurate 3-D shape estimates are obtained by combining multiple monocular views of a person moving in front of the sensor with shape completion and animation of people (SCAPE) body model. Chen et al. [15] adopted the training data, including 3-D meshes for multiple users with different poses as obtained from SCAPE, to model existing 3-D human body meshes. Xu et al. [16] measured accurate parameters of the human body with large-scale motion from a Kinect sensor by adopting space-time analysis across the posture variations to recover the human body with clothes. The methods mentioned above tend to reconstruct a real human body and commonly suffer from high sensor noise such that the reconstruction quality is difficult to improve, which limits its applications in home game and industrial virtual reality. B. Human Body Detection and Tracking With Depth Cameras Shotton et al. [17] described a new approach to human pose estimation, which can quickly and accurately predict the 3-D positions of body joints from a single depth image without any temporal information. The key to the approach is the use of a large, realistic, and highly varied synthetic set of training images, which allows learning models to be largely invariant to factors such as pose, body shape, field-of-view cropping, and clothing. Gedik and Alatan [18] proposed an automated 3-D tracking algorithm of rigid objects, which is based on fusion of vision and depth sensors through extended Kalman filter. They used estimation of optical flow based on intensity and shape index maps of 3-D point cloud to enhance tracking performance. Alexiadis et al. [19] built a real-time automatic system of dance performance evaluation using a Kinect RGB-D sensor, and provided visual feedback for beginners in a 3-D virtual scene. Liu et al. [20] used depth sensor data from a Kinect to track human movement and evaluate player energy consumption to investigate how much energy is expended while playing in a virtual environment. Helten et al. [21] presented a sensor fusion approach for real-time full-body tracking, which combines a generative tracker with a discriminative tracker retrieving closest poses in a motion database. Both trackers employ data from a low number of inexpensive body-worn inertial sensors providing reliable and complementary information. Shum et al. [22] proposed a posture reconstruction framework using a natural posture database to reconstruct a valid posture from incomplete and noisy Kinect data in real

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: TEMPLATE DEFORMATION-BASED 3-D RECONSTRUCTION OF FULL HUMAN BODY SCANS

time, followed by optimization for a posture that corrects the nonreliable body parts from Kinect. Ni et al. [23] integrated the depth information into a multilevel processing pipeline for detecting human activities and improved the accuracy of human key pose detection. Zhang et al. [24] introduced a single-pass, decision directed acyclic graph-based framework and addressed the task of multiple human detection and tracking in complex, dynamic, indoor environments, and in realistic and diverse settings, using an RGB-D camera on a mobile robot. C. Rigid and Nonrigid Registration As an earlier work, Besl and McKay [25] proposed a classic iterative closest point (ICP) algorithm to minimize the mean-square distance iteratively for rigid registration. There have been many variations of ICP for object reconstruction, and recently Pomerleau et al. [26] comprehensively compared these ICP variants on real-world data sets. Zhuang and Sudhakar [27] proposed a single-stage linear method to simultaneously fit rotation and translation (pose) parameters given two sets of 3-D point measurements. Chui and Rangarajan [28] introduced thin-plate spline (TPS) for nonrigid alignment, which fits a mapping function between two corresponding point sets. Myronenko and Song [29] modeled rigid and nonrigid alignment with motion coherence theory, and the registration between two point sets is seen as temporal motion process in which one point set moves coherently to align with the other set. Motion coherence constraint penalizes derivatives of all orders of the velocity field while TPS only penalizes the second-order derivative. Cui et al. [30], [31] described a method for 3-D object scanning by aligning depth scans that are taken from around an object with a time-of-flight camera. They overcome the difficulties that the sensor’s level of random noise is substantial and there is a nontrivial systematic bias. Their algorithm is based on a new combination of a 3-D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the noise characteristics of sensors. Cui and Stricker [32] proposed a global alignment method for 3-D shape closure enabling the combination of scans taken from around an object with Kinect. Their method is mainly based on correspondence of two frames using pairwise ICP registration, and exponential transform for each frame under a conformal geometric algebra using an energy function. Chang and Zwicker [33] presented a global registration algorithm for articulated shapes that optimizes the registration simultaneously over multiple frames. They express the surface motion in terms of a reduced deformable model and solve for joints and skinning weights. Most methods achieve highly accurate alignments for subtle warps but are not suitable for large-scale deformations such as a bending arm, which are common in home applications of depth sensors. D. Template-Based Reconstruction A popular approach to reconstruct sequences of multiple range scans is to design a template shape and fit it

3

to the scanned data. Allen et al. [34] collected 250 human whole-body 3-D laser scans with anthropometric markers from CAESAR data set, which are used to characterize and explore the space of probable body shapes. All the shapes are required to be in similar poses. They parameterize highresolution template meshes and fit them to detailed human body range scans with sparse 3-D markers. The system proposed by Anguelov et al. [35] deforms a data-driven SCAPE model to fit scanned data given a limited set of markers specifying the target shape. Using a single static scan and markers for motion capture, their system constructs a highquality animated surface model of a moving person with realistic muscle deformation. Kakadiaris and Metaxas [36] used spatiotemporal analysis of deforming apparent contour to obtain a 3-D human model. Hilton et al. [37] presented an approach for automatic reconstruction of recognizable avatars from a set of low-cost color images of a person taken from four orthogonal views. Seo and Magnenat-Thalmann [38] constructed 3-D human by driving template deformation with skeleton. The human template couples a standard human skeleton structure to a skin surface. Given a stream of synchronized video images that record a human performance from multiple viewpoints and an articulated template of the performer, Vlasic et al. [39] performed nonrigid deformation of the template to fit the silhouettes by integrating visual hull with linear blend skinning. Starck and Hilton [40] proposed a novel multiple resolution model-fitting technique and combined multiple shape cues from camera images for coarse-to-fine model-based reconstruction with preservation of animation structure. These methods rely on tracked marker locations to fit the template and poses are required to be similar to the template while our approach is not restricted. It is notable that a more recent work proposed by Zhang et al. [41] solves a similar problem, which performs full-body motion capture by input data captured by three depth cameras and a pair of pressure-sensing shoes. They can accurately reconstruct both full-body kinematics and dynamics data, and the whole process is nonintrusive and fully automatic. During tracking stage, 3-D skeletal poses are reconstructed using captured data and updated incrementally. A significant contribution is that they successfully integrate depth data from multiple depth cameras, foot pressure data, detailed full-body geometry, and environmental contact constraints into a unified framework. The objective of this paper is different from that of this representative method. We mainly focus on generating an accurate and high-quality human full body using data captured by multiple depth cameras. Another difference is that the human body model used in [41] is based on statistical analysis of a database of preregistered 3-D fullbody scans and principal component analysis is applied to hundreds aligned body scans to construct a low-dimensional parametric model for human body representation. Our method depends on only one high-resolution template with arbitrary pose to fit point cloud. This template designed by artist is regarded as a geometric and topological prior for the robust reconstruction of the human body when faced with depth data missing, strong sensor noise, varied clothes, and different poses.


Fig. 1.


Flowchart of deformable template-based 3-D human body reconstruction using multiple low-cost depth sensors.

III. OVERVIEW We propose a novel pipeline for obtaining a high-resolution 3-D human body using multiple low-cost depth cameras. To address the shortcomings of the scanned point cloud, including sparseness, discontinuity, noise, and incompleteness, we introduce a deformable high-resolution and high-quality template to automatically fit the point cloud. Our technique comprises three main steps: 1) mesh initialization of scanned human body; 2) template deformation optimization; and 3) high-resolution template deformation. Each depth image from a single camera is converted to a partial point cloud and two neighboring partial point clouds are then registered by virtue of segment constrained correspondence. All of the partial point clouds are aligned such that pairwise registration errors are distributed evenly among all of the views. An initial mesh is reconstructed from the entire point cloud of the human body, and then simplified and smoothed. Next, an intermediate template is used to fit the mesh of the human body by deformation. The deformation step is iteratively performed with nonrigid feature correspondence, energy minimization for deformation optimization, and correspondence fine-tuning. Additionally, we apply Green coordinates to linearly move vertices of the high-resolution template to consistent positions with the deformed intermediate template. Finally, the high-resolution template is deformed to fit the scanned human body, alleviating the problems caused by different poses and varied body details. The flowchart of our algorithm is shown in Fig. 1. A. Contributions We have implemented a prototype system for the entire reconstruction pipeline for a whole human body scanned by multiple low-cost depth cameras. We demonstrate its effectiveness, efficiency, and robustness with a group of users with

varied poses. Our approach is made possible by the following three technical contributions. 1) We propose a high-resolution deformable template to fit the low-quality human body scan with different poses. The reconstruction is accomplished in a globalto-global manner, without the use of body markers, and does not require preassigned control points for a given template. The entire process is fully automatic, easy to operate, and has potential for home applications. 2) To obtain a robust and accurate whole mesh for sparse and noisy point clouds, we introduce a novel registration algorithm. It first determines the segment constrained correspondence for two neighboring partial point clouds and then integrates them through rigid transformation. Global registration is used to distribute registration errors evenly among all views. 3) We formulate a template deformation optimization problem to fit the low-quality scanned mesh with a highresolution template model. The solution to this problem lies in establishing nonrigid point-to-point correspondence between two mesh surfaces with the same overall shape but varying details. The deformation from the template to the scanned mesh is performed iteratively and mainly includes feature correspondence, energy minimization, and correspondence fine-tuning. This allows the approach to address several challenging issues, including different poses, varied body details, and low-quality scans. IV. M ESH I NITIALIZATION OF S CANNED H UMAN We first generate an initial, coarse, and low-resolution mesh of the human body scanned using multiple depth cameras, which is then fit by a high-resolution deformable template.


5

A. Segment Constrained Correspondence Graph-based matching and learning have been extensively researched in recent years. For example, graphbased approaches have been used for image indexing [42], video annotation [43], and saliency detection [44], [45]. In this paper, we propose the feature correspondence of two partial 3-D views based on the influential work by Leordeanu and Hebert [46] employing the spectral properties of the weighted adjacency matrix W of a graph. Each potential point pair is regarded as a node in the graph. Both the intersimilarity and the inner geometric constraint of each point pair are considered for correspondence. In our segment constrained correspondence, which differs from the method described above, only a pair of points belonging to corresponding segments in two views qualifies as a candidate pair. This greatly reduces the number of candidate pairs. Moreover, pairs with similar descriptions, but which are evidently incorrect, are filtered out. Therefore, partial 3-D data are presegmented before correspondence, and the segmented part is seen as a prior information during the correspondence stage. Since we always scan the human in an upright position, spatially segmented parts are added through height clustering as strong constraints for spectral matching. Specifically, each point correspondence pair u should belong to the same part P. Hence, we incorporate a part indicator function p(u) into the adjacency matrix W = {w} as follows: 1, u = (si , si ) ∈ P (1) p(u) = /P infinity, u = (si , si ) ∈ where si and si denote two points in the scans of different views. The entry of W is defined as follows: ⎧ 2

d (u) ⎪ = v s , s ⎪ exp − + αp(u), u s , s ⎪ i j i j ⎨ 2 2σ

2 w(u, v) =

e−e ⎪ = v s , s ⎪ ⎪exp − , s , u s i j ⎩ i j 2σ 2 (2) where d(u) is the feature distance between the point pairs in different views, while e and e denote the spatial Euclidean distance between two points (si , sj ) and (si , sj ) from the same 3-D view. α is a tradeoff parameter and σ denotes standard deviation, and both values are determined experimentally. The best correspondence is selected from a large number of candidate pairs. The latest advances in feature correspondence have mainly focused on the following topics: feature extraction based on a multitopic model [47] and computational model [48], feature comparison using Hausdorff distance learning [49] and inner distance [50], feature leaning based on human perception [51], and multifeature fusion [52]. We use feature similarity to determine these candidates. Describing the feature of a local 3-D structure and differentiating it from other points depends heavily on a descriptor with discriminative power [53]. While state-of-the-art approaches in computer vision build a visual vocabulary [54], [55] based solely on the visual statistics of local image patches, this cannot be

Fig. 2. Feature descriptors for a particular point (in red) in the 3-D partial view: SHOT feature (left) and SHOT feature with color information (right).

Fig. 3. (a) Spectral matching correspondence. (b) Segment constrained correspondence. The latter clearly reduces correspondence errors.

applied to solve our problem owing to the lack of training data sets for human reconstruction. Moreover, a 3-D feature descriptor that is robust to incomplete data, clutter, and sensor noise is essential. We theoretically investigate the performances of several representative local descriptors, such as 3-D Harris, 3-D speeded up robust features (SURF), and signature of histograms of orientation (SHOT), based on their adopted neighborhood, embedding and mapping space, and analysis tools. The importance is placed on their abilities of dealing with rotation, scale, and topological or geometrical noises. They are all rotation invariant and can be uniformly scaled during feature extraction. However, 3-D Harris and 3-D SURF are severely dependent on local connectivity, and holes or gaps may cause large fluctuations of their values. The SHOT descriptor is based on repeatable local reference frame and eigenvalue decomposition around a given input point, which are robust against spatial discontinuity. Extensive analysis guided the selection of the recent SHOTs [56] as our feature descriptor. Fig. 2 shows two types of feature descriptors for a particular point (shown in red) in a 3-D partial view. Given the initial correspondence solution U, the optimal solution U ∗ is obtained by optimizing the following quadratic polynomial problem: U ∗ = arg max U T WU, s.t., U ∈ {0, 1}n .

(3)

The problem can be effectively solved by virtue of integer projected fixed point [57]. Fig. 3 shows the comparison between our method and spectral matching [46]. Since the initial correspondence results in a coarse rigid transformation, it can be accurately refined by considering the spatial Euclidean distance between each pair in the next pairwise registration. B. Point Cloud Registration Once the initial correspondence is established, we generate an accurate alignment of the human body and significantly reduce the influence of noisy data and outliers. To obtain



Fig. 4. Two partial 3-D views (a) without any correspondence and (b) aligned by initial feature correspondence. (c) Final pairwise registration.

Fig. 5. (a) Template simplification and (b) smoothing from a high-resolution template to an intermediate template.

robust registration of two partial 3-D views, we follow the idea of coherent point drift [29]. Fig. 4 shows a pairwise registration result. We introduce global registration to obtain a consensus registration for all partial views by diffusing pairwise registration errors. This can avoid the problem that gradual error accumulation during pairwise sequential alignment results in that the registration loop cannot be closed. The algorithm flow is as follows. A current view is aligned with all of its neighboring views to obtain a registration error. If the error exceeds a threshold, another view is selected from the neighbors to reduce the error. The new chosen view is successively aligned with its neighboring views to compute the current registration error. The view with the most matching pairs is iteratively selected, and the process ensures that larger pairwise registration errors are reduced while smaller errors increase. Finally, these pairwise registration errors are distributed evenly among all views.

difference equation

C. Mesh Generation 1) Mesh Reconstruction: The scanned point cloud is organized in a mesh structure. We triangulate the point cloud using a two-stage solution. In the first stage, we preliminarily use Delaunay triangulation [58] to triangulate the unorganized point cloud and connect these granular points with their spatial neighbors. We overcome the problem of holes and burrs in the second stage by casting it as an issue of Poisson surface reconstruction [59]. 2) Mesh Simplification: The generated mesh contains too many vertices to efficiently build the accurate correspondence for each vertex with the template model. We first simplify the initial mesh of the human body scan and remove superfluous local details. To adequately preserve the global shape, we reduce the shape size using the edge collapse strategy, thereby producing a low-resolution polygonal mesh based on a quadratic contraction cost. 3) Surface Smoothing: The scanned human body mesh is significantly different from the modeled template with respect to local geometrical details. It is thus nontrivial to establish the deformable correspondence between these two shapes. This requires the elimination of local surface details prior to building correspondence. Inspired by a skeleton extraction method [60], which uses Laplacian smoothing to extract similar representations and eliminate surface-shape differences, we apply the Laplacian operator L to obtain a smoothed mesh of the scanned human body. This requires solving the forward

sl+1 = (I + βTL)sl

(4)

where s is the vertex of the scanned mesh, I is the identity matrix, and β is a scalar to control the diffusion speed of the temperature variation T. l denotes the number of iterations. In contrast to many smoothing methods that adopt conformal weight [61] as the Laplacian, our algorithm attempts to obtain a similar surface with fewer geometrical details, and hence applies the graph Laplacian to generate a smoothed model. V. T EMPLATE D EFORMATION O PTIMIZATION We formulate a template deformation optimization problem to fit the low-quality scanned mesh with a high-resolution template model. The solution to this problem is to establish point-to-point correspondence between two mesh surfaces with the same overall shape but with variations in details. The deformation from the template to the scanned mesh is performed iteratively and mainly involves feature correspondence, energy minimization, and correspondence fine-tuning. This addresses several challenging issues including different poses, varied body details, and low-quality scans. A. Intermediate Template Modeling We assume that a high-resolution human body template model is provided. This template can then be deformed to approximate a scanned 3-D human body mesh. A highresolution template typically has rich geometrical details while the scanned mesh lacks the local characteristics since it is formed from low-quality data acquired by Kinect cameras. Local dissimilarities make it difficult to directly build the correspondence between the high-resolution template and the low-quality scan without body markers and preassigned control nodes of template. Moreover, there are too many vertices on the man-made template surface to determine their counterparts in the sparse scan and deform them efficiently. To alleviate the above issues, we automatically generate an intermediate template to facilitate an effective correspondence search. Fig. 5 shows the resulting intermediate template compared with the original high-resolution template. B. Correspondence Construction The correspondence between the vertices of the intermediate template and those of the scanned mesh is determined iteratively. This is followed by energy minimization and


correspondence fine-tuning. The objective of correspondence construction at each iteration is to provide a relatively accurate initial solution to the energy minimization. 1) Feature Computation-Scale Invariant Heat Kernel Signature: To build robust nonrigid correspondence, we use the heat kernel signature as the point feature. This feature is derived from a heat diffusion equation using the Laplace– Beltrami operator on surfaces [62]. Given an initial heat distribution, Dirac delta function δs (s ) = 0, the heat kernel denotes the amount of heat transferred from the source s to the target s in time m given a unit heat source at s. It can be solved by the eigen-decomposition of the Laplace–Beltrami operator in the following equation: ∞

e−λi m φi (s)φi s hm s, s =

with the distance between pair v. The distance is the geodesic distance on the surface. The difference degree between pairwise distances du and dv constitutes the pairwise distance consistency term fd (u, v), which is defined as |du − dv | ,1 . (7) fd (u, v) = min du Spectral matching proceeds by building a consistency matrix where entry is a geometric consistency function. The set of initial correspondences is filtered by optimizing spectral matching. A large set of inconsistent correspondences is removed, and the selected correspondences C = {(tc , sc ) | c = 1, . . . , |C|} with high confidence are robust against isometric deformation.

(5) C. Energy Minimization

i=0

where λi and φi are the ith eigenvalue and eigenfunction of the Laplace–Beltrami operator, respectively. Sun et al. [63] represent 3-D shapes using the heat kernel signature hm (s, s ) under a set of diffusion times {m}, and set a different diffusion time to obtain local neighborhood information at an easily controlled scale. Furthermore, to make it scale invariant, the Fourier transform of heat kernel signatures is proposed in [64]. We adopt this improved feature to construct a correspondence that is robust to rigid transformation, deformation, and scale variations. 2) Feature Correspondence: We denote T = {tj |j = 1, . . . , Nt } as the vertices of the template, and S = {si |i = 1, . . . , Ns } as the vertices of the human body scan. The correspondence within a closer feature space is built using their feature distance, denoted by (ti , si ). To improve the current correspondence locally, we further search a possibly optimal point si corresponding with ti in a local region of the scanned mesh vertex si within a fixed geodesic distance. The region is limited to the k-nearest neighbor graph on the geodesic surface. Here, k is set to 16 empirically. The retrieved optimal point is required to be closest to the mesh vertex in the feature space. Similarly, given the selected point sj , we search its optimal counterpart tj in the local region of template vertex tj . 3) Correspondence Filtering: Given one pair of vertices u = (ti , tj ) on the surface of the template, we define a geometric consistency function f to verify its consistency with another pair of vertices v = (si , sj ) on the scanned mesh. The geometric consistency function is composed of two terms: a pairwise orientation consistency term and a pairwise distance consistency term. The function is defined as follows, and two terms are weighted using two parameters: f (u, v) = λo fo (u, v) + λd fd (u, v).

7

(6)

a) Pairwise orientation consistency term: The normal of each vertex is computed, and the angle between their normal vectors describes the relative direction. The cosine difference of angles between one pair u on the template and another pair v on the scanned mesh is used to measure the orientation consistency degree fo (u, v). b) Pairwise distance consistency term: This term describes whether the distance between one pair u is consistent

In order to obtain the global optimization on correspondences under isometric deformation, we introduce a joint energy minimization that considers both the maximum likelihood of mapping after rotation, translation, and scaling, and the distance difference between correspondences of the deformed template vertices and scanned mesh vertices. After optimization, the positions of the deformed template vertices are updated for better nonrigid alignment with the scanned mesh. The energy function involves two terms: 1) one measuring the nonrigid registration error and 2) the other containing a correspondence term weighted by λcor , which ensures a smooth movement around correspondence with high confidence. The energy function is defined as follows: E = Enon + λcor Ecor .

(8)

The first term Enon models the template registration to the scanned mesh as a maximum likelihood estimation problem. We choose the scanned 3-D point cloud as the reference 3-D point set. A multivariate Gaussian is centered on each point of the template, and all of the Gaussians share the same isotropic covariance σ 2 . The local motion of the template T is toward the scanned point cloud S. The point set of the scanned point cloud can be considered as Gaussian mixture models with a probability density described in the following function: p(si ) =

Nt 1

p si |tj . Nt

(9)

j=1

The nonrigid alignment of template T to scan S can be performed based on this term. A set of transformation parameters representing local deformation are estimated by minimizing the negative log-likelihood function 2 Ns Nt sRj tj + Tj − si 1 log exp − Enon = − (10) 2 σ i=1

j=1

where the scale of template s is assumed to be isotropic. The local rotation and translation are denoted by Rj and Tj , respectively. We want the correspondences built in the previous step to contribute significantly during the energy minimization since they exhibit relatively high confidence not only in



VI. H IGH -R ESOLUTION T EMPLATE D EFORMATION

Fig. 6. Intermediate template deformation. (a) Template and scanned human mesh. (b) Template deformation after energy minimization. (c) Deformation after correspondence fine-tuning.

the Euclidean space but also in the nonrigid feature space. Therefore, we rely on these local correspondences to control the global motion of template vertices by adding the term Ecor as follows: Ecor =

|C|

2 λp tc − sc 2 + λf (tc − sc )T nc

(11)

c=1

where nc is the normal of the point sc on the scanned surface. The two weights λp and λf are used to balance the contributions of point-to-point distance and point-to-plane distance, respectively. We use expectation maximization to find a maximum likelihood solution to the whole energy minimization. During the E-step, a set of transformation parameters are used to compute an estimate of the posterior of Gaussian mixture components using Bayes rule. During the M-step, new transformation parameter values are obtained by minimizing the whole energy function. The two steps are alternately performed until the solution converges. Fig. 6(b) shows the deformed result after energy minimization. D. Correspondence Fine-Tuning The optimized correspondence can be further fine-tuned locally. We attempt to adjust the optimized correspondences (tj , sj ) in the nearest neighbor set N(sj ) based on two aspects. One is that the template vertex tj should be close to its counterpart in the geometrical space. The other is that they should be consistent in orientation. We search for the refined vertex sj in the local region of sj of the scanned human mesh such that sj more accurately corresponds to the given template vertex tj . The search is performed by minimizing the following function: a

2 1 − n(tj ), n(sk ) sj = arg min tj − sk exp 1 + r sk ∈N(sj ) (12)

Once the intermediate template has been deformed to align with the scanned human mesh, the remaining step is deforming the high-resolution template to achieve the desired effect. Given the intermediate template and its deformation, we consider the high-resolution template to be a deformable body, and apply Green coordinates [65] to linearly move the vertices of the high-resolution template to positions consistent with the deformed intermediate template. The intermediate template is enlarged to envelop the high-resolution template. We introduce the concept of the envelope, and the motivation is to transfer the deformation of intermediate template into that of high-resolution template because the deformation of intermediate template has been known. When enveloping the high-resolution template before deformation, all the corresponding relations between vertices of intermediate template and high-resolution template are fixed. After deformation of the intermediate template, local rigid transformations of vertices on the intermediate template can be similarly imposed on vertices of high-resolution template by virtue of envelopebased coordinate operators. As the intermediate template is manipulated to achieve its deformed shape, the enveloped high-resolution template follows the known movements and is thus repositioned with a similar sequence of movements. The movement is motivated by Green’s third integral identity, and Green coordinates represent the shape vertices in a linear expression considering the vertices and face orientations of the template. Green coordinates are detail-preserving envelope-based coordinate operators, and introduce appropriate rotations into the space deformation to allow shape preservation similar to conformal mapping. A vertex th of the high-resolution template within the envelope is expressed as an affine sum of the vertices of the intermediate template. The Green coordinates are derived by representing th as the linear combination th = L(th ; Q) =

φi (th )ti +

i∈T

ψj (th )n fj

(13)

j∈F

where φi (·) and ψj (·) are referred to as Green coordinates, which are represented with the linear combination of the exterior envelope. That is, the intermediate template Q = {T, F}. ti ∈ T is a vertex element and fj ∈ F is a face element. Then, after deformation, the deformed envelope is denoted by Q = {T , F }. The deformed vertices th of the high-resolution template are induced with the linear combination of the known Green coordinates φi (·) and ψj (·) and the deformed Q

th → L th ; Q = φi th ti + ψj th n fj . i∈T

where n(tj ), n(sk ) denotes the dot product of two normal vectors, representing the degree of orientation consistency, and r defines the inconsistency tolerance range. a is the order of attenuation; the higher the order, the stronger the consistency condition. This adjustment alleviates the potential geometrical distortion resulting from correspondence errors. Fig. 6(c) shows the fine-tuning result.

(14)

j∈F

The above equation signifies that the high-resolution template follows the movement of the envelope, and the new vertices are derived from the spatial positions of Q with the help of Green coordinates. Finally, the deformed highresolution template is obtained, as shown in Fig. 7. Fig. 8 demonstrates the deformation effect from different viewpoints.


Fig. 7. High-resolution template deformation. (a) High-resolution template and its envelope. (b) Deformed intermediate template and high-resolution template.

9

Fig. 9. Template deformation-based reconstruction results. Each high resolution template is deformed to align with a scan from multiple RGB-D cameras. TABLE I PARAMETERS U SED IN O UR A LGORITHM

Fig. 8. Deformation effect of high-resolution template demonstrated from different viewpoints.

VII. E XPERIMENTAL R ESULTS A. Experimental Environment In order to verify the entire scanning and template deformation procedure, we build an experimental hardware system composed of six low-cost depth cameras (Kinect), their brackets, a dark blanket, six USB cables connecting the depth cameras to a computer, and a desktop computer with i7 CPU and 16 GB memory. The upper left image of Fig. 1 illustrates the experimental environment, where six depth cameras are placed circularly to capture different views with as many overlapping points as possible. There is no requirement for a uniform interval between the cameras but they should be oriented toward the human. These depth cameras simultaneously capture RGB-D images of the human. Each depth camera has been intrinsically calibrated using a checkerboard via standard calibration techniques [66], [67]. Similarly, we performed extrinsic calibration between the color and depth cameras of each Kinect. The dark blanket is used to prevent the reflection from the floor. B. Template Deformation-Based Reconstruction Results Our algorithm is tested on a set of complete artist-generated high-resolution template meshes and scanned individuals. Given each scanned human body, we perform deformable template based reconstruction, which deforms a high-resolution template to align with the scanned human. This improves reconstruction results of human body for low-cost RGB-D scans. The high-resolution template can be easily rendered for many applications including gaming, virtual reality, augmented reality, virtual fitting, and anthropometry. We show a group of reconstruction experiments in Fig. 9. Each high-resolution template is modeled by a professional 3-D designer using 3-D modeling software. Each person is scanned in our hardware environment composed of multiple RGB-D cameras. Fig. 9

demonstrates that the template deformation-based reconstruction results are relatively accurate. Moreover, we test templates with arbitrary postures to verify the adaptability of our algorithm. Significant deformations also appear in the template and scanned human. The results in Fig. 9 show that the template poses and local scaling are correctly adjusted according to the scanned individuals. The surface details of the template are perfectly preserved, demonstrating the effectiveness of our algorithm. The entire process is automatic and requires no preassigned markers for the template or scan. C. Parameter Setting All the parameters in our tests, their actual values and locations, are listed in Table I. We focus on three groups of important parameters to discuss our consideration on their choice. The first group includes two correspondence parameters used in mesh initialization (Section IV). They are correspondence tradeoff parameter and diffusion speed, affecting correspondence errors and mesh smoothing quality. If the tradeoff parameter is set to a small value (e.g., less than 0.1), the segment constraint becomes weak such that wrong correspondences between similar features on the different parts increase. If it is too large (e.g., more than 5), rigid transformation cannot reach an optimum solution and correspondence errors cannot be reduced anymore although right correspondence pairs are selected. Diffusion speed controls the quality of surface details where a large value leads to an over-smoothed mesh. The second group is composed of three parameters used to build right deformation from template to scan (Section V-B). The number of vertex correspondences is set relatively low, compared with the number of vertices. Only correspondences with high confidence are chosen to provide best initial values for deformation. If their amount becomes large, e.g., 500, deformation optimization may not converge, resulting in that template and scan do not overlap rightly. The sum of two consistency term


Fig. 10. (a) Real human reconstruction. (b) Template-based reconstruction. Mesh distortion and missing detail are inevitable while reconstructing real human.

weights is naturally set to 1. In geometric consistency function, orientation consistency is regarded more important than distance consistency because the orientation variation relative to local deformation is generally significant on the surface of human body. They are empirically set to 0.8 and 0.2, respectively. The third group includes three weights used in our optimization framework (Section V-C). Energy weight favors smooth movement around correspondence with high confidence, and hence its value is set to a moderate magnitude, e.g., 2. Heavily depending on it may cause unnatural deformation between two similar but distant points. Distance term weights play an important role in balancing the contributions of point-to-point distance and point-to-plane distance. Their sum is set to 1. Point-to-plane distance makes distance computation more robust against noisy points. We set its weight to occupy 30% the gross weights, and a larger value will reduce discriminative power of distances between neighbors. D. Comparison Between Real Human Reconstruction and Template-Based Reconstruction


Fig. 11. Influence of omitting Laplace smoothing. An unsmoothed template, the scanned mesh, and its deformed result are shown.

Fig. 12. Deformation in the cases of different intermediate template sizes. (a) Deformed intermediate template with 2500 faces. (b) Deformed intermediate template with 5000 faces.

Fig. 13. Tolerance of correspondence fine-tuning. (a) Large tolerance value results in distorted deformation. (b) Small tolerance leads to incorrect correspondence. (c) Normal setting.

from that many faces on the deformed template lack correct correspondences, and their nonrigid deformation becomes confused. F. Intermediate Template Setting

We also compare the template-based reconstruction result with real human reconstruction, which directly generates a human mesh from the point cloud, as described in Section IV-C. The resulting reconstruction after texture mapping is illustrated in Fig. 10(a). Compared with the templatebased reconstruction result in Fig. 10(b), real human reconstruction is easily affected by sensor noise and data acquisition conditions, which precludes capturing accurate surface details. Moreover, real human reconstruction may cause part distortion, resulting in a low-quality 3-D human model. The proposed solution exploits the high-resolution template to preserve surface details and ensure mesh quality of all parts.

To efficiently build the correspondence between the template and the scanned mesh, an intermediate template with reduced size is modeled. We investigate the influence of the size of the intermediate template, and find that the deformation looks unnatural in terms of part transition for small template sizes. For example, when the template is simplified to only 2500 faces, it cannot nonrigidly align with the scanned mesh accurately, eliminating the benefit of fast correspondence optimization. When the size increases to 5000, the deformation quality shows obvious improvement. Fig. 12 shows the comparison between these two cases. Nevertheless, it becomes difficult and inefficient to build the correspondence if the size of the intermediate template is too large.

E. Effect of Laplacian Smoothing

G. Tolerance of Correspondence Fine-Tuning

We investigate whether template deformation is affected if Laplacian smoothing is neglected. In the previous experiments, both the template and the scanned mesh are smoothed to facilitate the construction of the correspondence. We test the influence of omitting Laplacian smoothing using a group of deformation results, as illustrated in Fig. 11. The group includes an unsmoothed template and scanned mesh, and the template is deformed to align with the scanned mesh. We see that unsmoothed surface noises impact on the correspondence and final deformation. The local surface of the template is significantly corrupted after deformation, resulting

We demonstrate the influence of tolerance during correspondence fine-tuning. In this paper, we set r to 30 and a to 10, where the tolerance value r restricts the inconsistency range and the order a determines the attenuation speed. A larger tolerance permits some correspondence inconsistency and may lead to incorrect deformation. In contrast, if the tolerance is set to the low value of 3, better correspondence cannot be selected from a large set of candidate pairs, resulting in disordered correspondence. Therefore, the tolerance value must be determined experimentally. Fig. 13 shows a large, small, and appropriate setting of this parameter.


Fig. 14. Template reusability. The same template can be used for two different users.

Fig. 15. Human reconstruction comparison between (b) KinectFusion and (c) proposed method for the (a) same scan.

H. Template Reusability We test the reusability of a single template for multiple users. Modeling a high-resolution human template is timeconsuming and requires expertise. It is desirable that the finished template can be repeatedly used for different users. A high level of reusability increases its practical applications. We analyze the reusability with two different users who are scanned in the same environment. The same template is deformed to align with the two users, as illustrated in Fig. 14. Both deformed results are satisfactory, demonstrating the robustness of our algorithm when reusing the same template for different human bodies. I. Comparison With KinectFusion KinectFusion [10] is a state-of-the-art reconstruction technique, which enables a user holding and moving a standard Kinect camera to rapidly create detailed 3-D reconstructions of objects in an indoor scene. Only the depth data from Kinect is used to track the 3-D pose of the sensor and reconstruct, geometrically precise, 3-D models of the physical objects in real-time. The technique can be applied to human reconstruction. We compare our algorithm to KinectFusion according to three aspects: 1) reconstruction quality; 2) reconstruction time; and 3) human–computer interface. Fig. 15 shows mesh reconstruction quality of KinectFusion and the proposed algorithm. We see that the resulting merged scans by KinectFusion remain incomplete and noisy because of unavoidable occlusions, scan angle views, and involuntary body movements, while our reconstructed mesh is smooth and without holes and surface noises, fully following high quality of template. Moreover, although KinectFusion runs in real time, only a small patch of surface can be scanned at each time in a close distance, e.g., a part of leg, and all the patches are gradually fused into a whole human body during scan. The current patch

11

Fig. 16. User study based on visual comparison between real human and reconstructed human.

should have much overlap with previous scanned patches, which makes it easy to be registered. The whole reconstruction time by KinectFusion is relatively long and full-body scan of human commonly costs average 5–10 min in our experiments. Accordingly, we have to move the Kinect camera slowly up and down, left and right, around the scanned human. The proposed method takes only several seconds to finish the whole process because the full body is scanned once and for all. Finally, it should be noted that, in terms of human–computer interface we find that it is not easy for beginners to capture human full body with KinectFusion, because the Kinect camera should be moved very carefully and the scanned subject must keep static for a long time otherwise the scanned human is not able to be continuously tracked so as to lose some body parts. The proposed algorithm does not demand such expert experience and is fully automatic. J. User Study To demonstrate the effectiveness of algorithm, we conduct a user study where we invite five participants irrelevant with this paper as human models and scan them to generate a highquality digital model for each of them. These male volunteers aged 18–25 come from different regions and are with various body types, for example, fat and thin, and tall and short. They are required to wear swimming trunks for clearly observing human body shape after reconstruction. They stand in the center of the scan system, and their poses remains as similar as possible for easily comparing reconstruction quality. A high-quality deformable template designed by artist fits each human body scan using our algorithm. Fig. 16 shows every original scan in the upper row and corresponding generated human body in the bottom. It can be found that each 3-D digital shape is smooth and accurately reflects human body characteristics. Each volunteer is invited to carefully observe his reconstructed body parts, including head, neck, torso, arm, and leg. Each of them agrees that the reconstructed digital model is consistent with his real body percentage and body details and is satisfied with the reconstruction quality. K. Comparison With Real Measurements We investigate the quality of reconstructed human body quantitatively, and design another user study where body



TABLE II U SER S TUDY BASED ON Q UANTITATIVE C OMPARISON B ETWEEN R EAL M EASUREMENTS AND 3-D D IGITAL M ODELS

TABLE III Q UANTITATIVE C OMPARISON OF M EASUREMENT E RRORS B ETWEEN THE P ROPOSED M ETHOD A ND K INECT F USION

parameters of five irrelevant volunteers with this research are measured using our algorithm. Before the test, we asked each volunteer to keep approximately the same pose, as shown in Fig. 16. 3-D digital human shapes are obtained for these volunteers. Manual anthropometric measurements are performed by an invited professional dressmaker for real persons and generated 3-D human shapes, respectively. The body parameters include height, waist length, thigh length, shank length, forearm length, and upper arm length. Each parameter is measured using a soft ruler three times to take their average. These body parameters of digital human are also estimated in 3-D global coordinate system by computing geodesic distance on the surface between two joints predefined during modeling template. Table II shows the comparison between real sizes and modeled sizes. According to current clothing standard, 5% error is considered to be allowable. We find that there exist some minor errors, mostly without exceeding the limitation of 5%, except that two parameters are significantly different from the real values. Five volunteers think that the reconstructed human full body is numerically reliable. In addition, we compare the reconstruction accuracy between the proposed method and KinectFusion. The actual sizes of human body are used for contrast with the measured sizes of reconstructed human full body. Their measurement errors are summarized in Table III. Most measurement errors of the proposed method are less than those of KinectFusion except two human body parameters. We find that reconstruction quality of KinectFusion may affect measurement accuracy, because there are some holes, missing parts, and surface noise, potentially causing measurement errors. L. Computational Time Table IV shows the statistics and timing for five volunteers from scan start to template deformation, including three main steps, mesh initialization, deformation optimization, and highresolution template generation. The program runs on i7 4 GHz 4-core Intel PC with 16 GB memory and GPU of Nvidia GTX 750. We observe that deformation optimization costs major time mainly because the optimization process is relatively slow. During mesh initialization, point cloud registration

TABLE IV RUN T IME ( IN S ECONDS ) OF O UR A LGORITHM

of partial views takes a long time to handle the transformation from source points to target points, which cannot be reduced since large matrix operations are unavoidable. Only the stage of high-resolution template is computationally efficient, which benefits from linear combination of vertices. Although the average time of the whole process for five scans is 9.28 s, it is also acceptable for reconstructing human model only once for each user. M. Human Motion Tracking One advantage of template-based reconstruction is that human motion is easy to track by the Kinect camera because each template is designed by artist and its skeleton has been manually specified in advance. Our algorithm only changes template geometry according to different body types, for example, tall and short, and fat and thin, while the topology of template is kept invariable. Hence, the predefined skeleton is still available after template geometry deformation. We test template action during tracking human motion by the Kinect camera. The template is driven by skeleton mapping from skeleton extracted by OpenNI to template skeleton, and an online video1 is provided to show this tracking experiment. We can observe that the template action follows human motion rightly and the tracking process runs in real time. VIII. C ONCLUSION In this paper, we present an automatic approach to obtain a high-resolution model of a human body scanned with multiple consumer-level RGB-D cameras. There are different application cases for our method and traditional skeleton-based deformation. Our method bases human data captured by depth cameras. It is difficult to directly obtain an accurate skeleton of the scanned human from incomplete and noisy point cloud such that traditional skeleton-based deformation is not available. In this case, the template should be deformed by virtue of our algorithm. Nevertheless, traditional skeleton-based deformation is frequently used in animation industry while tracking 1 http://www.liuzhenbao.com/sourcedata/template


the skeleton of actor body with active or passive markers. The skeleton is easily inferred according to these markers and makes the character deform naturally. Because the topological relation between skeleton and 3-D model has been predefined by artist and their relative positions are also built manually in advance, deforming skeleton commonly results in ideal 3-D model deformation. However, our method can handle low-quality data captured by depth cameras and make highquality template deform to fit the data. We investigated the time complexity of our algorithm and also traditional skeletonbased deformation. In our algorithm, it takes average 9.26 s from human scan to template deformation, while traditional skeleton-based deformation is very fast and only costs 0.01 s. The traditional skeleton-based method is clearly superior to our algorithm in terms of efficiency. Limitations: Due to considerable diversity in the style of human body, we have not found a robust way to scan human body with complex structure and deform a template with rich details according to human body scan. For example, it is still challenging to scan and model girls with long hair such as curly hairstyles and with loose clothes such as skirts because these geometrical details are topologically discontinuous so that the designed template may have to change its topology for fitting. It will cause unexpected cluttered deformation and the template may be destroyed. Moreover, the proposed method is only capable of generating 3-D high-quality mesh available for clothing design and games, rather than yielding realistic human. Some geometrical details such as face and soft folds of fabric cannot be captured by low-cost depth cameras in a distant distance while ensuring the whole human body captured simultaneously by cameras placed at different locations on a large circle.

R EFERENCES [1] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE MultiMedia, vol. 19, no. 2, pp. 4–10, Feb. 2012. [2] C. Sun, T. Zhang, B.-K. Bao, C. Xu, and T. Mei, “Discriminative exemplar coding for sign language recognition with Kinect,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1418–1428, Oct. 2013. [3] M. Wang, R. Hong, X.-T. Yuan, S. Yan, and T.-S. Chua, “Movie2Comics: Towards a lively video content presentation,” IEEE Trans. Multimedia, vol. 14, no. 3, pp. 858–870, Jun. 2012. [4] R. Min, N. Kose, and J.-L. Dugelay, “KinectFaceDB: A Kinect database for face recognition,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 44, no. 11, pp. 1534–1548, Nov. 2014. [5] D. Tao, L. Jin, Z. Yang, and X. Li, “Rank preserving sparse learning for Kinect based scene classification,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1406–1417, Oct. 2013. [6] L. Shao, J. Han, P. Kohli, and Z. Zhang, Computer Vision and Machine Learning With RGB-D Sensors. Cham, Switzerland: Springer, Feb. 2014. [7] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with microsoft kinect sensor: A review,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1318–1334, Oct. 2013. [8] Y. Gao, Y. Yang, Y. Zhen, and Q. Dai, “Depth error elimination for RGBD cameras,” ACM Trans. Intell. Syst. Technol., vol. 6, no. 2, Apr. 2015, Art. ID 13. [9] N. Werghi, “Segmentation and modeling of full human body shape from 3-D scan data: A survey,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 37, no. 6, pp. 1122–1136, Nov. 2007. [10] S. Izadi et al., “Kinectfusion: Real-time 3D reconstruction and interaction using a moving depth camera,” in Proc. 24th Annu. Symp. User Interf. Softw. Technol. (UIST), Santa Barbara, CA, USA, 2011, pp. 559–568. [11] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, “Scanning 3D full human bodies using kinects,” IEEE Trans. Vis. Comput. Graphics, vol. 18, no. 4, pp. 643–650, Apr. 2012.

13

[12] A. Barmpoutis, “Tensor body: Real-time reconstruction of the human body and avatar synthesis from RGB-D,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1347–1356, Oct. 2013. [13] D. S. Alexiadis, D. Zarpalas, and P. Daras, “Real-time, full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras,” IEEE Trans. Multimedia, vol. 15, no. 2, pp. 339–358, Feb. 2013. [14] A. Weiss, D. Hirshberg, and M. J. Black, “Home 3D body scans from noisy image and range data,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Barcelona, Spain, Nov. 2011, pp. 1951–1958. [15] Y. Chen, Z. Liu, and Z. Zhang, “Tensor-based human body modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Portland, OR, USA, Jun. 2013, pp. 105–112. [16] H. Xu, Y. Yu, Y. Zhou, Y. Li, and S. Du, “Measuring accurate body parameters of dressed humans with large-scale motion using a Kinect sensor,” Sensors, vol. 13, no. 9, pp. 11362–11384, Nov. 2013. [17] J. Shotton et al., “Efficient human pose estimation from single depth images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2821–2840, Dec. 2013. [18] O. S. Gedik and A. A. Alatan, “3-D rigid body tracking using vision and depth sensors,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1395–1405, Oct. 2013. [19] D. S. Alexiadis et al., “Evaluating a dancer’s performance using Kinectbased skeleton tracking,” in Proc. 19th ACM Int. Conf. Multimedia, Scottsdale, AZ, USA, 2011, pp. 659–662. [20] Z. Liu, S. Tang, H. Qin, and S. Bu, “Evaluating user’s energy consumption using kinect based skeleton tracking,” in Proc. 20th ACM Int. Conf. Multimedia, Nara, Japan, 2012, pp. 1373–1374. [21] T. Helten, M. Muller, H.-P. Seidel, and C. Theobalt, “Real-time body tracking with one depth camera and inertial sensors,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Sydney, NSW, Australia, Dec. 2013, pp. 1105–1112. [22] H. P. H. Shum, E. S. L. Ho, Y. Jiang, and S. Takagi, “Real-time posture reconstruction for microsoft Kinect,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1357–1369, Oct. 2013. [23] B. Ni, Y. Pei, P. Moulin, and S. Yan, “Multilevel depth and image fusion for human activity detection,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1383–1394, Oct. 2013. [24] H. Zhang, C. Reardon, and L. Parker, “Real-time multiple human perception with color-depth cameras on a mobile robot,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1429–1441, Oct. 2013. [25] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, pp. 239–256, Feb. 1992. [26] F. Pomerleau, F. Colas, R. Siegwart, and S. Magnenat, “Comparing ICP variants on real-world data sets,” Auton. Robots, vol. 34, no. 3, pp. 133–148, 2013. [27] H. Zhuang and R. Sudhakar, “Simultaneous rotation and translation fitting of two 3-D point sets,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 27, no. 1, pp. 127–131, Feb. 1997. [28] H. Chui and A. Rangarajan, “A new point matching algorithm for nonrigid registration,” Comput. Vis. Image Understand., vol. 89, nos. 2–3, pp. 114–141, Feb. 2003. [29] A. Myronenko and X. Song, “Point set registration: Coherent point drift,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 12, pp. 2262–2275, Dec. 2010. [30] Y. Cui, S. Schuon, S. Thrun, D. Stricker, and C. Theobalt, “Algorithms for 3D shape scanning with a depth camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1039–1050, May 2013. [31] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt, “3D shape scanning with a time-of-flight camera,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2010, pp. 1173–1180. [32] Y. Cui and D. Stricker, “3D shape scanning with a Kinect,” in Proc. ACM SIGGRAPH Posters, 2011, Art. ID 57. [33] W. Chang and M. Zwicker, “Global registration of dynamic range scans for articulated model reconstruction,” ACM Trans. Graph., vol. 30, no. 3, May 2011, Art. ID 26. [34] B. Allen, B. Curless, and Z. Popovi´c, “The space of human body shapes: Reconstruction and parameterization from range scans,” ACM Trans. Graph., vol. 22, no. 3, pp. 587–594, Jul. 2003. [35] D. Anguelov et al., “SCAPE: Shape completion and animation of people,” ACM Trans. Graph., vol. 24, no. 3, pp. 408–416, Jul. 2005. [36] I. A. Kakadiaris and D. Metaxas, “Three-dimensional human body model acquisition from multiple views,” Int. J. Comput. Vis., vol. 30, no. 3, pp. 191–218, 1998. [37] A. Hilton, D. Beresford, T. Gentils, R. Smith, and W. Sun, “Virtual people: Capturing human models to populate virtual worlds,” in Proc. Comput. Animat., Geneva, Switzerland, 1999, pp. 174–185. [38] H. Seo and N. Magnenat-Thalmann, “An automatic modeling of human bodies from sizing parameters,” in Proc. Symp. Interact. 3D Graph., Monterey, CA, USA, 2003, pp. 19–26.


[39] D. Vlasic, I. Baran, W. Matusik, and J. Popovi´c, “Articulated mesh animation from multi-view silhouettes,” ACM Trans. Graph., vol. 27, no. 3, Aug. 2008, Art. ID 97. [40] J. Starck and A. Hilton, “Model-based human shape reconstruction from multiple views,” Comput. Vis. Image Understand., vol. 111, no. 2, pp. 179–194, Aug. 2008. [41] P. Zhang, K. Siu, J. Zhang, C. K. Liu, and J. Chai, “Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture,” ACM Trans. Graph., vol. 33, no. 6, Nov. 2014, Art. ID 221. [42] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Trans. Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013. [43] M. Wang et al., “Unified video annotation via multigraph learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 5, pp. 733–746, May 2009. [44] J. Han et al., “An object-oriented visual saliency detection framework based on sparse coding representations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 12, pp. 2009–2021, Dec. 2013. [45] J. Han et al., “Background prior-based salient object detection via deep reconstruction residual,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1309–1321, Aug. 2015. [46] M. Leordeanu and M. Hebert, “A spectral technique for correspondence problems using pairwise constraints,” in Proc. IEEE 10th Int. Conf. Comput. Vis. (ICCV), vol. 2. Beijing, China, Oct. 2005, pp. 1482–1489. [47] B. Leng, J. Zeng, M. Yao, and Z. Xiong, “3D object retrieval with multitopic model combining relevance feedback and LDA model,” IEEE Trans. Image Process., vol. 24, no. 1, pp. 94–105, Jan. 2015. [48] J. Han et al., “Learning computational models of video memorability from FMRI brain imaging,” IEEE Trans. Cybern., vol. 45, no. 8, pp. 1692–1703, Aug. 2015. [49] Y. Gao, M. Wang, R. Ji, X. Wu, and Q. Dai, “3-D object retrieval with Hausdorff distance learning,” IEEE Trans. Ind. Electron., vol. 61, no. 4, pp. 2088–2098, Apr. 2014. [50] Y.-S. Liu, K. Ramani, and M. Liu, “Computing the inner distances of volumetric models for articulated shape description with a visibility graph,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2538–2544, Dec. 2011. [51] J. Han et al., “Two-stage learning to predict human eye fixations via SDAEs,” IEEE Trans. Cybern., vol. 46, no. 2, pp. 487–498, Feb. 2016. [52] T. Guan, Y. Wang, L. Duan, and R. Ji, “On-device mobile landmark recognition using binarized descriptor with multifeature fusion,” ACM Trans. Intell. Syst. Technol., vol. 7, no. 1, Oct. 2015, Art. ID 12. [Online]. Available: http://doi.acm.org/10.1145/2795234 [53] Y. Gao, Q. Dai, and N.-Y. Zhang, “3D model comparison using spatial structure circular descriptor,” Pattern Recognit., vol. 43, no. 3, pp. 1142–1151, Mar. 2010. [54] R. Ji et al., “Location discriminative vocabulary coding for mobile landmark search,” Int. J. Comput. Vis., vol. 96, no. 3, pp. 290–314, 2012. [55] R. Ji, H. Yao, W. Liu, X. Sun, and Q. Tian, “Task-dependent visualcodebook compression,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 2282–2293, Apr. 2012. [56] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of histograms for local surface description,” in Proc. Eur. Conf. Comput. Vis., Heraklion, Greece, Sep. 2010, pp. 356–369. [57] M. Leordeanu, M. Hebert, and R. Sukthankar, “An integer projected fixed point method for graph matching and MAP inference,” in Proc. Adv. Neural Inf. Process. Syst., Whistler, BC, Canada, 2009, pp. 1114–1122. [58] P. Cignoni, C. Montani, and R. Scopigno, “DeWall: A fast divide and conquer Delaunay triangulation algorithm in Ed ,” Comput.-Aided Design, vol. 30, no. 5, pp. 333–341, 1998. [59] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in Proc. 4th Eurograph. Symp. Geom. Process., Cagliari, Italy, 2006, pp. 61–70. [60] O. K.-C. Au, C.-L. Tai, H.-K. Chu, D. Cohen-Or, and T.-Y. Lee, “Skeleton extraction by mesh contraction,” ACM Trans. Graph., vol. 27, no. 3, 2008, Art. ID 44. [61] D. E. Blair, Inversion Theory and Conformal Mapping. Providence, RI, USA: Amer. Math. Soc., 2000. [62] H.-Y. Wu, H. Zha, T. Luo, X.-L. Wang, and S. Ma, “Global and local isometry-invariant descriptor for 3D shape comparison and partial matching,” in Proc. IEEE Comput. Vis. Pattern Recognit., San Francisco, CA, USA, Jun. 2010, pp. 438–445. [63] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably informative multi-scale signature based on heat diffusion,” Comput. Graph. Forum, vol. 28, no. 5, pp. 1383–1392, 2009. [64] M. M. Bronstein and I. Kokkinos, “Scale-invariant heat kernel signatures for non-rigid shape recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2010, pp. 1704–1711. [65] Y. Lipman, D. Levin, and D. Cohen-Or, “Green coordinates,” ACM Trans. Graph., vol. 27, no. 3, Aug. 2008, Art. ID 78.


[66] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000. [67] H. C. Daniel, J. Kannala, and J. Heikkilä, “Joint depth and color camera calibration with distortion correction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 10, pp. 2058–2064, Oct. 2012.

Zhenbao Liu (M’11) received the bachelor’s and master’s degrees from Northwestern Polytechnical University, Xi’an, China, in 2001 and 2004, respectively, and the Ph.D. degree from the College of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan, in 2009. He is currently an Associate Professor with Northwestern Polytechnical University. He was a Visiting Scholar with Simon Fraser University, Burnaby, BC, Canada, in 2012. He has published approximately 50 papers in major international journals and conferences. His current research interests include computer graphics, computer vision, and shape analysis. Jinxin Huang was born in Hubei Province, China, in 1992. She received the bachelor’s degree in electrical engineering and automation from Northwestern Polytechnical University, Xi’an, China, in 2014, where she is currently pursuing the master’s degree in transportation tools and applications. Her current research interests include human– computer interaction, including 3-D human reconstruction. Shuhui Bu received the master’s and Ph.D. degrees from the College of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan, in 2006 and 2009. He was an Assistant Professor with Kyoto University, Kyoto, Japan, from 2009 to 2011. He is currently an Associate Professor with Northwestern Polytechnical University, Xi’an, China. He has published approximately 40 papers in major international journals and conferences. His current research interests include computer vision and robotics. Junwei Han (M’12–SM’15) received the Ph.D. degree in pattern recognition and intelligent systems from the School of Automation, Northwestern Polytechnical University, Xi’an, China, in 2003. He is currently a Professor with Northwestern Polytechnical University. His current research interests include multimedia processing and brain imaging analysis. Prof. Han is an Associate Editor of the IEEE T RANSACTIONS ON H UMAN –M ACHINE S YSTEMS, Neurocomputing, and Multidimensional Systems and Signal Processing. Xiaojun Tang received the B.S., M.S., and Ph.D. degrees from Northwestern Polytechnical University, Xi’an, China, in 2002, 2005, and 2010, respectively. Since 2005, he has been with Northwestern Polytechnical University. He is currently a Visiting Scholar with the Applied Control and Information Processing Laboratory, University of Victoria, Victoria, BC, Canada. His current research interests include computer vision, optimal control, and state estimation. Xuelong Li (M’02–SM’07–F’12) is a Full Professor with the Center for OPTical IMagery Analysis and Learning, State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China.

Terrain Classification From Body-Mounted Cameras During Human Locomotion.

Clinical anthropometrics and body composition from 3D whole-body surface scans.

Child body shape measurement using depth cameras and a statistical body shape model.

Robust reconstruction of time-resolved diffraction from ultrafast streak cameras.

Sensor fusion of cameras and a laser for city-scale 3D reconstruction.

Reconstruction of Full-Thickness Scalp Defects Using a Dermal Regeneration Template.

Continuous Depth Map Reconstruction From Light Fields.

One-Stage Reconstruction of Scalp after Full-Thickness Oncologic Defects Using a Dermal Regeneration Template (Integra).

Three-dimensional (3D) ultrasound imaging of the gallbladder and dilated biliary tree: reconstruction from real-time B-scans.

Hierarchical Activity Recognition Using Smart Watches and RGB-Depth Cameras.

Surveillance of a 2D plane area with 3D deployed cameras.

3D genome reconstruction from chromosomal contacts.

Titanium template for scaphoid reconstruction.

3D Face Hallucination from a Single Depth Frame.

Adaptive deformation correction of depth from defocus for object reconstruction.

For3D: Full organ reconstruction in 3D, an automatized tool for deciphering the complexity of lymphoid organs.

Template CoMFA: the 3D-QSAR Grail?

Electron beam depth dose scaling by means of effective atomic number reconstructed from CT scans.

Rational-operator-based depth-from-defocus approach to scene reconstruction.

Large-scale reconstruction of 3D structures of human chromosomes from chromosomal contact data.

Person Recognition System Based on a Combination of Body Images from Visible Light and Thermal Cameras.

Clinical utility of CAT body scans.

Mini gamma cameras for intra-operative nuclear tomographic reconstruction.

Automatic patient-customised 3D reconstruction of human costal cartilage from lung MDCT dataset.