Feature Article

A Video-Based System for HandDriven Stop-Motion Animation Xiaoguang Han and Hongbo Fu ■ City University of Hong Kong Hanlin Zheng ■ Zhejiang University Ligang Liu ■ University of Science and Technology of China Jue Wang ■ Adobe Research

S

top-motion is a popular animation technique to make a physical object (for example, a puppet) appear to move on its own. Traditional stop-motion is laborious, requiring manually moving the object in small increments and shooting one still picture of the object per increment. This process should avoid capturing the hands or A video-based system other tools that drive the moveanimates everyday objects ment. Because the object’s moin stop-motion style, flexibly tion must be broken up into and intuitively. Animators increments, animators must can perform and capture concentrate on subtle object motions continuously instead movement across neighboring frames rather than the desired of breaking them into continuous motion. This makes increments and shooting one still picture per increment. The the process unintuitive, especially for novices. system permits direct hand Another limitation of tradimanipulation without needing tional stop-motion is that in rigs, achieving more natural many scenarios, such as aniobject control for beginners. mating a flying puppet, simply holding the object up with your hands would cause large occlusion regions. Recovering these regions is difficult, even with state-ofthe-art image completion methods (for more on these methods, see the sidebar). A common solution is to use a rig that securely supports the puppet. The rig causes only tiny occluded areas; animators can digitally remove it in postproduction. 70

November/December 2013

However, most professional rigs are expensive, and homemade rigs require suitable mechanical components (for example, a heavy plate and a long strand of wire) and some degree of manual dexterity. Furthermore, attaching the rig to the animated object can be tricky, especially when the object is heavy, fragile, or smooth. To address these limitations, we present a videobased stop-motion production system. Our key contribution is a two-phase workflow (see Figure 1). In the first phase, the animator holds the object with his or her hands and performs the planned motion in a continuous fashion. In the second phase, the animator performs a similar motion, with the hands holding the object at different positions. Powered by computer vision techniques, our system aligns the images captured in the two phases and combines them to create a composite with the hands removed and the occluded object parts recovered. This workflow not only lets the animator focus on the desired global motion without breaking it down into small increments but also replaces rigs with the hands, achieving more flexible and intuitive control of object motion. So, even novices can use our system to turn a story idea into a high-quality stop-motion animation.

The Workflow Our system requires a video camera connected to a computer. Because our main intended users are

Published by the IEEE Computer Society

0272-1716/13/$31.00 © 2013 IEEE

Keyframe k

Phase 1



Keyframe k + 1







Online alignment



Online alignment

Phase 2

Automatic temporal interpolation

Interactive processing

Interactive hand removal







Interactive hand removal

Figure 1. Two-phase keyframe-based capturing and processing. First, the animator holds the object with his or her hands and performs the planned motion in a continuous fashion. Second, the animator holds the object at different positions to align it with the object in the keyframes of the captured video. The system combines the images captured in the two phases to create a composite with the hands removed and the occluded object parts recovered.

(b)

(a)

(c)

Figure 2. Creating a stop-motion animation of a Rubik’s Cube. (a) Our setup employs an inexpensive USB webcam (the Logitech HD Pro Webcam C910) connected to a computer. (b) First, the animator directly manipulated the cube. (c) Our system semiautomatically removed the animator’s hands. Our system is complementary to, and can be used with, traditional stopmotion production (for example, for rotating the individual cube faces in the first and last frames).

amateurs, our current setup uses an inexpensive USB webcam (the Logitech HD Pro Webcam C910) for its simplicity and affordability (see Figure 2a). Although traditional stop-motion production might also use video cameras, it uses them mainly for capturing still pictures. Because we concentrate on object motion, we fixed the camera on a tripod in our experiments. We discuss possible solutions for using moving cameras later.

For maximum efficiency, the animator must hold the object only at its nearly rigid parts and avoid occluding nonrigid parts. This is crucial to the success of the computer vision techniques our system employs. The whole object doesn’t need to be nearly rigid. To ease hand removal, there are several rules of thumb for hand placement, listed here in roughly descending order of importance: ■■

Capturing, Phase 1 This phase first captures a clean background plate without the object or animator. As we mentioned before, the animator then holds the object and performs the desired motion by continuously moving the object in front of the camera (see Figure 2b). The animator can change hand placement during capturing, although this might cause the captured motion to be less fluid, requiring additional motion editing in postproduction.

■■

■■

Minimize the total area of the occluded regions. Avoid unnecessary overlap between the animator and object—for example, by standing right behind the object. Set up proper lighting to reduce the shadow caused by the animator’s hands or other body parts.

Figure 3 shows one positive and two negative examples of hand placement. In practice, performing exactly the same motion twice is difficult, and our system doesn’t require IEEE Computer Graphics and Applications

71

Feature Article

Related Work in Computer Animation and Image Processing

H

ere we look briefly at research in several areas related to the development of our stop-motion animation system (see the main article).

Stop-Motion Production Various software systems exist for producing stop-motion animation, for both professionals (for example, Dragonframe) and amateurs (for example, Stop Motion Pro and iStopMotion). However, the underlying technology hasn’t advanced much over time. Previous systems constantly stop and start the camera to allow for slight object movement between two consecutive shots. Although our video-based system seldom stops the camera, we still regard it as a stopmotion technique because it generates stop-motion-style output. Traditional stop-motion production often employs puppets with highly articulated armatures or clay figures, owing to their ease of repositioning. However, making them requires sculpting and specialized skills such as brazing. In contrast, we focus on animating everyday physical objects, enabling easy production of stop-motion animation, even for beginners.

Video-Based Animation Many systems create a nonphotorealistic or cartoon animation from a video sequence by stylizing the video’s appearance. Stop-motion, on the other hand, requires maintaining the object’s original appearance and motion while removing whatever drove its movement (for example, hands) in the final animation. Connelly Barnes and his colleagues presented a videobased interface for 2D-cutout animation (one type of stop-motion).1 Their system tracks 2D paper puppets being physically moved by an animator. To remove the animator’s hands, it renders the puppets using their corresponding images in the puppet database. However, it can handle only planar objects without 3D rotation or self-occlusion, owing to the limited number of images in the database. Our system doesn’t use such an example-driven approach. So, it can handle more complicated 3D shapes and motions, as well as the interaction between the object and its environment (for example, the dynamic specular reflection in Figure 2 in the main article). Recently, Robert Held and his colleagues presented a Kinect-based system for producing 3D animations using physical objects as input.2 Their system outputs computergenerated animations, rendered using the reconstructed 3D models of the physical objects. In contrast, we aim to create animations with videos and pictures of real objects.

Video Segmentation and Completion A variety of interactive rotoscoping3 and video object cutout methods4 exist. They aim for accurate object segmentation and generally require considerable user 72

November/December 2013

interaction to achieve that accuracy. Given that we want to reconstruct the areas occluded by the hands, a rough segmentation is sufficient in our case (see Figure 6 in the main article). This significantly simplifies the problem and reduces the amount of user intervention. Researchers have recently proposed many video completion methods.5 These methods typically use 3D patches as primitives for maintaining spatiotemporal coherence. They usually fill in a missing region with content for that region from the same input video. Using 3D patches can effectively fill holes in the background region. However, it’s less applicable for dynamic foreground objects whose motion is complex and whose topology might therefore change from frame to frame. Furthermore, the data needed for completion might not exist in other parts of the video. Our keyframe-based recapturing and completion avoids these problems and significantly eases video object completion.

Image Completion and Enhancement Using Multiple Inputs Our research heavily relies on the success of data-driven image completion, which completes large holes in an image by borrowing suitable patches from other images. Aseem Agarwala and his colleagues presented a digital photomontage framework for combining parts of different photographs into a single composite for a variety of applications.6 Our system uses a similar approach to combine two images to create a final composite. However, the photomontage system deals only with prealigned still images. We align images taken at different times and propagate the image completion results on keyframes to all other frames in a temporally coherent way.

References 1. C. Barnes et al., “Video Puppetry: A Performative Interface for Cutout Animation,” ACM Trans. Graphics, vol. 27, no. 5, 2008, article 124. 2. R. Held et al., “3D Puppetry: A Kinect-Based Interface for 3D Animation,” Proc. 25th Ann. ACM Symp. User Interface Software and Technology (UIST 12), ACM, 2012, pp. 423–434. 3. A. Agarwala et al., “Keyframe-Based Tracking for Rotoscoping and Animation,” ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 584–591. 4. X. Bai et al., “Video SnapCut: Robust Video Object Cutout Using Localized Classifiers,” ACM Trans. Graphics, vol. 28, no. 3, 2009, article 70. 5. M.V. Venkatesh, S. Cheung, and J. Zhao, “Efficient ObjectBased Video Inpainting,” Pattern Recognition Letters, vol. 30, no. 2, 2009, pp. 168–179. 6. A. Agarwala et al., “Interactive Digital Photomontage,” ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 294–302.

(a)

(b)

(c)

Figure 3. Hand placement. (a) A positive example, in which the occluded region’s total area is minimized. (b) A negative example in which the hand seriously occludes the object. (c) Another negative example in which the overlap between the hand and the object is unnecessary.

(a)

(b)

(c)

Figure 4. The user interface for keyframe-based alignment: onion skinning (a) with snapping and (b) without snapping, and (c) superimposed edge maps. When the object is close enough to its position and orientation in the keyframe, our system automatically warps the current live frame to better align with the keyframe.

the animator to do that. Instead, for phase 2, we employ keyframe-based capturing and processing using the first video as a reference. Before phase 2, the animator must identify a set of keyframes and the regions of interest (ROIs)—for example, the orange rectangles in Figure 1. Each ROI includes one or more occluded parts undergoing the same nearly rigid transformations. For each ROI, the animator creates a tracking window by specifying a bounding box loosely enclosing the ROI on the first frame. The windows will then automatically propagate to other frames, using motion tracking. The animator then examines, and corrects if necessary, each frame’s bounding boxes to ensure they enclose the object’s occluded areas (not necessarily the entire object). Next, on the basis of motion analysis, our system automatically suggests a series of keyframes, which the animator can refine in phase 2. Generally, the system will generate dense keyframes if the video involves changing hand-object configurations or complex object movement such as 3D rotation.

Capturing, Phase 2 Here, the system sequentially processes the selected keyframes, using a specially designed user interface. Starting from the first keyframe, the animator physically moves the object to align it with the previously captured object in each keyframe (see Figure 1). This time, the hand position must be different so that the previously occluded parts are clearly visible in the newly captured image. For more reliable computation, the animator should place his or her hands as far from the previously occluded parts as possible. To facilitate keyframe-based capturing, we base the interface on onion skinning. The interface displays a semitransparent overlay of the current keyframe and the live frame (see Figure 4a). Such visual feedback is greatly helpful for examining the alignment’s accuracy. The interface includes a snapping feature; when the object is close enough to its position and orientation in the keyframe, our system automatically warps the current live frame to better align with the keyframe. This IEEE Computer Graphics and Applications

73

Feature Article

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5. Hand removal on a keyframe. (a) The keyframe captured in phase 1, with the orange rectangle indicating the tracked region of interest. (b) The corresponding image captured in phase 2. (c) The automatically computed composite. (d) Userspecified strokes to correct errors. (e) The final composite. (f) The final label map: red for L(x) = 0 (using the colors in Figure 5a) and green for L(x) = 1 (using the colors in Figure 5b), where L(x) is the binary label for pixel x.

The Algorithm and Implementation

Interactive Hand Removal

Here, we describe the algorithm and implementation details that support the workflow.

The capturing process produces a video with the set of selected keyframes from phase 1 and a wellaligned image for each keyframe from phase 2. Given the pair of images for each keyframe, our system generates a composite by combining parts of the images to remove the hands (see Figure 5). The system also produces an object mask and a hand mask based on the image completion result. The system automatically propagates the masks 74

and hand-removal results to the intermediate frames to restore the occluded regions. This results in a complete stop-motion animation video (see Figures 1 and 2c). If the composite contains errors, the animator can refine it, using a paintbrush tool, as we describe later. The compositing and masks update instantly to give the animator rapid visual feedback after drawing each brush stroke. To generate visually pleasing results, our system requires a faithful reconstruction of the occluded parts but only a rough segmentation of the hands and object. In fact, a hand mask that’s slightly larger than the ground truth is preferable, to reduce artifacts caused by the hands’ shadows on the object. A rough segmentation of the object is also sufficient for temporal propagation, as we explain later. So, most examples in this article needed only a small number of animator strokes to achieve satisfactory results. Finally, the animator employs the paintbrush to mark the background regions that are outside the keyframe ROIs but are occluded by hands or other body parts. If the animator avoids unnecessary overlap of the object during phase 1, he or she can easily remove the occluded regions outside the ROIs and complete them using the clean background plate. The system slightly expands the resulting occlusion masks and projects them to the intermediate frames to automatically complete the occluded regions there.

feature can greatly relieve the animator from the burden of precise alignment. The interface also displays a window of superimposed edge maps (see Figure 4c), as we discuss later. Traditional stop-motion-animation and digitalanimation software such as Adobe Flash also provide onion skinning. However, those programs use it to help animators estimate the change across frames (to ensure that the movement across frames is neither too small nor too radical), rather than examine two frames’ alignment. We recommend that the animator perform interactive object alignment on the same side as the camera, with the display on the opposite side. Such a scenario is similar to a docking task that’s often used to evaluate six-degree-of-freedom (DOF) input devices.1 From this viewpoint, the object serves essentially as a physical input controller with three DOF for translation and three DOF for rotation. Because the animator can benefit from kinesthetic feedback about the physical object’s position and orientation, our alignment interface is thus intuitive and efficient. Optionally, our system can automatically decide whether an alignment is good enough to enable a more fluid capturing experience. The animator still has full control of the process. He or she can manually signal a good alignment, and the system will capture the corresponding image and pair it with the current keyframe. Our system also allows back-and-forth keyframe toggling in case the animator isn’t satisfied with the automatic alignment.

November/December 2013

ROI and Keyframe Specification From the algorithm viewpoint, we’re interested in only the local region where the animator’s hands occlude the object. For simplicity, we assume a single occluded rigid part—that is, a single ROI. Multiple ROIs can be processed sequentially using the same pipeline. As we mentioned before, we let the animator specify an ROI—that is, a bounding

box that loosely encloses this region (see Figure 5a). The system then automatically tracks the bounding box throughout the rest of the sequence. As we explain later, we assume that the object’s motion is relatively slow in the captured video, so the system can accurately track the bounding box by computing the optical flow between adjacent frames.2 The animator can also adjust the bounding box on any frame, which will reinitiate tracking, starting from that frame. After the animator specifies the ROI, our system automatically estimates object motion within the ROI to determine the initial set of keyframes. A keyframe should be placed when the object’s appearance is changing significantly. To detect this, we again use the optical flow for motion estimation because it satisfactorily captures complex motions such as deformation and 3D rotation, which can greatly affect the object’s appearance. We compute the optical flow only inside each frame’s ROI with respect to the corresponding region in the next frame. We assign a new keyframe when the accumulated average flow velocity is above a predefined threshold regarding the selected keyframe. Specifically, the average magnitude of the flow field from frame t – 1 to t is |vt|, and the current keyframe is ki. So, we select ki+1 when



ki+1 t=1+ki

| v t |> Tv ,

where Tv is a predefined threshold (Tv = 20 by default in our implementation). Keyframe extraction is a well-studied problem for video summarization.3 However, in that process, keyframes serve as a short summary of the video, whereas our keyframes guide temporal interpolation for hand removal.

Onion Skinning and Snapping A basic onion-skinning interface is straightforward to implement as a semitransparent superimposition of the live frame and the current keyframe with adjustable relative opacity (see Figures 4a and 4b). Although such semitransparent overlay provides instant feedback on alignment, features from individual images might be difficult to distinguish. We found that an additional window of superimposed edge maps (using the Sobel operator) can greatly help the animator identify a satisfactory alignment (see Figure 4c). Our system requires only a rough alignment between the object in the live frame and its counterpart in the keyframe. This is an important feature for our capturing workflow because in practice it’s

difficult for the animator to manually reach a perfect alignment (see Figure 4b). To ease the animator’s burden, we compute the optical flow between the keyframe ROI and the corresponding region in the live frame on the fly while the animator is using the onion-skinning interface. The flow is weighted by the pixel-wise difference between the two regions to reduce the disturbance of the hands, because the hands are at different positions in the two images. When the objects in the two images are close enough in position and orientation (thus satisfying the small-movement assumption of optical flow), the ROI in the live frame is warped by the flow and snapped onto the ROI in the keyframe to create a more accurate alignment. To detect good alignment, our system currently detects and matches SIFT (scale-invariant feature transform) features in the ROI of both images. A match is good if the average distance between the matched feature points is below a threshold (three pixels in our system). This criterion works well if many SIFT features can be detected (for example, for the flipping-book example we describe later). However, for textureless objects without enough features for reliable feature matching, the system disables this feature. To augment SIFT matching for automatic alignment evaluation, the system could employ other matching techniques, such as matching object contours.

Hand Removal Here, we focus on hand removal within the ROI because removing the animator’s hands or other body parts outside the ROI is straightforward (that is, by cloning the corresponding pixels in the clean background plate). Semiautomatic hand removal on keyframes. Given Ri, the ROI on ki (see Figure 5a), and Ril , the corresponding ROI on the live frame (see Figure 5b), we want to remove the regions occluded by the animator’s hands and recover the complete object in Ri with the help of Ril . For simplicity, we temporarily drop i from the notation. Assuming that Rl is already warped by the optical flow to align with R, our system treats hand removal as an image-compositing problem that assigns each pixel Cx in the final composite C to one of the two possible colors: R x or Rxl . Generally, assuming Rl and R are aligned well, if x isn’t occluded in both R and Rl, then Rx ≈ Rxl , and we prefer to assign Cx = R x to minimize the changes made to R. On the other hand, if x is occluded in either R or Rl, then R x differs significantly from Rxl . In this IEEE Computer Graphics and Applications

75

Feature Article

case, we use a background color model to further determine on which image the occlusion happens. If the occlusion happens on R, we assign Cx = Rxl to reconstruct the object color at x. Specifically, for R x, we compute two color probabilities: ■■ ■■

pf (x), the probability that x isn’t occluded, and po(x), the probability that x is either occluded or a background pixel.

To compute pf(x), we select a set of high-confidence pixels for which Rx − Rxl < d , where d is a small constant that we set as 0.05. We then train a Gaussian mixture model (GMM) Gf using these pixels’ colors. Finally we compute pf (x) as Gf (R x). Similarly, for po(x), we use all pixels on the border of R to train a GMM color model Go, and we compute po(x) as Go(R x). Go contains both the hand and background colors. To ensure the image labeling is spatially coherent and robust to image noise and color shift, we use an optimization approach under the photomontage framework (for more on this, see the section on image completion in the sidebar). We minimize this objective function: E (L) =

∑ E (x, L (x)) + l∑ E (x, y, L (x) , L ( y)),(1) d

x

s

x ,y

where L(x) is the binary label for pixel x. L(x) = 0 means Cx = R x; L(x) = 1 means C x = Rxl . l is the balancing weight between the two terms (0.5 in our system). Ed is the data term:  p o x p f x + p o x ( )( ( ) ( )) L ( x) = 0  . Ed ( x, L ( x)) =   p f ( x ) (p f ( x ) + p o ( x )) L ( x ) = 1  E s , the smoothness term, ensures a smooth transition between two different labels. We adopt the matching-color criterion from the digitalphotomontage framework and define it as Es ( x, y, L ( x ) , L ( y )) = Rx − Rly + R y − Rxl , when L(x) ≠ L(y), and 0 otherwise. We then minimize the total energy by using graph cut optimization,4 yielding a composite that’s presented to the animator for review. If the animator is satisfied with the result, the system proceeds to the next keyframe. Otherwise, the animator can refine the composite by adding brush strokes as a hard constraint, as we mentioned before. Figure 5c shows an example in 76

November/December 2013

which the automatically computed composite had a small error. To fix it, the animator added some red (L(x) = 0) strokes as a hard constraint (see Figure 5d). At an interactive rate, our system recomputed the composite, which was artifact-free (see Figure 5e). The unoccluded background pixels in R will be included for training both Gf and Go. So, pf (x) and po(x) will be high for these pixels, and Ed doesn’t favor either label. This isn’t a problem in our application because these pixels don’t affect hand removal; we let Es play the dominant role for them to create a seamless composite. As Figure 5f shows, although the labeling function doesn’t accurately segment either the object or hand, the final composite is good enough for hand removal. Automatic temporal propagation. Once the animator has achieved satisfactory hand removal results on adjacent keyframes k1 and k2 (we denote their image completion results as Ck1 and Ck2 ), our system employs automatic temporal interpolation for hand removal on all in-between frames j, k1 < j < k2 (see Figure 6). For temporal propagation, we further process k1 to generate a hand mask Mkhand and an unoccluded 1 object mask Mkobj , with the help of the clean 1 background plate B. We compute Mkhand as 1 Mkhand ( x) = Th ( Rk1 ( x) − B ( x) ) & Th ( Rk1 ( x) − Ck1 ( x) ), 1  (2)

where T h(e) is a step function that is 1 if e > Tc or is 0 otherwise. Tc is a predefined color threshold, which we set as 0.1 in our system. We then compute Mkobj as 1

(

)

Mkobj ( x) = Th Rk1 ( x) − B ( x) − Mkhand ( x) .(3) 1 1 Figures 6b and 6d show two examples of Mobj and Mhand. In Equations 2 and 3, we compute the masks as per-pixel operations. In theory, we could improve the mask calculation’s robustness by using another graph cut optimization process. However, this isn’t necessary because we don’t require accurate object and hand masks. We use the unoccluded object mask mainly for motion estimation, so small errors don’t affect its performance. We then compute the optical flow from Rk1 to Rj, using only the pixels inside Mkobj (as the data 1 term), which allows more accurate estimation of object motion. We then extrapolate the masked flow field and use it to warp Ck1 and Mkhand to frame j, 1 denoted as Ck′1 and M k′hand . Similarly, we compute 1

(a)

(e)

(b)

(f)

(c)

(g)

(h)

(d)

(i)

(j)

Figure 6. Temporal propagation. (a) The image completion result for frame 100, an initial keyframe. (b) The object mask (red) and hand mask (green) for frame 100. (c) The image completion result for frame 115, the next keyframe. (d) The object and occlusion masks for frame 115. The system automatically propagated the results to the in-between frames (e) 102, (f) 104, (g) 107, (h) 109, (i) 111, and (j) 113.

from k2. We then linearly blend Ck′2 and Mk′hand 2 Ck′1 and Ck′2 to create a reference object image Cj on which the object is complete, and we compute a final hand mask for j: M hand = M'khand M'khand . j 1 2 Finally, to remove the hands and complete the occluded object regions in Rj, we apply Poisson blending5 to seamlessly blend Rj with Cj in the region M hand (see Figures 6e–6j). j We intentionally dilate M hand to remove shadow j artifacts around the hand regions. The per-pixel operations can give the animator fast visual feedback, which we found more important for fluid workflow. If the object casts strong shadows on the scene (for example, the Rubik’s Cube in Figure 2) that would appear in Mobj and possibly make the optical flow less robust, the animator can roughly exclude the object shadow regions using the paintbrush.



Video Speed Adjustment Because we use a video camera to capture the animation, motion blur is inevitable when the object moves too fast. Motion blur usually doesn’t exist in traditional stop-motion animation. To avoid it, our system requires the animator to move the object slowly. This will also benefit motion tracking and image completion and greatly reduce the amount of required animator assistance. Traditional stop-motion animation has a sense of motion discontinuity between adjacent frames. Simply playing our processed video at its original speed might not produce that effect. So, we sample the video every five or so frames to create a more pleasing animation. During postproduction, the animator can also fine-tune the sampling rate at different parts of the video for more personalized results.

Results and Discussion All the animations in this article and the accom­ panying video (available at http://sweb.cityu. edu.hk/hongbofu/projects/StopMotion_CGA/ StopMotion_CGA.mov) were created by amateurs with little experience in digital or stop-motion animation. They created these examples in an everyday working environment, not a professional studio with special lighting and blue or green screens for background replacement. Our workflow essentially separates motion performance (phase 1), which might require an experienced animator, from keyframe-based interactive rendering (phase 2), which requires some animator input but can be done by amateurs. In other words, the same animator doesn’t have to be involved in both phases. On the other hand, our system’s flexibility comes at the cost of the lack of precise control (for example, for achieving carefully crafted ease-in and ease-out effects), possibly requiring motion editing in postproduction.

Example Animations With our tool, users could naturally produce an animation of pouring water from a cup (see Figure 7a). In contrast, with traditional stop-motion production, such an animation is difficult even using complicated rigs or other supporting tools. We asked two professional animators who teach stop-motion courses at a prestigious art school to make a similar animation with whatever hardware and software they liked. Although they managed to do this using DragonFrame and hardware tools such as transparent adhesive tape and wires, they failed to reproduce the motion of water in their still shots. They spent around two hours producing IEEE Computer Graphics and Applications

77

Feature Article

(a)

(b)

(c)

(d)

Figure 7. Stop-motion animations produced by our system. (a) Pouring water. (b) A walking stool. (c) Plant pruning. (d) A flipping book. These animations were created by amateurs with little experience in digital or stop-motion animation. (Videos of animations made with our system are at http://sweb.cityu.edu.hk/hongbofu/projects/StopMotion_CGA/StopMotion_CGA.mov.)

an animation with only 30 still shots (with our tool, a single amateur produced the animation with 512 frames in 2.5 hours). With direct hand manipulation, animators can easily animate relatively heavy objects such as the stool in Figure 7b. The professional animators also commented that animations involving complex 3D rotation, such the walking stool, are challenging even with rigs because they must frequently change the rig-object configurations to accommodate complex motions. Figure 7c shows an animation of scissors pruning a plant, which involves multibody motion. During phase 1, two hands simultaneously manipulated and animated two parts of the scissors. So, this example required two ROIs, which the system processed sequentially in phase 2. Figure 2 shows another example involving multibody motion. This example also shows easy integration of our system with traditional stop-motion production for the rotation of the cube’s individual faces. The flipping book in Figure 7d involves continuous rotation around a fixed axis. Given that such motions have only one DOF, we can obtain exactly the same trajectory during phase 2, although the moving speed might be different. So, although our keyframe-based alignment and capturing still works, a more efficient approach might be to first capture the motion in phase 2 as a video as well.6 Then, the animator would synchronize the two 78

November/December 2013

videos by establishing a temporal correspondence between their frames.

Hand Placement Generally, the bigger the object, the more flexible the hand placement can be because the hands occlude only a small portion of the object. Conversely, our system might become less reliable when the object is small and most of it is occluded. In this case, because motion tracking is difficult, reconstructing a large portion of the object might introduce noticeable artifacts. To animate small objects, we could use tools to drive the animation that can still be removed with our system (for example, using scissors in a typingkeyboard animation).

The Number of Keyframes Table 1 shows the number of keyframes that five of our examples used. On average, we used one keyframe for every 12 frames. Objects that are hard to track owing to the lack of features or fast motion (for example, the walking stool) will require more keyframes to generate satisfactory interpolation results. The worst cases, such as animating an almost transparent object that’s difficult to track even with the best computer vision techniques, will require a keyframe for every two or three frames. Such cases might need as much animator interaction as traditional stop-motion production.

Table 1. The results for creating some of our animations. However, we believe that some animators might still prefer our system, given its flexibility of direct hand manipulation.

Timings As Table 1 shows, producing our animations took from one to six hours, including motion planning and postproduction. The total animator time depended largely on the number of extracted keyframes. Object alignment took from dozens of seconds to one or two minutes per keyframe. Interactive hand removal needed only a small number of strokes on a few keyframes to achieve reasonable results. On average, the animator spent about half a minute on each keyframe for interactive hand removal. For temporal interpolation, our unoptimized implementation achieved a few seconds per frame (depending on the ROI size); we can further accelerate this for more rapid visual feedback. To create animations similar to ours, traditional stop-motion techniques would typically take much longer. This was confirmed by the two professional animators we mentioned earlier. For instance, they expected that even professionals would need at least two days to complete an animation similar to our walking stool (which took six hours with our system).

Limitations We speculate that the number of keyframes used is far from optimal. A better solution might be to start with a sparse set of keyframes and let the animator recursively insert a new keyframe between two keyframes that produce an unsatisfactory temporal interpolation. This requires faster temporal propagation for real-time feedback. Of course, our system can’t animate every possible object and has important limitations. It assumes that occluded object parts in phase 1 are available in phase 2. However, this isn’t true for highly deformable objects such as clothes, which are easily distorted by direct hand manipulation. Our system might have difficulties with such objects. Our automatic temporal propagation uses the optical flow to estimate the unoccluded parts’ motion and then extrapolates the flow to the occluded regions as a warp field. So, problems due to either the optical flow or flow extrapolation will cause artifacts. Optical flow is weak at estimating outof-plane rotation and performs poorly for regions with low texture (see Figure 8). Our smoothnessbased extrapolation might fail to respect objects’ features—for example, the straight edges in Figure 8. In our current system, we can alleviate these problems by inserting more keyframes.

No. of frames

No. of keyframes

Approximate production time (hrs.)

Rubik’s Cube

503

33

3.0

Flipping book

201

21

0.5

Pouring water

512

43

2.5

Moving a computer mouse

341

21

1.0

Walking stool

310

70

6.0

Animation

A better solution might be to impose rigid constraints on the formulation of the optical flow or extrapolation—for example, in the sense of structurepreserving image deformation.7 An alternative is to employ roughly reconstructed 3D models. We could do this by using an RGB-D camera such as the Microsoft Kinect instead of our traditional webcam. (For more on such an approach, see the section on video-based animation in the sidebar.) This approach will afford much more robust (3D) object tracking, leading to more accurate warp fields. Camera movement or a dynamic background can further complicate the problem and make the system less efficient. In this case, because a clean background plate isn’t available, animators must provide more manual input for ROI tracking and keyframe segmentation and use denser keyframes to avoid temporal interpolation errors. Furthermore, the system must employ video hole-filling to fill in the background regions occluded by the animator’s hands. (For more on this, see the section on video segmentation and completion in the sidebar.) For a moving camera with a static background, another possible solution is to first roughly reposition the camera in phase 2 with respect to the camera movement in the first video. Then, the system could employ advanced videomatching algorithms that can bring into spatiotemporal alignment two videos following nearly identical spatial trajectories.6 For changing lighting, images captured from phase 2 might appear significantly different from the video captured in phase 1. This will make a number of system components that rely on computing color difference fail. A possible solution is to first examine how the lighting changes in common regions of the scene captured in the two phases. Then we could apply relighting to cancel out the lighting differences’ effect before applying our system. Even with onion skinning, keyframe-based object alignment becomes more difficult when the desired object motion involves more DOF. In particular, alignment is the trickiest when an object is completely floating in the air. It involves both 3D translation and rotation (six DOF) because there’s no physical reference for alignment. IEEE Computer Graphics and Applications

79

Feature Article

(a)

(b) Figure 8. A failure case. (a) A starting keyframe, an intermediate frame, and an ending keyframe. (b) The completed versions of the three frames. Our smoothness-based extrapolation failed to respect the moved object’s straight edges (see the second image and the inset in Figure 8b).

To alleviate this problem, animators should continuously and sequentially perform object alignment keyframe by keyframe. This is because object movement across adjacent keyframes is typically small, thus requiring only a slight movement of the object to achieve alignment. However, this would easily cause fatigue for heavy objects such as the walking stool. Furthermore, object alignment is less intuitive when the animator is in front the camera (see Figure 1) rather than on the same side as the camera. The animator could horizontally flip the live frame to get a mirrored effect, which he or she might be more familiar with. However controlling rotation or translation toward or away from the camera is rather counterintuitive.

T

he results demonstrate that our system is useful and efficient for creating stop-motion animations for a variety of objects. We believe it has also advanced the state of the art in low-budget stop-motion production. We plan to address the limitations we just mentioned to build a more robust system.

Acknowledgments We thank the reviewers for their constructive comments, Michael Brown for video narration, Lok Man 80

November/December 2013

Fung and Wai Yue Pang for experimenting with traditional stop-motion production, and Tamas Pal Waliczky and Hiu Ming Eddie Leung for their comments. This research was supported partly by the Research Grants Council of the Hong Kong Special Administrative Region (grants CityU113610 and CityU113513), the City University of Hong Kong (grants 7002925 and 7002776), the National Natural Science Foundation of China (grant 61222206), and the National Basic Research Program of China (grant 2011CB302400).

References 1. S. Zhai, “Human Performance in Six Degree of Freedom Input Control,” PhD dissertation, Graduate Dept. of Industrial Eng., Univ. of Toronto, 1995. 2. C. Liu, “Beyond Pixels: Exploring New Representations and Applications for Motion Analysis,” PhD dissertation, Dept. of Electrical Eng. and Computer Science, MIT, 2009. 3. B.T. Truong and S. Venkatesh, “Video Abstraction: A Systematic Review and Classification,” ACM Trans. Multimedia Computing, Communications, and Applications, vol. 3, no. 1, 2007, article 3. 4. Y. Boykov and O. Veksler, “Graph Cuts in Vision and Graphics: Theories and Applications,” Handbook of Mathematical Models in Computer Vision, N. Paragios, Y. Chen, and O. Faugeras, eds., Springer, 2006, chapter 5.

5. P. Pérez, M. Gangnet, and A. Blake, “Poisson Image Editing,” ACM Trans. Graphics, vol. 22, no. 3, 2003, pp. 313–318. 6. P. Sand and S. Teller, “Video Matching,” ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 592–599. 7. Q.-X. Huang, R. Mech, and N. Carr, “Optimizing Structure Preserving Embedded Deformation for Resizing Images and Vector Art,” Computer Graphics Forum, vol. 28, no. 7, 2009, pp. 1887–1896. Xiaoguang Han is a research associate at the City University of Hong Kong’s School of Creative Media. His research interests include computer graphics and computer vision. Han received a BS in information and computer science from the Nanjing University of Aeronautics and Astronautics and an MSc in applied mathematics from Zhejiang University. Contact him at [email protected]. Hongbo Fu is an assistant professor in the City University of Hong Kong’s School of Creative Media. His primary research interest is computer graphics, particularly digital geometry processing. Fu received a PhD in computer science from the Hong Kong University of Science and Technology. Contact him at [email protected].

Hanlin Zheng is a PhD candidate in applied mathematics at Zhengjiang University. His research interests include computer graphics and digital geometry processing. Zheng received a BSc in applied mathematics from Zhejiang University. Contact him at [email protected]. Ligang Liu is a professor at the Graphics & Geometric Computing Laboratory at the University of Science and Technology of China. His research interests include digital geometry processing, computer graphics, and image processing. Liu received a PhD in computer graphics from Zhejiang University. Contact him at [email protected]. Jue Wang is a senior research scientist at Adobe Research. His research interests include image and video processing, computational photography, and computer graphics and vision. Wang received a PhD in electrical engineering from the University of Washington. He’s a senior member of IEEE and a member of ACM. Contact him at [email protected].

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

ADVERTISER INFORMATION • NOVEMBER/DECEMBER 2013

Advertising Personnel Marian Anderson: Sr. Advertising Coordinator Email: [email protected] Phone: +1 714 816 2139 | Fax: +1 714 821 4010 Sandy Brown: Sr. Business Development Mgr. Email [email protected] Phone: +1 714 816 2144 | Fax: +1 714 821 4010 Advertising Sales Representatives (display) Central, Northwest, Far East: Eric Kincaid Email: [email protected] Phone: +1 214 673 3742 Fax: +1 888 886 8599 Northeast, Midwest, Europe, Middle East: Ann & David Schissler Email: [email protected], [email protected] Phone: +1 508 394 4026 Fax: +1 508 394 1707



Southwest, California: Mike Hughes Email: [email protected] Phone: +1 805 529 6790 Southeast: Heather Buonadies Email: [email protected] Phone: +1 973 585 7070 Fax: +1 973 585 7071 Advertising Sales Representatives (Classified Line) Heather Buonadies Email: [email protected] Phone: +1 973 304-4123 Fax: +1 973 585 7071 Advertising Sales Representatives (Jobs Board) Heather Buonadies Email: [email protected] Phone: +1 973 304-4123 Fax: +1 973 585 7071

IEEE Computer Graphics and Applications

81

A video-based system for hand-driven stop-motion animation.

Stop-motion is a well-established animation technique but is often laborious and requires craft skills. A new video-based system can animate the vast ...
4MB Sizes 2 Downloads 3 Views