Neural Capture & Synthesis

The main theme of my work is to capture and to (re-)synthesize the real world using commodity hardware. It includes the modeling of the human body, tracking, as well as the reconstruction and interaction with the environment. The digitization is needed for various applications in AR/VR as well as in movie (post-)production. Teleconferencing and working in VR is of high interest for many companies ranging from social media platforms to car manufacturer. It enables the remote interaction in VR, e.g., the inspection of 3D content like CAD models or scans from real objects. A realistic reproduction of appearances and motions is key for such applications. Capturing natural motions and expressions as well as the photorealistic reproduction of images under novel views are challenging. With the rise of deep learning methods and, especially, neural rendering, we see immense progress to succeed in these challenges. The goal of my work is to develop methods for AI-based image synthesis of humans, the underlying representation of appearance, geometry and motion to allow for explicit and implicit control over the synthesis process. My work on 3D reconstruction, tracking and rendering does not focus exclusively on humans but also on the environment and objects we interact with, thus, enabling applications like 3d telepresence or collaborative working in VR. In both areas, reconstruction and rendering, hybrid approaches that combine novel findings in machine learning with classical computer graphics and computer vision approaches show promising results. Nevertheless, these methods still suffer from limitations like generalizability, controllability and editability which I will tackle in my ongoing and future work.


Stable Video Portraits

Stable Video Portraits is a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). It is based on a personalized image diffusion prior which allows us to generate new videos of the subject, and also to edit the appearance by blending the personalized image prior with a general text-conditioned model.

[Paper]  [Bibtex] 

TeSMo: Generating Human Interaction Motions in Scenes with Text Control

TeSMo is a method for text-controlled scene-aware motion generation based on denoising diffusion models. Specifically, we pre-train a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. Then, we enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes.

[Paper]  [Video]  [Bibtex] 

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive, detailed nature of human heads, including hair, ears, and finer-scale eye movements, we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity, temporally coherent motion sequences

[Paper]  [Video]  [Bibtex] 

TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance.

[Paper]  [Video]  [Bibtex] 

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatars

We propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data.

[Paper]  [Video]  [Bibtex] 

Imitator: Personalized Speech-driven 3D Facial Animation

We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video.

[Paper]  [Video]  [Bibtex] 

CaPhy: Capturing Physical Properties for Animatable Human Avatars

We present CaPhy, a novel method for reconstructing animatable human avatars with realistic dynamic properties for clothing. Specifically, we aim for capturing the geometric and physical properties of the clothing from real observations. This allows us to apply novel poses to the human avatar with physically correct deformations and wrinkles of the clothing.

[Paper]  [Video]  [Bibtex] 

MICA: Towards Metrical Reconstruction of Human Faces

Face reconstruction and tracking is a building block of numerous applications in AR/VR, human-machine interaction, as well as medical applications. Most of these applications rely on a metrically correct prediction of the shape, especially, when the reconstructed subject is put into a metrical context. Thus, we present MICA, a novel metrical face reconstruction method that combines face recognition with supervised face shape learning.

[Paper]  [Video]  [Bibtex] 

Texturify: Generating Textures on 3D Shape Surfaces

Texturify learns to generate geometry-aware textures for untextured collections of 3D objects. Our method trains from only a collection of images and a collection of untextured shapes, which are both often available, without requiring any explicit 3D color supervision or shape-image correspondence. Textures are created directly on the surface of a given 3D shape, enabling generation of high-quality, compelling textured 3D shapes.

[Paper]  [Video]  [Bibtex] 

Neural Head Avatars from Monocular RGB Videos

We present Neural Head Avatars, a novel neural representation that explicitly models the surface geometry and appearance of an animatable human avatar using a deep neural network. Specifically, we propose a hybrid representation consisting of a morphable model for the coarse shape and expressions of the face, and two feed-forward networks, predicting vertex offsets of the underlying mesh as well as a view- and expression-dependent texture.

[Paper]  [Video]  [Bibtex] 

Mover: Human-Aware Object Placement for Visual Environment Reconstruction

We demonstrate that human-scene interactions (HSIs) can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video. Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images, and optimize the 3D scene to reconstruct a consistent, physically plausible and functional 3D scene layout.

[Paper]  [Video]  [Bibtex] 

Advances in Neural Rendering

This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene.

[Paper]  [Bibtex] 

3DV 2021: Tutorial on the Advances in Neural Rendering

In this tutorial, we will talk about the advances in neural rendering, especially the underlying 2D and 3D representations that allow for novel viewpoint synthesis, controllability and editability. Specifically, we will discuss neural rendering methods based on 2D GANs, techniques using 3D Neural Radiance Fields or learnable sphere proxies. Besides methods that handle static content, we will talk about dynamic content as well.

[Video] 

SIGGRAPH 2021: Course on the Advances in Neural Rendering

This course covers the advances in neural rendering over the years 2020-2021.

[Video]  [Bibtex] 

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation.

[Paper]  [Video]  [Bibtex] 

NerFACE: Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face. To handle the dynamics of the face, we combine our scene representation network with a low-dimensional morphable model which provides explicit control over pose and expressions. We use volumetric rendering to generate images from this hybrid representation and demonstrate that such a dynamic neural scene representation can be learned from monocular input data only, without the need of a specialized capture setup.

[Paper]  [Video]  [Bibtex] 

Neural Deformation Graphs for Globally-consistent Non-rigid Reconstruction

We introduce Neural Deformation Graphs for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects. Specifically, we implicitly model a deformation graph via a deep neural network. This neural deformation graph does not rely on any object-specific structure and, thus, can be applied to general non-rigid deformation tracking.

[Paper]  [Video]  [Bibtex] 

Neural Non-Rigid Tracking

We introduce a novel, end-to-end learnable, differentiable non-rigid tracker that enables state-of-the-art non-rigid reconstruction. By enabling gradient back-propagation through a non-rigid as-rigid-as-possible optimization solver, we are able to learn correspondences in an end-to-end manner such that they are optimal for the task of non-rigid tracking.

[Paper]  [Video]  [Bibtex] 

CVPR 2020: Tutorial on Neural Rendering

Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. This state-of-the-art report summarizes the recent trends and applications of neural rendering.

[Paper]  [Video]  [Bibtex] 

State of the Art on Neural Rendering

Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. This state-of-the-art report summarizes the recent trends and applications of neural rendering.

[Paper]  [Bibtex] 

Adversarial Texture Optimization from RGB-D Scans

We present a novel approach for color texture generation using a conditional adversarial loss obtained from weakly-supervised views. Specifically, we propose an approach to produce photorealistic textures for approximate surfaces, even from misaligned images, by learning an objective function that is robust to these errors.

[Paper]  [Video]  [Bibtex] 

DeepVoxels: Learning Persistent 3D Feature Embeddings

In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D object without having to explicitly model its geometry.

[Paper]  [Video]  [Bibtex] 

Research Highlight: Face2Face

Research highlight of the Face2Face approach featured on the cover of Communications of the ACM in January 2019. Face2Face is an approach for real-time facial reenactment of a monocular target video. The method had significant impact in the research community and far beyond; it won several wards, e.g., Siggraph ETech Best in Show Award, it was featured in countless media articles, e.g., NYT, WSJ, Spiegel, etc., and it had a massive reach on social media with millions of views.

[Paper]  [Video]  [Bibtex] 

Deep Video Portraits

Our novel approach enables photo-realistic re-animation of portrait videos using only an input video. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor.

[Paper]  [Video]  [Bibtex] 

HeadOn: Real-time Reenactment of Human Portrait Videos

HeadOn is the first real-time reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze. Given a short RGB-D video of the target actor, we automatically construct a personalized geometry proxy that embeds a parametric head, eye, and kinematic torso model. A novel reenactment algorithm employs this proxy to map the captured motion from the source to the target actor.

[Paper]  [Video]  [Bibtex] 

InverseFaceNet: Deep Monocular Inverse Face Rendering

We introduce InverseFaceNet, a deep convolutional inverse rendering framework for faces that jointly estimates facial pose, shape, expression, reflectance and illumination from a single input image. This enables advanced real-time editing of facial imagery, such as appearance editing and relighting.

[Paper]  [Video]  [Bibtex] 

State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications

This report summarizes recent trends in monocular facial performance capture and discusses its applications, which range from performance-based animation to real-time facial reenactment. We focus on methods where the central task is to recover and track a three dimensional model of the human face using optimization-based reconstruction algorithms.

[Paper]  [Bibtex] 

Eurographics 2018: State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications

This state-of-the-art report session summarizes recent trends in monocular facial performance capture and discusses its applications, which range from performance-based animation to real-time facial reenactment. We focus on methods where the central task is to recover and track a three dimensional model of the human face using optimization-based reconstruction algorithms.

[Paper]  [Bibtex] 

Dissertation: Face2Face - Facial Reenactment

This dissertation summarizes the work in the field of markerless motion tracking, face reconstruction and its applications. Especially, it shows real-time facial reenactment that enables the transfer of facial expressions from one video to another video.

[Paper]  [Bibtex] 

SIGGRAPH Emerging Technologies: Demo of FaceVR

We present a novel method for the interactive markerless reconstruction of human heads using a single commodity RGB‐D sensor. Our entire reconstruction pipeline is implemented on the graphics processing unit and allows to obtain high‐quality reconstructions of the human head using an interactive and intuitive reconstruction paradigm.

[Paper]  [Video]  [Bibtex] 

SIGGRAPH Emerging Technologies: Real-time Face Capture and Reenactment of RGB Videos

We show a demo for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). Our goal is to animate the facial expressions of a target video by a source actor and re-render the manipulated output video in a photo-realistic fashion.

[Paper]  [Video]  [Bibtex] 

Face2Face: Real-time Face Capture and Reenactment of RGB Videos

We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion.

[Paper]  [Video]  [Bibtex] 

Real-Time Pixel Luminance Optimization for Dynamic Multi-Projection Mapping

Using projection mapping enables us to bring virtual worlds into shared physical spaces. In this paper, we present a novel, adaptable and real-time projection mapping system, which supports multiple projectors and high quality rendering of dynamic content on surfaces of complex geometrical shape. Our system allows for smooth blending across multiple projectors using a new optimization framework that simulates the diffuse direct light transport of the physical world to continuously adapt the color output of each projector pixel.

[Paper]  [Video]  [Bibtex] 

Interactive Model-based Reconstruction of the Human Head using an RGB-D Sensor

We present a novel method for the interactive markerless reconstruction of human heads using a single commodity RGB‐D sensor. Our entire reconstruction pipeline is implemented on the graphics processing unit and allows to obtain high‐quality reconstructions of the human head using an interactive and intuitive reconstruction paradigm.

[Paper]  [Video]  [Bibtex]