The main theme of my work is to capture and to (re-)synthesize the real world using commodity hardware. It includes the modeling of the human body, tracking, as well as the reconstruction and interaction with the environment. The digitization is needed for various applications in AR/VR as well as in movie (post-)production. Teleconferencing and working in VR is of high interest for many companies ranging from social media platforms to car manufacturer. It enables the remote interaction in VR, e.g., the inspection of 3D content like CAD models or scans from real objects. A realistic reproduction of appearances and motions is key for such applications. Thus, my work is closely related to photo-realistic video synthesis and editing. The development of algorithms for photo-realistic creation or editing of image content comes with a certain responsibility, since the generation of photo-realistic imagery can be misused. That's why I'm also working on the detection of synthetic or manipulated images and videos (Digital Multi-media Forensics).


FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive, detailed nature of human heads, including hair, ears, and finer-scale eye movements, we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity, temporally coherent motion sequences

[Paper]  [Video]  [Bibtex] 

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatars

We propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data.

[Paper]  [Video]  [Bibtex] 


Imitator: Personalized Speech-driven 3D Facial Animation

We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video.

[Paper]  [Video]  [Bibtex] 

CaPhy: Capturing Physical Properties for Animatable Human Avatars

We present CaPhy, a novel method for reconstructing animatable human avatars with realistic dynamic properties for clothing. Specifically, we aim for capturing the geometric and physical properties of the clothing from real observations. This allows us to apply novel poses to the human avatar with physically correct deformations and wrinkles of the clothing.

[Paper]  [Video]  [Bibtex] 

TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance.

[Paper]  [Video]  [Bibtex] 


MICA: Towards Metrical Reconstruction of Human Faces

Face reconstruction and tracking is a building block of numerous applications in AR/VR, human-machine interaction, as well as medical applications. Most of these applications rely on a metrically correct prediction of the shape, especially, when the reconstructed subject is put into a metrical context. Thus, we present MICA, a novel metrical face reconstruction method that combines face recognition with supervised face shape learning.

[Paper]  [Video]  [Bibtex] 

Texturify: Generating Textures on 3D Shape Surfaces

Texturify learns to generate geometry-aware textures for untextured collections of 3D objects. Our method trains from only a collection of images and a collection of untextured shapes, which are both often available, without requiring any explicit 3D color supervision or shape-image correspondence. Textures are created directly on the surface of a given 3D shape, enabling generation of high-quality, compelling textured 3D shapes.

[Paper]  [Video]  [Bibtex] 

Neural Head Avatars from Monocular RGB Videos

We present Neural Head Avatars, a novel neural representation that explicitly models the surface geometry and appearance of an animatable human avatar using a deep neural network. Specifically, we propose a hybrid representation consisting of a morphable model for the coarse shape and expressions of the face, and two feed-forward networks, predicting vertex offsets of the underlying mesh as well as a view- and expression-dependent texture.

[Paper]  [Video]  [Bibtex] 

Mover: Human-Aware Object Placement for Visual Environment Reconstruction

We demonstrate that human-scene interactions (HSIs) can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video. Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images, and optimize the 3D scene to reconstruct a consistent, physically plausible and functional 3D scene layout.

[Paper]  [Video]  [Bibtex] 

Advances in Neural Rendering

This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene.

[Paper]  [Bibtex] 


TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation.

[Paper]  [Video]  [Bibtex] 

NerFACE: Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face. To handle the dynamics of the face, we combine our scene representation network with a low-dimensional morphable model which provides explicit control over pose and expressions. We use volumetric rendering to generate images from this hybrid representation and demonstrate that such a dynamic neural scene representation can be learned from monocular input data only, without the need of a specialized capture setup.

[Paper]  [Video]  [Bibtex] 

Neural Deformation Graphs for Globally-consistent Non-rigid Reconstruction

We introduce Neural Deformation Graphs for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects. Specifically, we implicitly model a deformation graph via a deep neural network. This neural deformation graph does not rely on any object-specific structure and, thus, can be applied to general non-rigid deformation tracking.

[Paper]  [Video]  [Bibtex] 


Neural Non-Rigid Tracking

We introduce a novel, end-to-end learnable, differentiable non-rigid tracker that enables state-of-the-art non-rigid reconstruction. By enabling gradient back-propagation through a non-rigid as-rigid-as-possible optimization solver, we are able to learn correspondences in an end-to-end manner such that they are optimal for the task of non-rigid tracking.

[Paper]  [Video]  [Bibtex] 

State of the Art on Neural Rendering

Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. This state-of-the-art report summarizes the recent trends and applications of neural rendering.

[Paper]  [Bibtex] 

Adversarial Texture Optimization from RGB-D Scans

We present a novel approach for color texture generation using a conditional adversarial loss obtained from weakly-supervised views. Specifically, we propose an approach to produce photorealistic textures for approximate surfaces, even from misaligned images, by learning an objective function that is robust to these errors.

[Paper]  [Video]  [Bibtex] 


Learning to Detect Manipulated Facial Images

In this paper, we examine the realism of state-of-the-art facial image manipulation methods, and how difficult it is to detect them - either automatically or by humans. In particular, we create a datasets that is focused on DeepFakes, Face2Face, FaceSwap, and Neural Textures as prominent representatives for facial manipulations.

[Paper]  [Video]  [Bibtex] 

Deferred Neural Rendering:
Image Synthesis using Neural Textures

Deferred Neural Rendering is a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable Neural Textures. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect.

[Paper]  [Video]  [Bibtex] 

DeepVoxels: Learning Persistent 3D Feature Embeddings

In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D object without having to explicitly model its geometry.

[Paper]  [Video]  [Bibtex] 

Research Highlight: Face2Face

Research highlight of the Face2Face approach featured on the cover of Communications of the ACM in January 2019. Face2Face is an approach for real-time facial reenactment of a monocular target video. The method had significant impact in the research community and far beyond; it won several wards, e.g., Siggraph ETech Best in Show Award, it was featured in countless media articles, e.g., NYT, WSJ, Spiegel, etc., and it had a massive reach on social media with millions of views.

[Paper]  [Video]  [Bibtex] 


ForensicTransfer: Weakly-supervised Domain Adaptation for Forgery Detection

ForensicTransfer tackles two challenges in multimedia forensics. First, we devise a learning-based forensic detector which adapts well to new domains, i.e., novel manipulation methods. Second we handle scenarios where only a handful of fake examples are available during training.

[Paper]  [Bibtex] 

Deep Video Portraits

Our novel approach enables photo-realistic re-animation of portrait videos using only an input video. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor.

[Paper]  [Video]  [Bibtex] 

HeadOn: Real-time Reenactment of Human Portrait Videos

HeadOn is the first real-time reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze. Given a short RGB-D video of the target actor, we automatically construct a personalized geometry proxy that embeds a parametric head, eye, and kinematic torso model. A novel reenactment algorithm employs this proxy to map the captured motion from the source to the target actor.

[Paper]  [Video]  [Bibtex]