Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Face Captures

We compare variants of our prior model trained on a synthetic dataset, a real dataset, and a mixed dataset with both real and synthetic images (with a 50/50 split). All models are trained with the same total number of multiview frames (N=19,500)

The first row show shows examples of the synthetic images for training the prior model. The second row shows the initialization before finetuning after warm-up (left) and the finetuned result (right). All finetuned results are generated from three inputs.

Ablations

Synthetic vs. Real Training Data

Synthetic

Real

Mixed (50/50)