Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

1Technion, 2UC Berkeley, 3NVIDIA
Interpolate start reference image.

We introduce Zero-to-Hero, a test-time filtering method for attention maps that significantly enhances image-conditioned diffusion models.

Interpolate start reference image.

Our method, focused on single-image novel view synthesis, not only outperforms strong baselines but also proves to be highly applicable to other tasks, including multi-view generation, and pose- and segmentation-conditioned text-to-image synthesis.



Abstract

Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects.

Attention Maps Need Denoising Too!

Attention layers are critical in shaping the structure and appearance of generated images. Noise present in the latent induces attention map noise that accumulates throughout the denoising process, resulting in visual artifacts.


Which Attention Maps Should we Denoise? (Spoiler: Self-Attention).

The single embedding condition in the cross-attention of Zero-1-to-3 results in a single key-value pair. Consequently, the Softmax degenerates the scores to an all-ones matrix, losing spatial awareness.

Interpolate start reference image.

Introducing: Attention Maps Filtering

Robustifying attention maps can reduce generation artifacts. Inspired by gradient aggregation and weight averaging in SGD, we view the denoising as an unrolled optimization, with attention maps as parameters in a score prediction model.

Interpolate start reference image.

We collect a set of attention maps for each timestep through resampling, then aggregate the maps both within and across timesteps to refine the predictions. This process is training-free, resulting in more accurate maps and a more faithful generation.

Interpolate start reference image.

Enforcing Appearance Consistency Between Source and Target Views

Using "Mutual Self-Attention" during early stages of denoising, we propagate information from the input to the generated view. Our entire pipeline is shown in the figure.

Interpolate start reference image.

Zero-to-Hero is a Universal Generation Enhancement Tool 🔨

We implemented Zero-to-Hero on three tasks: image generation conditioned on (1) pose and (2) segmentation maps, as well as (3) multiview generation. In all cases we achieve a significant performance boost.

Interpolate start reference image.

BibTeX

@article{sobol2024zero2hero,
  author={Ido Sobol and Chenfeng Xu and Or Litany},
  title     = {Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering},
  journal   = {NeurIPS},
  year      = {2024},
}