Scale Space Diffusion

University of Maryland, College Park
TLDR summary

TL;DR: Why use more pixels when less can do the trick? 💡

Left: Our proposed Scale Space Diffusion (SSD) fuses scale spaces into diffusion models. Along the y-axis, increasing diffusion noise removes fine facial details; along the x-axis, decreasing resolution in a Gaussian pyramid causes similar information loss. Right: Visualization of the inference process using SSD (Flexi-UNet, 6L).

Abstract

Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths.

Evolution of \(x_t\) over diffusion steps \(t\). We compare SSD (top row), which begins sampling at lower resolutions and progressively increases resolution, against DDPM (bottom row), which operates at the highest resolution throughout the sampling process. As shown above, for \( 64 \times 64 \) generation, SSD (Flexi-UNet, 4L) achieves a \( 1.58 \times \) inference speedup compared to DDPM.

Scale Spaces vis-à-vis Diffusion Timesteps

Scale spaces vis-a-vis diffusion timesteps
Information Analysis. (a) Amount of information present in a diffusion state as diffusion step \(t\) changes. (b) Amount of information present in images at various resolutions (scales).

SSD Overview

Scale Space Diffusion overview
Overview. During training, \( x_t \)'s at resolution \( r(t) \) are sampled using the forward process, and our model is trained to predict the clean image \( x_{0,\theta}^{r(t-1)} \). Our Scale-Space Diffusion model processes both resolution-preserving and resolution-changing steps across multiple scales using only partial network activations.

Method

Scale Space Diffusion Formulation

Marginal distribution illustration

We map \(a_t\), a scalar value to a matrix based linear operator \(M_{1:t}\). \(M_{t}\) can be any linear operator, in this case we show \(M_{t}\) as a resize operator.

Choice of M_t illustration

Architecture: Flexi-UNet

Flexi-UNet vs FullUnet
Architecture: Flexi-UNet vs Full UNet. While the Full UNet processes all timesteps using a fixed symmetric encoder-decoder path and equal input-output resolutions, Flexi-UNet dynamically activates resolution-dependent subsets of layers. Low-resolution inputs are only passed through the deeper blocks, and resolution transitions are handled via asymmetric pathways.

Results

Training time results
Training time results

Left: DDPM-\(x_0\) is a special case of SSD: SSD(1L). The table shows that SSD can save substantial training compute while achieving comparable FID. Right: SSD achieves substantially lower training time, with efficiency gains increasing at higher resolutions and with more levels.

Training time results
Visual samples. Top: ImageNet-64 unconditional generation. For the top-most sample we also show model prediction at various scales (8, 16, 32, 64) during SSD. Bottom: CelebA-256 unconditional generation. For the top-most sample we also show model predictions at various scales (8, 16, 32, 64, 128, 256)

Visual results for SSD (Flexi-UNet, 3L) on CelebA-256

xt progression
Progression of noisy states \( x_t\) during generation using SSD (Flexi-UNet, 3L) on CelebA-256.
x0 progression
Progression of predicted clean images \( x_{0,\theta}^{r(t-1)} \) during generation using SSD (Flexi-UNet, 3L) on CelebA-256.
256 resolution generations
Generated Samples using SSD(Flexi-UNet, 3L) on CelebA-256.

Visual results for SSD on CelebA-32 with alternate degradation

CelebA-32 generations
Animation of the predicted clean image \(x_{0, \theta}^{r(t−1)} \) over the generation process for gradual downsizing degradation (instead of stepwise \(2 \times \) downsizing) operator in SSD framework.