Scale Space Diffusion

University of Maryland, College Park

Abstract

Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths.

SSD Overview

Overview. During training, \( x_t \)'s at resolution \( r(t) \) are sampled using the forward process, and our model is trained to predict the clean image \( x_{0,\theta}^{r(t-1)} \). Our Scale-Space Diffusion model processes both resolution-preserving and resolution-changing steps across multiple scales using only partial network activations.

Method

Scale Space Diffusion Formulation

We map \(a_t\), a scalar value to a matrix based linear operator \(M_{1:t}\). \(M_{t}\) can be any linear operator, in this case we show \(M_{t}\) as a resize operator.

Architecture: Flexi-UNet

Flexi-UNet vs FullUnet — **Architecture: Flexi-UNet vs Full UNet.** While the Full UNet processes all timesteps using a fixed symmetric encoder-decoder path and equal input-output resolutions, Flexi-UNet dynamically activates resolution-dependent subsets of layers. Low-resolution inputs are only passed through the deeper blocks, and resolution transitions are handled via asymmetric pathways.

Results

Left: DDPM-\(x_0\) is a special case of SSD: SSD(1L). The table shows that SSD can save substantial training compute while achieving comparable FID. Right: SSD achieves substantially lower training time, with efficiency gains increasing at higher resolutions and with more levels.

Visual samples. Top: ImageNet-64 unconditional generation. For the top-most sample we also show model prediction at various scales (8, 16, 32, 64) during SSD. Bottom: CelebA-256 unconditional generation. For the top-most sample we also show model predictions at various scales (8, 16, 32, 64, 128, 256)

Visual results for SSD (Flexi-UNet, 3L) on CelebA-256

Progression of noisy states \( x_t\) during generation using SSD (Flexi-UNet, 3L) on CelebA-256.

Progression of predicted clean images \( x_{0,\theta}^{r(t-1)} \) during generation using SSD (Flexi-UNet, 3L) on CelebA-256.

Generated Samples using SSD(Flexi-UNet, 3L) on CelebA-256.

Visual results for SSD on CelebA-32 with alternate degradation

Animation of the predicted clean image \(x_{0, \theta}^{r(t−1)} \) over the generation process for gradual downsizing degradation (instead of stepwise \(2 \times \) downsizing) operator in SSD framework.