Are Diffusion Models Really Superior to GANs on Image Super Resolution?
Last Updated on September 27, 2024 by Editorial Team
Author(s): Valerii Startsev
Originally published on Towards AI.
Introduction
For over half a decade (2014β2020), generative adversarial networks (GANs) dominated generative modeling, including image super-resolution (ISR).
The introduced adversarial training framework (involving a competing generator and discriminator network) excelled at generating high-resolution images from low-resolution counterparts by optimizing for perceptual quality.
However, their dominance started to fade in mid-2020 when denoising diffusion models (DDMs) began gaining traction due to their ability to provide a robust framework for image generation while also being capable of multi-modal data distributions β which GANs struggled with.
Moreover, by early 2021, diffusion models had become a dominant source of state-of-the-art methods in generative modeling, including ISR ([a] [b] [c]).
For instance, a comparison of GANs and diffusion models on two low-resolution images is depicted below, and more such results can be found in the literature.
While most studies indeed present diffusion models as the new gold standard in generative modeling (particularly for ISR), itβs essential to scrutinize these claims closely.
The apparent success of diffusion models in outperforming GANs may not solely be attributed to their inherent strengths.
Instead, it could be a consequence of the increased scale in model architecture, extended training durations, and larger datasets used in recent research.
This raises a critical question:
Are diffusion models truly better suited for ISR, or are they simply benefiting from more extensive resources?
This is precisely what the Yandex Research team explored in a recent paper, and I want to share our findings in this blog.
This article is structured as follows:
- First, weβll cover ISR and the questions we want to answer.
- Next, we shall dive into the experimental setup and training methodologies we utilized in our paper.
- Finally, we shall look at the results, and Iβll share our takeaways from each set of results.
TL;DR We show that the prevailing narrative of diffusion models has been shaped by studies where Diffusion-based ISR models were provided with more extensive training and larger network architectures than their GAN counterparts ([a] [b] [c]), leading to questions about the source of their perceived superiority. We demonstrate that GAN-based models can achieve results on par with Diffusion-based models under controlled settings, where both approaches are matched in terms of architecture, model and dataset size, and computational budget.
Letβs dive in!
Understanding ISR
As the name suggests, ISR aims to upscale low-resolution images to higher resolutions, recovering fine details and textures.
As discussed earlier, GANs have been the go-to method for this task for half a decade, leveraging adversarial training to produce sharp and visually appealing results.
However, GANs often struggle with complex multi-modal data and are difficult to train.
For more context, multi-modal refers to data with multiple distinct modes or types of distribution. In the context of image generation, this could mean producing various but equally plausible images from a single low-resolution input.
For instance, a single low-resolution image of a blurred landscape could be realistically interpreted in many ways β varying seasons, times of day, or weather conditions β each representing a different mode.
GANs, despite their strength in generating high-quality images, often struggle with this multi-modal nature.
Diffusion models are particularly well-suited to address the challenges posed by multi-modal data in image super-resolution (ISR).
How?
Unlike GANs, which rely on a single-shot generation process, diffusion models generate images through a sequential denoising process, as depicted below:
As shown above, in both models, the generation process begins with a random noise image.
- GANs attempt to generate the image in a single hop.
- Diffusion models, however, iteratively refine over multiple steps until a high-resolution image is produced. Each step incrementally improves the image quality by removing a small amount of noise, guided by the underlying data distribution learned during training.
The iterative approach provides a robust framework for image generation, which we also found to handle multi-modal data distributions reliably. Moreover, its denoising objective leads to a stable end-to-end training procedure.
The problem
While it is evident that image generation with diffusion models is more computationally intensive, experiments suggest that it results in more diverse and high-quality outputs.
However, as we thoroughly studied the literature, we wondered if it is a fair comparison based on which we can conclude that diffusion models outperform GANs.
While diffusion models dominate state-of-the-art generative modeling, including ISR, there hasnβt been a comprehensive study that compares GAN and Diffusion models for ISR under controlled conditions.
The superiority of diffusion models may result from the increased scale and computational resources that go into training this model and has nothing to do with the iterative denoising step that goes into diffusion models.
Experimental Setup and Methodology
Our paper βDoes Diffusion Beat GAN in Image Super Resolution?β sought to determine if the superiority of Diffusion models in ISR is due to their inherent capabilities or if it is simply a matter of more extensive training and larger model sizes.
To understand this, we conducted a controlled comparison between GAN and diffusion models, ensuring both were matched in architecture size, dataset size, and computational budget.
Architecture details
In this study, we utilized an Efficient U-Net architecture, initially introduced in the Imagen model, for both GAN and diffusion-based super-resolution (SR) models.
The architecture maintains consistency across different resolutions using the same number of channels and residual blocks, as seen in the 256×256 to 1024×1024 super-resolution tasks.
The primary distinction between the GAN and diffusion SR models lies in the use of timestep embedding and noise concatenation, which are present in the diffusion model but absent in the GAN (due to the inherent design of diffusion models).
For experiments involving text-facilitated generations, we processed image captions through a text encoder in text-conditional models. Then, we integrated the text information into the SR model using scale-shift modulation and cross-attention. This process enabled the model to generate images that align with the provided text, making the outputs more contextually relevant.
Notably, the text-unconditional GAN model had 614 million trainable parameters, while the diffusion model, when conditioned on UMT5 embeddings, reached 696 million parameters, indicating that both models exhibited a similar scale.
In terms of the predictive process:
- The GAN-based model directly predicted high-resolution images from low-resolution ones.
- The Diffusion model iteratively predicted the noise to be applied to the corrupted high-resolution images, conditioned on the corresponding low-resolution inputs. This means that starting with a low-resolution image and Gaussian noise shaped like a high-resolution image, we can iteratively generate a high-resolution image.
Dataset preparation
We began with a vast proprietary dataset comprising billions of image-text pairs, initially collected for training image-text models. Next, we filtered this massive pool into a dataset suitable for training competitive super-resolution models in multiple steps:
- First, only images with a height or width of exactly 1024 pixels and no less than 1024 pixels in the other dimension were considered.
- A center crop was applied for non-square images to achieve a uniform 1024×1024 pixel resolution, balancing data quality with quantity.
Using a few more rigorous ML-based filtering processess, the dataset was reduced to a high-quality subset of 17 million images, each paired with corresponding English captions.
Training details
I want to highlight a few exciting details about the training procedure we utilized in our experimentation.
The generator model in the GAN-based Super Resolution model was initially pretrained with L1 loss only. We found it essential since training from scratch with adversarial loss typically yields artifacts.
After pretraining, adversarial training was applied using non-saturating adversarial loss, which helps produce images with sharp edges and high-frequency details.
The diffusion model was trained on the Ο΅-prediction objective with timesteps sampled uniformly from [0, 1].
The training involved predicting noise added to high-resolution images, with the low-resolution images used as conditions.
This method relies on a variance-preserving noise schedule to ensure gradual refinement of the images.
Findings and Analysis
Published in our paper, these findings reveal several vital insights that challenge the contemporary prevailing narrative in the field.
As discussed earlier, while diffusion models have been dominant in their ability to handle complex, multi-modal data and produce high-quality outputs, the controlled experiments conducted in this study shed new light on the capabilities of GANs, particularly when scaled and trained under comparable conditions.
As experimental findings will reveal shortly in the article, GANs can achieve results comparable to Diffusion models across various ISR tasks when scaled appropriately.
Despite the perception that Diffusion models are superior, GANs performed similarly in terms of critical metrics like PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity), and a recent no-reference CLIP-IQA metric.
Letβs dive in!
1) Training time and convergence
Notably (and hereβs the first key difference between the GAN and diffusion model), the pretraining and adversarial training for the GAN-based SR models required approximately three days to complete 140,000 iterations, while training the diffusion models, with the same resources, took around two weeks.
Of course, we optimized the memory consumption and facilitated working with larger batch sizes using techniques like Fully Sharded Data Parallel (FSDP).
However, despite this optimization, we observed that GANs converge faster than diffusion models in all stages (the L1 pretraining and the adversarial training stage).
The following table depicts this:
- GAN models stabilize at 40x iterations.
- For the model to fully converge, diffusion models require about 620k iterations of L2 pretraining on cropped images.
Takeaway
This is a revealing observation, indicating that GANs built at scales similar to diffusion models take less time to train and converge faster.
While this is good, we still wanted to verify whether faster convergence resulted in similar performance.
Subsequent sections discuss this.
2) Quantitative comparison
The following table from the paper depicts a quantitative comparison between GAN, Diffusion SR, and current state-of-the-art on 4x Image Super Resolution task:
From the above table, it is clear that in 3 out of 4 cases, GANs outperform diffusion models trained on a similar scale and computation budget.
Takeaway
Yet again, the insights are pretty revealing.
- GANs, when scaled appropriately and trained under comparable conditions to diffusion models, can not only match but, in many cases, exceed the performance of diffusion models in image super-resolution tasks.
- This finding, coupled with the training time and convergence takeaway, suggests that the faster convergence of GANs does not come at the cost of performance. It often leads to better or equivalent image quality compared to diffusion models.
Moreover, these findings challenge the growing perception that diffusion models are (somewhat) universally superior for all generative tasks.
3) Visual comparison
The figure below presents a visual comparison between the GAN and diffusion SR model from our work and the baselines on SR (the third column is for the diffusion model, and the fourth column is for GAN):
Zooming in on some specific instances, it is noticeable that GANs (2nd column below) produce a slightly finer (or similar) level of detail and super-resolution in the diffusion model (1st column below):
4) Text conditioning
Lastly, we also explored the impact of textual conditioning by integrating text captions into the models using two types of text encoders:
- Our internal CLIP-like model β XL
- A UMT5 encoder
The figure below depicts these results, where each bar is for one of the three human annotators. Every bar represents how many times a human annotator preferred text-conditional models (XL, in green), the unconditional model in blue), or marked both as equal (orange).
While the above plot is for our internal CLIP-like model β XL, the below figure is a similar graphic for the UMT5 encoder:
Takeaway
This suggests that global-level semantic information from captions may be less beneficial for the ISR task.
In other words, additional text-conditioning does not noticeably improve image quality (as perceived by our annotators).
Conclusion
With this, we come to the end of this article, distilling our recent paper, which did a thorough study of GANs and diffusion models.
Our research suggests that GANs remain highly competitive for Image Super Resolution tasks thanks to faster training times and single-step inference capabilities.
Diffusion models, although powerful, require significantly more computational resources and training time to achieve similar performance levels.
These insights are still crucial for practitioners and researchers.
They highlight the importance of fair and controlled comparisons when evaluating new methodologies.
As computational resources and data availability continue to grow, understanding the true strengths and limitations of different approaches becomes increasingly vital.
As always, thanks for reading!
References
[a] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement, 2021.
[b] Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models, 2021.
[c] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting, 2023
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI