Are Diffusion Models Really Superior to GANs on Image Super Resolution?

Last Updated on September 27, 2024 by Editorial Team

Author(s): Valerii Startsev

Originally published on Towards AI.

Introduction

For over half a decade (2014–2020), generative adversarial networks (GANs) dominated generative modeling, including image super-resolution (ISR).

The introduced adversarial training framework (involving a competing generator and discriminator network) excelled at generating high-resolution images from low-resolution counterparts by optimizing for perceptual quality.

A simple GAN architecture (Image by Valerii Startsev)

However, their dominance started to fade in mid-2020 when denoising diffusion models (DDMs) began gaining traction due to their ability to provide a robust framework for image generation while also being capable of multi-modal data distributions — which GANs struggled with.

Moreover, by early 2021, diffusion models had become a dominant source of state-of-the-art methods in generative modeling, including ISR ([a] [b] [c]).

For instance, a comparison of GANs and diffusion models on two low-resolution images is depicted below, and more such results can be found in the literature.

A comparison of GANs and diffusion model on two low-resolution images (Source: SRDiff paper)

While most studies indeed present diffusion models as the new gold standard in generative modeling (particularly for ISR), it’s essential to scrutinize these claims closely.

The apparent success of diffusion models in outperforming GANs may not solely be attributed to their inherent strengths.

Instead, it could be a consequence of the increased scale in model architecture, extended training durations, and larger datasets used in recent research.

This raises a critical question:

Are diffusion models truly better suited for ISR, or are they simply benefiting from more extensive resources?

This is precisely what the Yandex Research team explored in a recent paper, and I want to share our findings in this blog.

This article is structured as follows:

First, we’ll cover ISR and the questions we want to answer.
Next, we shall dive into the experimental setup and training methodologies we utilized in our paper.
Finally, we shall look at the results, and I’ll share our takeaways from each set of results.

TL;DR We show that the prevailing narrative of diffusion models has been shaped by studies where Diffusion-based ISR models were provided with more extensive training and larger network architectures than their GAN counterparts ([a] [b] [c]), leading to questions about the source of their perceived superiority. We demonstrate that GAN-based models can achieve results on par with Diffusion-based models under controlled settings, where both approaches are matched in terms of architecture, model and dataset size, and computational budget.

Let’s dive in!

Understanding ISR

As the name suggests, ISR aims to upscale low-resolution images to higher resolutions, recovering fine details and textures.

As discussed earlier, GANs have been the go-to method for this task for half a decade, leveraging adversarial training to produce sharp and visually appealing results.

However, GANs often struggle with complex multi-modal data and are difficult to train.

For more context, multi-modal refers to data with multiple distinct modes or types of distribution. In the context of image generation, this could mean producing various but equally plausible images from a single low-resolution input.

For instance, a single low-resolution image of a blurred landscape could be realistically interpreted in many ways — varying seasons, times of day, or weather conditions — each representing a different mode.

GANs, despite their strength in generating high-quality images, often struggle with this multi-modal nature.

Diffusion models are particularly well-suited to address the challenges posed by multi-modal data in image super-resolution (ISR).

How?

Unlike GANs, which rely on a single-shot generation process, diffusion models generate images through a sequential denoising process, as depicted below:

Generation process in GANs and diffusion models (Graphic by Valerii Startsev ; intermediate images taken from diffusion model paper)

As shown above, in both models, the generation process begins with a random noise image.

GANs attempt to generate the image in a single hop.
Diffusion models, however, iteratively refine over multiple steps until a high-resolution image is produced. Each step incrementally improves the image quality by removing a small amount of noise, guided by the underlying data distribution learned during training.

The iterative approach provides a robust framework for image generation, which we also found to handle multi-modal data distributions reliably. Moreover, its denoising objective leads to a stable end-to-end training procedure.

The problem

While it is evident that image generation with diffusion models is more computationally intensive, experiments suggest that it results in more diverse and high-quality outputs.

However, as we thoroughly studied the literature, we wondered if it is a fair comparison based on which we can conclude that diffusion models outperform GANs.

While diffusion models dominate state-of-the-art generative modeling, including ISR, there hasn’t been a comprehensive study that compares GAN and Diffusion models for ISR under controlled conditions.

The superiority of diffusion models may result from the increased scale and computational resources that go into training this model and has nothing to do with the iterative denoising step that goes into diffusion models.

Experimental Setup and Methodology

Our paper “Does Diffusion Beat GAN in Image Super Resolution?” sought to determine if the superiority of Diffusion models in ISR is due to their inherent capabilities or if it is simply a matter of more extensive training and larger model sizes.

To understand this, we conducted a controlled comparison between GAN and diffusion models, ensuring both were matched in architecture size, dataset size, and computational budget.

Architecture details

In this study, we utilized an Efficient U-Net architecture, initially introduced in the Imagen model, for both GAN and diffusion-based super-resolution (SR) models.

The architecture maintains consistency across different resolutions using the same number of channels and residual blocks, as seen in the 256×256 to 1024×1024 super-resolution tasks.

The primary distinction between the GAN and diffusion SR models lies in the use of timestep embedding and noise concatenation, which are present in the diffusion model but absent in the GAN (due to the inherent design of diffusion models).

For experiments involving text-facilitated generations, we processed image captions through a text encoder in text-conditional models. Then, we integrated the text information into the SR model using scale-shift modulation and cross-attention. This process enabled the model to generate images that align with the provided text, making the outputs more contextually relevant.

Notably, the text-unconditional GAN model had 614 million trainable parameters, while the diffusion model, when conditioned on UMT5 embeddings, reached 696 million parameters, indicating that both models exhibited a similar scale.

In terms of the predictive process:

The GAN-based model directly predicted high-resolution images from low-resolution ones.
The Diffusion model iteratively predicted the noise to be applied to the corrupted high-resolution images, conditioned on the corresponding low-resolution inputs. This means that starting with a low-resolution image and Gaussian noise shaped like a high-resolution image, we can iteratively generate a high-resolution image.

Dataset preparation

We began with a vast proprietary dataset comprising billions of image-text pairs, initially collected for training image-text models. Next, we filtered this massive pool into a dataset suitable for training competitive super-resolution models in multiple steps:

First, only images with a height or width of exactly 1024 pixels and no less than 1024 pixels in the other dimension were considered.

Only images with both height and width more than 1024 pixels are chosen (Image by Valerii Startsev)

A center crop was applied for non-square images to achieve a uniform 1024×1024 pixel resolution, balancing data quality with quantity.

Applying a center crop for non-square images (Image by Valerii Startsev)

Using a few more rigorous ML-based filtering processess, the dataset was reduced to a high-quality subset of 17 million images, each paired with corresponding English captions.

Training details

I want to highlight a few exciting details about the training procedure we utilized in our experimentation.

The generator model in the GAN-based Super Resolution model was initially pretrained with L1 loss only. We found it essential since training from scratch with adversarial loss typically yields artifacts.

After pretraining, adversarial training was applied using non-saturating adversarial loss, which helps produce images with sharp edges and high-frequency details.

The diffusion model was trained on the ϵ-prediction objective with timesteps sampled uniformly from [0, 1].

The training involved predicting noise added to high-resolution images, with the low-resolution images used as conditions.

This method relies on a variance-preserving noise schedule to ensure gradual refinement of the images.

Findings and Analysis

Published in our paper, these findings reveal several vital insights that challenge the contemporary prevailing narrative in the field.

As discussed earlier, while diffusion models have been dominant in their ability to handle complex, multi-modal data and produce high-quality outputs, the controlled experiments conducted in this study shed new light on the capabilities of GANs, particularly when scaled and trained under comparable conditions.

As experimental findings will reveal shortly in the article, GANs can achieve results comparable to Diffusion models across various ISR tasks when scaled appropriately.

Despite the perception that Diffusion models are superior, GANs performed similarly in terms of critical metrics like PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity), and a recent no-reference CLIP-IQA metric.

Let’s dive in!

1) Training time and convergence

Notably (and here’s the first key difference between the GAN and diffusion model), the pretraining and adversarial training for the GAN-based SR models required approximately three days to complete 140,000 iterations, while training the diffusion models, with the same resources, took around two weeks.

Training time comparison (Image by Valerii Startsev)

Of course, we optimized the memory consumption and facilitated working with larger batch sizes using techniques like Fully Sharded Data Parallel (FSDP).

However, despite this optimization, we observed that GANs converge faster than diffusion models in all stages (the L1 pretraining and the adversarial training stage).

The following table depicts this:

Green regions correspond to significant improvements over the previous step. Three evaluations without improvement (in gray) in a row indicate convergence (Image source: Paper)

GAN models stabilize at 40x iterations.
For the model to fully converge, diffusion models require about 620k iterations of L2 pretraining on cropped images.

Takeaway

This is a revealing observation, indicating that GANs built at scales similar to diffusion models take less time to train and converge faster.

While this is good, we still wanted to verify whether faster convergence resulted in similar performance.

Subsequent sections discuss this.

2) Quantitative comparison

The following table from the paper depicts a quantitative comparison between GAN, Diffusion SR, and current state-of-the-art on 4x Image Super Resolution task:

Quantitative comparison between GAN, Diffusion SR, and current state-of-the-art (Image source: Paper). Red denotes best results; blue denotes second best result.

From the above table, it is clear that in 3 out of 4 cases, GANs outperform diffusion models trained on a similar scale and computation budget.

Takeaway

Yet again, the insights are pretty revealing.

GANs, when scaled appropriately and trained under comparable conditions to diffusion models, can not only match but, in many cases, exceed the performance of diffusion models in image super-resolution tasks.
This finding, coupled with the training time and convergence takeaway, suggests that the faster convergence of GANs does not come at the cost of performance. It often leads to better or equivalent image quality compared to diffusion models.

Moreover, these findings challenge the growing perception that diffusion models are (somewhat) universally superior for all generative tasks.

3) Visual comparison

The figure below presents a visual comparison between the GAN and diffusion SR model from our work and the baselines on SR (the third column is for the diffusion model, and the fourth column is for GAN):

Visual samples from several models (Image source: Paper)

Zooming in on some specific instances, it is noticeable that GANs (2nd column below) produce a slightly finer (or similar) level of detail and super-resolution in the diffusion model (1st column below):

Visual samples between diffusion model (left column) and GAN (right column) (Image source: Paper)

4) Text conditioning

Lastly, we also explored the impact of textual conditioning by integrating text captions into the models using two types of text encoders:

Our internal CLIP-like model — XL
A UMT5 encoder

The figure below depicts these results, where each bar is for one of the three human annotators. Every bar represents how many times a human annotator preferred text-conditional models (XL, in green), the unconditional model in blue), or marked both as equal (orange).

Comparison between preference for test-assisted generation (green) and no-text generation as evaluated by annotators. (Image source: Paper)

While the above plot is for our internal CLIP-like model — XL, the below figure is a similar graphic for the UMT5 encoder:

Takeaway

This suggests that global-level semantic information from captions may be less beneficial for the ISR task.

In other words, additional text-conditioning does not noticeably improve image quality (as perceived by our annotators).

Conclusion

With this, we come to the end of this article, distilling our recent paper, which did a thorough study of GANs and diffusion models.

Our research suggests that GANs remain highly competitive for Image Super Resolution tasks thanks to faster training times and single-step inference capabilities.

Diffusion models, although powerful, require significantly more computational resources and training time to achieve similar performance levels.

These insights are still crucial for practitioners and researchers.

They highlight the importance of fair and controlled comparisons when evaluating new methodologies.

As computational resources and data availability continue to grow, understanding the true strengths and limitations of different approaches becomes increasingly vital.

As always, thanks for reading!

References

[a] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement, 2021.

[b] Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models, 2021.

[c] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting, 2023

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Are Diffusion Models Really Superior to GANs on Image Super Resolution?

Author(s): Valerii Startsev

Introduction

Understanding ISR

How?

The problem

Experimental Setup and Methodology

Architecture details

Dataset preparation

Training details

Findings and Analysis

1) Training time and convergence

2) Quantitative comparison

3) Visual comparison

4) Text conditioning

Conclusion

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Are Diffusion Models Really Superior to GANs on Image Super Resolution?

Author(s): Valerii Startsev

Introduction

Understanding ISR

How?

The problem

Experimental Setup and Methodology

Architecture details

Dataset preparation

Training details

Findings and Analysis

1) Training time and convergence

2) Quantitative comparison

3) Visual comparison

4) Text conditioning

Conclusion

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥