Last Updated on July 26, 2023 by Editorial Team
Author(s): Luhui Hu
Originally published on Towards AI.
GAN, GPT-3, DALL·E 2, and what’s next
The past ten years have been the golden decade of AI, but meaningful AI has just begun: CV is the current leading field in the industry, NLP remains the crown jewel of AI, and RL looks forward to the verification of L4/L5 on the road, but AGI (Artificial General Intelligence) is the future.
Generative AI is a nascent but creative approach. It is one of the most successful ML frameworks in the evolution of deep learning over the past decade. It is an unsupervised or semi-supervised machine learning for creating new content, including but not limited to digital images, video, audio, text, or code. So far, there are two prominent frameworks of generative AI: Generative Adversarial Network (GAN) and Generative Pre-trained Transformer (GPT).
Generative Adversarial Network (GAN)
GAN uses two neural networks to compete with each other to become more accurate predictions, pitting one against the other (hence “adversarial”) to generate new synthetic data instances that can pass for real data. GANs use a cooperative zero-sum game framework to learn. They are widely used in image, video, and voice generation.
There are three main steps in GAN training:
1. Select several real images from the training set.
2. Generate several fake images by sampling random noise vectors and creating images from them using the generator.
3. Train the discriminator for one or more epochs using fake and real images.
Generative Pre-trained Transformer (GPT)
GPT is an autoregressive language model based on the transformer architecture, pre-trained in a generative and unsupervised manner, that shows decent performance in zero/one/few-shot multitask settings.
The transformer is an encoder-decoder architecture with a self-attention mechanism. Since it can access the state vectors of every input word, unlike LSTM, only uses information about other tokens from lower layers, and can be computed for all tokens in parallel, it demonstrates significantly improved accuracy and training performance. It evolved from BERT (Bidirectional Encoder Representations from Transformers) to RoBERTa, GPT-2, T5, TuringNLG to GPT-3. BERT started with about 110 million parameters, but the latest GPT-3 had 175 billion parameters and 96 attention layers with a 3.2 M batch size and 499 billion words. It cost about $4.6M to train. However, there are many exciting stories about GPT-3 use cases.
Transformer applications include but are not limited to:
1. Text generation
2. Text summarization
3. Text classification (i.e., sentiment analysis)
4. Language translation
5. Question answering
7. Named-entity recognition
DALL·E 2 is a remarkable text-to-image generative AI system. It mainly employs two techniques: CLIP (Contrastive Language-Image Pre-training) and diffusion models. CLIP is essential to connect text description to image elements. Diffusion models are transformer-based generative models. It uses a version of GPT-3 modified to generate images. It can combine concepts, attributes, and styles to generate more realistic images at higher resolutions than DALL·E.
DALL·E’s model is a multimodal implementation of GPT-3 with 12 billion parameters, trained on text-image pairs from the Internet. DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor. It is effectively optimized. We can see opportunities for better results with smaller parameters.
Unification By Transformer
There are many applications of deep learning, but language and vision are two primary branches. They are fundamental domains of cognitive learning but forked by two different DL modelings: RNN and CNN. Due to their sophisticated complexity and varied architectures, ML scientists had to research and develop these two relevant subjects independently, and then it became hard to share and evolve together.
Transformer changes the game. Not only did the transformer succeed in language modeling, but it demonstrated promise in computer vision (CV). Vision Transformers (ViT) are available at PyTorch & TensorFlow. Further, transformer-based GANs and GAN-like transformers have been explored successfully for generative vision AI.
Large Model and What’s Next
We are thrilled about the success of GPT-3 and transformer, but they are very large models that require big data and supercomputing power. Prof. Ion Stoica illustrated ML compute demand increase by extending OpenAI’s study as follows:
The ML compute demand increase is almost 17.5 times faster than the famous Moore’s Law. The demand increase occurs in both processing and memory. So how can we handle this explosive demand as we know the current challenges of Moore’s Law? Should we keep pursuing large models?
The large model isn’t an issue itself from the perspective of ML accuracy and performance, but we have to optimize and innovate in a few practical ways:
- Data-centric or big data: data-centric ML methodology can drive high-quality good data besides big data.
- Hardware infrastructure: GPU, TPU, FPGA, and others remain the core evolution of computing power, but their distributed cloud solution can scale out computing and memory capabilities.
- Model architecture and algorithm: GPT expects GPT-4 and GPT-5, but it’s critical to optimize model architecture and continue inventing better models.
- Framework design is the key to optimizing ML training and serving implementation. For instance, Ray is an open-source framework for productionizing and scaling Python ML workloads simply.
In a Nutshell
Generative AI is an emerging and innovative technology for digital content generation. Both GAN and GPT are two proven ML frameworks for visions and languages. Transformers were changing the game to unify two DL subjects (CNN and RNN), which can also apply to generative AI. Autoregressive transformers can provide a unified architecture for both vision and language generative solutions.
There are many meaningful generative applications for digital images, video, audio, text, or code. Before long, generative AI can be extended to metaverse and web3, which need increasingly more auto-generations of digital content.
1. Generative Adversarial Networks: https://arxiv.org/abs/1406.2661
2. Attention Is All You Need: https://arxiv.org/pdf/1706.03762.pdf
3. DALL·E 2 details: https://openai.com/dall-e-2/
4. Ion Stoica — Ray: A Universal Framework for Distributed Systems: https://youtu.be/tgB671SFS4w
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI