
Cross Modal AI — Text2X Tasks
Last Updated on April 22, 2025 by Editorial Team
Author(s): Sarvesh Khetan
Originally published on Towards AI.
Table of Contents

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea
We will do text conditioning on Image2Image Models. As seen in the blog, we can use any generative model i.e. Autoencoders / GAN … / DDPMs / DDIMs to perform Image2Image task but here in Text2Image task during inference we won’t have a input image, we will only have an input text hence while training also we will have to use a model which convert noise to image for the image2image task. Hence we can only use models like GANs and Diffusion for Text2Image Task !!
Below I have discussed Text Conditioned Diffusion based models since it gives better results compared to Text Conditioned GAN based models.

Text Conditioned DDPMs Based Models
Now you can do text conditioning in DDPMs in several ways :
a. Vanilla Method to do text conditioning
We have already seen this here, wherein I show diffusion model (CNN based / Transformer based) conditioned with time information. Now the idea is that to condition this with the text information we will just add this text information to the time information!! Below I have shown how to add text information to DiT based model (self attention) but you can use similar logic to add text information to other types of diffusion models discussed in this blog.

b. Classifier Based Guidance to do text conditioning
c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models
Following are some successful models which have been trained using the above method :
- Guided Language-2-Image Diffusion for Generation and Editing (GLIDE) by OPENAI : it had following architecture

- [better than glide] Dall-e2 (or unCLIP) by OPENAI : It had almost similar architecture as of Glide, just that they used CLIP Model to convert image and text into embeddings and it used a UNET style autoencoder based diffusion model.
- [better than dalle2] ImageGEN by GOOGLE : almost similar architecture as Glide, just that the use the pretrained frozen T5-XXL model to encode text.

Text Conditioned Latent DDPMs Based Models (works better)
Same as text conditioned DDPMs just that here we operate on latent space instead of pixel space.

Some famous models based on above architecture are :
- Stable Diffusion
- Titan Image Generator (AWS)
- Midjourney
All these model weights can be downloaded using Hugging Face’s diffusers library, example code available here

Text2Speech Task (TTS) / Speech Synthesis
- Idea : we will perform text conditioning on models which can do Speech2Speech task.
- Now you can solve Speech2Speech task using any generative model like
a. GANs
b. Diffusion (works best) - Some prominent models in this space are :
a. AudioGen by Facebook
b. MusicGen by Facebook
c. Indic-TTS developed by AI4Bharat is capable of generating speech in multiple Indian Languages.

Text2Video Task
- Idea : we will perform text conditioning on models which can do Video2Video task.
- Now you can solve Video2Video task using any generative model like
a. GANs
b. Diffusion (works best) - Some prominent models in this space are :
a. [2022] Make A Video By Meta => Here is a Blog explaining this paper
b. [2023] Stable Video Diffusion
c. [early 2024] Lumiere By Google
d. [late 2024] SORA by Open AI
e. [early 2025] Veo2 by Google
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.