Cross Modal AI — Text2X Tasks
Last Updated on April 22, 2025 by Editorial Team
Author(s): Sarvesh Khetan
Originally published on Towards AI.
Table of Contents

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea
We will do text conditioning on Image2Image Models. As seen in the blog, we can use any generative model i.e. Autoencoders / GAN … / DDPMs / DDIMs to perform Image2Image task but here in Text2Image task during inference we won’t have a input image, we will only have an input text hence while training also we will have to use a model which convert noise to image for the image2image task. Hence we can only use models like GANs and Diffusion for Text2Image Task !!
Below I have discussed Text Conditioned Diffusion based models since it gives better results compared to Text Conditioned GAN based models.

Text Conditioned DDPMs Based Models
Now you can do text conditioning in DDPMs in several ways :
a. Vanilla Method to do text conditioning
We have already seen this here, wherein I show diffusion model (CNN based / Transformer based) conditioned with time information. Now the idea is that to condition this with the text information we will just add this text information to the time information!! Below I have shown how to add text information to DiT based model (self attention) but you can use similar logic to add text information to other types of diffusion models discussed in this blog.

b. Classifier Based Guidance to do text conditioning
c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models
Following are some successful models which have been trained using the above method :
- Guided Language-2-Image Diffusion for Generation and Editing (GLIDE) by OPENAI : it had following architecture

- [better than glide] Dall-e2 (or unCLIP) by OPENAI : It had almost similar architecture as of Glide, just that they used CLIP Model to convert image and text into embeddings and it used a UNET style autoencoder based diffusion model.
- [better than dalle2] ImageGEN by GOOGLE : almost similar architecture as Glide, just that the use the pretrained frozen T5-XXL model to encode text.

Text Conditioned Latent DDPMs Based Models (works better)
Same as text conditioned DDPMs just that here we operate on latent space instead of pixel space.

Some famous models based on above architecture are :
- Stable Diffusion
- Titan Image Generator (AWS)
- Midjourney
All these model weights can be downloaded using Hugging Face’s diffusers library, example code available here

Text2Speech Task (TTS) / Speech Synthesis
- Idea : we will perform text conditioning on models which can do Speech2Speech task.
- Now you can solve Speech2Speech task using any generative model like
a. GANs
b. Diffusion (works best) - Some prominent models in this space are :
a. AudioGen by Facebook
b. MusicGen by Facebook
c. Indic-TTS developed by AI4Bharat is capable of generating speech in multiple Indian Languages.

Text2Video Task
- Idea : we will perform text conditioning on models which can do Video2Video task.
- Now you can solve Video2Video task using any generative model like
a. GANs
b. Diffusion (works best) - Some prominent models in this space are :
a. [2022] Make A Video By Meta => Here is a Blog explaining this paper
b. [2023] Stable Video Diffusion
c. [early 2024] Lumiere By Google
d. [late 2024] SORA by Open AI
e. [early 2025] Veo2 by Google
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.