Cross Modal AI — Text2X Tasks

Last Updated on April 22, 2025 by Editorial Team

Author(s): Sarvesh Khetan

Originally published on Towards AI.

Text2Image
Text2Speech
Text2Video

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea

We will do text conditioning on Image2Image Models. As seen in the blog, we can use any generative model i.e. Autoencoders / GAN … / DDPMs / DDIMs to perform Image2Image task but here in Text2Image task during inference we won’t have a input image, we will only have an input text hence while training also we will have to use a model which convert noise to image for the image2image task. Hence we can only use models like GANs and Diffusion for Text2Image Task !!

Below I have discussed Text Conditioned Diffusion based models since it gives better results compared to Text Conditioned GAN based models.

Text Conditioned DDPMs Based Models

Now you can do text conditioning in DDPMs in several ways :

a. Vanilla Method to do text conditioning

We have already seen this here, wherein I show diffusion model (CNN based / Transformer based) conditioned with time information. Now the idea is that to condition this with the text information we will just add this text information to the time information!! Below I have shown how to add text information to DiT based model (self attention) but you can use similar logic to add text information to other types of diffusion models discussed in this blog.

b. Classifier Based Guidance to do text conditioning

c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models

Following are some successful models which have been trained using the above method :

Guided Language-2-Image Diffusion for Generation and Editing (GLIDE) by OPENAI : it had following architecture

[better than glide] Dall-e2 (or unCLIP) by OPENAI : It had almost similar architecture as of Glide, just that they used CLIP Model to convert image and text into embeddings and it used a UNET style autoencoder based diffusion model.
[better than dalle2] ImageGEN by GOOGLE : almost similar architecture as Glide, just that the use the pretrained frozen T5-XXL model to encode text.

Text Conditioned Latent DDPMs Based Models (works better)

Same as text conditioned DDPMs just that here we operate on latent space instead of pixel space.

Some famous models based on above architecture are :

Stable Diffusion
Titan Image Generator (AWS)
Midjourney

All these model weights can be downloaded using Hugging Face’s diffusers library, example code available here

Text2Speech Task (TTS) / Speech Synthesis

Idea : we will perform text conditioning on models which can do Speech2Speech task.
Now you can solve Speech2Speech task using any generative model like
a. GANs
b. Diffusion (works best)
Some prominent models in this space are :
a. AudioGen by Facebook
b. MusicGen by Facebook
c. Indic-TTS developed by AI4Bharat is capable of generating speech in multiple Indian Languages.

Text2Video Task

Idea : we will perform text conditioning on models which can do Video2Video task.
Now you can solve Video2Video task using any generative model like
a. GANs
b. Diffusion (works best)
Some prominent models in this space are :
a. [2022] Make A Video By Meta => Here is a Blog explaining this paper
b. [2023] Stable Video Diffusion
c. [early 2024] Lumiere By Google
d. [late 2024] SORA by Open AI
e. [early 2025] Veo2 by Google

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Cross Modal AI — Text2X Tasks

Author(s): Sarvesh Khetan

Table of Contents

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea

Text Conditioned DDPMs Based Models

a. Vanilla Method to do text conditioning

b. Classifier Based Guidance to do text conditioning

c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models

Text Conditioned Latent DDPMs Based Models (works better)

Text2Speech Task (TTS) / Speech Synthesis

Text2Video Task

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Cross Modal AI — Text2X Tasks

Author(s): Sarvesh Khetan

Table of Contents

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea

Text Conditioned DDPMs Based Models

a. Vanilla Method to do text conditioning

b. Classifier Based Guidance to do text conditioning

c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models

Text Conditioned Latent DDPMs Based Models (works better)

Text2Speech Task (TTS) / Speech Synthesis

Text2Video Task

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement