Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Cross Modal AI — Text2X Tasks
Artificial Intelligence   Data Science   Latest   Machine Learning

Cross Modal AI — Text2X Tasks

Last Updated on April 22, 2025 by Editorial Team

Author(s): Sarvesh Khetan

Originally published on Towards AI.

Table of Contents

Cross Modal AI — Text2X Tasks

Text2Image (TTI) Task / In-Painting / Out-Painting

Idea

We will do text conditioning on Image2Image Models. As seen in the blog, we can use any generative model i.e. Autoencoders / GAN … / DDPMs / DDIMs to perform Image2Image task but here in Text2Image task during inference we won’t have a input image, we will only have an input text hence while training also we will have to use a model which convert noise to image for the image2image task. Hence we can only use models like GANs and Diffusion for Text2Image Task !!

Below I have discussed Text Conditioned Diffusion based models since it gives better results compared to Text Conditioned GAN based models.

Text Conditioned DDPMs Based Models

Now you can do text conditioning in DDPMs in several ways :

a. Vanilla Method to do text conditioning

We have already seen this here, wherein I show diffusion model (CNN based / Transformer based) conditioned with time information. Now the idea is that to condition this with the text information we will just add this text information to the time information!! Below I have shown how to add text information to DiT based model (self attention) but you can use similar logic to add text information to other types of diffusion models discussed in this blog.

b. Classifier Based Guidance to do text conditioning

c. Classifier Free Guidance (works best) to do text conditioning

Prominent Models

Following are some successful models which have been trained using the above method :

  • Guided Language-2-Image Diffusion for Generation and Editing (GLIDE) by OPENAI : it had following architecture
Glide by OpenAI Architecture
  • [better than glide] Dall-e2 (or unCLIP) by OPENAI : It had almost similar architecture as of Glide, just that they used CLIP Model to convert image and text into embeddings and it used a UNET style autoencoder based diffusion model.
  • [better than dalle2] ImageGEN by GOOGLE : almost similar architecture as Glide, just that the use the pretrained frozen T5-XXL model to encode text.

Text Conditioned Latent DDPMs Based Models (works better)

Same as text conditioned DDPMs just that here we operate on latent space instead of pixel space.

Some famous models based on above architecture are :

  • Stable Diffusion
  • Titan Image Generator (AWS)
  • Midjourney

All these model weights can be downloaded using Hugging Face’s diffusers library, example code available here

Cross Modal AI — Text2X Tasks

Text2Speech Task (TTS) / Speech Synthesis

  • Idea : we will perform text conditioning on models which can do Speech2Speech task.
  • Now you can solve Speech2Speech task using any generative model like
    a. GANs
    b. Diffusion (works best)
  • Some prominent models in this space are :
    a. AudioGen by Facebook
    b.
    MusicGen by Facebook
    c.
    Indic-TTS developed by AI4Bharat is capable of generating speech in multiple Indian Languages.
Cross Modal AI — Text2X Tasks

Text2Video Task

  • Idea : we will perform text conditioning on models which can do Video2Video task.
  • Now you can solve Video2Video task using any generative model like
    a. GANs
    b. Diffusion (works best)
  • Some prominent models in this space are :
    a. [2022] Make A Video By Meta => Here is a Blog explaining this paper
    b. [2023]
    Stable Video Diffusion
    c. [early 2024]
    Lumiere By Google
    d. [late 2024] SORA by Open AI
    e. [early 2025] Veo2 by Google

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.