Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

A Unified Approach To Multimodal Learning👀
Latest   Machine Learning

A Unified Approach To Multimodal Learning👀

Last Updated on February 17, 2025 by Editorial Team

Author(s): Yash Thube

Originally published on Towards AI.

A Unified Approach To Multimodal Learning👀

We live in a world of multimodal data. Think about it: a restaurant review isn’t just text; it’s accompanied by images of the food, the ambiance, and maybe even the menu. This combination of text, images, and ratings is multimodal data, and it’s everywhere. Traditional machine learning models often struggle with this kind of data because each β€œmode” (text, image, rating) has its unique structure and characteristics. A picture is high-dimensional, text is sequential, and a rating is just a number. How do we effectively combine these disparate data types to make better predictions?

The paper β€œGenerative Distribution Prediction: A Unified Approach to Multimodal Learning” introduces a new framework called Generative Distribution Prediction (GDP) to tackle this challenge. GDP’s core idea is to use generative models to understand the underlying process that creates this multimodal data. Imagine an artist trying to recreate a scene. They don’t just copy it; they understand the relationships between the objects, the lighting, and the overall composition. Similarly, generative models learn the underlying structure of the data, allowing them to create synthetic data that resembles the real thing. This synthetic data, as we’ll see, is key to improving predictions.

📌The Problem: Multimodal Data is Messy

It offers a clever solution: it uses generative models to create synthetic data that captures the combined information from all modalities. Think of it as the model learning to β€œimagine” new restaurant reviews, complete with pictures and ratings, based on what it has already seen. By learning this generative process, the model gains a deeper understanding of the relationships between the different modes. This understanding then allows it to make better predictions on real data.

📌How GDP Works: Two-Step Process

It works in two main steps:

  • Constructing a Conditional Generator: This step focuses on building a generative model that can create synthetic data conditioned on specific input values. For example, the model might generate a synthetic restaurant review (text, image, rating) given a specific cuisine type and price range. This often involves transfer learning, where a pre-trained generative model is fine-tuned on the specific multimodal data. A key component here is the use of dual-level shared embeddings (DSE). Embeddings are a way of representing data as vectors of numbers, capturing semantic meaning. DSE creates shared embeddings at two levels, helping the model to learn relationships between different modalities and also adapt to new, unseen data (a process called domain adaptation).
  • Using Synthetic Data for Point Prediction: Once the conditional generator is trained, it can be used to create synthetic data for any given input. This synthetic data represents the possible responses associated with that input. The model then makes a prediction by finding the response that minimizes the prediction error on this synthetic data. This is like the model saying, β€œBased on what I’ve learned about how reviews are generated, this is the most likely rating for this restaurant.”

📌Why is it Better?

  • Unified Framework: It handles multimodal data within a single generative modeling framework, eliminating the need for separate models for each modality.
  • Mixed Data Types: It can handle different data types (text, images, tabular data) seamlessly, modeling the conditional distribution of the variables of interest.
  • Robustness and Generalizability: By training on synthetic data, GDP becomes more robust to noise and variations in the real data, improving its ability to generalize to new, unseen examples.

📌Key Contributions and Theoretical Foundations

The paper makes several important contributions:

  • GDP Framework: Introduces the GDP framework for multimodal supervised learning using generative models.
  • Theoretical Foundation: Provides theoretical guarantees for GDP’s predictive accuracy, especially when using diffusion models as the generative backbone. It analyzes two key factors: generation error (how different the synthetic data is from the real data) and synthetic sampling error (the error introduced by using a finite sample of synthetic data).
  • Domain Adaptation: Proposes a novel domain adaptation strategy using DSE to bridge the gap between different data distributions.

📌Multimodal Diffusion Models: The Generative Engine

A crucial component of GDP is the use of diffusion models as the generative engine. Diffusion models are a powerful type of generative model that works by gradually adding noise to data until it becomes pure noise, and then learning to reverse this process to generate data from noise. The paper introduces a specialized diffusion model for multimodal data, integrating structured tabular data with unstructured data like text and images through shared embeddings and a shared encoder-decoder architecture.

📌Numerical Examples and Results

The paper evaluates GDP on a variety of tasks, including:

  • Domain adaptation for Yelp reviews
  • Image captioning
  • Question answering
  • Adaptive quantile regression

The results consistently show that GDP outperforms traditional models and state-of-the-art methods in terms of predictive accuracy, robustness, and adaptability.

In simple terms, It is Like a master chef. Imagine a master chef who has tasted thousands of dishes. They don’t just memorize the recipes; they understand the complex interplay of flavors, textures, and ingredients. GDP is like that chef. It learns the underlying β€œrecipe” for multimodal data, allowing it to generate new β€œdishes” (synthetic data) and, more importantly, make better predictions about the real dishes it encounters. By understanding the generative process it unlocks a reasonable potential of multimodal data, leading to more accurate and robust predictions across a wide range of applications.

The future directions involve making GDP more computationally efficient, applying it to a wider range of problems, and developing a deeper theoretical understanding of its properties with various generative models.

Stay Curious☺️….See you in the next one!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓