Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Inside World Models and V-JEPA: Building AI That Predicts Reality
Artificial Intelligence   Latest   Machine Learning

Inside World Models and V-JEPA: Building AI That Predicts Reality

Last Updated on September 29, 2025 by Editorial Team

Author(s): Abinaya Subramaniam

Originally published on Towards AI.

Artificial intelligence has dazzled the world with its ability to generate text, images, and even music. Large Language Models (LLMs) like GPT and multimodal systems that combine text, vision, and audio have pushed the boundaries of what machines can do. Yet, despite these advances, today’s most powerful AI still lacks one fundamental capability: understanding and predicting how the world evolves over time.

Inside World Models and V-JEPA: Building AI That Predicts Reality
Thumbnail — Image by Author

This missing ingredient is where world models step in. They are not just another type of model. They represent a deeper shift in AI research, bringing us closer to systems that can reason, imagine, and plan like humans.

What Exactly Are World Models?

A world model is an AI system that builds an internal representation of the environment. Instead of only reacting to raw inputs, it creates a kind of mental simulator that can answer questions like:

  • If I take this action, what will happen next?
  • What might the world look like a few seconds from now?
  • Which choice leads me to my goal most effectively?

Humans rely on this constantly. When you picture a ball rolling across the floor, you can predict it will slow down and stop without needing to physically test it. In the same way, a world model gives machines the ability to simulate the consequences of actions before taking them.

How Do World Models Differ from LLMs and Multimodal LLMs?

To understand the leap that world models represent, let’s compare them to the AI systems dominating headlines today.

Image by Author
  • LLMs are trained to predict the next word in a sequence of text. Their power lies in recognizing vast patterns in language.
  • Multimodal LLMs extend this to multiple domains (text, images, sometimes audio/video), mapping between them. For instance, turning a caption into an image or describing a picture in words.

But both are still essentially pattern recognizers. They do not simulate the physics of reality or reason about cause and effect.

By contrast, world models are built around dynamics. They care about how states change over time. Rather than saying, “That’s a ball,” a world model asks, “What will the ball do if it’s pushed?” This difference makes world models essential for AI agents that must act in and adapt to the real world like robots, autonomous cars, or interactive assistants.

How Do World Models Function?

The operation of a world model can be broken down into three main components:

  1. Perception (Encoding the World)
    The system takes raw sensory inputs such as video frames or sensor data and compresses them into a simplified, abstract representation.
  2. Prediction (Modeling Dynamics)
    It learns how the environment evolves by predicting the next state given the current one, and sometimes an action. This is where imagination happens. The system simulates possible futures internally.
  3. Planning and Control
    Using these predictions, an agent can explore multiple imagined futures, compare them, and decide on the best course of action without trial and error in the real world.

This approach makes world models both data-efficient and safer, since much of the learning happens in imagination.

The Internal Architecture of a World Model

While designs vary, most world models share a common architecture:

  • An encoder transforms raw inputs (like images or video patches) into compact embeddings.
  • A dynamics module predicts how these embeddings evolve over time.
  • A decoder reconstructs predicted states when needed.
  • A controller learns to use the predictions for decision-making.

A landmark example is the 2018 “World Models” paper by Ha and Schmidhuber, where an agent learned to play games by building a compact visual encoder, combining it with a recurrent network for prediction, and training a simple controller. All inside the agent’s imagined environment.

V-JEPA: Meta’s Leap into Predictive World Models

One of the most exciting modern developments is V-JEPA (Video Joint Embedding Predictive Architecture), introduced by Meta AI. Unlike models that focus on labeling objects or predicting the next pixel, V-JEPA is trained to predict missing parts of a video at an abstract level.

Overview of the V-JEPA 2 architecture. Adapted from Assran et al., 2025

This makes it a true step toward learning the underlying rules of the world, not by supervision, but by observation and prediction.

How V-JEPA Works: A Step-by-Step Look

The mechanics of V-JEPA can be understood as a pipeline:

  1. Breaking Down Video into Frames
    The model begins with raw video input, which is divided into individual frames.
  2. Dividing Frames into Grids
    Each frame is partitioned into small patches, similar to how transformers treat words in text. These patches form the “tokens” of the visual world.
  3. Masking Regions
    Some patches are deliberately hidden (masked), either across space (parts of a frame) or time (entire frames). The challenge is to predict the missing information.
  4. Grouping Context and Targets
  • Context patches: The visible, unmasked regions.
  • Target patches: The masked parts that must be reconstructed.
Architecture — Adapted from Assran et al., 2025.

5. Feature Extraction
Both sets of patches are embedded into feature vectors, often with convolutional layers that capture local patterns.

6. Adding Temporal and Positional Cues
Information about when and where each patch belongs is included so the system understands spatial layout and motion.

7. Encoding Context
The unmasked patches are passed through an encoder, producing a condensed representation of the observed world.

8. Prediction
A predictor network uses the context encoding to generate embeddings for the masked target patches. Importantly, it predicts embeddings, not pixels, which keeps the model focused on abstract world dynamics rather than surface detail.

9. Target Encoder for Validation
The actual masked patches are processed through a separate encoder to generate ground-truth embeddings.

10. Comparison and Learning
The model’s predicted embeddings are compared to the true ones. Differences are used to adjust the network weights, improving future predictions.

Through this loop, V-JEPA learns to build a strong internal model of how the world unfolds over time.

Why Predict Embeddings Instead of Pixels?

Predicting pixels often leads to blurry outputs and wasted effort on irrelevant detail (like background textures). By working in embedding space, V-JEPA avoids this trap, focusing instead on meaningful structure and causality, how objects move, interact, and change.

Use Cases of World Models and V-JEPA

The potential applications are vast:

  • Robotics: Robots can predict how objects will behave before physically interacting with them, reducing accidents and improving learning efficiency.
Image by Author
  • Autonomous Driving: Cars can foresee pedestrian movement or traffic patterns seconds into the future.
  • Gaming and Virtual Worlds: AI players can simulate countless possible futures before choosing a move, achieving superhuman strategy.
  • Healthcare: World models could help simulate disease progression or drug interactions, offering safer medical insights.
  • Personal AI Assistants: Beyond answering queries, assistants could simulate consequences: “If you reschedule your meeting, here’s how your evening will shift.”

Final Thoughts: Toward AI with Imagination

World models mark a shift from recognition to simulation. Where LLMs are fluent storytellers, world models are imaginative planners. By learning to predict the unseen and simulate futures, they bring AI closer to the way humans learn and reason.

Projects like V-JEPA show that self supervised prediction of videos is a powerful way to build these mental simulators. The road ahead likely combines the best of both worlds: LLMs for language and knowledge, and world models for reasoning and planning.

Together, they could form the foundation of the next generation of AI systems that not only talk about the world but also understand, imagine, and act within it.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.