Inside OpenAI Sora: Five Key Technical Details We Learned About the Amazing Video Generation Model
Last Updated on February 21, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence U+007C Jesus Rodriguez U+007C Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.substack.com
Last week, OpenAI unveiled its latest work on generative video models with Sora, a remarkable text-to- model that is able to generate up to a minute high quality video. The release took the generative AI world by storm with extensive debate in X and media publications.
After all, the videos generated by Sora are shockingly impressive.
From the technical side, OpenAI hasnβt published too many details but there are some key details that were highlighted as part of the release. Lets review some of them.
Inside Sora
Breaking Away from Tradition
In text-to-video models, researchers traditionally have explored various techniques, including recurrent networks, generative adversarial networks, autoregressive transformers, and diffusion models. These methods often target specific types of visual content, focusing on shorter clips or videos of a standard size. OpenAI Sora, however, breaks away from these limitations by being a versatile model capable of producing videos and images across a wide range of durations, aspect ratios, and resolutions, achieving up to a minute of high-definition video.
The Model
OpenAI Sora operates as a diffusion model, beginning the video generation process with an initial state resembling static noise and progressively refining this by diminishing the noise through numerous steps. Sora boasts the capability to create videos in their entirety in a single process or to augment existing videos, enhancing their length. This innovation addresses the complex challenge of maintaining consistent subjects in a video, even when they momentarily disappear from the frame.
Echoing the design principles of GPT models, Sora is built upon a transformer architecture, which facilitates remarkable scalability in its performance. The model treats videos and images as assemblies of smaller data segments, referred to as patches. These patches are comparable to the tokens used in GPT models, enabling a unified approach to data representation. This strategy allows Sora to be trained on a more diverse set of visual data, covering a broad spectrum of durations, resolutions, and aspect ratios.
Sora draws inspiration from the foundational work of DALLΒ·E and GPT models, incorporating the recaptioning technique from DALLΒ·E 3. This method involves creating detailed captions for visual training data, enhancing the modelβs ability to adhere closely to the textual directions provided by users in the videos it generates.
Some Key Contributions
Below, we listed a few of the key contributions of Sora to text-to-image model techniques. The model applies a series of techniques that seems to be the cornerstone of its high quality outputs.
1) Transforming Visual Data into Patches
Drawing inspiration from the advancements in large language models (LLMs), which have gained generalist capabilities through training on vast amounts of internet data, the developers of Sora applied a similar principle to visual data. Just as LLMs utilize tokens to process diverse forms of text, Sora employs visual patches. This approach has been found to be a scalable and effective way to train generative models on a variety of video and image types.
2) Video Compression Network
To manage the complexity of visual data, Sora includes a network designed to compress the data both temporally and spatially. This process involves converting raw video into a latent representation, which Sora is then trained to generate. A decoder model is also developed to transform these latent representations back into visual form, enabling the creation of detailed images and videos.
3) Spacetime Latent Patches
Sora treats compressed video data by extracting sequences of spacetime patches, similar to how transformers use tokens. This method is applicable to images as well, considering them as single-frame videos. By using patches, Sora can handle training data with varying resolutions, durations, and aspect ratios. During the generation process, the size of the output video can be adjusted by organizing the patches in a grid of the desired dimensions.
4) Scaling Transformers for Video Generation
At its core, Sora incorporates a diffusion model approach within a transformer architecture. This setup allows Sora to start with noisy patches and, through training, learn to predict their original, unaltered state. The use of transformers enables Sora to scale effectively across different types of visual data generation tasks.
5) Flexibility in Output
Unlike previous models that required standardizing video dimensions, Sora benefits from training on data in its native size. This flexibility allows for the generation of videos in a wide range of sizes and aspect ratios, from widescreen formats to vertical orientations, making it suitable for various devices and platforms. Additionally, Sora supports rapid prototyping at lower resolutions before generating content at full resolution, all within the same framework.
These are just very minor technical details about Sora. OpenAI should be publishing a more detailed technical report soon that should share more lights into the magic behind the video generation model.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI