TAI 129: Huge Week for Gen AI With o1, Sora, Gemini-1206, Genie 2, ChatGPT Pro and More!
Last Updated on December 10, 2024 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This was an extremely busy week for generative AI model releases. In OpenAIβs 12 days of Christmas, the company has so far launched a new $200 per month ChatGPT Pro subscription, its o1 and o1-Pro reasoning models, Sora Turbo (text-to-video model), and a new LLM customization technique β reinforcement fine-tuning. Meanwhile, Google updated its Gemini model with gemini-exp-1206 making large leaps on benchmarks (e.g., Livebench to 63.6 vs. Sonnet 3.5 at 58.5 and gemini-exp-1121 at 56.7) and reaching the top of lmarena. Meanwhile, Amazon launched its new βNovaβ family of LLMs, Meta made large improvements to Llama with Llama 3.3 models (the new 70B model matches the prior 405B model on many measures), and xAI released its first in-house text-to-image model. Outside of LLMs, Google also launched Genie 2 β a model that can create playable 3D worlds from a single image. β with hopes this type of model could enable future agents to be trained and evaluated in virtual environments.
OpenAI moved its o1 reasoning model out of preview to mixed reception. We think the company made a strategic decision to release a smaller, faster, and more affordable model to launch, which means it has somewhat mixed benchmark results relative to o1-preview. The company also launched o1 pro mode, exclusively available to ChatGPT Pro subscribers, which uses additional inference time compute to achieve even greater reliability and precision. In testing, o1 pro demonstrated superior performance on rigorous benchmarks, including competition-level coding and advanced mathematical problem-solving, consistently producing correct answers across multiple attempts.
Sora Turbo, OpenAIβs advanced video generation model, is now available to ChatGPT Plus and Pro users as a standalone product. The tool allows users to generate high-quality videos up to 1080p resolution, up to 20 seconds long, in multiple aspect ratios. ChatGPT Plus users can generate up to 50 videos at 480p resolution or fewer videos at 720p each month. It supports creative workflows by enabling users to extend, remix, and blend existing assets or create entirely new videos from text. Safety remains a priority and was likely a key reason for the long delay since the early demos, with visible watermarks, metadata for content verification, and strict moderation to prevent misuse, particularly in cases involving deepfakes.
Together with its new models, OpenAI introduced ChatGPT Pro, a premium plan costing $200 per month β much higher than its $20 per month ChatGPT Plus offer. Subscribers gain unlimited access to OpenAIβs most advanced models, including OpenAI o1, o1-mini, GPT-4o, and Advanced Voice. Additionally, it offers the exclusive o1 pro mode, which uses enhanced compute power to deliver highly reliable and accurate responses, particularly for intricate tasks like competition-level coding, science, and advanced mathematics. Pro users also benefit from 10x higher limits for Sora, higher resolutions, and longer video durations.
Why should you care?
It is very hard to know which model releases this week will prove the most impactful in the long term! We think reinforcement fine-tuning (essentially customizing o1-like models to reason on specific tasks) is likely to become a very powerful new tool in the LLM Developer toolkit, and we are sure to add it to our comprehensive LLM Developer Conversion course once it is widely available. We also think Genie 2 could be the beginning of a very important new model series for agents and robotics. Geminiβs recent huge incremental progress is making it an extremely compelling LLM β particularly for price vs performance- while Amazonβs new Nova models are a signal of intent that it also aims to be a competitor in the LLM field.
OpenAIβs 12 days of OpenAI took the most press attention, however, with some justification as o1 and Sora are both truly impressive models in different fields. These models are also getting more compute-intensive β at the same time as OpenAI is more eager to scale revenue and monetization β and weβre not sure which played the biggest part in OpenAIβs launch of the $200 per month ChatGPT Pro option. These latest models are also not yet available via API, which further encourages monetization via ChatGPT β we hope this is not the beginning of the de-prioritization of the API! We think many people will be able to get a $200 per month value from ChatGPT Pro if they are willing to experiment, adapt their workflows to LLMs, and get educated on how to use them. This is particularly valuable for professionals working on complicated mathematics, science, finance, or legal analysis or creatives who could benefit from rapidly demoing new video ideas β but we think many, many people across the economy could figure out a use case that benefits them with some experimentation! However, with increasingly intense competition in the GenAI field (particularly with Google Gemini now a very top tier LLM and Amazon entering the arena as well as rapid open source progress at META and in China), we are not sure if OpenAI will be able to sustain this price point. It remains a key unanswered question generally who will be able to capture most of the huge value unlocked by the now inevitable GenAI-powered creativity and productivity gains: GPU companies, Foundation Model Companies, Customised LLM Pipeline Builders, Enterprise adopters who unlock new products and productivity gains or early individual users who gain an edge in their field?
β Louie Peters β Towards AI Co-founder and CEO
Hottest News
1. OpenAI Introduced ChatGPT Pro
Open AI has added ChatGPT Pro, a $200 monthly plan that enables scaled access to the best of OpenAIβs models and tools. This plan includes unlimited access to OpenAI o1, o1-mini, GPT-4o, and Advanced Voice. It also includes o1 pro mode, a version of o1 that uses more compute to think harder and provide even better answers to the hardest problems.
2. Open AI Introduced Sora Turbo
Open AI has developed a new version of Sora, Sora Turbo, which is significantly faster than the model previewed in February. Users can generate videos up to 1080p resolution, up to 20 sec long, and in widescreen, vertical, or square aspect ratios. Users can also bring their own assets to extend, remix, and blend or generate entirely new content from text.
3. Meta Announced the Release of Llama 3.3
Meta has announced the newest addition to its Llama family of generative AI models: Llama 3.3 70B. The Llama 3.3 instruction-tuned text-only model is optimized for multilingual dialogue use cases. Meta claims Llama 3.3 70B outperforms Googleβs Gemini 1.5 Pro, OpenAIβs GPT-4o, and Amazonβs newly released Nova Pro on several industry benchmarks, including MMLU.
4. OpenAI Unveils Reinforcement Fine-Tuning To Build Specialized AI Models for Complex Domains
OpenAI is expanding its custom AI training offerings with a new method called Reinforcement Fine-Tuning (RFT). This technique reinforces how the model reasons through similar problems and improves its accuracy on specific tasks in that domain. Open AI aims to create specialized o1 models that can perform complex technical tasks with minimal training examples.
5. Google DeepMind Introduces Genie 2
DeepMind has unveiled a model that can generate an βendlessβ variety of playable 3D worlds. Genie 2 β the successor to DeepMindβs Genie, released earlier this year β can generate an interactive, real-time scene from a single image and text description. DeepMind claims that Genie 2 can generate a βvast diversity of rich 3D worlds,β including worlds in which users can take actions like jumping and swimming by using a mouse or keyboard.
6. Amazon Launches Nova AI Model Family for Generating Text, Images, and Videos
AWS unveiled Amazon Nova, a new family of multimodal generative AI models designed for versatility and scale. Nova includes six models: four for text generation/understanding (Micro, Lite, Pro, Premier) and two for creative tasks (Canvas, Reel). Users can experiment, evaluate, and deploy Nova models on Bedrock.
7. Microsoft Copilot Vision Is Here
Microsoft is starting to test its new Copilot Vision feature. Initially unveiled in October, Copilot Vision allows Microsoftβs AI companion to see what users see on an Edge webpage theyβre browsing. Users can then ask questions about the text, images, and content theyβre viewing or use it to assist them. Copilot Vision is now in testing for a limited number of Copilot Pro subscribers in the US.
8. Google Says AI Weather Model Masters 15-Day Forecast
DeepMindβs GenCast AI system surpasses traditional weather models, notably the European Centreβs ensemble, in forecast accuracy beyond one week. By merging diffusion models and ensemble forecasting, GenCast maintains high resolution while reducing compute demands. It excels in predicting extreme weather and improves wind power output forecasting, suggesting a promising hybrid approach to weather prediction.
9. Googleβs New Gemini Model Now Holds the β1 Spot in the Chatbot Arena Across All Domains
Google DeepMindβs new Gemini-exp-1206 model has reclaimed the top spot on the Chatbot Arena leaderboard, surpassing OpenAI across multiple benchmarks β while remaining completely free to use. The model excels in math, writing, and visuals and processes video with a 2M token window.
Five 5-minute reads/videos to keep you learning
1. How Good Are LLMs at Fixing Their Mistakes?
In this article, the author runs a test to see how effectively LLMs fix their mistakes when you point them out to them. This article presents the experiments, process, and findings of this experiment, which was built with Gradio on Spaces and uses Keras, JAX, and TPUs.
2. Agentic Design Patterns Part 1
In this article, Andrew Ng shares a framework for categorizing design patterns for building agents. It also sheds light on the evolution of AI that writes code, analyzed by several research teams. It focuses on an algorithmβs ability to do well on the widely used HumanEval coding benchmark.
3. You Could Have Designed State of the Art Positional Encoding
This post walks you through the step-by-step discovery of state-of-the-art positional encoding in transformer models. It iteratively improves the approach to encoding position with Rotary Positional Encoding (RoPE) used in the latest LLama 3.2 release and most modern transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry, and an understanding of self-attention are expected.
This article breaks down the research findings from βScaling Laws for Precisionβ β a collaborative effort from leading institutions including Harvard University, Stanford University, and MIT β which has sparked significant discussion in the artificial intelligence community. The article dives into key concepts, technical implementation, theoretical framework, experimental design, etc.
5. Reward Hacking in Reinforcement Learning
Reward hacking in reinforcement learning arises when agents exploit reward function flaws to gain high rewards without achieving intended tasks. Itβs a major challenge, especially with language models using reinforcement learning from human feedback. This article presents a deep dive into reward hacking and recommends mitigation strategies.
6. A System of Agents Brings Service-As-Software to Life
Software is no longer merely assisting humans. It acts as an autonomous worker capable of understanding and evolving beyond human limitations. This article breaks down what it means to translate human services into AI-powered software, how the service-as-software idea gets implemented, and how the software will evolve from simple workflow automation to a System of Agents.
Repositories & Tools
- Papers in 100 Lines of Code contains the implementation of papers in 100 lines of code.
- Daytona is an open-source dev environment manager.
- Florence-VL is a family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model.
Top Papers of The Week
This paper introduces capacity density and shows that LLMsβ capacity density grows exponentially over time. Capacity density provides a unified framework for assessing both model effectiveness and efficiency. Using some widely used benchmarks for evaluation, the capacity density of LLMs doubles approximately every three months.
2. GenCast: Diffusion-based ensemble forecasting for medium-range weather
This paper introduces GenCast, a probabilistic weather model with greater skill and speed than the worldβs top operational medium-range weather forecast. GenCast is a machine learning weather prediction (MLWP) method trained on decades of reanalysis data. GenCast generates stochastic 15-day global forecasts at 12-hour steps and 0.25-degree latitude-longitude resolution for over 80 surface and atmospheric variables in 8 minutes.
3. PaliGemma 2: A Family of Versatile VLMs for Transfer
Google DeepMind published PaliGemma 2, an enhanced Vision-Language Model (VLM) based on Gemma 2 models, integrating the SigLIP-So400m vision encoder. Trained at multiple resolutions, these models excel in transfer tasks, including OCR-related tasks, generating long descriptions, and achieving state-of-the-art results across diverse domains.
4. SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance
SNOOPI introduces a robust framework for one-step diffusion models, enhancing training stability with Proper Guidance-SwiftBrush and supporting negative prompt guidance through Negative-Away Steer Attention. These advancements significantly improve performance across metrics and set a new state-of-the-art HPSv2 score of 31.08, addressing previous instabilities and expanding practical image generation capabilities.
5. Evaluating Language Models as Synthetic Data Generators
This paper proposes AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMsβ data generation abilities. The research observes that LMs exhibit distinct strengths, an LMβs data generation ability doesnβt necessarily correlate with its problem-solving ability, and demonstrates that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.
6. VisionZip: Longer is Better but Not Necessary in Vision Language Models
This paper introduces VisionZip, a method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. It can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios where previous methods tend to underperform.
Quick Links
1. Perplexity AI witnessed a surge in user engagement, serving 20 million daily queries β a significant leap from the 2.5 million daily queries recorded at the start of the year. As co-founder and CEO of Perplexity, Aravind Srinivas highlighted, βSlowly but surely, a new consumer habit is emerging: Plexing.β
2. ChatGPT now has over 300 million weekly users. OpenAI CEO Sam Altman revealed the milestone during The New York Timesβ DealBook Summit. In the Twitter post, Open AI also shared that 1B user messages were sent on ChatGPT every day, and 1.3M devs have built on OpenAI in the US.
3. Alibaba Speech Lab has introduced ClearerVoice-Studio, a comprehensive voice processing framework. It combines advanced features such as speech enhancement, separation, and audio-video speaker extraction. The FRCRN model is one of its standout components, recognized for its exceptional ability to enhance speech by removing background noise while preserving the natural quality of the audio.
Whoβs Hiring in AI
Senior Product Manager, Conversational AI Experiences @Moveworks (ββMountain View, CA, USA)
Sr. Data Scientist / Machine Learning Engineer β GenAI & LLM @Databricks (Remote)
Python Developer (GenAI Acceleration Team) @Procter & Gamble (Warsaw, Poland)
Research Scientist Intern, PyTorch Core (PhD) @Meta (Menlo Park, CA, USA)
GenAI Platform Engineer, Applied Machine Learning @Apple (Sunnyvale, CA, USA)
Senior AI Test and Evaluation Engineer @Leidos (Remote/USA)
AI & GenAI Data Scientist-Senior Associate @PwC (Multiple Locations)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI