TAI 130: DeepMind Responds to OpenAI With Gemini Flash 2.0 and Veo 2

Last Updated on December 18, 2024 by Editorial Team

Author(s): Towards AI Editorial Team

Originally published on Towards AI.

What happened this week in AI by Louie

AI model releases remained very busy in the run-up to Christmas, with DeepMind taking center stage this week with a very strong Gemini Flash 2.0 release and its Veo 2 video model. The Flash 2.0 model illustrates the progress made in inference efficiency and model distillation over the past year, together with Gemini’s progress in competing at the top of the leaderboards. For example, Flash 2.0’s MMMU image understanding score of 70.7% compares to 59.4% achieved by the far larger and more expensive Gemini 1.0 ultra almost exactly one year before. We also saw a strong update to Grok-2 this week, together with free access to everything on x.com. Microsoft also delivered an impressive update with Phi-4 — its model family focussed on pushing synthetic data generation to its limits. The 14bn parameter Phi-4 model achieved an MMLU Pro score of 70.4 vs Phi-3 14B at 51.3 and even beat the recently upgraded Llama 3.3 70B model at 64.4. OpenAI also continued its 12 days of announcements with focus on ChatGPT including features such as Canvas, Projects, video input in advanced voice mode and integration with iPhones.

Gemini 2.0 Flash Experimental is an updated multimodal model designed for agentic applications, capable of processing and generating text, images, and audio natively. In benchmark comparisons, it shows strong progress over its predecessors. For example, on the MMLU-Pro test of general understanding, Gemini 2.0 Flash Experimental achieves a score of 76.4%, a slight improvement over Gemini 1.5 Pro’s 75.8% (despite being a smaller and faster model) and a substantial gain compared to Gemini 1.5 Flash’s 67.3%. Similarly, on the MMMU image understanding test, Gemini 2.0 Flash Experimental reaches 70.7%, surpassing Gemini 1.5 Pro’s 65.9% and Gemini 1.5 Flash’s 62.3%.

Gemini 2.0 Flash Experimental supports a range of input/output modalities, offers structured outputs, and integrates tool use, including code execution and search. It can handle large input lengths (up to 1 million tokens) and produce outputs with up to 8,192 tokens while maintaining a high request throughput. The model’s native tool use and code execution features are intended to enhance reliability and adaptiveness, though current feedback shows some inconsistencies in accuracy and voice naturalness. Gemini also released a new Multimodal Live API with real-time audio and video-streaming input.

In a busy week at Google Deepmind, the company also announced Deep Research (a tool for researching complex topics within Gemini advanced), Veo 2 (text to video model) and Imagen 3 (text to image). Veo 2 is a video generation model capable of producing realistic motion and high-quality outputs, including 4K resolution video with reduced artifacts. It interprets and follows both simple and complex textual instructions accurately, simulating real-world physics in a variety of visual styles. Veo 2 supports a range of camera control options and maintains fidelity across diverse scenes and shot types, enhancing both realism and dynamic motion representation. In human evaluations on the MovieGenBench dataset, Veo 2 outperformed other top models in terms of overall preference and prompt-following capability.

Why should you care?

As the first release from the Gemini 2.0 family, Flash 2.0 may be the first glimpse we have of the next generation of LLMs using larger compute clusters (TPUs in this case) and compute budgets. This model likely benefits from model distillation from larger models in the 2.0 family and shows the huge progress made in inference costs this year. This new model aligns with a strategy focused on agentic experiences and interoperability with various inputs and tools. Gemini noted how it fits into agentic research prototypes like Project Astra, which examines the use of video input AI assistants in mobile and potential wearable devices, and Project Mariner, which explores browser-based agents. The strong capability now possible in low latency and low cost smaller tier models is particularly valuable for these agentic applications where many tokens may be used in large chains of prompts and where real-time responses can be key. These low costs are also important for reasoning models that scale inference time compute; this is now the key area where Gemini still lags behind OpenAI, and we expect to hear more from Gemini here in the future.

— Louie Peters — Towards AI Co-founder and CEO

Hottest News

1. Google Launched Gemini 2.0, Its New AI Model for Practically Everything

Google released Gemini 2.0 Flash, a multilingual and multimodal AI model capable of real-time conversation and image analysis. In addition to advances in multimodality — like native image and audio output, it allows native tool use, enabling developers to build new AI agents.

2. OpenAI Brings Video to ChatGPT Advanced Voice Mode

OpenAI’s ChatGPT Advanced Voice Mode now supports video and screenshare features, enabling users to interact visually through a phone camera. This update, previously audio-only, demonstrates ChatGPT’s ability to identify objects and guide tasks. It is currently available to ChatGPT Plus and Pro users.

3. Microsoft Launches Phi-4, a New Generative AI Model, in Research Preview

Microsoft introduced Phi-4, a 14B parameter small language model (SLM) that excels at complex reasoning in areas such as math and conventional language processing. It surpasses larger models, excelling in mathematics and outperforming GPT-4 in science and tech queries. Available soon on HuggingFace, Phi-4 achieved 91.8% on AMC tests, leading all models but showing practical limitations despite strong benchmarks.

4. Apple Releases Apple Intelligence and ChatGPT Integration in Siri

Apple’s iOS 18.2 update enhances iPhones, iPads, and Macs with Apple Intelligence features. The new update brings a whole host of Apple Intelligence features, including ChatGPT integration with Siri, Genmoji, Image Playground, and Visual Intelligence to the iPhone. It also adds language support for other regions, such as the UK and Australia, officially launching Apple’s AI in those countries.

5. Cohere AI Releases Command R7B

Command R7B is the smallest, fastest, and final model in the R Series. It is a versatile tool that supports a range of NLP tasks, including text summarization and semantic search. Its efficient architecture enables enterprises to integrate advanced language processing without the resource demands typically associated with larger models.

6. Google Unveiled Willow, a Quantum Computing Chip

Google announced Willow, a new quantum chip that outperformed even the world’s best supercomputer on an advanced test. The new chip can complete a complex computation in five minutes that would take the most powerful supercomputer 10 septillion years — more than the estimated age of the universe. Google researchers were also able to prove for the first time that the chip’s errors did not increase proportionately as the number of qubits rose.

7. OpenAI Launches ChatGPT Projects, Letting You Organize Files, Chats in Groups

OpenAI is rolling out a feature called “Projects” to ChatGPT. It’s a folder system that makes it easier to organize things you’re working on while using the AI chatbot. Projects keep chats, files, and custom instructions in one place.

8. Grok Is Now Free for All X Users

Grok is now available to free users on X. Several users noticed the change on Friday, which gives non-premium subscribers the ability to send up to 10 messages to Grok every two hours. TechCrunch reported last month that Musk’s xAI started testing a free version of Grok in certain regions. Making Grok more widely available might help it compete with the already-free chatbots like OpenAI’s ChatGPT, Google Gemini, Microsoft Copilot, and Anthropic’s Claude.

9. OpenAI Released the First Version of Sora

OpenAI is releasing Sora as a standalone product at Sora.com to ChatGPT Plus and Pro users. Sora, OpenAI’s text-to-video AI, enables users to create 1080p videos up to 20 seconds long. Sora features include video remixing and storyboards. However, videos carry watermarks.

Five 5-minute reads/videos to keep you learning

1. The Epic History of Large Language Models (LLMs)

This article breaks the evolution of RNN architecture into five stages: traditional encoder-decoder architecture, addition of attention mechanism in our traditional encoder-decoder architecture, transformers architecture, addition of techniques like transfer learning into the NLP domain, and finally, large language models (like ChatGPT).

2. Building Multimodal RAG Application #5: Multimodal Retrieval From Vector Stores

This article dives into the essentials of setting up multimodal retrieval using vector stores. It covers installing and configuring the LanceDB vector database, demonstrates how to ingest both text and image data into LanceDB using LangChain, and concludes with a practical walkthrough of performing multimodal retrieval, enabling efficient searches across both text and image data.

3. How To Build a Truly Useful AI Product

The traditional laws of “startup physics” — like solving the biggest pain points first or supporting users getting cheaper at scale — don’t fully apply when building AI products. And if your intuitions were trained on regular startup physics, you’ll need to develop some new ones in AI. This article shares a set of four principles for building AI products that every app-layer founder needs to know.

4. Run Gemini Using the OpenAI API

Google confirmed that its Gemini large language model is now mostly compatible with the OpenAI API framework. There are some limitations with features such as structured outputs and image uploading, but chat completions, function calls, streaming, regular question/response, and embeddings, work just fine. This article provides examples of Python code to show how it works.

5. AI Tooling for Software Engineers in 2024: Reality Check (Part 1)

A survey asked software engineers and engineering managers about their hands-on experience with AI tooling. This article provides an overview of the survey, popular software engineering AI tools, AI-assisted software engineering workflows, what’s changed since last year, and more.

Repositories & Tools

MarkItDown is a Python tool for converting files and office documents to Markdown.
HunyuanVideo is a systematic framework for a large video generation model.
DeepSeek-VL2 is an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models.
TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.

Top Papers of The Week

1. Phi-4 Technical Report

This is the technical report for phi-4, a 14-billion-parameter language model. By strategically integrating synthetic data during training, it excels in STEM-focused QA capabilities. Despite retaining the phi-3 architecture, it outperforms its predecessors due to enhanced data quality, a refined training curriculum, and advanced post-training innovations. It surpasses GPT-4, particularly in reasoning-focused benchmarks.

2. ReFT: Representation Finetuning for Language Models

This research develops a family of Representation Finetuning (ReFT) methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. The research also defines a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). Both are drop-in replacements for existing PEFTs and learn interventions that are 15x — 65x more parameter-efficient than LoRA.

3. Training Large Language Models To Reason in a Continuous Latent Space

This paper introduces Coconut, a novel reasoning paradigm for LLMs that operates in a continuous latent space. Coconut enhances reasoning by utilizing the last hidden state as a continuous thought, enabling advanced reasoning patterns like breadth-first search. It outperforms traditional chain-of-thought approaches in logical tasks with substantial backtracking, demonstrating the promise of latent reasoning.

4. GenEx: Generating an Explorable World

This paper introduces GenEx, a system for 3D world exploration that uses generative imagination to generate high-quality, 360-degree environments from minimal inputs like a single RGB image. GenEx enables AI agents to perform complex tasks with predictive expectations by simulating outcomes and refining beliefs. By advancing embodied AI in imaginative spaces with real-world applications, GenEx advances embodied AI in imaginative spaces.

5. FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

This paper proposes a diagrammatic approach to optimizing deep learning algorithms with IO-awareness, achieving up to sixfold performance improvements like FlashAttention. By efficiently managing data transfers and harnessing GPU features, their method generates pseudocode for Ampere and Hopper architectures. It enhances energy efficiency and performance by reducing GPU energy costs from transfer bandwidth, which currently consumes 46%.

Quick Links

1. Harvard and Google to release 1 million public-domain books as AI training datasets. This dataset includes 1 million public-domain books spanning genres, languages, and authors, including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to age.

2. Meta is releasing an AI model called Meta Motivo, which could control the movements of a human-like digital agent, potentially enhancing the Metaverse experience. The company said that Meta Motivo addresses body control problems commonly seen in digital avatars, enabling them to perform more realistic and human-like movements.

3. Pika Labs has launched Pika 2.0, the advanced AI video model that is a new step towards creative AI video production. This forward-looking release combines crisp text alignment with freshly introduced Scene Ingredients in the Pika Labs‘ web application. Compared to earlier versions, it adds deeper flexibility and sharper detail.

Who’s Hiring in AI

Machine Learning & Computer Vision Engineer @Corning Incorporated (Remote)

Research Instructor @University of Colorado (Hybrid/Colorado, USA)

Artificial Intelligence Engineer @Fortive Corporation (Hybrid/Bengaluru, India)

Sr. AI Linguist @LinkedIn (Hybrid/Mountain View, CA, USA)

Lead AI Engineer @Capital One Services, LLC (Multiple US Locations)

Senior Generative AI Data Scientist, Amazon SageMaker @Amazon (Seattle, WA, USA)

Machine Learning Research Engineer Intern @Texas Instruments (Dallas, TX, USA)

Software Engineer, Generative AI Engineering (Internship) @Woven by Toyota (Tokyo, Japan)

Interested in sharing a job opportunity here? Contact [email protected].

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

TAI 130: DeepMind Responds to OpenAI With Gemini Flash 2.0 and Veo 2

Author(s): Towards AI Editorial Team

What happened this week in AI by Louie

Why should you care?

Hottest News

Five 5-minute reads/videos to keep you learning

Repositories & Tools

Top Papers of The Week

Quick Links

Who’s Hiring in AI

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Top 10 AI Research Papers of 2024: Key Takeaways and How You Can Apply Them

The Top 10 AI Research Papers of 2024: Key Takeaways and How You Can Apply Them

The Top 10 AI Research Papers of 2024: Key Takeaways and How You Can Apply Them

The Top 10 AI Research Papers of 2024: Key Takeaways and How You Can Apply Them

The Top 10 AI Research Papers of 2024: Key Takeaways and How You Can Apply Them

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

TAI 130: DeepMind Responds to OpenAI With Gemini Flash 2.0 and Veo 2

Author(s): Towards AI Editorial Team

What happened this week in AI by Louie

Why should you care?

Hottest News

Five 5-minute reads/videos to keep you learning

Repositories & Tools

Top Papers of The Week

Quick Links

Who’s Hiring in AI

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement