TAI #109: Cost and Capability Leaders Switching Places With GPT-4o Mini and LLama 3.1?
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This was another huge week for foundation LLMs, with the release of GPT-4o mini, the leak of LLama 3.1 model scorecards, new models from Mistral, and a 2.5T token high-quality open dataset out with Appleβs DCLM-7B model. In a somewhat unexpected (at least six months ago) reversal of fortunes β with closed API 4o mini overtaking LLama 3.0 for cost-effectiveness β according to model scorecard leaks Llama 3.1 405B actually beats the leading closed API models on capability (at least on some metrics such as MMLU-Pro). Elon Musk also signaled that xAI has truly started competing in the race, with Grok 2 due in August and a 100k single RDMA fabric H100 GPU data center beginning to train Grok 3 this week (aiming for a December release).
OpenAIβs new GPT-4o mini release was a highlight this week, bringing much faster speeds and 29x lower cost relative to GPT-4o and, surprisingly, also almost 3x lower price than GPT-3.5 Turbo. (note: these prices assume a 3:1 input-to-output token ratio). GPT-3.5-Turbo had almost no use cases left relative to cheaper, smarter models elsewhere (both open and closed), so OpenAI was long overdue a new low-price tier model. The new model also supports up to 16k output tokens β a very significant upgrade in our view and helping use cases such as translation and code or style conversion. This new model is extremely attractive for many applications. However, using GPT-4o is still likely to make more sense for complex, coding, and image tasks. Sam Altman stated the new model is already processing 200 billion tokens per day.
The new GPT-4o-mini pricing also makes it 3β4 times cheaper than LLaMA 3.0β70B (via Together.AI) while delivering largely superior metrics. Although this challenges open-source model adoption, there are still data security and privacy advantages to using open-source models, as well as increased flexibility for fine-tuning and adaptation. There are also easy wins on inference cost, such as quantization from FP16 to FP8 or INT4, and some of these have been adopted and rolled out for LLama 3.0 at Together.ai this week. Groq also released new LLaMA 3.0 options with an open-source Tool Use fine-tune that reached the #1 position on BFCL (Berkeley Function Calling Leaderboard), beating all other models.
META is fighting back this week with the release of its LLama 3.1 models. According to leaks, the instruction-tuned model reportedly scores 73.3% on MMLU-Pro relative to GPT-4o (72.55%) and Claude 3.5 Sonnet (72.83%). The 3.1 model series also has significant improvements to the 8B and 70B models (we assume coming from distillation from the 405B models), particularly on coding. We look forward to more info about LLama 3.1 from the full model release and seeing where capability really gets to for real-world use cases. While leading Closed Models have moved to an inference-efficient Mixture of Experts architecture, we think the LLama 3.1 models remain a dense architecture. This can cause some challenges for inference of the 405B model, and we expect it will be expensive, relatively slow, and mostly used via quantized versions.
Competition at the leading edge of LLMs is certainly heating up, and it is only getting easier to train LLMs now that large H100 clusters are available at many companies, open datasets are released, and many techniques, best practices, and frameworks have been discovered and released. However, it is not all one-way traffic, and the success of LLMs is not without its consequences. A new study from the Data Provenance Initiative found that as much as 25% of high-quality data from popular AI training sets have been restricted in the past year. Publishers are updating robots.txt and changing terms of service to prevent AI scraping. This response is likely partly to protect against reduced traffic as more users get their information directly from LLM models. However, we expect it is also a consequence of OpenAI and others increasingly bidding large amounts for access to datasets from companies like the Associated Press. This creates a spiraling incentive structure where companies close access to their websites to scraping and the open internet, enabling them to monetize access via large contracts with AI labs.
Why should you care?
Continuing the 2024 trend of rapid LLM cost reduction, OpenAIβs GPT-4o mini averages about 140x cheaper than GPT-4 was at its release just 16 months ago while also performing better on most benchmarks. It is also 230x cheaper and vastly better than the GPT-3 Da Vinci 002, released in August 2022 and the best model at the time. Matt Holden noted on x/twitter that in the early days of cloud storage β in its first decade (2006β2016), Amazon S3 cost per GB of storage dropped 86% (or ~97%, including Glacier). The speed of AI cost reduction is dramatically faster, potentially enabling much more rapid adoption relative to cloud computing.
With GPT-4o-mini, now anybody essentially has access to unlimited interns who can each read ~70,000 words and write 16,000 words for $0.02 in under one minute. This is an incredibly valuable resource for aiding people in performing existing work tasks and also in enabling new ones. However, it still takes imagination to figure out how and where you can use LLMs, as well as efforts to understand their strengths and weaknesses and diligence in providing clear instructions and checking answers. Many tasks also still require a lot of work on data preparation, prompting, fine-tuning, RAG, tool use, and surrounding software and UI/UX to get LLMs to a sufficient level of reliability. We think LLM adoption is only getting started, and the increased level of competition, both with open and closed foundation models, will only accelerate it further!
β Louie Peters β Towards AI Co-founder and CEO
This issue is brought to you thanks to Brilliant:
AI is getting smarter. Are you?
Understand the concepts powering the technology shaping our world β from LLMs like ChatGPT to quantum computers through interactive, bite-sized lessons. Or feed your curiosity by exploring thousands of lessons on topics from AI to black holes to going viral on Twitter. Brilliantβs first-principles approach helps you build understanding from the ground up. Meanwhile, each lesson is packed with hands-on problem-solving that lets you play with concepts.
Join 10 million learners worldwide and start your 30-day free trial today! Plus, Towards AI readers get a special 20% off a premium annual subscription.
Hottest News
OpenAI introduced GPT-4o mini, its latest small AI model. The company says the GPT-4o mini is cheaper and faster than OpenAIβs current cutting-edge AI models. GPT-4o mini scores 82% on MMLU and currently outperforms GPT-41 on chat preferences in the LMSYS leaderboard. The model is available for developers and consumers through the ChatGPT web and mobile app.
2. Progress at xAI on Grok-2 and a 100k Training Cluster for Grok-3
Elon Musk announced that xAIβs Grok-2 finished training in June using approximately 15,000 H100s (we think on Oracle Cloud). The team is now fine-tuning and debugging, with plans to release the model in August and hoping to match GPT-4βs capabilities. Musk also revealed that xAIβs new data center in Memphis has begun training following surprisingly rapid progress in construction and installation. This center will feature a training cluster with 100k liquid-cooled H100s on a single RDMA fabric (we understand 32k are already operational)l. Musk claims it is the most powerful AI training cluster in the world. While several leading AI companies are also building 100k+ clusters, itβs possible that xAI is the first to have 100k on a single fabric. The company has started training Grok-3 on the new cluster, aiming to complete training in 3β4 months and release the model in December.
3. Mistral AI and NVIDIA Unveil Mistral NeMo 12B, a Cutting-Edge Enterprise AI Model
Mistral AI and NVIDIA released Mistral NeMo 12B, a state-of-the-art 12B model with 128k context length. Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken and trained on more than 100 languages. It compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models. Weights for the base and instruct models are hosted on HuggingFace.
4. Groq Introduces Two New Open-Source Models Specifically Designed for Tool Use
Groq released two new open-source models specifically designed for tool use: Llama-3-Groq-70B-Tool-Use and Llama-3-Groq-8B-Tool-Use, built with Meta Llama-3. They used full fine-tuning and Direct Preference Optimization (DPO) to achieve state-of-the-art tool use performance. Per the report, no user data was used in the training process.
5. Andrej Karpathy Introduced Eureka Labs
Andrej Karpathy, former head of AI at Tesla and researcher at OpenAI, is launching Eureka Labs, an βAI nativeβ education platform. The company is starting with a more traditional approach to teaching with its AI course LLM101n.
6. Microsoft Unveiled SpreadSheet LLM
Microsoft has unveiled βSpreadsheetLLM,β a new AI model designed to understand and work with spreadsheets. It addresses the challenges of applying AI to the widely used but complex spreadsheet format by combining the power of LLMs with the structured data in spreadsheets.
7. A Survey on the Adoption of ChatGPT
A large-scale survey experiment in Denmark involving 100,000 workers from 11 occupations examined the adoption of ChatGPT. The study found it could halve working times in 37% of the job tasks for the typical worker. ChatGPT is widely used in exposed occupations, with adoption rates ranging from 79% for software developers to 34% for financial advisors. However, significant inequalities have emerged: women are 20 percent less likely to use ChatGPT than men, and ChatGPT users already had higher earnings before its arrival.
8. Apple Released DCLM-7B, a 7B Open-Source LLM Trained on the DCLM-Baseline Dataset
Apple released DCLM-7B, an 7B open-source LLM with fully open weights, training code, and dataset. The model was trained on 2.5T tokens on open datasets. While the benchmark performance of the model falls behind existing open-source alternatives β this release is significant as it is the first time a cleaned, high-quality dataset of this scale has been made fully open. We think this can contribute to easier experimentation on new model architectures. However, training a model on a dataset of this scale still requires significant funding!
9. Together AI Announced Inference Engine 2.0, a New Inference Stack
Together AI released a new inference stack, which provides a 4x faster decoding throughput than open-source vLLMs. The Together Inference Engine achieves over 400 tokens per second on Meta Llama 3 8B. They also introduced Together Turbo and Together Lite endpoints that enable performance, quality, and price flexibility. Together Turbo and Together Lite endpoints are available for Llama 3 models.
Five 5-minute reads/videos to keep you learning
To achieve optimal model performance in machine learning projects, defining the problem, understanding the context, and analyzing the dataset in detail is important. This article outlines five actionable tips essential for training machine learning models.
2. Train a Llama Model From Scratch
This quick and easy-to-follow tutorial explains the process of training a language model using the Llama model architecture and the Transformers library.
3. Step-by-Step Tutorial To Create a Coding Agent With GPT-40-Mini
This is a step-by-step video tutorial for creating a coding agent with GPT-4o-mini. It shows how you can use Claude Engineer to create your own custom version with GPT4o mini as the model.
LLMs always βmake stuff upβ: we call it a hallucination when the output is noticeably incorrect or inappropriate. This article identifies two main categories of hallucinations, factuality and faithfulness hallucinations, and highlights their impact, mitigation strategies, best practices, and more.
This article series will teach you how to build a multi-agent AI app. In the first part, the author starts with a single agent as a proof of concept (PoC). They use a function-calling agent, with each function responsible for a specific data retrieval algorithm, and leverage existing tools like AWS Bedrock and Slack to streamline knowledge sharing within our organization.
6. Long Context
This guide explores the basics of the context window, how developers should think about long context, various real-world use cases for long context, and ways to optimize the usage of long context.
Repositories & Tools
- Mem0 provides a self-improving memory layer for LLMs, enabling personalized AI experiences across applications.
- The Cradle framework enables nascent foundation models to perform complex computer tasks using the same unified interface humans use.
- CosyVoice is a large multilingual voice generation model with multilingual voice generation, cross-lingual voice cloning, and instruction-following capabilities.
- Open GPU Kernel Modules repository is the source release of the NVIDIA Linux open GPU kernel modules.
- E5-V is a new framework that adapts Multimodal Large Language Models (MLLMs) to create universal multimodal embeddings.
Top Papers of The Week
This paper presents DiT-MoE, a sparse version of the diffusion Transformer. It is scalable with dense networks and exhibits highly optimized inference. It includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the routed experts.
2. GraphFM: A Scalable Framework for Multi-Graph Pretraining
This paper introduces the Graph Foundation Model, trained on 152 datasets with over 7.4 million nodes and 189 million edges spanning diverse domains. This work shows that multi-graph pretraining can significantly reduce the burden imposed by the current graph training paradigm by creating a single generalist model that performs competitively across a wide range of datasets and tasks.
3. Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
This paper presents Reliable and Efficient Concept Erasure (RECE), a novel method that erases inappropriate content from diffusion models in just 3 seconds without needing extra fine-tuning. RECE leverages a closed-form solution to derive new target embeddings, which can regenerate erased concepts within the unlearned model.
4. Fully Non-Linear Neuromorphic Computing with Linear Wave Scattering
This paper presents a new way of implementing a neural network with an optical system, which could make machine learning more sustainable. It relies on linear wave scattering and yet achieves non-linear processing with a high expressivity. The key idea is to inject the input via physical parameters that affect the scattering processes.
5. Prover-Verifier Games Improve Legibility of LLM Outputs
In this paper, researchers trained strong language models to produce text that is easy for weak language models to verify and found that this training also made the text easier for humans to evaluate. When the problem-solving process of strong models is optimized for getting the correct answer, the results can become harder to understand. This finding highlights the importance of correctness, clarity, and ease of verification in AI-generated text.
6. Robotic Control via Embodied Chain-of-Thought Reasoning
This paper introduces Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action models. ECoT enhances the decision-making capabilities of robot control systems by enabling them to reason about tasks, sub-tasks, and their environment before taking action.
Quick Links
1. Towards AI has partnered with OβReilly to make our latest resources more accessible on their learning platform. Through this partnership, our latest book, βBuilding LLMs for Production,β and two exclusive βshortcutβ video series on LLMs and Generative AI research are now available on the OβReilly learning platform.
2. Google, OpenAI, Microsoft, Amazon, and others are joining the Coalition for Secure AI (CoSAI). The initiative addresses a βfragmented landscape of AI securityβ by providing access to open-source methodologies, frameworks, and tools. Other companies joining CoSAI include IBM, PayPal, Cisco, and Anthropic.
3. Fei-Fei Li, the renowned computer scientist known as the βgodmother of AI,β has created a startup dubbed World Labs. Itβs valued at more than $1 billion in just four months. World Labs hopes to use human-like visual data processing to make AI capable of advanced reasoning.
4. In a new funding round, Cohere was valued at $5.5 billion, making it one of the worldβs most valuable artificial intelligence companies and one of the largest startups in Canada. The company has also raised $500 million in Series D funding.
5. Nvidia is preparing a version of its new flagship AI chip for the Chinese market. Nvidia will work with Inspur, one of its major distributor partners in China, on the launch and distribution of the chip, tentatively named the βB20.β
6. In the latest study, OpenAI researchers found prover-verifier games improve legibility of language model outputs. They explored training strong language models to produce text that is easy for weak language models to verify, and this training also made the text easier for humans to evaluate.
Whoβs Hiring in AI
Senior DevOps Engineer @NVIDIA (Santa Clara, CA, USA)
Research Engineer, Horizons @Anthropic (Remote/USA)
Machine Learning Researcher @Lambda (Remote/USA/Canada)
Freelance AI Artists @Klick (Remote)
Prompt Engineering Fellow @Khan Academy (Remote/USA/Canada)
Senior Full Stack Engineer, Growth @dbt Labs (Remote/Brazil)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI