TAI #117:Do OpenAI’s o1 Models Unlock a Full “Moore’s Law” Feedback Loop for LLM Inference Tokens?
Last Updated on October 2, 2024 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
OpenAI’s new o1 series of “reasoning” models took clear center stage this week. These models use an advanced form of search and reasoning during inference. This system performs multiple steps of thought before arriving at an answer, drawing on reinforcement learning (RL) to refine its reasoning process. These models have been highly anticipated since press leaks of OpenAI’s “Q*” breakthrough over a year ago.
Typically, reception to the model has spanned from claims of unlocking AGI to dismissive claims that OpenAI is just applying “Chain of Thought” prompting. We think this model is a huge breakthrough for some tasks, but it is not a plug-and-play upgrade to existing models; you cannot simply use an existing LLM pipeline and prompt and expect it to get better results. While technical details of the model were scarce, this model is clearly more than just prompting, and there was a significant investment in a new “large-scale reinforcement learning algorithm” that “teaches the model how to think productively using its chain of thought.” Perhaps the model started as GPT-4o, but we think post-training compute investment has likely led to substantially different model weights. We also think some architectural adaptations were likely needed to achieve this reasoning search process. There was likely also substantial investment in compiling new post-training data where we expect experienced scientists and coders were asked to break down full details of their internal reasoning to solve challenging problems. The final model is capable of performing some tasks completely out of reach of existing LLMs, albeit at a substantially higher cost.
The performance jumps on some benchmarks (science, math, code, and reasoning focused ones) are substantial; for example, on PhD-Level Science Questions (GPQA diamond), GPT4o achieved 53.6%, o1-mini 60.0%, o1-preview 73.3% and the still un-released o1 77.3%. The downside is, of course, cost and latency; the models often spend 10–60 seconds “thinking” using hidden reasoning tokens. This also adds a cost. While the per token price of o1-preview is 6x higher than GPT-4o, factoring in these new thinking tokens means the price can often reach as much as 30x higher per output token. o1-mini offers 5x lower pricing than this and is even more tailored to math and coding problems where it can actually get better results.
The biggest highlight from OpenAI’s o1 report for us was its disclosure of non-plateauing scaling laws for capability relative to “test-time” (or “inference-time”) compute. While this still scales logarithmically (and hence gets expensive), the fact you can just spend more money on inference and achieve greater performance instead of needing to train a more capable model is very significant. It speaks to the success of OpenAI’s RL search model here that the reasoning steps do not just get lost and stuck after following the wrong direction but can, in fact, keep progressing toward the correct answer with more inference compute. While much work is still needed on refinement here, it opens the possibility of simply leaving o1 models to work for one day or one week to solve the hardest problems. Of course — this is also all very convenient for OpenAI’s business model!
Why should you care?
Besides the out-of-the-box capability unlock on some tasks (we found it particularly valuable for brainstorming tasks so far, and agent pipelines have also become much easier), we think the real story here is the beginning of a new paradigm of integrating RL-based “reasoning step” search with LLMs and scaling inference-time compute to reach greater capability.
Many people argue that LLMs alone will never be able to truly reason and generalize; they only memorize statistical features in their training data distribution. This may or may not be true, but we think a key reason LLMs perform poorly on reasoning-like tasks is that there is very little reasoning data on the internet. Humans always skip to the key points when writing up their ideas and don’t write down their full inner monologue with every thinking step — so the LLM thought it was supposed to just guess at these random leaps from token to token. To some extent, I think so far we have actively been training LLMs NOT to reason; they were being punished during training for attempting these necessary intermediary calculations/thinking steps and not just skipping to mimicking the next word as presented in their internet training data format. For this reason, I think we will find a lot of easy wins with models developed in the direction of o1.
For some time, we have been highlighting the rapid price reduction in LLM inference tokens. For example, cached input tokens with DeepSeek V2 are priced 4,000x lower than Da-Vinci 002 (GPT-3) token costs were two years before. At the peak of Moore’s law, $ cost per transistor was reduced around ~4,000x in the first 14 years up to 1982. We think “Moore’s Law” is an increasingly apt analogy. Despite the huge reduction in LLM inference token price, until now there was still one key element of the feedback loop missing.
While Moore’s law itself was just a prophecy, I think it actually consists of three very real key components that led to sustained feedback loops and the progress we have seen:
1) Learning Rates/Wright’s Law: More cumulative production of a product leads to lower costs due to A) R&D scaling with revenue, B) Companies learning from cost of goods sold and staff experience driving process improvements, and C) Economies of scale.
2) Volume unlocked by price: Lower costs lead to a wider set of applications becoming economically viable, which in turn leads to more cumulative production and lower costs.
3) Volume unlocked by capability: More transistors used together lead to higher capability, which in turn leads to more applications becoming possible, more production, lower costs, and so on.
Until now, cumulative growth in the generated LLM inference tokens has been leading to rapid breakthroughs in reducing cost and unlocking new economic applications. However, more tokens were not easily being applied to the same task to unlock more capabilities (analogous to fitting more transistors on one chip). These new o1 models demonstrate test-time compute scaling laws and clear capability gains from applying more “thinking tokens” to a problem over what could well be orders of magnitude of inference tokens. I expect this to lead to much higher cost per LLM task in some cases (where it makes sense to invest a large amount of tokens/thinking time) — but for this increased token volume to drive Wright’s law to lower the cost on a per LLM token basis even faster.
Interestingly, it is actually LLMs that are reinvigorating the 3rd component of Moore’s Law. The capability gains from more inference tokens per task actually look more clear-cut than gains from more transistors per chip, as transistors saturated these benefits a long time ago. However, now huge LLM training clusters are meaning we really are getting large new capabilities unlocked again from setting more and more transistors to one task (for example, 100k H100 GPU training runs). However, given the scale of the chip industry, it will still take a long time to really make a big change to cumulative historic transistor production and, hence, a faster reduction in price per transistor.
— Louie Peters — Towards AI Co-founder and CEO
This issue is brought to you thanks to Semrush:
Create unlimited SEO-ready content with ContentShake AI
Looking for an AI tool to help you rank in search?
Try ContentShake AI by Semrush! This smart AI content tool generates SEO-ready content set to rank for your target keywords.
Simply choose a content idea and get your blog post in minutes. Then, enhance it using the intuitive blog editor and publish it directly to your WordPress site.
The best part? You can also update your existing content and create unlimited new articles — at not extra cost.
Hottest News
1. OpenAI’s New o1-Preview and o1-Mini Models
OpenAI introduces the o1-preview, the first in a new series of reasoning models significantly adept at complex tasks in science, coding, and math. These models outperform predecessors by employing advanced reasoning before responding, with test performances comparable to PhD students in rigorous fields. Despite lacking some GPT-4o features, o1-preview excels in specialized reasoning tasks, promising substantial AI advancements.
2. Google AI Introduces DataGemma: A Set of Open Models
Google announced DataGemma, the first open model for connecting LLMs with extensive real-world data. Available on Hugging Face for academic and research use, both new models build on the existing Gemma family of open models and use extensive real-world data from the Google-created Data Commons platform to ground their answers. The public platform provides an open knowledge graph with over 240 billion data points sourced from trusted organizations across economic, scientific, health, and other sectors.
3. Mistral Releases Pixtral 12B, Its First Multimodal Model
Mistral has released its first model that can process images and text, called Pixtral 12B. Built on one of Mistral’s text models, Nemo 12B, the new model can answer questions about an arbitrary number of images of an arbitrary size given URLs or images encoded using base64, the binary-to-text encoding scheme.
4. OpenAI Could Shake Up Its Nonprofit Structure Next Year
Reports earlier this week suggested that the AI company was in talks to raise $6.5 billion at a $150 billion pre-money valuation. Now Reuters says the deal is contingent on whether OpenAI can restructure and remove a profit cap for investors. In fact, according to Fortune, co-founder and CEO Sam Altman told employees at a company-wide meeting that OpenAI’s structure will likely change next year, bringing it closer to a traditional for-profit business. OpenAI is currently structured so that its for-profit arm is controlled by a non-profit.
5. Jina AI Announced Reader-LM, A Small Language Models for Cleaning and Converting HTML to Markdown
Reader-LM by Jina AI is a compact language model API for efficient HTML to markdown conversion, surpassing traditional methods like readability and regex. Despite its small size, it performs exceptionally well against larger models, supporting large token contexts and optimized for GPUs.
6. Oracle Unveils World’s First Zettascale AI Supercomputer With 131K NVIDIA Blackwell GPUs
Oracle launched the world’s first zettascale cloud computing clusters powered by NVIDIA Blackwell GPUs. It offers up to 131,072 GPUs and delivers 2.4 zettaFLOPS of peak performance. This new development from Oracle supports advanced AI research and development while ensuring regional data sovereignty, a crucial factor for industries like healthcare and collaboration platforms such as Zoom and WideLabs.
7. AMD Is Turning Its Back on Flagship Gaming GPUs To Chase AI First
AMD prioritizes AI development over flagship gaming GPUs to achieve a larger market share and attract developer support. According to Jack Huynh, the goal is to reach a 40% market share to compete with Nvidia and optimize AMD platforms for developers before potentially re-focusing on gaming GPUs.
Six 5-minute reads/videos to keep you learning
1. Top RAG Techniques You Should Know (Wang et al., 2024)
This article explores the best Retrieval-Augmented Generation (RAG) stack based on the study by Wang et al., 2024. It goes over the best components and how they work so you can also make your RAG system top-tier.
2. Using GPT-4o for Web Scraping
The article examines the use of GPT-4 for AI-assisted web scraping, highlighting its ability to parse structured data from HTML. Through OpenAI’s API, the author tests its efficacy on simple and complex tables, addressing challenges with merged rows and accurate XPath generation. The study finds that combining data extraction with subsequent XPath generation is more effective.
3. Why Small Language Models Are the Next Big Thing in AI
SLMs are poised to democratize AI access and drive innovation across industries by enabling cost-effective and targeted solutions. This article explores the potential of small language models, the advantages of faster development cycles, improved efficiency, and the ability to tailor models to specific needs.
This article provides a step-by-step guide to building a FastAPI application that inputs a video URL and generates a description using AI. It also shows how to containerize the app using Docker and deploy it to Azure Web Apps.
5. Fine-Tuning Florence-2 — Microsoft’s Cutting-Edge Vision Language Models
Florence-2, released by Microsoft in June 2024, is a foundation vision-language model. This post shows an example of fine-tuning Florence on DocVQA. The authors report that Florence 2 can perform visual question answering (VQA), but the released models don’t include VQA capability.
6. Will the “AI Scientist” Bring Anything to Science?
An international team developed an AI system designed to mimic a novice Ph.D. student in generating hypotheses and conducting computer science experiments. While promising for advancing automated scientific discovery, it frequently produced incoherent and unreliable results similar to premature scientific guesswork. This article explores the capabilities and inner workings of the AI Scientist.
Repositories & Tools
- AlphaFold 3 is Ligo’s open-source implementation of AlphaFold3, an ongoing research project to advance open-source biomolecular structure prediction.
- Llama Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1–8B-Instruct, aiming to achieve GPT4o-level speech capabilities.
- iText2KG is a Python package designed to construct consistent knowledge graphs with resolved entities and relations incrementally.
- MiniCPM is an edge-side LLM that surpasses GPT-3.5-Turbo.
- Aider is an AI pair programming in your terminal.
- GPT Pilot is an AI developer companion.
- Taipy turns data and AI algorithms into production-ready web applications.
Top Papers of The Week
1. Planning In Natural Language Improves LLM Search For Code Generation
Research indicates that using natural language planning boosts the effectiveness of LLMs in code generation. The PLANSEARCH algorithm, which creates diverse natural language plans, significantly improves solution diversity and performance, achieving a pass@200 of 77.0% on LiveCodeBench. This approach highlights a direct correlation between the diversity of generated ideas and performance gains, proposing a new paradigm in computational problem-solving.
2. Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
This paper introduces the Mini-Omni, an audio-based end-to-end conversational model capable of real-time speech interaction. It proposes a text-instructed speech generation method and batch-parallel strategies during inference to boost performance. This method helps to retain the original model’s language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities.
3. MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery
This paper proposes MemoRAG, a novel RAG approach with long-term memory. MemoRAG adopts a dual-system architecture. On the one hand, it employs a light but long-range LLM to form the global memory of the database. On the other hand, it leverages an expensive but expressive LLM, which generates the ultimate answer based on the retrieved information.
4. Configurable Foundation Models: Building LLMs from a Modular Perspective
This paper offers a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. Overall, it offers a fresh modular perspective on existing LLM research and inspires the future creation of more efficient and scalable foundational models.
5. Imitating Language via Scalable Inverse Reinforcement Learning
The paper explores the use of Inverse Reinforcement Learning (IRL) in fine-tuning language models, traditionally reliant on Maximum Likelihood Estimation (MLE). IRL enhances performance, output diversity, and robustness. Combining IRL with MLE offers a promising alternative for refining large language models.
6. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
This paper presents PaperQA, a RAG agent that answers questions about the scientific literature. It performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers.
7. Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
This research establishes an experimental design to evaluate the research idea generation capability of LLMs. It performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. The blind reviews of both LLM and human ideas show that LLM-generated ideas are more novel (p < 0.05) than human expert ideas but not as feasible.
Quick Links
1. Google is adding an “Audio Overview” feature to NotebookLM, its AI note-taking and research app. Audio Overview will give users another way to process and understand the information in the documents uploaded to the app, such as course readings or legal briefs.
2. CEO Sebastian Siemiatkowski announced that Klarna will end its service provider relationships with Salesforce and Workday as part of a major internal overhaul driven by AI initiatives.
3. Arcee AI launched SuperNova, a 70 billion parameter language model designed for enterprise deployment, featuring advanced instruction-following capabilities and full customization options.
4. Salesforce unveiled Agentforce, a suite of autonomous AI agents that augment employees and handle tasks in service, sales, marketing, and commerce, driving efficiency and customer satisfaction.
5. Fei-Fei Li has raised $230 million for her new startup, World Labs, from backers including Andreessen Horowitz, NEA, and Radical Ventures. World Labs is valued at over $1 billion, and the capital was raised over two rounds spaced a couple of months apart.
Who’s Hiring in AI
Our Towards AI Jobs Search Platform is gaining momentum! We received 500,000 Google search impressions in August and are now listing 30,000 live AI jobs. Our LLM-enhanced pipeline continuously searches for jobs that meet our AI criteria and removes expired jobs. We make it much easier to search for and filter by specific AI skills, and we also allow you to set email alerts for jobs with specific skills or from specific companies. All for free! We hope our platform will make it much quicker to find and apply for jobs that truly match your AI skills and experience.
Contract, AI Engineer @Cytokinetics (Freelance/San Francisco, CA, USA)
AIML — Machine Learning Researcher, MLR @Apple (Cupertino, CA, USA)
Data Scientist (IV) — Generative AI @HP Inc. (Spring, Texas, USA)
Member of Technical Staff — Machine Learning @Microsoft Corporation (Mountain View, CA, USA)
Applied AI Engineer @Valence (Remote)
AI Software Engineer @Ataccama (Remote/Prague)
NLP LLM Operations Architect & AWS Engineer, Healthcare & Life Sciences @Norstella (Remote/USA)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI