TAI #118: Open source LLMs progress with Qwen 2.5 and Pixtral 12B
Last Updated on October 2, 2024 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week, several new strong open-source LLM models were released. Following OpenAIβs huge LLM progress with its o1 βreasoningβ model family last week, it was positive to see progress again in open source, albeit still behind the leading closed-source LLMs. Qwen 2.5 takes the lead in the open-source world for general language tasks. Pixtral 12B is a very powerful new small open-source multimodal model, while GRIN-MoE is now a competitor in the smallest inference compute LLM category.
Qwen2.5 is the latest release from the Qwen family of foundation models from Alibaba in China. The models generally take the lead for language benchmarks for open source models in its size category (up to 72B parameters) and even beat the much larger LLama 3.1 405bn in some cases. The new models, including Qwen2.5, Qwen2.5-Coder, and Qwen2.5-Math, feature significant improvements in areas like instruction-following, coding, and mathematics, outperforming many comparable models on key benchmarks. Trained on an 18 trillion token dataset, they support multilingual capabilities in over 29 languages and handle long-text generation of up to 8K tokens. The models are open-source, with most available under the Apache 2.0 license. There is also a stronger model, Qwen-Plus, available via API.
Pixtral 12B is Mistral AIβs first multimodal model, featuring a 12-billion parameter decoder and a 400-million parameter vision encoder designed to process both images and text natively. The model now competes with the larger LLaVa OneVision 72B model for the strongest open-source multi-modal while generally beating open models in its price category. The model excels in multimodal tasks like chart understanding, document question answering, and reasoning while also maintaining strong performance on text-only benchmarks, such as coding and math. It can handle variable image sizes and process multiple images within a long context window of 128K tokens. It is open-sourced under the Apache 2.0 license and available through various platforms.
GRIN MoE is a 16×3.8B parameter mixture-of-experts (MoE) LLM from Microsoft with 6.6B active parameters. It employs a SparseMixer-v2 architecture to estimate gradients related to expert routing and avoids the conventional need for expert parallelism or token dropping, allowing for efficient scaling in memory/compute-constrained environments. The model performs very well across various benchmarks, particularly given its low active parameter count, but we havenβt yet seen feedback from real-world usage. The model is available under an MIT license.
Why should you care?
We think OpenAIβs new o1 models and these new strong options for open-source LLMs only increase the need for carefully choosing the best LLM for your task. Key factors include cost, latency, capabilities on different categories of tasks, and flexibility for further adaptation via fine-tuning or techniques such as model distillation. These all vary significantly for different models. It is very easy to use a far too expensive model when a cheaper one is sufficient or even better at your specific problem category. Generally, we expect this to lead to the heavy use of model routers in most advanced LLM pipelines. For example, you may direct some queries to o1 models that need advanced general reasoning and planning, some to Gemini 1.5 pro models when using very long input context, and then Claude Sonnet 3.5 for general advanced coding and multimodal tasks. You likely will also use open source models in your stack β either for cost, adaptability, or privacy and security reasons β here, you might use fine-tuned Qwen 2.5 models for specialized language tasks or Pixtral 12B for specialized multimodal tasks.
We think there is room for multiple foundation model families to provide value across open and closed-source business models. However, pre-training and post-training for these foundation models are getting extremely expensive, and we expect limited companies to compete here. Most of the LLM ecosystem will likely focus on additional post-training steps and building advanced LLM pipelines on top of foundation LLMs.
β Louie Peters β Towards AI Co-founder and CEO
In collaboration with BrightData:
The Future of AI is Powered by Web Data 🌍
As AI continues to evolve, the need for dynamic, real-time web data has never been more critical. Traditional static datasets canβt keep pace with the nuanced, ever-changing data requirements of todayβs advanced AI models, particularly LLMs.
Access to real-time, unstructured web data is key to helping these models stay relevant, improve contextual understanding, and deliver more accurate insights.
Bright Data enables:
- Seamless data access β providing businesses with organized, real-time insights from a vast array of sources.
- Flexibility β a scalable, adaptive platform that evolves with your data needs.
- Transparency β adhering to strict ethical and compliance standards for responsible data collection.
Learn how real-time web data is shaping the future of AI and LLMs
Hottest News
1. Microsoft Wants Three Mile Island To Fuel Its AI Power Needs
Microsoft just signed a 20-year deal to exclusively access 835 megawatts of energy from the shuttered Three Mile Island nuclear power plant. If approved by regulators, the software maker would have exclusive rights to 100 percent of the output for its AI data center needs.
2. Anthropic Introduced Contextual Retrieval
Anthropic introduced a method called βContextual Retrievalβ and uses two sub-techniques: Contextual Embeddings and Contextual BM25. This method can reduce the number of failed retrievals by 49% and, when combined with reranking, by 67%. These represent significant improvements in retrieval accuracy, directly translating to better performance in downstream tasks. Users can now deploy Contextual Retrieval solution with Claude.
3. Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
This paper introduced Michelangelo, a new approach for evaluating how well language models can understand and reason over long context windows. Michelangelo aims to move beyond simple βneedle in the haystackβ evaluations and design more challenging evaluation tasks that require the model to extract and leverage the latent semantic relationships within the text.
4. We Are in the Intelligence Age by Sam Altman
In a victory lap blog, Sam declares that deep learning works and gets predictably better with scale. As AI progresses, we will soon be able to work with capable AI, which will help us accomplish much more than we could ever have without it. However, the dawn of the Intelligence Age is a momentous development with very complex and extremely high-stakes challenges.
Five 5-minute reads/videos to keep you learning
1. The Open Source Project Maintainerβs Guide
This post shares a list of mistakes to avoid if you are looking for contributors for your project. It also highlights how making it easier for people to contribute makes them more likely to do so.
2. AI vs. Human Engineers: Benchmarking Coding Skills Head-to-Head
CodeSignalβs latest report compares top AI models with human engineers using real-world coding assessments. These assessments evaluate general coding abilities and edge-case thinking, providing practical insights that help inform the design of AI-co-piloted assessments.
3. How Streaming LLM APIs Work
This guide explains how the HTTP streaming APIs from the various hosted LLM providers work. This article investigates three APIs: OpenAI, Anthropic Claude, and Google Gemini.
4. How I Deal With Hallucinations at an AI Startup
This article shares the key principles to focus on when designing solutions that may be prone to hallucinations. It also highlights the difference between weak and strong grounding.
5. Fine-Tuning LLMs to 1.58bit: Extreme Quantization Made Easy
BitNet is a special transformer architecture that offers extreme quantization of just 1.58 bits per parameter. However, it requires to train a model from scratch. While the results are impressive, not everybody has the budget to pre-train an LLM. This article explores a few techniques for fine-tuning an existing model to 1.58 bits.
Repositories & Tools
- Javascript Algorithms contain JavaScript-based examples of many popular algorithms and data structures.
- optillm is an OpenAI API-compatible optimizing inference proxy that implements several techniques to improve LLMsβ accuracy and performance.
- Solidroad is an AI-first training and assessment platform.
- Agent Zero is a personal and organic AI framework for tasks.
Top Papers of The Week
1. Training Language Models to Self-Correct via Reinforcement Learning
This paper developed SCoRe, a multi-turn online reinforcement learning approach that significantly improves an LLMβs self-correction ability using entirely self-generated data. When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe improved the base modelsβ self-correction by 15.6% and 9.1%, respectively, on the MATH and HumanEval benchmarks.
2. OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
This paper introduces the One-pass Generation and retrieval framework (OneGen). It is designed to improve LLMsβ performance on tasks requiring generation and retrieval. The framework incorporates retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass.
3. Eureka: Evaluating and understanding progress in AI
This paper presents Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. It also introduces Eureka-Bench as an extensible collection of benchmark testing capabilities. It analyzes 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison that can be leveraged to plan targeted improvements.
This paper introduces SAMMO, a framework to perform symbolic prompt program searches for compile-time optimizations of prompt programs. SAMMO generalizes previous methods and improves the performance of complex prompts on instruction tuning, RAG pipeline tuning, and prompt compression across several different LLMs.
5. Neptune: The Long Orbit to Benchmarking Long Video Understanding
This paper introduces Neptune, an evaluation benchmark that includes tough multiple-choice and open-ended questions for videos of variable lengths up to 15 minutes long. Neptuneβs questions are designed to require reasoning over multiple modalities (visual and spoken content) and long time horizons, challenging the abilities of current large multimodal models.
6. Wings: Learning Multimodal LLMs without Text-only Forgetting
This paper presents Wings, an MLLM that excels in text-only dialogues and multimodal comprehension. The experimental results demonstrate that Wings outperforms equally-scaled MLLMs in text-only and visual question-answering tasks.
Quick Links
1. Google Quantum AI demonstrates a quantum memory system that greatly reduces error rates. The quantum computer uses multiple physical bits to create one logical qubit. The researchers have developed an algorithm that they call βsurface codeβ to correct errors.
2. Former Apple design chief Jony Ive has confirmed that heβs working with OpenAI CEO Sam Altman on an AI hardware project. There arenβt many details on the project. Ive reportedly met Altman through Brian Chesky, the CEO of Airbnb, and the venture is being funded by Ive and Laurene Powell Jobsβ company.
Whoβs Hiring in AI
Ai Market Lead β Defense & Intel @Accenture (Arlington, TX, USA)
Lead Research Engineer β Prompt Engineering @GE Vernova (Bangalore, India)
Software Engineer III, Machine Learning, Google Research @Google (Zurich, Switzerland)
Senior AI Research Scientist β LLM Agent @Bosch Group (Sunnyvale, CA, USA)
Senior Technical Support Engineer @Salesforce (Japan/Remote)
Data Analytics Manager @Sei Foundation (Remote)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI