Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Artificial Intelligence   Latest   Machine Learning

TAI #105: Claude Sonnet 3.5; price alone is progress.

Author(s): Towards AI Editorial Team

Originally published on Towards AI.

What happened this week in AI by Louie

AI news this week was dominated by the surprise release of a new model from Anthropic, which now tops most LLM benchmarks on most metrics. Claude Sonnet’s 3.5 price is 80% lower than its Opus 3.0 model (launched just three months ago), and it is also far better and faster on all metrics. Despite this better, faster, cheaper trifecta, some people are still vocally disappointed that LLMs are stalling! In most industries — price reduction alone is the biggest sign of technological progress. For example, solar energy costs are currently falling by around ~80% every 10 years, and this is due to lots of hard work, large amounts of investment, and many new inventions and innovations. Open-source AI has also rapidly contributed to even more affordable LLMs (such as Deepseek coder v2).

I played lots with Claude 3.5 this week — and on the capability side, it is clear it has made great progress on ability and code skills in particular, both for developers & non-technical users (new no-code interaction with generated code). For developers, the new Sonnet 3.5 model progressed to a 64% success rate on complex agentic code tasks relative to the slower and 5x more expensive Opus 3.0 model at just 38% just three months ago. This was on an internal Anthropic benchmark where the agent had to search, view, and edit multiple files (mostly 3 or 4, up to 20!) to solve pull requests entirely.

For non-developers, Claude’s “Artifacts” now allows anyone to test and use LLM coding skills and build simple projects. It executes code content directly in the browser without you needing to copy and paste it into a code environment. While it only supports limited libraries & languages, Claude chooses from a powerful toolkit to fulfill your requests. For example, HTML is great for basic web pages, SVG for graphics, Mermaid for various diagrams, React for interactive applications, Markdown for formatted text, and the code artifact for any programming language snippets. This is all particularly powerful for allowing you to continue a conversation with the AI to iterate and improve its code, correct mistakes, and add features as you choose, all with natural language. OpenAI’s code interpreter is also making data analysis tasks & testing LLM-generated Python code easier for non-technical users, but it has less broad front-end capabilities.

Why should you care?

In AI, maybe we are getting a bit spoiled if we are only impressed by huge generational leaps in capability. Practical utility and economic use cases depend a lot on pricing and latency, but faster and cheaper models also allow for more complicated RAG, Fine-tuning, and Agent pipelines to be constructed to enhance performance further. Of course, many open source models already provide great capabilities for much cheaper prices than closed labs, though they do lag in capability, particularly in multimodal ability.

There has, in fact, been some level of stalling in scaling up LLM training clusters and scaling models to new levels of compute budget (layers, dimensions, and training data). I think getting clusters beyond ~30k H100 GPUs has taken longer than planned. But in the meantime, there has been progress on better datasets, new algorithms & new efficiency techniques, and, often, helped by innovation in the open-source arena. That’s not to say there is no merit in skepticism on LLM’s ability to progress indefinitely, however, without integrating more new ideas. It is true we haven’t yet had a GPT-5 class training cost model and no capability jump at the level of GPT-4. And there are still large classes of reasoning problems where LLMs fail completely.

Regarding Claude Artifacts — At Towards AI, our focus is on Making AI Accessible, so it’s great to see the coding capabilities of LLMs getting increasingly unlocked to the non-technical to experience and experiment with. We have recently released our 470-page book for building LLM apps with Python and have a new full pipeline advanced practical LLM & RAG course out soon, but we will also release courses this year for the non-technical to start building with LLMs.

Louie Peters — Towards AI Co-founder and CEO

This issue is brought to you thanks to AIPort:

Ever wondered how large language models like GPT-4 handle their immense computational demands? AIport’s latest article dives into the world of LLMs, exploring the challenges of model training and the innovative solutions that make it possible. Discover the top nine shortlisted tools that revolutionize LLM training, scaling, evaluation, and deployment.

From understanding model parallelism and detecting hallucinations to robust logging mechanisms and community support, this comprehensive guide is a must-read. Available for free, it’s perfect for both AI enthusiasts keen to transform their understanding of AI development and experienced engineers and entrepreneurs looking for ways to cut costs.

Read the complete article here!

Hottest News

1. Anthropic Introduced Claude 3.5 Sonnet

Anthropic has launched its newest model, Claude 3.5 Sonnet, which it says can equal or better OpenAI’s GPT-4o or Google’s Gemini across various tasks. The new model is already available to Claude users on the web and iOS, and Anthropic is also making it available to developers.

2. Runway Introduced Gen-3 Alpha: A New Frontier for Video Generation

Runway has launched Gen-3 Alpha, an advanced AI capable of generating videos and images from text and images. It features control modes for detailed manipulations and promises future enhancements in structure, style, and motion control.

3. Major Record Labels Sue AI Company Behind ‘BBL Drizzy’

A group of record labels, including the big three — Universal Music Group (UMG), Sony Music Entertainment, and Warner Records — are suing Suno and Udio, two of the top names in generative AI music making, alleging the companies violated their copyright “en masse.” Suno generates music using a transformer model similar to ChatGPT, and we have had great fun making music in many genres and using different artificial voices. However, the models can sometimes memorize parts of their training data — which, in this case, are owned by music companies with a history of heavy copyright protection. And, of course, the legal details of whether companies are allowed to train on this data in the first place are still murky.

4. Apple and Meta Have Discussed an AI Partnership

Apple is reportedly in talks with Meta about integrating the social media giant’s generative AI models into Apple Intelligence. The rumored deal structure would allow Meta and other AI partners to offer premium subscriptions through Apple devices, with Apple taking a revenue cut.

5. Ilya Sutskever, OpenAI’s Former Chief Scientist, Launches a New AI Company

Ilya Sutskever, alongside Daniel Gross and Daniel Levy, has established Safe Superintelligence Inc. (SSI), a new AI venture based in Palo Alto and Tel Aviv dedicated to creating superintelligent AI with a strong emphasis on safety. SSI is poised to integrate AI advancements with robust safety measures, prioritizing long-term security over immediate profits. It is anticipated to attract substantial investment due to Ilya’s legendary role in many key AI breakthroughs. However, will it be difficult to raise money to compete with companies (and charities.) that promise investors more near-term profits?

6. Google’s Gemini API Introduces Context Caching To Optimize AI Workflows

Google’s Gemini API has recently launched context caching, which can make the use of long context windows more economical — particularly as part of a RAG tech stack or potentially as a fine-tuning alternative using more examples for in-context learning. This feature will allow developers to store information from an inferenced input context in a dedicated cache for applications where the same input will be frequently reused. Subsequently, these tokens can be referenced for requests, eliminating the requirement to repeatedly pass the same set of tokens to a model.

According to a survey by Bain & Company, 2024 is the year for GenAI to deliver results and generate real business value. The focus is on quality, capabilities, realistic expectations, and proprietary AI solutions. Another survey from CB Insights observed a similar trend, with companies focusing on cost savings and productivity.

Five 5-minute reads/videos to keep you learning

1. Building a Personalized Code Assistant With Open-Source LLMs Using RAG Fine-Tuning

In this study, the authors experimented with fine-tuning Mistral 7B Instruct v0.2 on the Together AI Platform. They conducted experiments on five different codebases: Axolotl, Deepspeed, vLLM, Mapbox, and WandB. The article contains the results, generated examples, in-depth insights on RAG, and more.

2. Extracting Concepts From LLMs: Anthropic’s Recent Discoveries

Anthropic has advanced the interpretability of LLMs by integrating Sparse AutoEncoders (SAEs) with models like Claude-3-Sonnet to extract interpretable features across multiple languages. However, OpenAI cautions that excessive dependence on SAE-extracted features can hinder performance. This research represents substantial progress in decoding LLMs, but achieving full understanding is still elusive.

3. Understanding Mamba and Selective State Space Models (SSMs)

This article explores the Mamba architecture. If SSM-driven architectures like Mamba can consistently perform as well as or better than Transformers — at a fraction of the training and inference cost — they will quickly become the norm.

4. Linear Algebra 101 for AI/ML — Part 1

This is an introductory guide to linear algebra. It covers the basics of vector, matrix math, operations, and PyTorch. It compresses 6+ months of learning into a digestible article and also includes interactive question and quiz modules.

5. Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models

The article discusses how AI models using reinforcement learning may exhibit “specification gaming” and “reward tampering,” leading to manipulative behaviors aimed at maximizing rewards, which can include deceitful tactics and untrained modifications of their reward functions. The studies show that such issues persist despite attempts to prevent them.

6. A Visual Walkthrough of DeepSeek’s Multi-Head Latent Attention (MLA)

This article discusses the bottleneck problems that transformer models or LLMs encounter during training and inference and dives into how DeepSeek’s innovative approach, Multi-Head Latent Attention, addresses these problems. It will primarily cover the problem in GPU processing and Multi-Head Latent Attention (MLA).

Repositories & Tools

  1. DeepSeek-Coder-V2 is an open-source language model specialized in coding and mathematics. It performs better than proprietary models like GPT4-Turbo.
  2. Maestro is a framework for Claude Opus to break down an objective into sub-tasks, execute each sub-task, and refine the results into a cohesive final output.
  3. Tokencost calculates the USD cost of using major LLM APIs by calculating the estimated cost of prompts and completions.
  4. ReaLHF is a new approach for RLHF Training of LLMs with Parameter Reallocation.
  5. DiffSynth Studio is a Diffusion engine with restructured architectures, including Text Encoder, UNet, and VAE.
  6. GraphRAG is an open-source tool that combines RAG with knowledge graphs to solve critical LLM issues like hallucination and lack of domain-specific context.

Top Papers of The Week

1. Transformers Can Do Arithmetic with the Right Embeddings

This paper aims to improve the performance of transformers on arithmetic tasks by adding an embedding to each digit that encodes its position relative to the start of the number. This enables architectural modifications such as input injection and recurrent layers to improve performance further. It achieves up to 99% accuracy on 100-digit addition problems.

2. Meta Learning Text-to-Speech Synthesis in over 7000 Languages

This work builds a single text-to-speech synthesis system capable of generating speech in over 7000 languages. They integrate multilingual pretraining and meta-learning to approximate language representations. This approach enables zero-shot speech synthesis in languages without any available data.

3. Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

A new benchmark called MultiModal Needle-in-a-haystack (MMNeedle) has been introduced to evaluate the long-context handling capabilities of Multimodal Large Language Models (MLLMs). This benchmark tests MLLMs by requiring them to identify specific components within multi-image inputs, measuring their visual context processing.

4. XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

XLand-100B is a large-scale dataset for in-context reinforcement learning, featuring 100 billion transitions from 2.5 billion episodes across approximately 30,000 tasks. Built on the XLand-MiniGrid framework, it was created with 50,000 GPU hours to enhance research in the field.

5. HelpSteer2: Open-source Dataset for Training Top Performing Reward Models

HelpSteer2 is an open-source dataset licensed under CC-BY-4.0 designed to improve reward model training in LLMs by aligning with human preferences. With fewer data pairs than competitors, it has achieved a record 92.0% on the Reward Bench.

Quick Links

1. OpenAI acquired Rockset, a search and database analytics startup. OpenAI said Rockset’s tech will be integrated across its products to enhance its retrieval infrastructure.

2. Stability AI appointed Prem Akkaraju as CEO. The report from The Information also claims that, alongside the new CEO, Stability AI will receive a cash infusion from an investor group led by Sean Parker.

3. A new Bloomberg report examined data from thousands of data centers worldwide and found that demand for AI systems is significantly increasing the energy demand and consumption of these data centers.

Upcoming Events

Reuters Events is delighted to extend a special invitation to you: a complimentary guest pass to the MOMENTUM AI San Jose Business Summit, happening on July 16–17. This is a must-attend event for those looking to stay ahead in the rapidly evolving field of AI.

Why Should You Attend?

  • Engage with top-tier executives, such as CTOs, CIOs, and CDOs, from various industries as they discuss implementing and scaling AI across business functions.
  • Benchmark your AI strategies against some of the foremost innovators in the retail, healthcare, finance, logistics, travel, and manufacturing sectors.
  • Exclusive networking opportunities tailored for senior leaders. Join the confirmed attendees from Tyson Foods, Netflix, AbbVie, Eli Lilly, VISA, and many more.

Limited Availability! Get complimentary passes here, and don’t miss your chance to be part of this pivotal event.

Who’s Hiring in AI

Principal Machine Learning Engineer @Twilio (USA/Remote)

Senior Software Engineer (Contract) @Wikimedia Foundation (Remote)

AI Project Manager @XA Group (Dubai)

Intern — Backend @DAZN (Katowice, Poland)

Natural Language Processing Researcher @Kitware Inc. (USA/Remote)

AI/ML Developer @Novapulse AI (Remote)

AI and Software Coding Monster, Intern @Pixona.io (USA/Remote)

Interested in sharing a job opportunity here? Contact [email protected].

If you are preparing your next machine learning interview, don’t hesitate to check out our leading interview preparation website, confetti!

Think a friend would enjoy this too? Share the newsletter and let them join the conversation.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓