TAI #119 New LLM audio capabilities with NotebookLM and ChatGPT Advanced Voice
Last Updated on October 5, 2024 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week, we were focused on new voice capabilities for LLMs with Googleβs recently released NotebookLMβs audio features and OpenAIβs move to roll out ChatGPTβs advanced voice mode more widely β the fully multimodal version of GPT-4o. We also saw some great new LLM options released with Llama 3.2 (the first multimodal models in the family) and Gemini 1.5 Pro-002 with some strong benchmark improvements and lower prices.
NotebookLM, an experimental tool from Google, has been created to help users organize, analyze, and synthesize information from their own documents. It acts as a virtual research assistant and allows users to βgroundβ the language model in their materials, such as Google Docs and PDFs, to help them get insights and potentially generate new ideas. It recently introduced the βAudio Overviewsβ feature for turning research into audio summaries or short podcasts. This has led to lots of great demos shared online of people turning their content directly into podcasts. Over 2,000 people have written for our Towards AI publication and we think this opens up great potential to easily expand their audience with new content mediums!
Meanwhile, OpenAIβs wider rollout for Advanced Voice Mode, part of GPT-4o, brought natural conversation capabilities to ChatGPT. It allows for real-time voice interactions and is also supposed to detect non-verbal cues like tone and speed, making responses more emotionally tuned and human-like. Users can interrupt and guide conversations without losing context, a capability that sets it apart from traditional voice assistants like Siri or Alexa. Open AIβs voice mode also has an βAudio Overviewβ feature, which lets users listen to synthesized summaries of their documents in a conversational podcast format. While still experimental, this feature is rolling out across mobile apps, though it remains restricted in regions such as the EU due to regulatory concerns around AIβs ability to detect emotions. OpenAIβs voice mode is still heavily constrained relative to its fundamental capabilities by managing safety risks, such as preventing its ability to mimic voices. Working on these safety fixes was a large part of the delay in the release of voice mode.
Why should you care?
These new models are important because advancements in voice-enabled AI tools like Googleβs NotebookLM and OpenAIβs ChatGPT are making AI more practical and accessible. More natural, low-latency voice chatbots improve the quality of real-time conversations, allowing for smoother, more responsive interactions compared to older and less intelligent voice assistants like Siri or Alexa. This can allow for much more natural conversations than older chatbots which had to combine separate text-to-speech, LLM, and speech-to-text models.
For professionals, researchers, and content creators, these tools offer new ways to handle and distribute their content. Features like βAudio Overviewsβ turn written content into audio summaries, making it easier to review materials or generate new ideas. This can be especially useful for multitasking or accessing information when reading isnβt convenient. Additionally, these tools offer significant benefits for people with visual impairments by converting text into high-quality audio, making more content accessible.
β Louie Peters β Towards AI Co-founder and CEO
This issue is brought to you thanks to Nebius:
Donβt feel GPUs-less. We have H100 and L40S GPUs on demand β and H200 are coming soon and can be pre-ordered right now!
Our platform allows to get up from 1 to up to thousands of GPUs, with a dedicated support engineer for multi-host training.
So when you think thereβs no GPU at the end of the tunnelβ¦ just visit.
Hottest News
1. Meta Unveils Llama 3.2, Edge AI and Vision With Open, Customizable Models
Llama 3.2 features advanced AI models optimized for edge and mobile devices, including vision LLMs (11B and 90B) and lightweight text-only models (1B and 3B). These models excel in tasks such as summarization and image understanding, supporting extensive token lengths.
2. Googleβs New Gemini 1.5 AI Models Offer More Power and Speed at Lower Costs
Google has released two updated Gemini AI models that promise more power, speed, and lower costs. The new versions, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, offer significant improvements over their predecessors, according to Google, showing gains across a range of benchmarks, particularly in maths, long context and visual tasks. In addition, the company has reduced the price of input and output tokens for Gemini 1.5 Pro by more than 50%, increased rate limits for both models and reduced latency.
3. OpenAI CTO Mira Murati Is Leaving
Mira Murati, CTO of OpenAI, is leaving the company after over six years to pursue personal interests. Her departure comes as OpenAI prepares for DevDay and undergoes significant changes, including CEO Sam Altmanβs growing influence and a potential $150 billion funding round. Murati played a key role in developing major AI projects like ChatGPT.
4. OpenAI Might Raise the Price of ChatGPT to $44 by 2029
The New York Times, citing internal OpenAI docs, reports that OpenAI plans to raise the price of individual ChatGPT subscriptions from $20/month to $22/month by the end of the year. A steeper increase will come over the next five years; by 2029, OpenAI expects to charge $44 per month for ChatGPT Plus.
5. Microsoft Re-Launches βPrivacy Nightmareβ AI Screenshot Tool
Microsoftβs Recall, labeled a potential βprivacy nightmareβ by critics, will be relaunched in November on its new CoPilot+ computers. Some of its more controversial features have been stripped out β for example, it will be opt-in whereas the original version was turned on by default.
Google unveiled its AlphaChip reinforcement learning method for designing chip layouts. The AlphaChip AI promises to substantially speed up the design of chip floorplans and make them more optimal in terms of performance, power, and area. The reinforcement learning method, now shared with the public, has been instrumental in designing Googleβs Tensor Processing Units (TPUs).
7. Metaβs New AI-Made Posts Open a Pandoraβs Box
Meta plans to generate synthetic content tailored to individual users. Meta said it will generate some images based on a userβs interests and others that feature their likeness. Users will have the option to take that content in a new direction or swipe to see more content imagined for them in real time.
Five 5-minute reads/videos to keep you learning
1. Llama Can Now See and Run on Your Device β Welcome Llama 3.2
Llama 3.2 introduces advanced multimodal and text-only models, including 11B and 90B Vision models and smaller 1B and 3B text models for on-device use. Enhancements feature visual reasoning and multilingual support, though EU users face licensing restrictions on multimodal models.
2. Converting A From-Scratch GPT Architecture to Llama 2
The article outlines the process of converting a GPT model to a Llama 2 model, highlighting key modifications such as replacing LayerNorm with RMSNorm, GELU with SiLU activation, and incorporating rotary position embeddings (RoPE). It also details updates to the MultiHeadAttention and TransformerBlock modules to support these changes.
3. ChatGPT-o1 vs Claude 3.5 Coding Performance Compared
This is a comparative analysis of OpenAI o1 and Claude 3.5 using Cursor AI. It sheds light on their respective strengths and limitations in coding tasks. While Claude 3.5 demonstrated superior performance in the tested scenarios, the true potential of OpenAI ChatGPT-o1βs advanced reasoning capabilities remains to be fully explored.
4. Open AIβs Advice on Prompting
The o1-preview and o1-mini models excel in scientific reasoning and programming, showing strong performance in competitive programming and academic benchmarks. Ideal for deep reasoning applications, they currently support text-only inputs and have limitations in their beta phase, such as a lack of image input support and slower response times.
Convolutional networks have been around for a long time. Still, their performance has been limited by the size of the available training sets and the size of the networks under consideration. This article highlights newer techniques that propose a classification output considering many layers, allowing effective localization and simultaneous context usage.
6. Top Generative AI Use Cases in 2024
This guide highlights some of the top generative AI use cases across different fields, demonstrating how it revolutionizes areas like healthcare and finance.
7. Devs Gaining Little (if Anything) From AI Coding Assistants
A study from Uplevel found that AI coding assistants like GitHub Copilot do not significantly improve developer productivity as measured by pull request cycle time and throughput, contradicting anecdotal claims.
Repositories & Tools
- Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for LLMs and AI applications.
- Kotaemon is an open-source RAG-based tool for chatting with your documents.
- Exo allows you to run your own AI cluster at home with everyday devices.
- Kestra is a universal open-source orchestrator that makes scheduled and event-driven workflows easy.
- Count Token Optimization presents an iterative optimization method for text-to-image diffusion models to enhance object counting accuracy.
Top Papers of The Week
1. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
This paper introduces Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity.
2. HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
HelloBench is a benchmark designed to assess the long text generation abilities of Large Language Models, addressing their difficulties in producing texts over 4000 words with consistent quality. It categorizes tasks into five groups and introduces HelloEval, an evaluation method that closely aligns with human judgment.
3. A Controlled Study on Long Context Extension and Generalization in LLMs
This controlled study on extending language models for long textual contexts establishes a standardized evaluation protocol. Key findings highlight perplexity as a reliable performance metric, the underperformance of approximate attention methods, and the effectiveness of exact fine-tuning methods within their extension range.
4. LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data
This paper introduces semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets. Each operator can be implemented and optimized in multiple ways, opening a space for execution plans similar to relational operators.
5. SciAgents: Automating Scientific Discovery Through Multi-Agent Intelligent Graph Reasoning
This paper presents SciAgents, an approach that leverages large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties.
Quick Links
1. Bright Data offers web data solutions for AI and LLM developers. These solutions make gathering, managing, and integrating web data into AI models easier, streamlining the development process. Two standout offerings are its Dataset Marketplace and Web Scraper APIs, designed to make data collection more accessible and efficient.
2. California Governor Gavin Newsom vetoed the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047). In his veto message, Governor Newsom cited multiple factors in his decision, including the burden the bill would have placed on AI companies, Californiaβs lead in the space, and a critique that the bill may be too broad.
3. Airtable launched an enterprise-grade AI platform. It includes App Library, which allows companies to create standardized AI-powered applications that can be customized across an organization, and HyperDB, which enables integration of massive datasets of over 100 million records.
Whoβs Hiring in AI
Data Engineer (AWS, Snowflake, dbt) β R2843β6334 @Bcidaho (USA/Remote)
TECH Program Associate β Data Platforms @Spectrum (Madison, WI, USA)
Staff Engineer β AI/Machine Learning @LinkedIn (Sunnyvale, CA, USA)
Data Science and Analytics, Product @Anthropic (San Francisco, CA, USA)
Senior Machine Learning Engineer @webAI (USA/Remote)
Software Engineer @JPMorgan Chase (Columbus, IN, USA)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI