Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Revolutionizing Web Accessibility with the Hey AI Browser-Native Copilot
Latest   Machine Learning

Revolutionizing Web Accessibility with the Hey AI Browser-Native Copilot

Last Updated on January 6, 2025 by Editorial Team

Author(s): Ido Salomon

Originally published on Towards AI.

How on-device AI can reshape the browsing experience

Browser-native copilot (Image generated by author using DALLE)

Motivation

The internet is the world’s equalizer β€” or at least, it should be. While it has revolutionized how we learn, work, and connect, it still challenges many users. Consider the millions who find the online world taxing rather than liberating due to motor impairments, vision differences, difficulty navigating traditional interfaces, or challenges discerning reliable information in a sea of content. Traditional interfaces assume uniformity that doesn’t align with our diverse society.

Artificial intelligence (AI) has long been the key to bridging this gap. Voice assistants, screen readers, and browser extensions have tried to remove barriers to access. Yet, these solutions relied on cloud-based services, introducing privacy concerns over personal data sent to the cloud, cost barriers tied to cloud computing, unpredictable network latency and availability, and a one-size-fits-all approach that rarely adapts to the users’ needs.

Fortunately, the last two years have rewritten the rules of what’s possible. Compact and efficient open-source models can run locally on consumer devices instead of restricted to specialized hardware. At the same time, modern browser technologies like WebGPU and increasingly optimized AI runtimes such as Mediapipe and ONNX enable accelerated on-device inference. Together, these advancements support a new class of AI-powered experiences that run entirely within the browser, respecting the users’ privacy, responding in real-time, and tailoring to the individuals’ needs.

This article introduces Hey AI, the first browser extension built on these principles. It blends AI-powered local voice transcription, language and content understanding, and real-time interaction into a fully browser-native copilot experience. The extension addresses the immediate challenge of voice-driven accessibility and lays the groundwork for future interaction modes and experiences, such as eye-based control. The result is a proof of concept that serves as a stepping stone toward a genuinely inclusive human-centric web.

Introducing the Hey AI copilot

See it in action

Imagine starting your day with the news by saying: β€œOpen Google News. Click Search, type Presidential race, and Submit. What is the bottom line?”. With Hey AI, local AI detects the commands, transcribes them, executes them, processes the prompt in an LLM, and synthesizes a concise voice summary on the spot. Everything happens within your browser, without cost-per-request or risk of sending sensitive personal data to third parties.

The possibilities aren’t limited to non-standard input methods. For example, any user can benefit from critical content inspection (e.g., β€œIs this a scam”?), protecting them from misinformation or social engineering attacks, or even summarization and focused content extraction. We can enhance the web to be more navigable, trustworthy, helpful, and accommodating to individual needs.

Why browser-native?

The browser-native approach doesn’t aim to replicate cloud-based capabilities. Instead, it seeks to improve on them:

  • Privacy β€” all data stays on your device. No personal information, such as recordings or content, ever leaves your device.
  • Responsiveness β€” commands are carried out locally without network latency, making it feel like an integrated feature rather than a remote addition.
  • Cost and availability β€” local inference eliminates fees and dependency on remote service uptime. It works anywhere, anytime, regardless of service availability or connectivity.
  • Personalization β€” cloud-based tools cater to the common denominator. Since the browser-native copilot runs locally, you can fine-tune it to your voice, interests, and habits, molding it to your preferences. Moreover, you can customize its capabilities to your needs, fueled by an open-source community.

Under the hood

Copilots are complex beasts. They must capture user input, interpret it, apply the correct context, execute corresponding actions, and then provide feedback, all while running performantly in the browser. To understand what makes this possible, we’ll review the entire flow outlined in the architecture diagram.

Browser-native copilot architecture diagram (Image by author)

Capturing user input

All interactions start with the user’s input, which Hey AI continuously listens to through the microphone. However, raw audio is a messy stream of mostly long silences and background noise. Without careful filtering and processing, it wastes resources and responds slowly to genuine user input.

To mitigate these concerns, the copilot employs the Silero 5 voice activity detection (VAD) model with the vad-web library, built on Transformers.js. VAD pinpoints when someone is speaking, focusing on what matters by removing the excess noise and providing us with clean audio that’s likely to contain speech.

Unfortunately, we can’t process these audio segments directly since the browser’s audio processing ecosystem is severely lacking. Hence, we must first transcribe the captured voice segments into text to unlock the rich text-based ecosystem of NLP and LLMs.

Transcription is performed by a small variant of the Whisper speech-to-text (STT) model. It runs in ONNX with multi-threading, delivering dozens to hundreds of tokens per second, ensuring real-time transcription that doesn’t leave the user waiting.

Once speech is detected and transcribed, we must identify whether or not it’s directed at the copilot. Hey AI solves it with wake words (default or custom), whose presence in a transcription indicates the following text should be processed for intent recognition.

Intent recognition

Intent recognition is challenging, particularly when the copilot supports over 20 command types (e.g., navigation, clicks, completions, question answering, etc.), and users often chain multiple commands within the same prompt. When coupled with occasional mistranscriptions, simple rule-based matching falls short.

To tackle this complexity and understand the meaning behind the words, Hey AI utilizes a two-layered approach:

  • NLP engine (based on Compromise) β€” quickly identifies simple commands (like β€œOpen Youtube”), shortening response times and conserving resources for more complex tasks.
  • LLM (Gemma 2 2B using WebLLM) β€” handles more complex commands confidently identified by the NLP engine. It can understand ambiguous requests, tolerate minor parsing errors, and call for clarification if the intent is unclear.

This approach pairs speed (from the fast NLP engine) and robustness (from the LLM), ensuring Hey AI handles everything from simple instructions to intricate multi-step tasks.

Controlling the browser

Understanding what the user wants is only half the battle. Once the copilot knows the user’s intent (scroll through the page, zoom in, and so on), it must collect additional required context and dispatch the commands to the browser on the user’s behalf. Browser-level actions rely on direct APIs (such as opening new tabs, muting them, etc). Page-level actions are mainly carried out by injected scripts, each with a dedicated responsibility (e.g., DOM manipulation and simulating keyboard and mouse-based interaction).

AI-driven commands, such as question answering and completion, are fulfilled by the LLM. As the most computationally demanding component, it relies on WebGPU to maintain near real-time performance (~200/30 tokens per second encoding/decoding).

Providing feedback

Feedback transforms the copilot from a one-way command executor to a two-way conversational partner:

  • Output from AI-driven commands is synthesized into speech. Hey AI utilizes the Piper text-to-speech (TTS) model using sherpa-onnx, which has a selection of 923 human-like voices for the users to choose from, fitting the copilot’s persona to the user.
  • Status reports are communicated with small, non-intrusive toast overlays (e.g., β€œSuccessfully executed switch tab”)
  • Mode changes (e.g., β€œwaiting for command”) are indicated with subtle audio cues.

Challenges and lessons learned.

Building copilots, exceptionally when constrained to browser-native tools and runtimes, is far from simple and introduces many previously unexplored challenges.

On-device AI

The most significant hurdle compared to cloud-based solutions is the resource constraints of the users’ devices.

  • Potency β€” cloud LLMs, such as ChatGPT and Gemini, boast impressive capabilities powered by immense computing resources. Commodity user hardware can’t compete, constraining browser-native solutions to much smaller and more focused models that utilize various optimization techniques like quantization. Luckily, these models are consistently improving, with the latest models exhibiting capabilities previously reserved for their much larger peers.
  • Performance β€” although hardware is cheaper and more accessible than ever, the system requirements for small LLMs are still relatively high, generally targeting top-tier devices to reach adequate performance. Gradually, optimizations in AI runtimes, particularly GPU support and WASM multi-threading, enable complex use cases. Hey AI takes advantage of both and distributes the load between CPU and GPU (to an even 3GB each), extending the range of supported devices. In addition, to maximize perceived performance, the extension executes processing ahead of time so inference results are available immediately when needed (e.g., buffering synthesized segments during TTS).

Individuality

Unlike standardized mouse clicks or keystrokes, non-traditional inputs such as voice and gaze vary considerably. Users differ in language, accents, pacing, connotations, thought processes, etc. There are two main difference categories:

  • Technical β€” as an example, with differences in user pacing, how long should copilots wait until they determine the speech ended? Any constant threshold trades off inclusion with longer response times and potentially higher error rates due to the introduction of noise.
  • Intent β€” a transcription can mean different things for different users depending on context. For instance, β€œsearch for it” may refer to finding the word β€œit” on the current page, a reference to something the user is looking at, or a Google search.

Overall, accessibility is not a one-size-fits-all. Personalizing the parameters for each user, either by static configuration or dynamic algorithm, is critical to effectively tackling these challenges.

Web heterogeneity

The web is an ever-changing Wild West. Websites have hundreds of ways to implement the same functionality, each potentially requiring a unique user interaction. Unfortunately, this means copilots can’t support all websites equally. To extend support, copilots must be robust to unpredictable structure and DOM interaction models.

Safety

AI safety, especially when accessing web data, is a neverending chase. AI-based solutions must do their best to avoid unintended or harmful behavior, such as navigating to malicious pages or adversarial LLM poisoning. Hey AI keeps the user in the driver’s seat, focusing on granular instructions rather than amorphous tasks.

Conclusion

Merely two years ago, browser-native AI seemed more of an experiment than a practical solution. Since then, leaps in AI efficiency, browser capabilities, runtime optimizations, and a growing open-source ecosystem have unlocked complex on-device use cases previously thought impossible.

This paradigm shift requires a new perspective on the use of local AI. Striving to match the potency of cloud-based systems is a distraction, as consumer devices will always fall behind cloud computing. Instead, we should focus on cases where browser-native AI shows its significant value, such as privacy protection, cost cutting, higher availability, and personalized experiences.

While there are challenges, Hey AI demonstrates that browser-native AI is a viable tool for reshaping the browser experience. AI-powered accessibility and inclusion are already within reach. However, this is only the beginning β€” new modalities (like eye-tracking), progressively intelligent features, and a more robust framework are on the horizon.

Thanks for reading! I’ll soon release the source code for Hey AI on GitHub. Contributions and feedback are always welcome. Stay tuned for upcoming releases in this project, such as gaze-based control.

I’m eager to connect with individuals or organizations that could benefit from this accessibility tool or its future developments. Please reach out!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓