While Google and OpenAI Fight for the AI Bone, the Open Source Community Is Running Away with It
Last Updated on May 16, 2023 by Editorial Team
Author(s): Massimiliano Costacurta
Originally published on Towards AI.
βHey, did you hear? they say Google and OpenAI donβt have a competitive advantage in LLM.β
βYeah, sureβ¦ who said it?β
βGoogle.β
βWait a minuteβ¦β
One week ago SemiAnalysis released a real shocker when they made public a leaked document from Google titled βWe Have No Competitive Edge, And Neither Does OpenAI.β While we canβt be sure if the document is legit, it brings up some thought-provoking points about the real struggle in the world of large language models (LLMs). Itβs not Google vs. OpenAI; itβs more like open-source LLMs taking on their closed-source counterparts.
This leaked document hints that both Google and OpenAI might be losing their edge against the ever-growing open-source LLM community. The reason? Itβs pretty simple: open-source projects are moving at lightning speed. Faster than large corporations or corporate-backed companies can match, especially since open-source projects donβt face many reputational risks. Apparently written by a Google researcher, the document emphasizes that even though Google and OpenAI have been working hard to create the most powerful language models, the open-source community is catching up at an astonishing speed. Open-source models are quicker, more adaptable, and more portable. Theyβve managed to achieve great results with much less resources, while Google is grappling with bigger budgets and more complex models.
Whatβs more, having tons of researchers working together in the open makes it tougher for companies like Google and OpenAI to stay ahead of the game in terms of technology. The report says that keeping a competitive edge in tech is getting even more difficult now that cutting-edge LLM research is within reach. Research institutions around the globe are building on each otherβs work, exploring the solution space in a way thatβs way beyond what any single company can do. Turns out, being able to train huge models from scratch on pricey hardware isnβt the game changer it used to be, which means pretty much anyone with a cool idea can create an LLM and share it.
Alright, weβve seen open-source projects trying to outdo their corporate counterparts before, but letβs dig a bit deeper to see if this is a genuine threat in the AI world.
Ups and downs in the world of open collaboration
Open-source software has always seen its ups and downs. Some projects like BIND, WordPress, and Firefox have done really well, showing that they can stand up against big-name enterprise products. On the flip side, projects like OpenOffice, GIMP and OpenSolaris faced struggles and rapidly lost ground. Regardless, open-source software is still popular, with many websites using Apache web servers, BIND servers, and MySQL databases.
Now, the problem is that keeping open-source projects funded and maintained can be tricky. It takes solid planning, the right resources, and a real connection with users. If a project has a dedicated user base and passionate developers, itβs more likely to stay on top of its game and keep getting better. Back in 2018, OpenAI faced some of these hurdles and decided it was time for a change. They started looking for capital and eventually became a capped-profit company. That means they could get investments and offer investors a return capped at 100x their initial investment.
OpenAI said they needed to make these changes to fund research, support big companies, and keep things safe. So, you could argue they did what they had to do to dodge the usual open-source pitfalls. But that did not come for free, since, while OpenAI has made impressive progress in AI development, its increasing secrecy, lack of transparency, and limited customization options have alienated the very community it once aimed to serve.
On the other hand, Google is really into open-source software, and they are involved in quite a few open-source projects. Just look at Android, their mobile operating system. Itβs built on the Linux kernel and has been a game-changer in making open-source software popular in the smartphone world. Today, most smartphones run on Android. Another awesome open-source project from Google is Kubernetes, which has become the top choice for container orchestration. It helps developers automate things like deployment, scaling, and managing containerized applications. Last but not least, letβs not forget Chromium. Googleβs Chrome is built on the open-source Chromium project, and it has become super popular since its launch.
By being part of open-source projects like these, Google shows they are really into transparency, openness, and working together to create innovative and flexible software solutions. They are dedicated to making the tech world more inclusive, diverse, and accessible for everyone. For this reason, I wouldnβt be too shocked if Google decided to make their next big language model an open-source project. It could be a clever move because theyβd have all their brand, marketing, and developer muscle behind it, giving OpenAI some serious competition. Of course, thatβs assuming the modelβs quality would be on par, which hasnβt been the case so far. Whatβs even more crucial is that someone else might snag that spot first. As weβll see next, thereβs a long list of newbies waiting in line.
A guided journey through the open-source LLM boom
One of the coolest aspects of the SemiAnalysis document is the recent timeline highlighting key milestones in the open-source community, particularly in the area of large language models (LLMs). It all starts with what might be considered the βbig bangβ of recent open-source LLM advancements β the release of LLaMA by Meta on February 24, 2023 . LLaMA is a LLM with sizes from 7B to 65B parameters, claiming to require less computing power, making it ideal for testing new approaches. Actually, it was not released as an open-source model, but one week after the release, LLaMAβs model weights leaked to the public, and everyone got a chance to play around with it. Thatβs when things started snowballing.
Here is a quick summary of the milestones described in the document:
- Artem Andreenkoβs Raspberry Pi implementation of LLaMA (March 12, 2023)
- Stanfordβs Alpaca release (March 13, 2023)
- Georgi Gerganovβs 4-bit quantization of LLaMA, letting it run on a MacBook CPU without a GPU (March 18, 2023)
- Vicunaβs 13B model release (March 19, 2023) was trained for just $300.
- Cerebras trained an open-source GPT-3 architecture that outshines existing GPT-3 clones.
- LLaMA-Adapter (March 28, 2023) set a new record on multimodal ScienceQA with just 1.2M learnable parameters using a Parameter Efficient Fine Tuning (PEFT) technique.
- UC Berkeleyβs Koala (April 3, 2023) releases a dialogue model trained entirely on free data, costing $100 and scoring over 50% user preference compared to ChatGPT.
- Open Assistant release (April 15, 2023) offers a complete open stack for running RLHF (reinforcement learning from human feedback) models with a 48.3% human preference rating.
And the list goes onβ¦
Yes, you got it right. All of this happened in just over two months, proving how lively the open-source scene is. However, most of these open-source models might be known only to insiders and havenβt hit the mainstream (I have to admit that I learned about the majority of them through the leaked document too). But just some days ago, on May 5, a possible game-changer arrived: MosaicML released MPT-7B, setting the bar high for open-source competitors (and maybe even OpenAI). Whatβs more, itβs licensed for commercial use, unlike LLaMA. The MPT series also supports super-long inputs with context lengths of up to 84k during inference.
MosaicML put the MPT series through rigorous tests on various benchmarks, showing it can match LLaMA-7Bβs high-quality standards. The base MPT-7B model is a decoder-style transformer with 6.7B parameters, trained on 1T tokens of text and code. MosaicML also released three fine-tuned versions: MPT-7B-StoryWriter-65k+ for super long context lengths in fiction; MPT-7B-Instruct for short-form instruction following; and MPT-7B-Chat, a chatbot-like model for dialogue generation.
MPT-7B was trained on the MosaicML platform in just 9.5 days using 440 GPUs, with no human intervention. The perks of these improvements are clear in the numbers. For example, MosaicML claims that the MPT-7B model gives a competitive performance with only 7 billion parameters, which is 28 times smaller than OpenAIβs GPT-3 with its 175 billion parameters. This size reduction means big cost savings since fewer resources are needed for both training and deployment. Plus, the smaller MPT-7B model is more portable, making it easier to incorporate into various applications and platforms.
So, the million-dollar question is: can MPT-7B match the quality level weβve come to expect from ChatGPT? Weβll see. But one thing is for sure β the open-source world of LLMs is buzzing with excitement and innovation, and it wonβt be long before weβll find out.
So, do we have a clear winner?
No, actually we donβt. I donβt think that all the fears expressed in the article are necessarily well founded. Many of the concerns raised in the article might not be well-founded. However, this isnβt exactly great news for Google either, as OpenAI still remains by far the undisputed leader in the LLM market. OpenAI made a clever and bold move in November 2022 by launching ChatGPT for public use, totally free and perhaps not yet fully secured. This move garnered massive traction, making OpenAI the fastest company to reach one million users in just five days, and a whopping 100 million by the end of January 2023.
Impressive numbers aside, thereβs an important point to be made here: OpenAI is collecting a huge (and I mean HUGE) amount of user data. While the leaked document claims that faster, cheaper algorithms provide a competitive advantage, thatβs only part of the story. In the realm of AI, what truly matters to users is the quality of information that models offer. To make better inferences, more data and feedback are needed, and thatβs exactly what OpenAI is collecting.
Additionally, itβs worth noting that OpenAI has Microsoftβs backing, granting them access to a massive cloud of user data. When your AI can whip up a stunning PowerPoint presentation, a comprehensive Excel spreadsheet, or the perfect LinkedIn profile just by describing them in words, the algorithm used to accomplish that becomes irrelevant. But when you ask questions and receive incorrect responses, knowing the algorithm was trained on a mere $300 budget is hardly comforting.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI