The Choice for Businesses Between Open-Source and Proprietary Models To Deploy Generative AI
Last Updated on November 8, 2023 by Editorial Team
Author(s): Faizan Ahmad, PhD
Originally published on Towards AI.
The surge of interest in Generative AI has sprung up over 350 companies in the field by mid-2023 [1], with value propositions that span from foundational models to specific use cases. This breadth of choice of vendors has necessitated a thoroughly informed decision by businesses looking to implement this nascent technology, in which the criteria should go beyond simply looking at brand positioning or relative pricing. This article looks at one of the dimensions of this multi-factor approach: adopting an open source versus a proprietary LLM (Language Learning Model).
Figure 1 shows key players in the Generative AI market, divided between open-source and closed-source (i.e., proprietary) offerings. Among big tech, Google, Microsoft (Open AI), and Amazon have proprietary products, while Meta (Facebook) and NVIDIA offer open-source models. Businesses that are already large consumers of services from tech giants, such as of cloud storage or analytics products, may decide to stick with their current provider in order to benefit from the scalable, seamless integration of Generative AI into the existing ecosystem. For the rest of the competitive landscape, open-source space is dominated by the likes of Anthropic, Inflection, and Cohere, while Hugging Face, Mistral AI, and Stabilitiy.ai lead the closed-source side.
Businesses that are already consumers of services from tech giants may decide to stick with their current provider to benefit from the integration of Generative AI into the existing ecosystem.
Criteria to opt between open-source and proprietary Gen AI models
Every business would need to take a nuanced approach to calculate its own ROI (return on investment) when considering a vendor to deploy Generative AI. Differences to take into account would not be only between open-source and closed-source but also within the two categories. Figure 2 provides a summary of the relevant factors.
Pricing
In essence, open source is free to access, but there might be fees associated with additional licenses or services that are not part of their core offering. The pricing policies for closed-source providers vary greatly as the market is still learning about the value generated. The most prevalent pricing structure is based on the size of the input and output token (this is fundamentally the length of the text). Another approach is to base it on the number of times it is called, irrespective of the length of the text. Google uses the former, while Microsoft has a more complex, hybrid methodology. Amazon has not yet disclosed its pricing structure in detail.
The most prevalent pricing structure is based on the size of the input and output token.
Flexibility
The consideration of flexibility is two-fold. The first is the level of customization possible, where open source wins as it is up to the user how to utilize it. Closed source offerings could differ here, for example, Amazon and Microsoft are deemed to have more diversity in their foundational models for enterprise use than Google at present. The second is the issue of vendor lock-in. While open access models could be easy to migrate from one source to another as there would be no contractual limitations, there is not yet any clarity on how vendors could be switched for the closed source instance.
The consideration of flexibility is two-fold: The level of customisation and the issue of vendor lock-in.
Transparency
Open-source models are naturally more transparent, as the scrutiny of their performance is crowdsourced. Information on potential vulnerabilities is also quickly picked up and widely shared, whereas such data is unlikely to be made available for proprietary models. For instance, compared to other tech giants, Amazon at present has provided the least amount of information on how its models perform compared to others.
Talent
The savings from no access fee for open source models may be offset by the greater people cost. More talent, both in number and degree of specialization, would be required for deploying open-source models. Firstly, such skill is not readily available as the technology itself is still in its infancy, and the demand is unprecedented. Secondly, these jobs would be on the high end of the salary range and therefore, expensive to hire and retain. On the other hand, a smaller Data Science and Developer team with generalized knowledge of AI may suffice for customers of proprietary products.
The savings from no access fee for open source models may be offset by the greater people cost.
Support
Development and maintenance of the code and underlying infrastructure is more streamlined for closed-source models and would be packaged as part of the offering to businesses. Dedicated customer service is also likely to be a feature for closed-source providers, offering help with troubleshooting, etc., something that the open-source option, in general, would lack.
Speed-to-market
While the models themselves are quickly accessible for open source, the deployment speed might be lower than that for the closed source case because of the latterβs neatly packaged, user-friendly interfaces. This, compounded by the time-consuming process of hiring, may mean a slower overall go-to-market for open source.
Performance
On average, proprietary models are deemed to perform better than open-source ones, though this gap is shrinking over time. The difference is primarily due to the fact that, on average, open-source providers may not have the huge level of resources needed to focus on gaining such competitive advantage through an iterative approach, as training LLMs is expensive, needing large storage and intensive computation. In fact, by Q3 2023, the funding of ~$670M for the top five open-source start-ups has been dwarfed by that of closed-source ones of ~ $20B [2].
By Q3 2023, the funding of ~$670M for the top five open source start-ups has been dwarfed by that of closed source ones of ~ $20B.
Two other points to consider are privacy and IP rights. Open source is less likely to suffer from the issue of data privacy and leakage as it is adopted in-house. However, most closed-source providers offer to ring-fence enterprise data so that it is not used for further training of their models. The contrast of privacy is more dependent on the contractual terms of the particular vendor, and less so on the two categories considered here.
Given its novelty, the regulation around the IP rights of the data used to train LLMs has not yet been laid out. Though open source would see a higher risk from regulatory factors as it is trained exclusively on public data, closed source providers may also have to detail their inputs if required by law in the future. How customers of Gen AI providers are affected by it again depends on the provisions in place by each player, irrespective of whether it is open or closed source.
As the decision between open source and proprietary models could significantly impact a business, it is imperative to weigh the pros and cons holistically and promptly.
Sources: [1] Dealroom, [2] CB Insights
Disclaimer: The opinions and views expressed in this personal blog are solely those of the author and do not represent the views of any organizations or companies. No private or proprietary information is included.
As this is original work, please let me know of any errors or omissions.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI