What’s In/Out for LLMs 2024: A new year’s guide for app developers
Last Updated on January 11, 2024 by Editorial Team
Author(s): Ollie
Originally published on Towards AI.
If the pace of AI innovation is anything like last year, we’re all in for a wild ride in 2024. Here is our guide for AI app developers to toss out the “old” way of doing things (wayyyyy back, like, four months ago) and go boldly into the future.
Out: Defaulting to ChatGPT
Last year, ChatGPT took the world by storm and quickly became the first choice for developers building text-gen applications. But open source models like Llama2 and Mixtral have emerged as real alternatives, and more are in store for 2024. Locking into OpenAI (or competitors like Anthropic and Cohere) leaves you on the sidelines of the most exciting advances in open source AI.
In: Open Source Experimentation
Until December 2023, open-source mixture-of-experts (MoE) models didn’t exist. In the two-weeks since Mistral launched the first, Mixtral 8x7B, new MoE models are popping up everywhere, occupying the top spots on the HuggingFace open LLM leaderboard as of Jan 3, 2024 (probably more by the time you read this).
What this really highlights is the rate of change, and how vital it is to stay open and flexible to new, emerging models. The MoE approach, which became part of the mainstream open LLM efforts only a matter of weeks ago — is now already the new dominant approach driving progress. Just a few months back, it was Llama2, and six months ago, it was Falcon. In a matter of months, we have seen smaller models, new architectures, and new models leap-frog incumbent approaches and fundamentally raise the bar.
AI developers who want to stay ahead of the curve must position themselves to evaluate new models as they emerge in order to stay on the forefront and tap into the momentum and progress in this space.
Out: Mega-Model Wrappers
Building a demo against one proprietary model can be an easy starting point, but you can quickly run into friction as you add production features. Customers often realize that the “walled garden” nature of adjacent tools (e.g., moderation models), prompt engineering work (e.g., pre-packaged completion templates and automation), and programmatic interfaces (e.g., SDKs and APIs) can limit the extensibility and scalability of these early projects.
Plus, you may not need the biggest, baddest (most expensive) model to get certain jobs done. In many cases, a smaller, fine-tuned, open-source LLM will work great and cost much less.
In: Adaptable Model Pipelines
The alternative is to instead create pipelines with a mix of models with different capabilities and strengths. These integrated architectures can deliver value that is fare greater than the sum of the parts.
The key is to build in flexibility early in the adoption process. While this may not be a day 0 priority for many projects, the earlier you consciously consider and prioritize flexibility, the easier and more agile the journey will be as you and your team build on generative AI.
Technologies like Langchain, LlamaIndex, Pinecone, and Unstructured are making it easier to construct these flexible pipelines. Building on the best-of-breed right components in the ecosystem can accelerate development, and avoid/reduce internal time spent on maintenance and upgrades of the plumbing needed for your application and your experiences.
Equally important are the AI/ML systems, the platforms, and the orchestration frameworks that serve these models. These choices determine several important attributes — like the ability to support traffic at scale, the ability to power models to new hardware or clouds, and the SLAs that you can deliver to your customers. Understanding and building on these components can also be a way to choose the parts that you want to have control over, and reduce time and effort on components that you need but don’t want to build out as your differentiation.
Out: Vanilla LLMs
Even the largest, most complex LLMs are trained for generalized tasks — they don’t have expertise in a given domain, and they most certainly don’t have access to your company’s data, which makes them of limited value for most business use cases. Likewise, enforcing a specific, consistent writing style in a vanilla (read: not fine-tuned) LLM application is difficult and requires heavy-handed prompt engineering that jams up the context window and slows performance.
In: LLMs with Context
Retrieval Augmented Generation (RAG) is the easiest way to “augment” an LLMs knowledge with external data (think: product docs), making them much more useful for most applications outside of general purpose chat or summarization. By supplying relevant context and factual information to the LLM, RAG makes for more accurate responses (even allowing it to cite sources), improves auditability and transparency, and enables end users to access the same source documents used by the LLM during answer creation.
In: LLMs with Style
Where RAG excels in grounding LLMs, in fact, fine-tuning excels at applying a specific style (e.g., making an LLM write like a lawyer). Fine-tuning your own LLM might not be high on your January bucket list, but leveraging existing fine-tunes from the community is a practical entry point. Model hubs are loaded with OSS fine-tunes you can build to mimic real-world experiences (like chatting with a real, live customer service rep).
Recent studies have also shown that fine-tuning with smaller models (like the Mistral 7B) can even provide superior quality to larger models, applied to the right use case and fine-tuned with the right data sets. These can result in order-of-magnitude improvements in performance and cost, both crucial as applications scale to address broader needs and higher volumes. Even applications that are fundamentally augmenting data using RAG benefit from fine-tuning, with approaches like question decomposition improving the overall quality and effectiveness of using context data in a generation.
In: Multimodal Model Cocktails
One of our favorites is using LLMs to generate detailed prompts that automatically produce unique SDXL images. If individual models are becoming more powerful by the day (not to mention more capable thanks to RAG and fine-tuning), you can make multimodal magic when you combine them to mix the perfect “model cocktail.”
One of our favorite use cases is building an image-to-image pipeline that sneakily leverages LLMs to enhance image output. LLMs like Llama2 and Mixtral are great for image prompting at scale, because they think fast on their feet (relative to us humans).
Typically, these are prompts a user would have to think up on their own and then manually enter into SDXL. By relying instead on an LLM, you can quickly and easily perform an exploration of various generation ideas based on a single keyword. The whole process of creating a gallery is drastically accelerated, with minimal user intervention.
Since every January tech blog is legally obligated to include predictions for the coming year, here are some things to be excited about in 2024
- The growing utility of smaller, fine-tuned models
- More open-source mixture-of-experts models
- Function-calling enhancing real-world LLM applications
- The potential for local LLMs with projects like MLC-LLM
- Indemnified models hitting the market
- Private LLMs — bring your GenAI secret sauce in-house for privacy, security and speed
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI