Embrace AI, Optimize Later

Author(s): Suyash Damle

Originally published on Towards AI.

As technologies gain traction, the ecosystem responds and targeted optimization tools emerge. Enabling widespread experimentation, and lowering barriers to entry are more critical for accelerating progress- than achieving peak performance from day one.

Recent rapid advances in AI have made possible rapid development & prototyping of software solutions. But, there are also valid concerns about the extreme energy usage and potential wastage from novice coders making LLM calls for trivial use-cases and building products around it.

As someone working on advanced research & engineering in the field, I meet a lot of other researchers, engineers, budding tech founders and builders all the time. I’ve been in philosophical and technical discussions with these brilliant and well-informed people, skilled in different aspects of the technology and, with very different perspectives and incentives. I’d like to put down some informed opinions and ideas from both sides of the LLM adoption debate. I also propose a framework for developers and founders to develop with AI responsibly, that I’ve formulated after hours of deep discussions with startup founders and experts.

(1) It’s important to acknowledge that this technology is power-hungry. Using it like a silver bullet in every problem scenario, and for trivial use-cases is a massive wastage of resources.

(2) Still, it cannot be denied that it’s extremely easy to use & empowers a lot more people to build things quickly- things that were earlier not possible for small teams, or required a lot of time and investment. No more custom ML modeling, no data collection, no parameter tuning for days. These hard jobs get handled by companies with huge resources. For most use-cases, just adding some samples in the context is enough to make a pretrained LLM work just fine (“few-shot”/ “in-context learning”).

How do we get the best of both worlds? My opinion: Let’s embrace AI first, and optimize as we go.

Let me elaborate.

Early Computing: in Scarcity vs. in Abundance

Looking back at the history of software development, we see that the definition of “efficiency” evolves as we go through different phases of adoption and hardware maturity. For example, even during the high-stakes Apollo program, engineers were always under extreme resource constraints. The Apollo guidance computer was a marvel of its time. And yet, it was severely limited at few KBs of RAM and some tens of KBs in storage. In this situation of scarcity, efficiency was the biggest priority. Scientists and engineers spent hours optimizing every line of code, often writing directly in low-level assembly language. Minimizing every byte and CPU cycle was required- and, the cost was manual optimization. It required a lot of skill, and time to code for relatively simple things (by today’s standards). (Ref: [7]) There was a similar story for other software packages and games of the time. Every single CPU op, every byte of the binary program needed to be justified.

Embrace AI, Optimize Later — Apollo Guidance Computer. Source: Wikipedia

Over time, as demand from software stack kept rising, the hardware side had to catch-up. Memory and processor tech improved. Higher-level languages and frameworks abstract away hardware complexities and simplify coding tasks — which increases developer productivity. We prioritize rapid development, iteration, and broader participation instead of hyper-optimizations.

The resultant software is usually far from being hyper-optimized for disk/RAM/CPU footprint. For example, modern applications, games, and web scripts contain lot more sub-optimal code compared to the programs of the past. But, we also have fewer constraints now. And, with lower complexity, a lot more developers find it possible to learn and contribute. And the programs do a lot more today, and are more immersive.

The Python Revolution: Prioritizing Accessibility Over Raw Speed

The widespread adoption of Python offers another parallel to the current debate around LLM inefficiencies. Python, an interpreted language, is up to 100x slower for certain tasks than C++. The latter also provides developers with more granular control over memory usage and hardware resources. (Ref: [1], [2]) But, Python is exceptionally easy to read and write. As a high-level language, it abstracts away lot of details and offers a more friendly coding experience, even though the compiled code isn’t the most optimal, this has resulted in a remarkably low barrier to entry for new programmers.

It was important for the field of machine learning itself. The language’s rich ecosystem, and libraries like NumPy, Pandas, PyTorch, and TensorFlow, made it the language for ML research and application.

Also, this performance deficit was just a temporary condition anyway. As it gained traction, a community emerged to develop sophisticated optimizations. This included Just-In-Time (JIT) compilers (e.g. Numba); Cython: for static compilation of Python code to C, and custom C extensions — offering finer control and higher performance. Alternative interpreters, such as PyPy emerged, with substantial speed improvements- up to 22x. Further, new packages like Polars and Dask were developed to overcome Pandas’ limitations. They offer distributed computing capabilities with the same, convenient function signatures and abstractions. (Ref: [3], [4], [5], [6])

Pressing-on with software adoption, and the resulting demand, creates the right incentives for the hardware and infrastructure to keep improving.

LLMs: Trading Inefficiency for Innovation

LLMs and AI agents are very useful for tasks that involve unpredictable inputs, natural language interactions or other scenarios that require dynamic decision-making. They can handle complex orchestration of external tools and APIs. The ability to interpret plain english commands make them ideal for human-in-the-loop & unpredictable scenarios.

The Cost of Generality

LLMs can tackle problems that don’t neatly fit traditional, narrowly defined algorithmic solutions. But, since these are general-purpose models, they are extremely inefficient when used as a silver bullet. Using a 65-billion parameter model for a simple task: like sentiment analysis or parsing a web page, or for generating small, straightforward code snippets — is extremely inefficient. These models require significant computational power for both training and inference.

They are also wasteful for highly specific, constrained problems that could be solved with far less computational overhead by smaller, specialized models or (better still) deterministic, algorithmic systems.

Still, the solution is not to reject LLMs. We must use them for the purposes they’re great for: quick prototyping, and then, develop systems to track wasteful LLM calls and optimizing them.

Strategies for Responsible LLM Adoption and Optimization

Embracing the full potential of LLMs requires one to acknowledge the current limitations and strategically plan for that. I propose a phased development methodology, intelligent system design, and continuous optimization:

1. Prototype/ Proof-of-Concept

It is useful to begin by prototyping any app/service/workflow idea with LLMs to quickly validate it and assess the quality of outputs. This initial phase focuses on clearly defining the use case, specifying input and output requirements, and establishing preliminary model performance criteria such as desired latency, accuracy, and cost constraints. In the startup world, this can be the the v0 — to take it to early customers, see if you’re building all that they need.

This rapid prototyping allows developers to “fail fast” and gain crucial insights into the LLM’s capabilities and limitations for the domain. LLMs are bad at certain tasks — including arithmetic, or producing factual info. Use the simplest possible ways to get around this: write simple tools for database lookups, create tools for calculations, web search, etc.

2. Iterative Optimization on Developer’s side

Inference latency and other costs become a significant issue after the product takes-off — when there are a lot of users and lot of incoming requests. Now is the time to thoughtfully identify pieces of the pipeline that have a very predictable job — something to hard-code. Or, something that a simpler smaller neural net can do. Replace pieces one-at-a-time. Eg: if you’re using LLM calls to parse predictable inputs (like forms), or do sentiment analysis: replace those pieces.

The quality numbers from prototyping stage can be used here to iterate on your efficient code/model to get it to desirable quality. The final result will be that some stages of your pipeline get extremely cheap & efficient.

3. Optimization at/just before the LLM end-point

If you have agents in your system doing different operations on the request, keeping a log of what all operations/decision-making they’re performing will be very helpful. This is what I’ll call “profiling”.

Say, you might have a database for which an LLM “writes” SQLs — you can profile your logs to find what are the most common queries it has to generate. Provide those as hard-coded tools. One less LLM call. Your agent can call this tool directly if it matches the current request. If not, it can still get the LLM to generate the query. Caching the responses is another way to do this automatically.

Or, you find out that your agent often fetches web results, and mostly tries to extract information from wikipedia for different tasks. It’d fetch the wikipedia HTML & then put the whole thing in a context window and make a call to the LLM to extract some info from it. In this case, you can simply make a wikipedia API “tool” for the agents to use. Wikipedia API might provide a “small summary” method — so that the context the LLM has to deal with is short. Less computation, more predictability, savings on the token limit.

Similarly, as the LLM endpoint providers see what users demand and what they use LLMs for, they’ll likely start providing efficiency techniques to developers. They can provide more agentic tools/ smaller models that developers can integrate into their workflows directly, or build them into their APIs directly to save developer effort.

I believe we’ll also soon see a rapid rise in auto-profiling tools where the agentic workflow can keep a track of its LLM calls and smartly adapt — propose new tools to the developer, or smartly cache certain results.

4. Model Optimizations

For self-hosted LLMs, these are some very useful techniques to save resources and latency.

Quantization: reducing the number of bits used in the model’s weights/activations. This can reduce memory usage by a ton and speed up inference, without much impact on accuracy.

Pruning: simplifying a large model by selectively removing less important parameters.

Knowledge distillation: training a smaller, more efficient model (“student”) that matches the performance of a larger, more complex model (“teacher”). Deploy the smaller model.

Apart from this, lot of inference frameworks play a crucial role in production systems. They work by optimizing the model computations at low level compilation/hardware level. Eg: ONNX Runtime and NVIDIA’s TensorRT.

Hardware has been improving rapidly too — & will continue to do so. It’s possible that at some point, it’d get good & fast enough so that we stop caring about a few redundant LLM calls.

Disclaimer: The opinions/projections expressed here are all mine. This is written in my personal capacity — to put forward a perspective that, I felt, is important to consider.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Embrace AI, Optimize Later

Author(s): Suyash Damle

Early Computing: in Scarcity vs. in Abundance

The Python Revolution: Prioritizing Accessibility Over Raw Speed

LLMs: Trading Inefficiency for Innovation

The Cost of Generality

Strategies for Responsible LLM Adoption and Optimization

1. Prototype/ Proof-of-Concept

2. Iterative Optimization on Developer’s side

3. Optimization at/just before the LLM end-point

4. Model Optimizations

References

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Embrace AI, Optimize Later

Author(s): Suyash Damle

Early Computing: in Scarcity vs. in Abundance

The Python Revolution: Prioritizing Accessibility Over Raw Speed

LLMs: Trading Inefficiency for Innovation

The Cost of Generality

Strategies for Responsible LLM Adoption and Optimization

1. Prototype/ Proof-of-Concept

2. Iterative Optimization on Developer’s side

3. Optimization at/just before the LLM end-point

4. Model Optimizations

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement