The Design Shift: Building Applications in the Era of Large Language Models
Last Updated on March 17, 2024 by Editorial Team
Author(s): Jun Li
Originally published on Towards AI.
A new trend has recently reshaped our approach to building software applications: the rise of large language models (LLMs) and their integration into software development. LLMs are much more than just another new tool in our toolkit; they represent a significant shift towards creating applications that can understand, interact with, and respond to human language in remarkably intuitive ways.
Luckily, as technology evolves, it becomes much easier for engineers to access new technologies as black boxes. We donβt need to know precisely the complex algorithms and logic inside to apply them to our applications.
In this article, I aim to share my experience as a software engineer using a question-driven approach to designing an LLM-powered application. The design is similar to a traditional application but considers LLM-powered application-specific characters and components. Letβs look at LLM-powered application characters first.
LLM-powered application characters
Iβll compare traditional and LLM-powered applicationsβ characters from input and output perspectives since the input-output problem is the core of a software application that it essentially deals with.
Generally, a traditional application has structured inputs that require users to input data in a predefined format, such as filling out forms or selecting options. It usually needs more context understanding and struggles to interpret inputs outside its programmed scope, leading to rigid interaction flows.
An LLM-powered application can process inputs in natural language, including text and voice, allowing for more intuitive user interactions. It can understand inputs with a broader context, adapt to user preferences, and handle ambiguities in natural language. It can additionally gather inputs from various sources beyond direct user interaction, such as documents and web content, interpreting them to generate actionable insights.
In output characters, a traditional application usually has structured outputs that deliver information or responses in a fixed format, which can limit the flexibility in presenting data and engaging users. It has limited personalization that is often rule-based and requires predefined user segments or profiles. The outputs and user flows are designed and programmed in advance, offering limited adaptability to user behaviour or preferences.
An application powered by LLM generates natural language outputs that can be customized to suit the context of the interactions, user preferences, and query complexity. It can dynamically personalize responses and content based on ongoing interactions, user history, and inferred user needs. Additionally, it can provide response formats like JSON, according to the usersβ requirements from the queries, which can be used for further actions such as function calling. Furthermore, it can create flexible user flows that adapt in real-time to the userβs input, questions, or actions, making the interaction experience more intuitive.
LLM-powered application core structure
From the LLM-powered application characters, we can conclude that essentially, the LLM-powered application collects inputs in natural language and lets the LLMs generate outputs; we use various ways to improve the inputs to make the LLMs generate better outputs and transform the outputs as needed to meet our business requirements. So we can abstract a general LLM-powered application with a core structure comprising components: βInputsβ, βInput Processingβ, βLLM Integration/Orchestrationβ, and βOutput Processing and Formattingβ, as the diagram shows below.
Inputs
Inputs are read as natural language, usually a question, instruction, query, or command that specifies what the user wants from the model. These inputs are called prompts and are designed for LLMs like GPT to generate responses.
Input processing
This step processes and transforms the inputs, formats the requests into more structured inputs, and crafts the prompts using prompt engineering practices to guide the model in generating more accurate and meaningful responses. In some applications, it retrieves from external data sources alongside the original queries for the models to generate more precise and niche-targeting responses, called Retrieval-Augmented Generation (RAG).
LLM integration/orchestration
It involves the technical integration with the LLM, managing the interaction, and orchestrating the flow of information to and from the model. It sends processed inputs to the LLM and handles the modelβs outputs. It controls the routes to the LLMs and chains one or multiple models to generate optimized responses.
Output processing and formatting
This stage involves refining the LLMβs raw outputs into a presentable and useful format for the end-user, which may include formatting text, summarizing information, or converting outputs into actionable insights.
Design LLM-powered application
When designing the LLM-powered application, we can expand the core structure described above with the corresponding components, sub-applications/systems to meet the complex real-world application requirements. The application can usually be a chatbot that supports interactive conversations, a copilot that can offer assistance on other applications, or an agent acting autonomously or semi-autonomously to perform tasks, make decisions, or provide recommendations based on interpreting large volumes of data.
When I design an application, I prefer to use a set of questions that help me to create a design that meets business requirements. I plan to use the same approach to design the LLM-powered application while keeping the core structure in mind. To illustrate this, I will walk you through the design of an example application called the βSmart Engineering Knowledge Assistantβ. This application aims to assist engineers and developers more accurately in queries using natural language with extensive technical knowledge, including code examples, API usages and documentation outside or within an organization like a corporation. Additionally, it will offer the ability to generate code and interact with APIs based on the insights gained.
Now, letβs get started on the journey.
Q1: How can the end users interact with the application?
This question maps to the βInputsβ component in the core structure. Based on the application requirements, the application will provide a conversational UI like a chatbot that is easy for engineers to interact with using natural language. We can offer some predefined keywords as different patterns the application can recognize to generate corresponding prompts for LLMs or select a specific model. The users will see responses generated from LLMs in the UI and can continue the conversation based on responses. The inputs from the users are called the raw queries that have not yet been processed.
Q2: What do we need to refine the user inputs to get more accurate and high-quality responses?
This question maps to the βInput Processingβ component in the core structure. We need to rewrite usersβ queries to normalize them into a standardized format and extract the keywords, key terms, and requirements from the inputs. After that, the extracted keywords are enriched with additional context or semantics to improve the search and retrieval phase.
Itβs not sufficient to just process the inputs with rewriting. To make the input process more effective, we need to design a prompt engineering system, including a prompt pipeline, to normalize the queries with prompt templates. The system can dynamically craft the usersβ prompt, which involves specifying the expected output format, guiding the modelβs focus, or embedding additional instructions using different prompt strategies like few-shots.
Q3: How can the LLM response include the organizationβs knowledge base?
This question still maps to the βInput Processingβ component in the core structure since the query contexts on the organizationβs knowledge base will be alongside the original ones to send to LLMs to process.
One popular technology is RAG (Retrieval-Augmented Generation), which will be used in the βSmart Engineering Knowledge Assistantβ application. We can initially design a basic version of the RAG system to serve the application and then iteratively improve it.
The RAG system loads different data sources from the organization's knowledge base, such as documents like PDFs, code repos, and structured and unstructured data from databases or APIs. Then these documents are chunked, vectorized and stored in vector databases such as FAISS and Chroma. The initial processed inputs will be embedded by the embedding model from OpenAI, HuggingFace, or other providers. Then a similarity search will be carried out to get the top k relevant chunks, where βkβ is the number that can be configured from indexed chunks as the enhanced contexts. Finally, the prompt engineering system combines the enhanced prompt contexts and original queries to form the structured prompts plus queries for LLM to process.
Q4: How can the models be organized and selected appropriately when queries and prompts are processed?
This question relates to the core structure's βLLM Integration/Orchestrationβ component. We can design a model registry to maintain a catalog of available models in the application. We can then create a model and query router that acts as a brain to determine the query and model routings from the available models registered in the model registry.
For simple cases, the router selects the model specified in the query; otherwise, it uses the default or the appropriate one according to the query classification and other factors such as accuracy, response time, cost, etc.
The router can chain the models for complex cases. For example, the router can put multiple LLM calls in a sequential chain, with the output of one model being the input of another. Then the outputs are combined to form the final output. The router can also organize the model chains with conditional logic to determine the flow of model chaining based on intermediate results and decide which model chain to process the data next. Another use case is the router can divide the queries into multiple sub-queries that can be processed in parallel by models, summarised individually and then aggregated as the destination outputs.
Q5: How can we handle the LLMsβ responses, such as the expected format from queries and calling functions?
This question relates to the core structureβs βOutput Processing and Formattingβ component. We can design two components to process the raw outputs generated from LLMs. One is format parsers, which parse the output into a structured, user-friendly format or different data formats, such as CSV, JSON, YAML, etc. The other is function tools, which contain tools that can invoke functions, call external APIs, or query databases according to the outputs from the LLMs. The outputs will be converted into the proper format arguments the tools can parse and consume.
Now that we have used five questions to cover all the components of the core structure, should we consider anything else in designing the βSmart Engineering Knowledge Assistantβ application? Letβs move on to find out.
Q6: How can we refer to the information introduced earlier in the conversation?
Since LLMs are stateless by design, they donβt remember the contexts or information with which they interact. We need to design a memory component to emulate the states that can refer to information from the chat history. The memory component has a memory handler to handle read and write operations using Redis for low-latency reads and writes. Before the processed queries are sent to LLMs, they are written to the memory store as the chat history candidate for the next interaction. Then the processed queries are sent to LLMs for processing together with the existing chat history fetched from the memory store. Lastly, the outputs generated from LLMs are written into the memory store. The memory handler can configure the number of history records fetched from the memory store.
Q7: How can we protect the data and make sure the application is secure?
Security is critical for the application, and the security layer is applied across the entire application when needed. In the design, we can consider key places as below:
- User authentication before they can use the application.
- Inputs/queries are validated and sanitized before they are passed to process.
- In the RAG system, all data sources from organization knowledge base should be encrypted at rest. The data loaders must be authorized before loading data from data sources. Embeddings are encrypted using property-preserving encryption before they are stored in the vector database, and the encryption key is provided to generate the encrypted query embedding to match the data while the similarity search is carried out.
- All tokens and API keys for available models in the application are stored in the HashiCorp vault.
- Authorizations are checked before calling models to process queries, as models may not be open to everyone.
- In the memory system, authorizations are checked to read and write data from Redis, and data is encrypted at rest.
- For the data that is not general public within the organization, only authorized users can see the outputs from the models, such as confidential documents and code snippets from private repos.
- Authorizations are checked before API/function calls are executed based on the modelsβ outputs.
- Loggings are implemented to record access, changes and activities.
Q8: How can the application integrate with external services and be consumed by other services?
Like other traditional software applications, even though the βSmart Engineering Knowledge Assistantβ is a conversational-based application leveraging LLMs, we can still expose it as an API service to partner with external services that allow them to query the knowledge contents to enhance their content discovery features. Additionally, we can integrate the application with other external API services to extend the applicationβs capabilities and functionalities, such as connecting to GitHub APIs to get code snippets in public repos.
The deep-dive design of API integrations is out of the scope of this article. We just consider the very basic elements in this component here. API Service needs the REST API endpoints that the third-party services can call and an API access registry to catalog the third-party services that need access to the APIs. The external API manager needs to manage and maintain the configuration of the external API services for integration. It also needs API clients that can call external APIs; API tokens or API keys are stored in the HashiCorp vault and fetched during runtime; the data parser will be used to parse the API responses and data to integrate with the application.
Now, we finish the design by answering all sets of questions. But it doesnβt mean this is the final version of the design. In practice, the design also evolves iteratively according to business requirements updates, engineering and technology cognition improvement, etc.
Challenges and further considerations
We still need to face some challenges for LLM-powered applications, which are similar to the βSmart Engineering Knowledge Assistantβ application.
Hallucination and bias
LLMs can sometimes generate incorrect or nonsensical information, and the training data may lead to skewed or inappropriate responses. In an engineering context, this could affect the reliability of information. Organizations that donβt have sufficient capability to fine-tune the models rely more on RAG to mitigate hallucination and bias. We need to ensure the knowledge base is up to date with high-quality engineering information and remove outdated or incorrect content. We can also implement a human-in-the-loop (HITL) system where domain experts review critical outputs to catch and correct hallucinated and biased content and create the feedback loop to guide how queries are structured or how the RAG system prioritizes specific sources to improve the model performance.
API/Function call security
Even though it is handy to directly execute the API/function calls from the model outputs, it still raises security concerns. LLM may make inappropriate decisions and generate incorrect data that will impact the running results on APIs or functions and potentially introduce security holes. Additionally, if unauthorized parties execute the executions, it may lead to data breaches or system incidents. In practice, it is better to restrict the executions away from the production environment, ideally in an isolated environment like a sandbox. Compliance and monitoring with review and approval by authorized persons are necessary if execution in the production environment is required.
Design improvements
Inspired by GitHub Copilot Chat, we can implement a dynamic UI that asks follow-up questions to clarify ambiguous queries or gather more context to improve information accuracy. We can also consider using a user service to identify usersβ preferences, domains, frequently used models, custom-defined queries, or prompt templates to provide more personalized services. Another option is to leverage instant messaging applications like Microsoft Teams, widely used in corporations.
We can set up a mechanism in the prompt engineering system that continuously collects user feedback to refine input processing, routing logic, and prompt strategy.
From the architecture perspective, we can implement the microservices architecture when scalability, resilience and decoupling are required. The different components of the application are decoupled and operated as independent services, such as UI, prompt engineering system, RAG system, LLM processing, etc. Each service can scale independently based on demand.
Conclusion
This article presents an example application that uses a question-driven approach to explore the design of LLM-powered applications from a software engineerβs perspective. It breaks down each component of the core LLM-powered application structure through a series of questions and then expands to considerations that a general software application should have, such as security. However, there are various ways to design an LLM-powered application, depending on business requirements and circumstances; there is no absolute approach to suit all cases, and the design itself is evolving iteratively. For software engineers, adaption is a core idea that allows us to reuse prior experiences and knowledge to design a new technology-powered application that shares the essential structures and methodologies but with different forms.
Resources
Here are some articles that helped me better understand the technologies and concepts to support the contents of my article. I hope they can help you too.
- βPrompt pipelineβ, by Cobus Greyling, https://cobusgreyling.medium.com/prompt-pipelines-de48e25de224
- βRAG 101: Demystifying Retrieval-Augmented Generation Pipelinesβ, by Hayden Wolff, https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines
- βAdvanced RAG Techniques: an Illustrated Overviewβ, by IVAN ILIN, https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6
- βA Comprehensive Guide to Using Chains in Langchainβ, by Babina Banjara, https://www.analyticsvidhya.com/blog/2023/10/a-comprehensive-guide-to-using-chains-in-langchain
- LangChain official documentation about memory, https://python.langchain.com/docs/modules/memory/
- OpenAI official documentation about function calling, https://platform.openai.com/docs/guides/function-calling
- βEncrypting Vector Databases: A Must-Read for IT and IT Security Professionalsβ, by Martin Connell, https://www.linkedin.com/pulse/encrypting-vector-databases-must-read-security-martin-connell
- βSecurity of AI embeddings explainedβ, https://ironcorelabs.com/ai-encryption/
- βMitigating LLM Hallucinations in GenAI: Unleashing the Team of Digital Ghostbustersβ, by Ranjeeta Borah, https://blog.gramener.com/llm-hallucinations/
- βCreating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AIβ, by Muhammad Aurangzeb Ahmad, Ilker Yaramis, Taposh Dutta Roy, https://arxiv.org/pdf/2311.01463.pdf
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI