Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


The Design Shift: Building Applications in the Era of Large Language Models
Latest   Machine Learning

The Design Shift: Building Applications in the Era of Large Language Models

Last Updated on March 17, 2024 by Editorial Team

Author(s): Jun Li

Originally published on Towards AI.

Photo by Martin Martz on Unsplash

A new trend has recently reshaped our approach to building software applications: the rise of large language models (LLMs) and their integration into software development. LLMs are much more than just another new tool in our toolkit; they represent a significant shift towards creating applications that can understand, interact with, and respond to human language in remarkably intuitive ways.

Luckily, as technology evolves, it becomes much easier for engineers to access new technologies as black boxes. We don’t need to know precisely the complex algorithms and logic inside to apply them to our applications.

In this article, I aim to share my experience as a software engineer using a question-driven approach to designing an LLM-powered application. The design is similar to a traditional application but considers LLM-powered application-specific characters and components. Let’s look at LLM-powered application characters first.

LLM-powered application characters

I’ll compare traditional and LLM-powered applications’ characters from input and output perspectives since the input-output problem is the core of a software application that it essentially deals with.

Generally, a traditional application has structured inputs that require users to input data in a predefined format, such as filling out forms or selecting options. It usually needs more context understanding and struggles to interpret inputs outside its programmed scope, leading to rigid interaction flows.

An LLM-powered application can process inputs in natural language, including text and voice, allowing for more intuitive user interactions. It can understand inputs with a broader context, adapt to user preferences, and handle ambiguities in natural language. It can additionally gather inputs from various sources beyond direct user interaction, such as documents and web content, interpreting them to generate actionable insights.

In output characters, a traditional application usually has structured outputs that deliver information or responses in a fixed format, which can limit the flexibility in presenting data and engaging users. It has limited personalization that is often rule-based and requires predefined user segments or profiles. The outputs and user flows are designed and programmed in advance, offering limited adaptability to user behaviour or preferences.

An application powered by LLM generates natural language outputs that can be customized to suit the context of the interactions, user preferences, and query complexity. It can dynamically personalize responses and content based on ongoing interactions, user history, and inferred user needs. Additionally, it can provide response formats like JSON, according to the users’ requirements from the queries, which can be used for further actions such as function calling. Furthermore, it can create flexible user flows that adapt in real-time to the user’s input, questions, or actions, making the interaction experience more intuitive.

LLM-powered application core structure

From the LLM-powered application characters, we can conclude that essentially, the LLM-powered application collects inputs in natural language and lets the LLMs generate outputs; we use various ways to improve the inputs to make the LLMs generate better outputs and transform the outputs as needed to meet our business requirements. So we can abstract a general LLM-powered application with a core structure comprising components: ‘Inputs’, ‘Input Processing’, ‘LLM Integration/Orchestration’, and ‘Output Processing and Formatting’, as the diagram shows below.

LLM-powered application core structure — Diagram by the author


Inputs are read as natural language, usually a question, instruction, query, or command that specifies what the user wants from the model. These inputs are called prompts and are designed for LLMs like GPT to generate responses.

Input processing

This step processes and transforms the inputs, formats the requests into more structured inputs, and crafts the prompts using prompt engineering practices to guide the model in generating more accurate and meaningful responses. In some applications, it retrieves from external data sources alongside the original queries for the models to generate more precise and niche-targeting responses, called Retrieval-Augmented Generation (RAG).

LLM integration/orchestration

It involves the technical integration with the LLM, managing the interaction, and orchestrating the flow of information to and from the model. It sends processed inputs to the LLM and handles the model’s outputs. It controls the routes to the LLMs and chains one or multiple models to generate optimized responses.

Output processing and formatting

This stage involves refining the LLM’s raw outputs into a presentable and useful format for the end-user, which may include formatting text, summarizing information, or converting outputs into actionable insights.

Design LLM-powered application

When designing the LLM-powered application, we can expand the core structure described above with the corresponding components, sub-applications/systems to meet the complex real-world application requirements. The application can usually be a chatbot that supports interactive conversations, a copilot that can offer assistance on other applications, or an agent acting autonomously or semi-autonomously to perform tasks, make decisions, or provide recommendations based on interpreting large volumes of data.

When I design an application, I prefer to use a set of questions that help me to create a design that meets business requirements. I plan to use the same approach to design the LLM-powered application while keeping the core structure in mind. To illustrate this, I will walk you through the design of an example application called the “Smart Engineering Knowledge Assistant”. This application aims to assist engineers and developers more accurately in queries using natural language with extensive technical knowledge, including code examples, API usages and documentation outside or within an organization like a corporation. Additionally, it will offer the ability to generate code and interact with APIs based on the insights gained.

Now, let’s get started on the journey.

Q1: How can the end users interact with the application?

This question maps to the “Inputs” component in the core structure. Based on the application requirements, the application will provide a conversational UI like a chatbot that is easy for engineers to interact with using natural language. We can offer some predefined keywords as different patterns the application can recognize to generate corresponding prompts for LLMs or select a specific model. The users will see responses generated from LLMs in the UI and can continue the conversation based on responses. The inputs from the users are called the raw queries that have not yet been processed.

UI for engineering queries/prompts — Diagram by the author

Q2: What do we need to refine the user inputs to get more accurate and high-quality responses?

This question maps to the “Input Processing” component in the core structure. We need to rewrite users’ queries to normalize them into a standardized format and extract the keywords, key terms, and requirements from the inputs. After that, the extracted keywords are enriched with additional context or semantics to improve the search and retrieval phase.

It’s not sufficient to just process the inputs with rewriting. To make the input process more effective, we need to design a prompt engineering system, including a prompt pipeline, to normalize the queries with prompt templates. The system can dynamically craft the users’ prompt, which involves specifying the expected output format, guiding the model’s focus, or embedding additional instructions using different prompt strategies like few-shots.

Process inputs and get the processed queries and prompts — Diagram by the author

Q3: How can the LLM response include the organization’s knowledge base?

This question still maps to the “Input Processing” component in the core structure since the query contexts on the organization’s knowledge base will be alongside the original ones to send to LLMs to process.

One popular technology is RAG (Retrieval-Augmented Generation), which will be used in the “Smart Engineering Knowledge Assistant” application. We can initially design a basic version of the RAG system to serve the application and then iteratively improve it.

The RAG system loads different data sources from the organization's knowledge base, such as documents like PDFs, code repos, and structured and unstructured data from databases or APIs. Then these documents are chunked, vectorized and stored in vector databases such as FAISS and Chroma. The initial processed inputs will be embedded by the embedding model from OpenAI, HuggingFace, or other providers. Then a similarity search will be carried out to get the top k relevant chunks, where ‘k’ is the number that can be configured from indexed chunks as the enhanced contexts. Finally, the prompt engineering system combines the enhanced prompt contexts and original queries to form the structured prompts plus queries for LLM to process.

Input processing with RAG — Diagram by the author

Q4: How can the models be organized and selected appropriately when queries and prompts are processed?

This question relates to the core structure's “LLM Integration/Orchestration” component. We can design a model registry to maintain a catalog of available models in the application. We can then create a model and query router that acts as a brain to determine the query and model routings from the available models registered in the model registry.

For simple cases, the router selects the model specified in the query; otherwise, it uses the default or the appropriate one according to the query classification and other factors such as accuracy, response time, cost, etc.

The router can chain the models for complex cases. For example, the router can put multiple LLM calls in a sequential chain, with the output of one model being the input of another. Then the outputs are combined to form the final output. The router can also organize the model chains with conditional logic to determine the flow of model chaining based on intermediate results and decide which model chain to process the data next. Another use case is the router can divide the queries into multiple sub-queries that can be processed in parallel by models, summarised individually and then aggregated as the destination outputs.

Model integration and orchestration — Diagram by the author

Q5: How can we handle the LLMs’ responses, such as the expected format from queries and calling functions?

This question relates to the core structure’s “Output Processing and Formatting” component. We can design two components to process the raw outputs generated from LLMs. One is format parsers, which parse the output into a structured, user-friendly format or different data formats, such as CSV, JSON, YAML, etc. The other is function tools, which contain tools that can invoke functions, call external APIs, or query databases according to the outputs from the LLMs. The outputs will be converted into the proper format arguments the tools can parse and consume.

Output processing and format — Diagram by the author

Now that we have used five questions to cover all the components of the core structure, should we consider anything else in designing the “Smart Engineering Knowledge Assistant” application? Let’s move on to find out.

Q6: How can we refer to the information introduced earlier in the conversation?

Since LLMs are stateless by design, they don’t remember the contexts or information with which they interact. We need to design a memory component to emulate the states that can refer to information from the chat history. The memory component has a memory handler to handle read and write operations using Redis for low-latency reads and writes. Before the processed queries are sent to LLMs, they are written to the memory store as the chat history candidate for the next interaction. Then the processed queries are sent to LLMs for processing together with the existing chat history fetched from the memory store. Lastly, the outputs generated from LLMs are written into the memory store. The memory handler can configure the number of history records fetched from the memory store.

Application with memory to refer to chat history — Diagram by the author

Q7: How can we protect the data and make sure the application is secure?

Security is critical for the application, and the security layer is applied across the entire application when needed. In the design, we can consider key places as below:

  • User authentication before they can use the application.
  • Inputs/queries are validated and sanitized before they are passed to process.
  • In the RAG system, all data sources from organization knowledge base should be encrypted at rest. The data loaders must be authorized before loading data from data sources. Embeddings are encrypted using property-preserving encryption before they are stored in the vector database, and the encryption key is provided to generate the encrypted query embedding to match the data while the similarity search is carried out.
  • All tokens and API keys for available models in the application are stored in the HashiCorp vault.
  • Authorizations are checked before calling models to process queries, as models may not be open to everyone.
  • In the memory system, authorizations are checked to read and write data from Redis, and data is encrypted at rest.
  • For the data that is not general public within the organization, only authorized users can see the outputs from the models, such as confidential documents and code snippets from private repos.
  • Authorizations are checked before API/function calls are executed based on the models’ outputs.
  • Loggings are implemented to record access, changes and activities.
Security layer across the application — Diagram by the author

Q8: How can the application integrate with external services and be consumed by other services?

Like other traditional software applications, even though the “Smart Engineering Knowledge Assistant” is a conversational-based application leveraging LLMs, we can still expose it as an API service to partner with external services that allow them to query the knowledge contents to enhance their content discovery features. Additionally, we can integrate the application with other external API services to extend the application’s capabilities and functionalities, such as connecting to GitHub APIs to get code snippets in public repos.

The deep-dive design of API integrations is out of the scope of this article. We just consider the very basic elements in this component here. API Service needs the REST API endpoints that the third-party services can call and an API access registry to catalog the third-party services that need access to the APIs. The external API manager needs to manage and maintain the configuration of the external API services for integration. It also needs API clients that can call external APIs; API tokens or API keys are stored in the HashiCorp vault and fetched during runtime; the data parser will be used to parse the API responses and data to integrate with the application.

API integration — Diagram by the author

Now, we finish the design by answering all sets of questions. But it doesn’t mean this is the final version of the design. In practice, the design also evolves iteratively according to business requirements updates, engineering and technology cognition improvement, etc.

Challenges and further considerations

We still need to face some challenges for LLM-powered applications, which are similar to the “Smart Engineering Knowledge Assistant” application.

Hallucination and bias

LLMs can sometimes generate incorrect or nonsensical information, and the training data may lead to skewed or inappropriate responses. In an engineering context, this could affect the reliability of information. Organizations that don’t have sufficient capability to fine-tune the models rely more on RAG to mitigate hallucination and bias. We need to ensure the knowledge base is up to date with high-quality engineering information and remove outdated or incorrect content. We can also implement a human-in-the-loop (HITL) system where domain experts review critical outputs to catch and correct hallucinated and biased content and create the feedback loop to guide how queries are structured or how the RAG system prioritizes specific sources to improve the model performance.

API/Function call security

Even though it is handy to directly execute the API/function calls from the model outputs, it still raises security concerns. LLM may make inappropriate decisions and generate incorrect data that will impact the running results on APIs or functions and potentially introduce security holes. Additionally, if unauthorized parties execute the executions, it may lead to data breaches or system incidents. In practice, it is better to restrict the executions away from the production environment, ideally in an isolated environment like a sandbox. Compliance and monitoring with review and approval by authorized persons are necessary if execution in the production environment is required.

Design improvements

Inspired by GitHub Copilot Chat, we can implement a dynamic UI that asks follow-up questions to clarify ambiguous queries or gather more context to improve information accuracy. We can also consider using a user service to identify users’ preferences, domains, frequently used models, custom-defined queries, or prompt templates to provide more personalized services. Another option is to leverage instant messaging applications like Microsoft Teams, widely used in corporations.

We can set up a mechanism in the prompt engineering system that continuously collects user feedback to refine input processing, routing logic, and prompt strategy.

From the architecture perspective, we can implement the microservices architecture when scalability, resilience and decoupling are required. The different components of the application are decoupled and operated as independent services, such as UI, prompt engineering system, RAG system, LLM processing, etc. Each service can scale independently based on demand.


This article presents an example application that uses a question-driven approach to explore the design of LLM-powered applications from a software engineer’s perspective. It breaks down each component of the core LLM-powered application structure through a series of questions and then expands to considerations that a general software application should have, such as security. However, there are various ways to design an LLM-powered application, depending on business requirements and circumstances; there is no absolute approach to suit all cases, and the design itself is evolving iteratively. For software engineers, adaption is a core idea that allows us to reuse prior experiences and knowledge to design a new technology-powered application that shares the essential structures and methodologies but with different forms.


Here are some articles that helped me better understand the technologies and concepts to support the contents of my article. I hope they can help you too.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓