Build a Company Brain With AI and RAG
Last Updated on December 10, 2024 by Editorial Team
Author(s): Igor Novikov
Originally published on Towards AI.
The most common case businesses have right now for AI use is searching for answers in the data they have to make decisions or providing beautifully crafted but completely useless reports so that top management can justify their gigantic bonuses. All are very important and legitimate applications 🧐.
The problem is that your data analytics and AI department if you have one, probably speaks voodoo language, and canβt give a coherent answer on why we have so much data but still canβt answer pretty trivial questions.
I feel your pain, letβs find out what they are talking about.
Before the generative AI boom, the main way to search documents and company knowledge bases was either full-text search engines, like Apache Lucene, Elasticsearch, and Solr, or similar features in database engines, like Microsoft SQL Full Text Search, primitive keyword-based search or hardcore ML/ DataScience stuff. Not anymore.
They do work, but their capabilities are limited or implementation costs like half of the Boing 747, and hence the systems that used these approaches were very limited as well. They canβt handle complex or fuzzy questions where the system is required to understand the meaning of the question and the context of the conversation rather than the individual words.
Example from my practiceβ patent search. Itβs a difficult task which old methods do not solve well because similar thing are described in different patents in very different ways.
Now it become possible and the main culprit is Large Language Models or LLMs. One can argue if they are truly intelligent or not but one thing is clear β they are very useful in understanding the intricacies of human language. LLM technology is so impressive that it almost became a synonym for Artificial Intelligence.
LLM era and Retreiver
LLMs help in two ways here:
- They help understand user questions and translate them into something a computer can handle. Might sound not too impressive but it is huge. They can understand whatever gibberish some of your users feed it and translate it into something meaningful.
- They generate a meaningful answer based on the data available and the conversation context
There is one item in between that LLMs do not handle β retrieving the data to answer the question. Technically an LLM may know enough (that is β have enough data) to answer the question, if it was trained on data in that domain, or if the question is very generic. But usually, for a company that works in a specific domain, it is not the case. LLM probably doesnβt know your company financials or the name and phone number of your head of marketing so you can call him late at night with very urgent questions like when they are going to change the font of that marketing report. Not very useful.
So there is one more layer in between called Retriever. It searches for relevant information in company databases, documents, CRMs, and so on to provide it to the LLM. The very popular term RAG is therefore Retrieval Augmented Generation, which means just that β a system that uses LLM and a retriever to provide answers.
It is quite likely you will need multiple retrievers that search for data in different silos your company has, so at the end of the retrieval process, you will need to combine the search results for them. This is called Retriever Assembling.
So RAG systems finally solved the document search? Well, not completely.
RAG has its own limitations, the main of which come from the way LLM works. Specifically, LLMs can handle only a limited amount of text per request β so we canβt send it all available information to figure out the answer. Your company probably has hundreds, if not thousands of documents, database records, and other stuff. No way you can somehow magically transmit all that to the LLM to look through. So if we have lots of documents β we have to filter them and if the documents themselves are large β we have to chunk them into small bits.
Additionally, each token we send to LLM costs money, so we donβt want to send a lot of useless and irrelevant data and pay for it. And as you know, most documents are watered down with tons of garbage. So we really want to send only the information we think would be relevant to the answer.
This brings its own set of challenges: if we chunk information and store it across different chunks β how can we make sure that all needed chunks are used during retrieval? It can be very difficult if you think about complex documents with lots of tables and so on, where the explanation of the table can be in a different part of the document than the table itself. Or in a different document. Additionally, there could be different versions of the same document in your system, with conflicting informationβ¦
One of the ways to solve this problem is Re-Ranking.
Why do we need re-ranking?
In default RAG, documents are found using a semantic search in your data storage, which looks for documents similar to the user question. For example, if you ask, βWhy is my cat plotting world domination?β the search uses vector embeddings to find documents related to catsβ mischief. These documents, along with the question, are then sent to the LLM to produce the final (and hopefully non-alarming) answer.
Unfortunately, semantic search algorithms are not able to find perfect matches in many cases.
This happens for various reasons, like that documents often contain information not on one but on several topics, can contain conflicting and outdated information, and documents are stored in vector databases not in full but in chunks.
So the semantic search can return documents like this:
- Cats: the history of humanity enslavement
- Would my cat eat me if I stopped feeding it?
- The Cats musical dominates the box office this year
- β¦
- How to win cats and influence people
As you can see, item number 3 is not a relevant document here.
There is another one Iβll explain with an analogy. Say you know a very knowledgeable professor. He read all sorts of books and technically knows a lot of stuff. You ask him about pelicans on the Galapagos islands β he knows all about that. You ask him if Trump is better than Biden, and he also knows much information about that too. Now you think this guy is very smart! I can ask him anything! I have this problem at work, letβs use this guy to solve it! So you ask him if under the βSolvency II Directiveβ, if an insurance company were to hypothetically insure a colony of hyper-intelligent squirrels that operate their own hedge fund, how would their risk-based capital requirements be adjusted for the inherent unpredictability of nut futures?
And??? Of course, he knows that too! And he gives you the answer. The problem is, it is complete rubbish. He was right on the previous topics, but if you go into specific industry domains, he produces believable but inaccurate answers. The reason is that LLMs were trained on a large corpus of internet data, but itβs very general data. It is not domain-specific, so it produces general answers.
Re-ranking helps to solve this problem by using re-ranking models that are trained on specific domain texts, so it can provide the LLM with the most relevant context. So in the previous example, we ask a reranked to rerank the semantic search output in order of relevance, it will give you this:
- Cats: the history of humanity enslavement
- Would my cat eat me if I stopped feeding it?
- How to win cats and influence people
- β¦.
- The Cats musical dominates the box office this year
Now, you can take the first tree of the most relevant documents and send them to our very smart professor (LLM) to provide the answer.
There are many ways to rerank, some of them covered here . Typically, in production systems a combination of several methods is used that fit best for the current domain.
To re-rank data you you need to extract it for storage. There are several options.
Vector Storage
For a RAG to work, Retriever can search the data anywhere. For example directly in the company SQL database or your emails. Or it can search using full-text search systems you already have in the company. The thing is β it is not practical as it will be slow and inaccurate. The preferred (nowadays) way to store data for RAG retrieval is a vector database or vector store.
In a vector database (or more precisely, in vector space), concepts that are semantically similar are located close to each other, regardless of their representation. For example, the words βdogβ and βbulldogβ will be close, whereas the words βlockβ (as in a door lock) and βlockβ (as in a castle) will be far apart.
Additionally, the vector database will store metadata, that is data about data, to store things like permissions (who can access data), where this data came from, how it was stored, data type and category, and so on.
Therefore, vector databases are well suited for semantic data search.
There are many options, from open-source Quadrant to enterprise Pinecone. Most classic and NoSQL databases now have vector options too.
Popular Vector Databases (as of now):
- QDrant β open-source database, I prefer this one for most tasks
- Pinecone β cloud-native (i.e., they will charge you a lot) database that supports a lot of enterprise features
- Chroma β another open-source database (Apache-2.0 license)
- Weaviate β open under BSD-3-Clause license
- Milvus β open under Apache-2.0 license
- Marqo β open-source platform and a fully managed cloud service
- FAISS β a separate beast, not a database but a framework from Meta
- Pgvector for PostgreSQL
- Atlas for Mongo
LLamaindex
The most common framework (and my personal favorite) that is used to build RAG systems like that is LLamaindex. It is a Python library that provides the following tools:
- Data ingestion: your data is probably spread across different locations and formats, like PDF documents, SQL databases, company wiki, CRM, and so on. An LLM canβt directly work with that so we need to get the data from these places. LLamaindex provides data connectors for that.
- Data Indexes: your data, when we get it from wherever it is stored, will be a mess no supercomputer can navigate, probably. So we need to structure it into a format LLM can understand.
- Query Engines: once we have the data in consumable format β we need a tool to query it. LLama index provides tools to query data via various methods and integrate it with applications.
- Data Agents: these are workers based on LLM who can do complex tasks with data, like call an API, augment the data, and call helper functions. They are used to build workflows.
There are alternatives to LLamaindex, for example, if you use Microsoft stack you can choose Semantic Kernel, although it is still pretty buggy in my opinion. Another popular alternative is Haystack.
There is one more option, that I would suggest to avoid:
LangChain
Langchain is another framework that is used to build RAG systems, although it is more generic and not specifically built for that. It is widely used to build LLM-powered applications of all sorts, like chatbots, and supposedly should speed up development by providing a lot of building blocks and chaining them together to build flows.
You can even use both LangChain and LLamaindex together. One thing I want to add here is that Langchain devs change things so often and so much, that they quite often break things and the framework is not very stable, making it very annoying to use. I sometimes start to doubt if it is at all suitable for production systemsβ¦
Access Control
Here is another issue. In your current systems, you likely have some sort of access control in place. There is no access control on the LLM level and trying to implement it on the LLM level is futile. Whatever instructions you give to the LLM β can be broken and basically, any LLM can be talked into pretty much anything with enough effort.
There is a funny game by one of the leaders in the LLM security industry Lakera that shows it very well.
Iβm not saying you shouldnβt have guardrails for your LLM input and output, just saying you canβt rely on them for access control or with secret data.
An important implication from this fact is that you also canβt train your LLM to that sort of data, because eventually, it will be happy to tell it to someone that is not supposed to know it.
So really, the way to work is to feed LLM necessary data during the retrieval process at the user request level and only data that this user can access, which should be checked at the data storage level, for example.
Permissions and categories are a form of metadata, and for this whole system to work, these metadata must be preserved at the data Ingestion stage in the vector database or somewhere else. Correspondingly, when searching in the vector database, it is necessary to check on the found documents whether the role or other access attributes correspond to what is available to the user. Some (especially commercial corporate vector) databases already have this functionality as standard.
You can find a good example here.
Knowledge Graphs
Not a mandatory part of the RAG system, but one that you will likely need if you have a lot of data or data is stored in a certain way that is important in order to understand it.
For example, think about your folder with photos on your computer. If you were to randomly rename all the subfolders β how easy it would be to find anything in there, although the data itself (photos) didnβt change?
We, humans, are kinda similar. Without structure the information doesnβt mean much to us, so we have those knowledge graphs somewhere in our heads. We understand what the data means from the convention of how the data is stored (i.e., from the structure or metadata) to use it effectively. For LLMs, this problem is solved with Knowledge Graphs with metadata (also known as Knowledge Maps), which means the LLM has not only the raw data but also information about the storage structure and the connections between different data entities. This approach is also known as Graph Retrieval-Augmented Generation (GraphRAG).
Graphs are excellent for representing and storing heterogeneous and interconnected information in a structured form, easily capturing complex relationships and attributes among different types of data, which vector databases struggle with.
.
A vanilla RAG looks something like this:
The modified process will look like this:
So, in fact, this is an ensemble of a vector database and a knowledge graph. As I mentioned in the section on ensembles in the previous article, they generally improve accuracy and often include a search through a regular database or by keywords (e.g., Elasticsearch).
There are already many graph storage solutions (though companies often make their own versions):
- Nebula β open source under Apache 2.0 license, good performance and scalable architecture, my personal preference
- Neo4j β popular, supports Cypher query language, making it accessible for various applications, they have Community Edition under the GPL v3 but for anything serious, you will need Enterprise Edition
- OrientDB β a multi-model database that combines graph and document models. Open-source (Apache 2.0) and have been here for a while. Could be an option if you need hybrid document-graph mode, otherwise better go with Nebula
- Memgraph β an in-memory graph database designed for real-time graph analytics and use cases requiring low-latency performance. If you need real-time processing, very likely you will have to do in-memory
- ArangoDB β a multi-model database supporting graph, document, and key/value data models within a single database engine. Itβs not singly focused on graphs so if you want graphs only β again Nebula is better
- JanusGraph β Open-source under the Apache 2.0 license, supports various storage backends, including Apache Cassandra, HBase, and BerkeleyDB. Utilizes the Gremlin graph traversal language, part of the Apache TinkerPop framework so if you already work with that β could be a good choice.
Ingestion
This is probably the most difficult problem. No matter how you build your RAG system, if it stores garbage β it will produce garbage. Garbage in β garbage out.
It is difficult to ingest only clean data, and particularly difficult if the source of it is unstructured crap like PDF documents. Honestly, somebody should pass a law to prohibit this format at a planet level.
LLamaParse and similar frameworks help, to a degree to handle that. But it is not only the format, there is a problem with document versioning, whereby you have multiple versions of similar documents or when information in one domain contradicts information in another⦠What do people do in this case? I guess it varies, and there is not always one correct answer.
This is an inherently difficult problem and not completely solvable until we have a superintelligence. There are methods like the Maximum Marginal Relevance Search (MMR) algorithm. It takes into account the relevance and newness of the information for filtering and reranking of the documents.
Whatβs important is this:
- The system should be able to realize that it has conflicting information and defer it to the user to figure it out if it is not sure
- For this, you have to store documentsβ versioning metadata and decide how versions and precedence are going to be resolved in case of conflicts with algorithms like MMR and other types of re-ranking and relevance scoring
There is no silver bullet here, as it will largely depend on how you store your data in your systems but it definitely helps if there is structure and policies on conflict resolution already in place. In this case, you have a chance to formalize these and explain them to an LLM agent or use a similar approach.
As part of your data, you might have images or you may want users to query the system with images. It is possible now with the help of Vision Transformers. Itβs a big separate topic, though.
Hallucinations
Letβs talk about the elephant in the room. Everyone knows LLMs hallucinate. In my opinion, it makes them unusable at points in critical systems, like healthcare decision-making or financial advising. In those areas, it can act as a helper to a human professional and even with great caution.
Why do models hallucinate? It certainly not going to benefit from that in any way, so it has no selfish intentions, so there are several major reasons:
Training Data
Large language models are trained on huge and diverse datasets from many sources. This training is unsupervised, making it hard to ensure the data is fair, unbiased, or accurate. Language models cannot inherently distinguish between truth and fiction. They rely on patterns in the training data, which may include conflicting or subjective information. This can lead to outputs that may seem logical but are factually incorrect.
Lack of Accuracy in Specific Areas
Large language models like GPT are designed for general tasks and may hallucinate or produce incorrect results when applied to specific fields like medicine, law, or finance. These hallucinations occur because the models try to create coherent responses without having domain-specific knowledge.
Bad Prompting
Users provide instructions through prompts. Writing clear and precise prompts is crucial, much like coding. If a prompt lacks enough detail or context, the model might generate irrelevant or incorrect responses. To get correct answers it is very important to ask the right questions.
Intentional misalignment
Or simply speaking β the model was trained to lie. That is especially true about publicly available models like ChatGPT or Gemini. They are trained with loads of questionable policies in mind to comply with all sorts of laws and stupid regulations so they provide ridiculous answers about sensitive topics.
Context loss
The model has a limited context window or short-term memory. If a conversation is too long or the amount of data you are trying to feed is too big β it will start losing the context details and will start producing non-sensual results. This one I think is going to be solved quite soon as the context window gets larger and in a way is a function of available compute power. But as of now, it is a problem and it is very important to architect your RAG system so it optimizes the context size.
Is it possible to illiminate hallucinations? The short answer is no. There are ways to lower the probability of hallucinations:
- Fine-tuning the model on relevant domain data.
- Use RAG to provide valid context while avoiding context overflow
- Use prompt engineering to create prompts that specifically instruct the model not to add information it is not certain about.
- Use in-prompt tuning with on-few-shot prompting.
- Use LLM self-checking-like strategies Tree of Thought to validate its thinking process and outputs.
- You can actually ask it to return a distribution of tokens with their probabilities, and see β how confident is the model really in what it has concocted (calculating token-level uncertainty)? If the probabilities in the distribution are low (what is considered low depends on the task), then most likely the model has started to fabricate (hallucinate) and is not at all confident in its response. This can be used to evaluate the response and to return an honest βI donβt knowβ to the user.
- You can use other models and oracles to validate the truthfulness of the output.
- Post-processing and Filtering Mechanisms to censor outputs that do not comply with company policies or otherwise can be harmful
But, as I mentioned, none of that can guarantee that there will be no hallucinations.
Calculations
What was said before is largely related to text data. But what if you have a lot of numbers and tables, and need to be able to do calculations as part of the answer? Things get tricky here since LLMs canβt count well. Even the best ones are terrible at simple math, set aside anything complex. The most common way to do calculations is to give LLM a tool to do that, for example, good old Elastic or an SQL database.
So what you do β is you use LLM to generate a query for Elastic Search and use it to do the calculations and that you the results to generate the final output.
For example, a user question might be: βWhat is the total sales revenue in Q1 for North America?β. LLM will generate a query for Elasticsearch like this one:
{
"query": {
"bool": {
"must": [
{"term": {"region": "North America"}},
{"range": {"date": {"gte": "2024-01-01", "lte": "2024-03-31"}}}
]
}
},
"aggs": {
"total_revenue": {
"sum": {
"field": "sales_revenue"
}
}
}
}
Elasticsearch response can be:
{
"aggregations": {
"total_revenue": {
"value": 4520000.0
}
}
}
And LLM output will be: "The total sales revenue in Q1 for North America is $4,520,000."
Elastic is fine with moderately complex computations, if you have something really complex β you might have to delegate it to a NumPy or Pandas script.
Have fun!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI