What are Vector Databases?
Last Updated on June 4, 2024 by Editorial Team
Author(s): Ayo Akinkugbe
Originally published on Towards AI.
Introduction
Vector databases are databases designed specifically for storing vector embeddings. If a vector is a data representation having magnitude and direction, what then are vector embeddings?
Vector embeddings are lists of numbers that represent patterns of relationships created to represent unstructured data such as images, texts and audio. Over 80% of data is unstructured β data that doesnβt easily fit into rows and columns. These include documents, images, texts, audio and video. Recommendation engines and Advanced AI applications ingest, process and store large amounts of unstructured data. Based on use cases in these contexts such as semantic search, text summarization and many more, data is often converted to embeddings, which are assigned values in a vector space. There are various methods and models for generating embeddings β subject matter expertise, Word2vec, OpenAI API and many more . Embeddings are then stored and queried. Using traditional databases for storing and querying vector embeddings can pose some challenges. Vector stores or databases are specially designed databases to cater for the types of operations performed on high-dimensional vector embeddings.
Vector databases are databases designed specifically for storing vector embeddings
How Vector Databases Work
A vector database indexes and stores vector embeddings for fast retrieval and similarity search. Embeddings allow unstructured data to be easily searched and compared. This is possible since vectors have magnitude and direction in space. Various similarity measures such as cosine similarity, vector similarity metric and distance metrics such as Euclidean distance can be used to compare or find the likeness between vector embeddings.
The figure below shows the derivation of embeddings for four words. They are then visualized in space to determine likeness and similarity.
For example β a pdf document can be converted to vector embeddings and stored in a database and queried using this technique. Each query is converted into vector embeddings for which a similarity measure like cosine similarity is calculated and used to retrieve the matching vectors from the database. The retrieved vectors are then displayed as the results of the query.
Using a relational database for the above example can be compared to using a bread knife as a steak knife in the kitchen. The process takes longer and might incur high compute resources. If linear search is used to make such query in a relational database, the similarity metric for the query would have to be compared with each of the metric for each stored embedding (which can be in the millions) until the right match is found. However Vector stores offer an optimal approach as they are designed specifically to store and operate on vectors
Vector databases leverage Approximate Nearest Neighbor (ANN) search algorithms which optimize data retrieval through hashing and quantization. This makes the above process of querying embeddings faster and more efficient.
Using a relational database for a vector database use case can be compared to using a bread knife as a steak knife in the kitchen.
Vector Databases vs. Traditional Databases
Using databases in the application isnβt always an either-or situation as databases are designed for different purposes and can be complementary. However, there are distinctions between traditional databases and vector databases, which highlights the pain points vector databases address
For example Relational database management system (RDBMS) stores data in tables, rows and columns. The data is queried or retrieved using SQL (Structured Query Language) or a variation of SQL. Though relational databases have many use cases in traditional software applications such as building data warehouses and online transaction procession, they might be laborious and computationally expensive to use in AI/ML applications processes like Large Language Models (LLMs) and Retrieval Augmented Generation (RAGs) which include summarization and semantic search.
How do traditional databases compare to vector databases? What do they do differently?
Purpose and Strengths
- Relational Database β Designed to store relational data. Organizes data into tables, rows and columns using a schema. Great for data that needs to maintain a particular or structure e.g, transactional data
- Document Database β Designed to store large volumes of unstructured data. Organizes data in documents. Great for data with changing format or schema e.g real-time web application data, IoT data.
- Graph Database β Stores data using a graph structure. Great for representing complex relationships in data e.g Social network data
- Vector Database β Designed to store and implement vector data structures. Images, videos, audios, texts are converted to vector embeddings for advanced AI/ML processes like summarization and semantic search. Vector databases are designed specifically to store vector embeddings and enhance these type of operations.
Data storage
- Relational Database β Stores data in rows/Columns. Data requires lots of structure.
- Document Database β Stores data in documents. Data structure is malleable and can change.
- Graph Database β Stores data using a graph structure. Data entities are represented as vertices/nodes and relationships are represented as edges.
- Vector Database β Stores and organizes data as vector embeddings or points in a vast multi-dimensional space.
Querying and Data Retrieval
- Relational Database β Uses SQL for data retrieval, storage and update
- Document Database β Uses custom query languages like MongoDB Query Language (MongoDB), Mango Query Language(CouchDB), and Firestore Query Language (Firebase).
- Graph Database β Uses query languages such as GraphQL and Gremlin
- Vector Database β Uses similarity metrics such as cosine similarity, vector similarity metric or Distance metric β Euclidean distance for data retrieval.
Indexing
Indexing allows fast storage and retrieval of data in databases. It is used to maximize query efficiency in a database. When a database is not inidex, it will be queried linearly i.e A query will have to search every row to find the matching condition in the query. This can be resource-intensive. Indexing is implemented differently in traditional and vector databases.
- Relational Database β Indexing is based on exact matching. Indexing techniques for fast data retrieval include b-trees, hash, bitmap and full text indexes
- Document Database β Indexing techniques include b-trees, Compound Indexes, multikey indexes, Geospatial indexes, and Text Indexes
- Graph Database β Indexing techniques include Node and Relationship Indexes, Label Indexes, Full-Text Indexes, Spatial Indexes
- Vector Database β Indexing is based on similarity score and distance. Leverages hashing or quantization. Indexing techniques include LSH (Locality-Sensitive Hashing), K-D Trees, VP Trees (Vantage-Point Trees), HNSW (Hierarchical Navigable Small World Graphs)
Though traditional databases like relational databases can be used to store and interact with embeddings, they are not designed for those types of operations. Vector databases leverages special hashing techniques for indexing and storing data
Benefits of Vector Database
- Scalability β Similarity search like cosine similarity, is not scalable in traditional databases. With vector databases, these operations can be computed at scale.
- Flexibility β Vector databases allow operations such as semantic search that might be hard to perform using traditional databases. It allows a variety of processes to be performed on unstructured data.
- Ultra Low Latency β Using ANN Search, Vector databases are optimized for high dimensional similarity search between vector embeddings.
- High Performance β For operations like nearest neighbor search, vector databases outperform relational databases by several orders of magnitude. Though NoSQL databases can handle large volumes of data, they lack the specialized indexing mechanisms such as HNSW and IVF used in vector databases to speed up similarity searches.
- Optimized Storage and Memory β Vector databases keep frequently accessed vectors in memory and optimizes disk I/O for less frequently accessed data. Unlike other databases, it also leverages specialized indexing techniques for data storage.
Unlike other databases, Vector databases are enhanced to store, index and query vector embeddings using similarity score at ultra low latency
Use Cases
- Vector databases can be used to equip LLMs with longterm memory
- Recommendation engine β suggest items similar to past purchases of a customer
- Semantic Search β search based on the meaning or context and not on exact strings or keyword matching, implemented without keywords and tags β for texts, images, audio or video data
- Retrieval Augmented Generation β a method used to improve the domain-specific responses of large language models.
- Advanced Chatbots β chatbots that leverage artificial intelligence (AI), machine learning (ML), and natural language processing (NLP) to understand and respond to user inputs.
- Context Match β understanding and interpreting the meaning of words, phrases, or data based on the surrounding information or context.
Vector databases are useful in systems that leverage operations such as search, clustering, classification and recommendations
Examples of Popular Vector Databases
Below are some examples of commonly searched and referenced vector databases. Some of these are open source.
- Pinecone
The vector database to build knowledgeable AI | Pinecone
Search through billions of items for similar matches to any object, in milliseconds. It's the next generation ofβ¦
www.pinecone.io
2. ChromaDB
the AI-native open-source embedding database
the AI-native open-source embedding database
the AI-native open-source embedding databasewww.trychroma.com
3. Weaviate
Welcome | Weaviate – Vector Database
Welcome to Weaviate
weaviate.io
4. Qdrant
Qdrant – Vector Database
Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vectorβ¦
qdrant.tech
5. Milvus
Vector database – Milvus
Milvus is the world's most advanced open-source vector database, built for developing and maintaining AI applications.
milvus.io
Conclusion
Vectors databases are an evolving technology and just getting started. According to Polaris research, the global vector database market was valued at USD 1,781.54 million in 2023 and is expected to grow at a CAGR of 21.7%. With advances in generative AI, innovative solutions are bound to ensue regarding storing and processing vector embeddings.
Maybe in a few years vector databases might be compared to a newer technology as traditional databases were in this overview.
For more on vectors and vector operations such as distance calculation, check out this earlier post:
Guide: Creating and Computing Vectors Using Python
A quick guide on implementing vector operations in Python
python.plainenglish.io
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI