HNSW — Small World, Yes! But how in the world is it Navigable?
Last Updated on December 13, 2024 by Editorial Team
Author(s): Allohvk
Originally published on Towards AI.
The psychologist Stanley Milgram carried out a curious social experiment in the 60’s. Envelopes addressed to an individual in Boston were given to volunteers in far-away cities. Each volunteer was requested to pass the envelope to anybody they knew who could get it closer to the intended Boston recipient. Instructions in the envelope asked each person who received it to repeat the same process. Around that time, there were a quarter billion folks in US. The internet was yet to be invented. Mobile phones were decades away. Social networks were limited to local friends. Yet amazingly, these envelopes reached their Boston destination in less than 6 hops!
Milgram demonstrated that we are all separated by very few degrees of separation. Pick any 2 random people on the planet and you could establish a short path between them that consists of < 6 connected people on an average. That meant people were well connected even in those days! 4 decades later, with 6 billion+ people on the planet, the possibilities of numerous connections between them may have influenced a young student studying at Harvard to take this whole experience online. Imagine 6.5 billion points on a circle and a potential to connect each one of them to any another in < 6 connections. How many friend’s requests would that translate to?
But let us go back Milgram’s study. Something is odd there. The presence of short paths was surely a surprise, yet understandable. Maybe people were more connected than what was previously thought. What is odd is that these short paths are apparently discoverable! Think of it — No Google or LinkedIn — if you got the envelope to forward, you would have had to analyze your friends circle or your friends friends circle (if you knew them) & make a best judgement on whom to forward the envelope. You didn’t have a complete end path to the Boston individual. You just had local information. Somehow, it seemed that the short paths naturally manifested themselves as people’s choices. This has useful applications, like in Internet packet routing…
So how did a typical volunteer choose whom to forward the envelope next? One strategy is to simply pass the message to a friend that appears closest to the destination given the limited view one had. In fact, this simple greedy search strategy is surprisingly effective and is used by most of the vector search algorithms today. Milgram’s study formally highlighted the so-called small world phenomenon. Meta’s 2016 paper showed that the number of hops had reduced to 3.6 for the 1.6 billion people connected on Facebook! So, two random people on 2 extreme corners of the earth can be connected in less than 4 hops. This low diameter property is a hallmark of small-world networks!
Graph Theory — An intuitive explanation
Why should there exist short paths linking random pairs of strangers? Can we mathematically model this curious phenomenon? Graph theory seems a perfect choice to do so! It may sound complicated, but in reality you just take a piece of paper and draw some circles representing each person in the network. Next, you draw lines between circles if a relationship exists between them. We can use these lines to travel thru’ the graph from person to person. In the graph below, can you tell me the path from Nora Bing, the romantic novelist in NY to David, the scientist doing research in Minsk?
Now, instead of 6 people, imagine you have a billion vectors. In order to search this data, you need to build relationships between them. Vectors that lie close to each other are similar in nature & can be considered friends. A line can be drawn between them. We do this during the graph construction phase. If we do a good job drawing these lines, then searching the vector DB becomes a breeze. Semantic search becomes possible at a scale of billions!
Of course, we use the official name nodes instead of circles and edges instead of lines, and we would have to use data structures like sets to hold all data instead of drawing them on paper. We would also have to come up with hacks to ensure construction and searches are possible at billion+ scale, but apart from these small inconveniences, it is pretty much the same thing. Euler invented Graph theory to solve an interesting puzzle — the story is charmingly captured in Vaidehi’s article. One last point — Some prefer the term Vertex instead of node. In this article, I stick to “people” or sometimes “Nodes”. But if in any paper, you ever happen to see a set named V holding all nodes & wonder why the set was not named N, you now know why 🙂
Even before Milgram’s experiments, Erdos & Renyi had proposed random graphs — which rather literally (as the name suggests) involved “connecting people randomly”! Now, granted this kind of graph is not practical but it did exhibit the low diameter behaviour that small-world networks possess. Drawing edges at random between nodes, allowed for long hop connections between people and these long hops made the search path shorter. Of course, this kind of graph cannot be used to model real world connections. If you have 2 friends X & Y, it is likely that X & Y themselves are friends. That relationship would never be captured in a pure random graph. A more realistic model would consist of several local clusters of friends all across the graph. But this model would not exhibit low diameter behaviour. Guess why?
Take an extreme case where every individual on the planet has 2 friends. Imagine a straight line of people standing next to each other holding hands — everyone is connected to 2 people on each side. Now imagine traversing from one (random) node to another (random) node. Such a structure would never have the low diameter property that Milgram’s experiments demanded. You would have to literally traverse from person to person in a straight line to get to your destination. No long-hops! Now imagine increasing the number of friends to 10. This network is still going to be highly clustered, since all the edges are local. The only way to travel is to go in short hops. The diameter is still going to be large. A graph dominated by local clusters cannot be used to model a small world network!
The clustering effect breaks symmetry in the above model. Counter-intuitive as it may seem, Erdos & Renyi’s random graphs model had an inherent symmetry. All nodes had an equal chance of getting a connection. With sufficiently large networks, almost all nodes will approximately have the same number of edges — a world of perfect symmetry. Of course the real world is more chaotic…
Modelling “small worlds” with graph theory
It seems obvious in hindsight but it was Watts & Strogatz who in their seminal 1998 paper proposed a model that walked in between these 2 extremes. The model begins with a simple structure. Each node (white circle in the fig below) is placed along the circumference of a circle, and is connected to its k nearest neighbors. This creates the local clusters we just discussed. Then, with a random probability p, the edges are ‘rewired’ i.e. the old connection is broken and a new one is added. This new node is chosen uniformly at random from a set N containing long range contacts).
At p = 0, the network exhibits local cluster properties and at p = 1 the network is completely random. As p is increased from 0, the model begins to show small-world characteristics.
Watts & Strogatz’s model had a balance of order & chaos. They showed that increasing p reduces the average shortest-path length without affecting the clustering coefficient too much. Their model came out almost 3 decades ago at a time when the world was not connected online. But even in those times, it was becoming clear that people’s circle of acquaintances were not limited to those who live next door, as conventional models would imply! They applied their model to explain a few real world situations — from the neuro-circuitry of a worm to the power grid connections of the Western U.S.
The model was not perfect though. For e.g. it could not fully explain the World Wide Web. Let us try modelling WWW as a graph — Each webpage can be a node and every hyperlink in that webpage can be an edge connection to the corresponding webpage. Barabasi et al estimated that despite the billions of web pages, the diameter was < 19. In other words, if we start at a random page, we can reach any other (chosen) page by clicking 19 links on an average (Just like Milrgam’s results, these results have limitations but that is not our focus). But this modelling had a major flaw! Can you guess what it was?
In every article I write, I give references of links that helped me. Since I get these in a Google Search (typically I scan upto 100 results for every concept I need to deep dive), anything that is lower than the 100th rank in the search results will never get a chance to show up as a link on my page. So an already popular web page will just get more popular! This is the power law property. Small changes to the model were needed to account for this (For e.g. — by adjusting the probability that a given node will get a new edge to make it proportional to the share of the total count of edges that the node already has).
With scientists having created successful small world models & proven the existance of short paths, the focus shifted to finding these short paths using local information. To some extent, we already know a decent strategy — the one used by Milgram’s volunteers — a greedy search! It was Jon Kleinberg who researched this further & showed how to make small world models navigable.
Navigating thru’ small worlds
Jon Kleinberg’s model shows it is possible to navigate between nodes with very little knowledge of the full graph. Such graphs — where short paths can be found without access to global information — were termed navigable by him. Of course, these paths should be travellable in a (poly) logarithmic number of steps. He refined earlier models by adding shortcuts not uniformly but according to a harmonic distribution, such that the number of outgoing edges per node are fixed & the connecting probability depending on the distance between the nodes. So more the distance, better the chance of a connection. He then showed that for such a model, greedy search strategy works & the models become navigable using local information only!
Navigable Small World networks (NSW) are cool. Creating it in a 2D space is one thing, but how do we create a model needed for our high dimensional vector search? The stage is set for for a paper by Malkov et al that is the backbone of most searches today.
NSW paper — A simple explanation
Let us revisit our Greedy Search algo. It takes a query and an entry point as parameters. The query is the node we want to find and the entry point is the current node — the node on which we are standing now.
- Starting from this entry point, compute the distance from the query to each node from the friend list of the current node.
- Select the node with the least distance. If this distance is smaller than the distance between the query & the current node, then hop to the selected node which becomes the new current node.
- Rinse. Repeat. Stop when a local minimum is reached — a node whose friends are not closer to the query than itself.
Here is the catch — What is this local minimum stuff? Why not a true minimum?
If you were a zoologist tasked with creating a model to explain animal territories, you would likely use a ‘Voronoi’ diagram. This involves dividing a surface into areas of influence. Say we have n T-Rex’s living in a forest. In our quest to model T-Rex behaviour, we are likely to start out by partitioning the forest into n areas such that a random point in one of the partitioned areas is guaranteed to be closest to the T-Rex associated with that area and not to any other T-Rex. Turns out, the resulting areas have polygonal shape — edges of the polygons being the perpendicular bisectors of lines joining neighboring T-Rex’s nesting sites. This kind of modelling has many applications — say solving Knuth’s Post Office Problem… If we had a Voronoi diagram of all post offices, we can find the closest post office to any given house by simply looking up the partition the house falls in.
Let us go back to our graph theory. If we carve out the n-dimensional vector space into a neat Voronoi diagram based on the positions of each node, then we are guaranteed that greedy search algo works perfectly! In other words, if every person’s friend list had all their Voronoi neighbors, then greedy search algorithm would never return false results. Of course, this is not practical…
Consider this situation! You got Milgram’s envelope to forward, the target destination being Jennifer Aniston. Resisting the urge to immediately fly & hand over the envelope personally to Aniston, you reluctantly look at your friends list and choose the best option — a friend working for a S/W company in Los Angeles. This way you ensure it physically reaches somewhere near Hollywood & hope that it goes thru’ more local hops to reach Aniston. What you didn’t know was that another friend in your list, living in far-away Florida, had an uncle working as a hair stylist in Hollywood. Possibly that friend was a better choice, but of course you couldn’t have seen beyond the choices you have. That is the disadvantage of greedy search (& more philosophically, of life itself!). It is possible that there is a path that does not seem most optimum given your limited view, but is actually a better one.
If we want greedy search to return perfect results, we would have to build Voronoi diagrams (or its dual — the Delaunay graph). While this is possible in 2D for carving out territories of a few T-Rex’s, the situation changes when talking of billion+ nodes, each represented by 1000+ dimension vectors! But hey, we are OK with results that are off occasionally. Such an Approximate Nearest Neighbour Search (ANNS) does not require the entire Delaunay graph. We can live with an approx Delaunay graph. This gives false minima sometimes — the node that we think is closest to our query may not really be the closest. This is OK, but we can still try to minimize these instances.
Instead of a single greedy search, Malkov performs m independent greedy searches from different random entry points, and gets multiple results as to who can be the nearest neighbour to the query. He simply choose the closest one. Malkov also brings in interesting changes in the model construction as well. He constructs the network one randomly chosen node at a time and connects this node to the k-nearest neighbours among the nodes already inserted. It is obvious that such a network would have approximate Voronoi properties — after all we are connecting each node to its immediate neighbours. This brings in a degree of navigability. But how does the low radius property that is the hallmark of a small world models come in?
We observed earlier in Reni’s works on how the power of random connections create long-range paths which make the network small world. We also saw how Klaunberg rewired a few local neighbour connections of nodes with random long-range connections to make his network small world. Well, the beauty of Malokv’s algorithm is that this happens naturally!!
Recall how nodes are inserted one by one randomly into Malkov’s model during construction. Obviously, in the initial phase of construction, the nodes chosen (in a billion+ scale) are likely be far apart. They therefore form the long-hop connections needed for a small world. As we continue to insert more nodes, connections get shorter but some of those old long range connections still hold. The short connections approximate the Delaunay graph required by the Greedy Search while the long connections help in logarithmic scaling of the Greedy Search. The model thus becomes small world & navigable.
If you are with me so far, then moving to HNSW will be a breeze. HNSW is also designed by Malkov so the core logic is the same. But first a game!
Let us play guess-the-number — We have to guess a number (query) chosen by the organizer from 1–1000. A simple binary search algo works well in this game. We first guess 500, then branch off based on the organizer’s response and keep halving until we narrow our search to the required number. But say, the organizer has allowed the use of 10 choices the first time.
We naturally suggest the 10 numbers — 100, 200 & so on till 1000. We ask the orgnizer to point out the number closest to the query from these ten. If the number-to-be-guessed is (say)240, the organizer points to 200. We now know that the number to be guessed falls between 150–250. We now use regular algorithms & guess the number. This approach significantly speeds up our search. We just created a hierarchical model. We first search in the upper hierarchy having a limited number of (appropriately-chosen) numbers for a rough match & then search in lower hierarchy for a closer match. This is precisly the algorithm at work in H(ierarchical) NSW.
HNSW — At last
HNSW is a natural evolution of NSW & is inspired by the game above. Actually, it is inspired by an interesting data structure called skip lists which uses the same trick as in the game above. In HNSW, the greedy search initially happens at the uppermost level. This level holds very few nodes. Once a local minimum node is found, we move into the next layer. The entry point into the next layer is the node found in the layer above. We keep going lower & lower till we reach layer 0 which has all the nodes. We do one final greedy search in this layer using the regular algorithms. Even though the whole search spans multiple levels, it is faster just like the game above!
We get an effect of zooming-in more and more as the search progresses across levels. Naturally nodes in the higher levels are few & low-degree (degree is the official name for number of edges a node has). By their very design, these nodes hold the longest edges. At the lower levels, the number of nodes increase and so does the degree. The key here is to balance the degree of nodes at different levels in order to ensure a good accuracy as well as a good speed. This is all there is to HNSW. The rest are minor implementation details. Let us see how the HNSW network is formally constructed. The construction process involves a search first.
- As before, the network is built by inserting nodes one-by-one. The node being inserted is called the query. We have to insert this node in the right place. For that we need to run the search algorithm. The search algorithm starts with the topmost layer & greedily travels across edges, finding the nearest neighbors to the query.
- After finding the local minimum node in the topmost layer, it moves down to the next layer. The entry point in this next layer is the local minimum node found in the layer above. (Note that if a node is present at a certain layer, it must be present in every layer below. Layer 0 has all the nodes. Higher layers have fewer & fewer nodes as we go up. It must be noted that nodes are not split-up between layers. Imagine we have 1000 nodes and 3 layers. Then we can structure them as 10 nodes in layer 2, 100 nodes in layer 1 and all 1000 nodes in layer 3)
- The search algo repeats layer after layer until we reach the ‘chosen’ insertion layer for this node. The query node needs to be inserted at this layer and every layer below! We will discuss how we arrive upon this chosen layer shortly but first let us formalize the search algo we just discussed.
A list of ef nearest elements to the query is kept during the greedy search. Initially this list is just filled with the entry point nodes & all these are marked as unevaluated. The list is then updated at each step by evaluating all friends of the closest non-evaluated node in the list until all friends of every such node is evaluated. What does ‘evaluating a node’ mean? It simply means checking its distance to the query and inserting it into the list if the distance is close. Of course, we set some size limit to the list & move out nodes if the list starts to overflow. ef is set to 1 by the authors during construction. The algo ends when all nodes in the list have been evaluated. The final list returned contains all the “candidate nodes”, from which we pick the nodes closest to the query.
- OK, we now discuss how the chosen layer of the node is decided. Every node is randomly assigned an integer l which is its chosen layer. l is selected randomly with an exponentially decaying probability distribution normalized by a parmeter m_L. Note that m_L = 0 results in a single layer & gives Malkov’s original NSW. Typically, m_L value is set such that the majority of nodes have l=0, i.e. most of the nodes are present only on the lowest layer. Larger values of m_L increase the probability of more nodes appearing on higher layers. This makes the search faster at the cost of more memory.
- Just like in NSW, the nearest neighbour search during insertion ensures that the short paths that approximate Delaunay graph is built incrementally. In contrast to NSW, HNSW construction algo does not require the elements to be shuffled before insertion — the randomness needed for long paths is achieved by the level randomization done above.
At this point, we have reached the layer that we want to insert into. Phase 2 of the construction starts now:
- In this destination layer, the search process is the same except that instead of finding ef neighbours, it greedily searches for efConstruction nearest neighbours. efConstruction is just another model parameter. It is usually far greater than ef. Higher efconstruction implies longer construction time but better search quality (more nearest neighbors considered, hence more time but better search).
- M out of efConstruction neighbours are chosen & edges to these nodes are built from the inserted query node. M is another hyperparameter. While the reader may expect that M nearest neighbours would be directly chosen for building edges, there is a slight twist to the algo which is explained in the appendix to this article. Also there is a max limit M_max to how many edges a node can have. M_max0 is the same limit in layer 0.
- The algo next descends to the layer below. Remember — once we insert a node into a particular layer, we must insert that node in every layer below (in the appropriate place). Each of the efConstruction nodes found above acts as a separate entry point into the next layer. Greedy searches are done as outlined above & edges are drawn. The cycle repeats layer after layer & the algo terminates after the query node is inserted & connected in layer 0!
HNSW usage & hyper-parameter choices
M_max setting is important to avoid hitting local minimum and stopping too early during the zoom-out phase. It is usually set as M
and M_max0 can be set to 2*M
. An optimal value of m_L is 1/ln(M)
. But what about M itself? We see lot of params depending on the value of M. Recall that higher M implies more potential connections, so better results. Naturally this comes at a higher computation/memory. Values between 5–50 work. This is the single most impactful parameter to tune!
efConstruction & efsearch are other params which may be tuned. We encountered efConstruction earlier. If the default setting gives a recall lower than 0.9, we can increase it though beyond a limit, it stops having an impact. But what is efsearch? Well, this is used in the search algorithm of HNSW after construction. We never discussed this search algorithm but the good news is that it is exactly the same as the search algo used during construction. Essentially, efsearch setting is a tradeoff between search quality & speed.
Most modern Vector DBs support HNSW indexing. It is the perferred choice for quick & accurate searches involving high dimensional data. Its weakness is that it needs a fair amount of memory. It can be used in combination with Product Quantization (PQ) to compress vectors & reduce memory usage. IVF can be added to further improve search speeds. These concepts are explained in plain English in my Vector DB origins & Vamana article.
This is the 5th in a 12-series article titled My LLM diaries. Others are:
- Quantization in plain English
- LoRA & its newer variants — Explained like never before
- In-Context learning — The greatest magic show in the kingdom of LLMs
- RAG in plain English — Summary of a 1000 papers
- RAG in plain English — Summary of a 1000 papers — Part 2
- LLMs on the laptop — A peek into the Silicon
- Taming LLMs — A study of few popular techniques
- Semantic search, Vector DBs & Vamana — Explained like never before
- Story of HNSW — The most popular vector search algorithm (current article)
- Agents in plain English
- LLMops in plain English — Operationalizing trained models
- Taking a step back — On model sentience, conscientiousness & other philosophical aspects
Appendix*
During insertion, a naïve approach is to take M closest candidates. However, it need not be an optimal choice always. In the author’s words, if a candidate is closer to one of the newly connected candidates than it is to the inserted vector, we skip over it without connecting. Can you think of a scenario for this logic?
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI