7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB
Author(s): Saloni Gupta
Originally published on Towards AI.
ArangoDB stands out as one of the most versatile graph database solutions available today. My introduction to ArangoDB came during a project transition from Neo4j, where I discovered its unique query language, ArangoDB Query Language (AQL), reminiscent of Cypher for Neo4j. If youβre curious about the comparison between ArangoDB and Neo4j, this detailed resource might interest you:
ArangoDB vs Neo4j β What you canβt do with Neo4j
Neo4j is a single-model graph database. ArangoDB offers the same functionality as Neo4j with more than competitiveβ¦
arangodb.com
In the course of this project, I set up a local instance of ArangoDB using docker, and employed the ArangoDB Python Driver, python-arango
, to develop data ingestion scripts. The scale of data I dealt with reached millions of nodes and relations, which posed a significant challenge in terms of performance optimization.
Assuming youβre actively engaged in an ArangoDB graph project, the purpose of this article is to share the optimization strategies I utilized to enhance data ingestion performance, empowering you to implement similar practices. Letβs delve into each optimization:
Optimizations Applied for Efficient Data Ingestion:
- Utilising
import_bulk
with a batch size for faster import - Leveraging Pandas &
import_bulk
for Relation Creation Over AQL Queries - Harnessing Batch API for Streamlined Relation Queries
- Enhancing Maximum Memory Map Configuration
- Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues
- Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively
- Defining additional indexes
Let us look at each one in more detail.
1. Utilising import_bulk with a batch size for faster import
Employing import_bulk with a specified batch size proves to be the most efficient method for inserting multiple documents into a collection. Hereβs how you can leverage it :
a. For vertex collections:
vc1 = db.collection(vertex_collection_name)
vc1.import_bulk(df_vertex_data, batch_size=50000)
b. For edge collections, a similar approach can be adopted. This is elaborated in the subsequent point 2.
ec1 = db.collection(edge_collection_name)
ec1.import_bulk(df_relation_data, from_prefix=source, to_prefix=target, batch_size=50000)
Reference: https://docs.python-arango.com/en/main/specs.html#arango.collection.Collection.import_bulk
2. Leveraging Pandas & import_bulk
for Relation Creation Over AQL Queries
There are multiple ways to insert relations i.e. edge collection into an ArangoDB graph database. An AQL query could do the job using the βINSERTβ operation. While AQL queries serve well for simple operations, complex queries involving multiple joins and conditions might significantly slow down the insertion of edges into the edge collection. In such cases, employing Pandas for merges, filtering, and transformations, followed by bulk insertion with import_bulk
explained in 1, proves to be more efficient.
3. Harnessing Batch API for Streamlined Relation Queries
Opt for batch API execution for queries accessing large volumes of data. Methods such as insert_many()
, update_many()
, replace_many()
, and delete_many()
facilitate inputting multiple documents simultaneously. These methods have replaced the batch request API in newer versions of ArangoDB post 3.8.0.
Reference: https://docs.python-arango.com/en/main/batch.html
4. Enhancing Maximum Memory Map Configuration
Increasing vm.max_memory_map
can significantly enhance performance for memory-intensive operations, such as those characteristic of database operations. Adjustments to this parameter are essential, especially in Docker environments.
Here's how you can adjust it in Windows using Powershell:
wsl -d docker-desktop
sudo sysctl -w "vm.max_map_count=2048000"
Note: You may have to tune this parameter based on your requirements. This is the minimum value that I needed for my data volume to be successfully ingested.
5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues
When retrieving an entire collection during data ingestion, consider iterating through the collection in batches, especially if itβs large. This prevents timeout and reconnect issues. Below is a sample code snippet illustrating this approach.
def get_full_collection(db, collection_name, aql):
# get total amount of documents in collection
#entries = db.collection(collection_name).count() - slow
entries = list(aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))[0]
print(entries)
limit = 50000 # blocksize you want to request
l = [] # final output
for x in range(int(entries / limit) + 1):
block = db.collection(collection_name).all(skip=x * limit, limit=limit)
l.extend(block)
return l
6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively
Initialize the ArangoDB client with a custom DefaultHTTPClient
and set a higher value for request_timeout
to accommodate larger datasets. This is how you can do it:
from arango.http import DefaultHTTPClient
# Initialize the ArangoDB client.
client = ArangoClient(http_client=DefaultHTTPClient(request_timeout=1000000))
Similarly, adjust the ttl
(time to live) parameter while executing AQL queries for longer-running queries. An example of how you can increase the value of ttl
parameter in aql.execute() is shared below:
aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))
7. Defining additional indexes
Donβt overlook the importance of adding indexes to frequently used collections in the data ingestion process. Indexing significantly boosts query performance, thus expediting data ingestion procedures.
These optimization techniques, consolidated here, aim to enrich your ArangoDB experimentation endeavors.
Please feel free to share your experiences with these strategies and their impact on your projects!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI