Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB
Latest   Machine Learning

7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB

Author(s): Saloni Gupta

Originally published on Towards AI.

Photo by Alina Grubnyak on Unsplash

ArangoDB stands out as one of the most versatile graph database solutions available today. My introduction to ArangoDB came during a project transition from Neo4j, where I discovered its unique query language, ArangoDB Query Language (AQL), reminiscent of Cypher for Neo4j. If you’re curious about the comparison between ArangoDB and Neo4j, this detailed resource might interest you:

ArangoDB vs Neo4j — What you can’t do with Neo4j

Neo4j is a single-model graph database. ArangoDB offers the same functionality as Neo4j with more than competitive…

arangodb.com

In the course of this project, I set up a local instance of ArangoDB using docker, and employed the ArangoDB Python Driver, python-arango, to develop data ingestion scripts. The scale of data I dealt with reached millions of nodes and relations, which posed a significant challenge in terms of performance optimization.

Assuming you’re actively engaged in an ArangoDB graph project, the purpose of this article is to share the optimization strategies I utilized to enhance data ingestion performance, empowering you to implement similar practices. Let’s delve into each optimization:

Optimizations Applied for Efficient Data Ingestion:

  1. Utilising import_bulk with a batch size for faster import
  2. Leveraging Pandas & import_bulk for Relation Creation Over AQL Queries
  3. Harnessing Batch API for Streamlined Relation Queries
  4. Enhancing Maximum Memory Map Configuration
  5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues
  6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively
  7. Defining additional indexes

Let us look at each one in more detail.

1. Utilising import_bulk with a batch size for faster import

Employing import_bulk with a specified batch size proves to be the most efficient method for inserting multiple documents into a collection. Here’s how you can leverage it :

a. For vertex collections:

vc1 = db.collection(vertex_collection_name)
vc1.import_bulk(df_vertex_data, batch_size=50000)

b. For edge collections, a similar approach can be adopted. This is elaborated in the subsequent point 2.

ec1 = db.collection(edge_collection_name)
ec1.import_bulk(df_relation_data, from_prefix=source, to_prefix=target, batch_size=50000)

Reference: https://docs.python-arango.com/en/main/specs.html#arango.collection.Collection.import_bulk

2. Leveraging Pandas & import_bulk for Relation Creation Over AQL Queries

There are multiple ways to insert relations i.e. edge collection into an ArangoDB graph database. An AQL query could do the job using the ‘INSERT’ operation. While AQL queries serve well for simple operations, complex queries involving multiple joins and conditions might significantly slow down the insertion of edges into the edge collection. In such cases, employing Pandas for merges, filtering, and transformations, followed by bulk insertion with import_bulkexplained in 1, proves to be more efficient.

3. Harnessing Batch API for Streamlined Relation Queries

Opt for batch API execution for queries accessing large volumes of data. Methods such as insert_many(), update_many(), replace_many(), and delete_many() facilitate inputting multiple documents simultaneously. These methods have replaced the batch request API in newer versions of ArangoDB post 3.8.0.

Reference: https://docs.python-arango.com/en/main/batch.html

4. Enhancing Maximum Memory Map Configuration

Increasing vm.max_memory_map can significantly enhance performance for memory-intensive operations, such as those characteristic of database operations. Adjustments to this parameter are essential, especially in Docker environments.

Here's how you can adjust it in Windows using Powershell:

wsl -d docker-desktop
sudo sysctl -w "vm.max_map_count=2048000"

Note: You may have to tune this parameter based on your requirements. This is the minimum value that I needed for my data volume to be successfully ingested.

5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues

When retrieving an entire collection during data ingestion, consider iterating through the collection in batches, especially if it’s large. This prevents timeout and reconnect issues. Below is a sample code snippet illustrating this approach.

def get_full_collection(db, collection_name, aql):
# get total amount of documents in collection
#entries = db.collection(collection_name).count() - slow
entries = list(aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))[0]
print(entries)
limit = 50000 # blocksize you want to request
l = [] # final output
for x in range(int(entries / limit) + 1):
block = db.collection(collection_name).all(skip=x * limit, limit=limit)
l.extend(block)

return l

6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively

Initialize the ArangoDB client with a custom DefaultHTTPClient and set a higher value for request_timeout to accommodate larger datasets. This is how you can do it:

from arango.http import DefaultHTTPClient
# Initialize the ArangoDB client.
client = ArangoClient(http_client=DefaultHTTPClient(request_timeout=1000000))

Similarly, adjust the ttl (time to live) parameter while executing AQL queries for longer-running queries. An example of how you can increase the value of ttl parameter in aql.execute() is shared below:

aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))

7. Defining additional indexes

Don’t overlook the importance of adding indexes to frequently used collections in the data ingestion process. Indexing significantly boosts query performance, thus expediting data ingestion procedures.

These optimization techniques, consolidated here, aim to enrich your ArangoDB experimentation endeavors.

Please feel free to share your experiences with these strategies and their impact on your projects!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓