7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB

Author(s): Saloni Gupta

Originally published on Towards AI.

7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB — Photo by Alina Grubnyak on Unsplash

ArangoDB stands out as one of the most versatile graph database solutions available today. My introduction to ArangoDB came during a project transition from Neo4j, where I discovered its unique query language, ArangoDB Query Language (AQL), reminiscent of Cypher for Neo4j. If you’re curious about the comparison between ArangoDB and Neo4j, this detailed resource might interest you:

ArangoDB vs Neo4j — What you can’t do with Neo4j

Neo4j is a single-model graph database. ArangoDB offers the same functionality as Neo4j with more than competitive…

arangodb.com

In the course of this project, I set up a local instance of ArangoDB using docker, and employed the ArangoDB Python Driver, python-arango, to develop data ingestion scripts. The scale of data I dealt with reached millions of nodes and relations, which posed a significant challenge in terms of performance optimization.

Assuming you’re actively engaged in an ArangoDB graph project, the purpose of this article is to share the optimization strategies I utilized to enhance data ingestion performance, empowering you to implement similar practices. Let’s delve into each optimization:

Optimizations Applied for Efficient Data Ingestion:

Utilising import_bulk with a batch size for faster import
Leveraging Pandas & import_bulk for Relation Creation Over AQL Queries
Harnessing Batch API for Streamlined Relation Queries
Enhancing Maximum Memory Map Configuration
Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues
Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively
Defining additional indexes

Let us look at each one in more detail.

1. Utilising import_bulk with a batch size for faster import

Employing import_bulk with a specified batch size proves to be the most efficient method for inserting multiple documents into a collection. Here’s how you can leverage it :

a. For vertex collections:

vc1 = db.collection(vertex_collection_name)
vc1.import_bulk(df_vertex_data, batch_size=50000)

b. For edge collections, a similar approach can be adopted. This is elaborated in the subsequent point 2.

ec1 = db.collection(edge_collection_name)
ec1.import_bulk(df_relation_data, from_prefix=source, to_prefix=target, batch_size=50000)

Reference: https://docs.python-arango.com/en/main/specs.html#arango.collection.Collection.import_bulk

2. Leveraging Pandas & `import_bulk` for Relation Creation Over AQL Queries

There are multiple ways to insert relations i.e. edge collection into an ArangoDB graph database. An AQL query could do the job using the ‘INSERT’ operation. While AQL queries serve well for simple operations, complex queries involving multiple joins and conditions might significantly slow down the insertion of edges into the edge collection. In such cases, employing Pandas for merges, filtering, and transformations, followed by bulk insertion with import_bulkexplained in 1, proves to be more efficient.

3. Harnessing Batch API for Streamlined Relation Queries

Opt for batch API execution for queries accessing large volumes of data. Methods such as insert_many(), update_many(), replace_many(), and delete_many() facilitate inputting multiple documents simultaneously. These methods have replaced the batch request API in newer versions of ArangoDB post 3.8.0.

Reference: https://docs.python-arango.com/en/main/batch.html

4. Enhancing Maximum Memory Map Configuration

Increasing vm.max_memory_map can significantly enhance performance for memory-intensive operations, such as those characteristic of database operations. Adjustments to this parameter are essential, especially in Docker environments.

Here's how you can adjust it in Windows using Powershell:

wsl -d docker-desktop
sudo sysctl -w "vm.max_map_count=2048000"

Note: You may have to tune this parameter based on your requirements. This is the minimum value that I needed for my data volume to be successfully ingested.

5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues

When retrieving an entire collection during data ingestion, consider iterating through the collection in batches, especially if it’s large. This prevents timeout and reconnect issues. Below is a sample code snippet illustrating this approach.

def get_full_collection(db, collection_name, aql):
 # get total amount of documents in collection
 #entries = db.collection(collection_name).count() - slow
 entries = list(aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))[0]
 print(entries)
 limit = 50000 # blocksize you want to request
 l = [] # final output
 for x in range(int(entries / limit) + 1):
 block = db.collection(collection_name).all(skip=x * limit, limit=limit)
 l.extend(block)

 return l

6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively

Initialize the ArangoDB client with a custom DefaultHTTPClient and set a higher value for request_timeout to accommodate larger datasets. This is how you can do it:

from arango.http import DefaultHTTPClient
# Initialize the ArangoDB client.
client = ArangoClient(http_client=DefaultHTTPClient(request_timeout=1000000))

Similarly, adjust the ttl (time to live) parameter while executing AQL queries for longer-running queries. An example of how you can increase the value of ttl parameter in aql.execute() is shared below:

aql.execute("RETURN LENGTH("+collection_name+")", ttl=1000000))

7. Defining additional indexes

Don’t overlook the importance of adding indexes to frequently used collections in the data ingestion process. Indexing significantly boosts query performance, thus expediting data ingestion procedures.

These optimization techniques, consolidated here, aim to enrich your ArangoDB experimentation endeavors.

Please feel free to share your experiences with these strategies and their impact on your projects!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB

Author(s): Saloni Gupta

ArangoDB vs Neo4j — What you can’t do with Neo4j

Neo4j is a single-model graph database. ArangoDB offers the same functionality as Neo4j with more than competitive…

Optimizations Applied for Efficient Data Ingestion:

1. Utilising import_bulk with a batch size for faster import

2. Leveraging Pandas & `import_bulk` for Relation Creation Over AQL Queries

3. Harnessing Batch API for Streamlined Relation Queries

4. Enhancing Maximum Memory Map Configuration

5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues

6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively

7. Defining additional indexes

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

7 Techniques to Enhance Graph Data Ingestion with Python in ArangoDB

Author(s): Saloni Gupta

ArangoDB vs Neo4j — What you can’t do with Neo4j

Neo4j is a single-model graph database. ArangoDB offers the same functionality as Neo4j with more than competitive…

Optimizations Applied for Efficient Data Ingestion:

1. Utilising import_bulk with a batch size for faster import

2. Leveraging Pandas & import_bulk for Relation Creation Over AQL Queries

3. Harnessing Batch API for Streamlined Relation Queries

4. Enhancing Maximum Memory Map Configuration

5. Batch Iteration Through Collections to Mitigate Timeout and Reconnect Issues

6. Increasing Timeout of ArangoDB Client and Ttl Value for Queries to Avoid HTTPS and Cursor Timeout Respectively

7. Defining additional indexes

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

2. Leveraging Pandas & `import_bulk` for Relation Creation Over AQL Queries