Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Embedchain In Action.
Latest   Machine Learning

Embedchain In Action.

Last Updated on November 5, 2023 by Editorial Team

Author(s): iva vrtaric

Originally published on Towards AI.

image created with Midjourney

Introduction

You’ve likely come across countless articles discussing the creation of intelligent chatbots that sift through data, documents, and links utilizing popular vector databases. Among these, Langchain and LlamaIndex often emerge as the favored solutions for knowledge extraction, adeptly indexing and querying databases.

So, I have a substantial grounding in this area, having contributed to a course on Langchain and Vector Databases for production. In this piece, I plan to share the insights from grappling with the most challenging aspects of working with these frameworks and tools.

Among the most common issues were:

How should I chunk the data? What is a meaningful chunk size? How should I create embeddings for each chunk? Which embedding model should I use? How should I store the chunks in a vector database? Which vector database should I use? Should I store metadata along with the embeddings? How should I find similar documents for a query? Which ranking model should I use?

A few months back, I stumbled upon Embedchain. The excitement to test its capabilities was apparent, and now, it’s become an integral tool for my personal projects.

So, what makes Embedchain a game-changer in building apps and chatbots?

Let’s peel back the layers of Embedchain to comprehend its methodical approach to crafting chatbots that can sift through any dataset.

The Bot Creation Blueprint

Data Detection and Loading: Embedchain is flexible when it comes to handling different data types. Be it a YouTube clip, a digital book in PDF, an insightful blog post, any web link, or data stored locally, it recognizes and processes them with ease. This eliminates the hassle of managing data loaders or selecting the appropriate format. A straightforward .add command is all it takes to integrate the data into the chatbot or query app.

Data Chunking: Once loaded, data is broken down into meaningful segments. This strategy of chunking amplifies the potential of the stages that follow for the process of creating more intelligent and responsive chatbots.

This stage has often been a source of headaches, forcing me into bouts of trial and error. The fine line between retaining vital information and creating manageable chunk sizes was a common issue. Needless to say, I found myself immersed for hours in a bid to avoid the information loss that comes with inappropriate chunking.

Embedding Creation: Each segmented piece undergoes a transformation. Chunks are converted into embeddings — turning raw data into machine-friendly vectors, prepared for storage and retrieval.

The choice of model and method can significantly impact the outcomes. Personally, I’ve leaned on renowned OpenAI models for augmented retrieval, further fueled by a vector database. Embedchain offers an array of options for easy indexing, storing, and retrieving vectors tailored to varying tasks and budgets.

image by author

For those who might be budget-conscious or experimenting, there’s the possibility of utilizing the sentence transformer or even custom models available through Huggingface, which come at no cost. This flexibility means that, with the right configurations and choices, one can effectively run their entire chatbot application without incurring any expenses.

Storage: Finally, these machine-readable vectors, encapsulating the essence of the original data, find a home in a vector database, awaiting their call to action during future queries. Embedchain leans on the Chroma vector database — a commendable choice in my experience, and as a bonus, it’s available at no cost.

Functionality: Embedchain offers functionalities designed for intuitive interactions. There’s the Query Interface, which operates as a question-answering bot. By employing the .query() function, it delivers direct answers without holding onto past conversation context.

The Chat Interface provides a more connected experience. Through the .chat function, it recalls the previous conversations, ensuring context-aware responses.

The Dry Run option exists to test prompts without dispatching them to the LLM. And for real-time response, the Stream Response capability comes in handy.

App Example

The best way to demonstrate this is with an example. I created a quick app to query documentation that contains numerous tables, benchmarks, new concepts, and algorithms. I was asked to write a whitepaper for a company about which I knew little. I gathered all the information I could from their website and a few additional sources.

The workflow: I collected the data sources and loaded them. They were then automatically chunked, embedded, and stored in the database, allowing me to query them quickly. Here is the detailed code explaining how I accomplished this:

First, assuming you’ve installed both OpenAI and Embedchain using pip, you can import the OpenAI API key. Make sure to store this key in a separate file to prevent public exposure:

You can name your bot and assign that name to the App()function. Next, you’ll begin the data ingestion process. In my case, I needed information available on the website. Embedchain provides a way to incorporate an entire website, but I only needed specific pages for my investigation:

As you can see, adding links to feed the bot is straightforward with the .add command. When you run the cell, you should receive an output similar to the following, confirming that the data has been vectorized and stored:

'80cd3a55397a9406be28c5d752c0f41b'

Next, I sought data from YouTube that would give me insights into benchmarking and additional details about the company:

Behind this approach is a model that processes the transcripts, along with a ‘yt_loader’ that fetches the links and incorporates them into the bot. The bot’s like a sponge — the more you throw at it, the juicier it gets. So, I tossed in an ArXiv paper on benchmarking to spice things up.

In this code snippet, I’ve pulled in data from Notion, Google Docs, and GitHub. Because hey, why limit your bot’s worldly wisdom, right?

To get answers from my bot, all I have to do is strike up a casual conversation. It dives into its database, digs up relevant gems, and spits back an eloquent response:

The Output:

To submit benchmarks on the DataPerf platform, 
participants can use the online platform called Dynabench.
They can submit their solutions for evaluation by following
the guidelines provided on the platform. The Dynabench platform
hosts the DataPerf benchmarks, evaluation tools, leaderboards,
and documentation. Participants can submit, evaluate, and compare
their solutions for all data-centric benchmarks defined in the
DataPerf suite. The platform also supports a wide variety of submission
artifacts, such as training subsets, priority values/orderings, and
purchase strategies. If you need more specific instructions on
how to submit benchmarks, it is recommended to refer to the documentation
or guidelines provided on the Dynabench platform.

I can just casually chat with it, and it rifles through its database like a librarian on caffeine, serving up answers generated from all that curated knowledge 🙂

Experiment no2

Here’s an added value: I’m including a brief experiment with a Discord bot. What’s remarkable is that I ventured into this with no prior experience in Discord bot development. This showcases how easy it is to explore new areas when using Embedchain.

The docs

The Embedchain documentation is a masterclass in clarity, to be honest. In the labyrinth that is tech documentation, it’s easy to lose sight of what you initially set out to find. With Embedchain, that’s a non-issue. They’ve implemented an automated bot that guides you through the documentation, providing not just answers but also code snippets and resource suggestions. It’s a level of assistance rarely seen in new modules or frameworks, and I’m all for making this the new normal in tech docs.

Discord bot in 5 mins

To set up a bot, start by putting your OpenAI API key in a separate .env file. Next, visit Discord’s developer site to create a new bot application.

There, guided with documentation, you’ll fill in the bot’s details, and adjust a few settings — like turning on all three options under ‘Privileged Gateway Intents’. Don’t forget to save it.

Once that’s done, you’ll need to reset and copy a discord token value for later use. After this, go back to the bot settings, find the OAuth2 section, and update some more settings, mainly around what the bot can do in terms of sending messages.

You’ll then generate a URL that will let you add your bot to a Discord server. Copy that URL, paste it into a web browser, select the server you want your bot in, and give it the thumbs-up.

Here’s a pro tip:

You can keep your Discord bot online almost constantly and use its embedchain capability to fetch new data on the fly. With the setup I’ve described, you can get your bot functioning quickly and keep it updated just by feeding it links. So, if you’re ever in a situation where you need information immediately, your bot is just a quick query away. No more scrambling to find that elusive piece of data; your bot’s got you covered.

Install the requirements:

discord==2.3.1
embedchain==0.0.58
python-dotenv==1.0.0

You’re also free to clone the Embedchain repo from GitHub, which comes with functional examples. I’m just here to show you how a little code-tweaking can help the bot align with your particular interests.

For example, I crafted a dog bot influenced by memes that deliver the information I need on cue. In a playful move, I made a small code adjustment to ‘dogsplain’ — a term that allows me to fetch information in my own distinctive chat style with the bot:

Tinkering with the code, you’ll likely uncover numerous ways to design the bot’s style to your preferences. Plus, the experience serves as a learning playground.

caption by author

To add the data sources to the bot use the slash command:

/add <data_type> <url_or_text>

/query <question>

/chat <question>
caption by author

Voila! Run the activation code, and your bot springs to life, showing up online instantly.

Conclusion

So, as illustrated in the examples above, crafting a fully functional bot capable of instantly retrieving information in an array of formats — be it text, YouTube videos, PDFs, web pages, sitemaps, Docx, Google Docs, Notion, CSV, MDX, or Q&A pairs — is remarkably straightforward.

I highly recommend checking out the various examples in the community showcase section. You’ll likely find them inspiring and may come up with new ideas inspired by what you see U+1F60E

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓