Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Talk to your documents as PDFs, txts, and even web pages
Latest   Machine Learning

Talk to your documents as PDFs, txts, and even web pages

Last Updated on September 1, 2023 by Editorial Team

Author(s): Damian Gil

Originally published on Towards AI.

Complete guide to creating a web and the intelligence that allows you to ask questions to docs such as PDFs, TXTs, and even web pages using LLMs.

Content Table

· Introduction
·
How does it work
·
Steps (part 1) U+1F463
·
Break: Let’s recap (part 1) U+1F32A️
·
Steps (part 2) U+1F463
·
Break: Let’s recap (part 2) U+1F32A️
·
Web App
·
For the impatient one (code)
·
Conclusions
·
References

Introduction

We all have to read eternal documents to get two sentences containing the value/information we need.

Ever wished you could extract compelling information from a document without getting lost in a sea of ​​words?

Look no further! Welcome to the project “Talking to Documents Using LLM. It’s like recovering treasures from PDFs, TXT files or web pages without brain fatigue. And don’t worry, you don’t need a tech wizarding degree to have fun. We’ve created a user-friendly interface with Streamlit, so even your non-tech-savvy friends — marketing assistants — can join in.

This article will dive deeper into the theory and code behind application intelligence, shedding light on how web applications work. To give you a quick overview of the technologies we’ll be using, check out the following image that shows the four main tools in action.

Key Technologies Utilized in This Project (Image by Author).

And of course, you can find the code in my GitHub repository, or if you want to test the code directly, you can go to the “For the impatient (code)” section.

How does it work

Let’s take a peek through the veil of “Talking to Documents Using LLM”.This project is like an interesting duo: a smart player (back-end) and a user-friendly website (front-end).

  • Smart reader: The AI ​​brain reads and understands documents like a super smart friend, excellently handling PDFs, TXTs and web content.
  • User-friendly website: User-friendly web interface to configure and interact with the model. It consists of two main pages, each with a specific purpose: Step 1️⃣ Create database and Step 2️⃣ Request documents.

Our project’s outline looks something like this:

Pipeline flow of how the project works (Image by Author)

Although it may seem complicated, we will simplify every essential step to make it functional. By focusing on the workings of intelligence, we will discover how it works. These steps are supported by the Python TalkDocument class and we will demonstrate each step with the corresponding code to bring it to life.

Steps (part 1) U+1F463

Get ready to explore every step of the way as we discover what makes the magic behind the scenes! U+1F680U+1F50DU+1F3A9

1- Import documents

This step is obvious, right? This is where you decide what type of material you provide. Whether it’s PDF, plain text, web URL or even raw string format, it’s all on the menu.

# Inside the __init__ function, I have commented out the variables
# that we are not interested in at the moment.

def __init__(self, HF_API_TOKEN,
data_source_path=None,
data_text=None,
OPENAI_KEY=None
) -> None:

# You can enter the path of the file.
self.data_source_path = data_source_path

# You can enter the file in string format directly
self.data_text = data_text

self.document = None
# self.document_splited = None
# self.embedding_model = None
# self.embedding_type = None
# self.OPENAI_KEY = OPENAI_KEY
# self.HF_API_TOKEN = HF_API_TOKEN
# self.db = None
# self.llm = None
# self.chain = None
# self.repo_id = None

def get_document(self, data_source_type="TXT"):
# DS_TYPE_LIST= ["WEB", "PDF", "TXT"]

data_source_type = data_source_type if data_source_type.upper() in DS_TYPE_LIST else DS_TYPE_LIST[0]
if data_source_type == "TXT":
if self.data_text:
self.document = self.data_text
elif self.data_source_path:
loader = dl.TextLoader(self.data_source_path)
self.document = loader.load()

elif data_source_type == "PDF":
if self.data_text:
self.document = self.data_text
elif self.data_source_path:
loader = dl.PyPDFLoader(self.data_source_path)
self.document = loader.load()

elif data_source_type == "WEB":
loader = dl.WebBaseLoader(self.data_source_path)
self.document = loader.load()

return self.document

Depending on the type of file you upload, the document will be read by different methods. An interesting improvement could involve adding automatic format detection to the mix.

2- Split Type

Now, you may be wondering why the term “type” exists? Well, in the app, you can choose the split method. But before diving into this topic, let’s explain what document splitting is.

Think of it like this: Just as humans need chapters, paragraphs, and sentences to structure information (imagine reading a book with an endless paragraph — yes!), machines need structure too. We need to cut the document into several smaller parts to understand it better. This parsing can be done by character or by token.

# SPLIT_TYPE_LIST = ["CHARACTER", "TOKEN"]

def get_split(self, split_type="character", chunk_size=200, chunk_overlap=10):

split_type = split_type.upper() if split_type.upper()
in SPLIT_TYPE_LIST else SPLIT_TYPE_LIST[0]

if self.document:

if split_type == "CHARACTER":
text_splitter = ts.RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
elif split_type == "TOKEN":
text_splitter = ts.TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)


# If you input a string as a document, we'll perform a split_text.
if self.data_text:
try:
self.document_splited = text_splitter.split_text(text=self.document)
except Exception as error:
print( error)

# If you upload a document, we'll do a split_documents.
elif self.data_source_path:
try:
self.document_splited = text_splitter.split_documents(documents=self.document)
except Exception as error:
print( error)

return self.document_splited

3- Embedding type

We humans can understand words and pictures easily, but machines need a little more guidance. This becomes obvious when:

  • We are trying to convert categorical variables in a dataset into numbers
  • Image management in neural networks. For example, before an image is fed into a neural network model, it undergoes transformations to become a numerical tensor.

As we can see, mathematical models have the language of numbers. This phenomenon is also seen in the field of NLP, where the concept of word embedding is promoted.

In essence, what we’re doing in this step is converting the splits from the previous stage (document chunks) into numerical vectors.

This conversion or encoding is done using specialized algorithms. It is important to note that this process converts sentences into digital vectors and that this encoding is not random; it follows a structured method.

This code is very simple:
We instantiate the object responsible for the integration. It’s important to note that at this point, we’re just creating an Integration Object. In the next steps, we will do the actual conversion.

def get_embedding(self, embedding_type="HF", OPENAI_KEY=None):

if not self.embedding_model:

embedding_type = embedding_type.upper() if embedding_type.upper() in EMBEDDING_TYPE_LIST else EMBEDDING_TYPE_LIST[0]

# If we choose to use the Hugging Face model for embedding
if embedding_type == "HF":
self.embedding_model = embeddings.HuggingFaceEmbeddings()

# If we opt for the OpenAI model for embedding
elif embedding_type == "OPENAI":
self.OPENAI_KEY = self.OPENAI_KEY if self.OPENAI_KEY else OPENAI_KEY
if self.OPENAI_KEY:
self.embedding_model = embeddings.OpenAIEmbeddings(openai_api_key=OPENAI_KEY)
else:
print("You need to introduce a OPENAI API KEY")

# The object
self.embedding_type = embedding_type

return self.embedding_model

Break: Let’s recap (part 1) U+1F32A️

To grasp the initial three steps, let’s consider the following example:

The initial three steps of the process (Image by Author). Note: In this example, the vectors have been represented with 6 dimensions and have been randomly created to exemplify.

We start with an input text.

  1. We do a distribution based on the number of characters (about 50 characters in this example).
  2. We do integrations, turning text snippets into digital vectors.

How wonderful! U+1F680 Now, let’s start the thrilling journey through the remaining levels. Fasten your seat belts because we’re about to unlock some real tech magic! U+1F525U+1F513

Steps (part 2) U+1F463

We will continue to see the steps to follow.

4- Model VectorerStore Type

Now that we’ve turned our text into code (embedded), we need a place to store them. This is where the concept of “vector store” comes in. It’s like a smart library of these codes, making it easy to find and retrieve similar codes when asking a question.

Think of it as a neat storage space that allows you to quickly get back to what you need!

The creation of this type of database is managed by specialized algorithms designed for this purpose, such as FAISS (Facebook AI Similarity Search). There are other options as well, and currently, this class supports CHROMA and SVM.

This step and the next step share code. In this step, you choose the type of vector repository you want to create, while the next step is where the actual creation takes place.

5. Model VectoreStore (Creation)

This type of database handles two main aspects:

  • Vector storage: It stores the vectors generated by the integration.
  • Similar calculation: It calculates the similarity between vectors.

But what exactly is the similarity between these vectors and why does it matter?

Well, remember I mentioned that the integration is not random? It is designed so that words or phrases with similar meanings have similar vectors. This way we can calculate the distance between the vectors (like using Euclidean distance) and this gives us a measure of their “similarity”.

To visualize this with an example, imagine we have three sentences.

Concept of embedding and similarity. Note that this representation simplifies the actual functioning process. For a more in-depth understanding, refer to the following article for further details. Introduction to Facebook AI Similarity Search (Faiss). (Image by Author)

Two are related to recipes, while the third is related to motorcycles. By representing them as vectors (thanks to integration), we can calculate the distance between these points or sentences. This distance serves as a measure of their similarity.

In terms of code, let’s take a look at the required requirements:

  1. The split text
  2. The embedding type
  3. The vector store model
# VECTORSTORE_TYPE_LIST = ["FAISS", "CHROMA", "SVM"]

def get_storage(self,
vectorstore_type = "FAISS",
embedding_type="HF",
OPENAI_KEY=None
):

self.embedding_type = self.embedding_type if self.embedding_type else embedding_type
vectorstore_type = vectorstore_type.upper() if vectorstore_type.upper() in VECTORSTORE_TYPE_LIST else VECTORSTORE_TYPE_LIST[0]

# Here we make the call to the algorithm
# that performed the embedding and create the object
self.get_embedding(embedding_type=self.embedding_type, OPENAI_KEY=OPENAI_KEY)

# Here we choose the type of vector store that we want to use
if vectorstore_type == "FAISS":
model_vectorstore = vs.FAISS

elif vectorstore_type == "CHROMA":
model_vectorstore = vs.Chroma

elif vectorstore_type == "SVM":
model_vectorstore = retrievers.SVMRetriever


# Here we create the vector store. In this case,
# the document comes from raw text.
if self.data_text:
try:
self.db = model_vectorstore.from_texts(self.document_splited,
self.embedding_model)
except Exception as error:
print( error)

# Here we create the vector store. In this case,
# the document comes from a document like pdf txt...
elif self.data_source_path:
try:
self.db = model_vectorstore.from_documents(self.document_splited,
self.embedding_model)
except Exception as error:
print( error)

return self.db

To illustrate what happens in this step, we can visualize the following image. It shows how encoded text fragments are stored in the vector store, allowing us to calculate the distance/similarity between vectors/points.

Visual representation of the creation of the Store vector (Image by Author). Note: non-real values.

Break: Let’s recap (part 2) U+1F32A️

Great! We now have a database that is capable of storing our documents and calculating the similarity between encrypted text chunks. Imagine a situation in which we want to encode an external sentence and store it in our vector store. This will allow us to calculate the distance between the new vector and the document split. (Note that here the same integration should be used to create the vector store.) We can visualize this through the following image.

Example of calculating distances between points/text. Note that this representation simplifies the actual functioning process. For a more in-depth understanding, refer to the following article for further details. Introduction to Facebook AI Similarity Search (Faiss). (Image by Author).

Remember the previous picture… we have two questions, and we turn them into numbers using integration. We then measure the distance and find the closest sentences to our question. These sentences are like perfect combinations! It’s like finding the best puzzle piece in an instant! U+1F680U+1F9E9U+1F50D

6- Question

Remember that our goal is to ask a question about a document and get an answer. In this step, we collect the questions as user-supplied input.

7 & 8- Relevant Splits

This is where we insert our question into the vector repository, embedding it to turn the sentence into a digital vector. We then calculate the distance between our question and the parts of the document, determining which parts are closest to our question. The code is:

# Depending on the type of vector store we've built, 
# we'll use a specific function. All of them return
# a list of the most relevant splits.

def get_search(self, question, with_score=False):

relevant_docs = None

if self.db and "SVM" not in str(type(self.db)):

if with_score:
relevant_docs = self.db.similarity_search_with_relevance_scores(question)
else:
relevant_docs = self.db.similarity_search(question)

elif self.db:
relevant_docs = self.db.get_relevant_documents(question)

return relevant_docs

Try it with the code below. Learn how to answer as a list of 4 items. And when we look inside, it’s actually separate parts of the document you entered. It’s like the app that presents you with the best 4 puzzle pieces that fit your question like a glove! U+1F9E9U+1F4AC

9- Response (Nature Language)

Great, now we have the most relevant text for our question. But we can’t just hand those parts over to the user and terminate it. We need a concise and precise answer to their question. And that’s where our Language Model (LLM) comes into play! There are many LLM flavors. In the code, we set “flan-alpaca-large” as default. Don’t hesitate to choose the person you are most touched by! U+1F680U+1F389

Here is the plan:

  1. We restore the most important parts (splits) related to the question.
  2. We prepare a prompt that includes the question, how we want the answer, and those text elements.
  3. We pass this prompt to our Intelligent Language Model (LLM). He knows the question and the information contained in these elements and gives us natural answers.️

This last part is shown in the following image: U+1F5BC️

Pipeline for Obtaining the Answer U+1F310U+1F4DCU+1F9E0 (Image by Author).

Sure enough, this exact flow is what gets executed in the code. You’ll notice in the code there’s an extra step involved in creating a “prompt”. Indeed, we can come up with a final answer that is a combination of the answers found by the LLM and other choices. For simplicity, let’s do it the simplest way: “stuff”. This means that the answer is the first solution found by the LLM. Here, you don’t get lost in theory! U+1F31F

 def do_question(self, 
question,
repo_id="declare-lab/flan-alpaca-large",
chain_type="stuff",
relevant_docs=None,
with_score=False,
temperature=0,
max_length=300
):
# We get the most relevant splits.
relevant_docs = self.get_search(question, with_score=with_score)

# We define the LLM that we want to use,
# we must introduce the repo id since we are using huggingface.
self.repo_id = self.repo_id if self.repo_id is not None else repo_id
chain_type = chain_type.lower() if chain_type.lower() in CHAIN_TYPE_LIST else CHAIN_TYPE_LIST[0]

# This check is necessary since we can call the function several times,
# but it would not make sense to create an LLM every time the call is made.
# So it checks if an llm already exists inside the class
# or if the repo_id (the type of llm) has changed.
if (self.repo_id != repo_id ) or (self.llm is None):
self.repo_id = repo_id

# We created the LLM.
self.llm = HuggingFaceHub(repo_id=self.repo_id,huggingfacehub_api_token=self.HF_API_TOKEN,
model_kwargs=
{"temperature":temperature,
"max_length": max_length})
# We create the prompt
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If the question is similar to [Talk me about the document],
the response should be a summary commenting on the most important points about the document


{context}
Question: {question}
"""


PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

# We create the chain, chain_type= "stuff".
self.chain = self.chain if self.chain is not None
else load_qa_chain(self.llm,
chain_type=chain_type,
prompt = PROMPT)

# We make the query to the LLM using the prompt
# We check if there is a chain already defined,
# if it does not exist it is created
response = self.chain({"input_documents": relevant_docs, "question": question}, return_only_outputs=True)

return response

You’ve rocked it, reaching the end of understanding how the brain of our web app works. High-five if you’re here! Now, let’s dive into the final stretch of the article, where we’ll explore the web app’s pages. U+1F389U+1F575️‍U+2642️

Web App

This interface is designed for people without technical knowledge, so you can get the most out of this technology without any problems. U+1F680U+1F469‍U+1F4BB

Using the site is extremely simple and powerful. Basically, the user just needs to provide the document. If you are not ready to benefit from additional parameters, in the next step, you may already start asking questions. U+1F31FU+1F916

Let’s see the steps to follow:

  1. Provide the document or web link. By page Step 1️⃣ Create Data Base.
  2. Set up the configuration (optional). By page Step 1️⃣ Create Data Base.
  3. Ask your questions and get answers! U+1F4DAU+1F50DU+1F680 By page Step 2️⃣ Ask to the document.

1. Provide the document or web link.

In this step, you add the document by uploading it to the web. Remember that if you don’t have the Face Hugging API key as an environment variable, the “home” tab will ask you to enter it. U+1F4C2U+1F511

Page where you provide the document or URL: U+1F4C1U+1F310 (Image by Author). Real image of the web application in page “Step 1️⃣ Create Data Base”.

Once you’ve attached the document and set up your preferences, a summary table of your configuration is displayed. The “Create Database” button is then unlocked. When you click the button, the database or vector store will be created, and you’ll be redirected to the question section. U+1F4D1U+1F512U+1F680

Summary Table: U+1F4CAU+1F4CB (Image by Author). Real image of the web application in page “Step 1️⃣ Create Data Base”.

2. Set up the configuration (optional).

As I mentioned earlier, we can configure the vector store according to our preferences. There are different methods for splitting documents, embedding, and more. This tab allows you to customize it as you like. It comes with a default configuration for faster setup. U+2699️U+1F6E0️

Vector Store and LLM Configuration: U+2699️U+1F527U+1F916 (Image by Author). Real image of the web application.

3. Ask your questions and get answers!

At this point, you can start making all the queries you want. Now it’s time to enjoy and interact with the tool to your heart’s content! U+1F917U+1F50DU+1F4AC

Page for Asking Questions: U+1F4ACU+1F50D (Image by Author). Real image of the web application in page “Step 2️⃣ Ask to the document”.

For the impatient one (code)

For those who are eager to dive in, you can directly grab the TalkDocument class and paste it into a Jupyter notebook to start tinkering. You might need to install a few dependencies, but I’m sure it won’t be a challenge for you. Have a great time exploring and experimenting! Happy coding!! U+1F680U+1F4DAU+1F604

The TalkDocument class in code to play (Code by Author)

Conclusions

Congratulations on making it this far in our exciting journey! We’ve delved into how the intelligence behind this web application works. From uploading documents to obtaining answers, you’ve covered a lot of ground! If you’re feeling inspired, the code is available in my GitHub repository. Feel free to collaborate and reach out to me on LinkedIn for any questions or suggestions! Have fun exploring and experimenting with this powerful tool!

If you want, you can check my GitHub

damiangilgonzalez1995 – Overview

Passionate about data, I transitioned from physics to data science. Worked at Telefonica, HP, and now CTO at…

github.com

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓

Sign Up for the Course
`; } else { console.error('Element with id="subscribe" not found within the page with class "home".'); } } }); // Remove duplicate text from articles /* Backup: 09/11/24 function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag elements.forEach(el => { const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 2) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); */ // Remove duplicate text from articles function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag // List of classes to be excluded const excludedClasses = ['medium-author', 'post-widget-title']; elements.forEach(el => { // Skip elements with any of the excluded classes if (excludedClasses.some(cls => el.classList.contains(cls))) { return; // Skip this element if it has any of the excluded classes } const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 10) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); //Remove unnecessary text in blog excerpts document.querySelectorAll('.blog p').forEach(function(paragraph) { // Replace the unwanted text pattern for each paragraph paragraph.innerHTML = paragraph.innerHTML .replace(/Author\(s\): [\w\s]+ Originally published on Towards AI\.?/g, '') // Removes 'Author(s): XYZ Originally published on Towards AI' .replace(/This member-only story is on us\. Upgrade to access all of Medium\./g, ''); // Removes 'This member-only story...' }); //Load ionic icons and cache them if ('localStorage' in window && window['localStorage'] !== null) { const cssLink = 'https://code.ionicframework.com/ionicons/2.0.1/css/ionicons.min.css'; const storedCss = localStorage.getItem('ionicons'); if (storedCss) { loadCSS(storedCss); } else { fetch(cssLink).then(response => response.text()).then(css => { localStorage.setItem('ionicons', css); loadCSS(css); }); } } function loadCSS(css) { const style = document.createElement('style'); style.innerHTML = css; document.head.appendChild(style); } //Remove elements from imported content automatically function removeStrongFromHeadings() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6, span'); elements.forEach(el => { const strongTags = el.querySelectorAll('strong'); strongTags.forEach(strongTag => { while (strongTag.firstChild) { strongTag.parentNode.insertBefore(strongTag.firstChild, strongTag); } strongTag.remove(); }); }); } removeStrongFromHeadings(); "use strict"; window.onload = () => { /* //This is an object for each category of subjects and in that there are kewords and link to the keywods let keywordsAndLinks = { //you can add more categories and define their keywords and add a link ds: { keywords: [ //you can add more keywords here they are detected and replaced with achor tag automatically 'data science', 'Data science', 'Data Science', 'data Science', 'DATA SCIENCE', ], //we will replace the linktext with the keyword later on in the code //you can easily change links for each category here //(include class="ml-link" and linktext) link: 'linktext', }, ml: { keywords: [ //Add more keywords 'machine learning', 'Machine learning', 'Machine Learning', 'machine Learning', 'MACHINE LEARNING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ai: { keywords: [ 'artificial intelligence', 'Artificial intelligence', 'Artificial Intelligence', 'artificial Intelligence', 'ARTIFICIAL INTELLIGENCE', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, nl: { keywords: [ 'NLP', 'nlp', 'natural language processing', 'Natural Language Processing', 'NATURAL LANGUAGE PROCESSING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, des: { keywords: [ 'data engineering services', 'Data Engineering Services', 'DATA ENGINEERING SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, td: { keywords: [ 'training data', 'Training Data', 'training Data', 'TRAINING DATA', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ias: { keywords: [ 'image annotation services', 'Image annotation services', 'image Annotation services', 'image annotation Services', 'Image Annotation Services', 'IMAGE ANNOTATION SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, l: { keywords: [ 'labeling', 'labelling', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, pbp: { keywords: [ 'previous blog posts', 'previous blog post', 'latest', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, mlc: { keywords: [ 'machine learning course', 'machine learning class', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, }; //Articles to skip let articleIdsToSkip = ['post-2651', 'post-3414', 'post-3540']; //keyword with its related achortag is recieved here along with article id function searchAndReplace(keyword, anchorTag, articleId) { //selects the h3 h4 and p tags that are inside of the article let content = document.querySelector(`#${articleId} .entry-content`); //replaces the "linktext" in achor tag with the keyword that will be searched and replaced let newLink = anchorTag.replace('linktext', keyword); //regular expression to search keyword var re = new RegExp('(' + keyword + ')', 'g'); //this replaces the keywords in h3 h4 and p tags content with achor tag content.innerHTML = content.innerHTML.replace(re, newLink); } function articleFilter(keyword, anchorTag) { //gets all the articles var articles = document.querySelectorAll('article'); //if its zero or less then there are no articles if (articles.length > 0) { for (let x = 0; x < articles.length; x++) { //articles to skip is an array in which there are ids of articles which should not get effected //if the current article's id is also in that array then do not call search and replace with its data if (!articleIdsToSkip.includes(articles[x].id)) { //search and replace is called on articles which should get effected searchAndReplace(keyword, anchorTag, articles[x].id, key); } else { console.log( `Cannot replace the keywords in article with id ${articles[x].id}` ); } } } else { console.log('No articles found.'); } } let key; //not part of script, added for (key in keywordsAndLinks) { //key is the object in keywords and links object i.e ds, ml, ai for (let i = 0; i < keywordsAndLinks[key].keywords.length; i++) { //keywordsAndLinks[key].keywords is the array of keywords for key (ds, ml, ai) //keywordsAndLinks[key].keywords[i] is the keyword and keywordsAndLinks[key].link is the link //keyword and link is sent to searchreplace where it is then replaced using regular expression and replace function articleFilter( keywordsAndLinks[key].keywords[i], keywordsAndLinks[key].link ); } } function cleanLinks() { // (making smal functions is for DRY) this function gets the links and only keeps the first 2 and from the rest removes the anchor tag and replaces it with its text function removeLinks(links) { if (links.length > 1) { for (let i = 2; i < links.length; i++) { links[i].outerHTML = links[i].textContent; } } } //arrays which will contain all the achor tags found with the class (ds-link, ml-link, ailink) in each article inserted using search and replace let dslinks; let mllinks; let ailinks; let nllinks; let deslinks; let tdlinks; let iaslinks; let llinks; let pbplinks; let mlclinks; const content = document.querySelectorAll('article'); //all articles content.forEach((c) => { //to skip the articles with specific ids if (!articleIdsToSkip.includes(c.id)) { //getting all the anchor tags in each article one by one dslinks = document.querySelectorAll(`#${c.id} .entry-content a.ds-link`); mllinks = document.querySelectorAll(`#${c.id} .entry-content a.ml-link`); ailinks = document.querySelectorAll(`#${c.id} .entry-content a.ai-link`); nllinks = document.querySelectorAll(`#${c.id} .entry-content a.ntrl-link`); deslinks = document.querySelectorAll(`#${c.id} .entry-content a.des-link`); tdlinks = document.querySelectorAll(`#${c.id} .entry-content a.td-link`); iaslinks = document.querySelectorAll(`#${c.id} .entry-content a.ias-link`); mlclinks = document.querySelectorAll(`#${c.id} .entry-content a.mlc-link`); llinks = document.querySelectorAll(`#${c.id} .entry-content a.l-link`); pbplinks = document.querySelectorAll(`#${c.id} .entry-content a.pbp-link`); //sending the anchor tags list of each article one by one to remove extra anchor tags removeLinks(dslinks); removeLinks(mllinks); removeLinks(ailinks); removeLinks(nllinks); removeLinks(deslinks); removeLinks(tdlinks); removeLinks(iaslinks); removeLinks(mlclinks); removeLinks(llinks); removeLinks(pbplinks); } }); } //To remove extra achor tags of each category (ds, ml, ai) and only have 2 of each category per article cleanLinks(); */ //Recommended Articles var ctaLinks = [ /* ' ' + '

Subscribe to our AI newsletter!

' + */ '

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

'+ '

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

' + '
' + '' + '' + '

Note: Content contains the views of the contributing authors and not Towards AI.
Disclosure: This website may contain sponsored content and affiliate links.

' + 'Discover Your Dream AI Career at Towards AI Jobs' + '

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 10,000 live jobs today with Towards AI Jobs!

' + '
' + '

🔥 Recommended Articles 🔥

' + 'Why Become an LLM Developer? Launching Towards AI’s New One-Stop Conversion Course'+ 'Testing Launchpad.sh: A Container-based GPU Cloud for Inference and Fine-tuning'+ 'The Top 13 AI-Powered CRM Platforms
' + 'Top 11 AI Call Center Software for 2024
' + 'Learn Prompting 101—Prompt Engineering Course
' + 'Explore Leading Cloud Providers for GPU-Powered LLM Training
' + 'Best AI Communities for Artificial Intelligence Enthusiasts
' + 'Best Workstations for Deep Learning
' + 'Best Laptops for Deep Learning
' + 'Best Machine Learning Books
' + 'Machine Learning Algorithms
' + 'Neural Networks Tutorial
' + 'Best Public Datasets for Machine Learning
' + 'Neural Network Types
' + 'NLP Tutorial
' + 'Best Data Science Books
' + 'Monte Carlo Simulation Tutorial
' + 'Recommender System Tutorial
' + 'Linear Algebra for Deep Learning Tutorial
' + 'Google Colab Introduction
' + 'Decision Trees in Machine Learning
' + 'Principal Component Analysis (PCA) Tutorial
' + 'Linear Regression from Zero to Hero
'+ '

', /* + '

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

',*/ ]; var replaceText = { '': '', '': '', '
': '
' + ctaLinks + '
', }; Object.keys(replaceText).forEach((txtorig) => { //txtorig is the key in replacetext object const txtnew = replaceText[txtorig]; //txtnew is the value of the key in replacetext object let entryFooter = document.querySelector('article .entry-footer'); if (document.querySelectorAll('.single-post').length > 0) { //console.log('Article found.'); const text = entryFooter.innerHTML; entryFooter.innerHTML = text.replace(txtorig, txtnew); } else { // console.log('Article not found.'); //removing comment 09/04/24 } }); var css = document.createElement('style'); css.type = 'text/css'; css.innerHTML = '.post-tags { display:none !important } .article-cta a { font-size: 18px; }'; document.body.appendChild(css); //Extra //This function adds some accessibility needs to the site. function addAlly() { // In this function JQuery is replaced with vanilla javascript functions const imgCont = document.querySelector('.uw-imgcont'); imgCont.setAttribute('aria-label', 'AI news, latest developments'); imgCont.title = 'AI news, latest developments'; imgCont.rel = 'noopener'; document.querySelector('.page-mobile-menu-logo a').title = 'Towards AI Home'; document.querySelector('a.social-link').rel = 'noopener'; document.querySelector('a.uw-text').rel = 'noopener'; document.querySelector('a.uw-w-branding').rel = 'noopener'; document.querySelector('.blog h2.heading').innerHTML = 'Publication'; const popupSearch = document.querySelector$('a.btn-open-popup-search'); popupSearch.setAttribute('role', 'button'); popupSearch.title = 'Search'; const searchClose = document.querySelector('a.popup-search-close'); searchClose.setAttribute('role', 'button'); searchClose.title = 'Close search page'; // document // .querySelector('a.btn-open-popup-search') // .setAttribute( // 'href', // 'https://medium.com/towards-artificial-intelligence/search' // ); } // Add external attributes to 302 sticky and editorial links function extLink() { // Sticky 302 links, this fuction opens the link we send to Medium on a new tab and adds a "noopener" rel to them var stickyLinks = document.querySelectorAll('.grid-item.sticky a'); for (var i = 0; i < stickyLinks.length; i++) { /* stickyLinks[i].setAttribute('target', '_blank'); stickyLinks[i].setAttribute('rel', 'noopener'); */ } // Editorial 302 links, same here var editLinks = document.querySelectorAll( '.grid-item.category-editorial a' ); for (var i = 0; i < editLinks.length; i++) { editLinks[i].setAttribute('target', '_blank'); editLinks[i].setAttribute('rel', 'noopener'); } } // Add current year to copyright notices document.getElementById( 'js-current-year' ).textContent = new Date().getFullYear(); // Call functions after page load extLink(); //addAlly(); setTimeout(function() { //addAlly(); //ideally we should only need to run it once ↑ }, 5000); }; function closeCookieDialog (){ document.getElementById("cookie-consent").style.display = "none"; return false; } setTimeout ( function () { closeCookieDialog(); }, 15000); console.log(`%c 🚀🚀🚀 ███ █████ ███████ █████████ ███████████ █████████████ ███████████████ ███████ ███████ ███████ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Towards AI is looking for contributors! │ │ Join us in creating awesome AI content. │ │ Let's build the future of AI together → │ │ https://towardsai.net/contribute │ │ │ └───────────────────────────────────────────────────────────────────┘ `, `background: ; color: #00adff; font-size: large`); //Remove latest category across site document.querySelectorAll('a[rel="category tag"]').forEach(function(el) { if (el.textContent.trim() === 'Latest') { // Remove the two consecutive spaces (  ) if (el.nextSibling && el.nextSibling.nodeValue.includes('\u00A0\u00A0')) { el.nextSibling.nodeValue = ''; // Remove the spaces } el.style.display = 'none'; // Hide the element } }); // Add cross-domain measurement, anonymize IPs 'use strict'; //var ga = gtag; ga('config', 'G-9D3HKKFV1Q', 'auto', { /*'allowLinker': true,*/ 'anonymize_ip': true/*, 'linker': { 'domains': [ 'medium.com/towards-artificial-intelligence', 'datasets.towardsai.net', 'rss.towardsai.net', 'feed.towardsai.net', 'contribute.towardsai.net', 'members.towardsai.net', 'pub.towardsai.net', 'news.towardsai.net' ] } */ }); ga('send', 'pageview'); -->