Next-Gen Search powered by Jina
Last Updated on May 3, 2021 by Editorial Team
Author(s): Shubham Saboo
Technology
Since the inception of online search, the world has changed dramatically, but the βcuriosityβ that fuels the business remains constantβ¦
What is Neural SemanticΒ Search?
A neural search is an intelligent approach towards retrieving contextual and semantically relevant information. Instead of telling a machine a set of rules to understand what data is what, neural search does the same thing with a pre-trained neural network. This means developers donβt have to write every little rule, saving them time and headaches, and the system trains itself to get better as it goesΒ along.
Conventional Search v/s NeuralΒ Search
Conventional search
- Conventional search is symbolic and keyword-driven due to which it lacks the necessary context.
- Conventional search is fragile due to its hard-coded ruleΒ engines.
- Conventional search requires updating of rules with the new addition of data, making it non-scalable and time-consuming.
- Conventional search requires a level of domain knowledge to implement.
Neural/Semantic Search
- Neural search is context-driven enabling it to find semantically relevant information.
- Neural search is flexible in adapting to all the corner cases and resilient toΒ noise.
- Neural search on the other hand can train itself on the new data using the past inferences/context making it highly scalable and efficient.
- Neural search requires little to no domain knowledge to implement.
What isΒ Jina?
Jina is a cloud-native neural search platform. It can be deployed in containers, clod, or on-prem servers. It offers anything-to-anything search ranging from Text-to-text, image-to-image, video-to-video, or any other data type that you can feed as input to the engine. Jina operates on its primitive data type known as a document. Documents are pieces of data in any dataset you want to search, and the input queries you use to find what youΒ want.
Basically, they are the input and output data for the Jina search workflows. Jina core comprises of two main flows, which are the heart and soul of the semantic searchΒ engine:
- Indexing Flow: An indexing Flow makes the whole corpus searchable by sentence. The indexing flow prepares and pre-processes the data to be searched. The input documents are fed in, processed, and output at the other end is stored as searchable indexes.
- Querying Flow: A querying flow takes the user query as an input document (primitive Jina data type) and returns a list of ranked matches based on the similarity score within the word embeddings.
Jina Components
Flow represents a high-level task, e.g. indexing, searching, training. It consists of a group of pods, orchestrating them to accomplish one task. A pod is a group of executors sharing the same properties, it allows parallel execution of multiple executors and adds context and control to the executors.
Executor represents an algorithmic unit in Jina. Algorithms such as encoding images into vectors, storing vectors on the disk, ranking results, can all be formulated as Executors. Executor provides useful interfaces, allowing AI developers and engineers to really focus on the algorithm. Some common executors are asΒ follows:
- Crafter: Crafter is used for pre-processing and the documents intoΒ chunks.
- Encoder: Encoder takes the input pre-processed chuck of documents from the crafter and encodes them into embedding vectors.
- Indexer: Indexer takes the encoded vectors as input and indexes and stores the vectors in a key-value fashion.
- Ranker: Ranker runs on the indexed storage and sorts the results based on a certainΒ ranking.
Search Modalities
Jina is a data type-agnostic framework, that lets you work with any type of data and run cross-modal and multi-modal searchΒ Flows.
- Single Modality: In this type of search the type of input and the type of output remains the same, it includes text-to-text search, image-to-image search, audio-to-audio search, etc. In a single modality, the search is designed to deal with a single data type making it less flexible and fragile to the input of different dataΒ types.
- Cross Modality Search: It enables you to effectively find relevant documents of modality A (let's sayββββimageβ) by querying with documents from modality B (let's sayββββtextβ). Cross Modality refers to a set of applications where you can look for documents of one modality (e.g. images) with queries from another one (e.g.Β text).
- Multi-Modality Search: It enables you to project documents of different modalities into a common embedding space, and find relevant documents with respect to the fusion of multiple modalities Multi-Modality is when you merge information in a query from different modalities as in providing an infused input consisting of (text+image) to get the output which can be flexible depending on the interpretation by theΒ model.
Support to different types of modalities unlocks a lot of powerful patterns and makes Jina fully flexible and agnostic to what can be searched.
Jina inΒ Action
For showcasing a live demo, I have designed a simple neural semantic search for textual data. The model is trained on the data taken from a random Wikipedia page. Jina takes the input document and follows through the internal Jina flows (Indexing followed by Querying) to come up with a searchΒ engine.
Frameworks/Tools Used:
- Jina Core: It enables the indexing and querying workflows for the respective application.
- Language Model: The language model used here comes from the BERT(Bi-directional Encoder representation for Transformers) family, here we have used βdistilbert-bert-casedβ for understanding the context under the querying flow ofΒ Jina.
- Jina Box: Jina Box is an easy-to-use, lightweight, customizable front-end web component for data type agnostic search (be it text, audio, video, etc.) that can be easily connected to the Jina backend providing the user with a simple and efficient interface to interact with the searchΒ engine.
- Python 3.7: It is used as the development environment for the Jina Application.
Example: Here in the search box we try to search for βcomputerβ and get the following results. It's interesting to see that there is no mention of the exact word βcomputerβ anywhere in the indexed document, still the model figures out the sentence which are contextually or semantically related computer.
References
If you would like to learn more or want to me write more on this subject, feel free to reachΒ out.
My social links: LinkedIn| Twitter |Β Github
If you liked this post or found it helpful, please take a minute to press the clap button, it increases the post visibility for other mediumΒ users.
Next-Gen Search powered by Jina was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI