Professional-grade Semantic Search Optimization: Accurate Result Rearrangement with Cohere AI, BGE Re-Ranker, and Jina Reranker

1. Introduction

1.1 RAG

Before we talk about rescheduling tools, we need to know a little bit about RAG.

Retrieval Augmented Generation (RAG) is an emerging AI technology stack that enhances the capabilities of large language models (LLMs) by providing them with additional "up-to-date knowledge".

The basic RAG application consists of four key technical components:

Embedding Models: used to convert external documents and user queries into Embedding vectors
vector database: Used to store Embedding vectors and perform vector similarity searches (retrieve the most relevant Top-K information).
Prompt engineering: Inputs for combining the user's question and retrieved context into a large model
Large Language Model (LLM）: Used to generate responses

The underlying RAG architecture described above can effectively address the problem of LLM generating "illusions" and generating unreliable content.However, some enterprise users are demanding more contextual relevance and Q&A accuracy, requiring a more sophisticated architecture. A proven and popular approach is to integrate Reranker into RAG applications.

Semantic search provides search functionality based on the contextual meaning of text passages. It addresses the limitations of alternative methods (keyword search).
For example, let's query: "places to eat". Using a semantic search model, we will be able to automatically associate it with "restaurant" because they have similar meanings. This is not possible with keyword searches, as the results will be limited to keywords such as "place", "go" and "eat".
It's like having a conversation with a search engine that understands not only what you're asking, but why you're asking. This is where natural language processing, artificial intelligence, and machine learning come in. They make a concerted effort to understand the user's query, the context of the query, and the user's intent. Semantic search examines the relationships between words or the meaning of words to provide more accurate and relevant search results than traditional keyword searches.

1.2 Reranker

Reranker is an important part of the Information Retrieval (IR) ecosystem for evaluating search results and reordering them to improve query relevance. In RAG applications, Reranker is mainly used after getting the results of a vector query (ANN) to be able to determine the semantic relevance between the documents and the query more efficiently, rearrange the results more finely, and ultimately improve the quality of the search.

Currently.There are two main types of Reranker--Reranker based on statistical and deep learning based models:

A statistically based Reranker aggregates the list of candidate results from multiple sources and uses either a weighted score of multiple recalls or a countdown rank fusion (RRF) algorithm to recalculate scores for all the results, uniformly re-ranking the candidates. This type of Reranker has the advantage of being computationally uncomplicated and efficient, and is therefore widely used in traditional search systems that are more sensitive to latency.
Rerankers based on deep learning models, often called Cross-encoder Rerankers, are specially trained neural networks that can analyze correlations between questions and documents very well due to the nature of deep learning. These Rerankers can score the semantic similarity between questions and documents. Because the scoring generally depends only on the textual content of the questions and documents, and not on the scores or relative positions of the documents in the recall results, this type of Reranker is suitable for both single- and multi-way recall.

1.3 Reranker's role in the RAG

Integrating Reranker into a RAG application can significantly improve the accuracy of the answers generated, as Reranker is able to select the closest documents to the question among single or multiple recalls. Additionally, expanding the richness of the search results (e.g., multiple recalls) combined with fine-grained filtering of the most relevant results (Reranker) can further improve the quality of the final results. Using Reranker, the first level of recall can exclude content that is not very relevant to the problem, further narrowing the context input to the larger model to a small set of the most relevant documents. By shortening the context, the LLM is able to pay more "attention" to everything in the context, avoiding neglecting key content and saving inference costs.

The figure above shows the architecture of the RAG application with the addition of Reranker. It can be seen that this retrieval system consists of two phases:

Retrieve Top-K related documents in vector databases, and can also be used with Sparse embedding to cover full-text retrieval capabilities.
Reranker scores and rearranges the retrieved documents based on their relevance to the query. The top results are selected as the Context in the Prompt and passed into the LLM to generate higher quality and more relevant answers.

However, it is important to note that adding a Reranker also poses some challenges and increases the cost of use compared to a RAG with a vector-only retrieval infrastructure.

1.4 Cost of using Reranker

The cost of using Reranker to improve retrieval relevance needs to be emphasized. This cost includes two aspects: the impact of increased latency on business and the increase in service cost due to increased computation. We recommend that you evaluate whether you need to use Reranker according to your own business needs, weighing search quality, search latency, and cost of using Reranker.

Reranker significantly increases search latency

Without Reranker, RAG applications simply perform a low latency Vector Approximate Nearest Neighbor (ANN) search to obtain Top-K related documents. For example, the Milvus vector database implements efficient vector indexing such as HNSW, which achieves millisecond search latency. If you use Zilliz Cloud, you can further improve the search performance with the more powerful Cardinal index.

However, with the addition of a Reranker, especially a Cross-encoder Reranker, the RAG application needs to process all the documents returned from the vector retrieval through a deep learning model, which can lead to a significant increase in latency. Depending on the model size and hardware performance, the latency can increase to hundreds of milliseconds or even to several seconds compared to the millisecond latency of vector retrieval!

Reranker will significantly increase the cost of computation.

In the underlying RAG, vector retrieval, although it requires pre-processing of documents using deep learning models, this more complex computation is cleverly designed to be performed offline. With offline indexing (Embedding model inference), each online query process requires only vector retrieval at a very low computational cost. In contrast, using Reranker significantly increases the computational cost of each online query. This is because the re-ranking process requires high-cost model inference for each candidate document, and unlike the former, which can reuse the results of the offline index for each query, using Reranker requires inference for each online query, and the results cannot be reused, resulting in duplicated overheads. This is very unsuitable for high-traffic information retrieval systems such as web search and e-commerce search.

Let's do some simple math and look at the cost of using Reranker.

According to VectorDBBench, the cost of using a vector database that can handle 200 queries per second is only $100 per month, which translates to a cost of $0.0000002 per query. With Reranker, assuming that the first stage of the vector search returns top-100 documents, the cost of rearranging these documents is $0.001. That is, adding Reranker is 5,000 times more expensive than performing a vector search alone.

Although in many real-world situations only a small number of results may be re-ranked (e.g., 10 to 20), the cost of using a Cross-encoder reranker is still much higher than the cost of simply performing a vector search.

From another perspective, using Reranker is equivalent to burdening the query with the equivalent of the exorbitant cost of offline indexing, i.e., the computational effort of model inference. The cost of inference is related to the size of the input (the number of tokens in the text) and the size of the model itself. Typical Embedding and Reranker model sizes range from a few hundred MB to several GB. We assume that the two models are close in size because the query documents are generally much larger than the query problem, and the cost of reasoning about the problem is negligible, but if we need to rearrange the top-10 documents for each query, this is equivalent to 10 times the cost of computing Embedding offline for a single document. This is 10 times the cost of computing Embedding offline for a single document. If the query load is high, the computational and usage costs may be prohibitive. For low load scenarios, such as high-value, low-frequency knowledge base Q&A within an organization, this cost may be perfectly acceptable.

1.5 Cost Comparison: vector retrieval . Cross-encoder Reranker . Large Model Generation .

While the cost of using Reranker is much higher than the cost of using vector retrieval alone, it is still less expensive than using LLM to generate answers for the same number of documents. In the RAG architecture, Reranker can filter the initial results of the vector search and discard documents with low relevance to the query, thus effectively preventing the LLM from processing irrelevant information, which greatly reduces the time-consuming and costly part of the generation process as compared to sending all the results returned from the vector search to the LLM.

To give a close practical example: in the first stage of retrieval, a vector search engine can quickly filter out the 20 documents with the highest semantic closeness among millions of vectors, but the relative order of these documents can be further optimized using Reranker. Although it incurs some cost, Reranker can further pick out the best top-5 results from the top-20 results. The relatively more expensive LLM would then only need to analyze these top-5 results, eliminating the higher cost and "laxity" of dealing with 20 documents. In this way, we can balance latency, answer quality, and cost of use with this composite solution.

1.6 Scenarios of use of Reranker

Reranker is particularly suitable for scenarios where high precision and relevance of answers are sought, such as specialized knowledge bases orCustomer Service Systemand other applications. Because the queries in these applications have high business value, improving answer accuracy is prioritized over system performance and controlling costs. Using Reranker generates more accurate answers and improves the user experience.

However, in the case ofWeb search, e-commerce searchResponse speed and cost are critical in this type of scenario, making the costly Cross-Encoder Reranker less suitable.This type of application scenario is better suited for vector search with a lighter Score-based Reranker to ensure responsiveness and reduce overhead while improving search quality.

Compared with vector search alone, using Reranker can improve the accuracy and relevance of answers in the retrieval augmentation generation (RAG) and search system by further refining the ordering of the first layer of search results. However, using Reranker increases latency and cost of ownership, and is therefore not suitable for high-frequency, high-concurrency applications. When considering whether to use Reranker, there is a trade-off between answer accuracy, responsiveness, and cost of use.

Rearrangers increase latency and computational cost while improving retrieval relevance. Therefore, after weighing the tradeoffs between retrieval quality, search latency, and cost of use, there are not many current options for reordering tools, theHere are three: Cohere Rerank , BGE Re-Ranker, and Jina Reranker.

Company Profile

Aidan Gomez (CEO), Nick Frosst and Ivan Zhu founded Cohere in 2019.
One of them, Aidan Gomez, co-authored the paper "Attention Is All You Need" in June 2017, the weight of which we all know and will not repeat here. In early 2023, former YouTube CFO Martin Kon (President and COO) joined the team.

Objective: To build a large modeling infrastructure

Gomes : "In the beginning, we didn't really know what product we wanted to build... we were just focused on building the infrastructure to train large language models on supercomputers using whatever computation we could get our hands on. Soon after we launched Cohere, GPT-3 came along, and it was a huge breakthrough moment that was very effective and gave us [an indication] that we were on the right track."

2.1 Product Functions

Cohere trains Large Language Models (LLMs) for a variety of reading and writing tasks, such as summarization, content creation, and sentiment analysis.
Its language model is optimized for three main use cases:

Retrieving text
- Embed
- Semantic Search
- Rerank (re-ranking)
Generating text
- Summarize
- Generate
- Command Model: Follows the business application's user commands.
Classifying text

Depending on your privacy/security requirements, there are multiple ways to access Cohere:

Cohere's API: This is the easiest option, just get an API key from the dashboard and start using Cohere-hosted models.
Cloud AI Platforms: This option provides a balance of ease of use and security. You can access Cohere on a variety of cloud AI platforms, such as Oracle's GenAI service, AWS's Bedrock and Sagemaker platforms, Google Cloud, and Azure's AML service.
Private Cloud Deployments: Cohere's models can be privately deployed in most virtual private cloud (VPC) environments, providing enhanced security and the highest degree of customization. For information, please contact sales.

2.2 Business models

Cohere bears the substantial upfront cost of building each model and the ongoing cost of reasoning.
It recovers costs through usage-based pricing and offers three different pricing tiers:

Free: Access to all Cohere API endpoints with limited speed for learning and prototyping.
Product: Increased rate limiting of access to all Coheres API endpoints, enhanced customer support, and the ability to train custom models based on the data provided.
Cohere charges based on the number of Token (Token is basically a number, letter, or symbol) across all of its API endpoints, which vary in price from $0.0000004 (embedded) to $0.001 (reordered) per Token.
Enterprise: Dedicated model instances, highest level of support and customized deployment options. Enterprise pricing is undisclosed.

2.3 Cohere Rerank 3

Cohere Rerank is a widely used rearrangement tool in the industry, often integrated into the LangChain and LlamaIndex frameworks, and is relatively simple to use.

The company behind it, Cohere, comes from a long line of people. founded in 2019, Cohere was founded by researchers and engineers who worked on Google Brain and Cortex, and one of its co-founders, Aidan Gomez, is one of the authors of the Transformers architecture.

According to incomplete statistics, Cohere has raised more than $445 million in cumulative funding. In March of this year, it was also revealed that Cohere's new round of funding was in advanced stages of negotiation, raising more than $500 million and expected to reach a valuation of $5 billion.

In April of this year, Cohere released Rerank 3, which is much improved in every way, including:

4k context length significantly improves search quality for longer documents
Ability to search multifaceted and semi-structured data such as emails, invoices, JSON documents, codes and forms
Coverage of more than 100 languages
Improve Latency and Reduce Total Cost of Ownership (TCO)

However, it is commercially closed-source. Originally, it cost users $1 per 1000 searches, and after upgrading to Rerank 3, it costs $2 per 1000 searches.

2.4 Cohere utilization

Cohere trains large-scale language models (LLMs) for a variety of reading and writing tasks, such as summarization, content creation, and sentiment analysis. Its language models are optimized for three main use cases: retrieving text, generating text, and classifying text.

Cohere provides API endpoints for organizations to leverage its LLMs and many deployment options that enable them to securely store data through cloud partners like AWS or Cohere's hosted cloud. To help its LLMs customers more efficiently, Cohere also offers customized model training services.

Official website :
github : /cohere-ai/cohere-python
Documentation:

Implementing Semantic Search Based on Cohere AI

intend
1. Installation of libraries

pip install cohere

2、Obtain the secret key

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from  import cosine_similarity
from annoy import AnnoyIndex
import warnings
('ignore')
pd.set_option('display.max_colwidth', None)

api_key = ''

co = (api_key)

Getting a problem categorized dataset
This will be demonstrated here using the trec dataset, which consists of questions and their categories.

# Get the dataset
dataset = load_dataset("trec", split="train")

# Import it into a pandas dataframe, taking only the first 1000 rows
df = (dataset)[:1000]

# Preview the data to make sure it has been loaded correctly
(10)

Document Embedding

Question text can be embedded using Cohere.
Questions are embedded using the embed function of the Cohere library. It takes about 15 seconds to generate a thousand embeds of this length.
```
# Get embeds
embeds = (texts=list(df['text']),
                  model="large",
                  truncate="RIGHT").embeddings

# Check the dimensions of the embeds
embeds = (embeds)
```
Searching using indexing and nearest neighbor search
Use the AnnoyIndex function of the annoy library, a way to store embeddings optimized for fast searching.
The optimization problem of finding the nearest (or most similar) point to a given point in a given set is known as nearest neighbor search.
This method is suitable for large amounts of text (other options include Faiss, ScaNN and PyNNDescent).

After constructing the index, we can use it to retrieve the nearest neighbors of existing problems, or embed new problems and find their nearest neighbors.
```
# Create the search index, pass in the size of the embedding
search_index = AnnoyIndex([1], 'angular')
# Add all vectors to the search index
for i in range(len(embeds)).
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('')
```

Find neighbors of examples in the dataset
If we are only interested in the distances between questions in the dataset (no external queries), a simple approach is to compute the similarity between each pair of embeddings we have.

# Select an example (we will retrieve other examples similar to it)
example_id = 7

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# Format and print the text and distances
results = (data={'texts': [similar_item_ids[0]]['text'],
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f "Problem: '{[example_id]['text']}'\n nearest neighbor:")
results

Find Neighbors for User Queries

We can use techniques such as embedding to find the nearest neighbor of a user query.
By embedding the query, we can measure its similarity to the items in the dataset and identify the nearest neighbors.

query = "What is the tallest mountain in the world?"

# Get the query embedded
query_embed = (texts=[query],
                  model="large",
                  truncate="RIGHT").embeddings

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# Format the results
results = (data={'texts': [similar_item_ids[0]]['text'],
                             'distance': similar_item_ids[1]})

print(f "Question: '{query}'\n nearest neighbor:")
results

Reference Links:

Silicon Valley Technology Review : Cohere, Big Models for Businesses
/s/H-FNecz6rhfVkWg_ayoKKg
How to Enable Semantic Search with Cohere AI Text Embedding Technology
/s/wWeYopgO3t6vyjHlA85sBQ
NLP development for everyone: cohere and open-source ping-pong testing
/video/BV1ov4y1U7Au/

Re-Ranker v2.0

Recently, Zhiyuan team re-introduced a new generation of retrieval ranking model, BGE Re-Ranker v2.0, and extended the "text + image" hybrid retrieval capability of vector model BGE.

BGE Re-Ranker v2.0 Supports more languages and longer text lengths.It has also achieved state-of-the-art results on mainstream benchmarks such as the English search benchmark MTEB, the Chinese search benchmark C-MTEB, the multilingual search benchmark MIRACL, and the LLaMA-Index Evaluation.
BGE Re-Ranker v2.0 Further Optimizes Inference Efficiency with Hierarchical Self-Distillation StrategyThe result is that a modest amount of overhead can be traded for significant performance gains.
BGE-v1.5 and BGE-M3 further add "text + image" hybrid search capability by incorporating visual token.The first is to provide a simple and easy to use text retrieval system, while at the same time maintaining excellent text retrieval performance.

The above model is now available through Hugging Face, Github, and other platforms under a free, commercially licensed open source agreement:

/FlagOpen/FlagEmbedding

/BAAI

3.1 Technical Highlights

                                 Figure 1 RAG pipline

As shown in Figure 1.Retrieval ranking models are an important part of the information retrieval and RAG pipeline. Compared with vector models and sparse retrieval models, retrieval ranking models utilize more complex decision functions to obtain finer correlations.In general, the system will first use the vector model (BGE-M3-Dense) and the sparse retrieval model (BGE-M3-Sparse) to obtain the initial candidate from the vector database and the inverted index respectively. Usually, the system first obtains coarse-grained candidates from vector databases and inverted indexes with the help of vector model (BGE-M3-Dense) and sparse retrieval model (BGE-M3-Sparse), respectively. Immediately after that, the system will further utilize the ranking model (BGE Re-Ranker) to further filter the candidate set and finally obtain the fine-grained candidates to support the downstream large language model for the retrieval enhancement task (RAG).

                                              Figure 2

The BGE Re-Ranker v2.0 series of ranking models utilizes two different sizes of model bases:
BGE Re-Ranker v2-LLM (as in Figure 2A): Based on MiniCPM-2B, Gemma-2B, and other high-performance, lightweight, large-language models.
BGE Re-Ranker v2-M3 (as in Figure 2B): Based on the high-performance BGE-M3-0.5B with a smaller parameter count.Faster。
All models are generated through multilingual data training, with the ability of multilingual retrieval. Example: BGE Re-Ranker v2-MiniCPM-2BDramatically improved search capabilities in English and ChineseWhile BGE Re-Ranker v2-Gemma-2B and BGE Re-Ranker v2-M3 achieved the best retrieval results in multilingual retrieval tasks (Note: See GitHub repository description for the training data ratios of BGE Re-ranker v2.0 series models).
To further improve the model inference efficiency, BGE Re-Ranker v2.0 adopts a hierarchical self-distillation training strategy (as shown in Fig. 2C). Specifically, the model's final ranking score (S(0)) was used as a teacher signal, and using knowledge distillation, each intermediate layer of the model was also learned and endowed with ranking capabilities.In practice, users can flexibly choose the number of layers of the sorting model based on the arithmetic conditions and time delay constraints of specific scenarios.。
BGE Series of Vector Models Extend "Text + Image" Hybrid Search FunctionBy introducing visual tokens generated by the CLIP model, the BGE gains the ability to model "text + image" hybrids. By introducing visual tokens generated by the CLIP model, the BGE is able to obtain a hybrid "text + image" modeling capability. It is worth noting that the training for augmenting visual tokens only works on the visual tokenizer, while the original BGE model (BGE v1.5, BGE M3) parameters remain unchanged.As a result, the excellent text retrieval capabilities of the BGE model are fully maintained while gaining hybrid modeling capabilities。

3.2 Performance Review

The retrieval performance evaluation results of BGE Re-Ranker v2.0 series models in English, Chinese, and multilingual mainstream benchmarks are as follows:

1. Benchmarks for English search assessment

The results of the English review of MTEB/Retrival are as follows (Table 1):

BGE Re-Ranker v2 firstly re-ranks the top-100 candidate set of BGE-v1.5-large. The experimental results show that BGE Re-Ranker v2-Gemma-2B achieves the most excellent results, with the retrieval accuracy ofto a significant increase of 6 per cent. Meanwhile, the intermediate layer ranking results obtained by the hierarchical self-vaporization strategy (BGE Re-Ranker v2-MiniCPM-28 vs. BGE Re-Ranker v2-MiniCPM-40) well maintain the retrieval accuracy of the final layer. In addition, after switching to the more powerful vector model E5-Mistral-7B (still reshooting its top-100), the retrieval accuracy is further improved, and the average retrieval equivalent score (NGCG@10) reaches60.4This result is the best result on the BEIR benchmark, with an improvement of almost 4% compared to the original embedding-only result of 56.85. [1][2].

2. Chinese search assessment benchmarks

In the Chinese evaluation C-MTEB/Retrival, BGE Re-Ranker v2 also re-ranks the top-100 candidate set of BGE- v1.5-large. Similar to the English results, BGE Re-Ranker v2-MiniCPM-2B achieves the optimal retrieval quality, and the intermediate layer ranking result (BGE Re-Ranker v2-MiniCPM-2B-layer 28) still fully maintains the retrieval accuracy of the final layer.

3. Benchmarks for multilingual search assessment

In the multilingual evaluation MIRACL (Table 3), BGE Re-Ranker v2 re-ranks the top-100 candidate set of BGE-M3. Unlike the previous results, BGE Re-Ranker v2-Gemma-2B tops the combined results, while BGE Re-Ranker v2-M3 achieves similar results with a smaller model size (0.5B). The above results also reflect the performance differences between the individual pre-trained model bases in different languages.

4. RAG benchmarking

In the RAG evaluation benchmark provided by Llama Index [3], we use BGE Re-Ranker v2 and various baseline re-rankers to re-rank the recall results of different embedding models (bge v1.5 large, bge-m3, openai-te3, mxbai-embedding). Recall results are rearranged. As shown in the following table (Table 4), BGE Re-Ranker v2 can significantly improve the precision of each embedding model in RAG scenarios. Meanwhile, BGE Re-Ranker v2 with bge-m3 can obtain the best end-to-end retrieval quality.

5. "Text + Picture" mixed assessment benchmarks

Finally, in the task of "text + image" hybrid retrieval (Table 4), the Visualized BGE achieves a significant advantage over the CLIP baseline on five commonly reviewed benchmarks: WebQA, CIRR, FashionlQ, OVEN-QS, and ReMuQ.

3.3 BGE Community Ecology

Thanks to the outstanding performance and good generalization of BGE, the mainstream vector databases in the industry have followed up the various model versions of BGE. The popular BGE-M3 model has been integrated by Vespa, Milvus and other frameworks, which brings great convenience to the community users to quickly build the "trinity" (dense search, sparse search, reordering) search pipeline.

1. Examples of Vespa usage (see [4] for details)

2. Examples of Milvus usage (see [5] for details)

References:

[1] MTEB Leaderboard, /spaces/mteb/leaderboard

[2] SFR-Embedding-Mistral, /sfr-embedded-mistral/

[3] Llama-Index Evaluation, /en/latest/optimizing/evaluation/

[4] Vespa for BGE M3, /vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/

[5] Zilliz for BGE, /FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3

Reranker

Web site:/reranker/

Jina Reranker is developed by Jina AI, a neural search company. 2022, Jina AI closed its Series A funding round, which has totaled more than 200 million RMB. Its neural search framework, Jina, has been ranked No. 1 on GitHub's Global Trending Chart several times.

Jina Reranker v2 was released in June of this year with significant improvements in speed, multi-language support, and functionality, especially for Retrieval Augmented Generation (RAG) scenarios.

4.1Key benefits of Jina Reranker v2:

Multi-language support: delivers more relevant search results in more than 100 languages and outperforms bge-reranker-v2-m3;
Proxy Capabilities: State-of-the-art function calls and text-to-SQL conversion capabilities for proxy RAG scenarios.
Code Retrieval: best performance on code retrieval tasks;
Extreme speed: inference is 6x faster than the previous generation and 15x faster than the comparable product bge-reranker-v2-m3.

Recall@3 scores are reported for different Reranker models on the ToolBench dataset. As shown, Jina Reranker v2 essentially reaches state-of-the-art performance levels and is nearly 15 times faster with half the model size.

4.2 Features of Jina Reranker v2:

Need for innovation: to compensate for the lack of retrieval accuracy of the embedding model.
Multi-language support: Excellent performance in benchmarks such as MKQA, BEIR and AirBench.
Application scenarios: applications in structured data querying, function calling and code retrieval.
Reasoning speed: smaller model size, Flash Attention 2 technology.
Training process: conducted in four stages, including pre-training using English data, adding cross-linguistic data, and fine-tuning.

4.3 The way Jina Reranker v2 is applied:

Through the Reranker API: The quickest way to use Jina Reranker v2 is through its API, which makes it easy to improve the relevance of your searches and the accuracy of your RAG without having to deploy a model.
Integration via RAG/LLM frameworks: Jina Reranker integrates with existing LLM and RAG orchestration frameworks by simply using the model name for quick integration.
Huggingface: Jina AI opens up (under CC-BY-NC-4.0) access to the jina-reranker-v2-base-multilingual model on Hugging Face for research and evaluation purposes.
Private Cloud Deployment: Jina Reranker v2's pre-built private deployment packages will soon be available on AWS Marketplace and Azure Marketplace for easy deployment by AWS and Azure users.

Jina Reranker also charges a fee, but the first 1 million tokens are free, 1 billion tokens are $20, and 11 billion tokens are $200.

4.4 Quick use of Jina Reranker

a. Through the Reranker API

The quickest way to use Jina Reranker v2 is through its API, which makes it easy to improve the relevance of your searches and the accuracy of your RAG without having to deploy a model.

Example 1: Ranking Function Calls

To sort the most relevant external functions or tools, organize the query and documentation (function architecture) in the following format.

curl -X 'POST' \
  '/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <YOUR JINA AI TOKEN HERE>' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "jina-reranker-v2-base-multilingual",
  "query": "I am planning a road trip from Berlin to Munich in my Volkswagen VII. Can you calculate the carbon footprint of this trip?",
  "documents": [
    "{'\''Name'\'': '\''getWeather'\'', '\''Specification'\'': '\''Provides current weather information for a specified city'\'', '\''spec'\'': '\''/data/2.5/weather?q={city}&appid={API_KEY}'\'', '\''example'\'': '\''/data/2.5/weather?q=Berlin&appid=YOUR_API_KEY'\''}",
    "{'\''Name'\'': '\''calculateDistance'\'', '\''Specification'\'': '\''Calculates the driving distance and time between multiple locations'\'', '\''spec'\'': '\''/maps/api/distancematrix/json?origins={startCity}&destinations={endCity}&key={API_KEY}'\'', '\''example'\'': '\''/maps/api/distancematrix/json?origins=Berlin&destinations=Munich&key=YOUR_API_KEY'\''}",
    "{'\''Name'\'': '\''calculateCarbonFootprint'\'', '\''Specification'\'': '\''Estimates the carbon footprint for various activities, including transportation'\'', '\''spec'\'': '\''/api/v1/estimates'\'', '\''example'\'': '\''{type: vehicle, distance: distance, vehicle_model_id: car}'\''}"
  ]
}'

The expected results are as follows:

{
  "model": "jina-reranker-v2-base-multilingual",
  "usage": {
    "total_tokens": 383,
    "prompt_tokens": 383
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "{'Name': 'calculateCarbonFootprint', 'Specification': 'Estimates the carbon footprint for various activities, including transportation', 'spec': '/api/v1/estimates', 'example': '{type: vehicle, distance: distance, vehicle_model_id: car}'}"
      },
      "relevance_score": 0.5422876477241516
    },
    {
      "index": 1,
      "document": {
        "text": "{'Name': 'calculateDistance', 'Specification': 'Calculates the driving distance and time between multiple locations', 'spec': '/maps/api/distancematrix/json?origins={startCity}&destinations={endCity}&key={API_KEY}', 'example': '/maps/api/distancematrix/json?origins=Berlin&destinations=Munich&key=YOUR_API_KEY'}"
      },
      "relevance_score": 0.23283305764198303
    },
    {
      "index": 0,
      "document": {
        "text": "{'Name': 'getWeather', 'Specification': 'Provides current weather information for a specified city', 'spec': '/data/2.5/weather?q={city}&appid={API_KEY}', 'example': '/data/2.5/weather?q=Berlin&appid=YOUR_API_KEY'}"
      },
      "relevance_score": 0.05033063143491745
    }
  ]
}

Example 2: Ranking SQL Queries

To get the relevance score of a user's query to a database table structure, you can use the following sample API call.

curl -X 'POST' \
  '/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <YOUR JINA AI TOKEN HERE>' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "jina-reranker-v2-base-multilingual",
  "query": "which customers bought a summer outfit in the past 7 days?",
  "documents": [
    "CREATE TABLE customer_personal_info (customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50));",
    "CREATE TABLE supplier_company_info (supplier_id INT PRIMARY KEY, company_name VARCHAR(100), contact_name VARCHAR(50));",
    "CREATE TABLE transactions (transaction_id INT PRIMARY KEY, customer_id INT, purchase_date DATE, FOREIGN KEY (customer_id) REFERENCES customer_personal_info(customer_id), product_id INT, FOREIGN KEY (product_id) REFERENCES products(product_id));",
    "CREATE TABLE products (product_id INT PRIMARY KEY, product_name VARCHAR(100), season VARCHAR(50), supplier_id INT, FOREIGN KEY (supplier_id) REFERENCES supplier_company_info(supplier_id));"
  ]
}'

The expected response is:

{
  "model": "jina-reranker-v2-base-multilingual",
  "usage": {
    "total_tokens": 253,
    "prompt_tokens": 253
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "CREATE TABLE transactions (transaction_id INT PRIMARY KEY, customer_id INT, purchase_date DATE, FOREIGN KEY (customer_id) REFERENCES customer_personal_info(customer_id), product_id INT, FOREIGN KEY (product_id) REFERENCES products(product_id));"
      },
      "relevance_score": 0.2789437472820282
    },
    {
      "index": 0,
      "document": {
        "text": "CREATE TABLE customer_personal_info (customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50));"
      },
      "relevance_score": 0.06477169692516327
    },
    {
      "index": 3,
      "document": {
        "text": "CREATE TABLE products (product_id INT PRIMARY KEY, product_name VARCHAR(100), season VARCHAR(50), supplier_id INT, FOREIGN KEY (supplier_id) REFERENCES supplier_company_info(supplier_id));"
      },
      "relevance_score": 0.027742892503738403
    },
    {
      "index": 1,
      "document": {
        "text": "CREATE TABLE supplier_company_info (supplier_id INT PRIMARY KEY, company_name VARCHAR(100), contact_name VARCHAR(50));"
      },
      "relevance_score": 0.025516605004668236
    }
  ]
}

b. Integration through the RAG/LLM framework

Jina Reranker integrates with existing LLM and RAG orchestration frameworks and can be quickly integrated using only the model name. For more information on how to integrate Jina Reranker v2, please visit the respective documentation page: jina-reranker-v2-base-multilingual.

Haystack by deepset: In Haystack, Jina Reranker v2 can be integrated using the JinaRanker class./docs/jinaranker

from haystack import Document
from haystack_integrations. import JinaRanker
 
docs = [Document(content="Paris"), Document(content="Berlin")]
 
ranker = JinaRanker(model="jina-reranker-v2-base-multilingual", api_key="<YOUR JINA AI API KEY HERE>")
 
(query="City in France", documents=docs, top_k=1)

LlamaIndex: Jina Reranker v2 is available as a JinaRerank node post-processor module./en/stable/examples/node_postprocessor/JinaRerank/

import os
from llama_index.postprocessor.jinaai_rerank import JinaRerank
 
jina_rerank = JinaRerank(model="jina-reranker-v2-base-multilingual", api_key="<YOUR JINA AI API KEY HERE>", top_n=1)

Langchain: Integrating Jina Reranker 2 in an existing application using the JinaRerank module requires initializing the JinaRerank module with the correct model name. Specific reference:/v0.2/docs/integrations/document_transformers/jina_rerank/

from langchain_community.document_compressors import JinaRerank
 
reranker = JinaRerank(model="jina-reranker-v2-base-multilingual", jina_api_key="<YOUR JINA AI API KEY HERE>")

Huggingface

We also open access (under CC-BY-NC-4.0) to the jina-reranker-v2-base-multilingual model on Hugging Face for research and evaluation purposes.

To download and run models from Hugging Face, install transformers and einops:

pip install transformers einops
pip install ninja
pip install flash-attn --no-build-isolation

huggingface-cli login --token <"HF-Access-Token">

Download the pre-trained model:

from transformers import AutoModelForSequenceClassification
 
model = AutoModelForSequenceClassification.from_pretrained(
    'jinaai/jina-reranker-v2-base-multilingual',
    torch_dtype="auto",
    trust_remote_code=True,
 
)
 
('cuda') # or 'cpu' if no GPU is available
 
()

Defines the query and the documents to be reordered:

query = "Organic skincare products for sensitive skin"
 
documents = [
    "Organic skincare for sensitive skin with aloe vera and chamomile.",
    "New makeup trends focus on bold colors and innovative techniques",
    "Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
    "Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
    "Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
    "Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
    "Natural and organic skin care products specially designed for sensitive skin.",
    "New makeup trends focus on vibrant colors and innovative techniques",
    "sensitive skinのためにSpecialにDesignされたNatural OrganicスキンケアProducts",
    "meso- (chemistry)しいメイクのトレンドはfreshやかなcolorと革meso- (chemistry)的なTechnologyにfocal pointを(coll.) fail (a student)てています",
]

Construct sentence pairs and calculate relevance scores:

sentence_pairs = [[query, doc] for doc in documents]
 
scores = model.compute_score(sentence_pairs, max_length=1024)

The resulting scores will be presented as a list of floating point numbers, where each value corresponds to the degree of relevance between a document and the query, with larger values indicating greater relevance.

rerank is able to decompose long text into several small parts, score each part individually for its match with the query, and finally, combine the scores of the parts to form a complete reordered output. In this process, max_query_length and max_length parameters are used to control the length and fineness of the text segmentation.

results = (
    query,
    documents,
    max_query_length=512,
    max_length=1024,
    top_n=3
)

This function returns not only the relevance score of each document, but also their content and position in the original document list.