Keyword-aware retrieval

This paper focuses on the advantages of vector retrieval services with keyword-aware capabilities, application examples, and the Sparse Vector generation tool.

Background

Keyword search and its limitations

In the field of information retrieval, the "traditional" way to retrieve information is through keywords, which is roughly the process:

Keyword extraction on raw corpus (e.g. web pages).
To establish the mapping relationship between the keywords and the original corpus, common methods include inverted indexing, TF-IDF, BM25, and other methods, in which TF-IDF and BM25 are usually used toSparse Vector to indicate word frequency.
For retrieval, keyword extraction is performed on the retrieved statements and the raw TopK corpus with the highest degree of relevance is recalled through the mapping relation established in step 2.

However, the keyword search cannot understand the semantics. For example, the search statement for "Zheyi Hospital", after the word after becoming "Zheyi" and "hospital", these two keywords are not effective in hitting the user's expected "the first hospital of Zhejiang University School of Medicine" this goal.

Semantic-based Vector Retrieval

With the ever-changing development of artificial intelligence technology and the increasing ability of semantic understanding Embedding model, the semantic Embedding-based vector retrieval to recall the associated information has gradually become mainstream. The general process is as follows:

Raw corpus (e.g. web pages) is generated by Embedding modelingVector also known asDense Vector.
Vector entry vector retrieval system.
For retrieval, the retrieved utterances are also generated as vectors by the Embedding model, and the vectors are used to recall the nearest TopK raw corpus in the vector retrieval system.

However, it is undeniable that there are limitations in recalling information based on semantic vector retrieval.------ The Embedding model's ability to understand semantics must be continuously optimized in order to achieve better results. For example, if the model cannot understand that "rice irrigation" and "irrigated rice" are semantically close to each other, it will lead to the inability to recall the corpus related to "irrigated rice" through "rice irrigation". In this case, keyword retrieval can play to its advantage by effectively recalling the relevant corpus through the keywords "rice" and "irrigation".

Keyword Search + Semantic Search

In response to the above problems, the business and the system gradually evolved out of the "two-way recall, integrated sorting" approach to solve the problem, and in terms of effect also exceeds the simple keyword search or semantic search, as shown in the figure below:

But the drawbacks of this approach are obvious:

Increased system complexity.
Increased overhead of hardware resources (memory, CPU, disk, etc.).
Reduced maintainability.
......

Semantic Search with Keyword Awareness

The vector retrieval service DashVector supports both Dense Vector, which is used for high-dimensional feature (Embedding) representation of the model, and Sparse Vector, which is used for keyword and word frequency information representation.DashVector can perform keyword-aware vector retrieval, i.e., a hybrid retrieval combining Dense Vector and Sparse Vector. Vector and Sparse Vector combined hybrid retrieval.

DashVector with keyword-aware vector retrieval capabilities, both the advantages of the "two-way recall, comprehensive sorting" program, but without its shortcomings. It makes the system complexity, resource overhead significantly reduced at the same time, but also has the advantages of keyword retrieval, vector retrieval, keyword + vector hybrid retrieval, which can meet the needs of the vast majority of business scenarios.

clarification

Sparse Vector, a sparse vector is a vector in which most of the elements are 0 and only a few elements are non-zero. In DashVector, sparse vector can be used to represent information such as word frequency. For example.{1:0.4, 10000:0.6, 222222:0.8}is a sparse vector whose 1st, 10000th, and 2222222nd elements (representing the three keywords, respectively) have non-zero values (representing the weights of the keywords), and all other elements are zero.

usage example

pre-conditions

Cluste has been created
API-KEY has been obtained
The latest version of the SDK has been installed

Step1. Create a Collection that supports Sparse Vector.

clarification

You need to replace YOUR_API_KEY in the following example with your api-key, and YOUR_CLUSTER_ENDPOINT in the example with your Cluster Endpoint for the code to work properly.
This example only demonstrates the functionality of Sparse Vector, and for simplicity, the dimension of the vector (Dense Vector) is set to 4.

Python example.

import dashvector

client = (
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)

ret = ('hybrid_collection', dimension=4, metric='dotproduct')

collection = ('hybrid_collection')
assert collection

critical

Only the inner product metric (metric='dotproduct') supports the Sparse Vector function.

Step2. Insert Doc with Sparse Vector

Python example.

from dashvector import Doc

(Doc(
    id='A',
    vector=[0.1, 0.2, 0.3, 0.4],
    sparse_vector={1: 0.3, 10:0.4, 100:0.3}
))

clarification

The vector retrieval service DashVector recommends using Quick Start to generate Sparse Vector.

Step3. with Sparse Vectorvector search

Python example.

docs = (
    vector=[0.1, 0.1, 0.1, 0.1],
    sparse_vector={1: 0.3, 20:0.7}
)

Sparse Vector Generation Tool

DashText, the SparseVectorEncoder recommended by the vector retrieval service DashVector.