Vector retrieval and quantization methods

1. Vector search

In vector retrieval, KNN (K-Nearest Neighbors) and ANN (Approximate Nearest Neighbor) are the two most common methods, both of which are used to find similarities between data points based on feature vectors, but they differ in terms of accuracy and efficiency.

KNN is a basic classification and regression method that predicts the class of a sample based on the class of its K nearest neighbor samples in the feature space. For regression problems, it may be possible to predict the value of a sample based on the values of the K nearest neighbor samples. In vector retrieval, the KNN algorithm does violent retrieval, calculating the distance (or direction, depending on the metric) between the Query vectors and each of the vectors in the vector pool to find the most similar K vectors.The advantage of KNN is that it is simple and easy to understand and does not need to be trained, but its disadvantage is that it is computationally costly, especially on large datasets, because it needs to calculate the distance between the test sample and all the samples in the dataset. distance between the test sample and all the samples in the dataset.

Based on the shortcomings of KNN on large-scale data volume, ANN further improves this. It a method for quickly finding the approximate nearest neighbor vectors to a given query vector in a large-scale dataset. Unlike KNN, ANN does not always return the exact nearest neighbor, but returns the approximate nearest neighbor within a certain error margin in exchange for faster retrieval.The steps of the ANN algorithm usually include:

Index Construction: Construct an index of the dataset to speed up the search. Commonly used indexing methods include IVF (Inverted File), HNSW (Hierarchical Navigable Small World).
Query processing: takes a query vector and uses an index to quickly find a set of candidate nearest neighbors.
Validation and refinement: exact distance computation of candidate nearest neighbors to determine the final nearest neighbor.

The advantage of ANN is that it is fast and suitable for large-scale datasets, but its disadvantages are that the results may not be completely accurate and building an index will require the introduction of additional space storage.ANN is useful in many applications such as semantic retrieval, image retrieval, recommender systems, etc., which tolerate a certain degree of approximation but require a fast response.

1.1. resource consumption in ANN large-scale vector retrieval

In large-scale vector retrieval, although we can utilize ANN to achieve faster vector recall of TopK, the consumption of resources is also huge. Taking the ANN retrieval in OpenSearch as an example, under the HNSW algorithm implemented using nmslib, for 1 billion vectors of 128 dimensions to be retrieved, the storage space required to construct the HNSW index is 704GB^[1]If a replica is added to achieve higher data availability, the required storage space needs to be multiplied by 2 to 1408 GB. During retrieval, the index still needs to be loaded into memory for querying, corresponding to i.e., 1408 GB of memory space (provided by multiple nodes, not a single node) is required for storing the HNSW index.

However, although the resource consumption required for HNSW indexing is huge, it also yields high performance gains, with 99% top10 recall being achieved for top10 retrieval of 1 billion vectors with sufficient memory resources given, while the p50 latency is only about 23ms.

2. Quantitative methods

For the situation that ANN requires a large amount of resources under large-scale vector retrieval, various quantization methods have been proposed to quantize the vectors to save the required resources and still achieve ANN retrieval. The more representative ones include Product Quantization (PQ: Product Quantization), Binary Quantization (BQ: Binary Quantization) and Scalar Quantization (SQ: Scalar Quantization). The following mainly introduces PQ and BQ.

3. Product Quantization

Product Quantization (PQ) is a method used to compress high-dimensional vectors in order to reduce the memory footprint and increase the speed of nearest neighbor search. Its core idea is to decompose the original high-dimensional vector space into the Cartesian product of a number of low-dimensional vector spaces, and quantize each of these low-dimensional vector spaces.

The core idea of product quantization is clustering. When constructing the index, it is divided into 2 main steps: Cluster and Assign. the following is the process schematic:

PQ provides a parameter m_split (also referred to as m in some algorithm implementations), which controls the number of segments the vector is sliced into. the Cluster process is:

Assuming that the vector pool has N vectors, each of which is 128 dimensions, and that each vector is divided into 4 "segments", this results in N * (4 vectors of 32 dimensions)
Clustering (e.g., k-mean) of the vectors in each segment, assuming that the number of clusters is 256, and since there are 4 segments, each segment (N vectors of 32 dimensions) is divided into 256 clusters and generates 256 "centroids" (center vectors of the clusters), resulting in a total number of clusters of 4 * 256. 256.

Assign process for:

First, each "cluster" is encoded, there are 4 * 256 "clusters" and the corresponding code, because there are 256 "clusters", so only 8-bit code id can represent A cluster.
For each existing vector, first cut into 4 segments, and then for each segment of the small vectors, calculate its corresponding nearest "cluster" (calculate the distance from the "center point"), and then use the "cluster" ids to encode. "The id of the cluster is then used to encode the cluster.
In this way each vector can be represented by 4 segments numbered from 0 to 255. This means that 4 bytes can be saved. This compresses the vectors and saves memory.

Next is the query:

When querying a vector, as in training, it is first cut into 4 segments, then the distance between each subvector and the corresponding 256 cluster centers is calculated and stored in a distance matrix or array. Next, the distance between the query vector and each vector can be calculated by looking up the table. This is done by summing up the distances between each subvector, then calculating the distance between the query vector and all the vectors in the library, and finally doing a TopK.

This method can greatly reduce the amount of computation. However, product quantization is a lossy compression method, which may reduce in prediction accuracy, but can dramatically compress high-dimensional vectors to reduce memory usage while maintaining enough information for efficient similarity search. Still taking the example of doing one billion 128-dimensional vector queries in OpenSearch, using IVF+PQ, with a top10 recall of 0.66, only 114GB of memory is needed to complete the construction of the ANN query index, but the p50 query latency is around 117ms, which greatly saves query cost but is accompanied by significant query loss.

4. Binary Quantization

Binary Quantization has received more attention recently (this article was written in early November 2024).OpenSearch in its most recent version 2.17^[2]in which the Binary Quantization (BQ) feature was released for compressing vectors to enable storage and retrieval of vectors in a more cost-effective way. On the other hand, Elasticsearch's recent 8.16 release^[3]The Better Binary Quantization (BBQ) feature was also released, further enhancing the original BQ approach.

The idea of BQ is very straightforward, converting embedded vectors represented as floating-point numbers into binary form (i.e., a sequence of 0s and 1s) to achieve data compression and accelerated retrieval. Specifically, for each dimension of the vector, only 1 (if the original value is positive) and 0 (if the original value is negative) are retained, with 0 remaining unchanged. This is shown in the figure below:

In calculating similarity, BQ uses the Hamming Distance. For example, suppose there are 2 vectors [0.1, 0.3, -0.1, 0.2, -0.23, 0.12] and [0.2, -0.1, 0.21, 0.11, -0.32, 0.01]. The results after conversion to BQ coding are [1, 1, 0, 1, 0, 1] and [1, 0, 1, 1, 0, 1]. When calculating the distance between the two, the result of the distance by taking the different or by bit is 2. When measuring the distance, we expect that the closer the distance is, the lower the value will be. So the closer the direction of the two vectors, the lower the value of the distance by taking the difference of the two vectors.

It can be seen that BQ can greatly reduce the memory space (transposing float32 by 1bit storage reduces the memory space by 32 times) in addition to greatly reducing the computational complexity. This is because it uses Hamming Distance for similarity metrics, which is very efficient and involves only simple dissimilarity operations and bit counting, which results in a significant improvement in retrieval speed. Outside of the first instance, it is also easy to see that BQ is a lossy encoding in the sense that from a vector perspective, it only preserves the direction of the vector, loses information about the length of the vector (or the distance traveled in each dimension), and is unable to restore the original information after encoding. Although it sounds like this approach loses more information, it has been proven to work very well in high-dimensional vector spaces.

4.1. Limitations of BQ in low dimensional space

Before explaining why BQ works very well in high-dimensional vector spaces, let's look at why BQ works less well in low-dimensional spaces.

Taking the two-dimensional space as an example, suppose the original vector distribution is shown in the left part of the figure below, and the distribution after BQ is shown in the right part of the figure below:

Fig: BQ with 2 dimensional vectors^[3]

It can be clearly seen that in the above example, all the points in the 2D space are distributed into 4 "regions" after BQ. This means that the points in the same region are in conflict and the distance relationship is erased. For example, for the region in the upper right corner, each vector has a value greater than 0 in each dimension, but will eventually be mapped to a vector of [1,1].

This problem is more pronounced for vector pools with some specific distributions, such as the one shown below:

Fig: Displays some counter-intuitive facts about hamming distances between 2d binary vectors^[3]

Intuitively, the red points and yellow points are very close in distance, but after BQ, the yellow points are all quantized as [0,1] points and the red points are all quantized as [1,0] points. At this point after the Hamming distance calculation, they become the farthest points instead.

As can be seen from these two examples, BQ performs very poorly for low-dimensional vector spaces, mainly because the conflict probability of vectors after BQ is too high. For example, in a two-dimensional space, the conflict probability of any 2 points after BQ is 1/4.

4.2. BQ in higher dimensional spaces

The problem of vector conflicts in low dimensional spaces is greatly alleviated as the dimensional space increases. As the dimensionality increases, the number of "regions" mentioned above grows exponentially, reducing the probability of conflicts between vector representations.

In higher dimensional spaces, such as 756 dimensions, the number of regions becomes extremely large (2^756), which makes it very unlikely that even if there are billions or trillions of vectors, there will be a collision between them. In 1.5K dimensions, the number of regions is large enough to hold any practical number of vectors without a single collision. That's what makes BQ in high-dimensional spaces achieve very good results.

Note that in practice, in addition to 1-bit quantization, 2-bit and 4-bit quantization modes can also be used. Quantization with more bits generally achieves better accuracy, but also requires more memory for querying.

4.3. Rerank

In practice, when retrieving the results after BQ (which can still be done based on the HNSW algorithm), an additional rerank step is added to improve the performance. Specifically, during retrieval, the BQ is performed first (using the BQ representation of the query vector to retrieve the BQ-represented vectors in the vector pool), and topK * rescore_multiplier candidate vectors are retrieved. rescore_multiplier is a tunable parameter, and the higher the value is, the more candidate vectors are considered, and the higher the recall is likely to be, but it will also lead to longer retrieval time. However, it will also lead to longer retrieval time. After getting these candidate vectors (still in BQ representation), the original float32 bit representation of the query vectors is used, and the candidate vectors are reordered by dot product.

Binary quantization has applications in several fields, including but not limited to text retrieval, image recognition, etc. It improves the efficiency and speed of data processing by reducing the dimensionality and complexity of the data.

5. Better Binary Quantization

Better Binary Quantization (BBQ: Better Binary Quantization), introduced in Elasticsearch 8.16 and Lucene. It is an algorithm inspired by the RaBitQ technique proposed by researchers at Nanyang Technological University in Singapore.

In the documentation published by ElasticSearch (ES), it is stated that^[4]: Simple binary quantization leads to a large amount of information loss, and in order to achieve sufficient recall, it is necessary to obtain an additional 10 or even 100 times of neighbors for rearrangement, which is not ideal. Based on this limitation, ES introduces BBQ, and the following are its significant differences from simple BQ:

All vectors are normalized around a "center of mass": this unlocks a number of properties that facilitate quantization
Stores multiple error corrections: some for center-of-mass standardization, some for quantitative corrections
Asymmetric quantization: in this approach, the vectors themselves are stored as single-bit values and the query vectors are quantized only to int4. This significantly improves the search quality without increasing the storage cost
Fast searches using bitwise operations: query vectors are quantized and transformed to enable efficient bitwise operations

Here we take the case given in the official ES blog^[5], explains the BBQ process in detail.

5.1 Constructing vectors

When quantizing raw vectors into bit-level vectors, our goal is to convert these vectors into smaller representations, while also:

Provide a reasonable approximation for a quick estimate of distance
Provide some guarantee on the distribution of vectors in space to better control the total number of data vectors needed to better control the recall of true nearest neighbors

Both of these can be accomplished in the following ways:

Translate each vector into a hypersphere. For example, if it is a two-dimensional vector, this hypersphere is the unit circle
Fix each vector to a representative point within each region of the circle
The correction factor is retained to better approximate the distance between each vector in the vector pool and the query vector

Next we break down these steps step by step.

5.2 Standardization around a central point

In order to segment each dimension, we need to choose a center point. To simplify the operation, we will choose one point to transform all the data vectors.

Taking the two-dimensional space as an example, suppose we have 3 vectors, respectively:

v1: [0.56, 0.82]，v2: [1.23, 0.71]，P3: [-3.28, 2.13]

The center point c is obtained by averaging each dimension:

X-dimension: (0.56 + 1.23 + (-3.28)) / 3 = -0.49

Y dimension: (0.82 + 0.71 + 2.13) / 3 = 1.22

Therefore, the centroid vector c is: [-0.49, 1.22] as shown below:

The original vector is then normalized, that is, the operation (v-c)/||v-c|| is done. From the formula we can understand that the result is a unit vector with the same direction as v-c (i.e., the direction of the angle between the two vectors) but of length 1.

Take the calculation of v_c1 as an example:

Calculate v1 - c = [0.56, 0.28] - [-0.49, 1.22] = [1.05, -0.39]
Calculate ||v1-c|| = 1.13
Calculate v_c1 = (v1 - c)/ ||v1-c|| = [0.94, -0.35]

You end up getting 3 points after doing the normalization around the center point:

v_c1=[0.94, -0.35]，v_c2= [0.96, -0.28]，v_c3=[-0.95, 0.31]

So far, we can see that the original point is projected onto a "circle" with center (0,0) and radius 1.

5.3 Binary quantization

After doing the normalization, the next step is to first do a BQ for each vector to get the 3 regions that these vectors belong to: r1=[1, 1], r2=[1, 1], r3=[0, 0] respectively

Next, quantization is accomplished by fixing each vector to a representative point within each region. Specifically, a point on the unit circle equidistant from each coordinate axis is selected, i.e.

The quantized vectors are denoted as v1_q, v2_q, v3_q, and these 3 points are projected onto the representative points of each region respectively. Take v1 as an example, the calculation is as follows:

The result is obtained as:

At this point, we get an approximation of the BQ for each vector, albeit vague, but can be used for distance comparison. Of course, we can also see that v1 and v2 are the same point after the BQ, which was also explained earlier, and the probability of this "collision" will be greatly reduced as the dimensionality increases.

5.4 Storing error correction values

Like the original BQ, this encoding loses a lot of information, so we need some additional information to compensate for this loss and to correct the distance estimate

To restore precision, we store 2 float32 values:

Distance from each raw vector to the center point
This vector is normalized and then dot product with its quantized form

The Euclidean distance from the vector to the center of mass is simple and we have already calculated it when quantizing each vector:

||v1-c|| = 1.13

||v2-c|| = 1.79

||v3-c|| = 2.92

Pre-computing the distance from each vector to the centroid allows recovering the vector to do the centering transformation. Similarly, we compute the distance from the query vector to the centroid. Intuitively, the centroid acts as an intermediary here, rather than directly calculating the distance between the query vector and the data vector.

The dot product of the normalized vector with its quantized vector is as follows:

v_c1 * v1_q = 0.90

v_c2 * v2_q = 0.95

v_c3 * v3_q = 0.89

The dot product between the quantized vectors and the original normalized vectors serves as the second correction factor, reflecting how much the quantized vectors deviate from their original positions.

It is important to note that our goal in performing this transformation is to reduce the total size of the data vectors and to reduce the cost of vector comparisons. These correction factors, while appearing large in our 2D example (requiring 1 float for storage), become negligible in their effect as the vector dimensions increase. For example, a 1024-dimensional vector that would take 4096 bytes to store as float32 would take only 136 bytes (1024 bit = 128 bytes, plus 8 bytes for 2 floats) with this bit quantization and correction factor.

5.5.

Assuming the query vector q = [0.68, -1.72], it first needs to be quantized following the same steps when performing the query:

Calculate q - c = [0.68- (-0.49), -1.72-1.22] = [1.17, -2.95]
Calculate ||q-c|| = 3.17
Calculate the vector q_c = (q - c)/ ||q-c|| = [0.37, -0.92] after projecting to the "circle".

Next, we perform a scalar quantization SQ (Scalar Quantization) on the query vector to quantize it to 4 bits, which we call q_scalar. it is worth noting that we do not quantize it to a bit representation, but rather we retain an int4 scalar quantization form, i.e., we store the q_scalar in the form of an array of int4 bytes that is used to estimate the distance. We can use this asymmetric quantization method to retain more information without increasing the storage cost.Scalar Quantization is not introduced here, interested parties can refer to the document^[6]。

The steps to perform scalar quantization (SQ) are:

Get the range of dimensions (i.e., the lower and upper scalar values) of the q_scalar vector: since it is only 2-dimensional, lower = -0.92 and upper = 0.37
Calculate the quantization step: width=(upper-lower)/(2)⁴-1)=0.08
Get SQ result: q_scalar = |(q_c - min) / width| = |[0.37, -0.92] - [-0.92, -0.92]/0.08| = [15, 0]

Since we only have two dimensions, the quantized query vector now consists of the upper and lower values of the int4 range.

When calculating the distance, each vector in the vector pool is used to compare to this query vector. We do this by summing each dimension in the quantized query vector that is shared with the given data vector. Basically, this is a normal dot product operation, except that bits and bytes are used.

For example, to compute the distance between the query vector q and v1 in the vector pool, the first thing used is the dot product of the quantized q_scalar and the quantized r1:

q_scalar * r1 = [15, 0] * [1, 1] = 15

Correction factors are then used to expand the quantization results to more accurately reflect the estimated distances. One of the correction factors is the distance to the center point, which has been calculated earlier: ||q-c|| = 3.17.

5.6 Estimated distances

So far, we have quantized the query vector and obtained the correction factor. Now the approximate distance between the query vector q and the vector P1 can be computed. Use the Euclidean distance formula and expand it:

In this equation, most of the values have already been computed earlier, e.g. ||v1-c||. However, we still need to compute some of the values e.g. q_c * v_c1. Its value can be reasonably and quickly estimated based on the correction factor obtained earlier and the quantized binary distance metric:

The subsequent calculations are more complex, interested in the original article^[5]. Eventually approximate distance results can be obtained:

est_dist(v1, q) = 2.02

est_dist(v2, q) = 1.15

est_dist(v3, q) = 6.15

compared to the distance between the points before quantization:

real_dist(v1, q) = 2.55

real_dist(v2, q) = 2.50

real_dist(v3, q) = 5.52

It can be seen that the ranking of this method on quantized distance proximity is consistent with the original distance ranking.

5.7. Rerank

These estimated distances are really only estimates. Even with the addition of an extra correction factor, the distance calculations for the vectors generated by binary quantization are only approximations of the distances between the vectors. In ES's experiments, a high recall can be achieved through a multi-stage processing procedure. In order to obtain high quality results, a more precise reordering of the distance computation of a reasonable set of sample vectors returned through binary quantization is required. Generally these candidate sets can be small in size. In large datasets (>1,000,000), a high recall of more than 95% can usually be achieved with 100 or fewer candidate sets.

In RaBitQ, results are continuously reordered during the search operation. In the ES experiments, we separated the reordering step from the search in order to achieve more scalable binary quantization. While RaBitQ is able to maintain a better list of top N candidates by reordering during the search process, it does so at the cost of constantly loading the full float32 vector. This is impractical for some of the larger production-type datasets.

6. ES official test of BBQ performance

ES officially tested the performance of BBQ in Lucene and Elasticsearch.

Lucene Benchmarking

The tests cover three datasets; E5-small, CohereV3, and CohereV2. The test results for each dataset show recall @100 at different oversampling rates (1, 1.5, 2, 3, 4, 5).

E5-small

l 500k vectors, based on the quora dataset.

l The BBQ quantization method outperforms the 4-bit and 7-bit quantization methods in terms of indexing time and memory requirements, and the recall performs well.

quantitative method	indexing time	Force Merge time	Memory requirements
bbq	161.84	42.37	57.6MB
4 bit	215.16	59.98	123.2MB
7 bit	267.13	89.99	219.6MB
raw	249.26	77.81	793.5MB

A recall of 74% is achieved using only single-bit precision. Due to the small number of dimensions, BBQ's distance calculation is not much faster compared to the optimized int4.

CohereV3

l 1M 1024-dimensional vectors using the CohereV3 model.

l BBQ equally outperforms 4- and 7-bit quantization methods in terms of indexing time and memory requirements, and has a recall rate of over 90%.

quantitative method	indexing time	Mandatory merger time	Memory requirements
bbq	338.97	342.61	208MB
4 bit	398.71	480.78	578MB
7 bit	437.63	744.12	1094MB
raw	408.75	798.11	4162MB

In this case, 1-bit quantization combined with HNSW achieves over 90% recall with only 3-fold oversampling.

CohereV2

1M 768-dimensional vectors using the CohereV2 model.

BBQ is comparable to 4-bit quantization methods in terms of indexing time and memory requirements, and has a good recall performance.

quantitative method	indexing time	Mandatory merger time	Memory requirements
bbq	395.18	411.67	175.9MB
4 bit	463.43	573.63	439.7MB
7 bit	500.59	820.53	833.9MB
raw	493.44	792.04	3132.8MB

In this benchmark, the performance of BBQ and int4 are almost synchronized. Moreover, BBQ achieves such a high recall on inner product similarity with only 3-fold oversampling.

summarize

In large-scale vector retrieval, using the traditional full-precision hnsw algorithm consumes a lot of memory resources. In turn, if the index is not loaded into memory, it leads to high query latency. After product quantization, both OpenSearch and ElasticSearch have introduced excellent binary quantization methods, which can achieve large-scale vector retrieval at a lower cost while still guaranteeing better query latency and recall, solving the problem of poorer product quantization. As for Better Binary Quantization, through vector normalization, error correction and asymmetric quantization and other techniques, it achieves a significant reduction in memory usage and improvement in retrieval speed while maintaining high recall.

Through Elasticsearch's official test results, we see the excellent performance of BBQ on different datasets, whether in terms of recall, indexing time or memory requirements, BBQ shows performance beyond expectations. These tests not only validate the practicality of BBQ technology, but also provide a reference for us to choose a suitable vector retrieval technology in practical applications.

Overall, BBQ, as a new quantization technique, has great potential for its application in large-scale vector retrieval. It can not only help us process and retrieve massive data efficiently, but also significantly reduce resource consumption while maintaining high efficiency. With the continuous progress and optimization of the technology, we can foresee that quantization will play an increasingly important role in the field of data analysis and retrieval in the future.

References

[1] Choose the k-NN algorithm for your billion-scale use case with OpenSearch: /blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/

[2] OpenSearch Binary Quantization: /docs/latest/search-plugins/knn/knn-vector-quantization/#binary-quantization

[3] 32x Reduced Memory Usage With Binary Quantization: /blog/binary-quantization#the-importance-of-data-distribution-for-bq

[4] Better Binary Quantization (BBQ) in Lucene and Elasticsearch: /search-labs/blog/better-binary-quantization-lucene-elasticsearch

[5] Scalar quantization 101：/search-labs/blog/scalar-quantization-101

[6] Better Binary Quantization (BBQ) in Lucene and Elasticsearch: /search-labs/blog/better-binary-quantization-lucene-elasticsearch#alright,-show-me-the-numbers