ByteHouse High Performance Vector Retrieval in Practice

More technical exchanges, job search opportunities, welcome to pay attention to thebyte-jumpingdata platformWeChat public number, reply [1]Enter the official communication group

With the development of LLM technology, vector retrieval and vector databases have also received continuous attention from the industry, which can provide external memory units for LLM, assisting LLM to return more accurate answers by providing content associated with questions and historical answers. Not only LLM, but also vector retrieval has long been used in OLAP engines to improve the analysis and retrieval of unstructured data.

As the OLAP engine under Volcano Engine, ByteHouse has launched the high-performance vector retrieval capability. This article focuses on ByteHouse's construction of high-performance vector retrieval capabilities, and "map search" as an example, explaining how OLAP vector retrieval capabilities in specific scenarios.

ByteHouse vector search capability

Vector retrieval, the goal of which is to find the k results that are most similar to a given vector, is widely used in the scenarios of graph search and recommendation systems. In practice, the size of the dataset targeted by vector retrieval is usually at the level of million or even billion, and the query latency is usually required to be returned within several milliseconds to a hundred milliseconds, so vector retrieval indexes with special structures are usually used.

Analysis of pre-existing limitations

Currently the more popular vector indexing algorithms are HNSW, Faiss IVF, etc. This kind of vector retrieval load based on vector indexing usually has the characteristics of long construction time, high resource consumption, etc., and we need to consider a series of problems such as how to efficiently manage the resources required for the indexing task; how to support memory computation; how to combine with the filtering statement, and resource consumption.

ByteHouse is derived from ClickHouse, but the latter has problems such as repeated reading of vector indexes and redundancy in similarity calculation, which makes it less usable for vector retrieval scenarios with low latency requirements and high concurrency demand.

Optimize the overall architecture

Based on the above analysis, ByteHouse carries out a comprehensive innovation in vector retrieval capability. Based on the vector-centric idea, the team reconstructs an efficient vector retrieval execution link, combining index caching, storage layer filtering, and other mechanisms, so as to realize further breakthroughs in ByteHouse performance.

ByteHousevector searchOverall Functional Architecture

In the following, we will take the typical vector search application scenario of "map search" as an example, and explain in detail how ByteHouse achieves the realization and optimization of high-performance retrieval capabilities.

Vector Search Capabilities for Map Search Scenarios

In the field of public opinion monitoring, a leading company applies the ByteHouse vector search capability in the "map search" scenario.

In order to better provide public opinion monitoring services, the company has been expanding its monitoring scope across the network, with an overall data scale of 1.2 billion, and expects to achieve sub-second search speeds with 64 cores and 256GB of RAM resources.

Compared to other products in the industry where the single-query latency is at the level of a few seconds or ten seconds, ByteHouse can reach 700-800 milliseconds before optimization, and after a series of optimizations, it can eventually find 1000 images and their similarity scores from large-scale data in a time of about 150-200 milliseconds.

So, how does ByteHouse achieve the above performance? Specifically, ByteHouse has taken optimization measures mainly in the areas of vector retrieval computation underpushing, filtering operation optimization, and optimization of data cold reading problem. In addition, due to limited resources, index construction resource limitation is also carried out.

Optimization 1: Calculated push-down optimization

In a typical OLAP scenario, the data comes in and generates multiple Data Parts that are aggregated in batches. As per the operator execution, Vector Search and Attributes Read are performed for each Part and then TopK is done for each Part.This generates a lot of resource consumption when there are more projection columns.

In response, the ByteHouse team made the following optimizations.

First, Vector Search is performed for each Part, which is equivalent to splitting an operator into three operators and doing Vector Search first.
The results of Vector Search are then globally sorted, without reading the scalar information columns at this point.
Finally, on the result of the global sort, the Read Task is executed to get the final result.

Computational downward push optimization

Typically, data is written in batches, and each batch will have a Part, which is typically greater than 100. then, in real-world scenario testing after optimization, latency can basically be more than twice as fast.

Optimization 2: Filtering operation optimization

In optimizing the mixed scalar and vector query scenarios, the optimization focuses on the following three points.

Scalar Primary Key Range Based Lookup

A scalar is actually a sorted key, similar to a primary key in MySQL. Usually, when performing a search, the scalar is filtered first to get the data that matches the query criteria. If it is calculated in the usual way, it will be very consuming.

Since the scalar itself is ordered, it can be simply understood as follows: only the first and last parts of the data need to be read, filtered, and a row id bitmap constructed that matches the conditions. e.g., in the case of calculating timestamp greater than a certain value, only the rows corresponding to the start and end positions need to be counted, since the intermediate results are all matched.

Accelerated scalar column pruning

Data pruning, which focuses on physically arranging partitions of keys according to the actual situation, including partitioning of primary and secondary indexes, as a way to speed up queries.

Storage Layer Filtering

Optimize at the storage level by pushing filtering conditions down into storage to minimize IO operations. For databases like OLAP and OLTP, the underlying layer of query actions can have high computational overhead. Therefore ByteHouse does more filtering at the bottom level for typical scenarios as a way to reduce CPU and memory resource usage. In addition, Cache acceleration is done for hot data rows.

Optimization 3: Optimization of the vector data cold reading problem

Using an index requires the index structure to be fully loaded into memory.

In terms of vector indexing, ByteHouse has access to two popular retrieval algorithm libraries, hnswlib and faiss, and supports many common indexes such as HNSW, IVF_PQ, IVF_PQ_FS, and so on. In addition, considering that vector retrieval needs to be executed in memory, a vector index caching mechanism is also added to ensure that the indexes of the Data Part involved in the query can be resident in the memory, so as to realize low-latency vector retrieval.

In the case of vector retrieval requiring high QPS, cold reading may become a performance bottleneck. Currently supported vector indexes need to be loaded into memory before high-performance vector retrieval computations can be performed, and optimization methods at this level mainly involve loading different resource index structures into memory.

Cache Preload & Auto GC

There can be issues with the build-to-memory process, such as a database restart, or importing new data when that data is still cold. For OLAP storage, the LSM storage format is used for data Merge and automatic GC process. In scenarios such as service startup and data write during database operation, ByteHouse can automatically load new indexes into memory and ensure that expired data in Cache can be automatically reclaimed.

Optimization 4: Index Building Resource Limits

Vector database workload features

There is a general concern about resources in terms of vector retrieval; using vector retrieval consumes more CPU and memory resources. In order to avoid excessive resource overhead, the ByteHouse team has imposed restrictions at the resource usage level.

concurrency limit

Limit the thread level of index construction and the merge task in the background in case of insufficient resources. However, limiting resources also solves the problem of slower retrieval while maintaining resource usage.

Memory Optimization

In order to reduce resource usage overhead, the ByteHouse team has taken the following main steps.

Use PQ, SQ compression to reduce the storage space of vectors to 1/4 or 1/3 of the original size. e.g., if the precision requirement is not too high, compress the data of float32 type to INT8 type, so as to compress the data of 4 bytes to 1 byte and reduce the storage space.

During training, incremental training needs to be supported. For the IVF series, the indexes do not need to be memory-resident when they are built and can be dropped to disk.

After optimization, ByteHouse achieves remarkable results in resource usage, supporting large-scale data with few resources in scenarios similar to the aforementioned "map search".

ByteHouse is a full-scenario engine, not just a vector search engine.

Currently, ByteHouse already supports multiple vector retrieval algorithms as well as efficient execution links through the Vector engine, and is able to support large-scale vector retrieval scenarios with millisecond query latency.

Under the concept of "one-dimensional data, diversified engines", ByteHouse is committed to achieving full-scene engine coverage to ensure the maximization of the overall data performance. In addition to the Vector engine, which supports vector retrieval, ByteHouse also has full-text search engine, GIS engine and other full-scene engines to provide users with integrated data analysis services.

As a cloud number warehouse product that is designed to provide high performance and extreme analyzing capability, as early as February 2022, ByteHouse's deployment scale in ByteHouse has exceeded 18,000 units, with a single cluster of over 2,400 units. In the future, it will also continue to provide support for enterprise data analysis capabilities, helping to promote the transformation and upgrading of digital intelligence.

click to jumpvolcano engineByteHouse Learn more