ClickHouse's vector processing capabilities

introductory

In the past, unstructured data (e.g., text, images, audio, video) were often considered difficult to use directly in databases because of the diversity and complexity of these data types. However, with the development of technology, embedding techniques can convert unstructured data (e.g., text, images, audio, video) into vectors for efficient vector-related processing, such as similarity computation and search. This vector-related processing capability is important for applications such as recommender systems, image semantic search, and natural language processing, and some application scenarios are listed below:

recommender system: Particularly suitable for e-commerce sites, where relevant products are found through vector searches that also encode page views and past purchases.
question and answer system: Using vectors to encode different synonyms to improve the accuracy of a question and answer system.
Image and Video Search: Using multimodal models to search for images and videos based on text, for music, movie recommendation systems, etc.
Fraud detection: Detect anomalous behavior to prevent fraud by encoding user behavior or login patterns.
Multi-language search: Cross-language search where the same concept is encoded as the same vector in different languages.

The vector processing capability of a database is the ability of a database system to efficiently process and manipulate high-dimensional vector data. It includes:

Representation of vector data
Calculation of vector data
Search of vector data
Indexing of vector data
Storage of vector data

By transforming unstructured data into vectors, a general-purpose OLAP database with vector processing capabilities will be able to process and analyze both structured and unstructured data in a unified and efficient manner. Combining structured and unstructured data on a single platform has the following benefits:

Simplification of data management and database infrastructure
Integrated analysis of structured and unstructured data
Incorporate other features of existing database systems (e.g., filtering, full-text search and analysis)

Generation of vectors

Vectors can have very large dimensions, contain a large amount of information, and can represent words, sentences, documents, pictures, videos, genes, actions, and more. Vector data is generated from unstructured data through various techniques and algorithms, here are some common approaches:

text data：
- word embedding: Convert words or phrases into vectors using models such as Word2Vec, GloVe, and others.
- Sentence and document embedding: Convert sentences or documents into vector representations using pre-trained models such as BERT, GPT, etc.
image data：
- Convolutional Neural Network (CNN): Extract the feature vectors of an image by a pre-trained image classification model (e.g., ResNet, Inception).
audio data：
- MFCC (Mel Frequency Cepstrum Coefficient): Extracting features of an audio signal.
- Pre-trained models: as modeled by VGGish, which converts audio clips into vectors.
Video data：
- Frame level embedding: Image features are extracted for each frame of the video and then these features are aggregated.
- chronological model: Video sequences are processed using 3D convolutional networks or recurrent neural networks (RNN) to generate video embeddings.

As long as unstructured data is converted into vectors, it can be efficiently stored, retrieved and computed in a database.

Users can download off-the-shelf models and further train or fine-tune them, or download or generate their own datasets for experimentation. One can run the model locally to generate vectors or call a public cloud service to generate vectors. Once you have a dataset, you need to think about where to store the model data, and you need to consider vector databases or generic databases with vector capabilities. Here we choose the ClickHouse generic OLAP database and examine its vector-related capabilities.

ClickHouse's vector capabilities

Following the above dimensions, we study and practice the vectorial capabilities of ClickHouse.

Representation of vector data

In ClickHouse vectors are represented as Array(Float32). Taking the process analysis scenario as an example, a process variant is a flow of execution of a process (e.g. Apply->Return->Reapply->Approve->Execute), we create a table for handling process variants. Add a column to this tablevariant_embedding, which is used to store the embedding vectors of the process variants. A sample table building SQL is shown below:

create table t_process_variants(
    id Int64,
    activities Array(String),
    variant_embedding Array(Float32)
) engine = MergeTree()
order by id;

Embeddings can be generated by running the CLIP language model locally, or by calling a public AI cloud service to generate embeddings for the process variants. Here I generate the embeddings via AliCloud'sModel Service Lingzhi Generate embedding vectors for process variants, each with 1536 dimensions.

Calculation of vector data

Common calculations for vector data are similarity calculations, and mathematical operations on vectors.

Similarity calculation between vectors

One of the common operations on vector data is to search for similar content, which requires calculating the distance between vectors. Two vectors with a small distance represent content that is conceptually similar. Commonly used distance metrics include:

cosineDistance：
- cosine distanceCosine similarity - Wikipedia
- A measure of the cosine of the angle between two vectors.
- Calculation formula:
  [ \text{CosineSimilarity}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}||_2 \cdot ||\mathbf{y}||_2} ]
- Suitable for measuring similarity in the direction of vectors rather than absolute magnitude.
L2Distance：
- Euclidean distanceEuclidean distance - Wikipedia
- Measures the linear distance between two vectors in space.
- Calculation formula:
  [ \text{EuclideanDistance}(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} ]
- Commonly used for actual distance calculations.

The following SQL query searches for similar process variants:

with [...] 
	as search_var
select activities,
    L2Distance(
        search_var,
        variant_embedding
    ) as dist
from t_process_variants
order by dist asc
limit 5

Or use cosine distance.

with [...] 
	as search_var
select activities,
    cosineDistance(
        search_var,
        variant_embedding
    ) as dist
from t_process_variants
order by dist asc
limit 5

In addition to this, ClickHouse offers other distance calculations.

L1Distance
Manhattan Distance.
L2SquaredDistance
The square of the Euclidean distance (or the omission of squaring).
LpDistance
p-paradigm distance.
LinfDistance
The L∞ distance (also known as the Chebyshev distance) is a special form of the p-paradigm distance in which p tends to infinity.

Mathematical operations with vectors

Addition and subtraction of vectors can be done directly with the operators+、-to finish.

SELECT [1, 2] + [2, 3]

Query id: 0cd21c01-b110-4278-8899-bfbce608724a

   ┌─plus([1, 2], [2, 3])─┐
1. │ [3,5]                │
   └──────────────────────┘

SELECT [1, 2] - [2, 3]

Query id: 4ceb3180-f690-4b22-8416-479a3772bf11

   ┌─minus([1, 2], [2, 3])─┐
1. │ [-1,-1]               │
   └───────────────────────┘

Dot product of vectors using the functiondotProduct。

In ClickHouse, there is no direct provision for fork multiplication (Cross Product) of vectors.

Other simple vector calculations such as[1,2,3] / 2The program can be usedarrayMapDone.

Search of vector data

Vector searches are typically done with a vector distance function plus a threshold, and with conditions on other common columns to synthesize the results needed for the query. For example, the query is similar to a process variant and Zhang San is involved in the process.

with [...]
	as search_var
select activities,
    cosineDistance(
        search_var,
        variant_embedding
    ) as dist
from t_process_variants
where dist < 0.5
	and has(operators, 'John Doe') -- operatorscolumn contains the names of all participants
order by dist asc
limit 3

Indexing of vector data

The above vector search is linear because the time complexity is O(n) to calculate the vector distances one by one. The time complexity can be reduced by adding vector indexes at the expense of accuracy, but in many scenarios this sacrifice in accuracy is tolerable.

Approximate Nearest Neighbor Search Index (Annoy Index)

The Annoy (Approximate Nearest Neighbors Oh Yeah) index is a data structure designed to improve the efficiency of large-scale nearest-neighbor vector searches, and while it is still experimental in ClickHouse, it strikes a balance between accuracy and computational efficiency.The Annoy index is used to perform approximate in high-dimensional space nearest neighbor search in high dimensional spaces. It divides the space into small regions by random hyperplanes, each containing a subset of data points. These regions form a binary tree with nodes representing hyperplanes, child nodes representing subregions, and leaf nodes containing actual data points. Balance and efficiency of the tree is ensured by randomizing insertions and using heuristic algorithms.

The Annoy index currently supports two distances:

cosineDistance
Euclidean distance L2Distance

How Annoy Index works.

Build. The Annoy index constructs tree structures by splitting the space several times using random hyperplanes, and can be created by specifying the number of trees (NumTree) and the distance function (DistanceName).DistanceName represents the distance function to be used, which can only be either L2Distance or cosineDistance, and the default is NumTree represents the number of trees created by the algorithm. The higher the number of trees, the slower the index construction and query speed will be, but the higher the accuracy will be. By default, NumTree is set to 100.
Search for. When querying, the distance is estimated by comparing vectors and hyperplanes, deciding which child node to explore further, and finally obtaining an approximate set of nearest neighbors, which is much faster than linear scanning, with a time complexity of O(logN).

You can use to create an Annoy index when you create or modify a table by specifying the index type as Annoy in the INDEX statement.

INDEX {ann_index_name} {vector column} TYPE annoy([{distance function}[, {number of trees}]])

Example:

alter table t_process_variants
add index variant_embedding_idx variant_embedding type annoy('cosineDistance', 100)

Precautions for use.

Annoy indexing applies to the use ofORDER BY DistanceFunction(Column, vector) maybeWHERE DistanceFunction(Column, Point) < MaxDistance query, you must use theLIMIT clause to control the number of results returned.
The settings switch allow_experimental_annoy_index must be turned on to initiate the role of the Annoy index, which is off by default.

Storage of vector data

Vectors are a bit of a luxury to store in Array(Float32) because in many cases a 16-bit floating point number is enough, and in the industry there is specifically such a type called bfloat16, which is used in many datasets and trained models.

But ClickHouse doesn't currently have a type like Float16. So you still have to use Array(Float32), but bfloat16 is equivalent to the result of Float32 after the low 16 position is 0. So we can use a function to clear the lower 16 bits of the Float32 content of the vector before storing it. In this way the low 16 bits of 0 will be compressed by the compression algorithm when stored.

summarize

ClickHouse is equipped with powerful vector processing capabilities to efficiently process and manipulate high-dimensional vector data. By converting unstructured data (e.g., text, images, audio, and video) into vectors, ClickHouse enables a wide range of applications such as recommender systems, Q&A systems, image and video search, and more. Vector data is represented in ClickHouse asArray(Float32)ClickHouse supports a variety of similarity calculation methods, such as cosine distance and Euclidean distance. clickHouse supports approximate nearest neighbor search index (Annoy index), which improves the search efficiency by splitting the space with random hyperplanes, and is suitable for large-scale vector search. In addition, in order to save storage space, the bfloat16 format can be used, thus realizing efficient compression and storage.

Overall, ClickHouse's vector database capabilities demonstrate significant value in multiple application areas by efficiently processing and searching high-dimensional vector data, making it ideal for recommender systems, Q&A systems, image and video search, and more.