Veteran driver brings you to talk about vector databases

introductory

With the development of artificial intelligence and big data technology, traditional databases have been difficult to meet the needs of certain complex application scenarios, especially in the processing of unstructured data such as images, speech, text, and so on, the traditional way of exact matching has become overwhelming. In recent years, vector databases have emerged as a hot topic. The core advantage of vector databases is that it can find closer results in similarity computation by transforming data into high-dimensional vectors, and is no longer limited to the reliance on exact matching in traditional databases. Nowadays, more and more application scenarios, especially image search and speech recognition, are relying on vector databases to solve problems. Numerous vector database products are also springing up, such as Faiss, Milvus, Pinecone, Weaviate, Vespa and so on.

This article takes you through the principles and applications of vector databases and compares them with traditional databases to explore the fascination of the technology behind them in a relaxed and conversational manner.

Interesting dialog

newbieI've heard of traditional databases, but lately I've been hearing a lot of talk about "vector databases". What does that mean? I'm a database noob and I'm totally confused ......😂

knowledgeable person: Haha, don't panic, I'll explain it to you in big words. Let's start with traditional databases. You've heard of them, right?

newbie: Well, knowing a little bit about it, it's the kind that stores data and checks it, isn't it?

knowledgeable person: pairs! Traditional database search is mainly based on some indexing and sorting algorithms, such as B Tree, LSM Tree, Hash, or you may have heard of BM25, TF-IDF and so on. Simply put, it relies on exact matching of text to find data. 😬

newbieOh, so I type in a keyword and it goes to the database and finds the exact same thing, right?

knowledgeable personThis is true! For example, if you want to search for the keyword "provincial children's insurance", the database may not even store this word. It is actually "Children's Hospital of Zhejiang University School of Medicine". Because these two words are different, traditional databases are just confused and can't find anything. 😂

newbie: Haha, that's too stupid! You're obviously talking about the same place, but you can't find it?

knowledgeable person: pairs! That's the thing about traditional databases, they're good at dealing with exact keyword matches, but they're really not up to the task for this kind of semantically related stuff. So there are vector databases! ✌️

newbieOh? How does this vector database solve this problem?

knowledgeable person: What's so great about the Vector Database is that it doesn't look at the words on the surface, but rather, through mathematical methods, it puts "provincial child protection agency" and "Children's Hospital of Zhejiang University School of MedicineIt "understands" semantically related things such as "the word", "the word", and "the word". It transforms each word and paragraph into a set of numbers, the so-called "high dimensional vector". Then it compares the similarity between those numbers. That way, even if you don't enter exactly the same words, it will find related results. ✌️✌️

newbie: Wow, so how did it do it? Was it a guess?

knowledgeable person: It can be understood like this. It is trained with a lot of data to find similar features of words, sentences or pictures. Just like when you look at a picture, if the colors and shapes are similar, you will think they are similar, right? Vector databases work on a similar principle, breaking down the data into features of many dimensions and then comparing them. 😬

newbieSo that's it! So is it just for these "fuzzy searches"?

knowledgeable person: Right! Do you remember a few years ago Taobao and Baidu launched "search for maps"Function? That kind of scenario is almost impossible to handle with traditional databases, but vector databases can break down images into values of various dimensions, and then find out what is most similar to those dimensions in the database. To be clear, it's much better at handling searches on unstructured data like images and speech. 😬

newbie: Sounds pretty lofty! I'm a bit skeptical though, what exactly is the difference between a vector database and a traditional one?

knowledgeable person: This is going to be summarized. Traditional databases rely on exact lookups, they are good at finding exactly the same data, and they have sophisticated indexes and algorithms. Vector databases, on the other hand, are approximate lookups, which don't look for exactly the same, they look for "good enough"Stuff. Theoretically, it can also do very accurate lookups if we get the feature dimensions of the object fine enough. 😂

newbie: Wouldn't that mean that vector databases are smarter and can solve all problems?

knowledgeable person: Ideal is very full, the reality is, although the vector database is powerful, but it deals with the more dimensions, the greater the load on the system, the computational complexity will be greatly increased. So it is generally only used to do approximate queries, it is not possible to replace all database applications.

newbie: Got it! It's meant to do those approximate lookups that traditional databases can't do, especially those images, or speech, or things that have semantic associations.

knowledgeable person: Exactly! Vector databases are like giving databases eyes that "understand" data, that find similarities, not just rigidly matching keywords. Now, do you know more about vector databases? 😂

newbie:haha, enlightened! I feel like this is pretty interesting, I'll have to look into it sometime in the future!

knowledgeable person: That's right, welcome to the world of databases! ✌️

reference article

/glossary/vector-database/
/developer/article/2312534
/cn/what-is/vector-databases/
/cn/campaigns/what-is-a-vector-database/

This article is copyrighted and may not be reproduced without the author's permission.