My Thoughts, Miscellaneous Thoughts on the Technical Architecture of the RAG/Big Model/Unstructured Data Knowledge Base Class Products

1. Preamble

At the RAG session of the Rare Earth Nuggets Developer Conference on 6.28/29, our company's CEO, Employee Outside, represented TorchV to share our experience in theDifficulties and innovations in landing RAG in enterprise applications》

Which ended up sharing two ideas:

AI has three characteristics when landing in application scenarios: small function, high quality, and high value
If making a product is to get a horizontal right, then going to an enterprise landing service is a vertical, from requirements and solutions, to POC, and final delivery.

For the three characteristics of AI applications, we are in the landing, in fact, encountered quite a lot of problems, but people who have used large models or AI products should know that the current C-end products based on the development of large models of applications in fact, in the overall feeling of the majority of relatively small tools, in the matter of helping humans to improve their efficiency, with the help of AI tools, it can be very good to complete the daily complicated work and learning tasks. For exampleAI translation、Web SummaryPlug-ins and so on. These products are more C-end-oriented, with the help of the Internet ecosystem and the development of open source technology, as long as the function/interaction to meet the user's requirements, will soon be able to impress the C-end users to taste trial or even pay.

But do B-type products, the entire delivery process is obviously different from the C-end, in the B-end in addition to the product itself needs to be powerful enough, we also need to do AI landing delivery, which containsPrivate customization/Client Training/Private deployment/software and hardware adaptationAnd so on and so forth with tedious work and much longer overall delivery cycles. This clearly echoes the second point above, theProducts + ServicesIt's the only way to serve B-side customers in an integrated way.

This article is combined with our company in the B-side RAG / large model application product landing delivery scenario considerations, to the actual scenario, talk about my thinking about the technical architecture of the knowledge base products summarized.

2、Business function/technology component disassembly abstraction

Figure 3 - Business Architecture

In the title of the article, I have labeled the scope.RAG、large model、Unstructured data

We start from these three aspects, at the software level, how do we consider these new technical terms, disassemble them from a technical/product functionality point of view, and realize the corresponding functionality to be delivered to our customers.

In terms of the functional requirements of the business, there are several main areas:

repository: Clients need to collect and process business data in a harmonized manner to form a knowledge base that can be made available to LLM for use.
application center: B-side customers need out-of-the-box products that solve real-world business problems.
user rights: The system provides enterprise-level flexible and controllable rights management, which is convenient for enterprise customers to carry out unified management authorization.
multi-tenantMulti-tenant architecture is essential to ensure that data is isolated at the Schema level, guaranteeing data security as well as flexible output support for upper-tier applications.
...

And from the technical side of the equation, technicians need to be concerned:

Handling of unstructured data: The platform needs to support a variety of unstructured data extraction and processing work, the entire document content for chunking, embedding into the database, in order to search for

Breadth of document types: Provide numerous unstructured data documents (PDF/PPT/WORD, etc.) extraction support, is to impress the B-side of the customer's favorable attraction.
File parsing accuracy: the difficulty of parsing documents led by PDF/PPT/Word, how to go further in the work of parsing, from the root to reduce the model in the use of known data in the illusion of the problem
task scheduling: Data processing relies on a stable task scheduling platform to ensure the final and orderly execution of data processing.

modeling service: from LLM large language models, Reranker models, embedding, OCR models, visual models, and more.Guaranteed idempotent output of the model, providing stable and reliable service support for upper-tier applications.

LLM model: Offers a range ofAgent ServicesThe upper layer of the business can flexibly invoke the large model to obtain satisfactory results.
ReRanker model: Reordering models are a key tool for improving the accuracy of Q&A 2-stage recall and should not be ignored.
Embedding model:: Vectorized embedding, which provides work on extracting vectors for the characterization of knowledge texts that cannot be neglected
OCR/Visual Modeling: Assist unstructured data extraction in the case of rule extraction is not satisfied, start OCR and visual modeling to enhance the effectiveness of unstructured data delivery

VectorDB: Need to take into account the actual business requirements, from the performance/space/ecological aspects of VectorDB and other options

Split from the technical point of view, in fact, technicians are concerned about a very large number of points, each job can actually be a separate middleware products, to integrate all of these into a piece, is not an easy task.

3. microservices/distributed/cloud native?

Anyone who has written about Java is probably familiar with all three of these terms, and I remember a long time ago saying that if you don't know microservices for an interview, you won't be able to get a job (PS: now it seems like you can't get one no matter what you know) 😂.

For AI applications, probably more software ecosystem is driven by Python, including some tool libraries LangChain, LlamaIndex, etc. are Python, although there is no shortage of Java, such as LangChain4j, Spring-AI and other components, are latecomers in the whole ecological stability and so on is really lagging a section.

But probably many people have used LangChain and other frameworks have a consensus, that is, as a tool to use no problem, but on the production? There are too many problems. I think the main points:

LangChain's over encapsulation, for the application layer, whether it is Agent, or RAG, in fact, quite a simple thing, and the big model API interface docking on it, but you go to see LangChain's source code, the whole call chain encapsulation is extremely complex, can not be changed.
The upper level of the business needs change too much, sometimes it is necessary to combine their own company's actual business situation to deal with, in this case, it is better to write their own fast, in fact, call the link is not complicated!
In terms of stability/transactions/data consistency, is Python the right language to use as the main language for enterprise service interfaces?

And we are discussing today is the choice of the technical architecture of the entire product, in fact, in the section above the business function/technical component abstraction, we have split the function and technical points, from the technical point of view, this is already a set of many services in one integrated technical solution. At the application level of functionality, do we still need to be like before, the whole set of microservices architecture out to develop business functions?

My personal opinion is:Depending on the team configuration, microservices can be used or not. But applications must be naturally distributed, support horizontally scaling clusters, and be elastically scalable.

The current environment, the project to engage in microservices, the final dilemma may be that all the services are you a person responsible for writing a service to write a service to write a service to write a service, and then a rpc call, but also to consider the data meltdown, availability and so on, the small team I think there is no need to toss!

Key points to consider:

1. Efficiency gains in massive unstructured data processing

In the processing of RAG product class, the processing of unstructured data in addition to rapid parsing, but also need to vectorize the text, and we need to be able to quickly process these files in the technical architecture, through the Pipeline, the unstructured data is ultimately stored in the vector database, which traditionally had to be used in the message middleware MQ, and the application level program can be used by the Consider elastic scaling, expanding the consumer nodes to improve the overall processing efficiency

2. Storage/computation recall efficiency for massive vector data

When we extract the unstructured data, we need to go through the Embedding model for vectorization, which also involves Chunking chunks of text, so the underlying vector data storage and computation is inevitably a need for a more comprehensive consideration of the vector database middleware, which includes: the performance of the vector recall, data storage/backup, multi-tenant Schema-level data Permissions and so on

3. Final consistency of data

Embedding processing of data, large model scheduling deduction, caching, etc., in the current situation has been numerous service components split, the entire data processing tasks I think need to ensure the ultimate consistency of the data, in a distributed scenario, multi-node processing requires special attention.

4. Atomicity of application functions (cloud native)

The entire application layer is functional, and I think it needs to beMaintains independence and guarantees stability, this point actually works better for me in the private deployment/delivery segment. If you're an ops or lead developer, you'll appreciate the convenience when deploying in a fully intranet-isolated environment.

In short, I think that serving at the application level is what the server side should be doing:Reduced configuration, lightweight, stable

4. Programming language/middleware selection?

Our team's current development language is a combination of Java + Python, with a major division of responsibilities:

Java: API interface for upper-tier business applications, task scheduling, data processing, etc.
Python: and models, data processing, NLP and other related tasks are open in the form of interfaces, API interfaces are stateless, all the data state flow is realized on the Java side

Many developers in this may have some concerns about the choice of Java language, whether it is appropriate in the current RAG / big model field? In fact, the most confusing is the processing of unstructured data, may be many developers see the current open source of the many components or platforms, are Python's main technology stack, that Java can not be processed, in fact, this is a complete misunderstanding, for the most difficult to deal with the extraction of PDF documents, theApache PDFBoxDefinitely a component worth digging deeper into, but of course Python is inherently good at data processing/analysis, and can be selected for execution depending on the configuration of the team, which I think are the main points to consider:

1. Team staffing

According to the team's current mainstream programming language to do technical architecture selection and decision-making, there is no absolute sense of which programming language is the main, Java, Python, Go, NodeJS, TypeScript, etc. can be.

2. Software ecology & technology maturity

Upper tier application product development, surely the first thing to consider what mature middleware and components to develop to complete this many needs, can not build wheels from 0 to 1, build wheels can certainly enhance the level of skills of the developers, but in the increasing development of AI today, for the company's products as early as possible to find the PMF is the first task. Comprehensive consideration is needed.

I don't know about other programming languages, but in terms of parsing unstructured data, Python and Java are both relatively richer and more stable.

Java language is better to use include: Apache PDFBox, POI, Tika

Python, including: PyMuPDF, pdfplumber, pypdf, camelot, python-docx, etc.

3. Stability/clustering/high availability

Well, there's no high concurrency here because people aren't stuck 😂

There is not much difference between a large model product and a traditional business in this respect. Stability/clustering and other features are also required and should be taken into account by technicians when choosing middleware.

Examples include MQ messaging middleware, caching Redis, etc.

4. Deployment implementation/delivery

Yes, the last step in the deployment of the implementation of this link also needs to be considered, Docker can indeed bring great convenience, but the cost also needs to be considered, the current Python eco-packaging of the entire Docker, compressed packages can easily be 2, 3G to start with, in fact, is also quite a headache, if you're using the K8s scheduling to deploy, k8s pull a 10G image is not so fast 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂

5. Final

AI application is a need for rapid trial and error, powerful a certain point to break through, technical architecture, should also consider the overall development efficiency, ecology and so on.

This reminds me of the classic jQuery quote from a decade or so ago, which was loved by many developers when it came out:

Write Less, Do More!!!

While the big models are increasingly robust and evolving, should our technical architecture, too, do a slimming?

If you are also concerned about the big model, RAG retrieval augmented generation technology, welcome to follow me to explore and learn and grow together ~!

图10-微信公众号"八一菜刀"