Location>code7788 >text

GraphRAG+Document Architecture: Building a High Performance Entity Traceability Solution

Popularity:285 ℃/2024-12-17 00:24:45

Author: Chen Zikang

As we all know, after GraphRAG extracts the document content into knowledge graph triples, it actually retains only the relevance knowledge information, so it inevitably loses some content details of the original text. This is an undesired result in business scenarios that require strict data integrity, such as finance, medical, insurance and other industries. In order to address such business demands, we introduce document structure information into GraphRAG links to solve the problem of loss of original text information after knowledge extraction. At the same time, we also optimize the GraphRAG link from end to end, which dramatically improves the performance of knowledge graph construction and retrieval.

1. Summary

GraphRAG is an innovative knowledge retrieval and Q&A enhancement framework that skillfully combines graph database technology with retrieval augmentation generation (RAG) methodology.GraphRAG tends to achieve better results than traditional RAGs in dealing with complex data-relationship tasks, and is one of the popular engineering directions in the LLM field nowadays.

As one of the important components of the DB-GPT Wanxing open source project, Ant's self-developed GraphRAG has recently gained significant performance improvements - the improved GraphRAG adds a new Document Structure index on top of the original community summary enhancement and hybrid retrieval support to further improve the richness and readiness of the knowledge graph and knowledge recall. On the basis of the original community summary enhancement and hybrid retrieval support, GraphRAG has added the Document Structure index, which further improves the richness of the knowledge graph and the completeness of knowledge recall. Meanwhile, it continues to be compatible with the cool rendering of knowledge graphs based on the AntV G6 engine and the TuGraph engine, GraphRAG makes the complex data relationships in documents clear at a glance.

In this paper, the latest technical improvements of DB-GPT GraphRAG system will be introduced in detail, focusing on the innovative implementation of document structure indexing, the efficiency optimization of knowledge graph construction, and the effect enhancement brought by the multi-dimensional retrieval mechanism. Through comparison tests with industry benchmarks, it is verified that the optimized solution reduces resource consumption while maintaining the integrity and accuracy of knowledge representation. The article also shows practical application cases, providing readers with intuitive technical references.

2. Knowledge mapping

2.1 Document Structure Mapping

2.1.1 Document parsing

To improve the accuracy of document parsing, we have implemented enhanced support for Markdown format files. The system intelligently slices the document into multiple independent content blocks (chunks) by recognizing the header level symbols (e.g., "#", "##", etc.) in the standard formatted document. This structured parsing method not only retains the integrity of the original content, but more importantly, accurately captures the hierarchical relationship between the content.

Based on these hierarchical relationships, we construct a directed graph structure:

  • Each node represents a chunk of text.
  • There are two types of edges (directed edges):
    • next: connects neighboring text blocks at the same level, indicating the order of the content.
    • include: connect different levels of text blocks, indicating the upper and lower levels of the inclusion relationship.

To visualize the process of parsing the document structure, let's understand the mechanism through a concrete example, consider the following Markdown document fragment:

## Machine learning fundamentals ##
## Supervised learning ##
## Classification algorithms ##
## Regression algorithms ##
## Unsupervised learning ##
## Clustering algorithms ##

This document will be sliced into the following blocks of text (with [...] to indicate specific content):

  • chunk1: "# Machine Learning Fundamentals [...]"
  • chunk2: "## Supervised learning [...]"
  • chunk3: "### Classification algorithm [...]"
  • chunk4: "#### Regression algorithm [...]"
  • chunk5: "## Unsupervised learning [...]"
  • chunk6: "### Clustering algorithm [...]"

At the same time, there are two directed relationships between these chunks (include, join next):

  • include side: chunk1 contains chunk2 and chunk5; chunk2 contains chunk3 and chunk4; chunk5 contains chunk6.
  • next side: chunk2 connects to chunk5; chunk3 connects to chunk4.

2.1.2 Document structure mapping construction

Naturally, thanks to the directed relations between text chunks (chunks), it is possible to organize the document structure as a directed graph. By the way, it is written to the knowledge graph (based on the TuGraph base). As shown in the figure, in the upper part of the graph, where a node can be a content slice/text chunk (chunk, purple node) of the document, the edges represent the structural relationships between text chunks (chunks) in the original document.

In layman's terms, Document 1 is like the main table of contents of a tutorial, which is connected to the different chapters (Chunks 1, 2, and 3) through include relationships. Each chapter can in turn include more specific knowledge points (Entity). At the same time, through the next relationship, we can clearly see the recommended order of learning.

2.2 Ternary mapping

After completing the construction of the document structure graph, the next key step is to extract the knowledge entities and relationships in the document through semantic understanding to construct a structured knowledge graph. We adopt an innovative "context enhancement" approach to improve the accuracy of knowledge extraction through multi-dimensional information fusion.

2.2.1 Similar Text Block Search

  • For each text chunk to be processed, we first find the other top k text chunks (chunks) that are semantically similar by vector similarity search.
  • These similar blocks of text will serve as supplemental context to help the LLM better understand the current content.

2.2.2 Ternary knowledge extraction

  • The current block of text is sent to the LLM server as a request, along with the associated context.
  • Based on richer context, LLM is able to recognize and extract knowledge triples more accurately.
  • The extraction follows the standard format of [ENTITY] - (RELATION) - > [ENTITY].

2.2.3 Concurrent Extraction Optimization

  1. The ternary extraction of text blocks is a relatively independent process, so we use concurrency to batch execute processes.
  2. Consider the characteristics of the ternary knowledge extraction task:
    • The processing of each text block is independent of each other.
    • The computational process has no shared state.
    • Moderate task granularity.
  3. We implemented a concurrent processing mechanism based on text blocks:
    • Balance the number of concurrencies and LLM call resources through batch processing.
    • Thus, an optimal balance of processing speed and resource utilization is achieved.
  4. The proposed concurrency optimization scheme achieves a significant performance improvement: the task processing time is reduced to 20% of the original elapsed time. This five-fold performance improvement fully demonstrates the effectiveness of the concurrency strategy in large-scale knowledge extraction scenarios. We also observed two major performance bottlenecks in real-world operation:
    • LLM Concurrent Request Limit
      • APIs for large language models often have strict limits on the number of concurrent requests.
      • Even if the system is able to handle more tasks in parallel, it will be limited by the API call quota.
    • Token Generation Rate Limit
      • The LLM service has an explicit upper limit on the total number of tokens generated per minute.
      • This limitation directly affects the maximum processing throughput of the system.

These limitations are dictated by the infrastructure of the LLM service and are beyond the scope of optimization of our GraphRAG framework. However, this points us in the direction of future optimizations:

- Explore elastic scaling schemes for model deployment.
- Investigate intelligent scheduling strategies for request batch processing.
- Optimize token usage efficiency (e.g., streamline system prompts, etc.).

The advantage of the "contextual enhancement" approach is:

  • LLM allows for a more global view of knowledge that is not limited by the localized information of a single block of text.
  • The addition of context enables more accurate entity relationships to be established.
  • The resulting knowledge graph is of higher quality, with more accurate and complete relationships.

For example, when processing a chunk of text about "Deep Learning", by retrieving similar content, LLM can (through the similarity algorithm supported by the system) also see the "neural network structure", "backpropagation algorithm" and other similar text chunks (chunks) in the sense, thus helping LLM to more accurately extract the relationship between the triad/entity, e.g.: [Deep Learning] - (use) - > [Backpropagation Algorithm]; [Neural Networks] - (is) - > [Deep Learning Models].

This, together with the previously mentioned document structure, extends the scope of GraphRAG's definition of a Graph:

Among them.

  • Triplets Graph: captures semantic relationships between entities.
  • Document Structure Graph (Document Structure Graph): maintains the structure of the knowledge hierarchy.

This graph structure not only draws a "knowledge map", but also provides the possibility for LLM to trace back to the original chunk of text based on entity. In the future, we may build a more complex graph covering more comprehensive information to support more sophisticated retrieval algorithms.

2.3 Community Summary Summary

In order to better understand and organize the knowledge in the mapping in a macro, global way, we introduced a community-level mechanism for summarizing and inducting knowledge. This process is divided into three key steps:

2.3.1 Community discovery

GraphRAG uses the Leiden algorithm for community detection by default. It is known for its efficiency and accuracy, and the algorithm is capable of:

  • Automatic discovery of tightly linked clusters between knowledge entities.
  • Ensure that the boundaries between communities are clear while maintaining strong ties within them.
  • Adaptation to determine the right size of community.

2.3.2 Community textualization

For each identified community, we performed a systematic information extraction (i.e., a graph unfolding process):

  • Collecting attribute information of all entities in the community, extracting relationship descriptions between entities.
  • Converting graph structure information into a structured text representation.

2.3.3 Community summaries

Based on the textualized community information, we generate community summaries through the following steps:

  • Call LLM to analyze the core themes and key concepts of the community and generate a condensed description of the community's themes (using the Concurrent Processing Community Summary Summary function).
  • Persistent storage of summary information for subsequent retrieval and analysis.

Therefore, thanks to the document structure + ternary structure, as well as the community abstract summarization, the new version of GraphRAG not only provides a local view of the knowledge graph and helps people to discover the close connection between knowledge, but also provides a foundation for a higher level of semantic understanding for the application of the knowledge graph.

2.4 Figure data modeling

In the stage of designing GraphRAG data persistence, we chose TuGraph as the underlying storage engine. At the same time, on the basis of the original graph model (Graph Schema), a new set of complete graph model is designed, which makes the graph express, store and index the parts of document structure. Therefore, the new graph model has the following characteristics:

  1. The hierarchical structure of the document is preserved intact.
  2. Clearly expresses the relationships between blocks of text.
  3. Support for associative representation of ternary knowledge entities.

2.4.1 Point types

GraphRAG defines three basic types of nodes:

  1. document: represents the document object
    • Core attributes: id, name.
    • Optional attribute: community_id (for community segmentation).
  2. chunk: represents the text block of the document
    • Core attributes: id, name.
    • Optional attributes: community_id (for community segmentation), content (specific content of the block).
  3. entity: Representation of knowledge entities
    • Core attributes: id, name
    • Optional attributes: community_id (for community segmentation), description (entity description).

2.4.2 Side types

In order to accurately represent the relationships between nodes, GraphRAG defines five specific edge types:

  1. Contains relational edges:
    • document_include_chunk: The document contains blocks of text.
    • chunk_include_chunk: Hierarchical containment between blocks of text.
    • chunk_include_entity: The text block contains knowledge entities.
  2. Sequential Relationship Edge:
    • chunk_next_chunk: The sequential relationship between blocks of text.
  3. Semantic relation edges:
    • relation: Semantic relationships between ternary entities.

Each edge type contains basic attributes (id, name) and optional description information (description). For inter-entity relationship edges, GraphRAG additionally records the chunk_id, which is used to track the origin context of the relationship.

Based on this modeling approach, we not only achieve structured storage of documents, but also lay the foundation for subsequent knowledge graph construction and semantic analysis.

3. Knowledge map search

The construction of knowledge graph lays the foundation for intelligent retrieval, but how to efficiently retrieve and utilize this structured knowledge still faces many challenges. We design a multi-level retrieval framework to achieve accurate and comprehensive information access by fusing different dimensions of knowledge representation.

3.1 Keyword extraction

First, GraphRAG needs to accurately understand the user's query intent. Query parsing is achieved by calling LLM: the user enters Query -> keywords. For example, key search terms are extracted: "machine learning", "beginner", "basic", "tutorial" etc.

3.2 Localized search: ternary mapping search

Based on the extracted keywords, we first locate the related knowledge entities and chunks through simple graph query statements. Subsequently, GraphRAG explores the associations between the entities by multi-hop traversal of the graph to obtain more relevant information.

To better understand multi-hop traversal, let's take an example. In a knowledge graph, multi-hop traversal this is like:

  • The first thing you notice is an AI-related point of knowledge, which you find out is related to "deep learning" (one jump).
  • Following this thread, the connection between "deep learning" and "neural networks" was found (two jumps).
  • Continuing to explore, we find that "neural networks" are again closely related to "backpropagation" (multiple jumps).
  • In this way, GraphRAG not only finds the most directly relevant answers, but also, with the help of multi-hop exploration, obtains a subgraph that provides more comprehensive and richer information.

3.3 Global search: community summary search

With the community detection algorithm Leiden, GraphRAG is able to obtain an overview of the knowledge community associated with a query, providing a more macro view of knowledge.

3.4 Original text search: document structure search

For the retrieved text blocks, the system goes back "up" through the "include" relationship in the document structure until it goes back to the document node (document). This traceability mechanism ensures the traceability of knowledge, but also as the original context to help users understand the context of the information. The author tentatively called "knowledge traceability".

  • The TuGraph query is used as an example to explain the process of knowledge tracing:

Suppose a user asks, "How well is TuGraph used in financial risk control?"

Step 1: Locate the relevant text block The system first finds this key information (chunk):

In terms of financial risk control, TuGraph has helped financial institutions improve anti-fraud and anti-money laundering efficiency, for example, a bank has improved the efficiency of anti-money laundering risk event analysis by 790% and improved the differentiation rate of risk control coverage in its credit platform by 713%.

Step 2: Upward Traceability The system works back through the include relationship, layer by layer:

1. This block of text was found to belong to the "Use Cases" section. 2.
2. "Use Cases" further belongs to TuGraph's overall introductory document. 3.
3. Finally, the complete documentation structure is located:
TuGraph Introduction
├── Main Features
│ ├─ Efficient
│ ├── Distributed Architecture
│ └── Versatility
└─ Application Cases
    └─ Financial Risk Control Applications <- Current Content Location

The process of going from the "current content location" to the TuGraph profile is called "upward traceability".

3.5 Generating Answers

Finally GraphRAG integrates the retrieval results from the three links to form a complete knowledge collection:

  • Localized searches provide specific knowledge points and associative relationships.
  • Global search provides domain overview and knowledge clustering.
  • Original text retrieval provides a traceable basis for documentation.

This multi-dimensional knowledge is fed into the LLM and the model combines the user's query intent to generate a comprehensive, accurate and coherent answer, which is returned to the user. everything GraphRAG does is designed to ensure the accuracy of the answer and to provide the user with in-depth knowledge insights.

4. Demonstration of effects

4.1 Visualization

Users can now quickly experience this functionality in the DB-GPT v0.6.2 version of the front-end application, which provides clear and intuitive knowledge graph visualizations. The generated knowledge graph is rendered by the AntV G6 engine.

4.2 Examples of questions and answers

The user asks for basic information about "TuGraph". The system accurately locates the relevant knowledge node through keyword extraction. This verifies the effectiveness of the query comprehension mechanism discussed by the author earlier.

Of particular note is the citation section at the bottom of the answer (blue citation box), which is a manifestation of the "knowledge traceability" mechanism discussed earlier:

  • With the implementation of the document structure, GraphRAG not only answers user questions, but also indicates the exact source of the information.
  • By citing the original text, GraphRAG ensures the credibility of the information and facilitates further exploration by the user.

5. Performance testing

5.1 Test results

For the sake of comparison, we choose test text to test the performance of our version of GraphRAG and Microsoft's version of GraphRAG separately. Through real-world testing, we found that our scheme achieves the following results (compared to Microsoft's version of the GraphRAG scheme, the September version) while maintaining a similar document input size (42,631 tokens):

  1. Total Token consumption is 42.9% of Microsoft GraphRAG (**417,565 **vs 972,220).
  2. In particular, the amount of Tokens generated was 18.4% of Microsoft's GraphRAG (41,797 vs 227,230)。
  3. The time to build a knowledge graph is 80.1% of Microsoft GraphRAG (170s vs 210s)。
  4. We support document structure mapping, Microsoft GraphRAG does not.
  5. At the same time, our graph structure maintains a comparable complexity (734 nodes/1164 edges vs. 779 nodes/967 edges), ensuring that the completeness and coverage of the knowledge representation remain at approximately the same level.
GraphRAG (DB-GPT) GraphRAG (Microsoft)
Doc Tokens 42631 42631
Triplets Graph 734 nodes, 1064 edges 779 nodes, 967 edges
Doc Structure Graph 76 nodes, 1090 edges N/A
Prompt Tokens 375768 744990
Completion Tokens 41797 227230
Total Tokens 417565 972220
Indexing Time 170s 210s
  • Global search: a significant performance improvement.
GraphRAG (DB-GPT) GraphRAG (Microsoft)
Time 8s 40s
Tokens 7432 63317
  • Localized search: performance is essentially equivalent.
GraphRAG (DB-GPT) GraphRAG (Microsoft)
Time 15s 15s
Tokens 9230 11619

5.2 Analysis of results

The Microsoft GraphRAG solution splits the community report into predefined sized blocks of text in the Map step. Each text block is used to generate an intermediate response containing a list of points, each of which has a corresponding numerical score indicating the usefulness of the point (scale: usefulness 0-100, with LLM as the "judge"). Then, in the Reduce step, the most useful points from the intermediate responses are filtered out and aggregated as a context for generating the final answer. Microsoft GraphRAG consumes more tokens and time because of the multiple calls to LLM to get a "usefulness" score for each point in the map-reduce. In fact, this intermediate answer lacks global view/global information, which leads to bias in LLM when evaluating the usefulness of the intermediate answer.

  • To better understand why bias arises, take the Romance of the Three Kingdoms as an example. User Question: "Why did Cao Cao lose his rule over Jingzhou?"
  • Community Summary Blocks and Intermediate Answers:
    • Fragment 1: "Jingzhou was originally governed by Liu Biao. In the thirteenth year of Jian'an (208), Liu Biao died of an illness, and his son Liu Cong surrendered to Cao Cao. Cao Cao sent troops south to occupy Jingzhou."
      • Middle answer: Jingzhou initially fell into Cao Cao's hands through Liu Qiong's surrender.
      • LLM Score: 85 (high because of the direct answer to how Cao Cao gained Jingzhou).
    • Clip 2: "Zhou Yu and Zhuge Liang defeated Cao Cao's army in the Battle of Red Cliff with a fire attack. Cao Cao was forced to retreat to the north."
      • Middle answer: Cao Cao was defeated in the Battle of Red Cliff and had to withdraw.
      • LLM Score: 70 (Medium, because the description of Cao Cao retreating to the north and Jingzhou being in the south suggests that it is basically impossible for Cao Cao to gain dominion).
    • Clip 3: "Liu Bei sent Guan Yu to guard Jingzhou. Guan Yu made good use of the navy, and because of the support of the local people, he made the foundation of Jingzhou stable."
      • Middle answer: Guan Yu took over the defense of Jingzhou.
      • LLM Rating: 30 (low because it appears to be just a personnel arrangement and has nothing to do with Cao Cao).
  • Deviation analysis:
    • LLM rated Clip 3 lower because, taken on its own, it seems to be a simple transfer of personnel. But in fact, Guan Yu's defense of Jingzhou is an extremely crucial piece of information because: it not only demonstrates the actual control of Liu Bei's power, shows the complete process of the change of Jingzhou's ownership, but also reflects the importance of the people's sentiment. Clip 3 will be filtered out by Microsoft GraphRAG according to the algorithm. In this way, bias is created.

Further, when the size of the graph increases, i.e., the number of nodes and edges increases, it is proposed to compute the degree of aggregation of the community directly by the relevant community detection algorithm, which defines the importance of certain discovered subgraphs to the user's problem, instead of using the usefulness scoring mechanism. In my opinion, this is one of the responsibilities of GraphRAG in the global indexing dimension - to provide "global information". In the case of large-scale graphs, how does Microsoft GraphRAG's map-reduce method really work, and is the indexing performance still excellent? More experiments may be conducted to test the validity of these ideas.

Microsoft GraphRAG employs a parallel processing strategy to batch process community summary blocks to generate intermediate answers, an approach that improves processing speed but introduces additional tokens and time overhead. While generating intermediate answers, information loss and bias are introduced to some extent:

(1) Loss of information integrity: lack of cross-block information correlation.

(2) Semantic understanding bias: local processing affects global judgment."

Anyway, our version takes these pitfalls and deviations into account. Of course, Microsoft GraphRAG has achieved great success in the open source community, and is something that the author has studied and learned from.

Overall, we achieved good results: in theBuilding a knowledge graph of the same sizecase, we are on the task of constructing a map of theTakes less time(~80%), ** consumes fewer tokens ** (~40%). Also, when answering user questions that require global search, our version of GraphRAG is more advantageous in terms of time and tokens consumption according to the test results. In addition, our GraphRAG benefits from document structure support, which allows us to search for the original text and return it to the user as a part of the reference text, giving the user access to more reliable information about the original text.