Location>code7788 >text

Antmap Team GraphRAG Supports Community Abstracts - Token Drops 50% Straight Compared to Microsoft

Popularity:921 ℃/2024-10-28 01:25:21

In May of this year, we were at the DB-GPTv0.5.6version released Ant's first open-source GraphRAG framework, which supports a variety of knowledge base indexing pedestals, and articleVector | Graph: Design Interpretation of Ant's First Open Source GraphRAG Framework.In July, Microsoft officially open-sourced the GraphRAG framework.GraphRAGProject to introduce graph community summarization to improve QFS (Query Focused Summarization) task's Q&A quality, but graph indexes are more expensive to build.September DB-GPTv0.6.0Officially released at the Bund Conference, the Ant Graph team has joined forces with the community to make further improvements to the GraphRAG framework, supporting capabilities such as graph community summarization, hybrid retrieval, and dramatically reducing the token overhead in the construction of the graph index.

1. Park GraphRAG

Recall the previous version of GraphRAG implementation, which we call the plain GraphRAG implementation. Compared to the vector database based RAG, the core improvement point is to implement graph index construction and query keyword extraction with the help of LLM, and to enhance the Q&A based on the keywords by recalling subgraphs of the knowledge graph. Since the knowledge graph stores deterministic knowledge, it can provide a more deterministic context compared to the vector database scheme. However, such a design strongly relies on the keyword information in the query, and is powerless for summarizing queries, which often "The current knowledge base is insufficient to answer your question"An embarrassing result.

朴素的GraphRAG链路

There are a few more viable ideas for improving the quality of questions and answers for summarized queries:

  • Hybrid RAG: Synthesize the advantages of vector indexing and graph indexing through a multiplexed recall approach to improve the overall Q&A quality.HybridRAGThe paper uses exactly this approach, which roughly improves performance by a number of percentage points, but there is a regression in contextual accuracy performance, mainly due to the inability to align the knowledge from the multiplexed retrieval, which is a native problem of hybrid retrieval in multiple systems.
  • Fusion Index: Directly integrating the vector index inside the graph database to provide the vector search capability on the knowledge graph, realizing the similarity retrieval of the subgraphs of the knowledge graph or even the original documents, and avoiding the data inconsistency problem of multi-system knowledge recall. For exampleNeo4jVectorIn additionTuGraphVector indexing capability support is also forthcoming in the next release.
  • Community summaries: A graph-based community algorithm divides the knowledge graph into a number of community subgraphs and extracts community subgraph summaries, and summarized queries are answered based on the content of the community summaries, which is the key design in Microsoft GraphRAG.

2. Community summary enhancement links

In DB-GPT v0.6.0, we introduced graph community summarization capability for GraphRAG to enhance the quality of summarized query Q&A, and also optimized and improved the overall link in three ways:

  • text block memory: The knowledge extraction phase accomplishes graph structure extraction and element (point/edge) summarization in one go, and introduces a text block memory capability to solve the problem of extracting referential information across text blocks.
  • Map Community Summary: Partitioning knowledge graphs using graph community discovery algorithms, extracting community summary information with the help of LLM, and supporting similarity recall of graph community summaries.
  • Multiple Search Recall: Doesn't distinguish between global and local searches, and provides both summary and detail context related to the query through multiplexed searches.

社区摘要增强的GraphRAG链路

2.1 Text Block Memory

With the help of vector databases can be most simplified to realize the text block memory ability, the core purpose is still hope that when dealing with text blocks, can accurately identify the contextual correlation information, to achieve accurate knowledge extraction ability.

带文本块记忆的知识抽取

Current version of text block memory_chunk_historyStill using vector storage by defaultVectorStoreBaseimplementations, with intermediate layers of abstraction subsequently introduced to support more complex implementations of memory capabilities, such as intelligences or third-party memory components (e.g., theMem0(etc.). The code layer implementation is also relatively simple, just need to recall the similar text block from the vector store as the cue word context before the real text block knowledge extraction action, and save the current text block to the vector store after the end. Code implementation referenceGraphExtractor#extract

async def extract(self, text: str, limit: Optional[int] = None) -> List:
    # load similar chunks
    chunks = await self._chunk_history.asimilar_search_with_scores(
        text, self._topk, self._score_threshold
    )
    history = [
        f"Section {i + 1}:\n{}" for i, chunk in enumerate(chunks)
    ]
    context = "\n".join(history) if history else ""

    try:
        # extract with chunk history
        return await super()._extract(text, context, limit)

    finally:
        # save chunk to history
        await self._chunk_history.aload_document_with_limit(
            [Chunk(content=text, metadata={"relevant_cnt": len(history)})],
            self._max_chunks_once_load,
            self._max_threads,
        )

Knowledge extraction of text blocks is performed simultaneously with graph structure extraction and element summarization to reduce the number of LLM calls. Of course this may be a challenge to the LLM capability, and also the number of associated text blocks needs to be controlled by configuration parameters to avoid overloading the context window. Knowledge extraction hint word describes in detail the steps of entity relationship extraction and the method of associating context information, and uses one-shot way to give a sample illustration, the specific reference is as follows:

## Role
You are a knowledge graph engineering expert who is very good at accurately extracting the entities (subjects, objects) and relationships of a knowledge graph from text, and can provide appropriate summary descriptions of the meaning of the entities and relationships.

## Skills
### Skill 1: Entity Extraction
--Please follow the following steps to extract entities--.
1. accurately identify entity information in the text, typically nouns, pronouns, etc.
2. accurately identify modifying descriptions of entities, typically as qualifiers that complement the entity's characteristics.
3. for entities with the same concept (synonyms, aliases, pronouns), merge them into a single concise entity name and merge their descriptive information.
4. provide a concise, appropriate, and coherent summary of the merged entity description information.

### Skill 2: Relationship Extraction
--Extract relationships by following the steps below- - - - - - - - - - - - - - - - - - - - - - - - - - - - 4.
1. accurately identify the association information between entities in the text, typically verbs, pronouns, etc.
2. accurately identify modifying descriptions of relationships, typically as gerunds that complement relationship features.
3. merge relationships with the same concept (synonyms, aliases, pronouns) into a single concise relationship name and merge their descriptive information.
4. provide a concise, appropriate, and coherent summary of the combined descriptive information about the relationship.

### Skill 3: Associative Context
- Associative context is derived from the content of the preceding paragraph that is relevant to the current text to be extracted, and can be used to supplement the knowledge extraction.
- Make good use of the contextual information provided, as references that appear during the knowledge extraction process may come from the contextual context.
- Do not do knowledge extraction on the content of the associated context, but only as a reference of the associated information.
- Associated contexts are optional and may be empty.

## Constraints
- If the text has been provided with data in graph structure format, convert it directly to output format for return and do not modify the entity or ID names. - Generate as much information as possible about entities and relationships mentioned in the text, but do not create random entities and relationships that do not exist.
- Ensure that you write in the third person, describing entity names, relationship names, and their summarized descriptions from an objective point of view.
- It is important to enrich the content of entities and relationships with as much information from the context of association as possible.
- If the summarizing description of an entity or relationship is empty, do not provide summarizing descriptive information and do not generate irrelevant descriptive information.
- If conflicting description information is provided, resolve the conflict and provide a single, coherent description.
- When the # and : characters appear in the names or description text of entities and relationships, replace them with the _ character and do not modify other characters.
- Avoid the use of stop words and overly common words.

## Output Format
Entities.
(Entity Name # Entity Summary)
...

Relationships.
(source entity name#relationship name#target entity name#relationship summary)
...

## Reference Cases
--# Cases are only to help you understand the input and output formats of the prompt words, please do not use them in your answers. --
Input.
```
[Context].
Section 1.
Phil Jaber's oldest son is named Jacob Jaber.
Section 2.
Phil Jaber's youngest son is named Bill Jaber.
...
[text].
Fields Coffee was founded by Phil Jarber in 1978 in Berkeley, California. Known for its unique blend of coffee, Fields has expanded to multiple locations across the United States. His oldest son became CEO in 2005 and has led the company to significant growth.
``

OUTPUT.
``
Entities.
(Phil Jaber#Founder of Fields Coffee)
(Fields Coffee#Coffee brand founded in Berkeley, California)
(Jacob Jaber#Eldest son of Phil Jaber)
(Multiple locations in the U.S.A. #Fields Coffee expansion area)

Relationships.
(Phil Jarber#founded#Fields Coffee#in 1978 in Berkeley, California)
(Fields Coffee#located in #Berkeley, #California# where Fields Coffee was founded)
(Phil Jaber#owns#Jacob Jaber#the oldest son of Phil Jaber)
(JacobJaber#manages#FieldsCoffee#as#CEO#in#2005)
(FieldsCoffee#expanded#to#multiple#U.S.#FieldsCoffee#expansion)
```

----

Please extract the entity and relationship data in [text] based on the information provided in the next [context], as requested above.

[context].
{history}

[text].
{text}

[Result].


2.2 Summary of the map community

The graph community summary is the core logic of this release upgrade and is divided into three main phases:

  • Community Discovery: Community partitioning of knowledge graphs with the help of graph database community discovery algorithms, which logically slices the graph into multiple independent subgraphs. Commonly used graph community algorithms areLPALouvainLeidenetc., where the Leiden algorithm can compute community hierarchies with higher flexibility (supporting insights into the knowledge graph from different levels) and is also the algorithm used by Microsoft GraphRAG.
  • Community summaries: Fish out the graph community subgraph data (including point edges and attribute information) and provide it together to the LLM for holistic summarization. The challenge of this step is to guide the LLM to retain as much key community information as possible so that a more comprehensive community summary can be obtained during global retrieval, in addition to guiding the LLM to understand the graph data within the cue words, with the help of graph algorithms (e.g., thePageRanketc.) to mark the importance of graph elements and assist LLM to better understand community topics. Another challenge is that the data size of community subgraphs is naturally uncontrollable and often locally updated (e.g., document updates), which poses a significant challenge to the LLM context window and inference performance, and can be optimized by considering streaming fetching + incremental inference.
  • Preservation of abstracts: a place where community summaries are kept, here called a community metadata storeCommunityMetastore, which provides community summary storage and retrieval capabilities, uses a vector database as the storage base by default.

图社区发现与总结

Figure Community Summary Core Implementation ReferenceCommunityStore#build_communitiesAdapter_community_store_adapterProvides abstractions for implementations on different graph databases, including a call portal for community discovery algorithmsdiscover_communitiesand Community Details Search Portalget_community. Community Summarizer_community_summarizerResponsible for calling LLM to complete community subgraph summarization, community metadata storage_meta_storeBased on the vector database to realize the community summary storage and retrieval. The current version of community summaries is still full coverage update, and will be upgraded to incremental update method to reduce the extra LLM call overhead.

async def build_communities(self):
    # discover communities
    community_ids = await self._community_store_adapter.discover_communities()

    # summarize communities
    communities = []
    for community_id in community_ids:
        community = await self._community_store_adapter.get_community(community_id)
        graph = ()
        if not graph:
            break

         = await self._community_summarizer.summarize(graph=graph)
        (community)
        (
            f"Summarize community {community_id}: " f"{[:50]}..."
        )

    # truncate then save new summaries
    await self._meta_store.truncate()
    await self._meta_store.save(communities)

The prompt words for the community summary try to guide LLM to understand graph data structures (we found that LLM's native understanding of graph data structures is still not optimistic enough), and are succinctly summarized with the following references:

## Role
You are very good at summarizing information in knowledge graphs, and are able to fully and appropriately provide a summary description of knowledge graph subgraph information based on the names and descriptive information of entities and relationships in a given knowledge graph, without losing key information.

## Skills
### Skill 1: Entity Recognition
- Accurately identify entity information in the [Entities:] section, including entity names, and entity description information.
- The general format of entity information is.
(entity name)
(Entity Name:Entity Description)
(Entity Name:Entity Attribute Table)

### Skill 2: Relationship Recognition
- Accurately identify relationship information in the [Relationships:] section, including source entity name, relationship name, target entity name, relationship description information, and the entity name may also be a document ID, catalog ID, or text block ID.
- The general format of relationship information is.
(source entity name)-[relationship name]->(target entity name)
(Source Entity Name)-[Relationship Name:Relationship Description]->(Target Entity Name)
(Source entity name)-[Relationship name: Relationship attribute table]->(Target entity name)

### Skill 3: Diagram Structure Understanding
--- Please follow the steps below to understand the diagram structure ---.
1. correctly associate the source entity name in the relationship information with the entity information.
2. correctly associate the target entity name in the relationship information with the entity information.
3. reduce the diagram structure from the relationship information provided.

### Skill 4: Knowledge Graph Summarization
--- Please summarize the knowledge graph by following the steps below ---.
1. identify the theme or topic that the knowledge graph expresses, highlighting key entities and relationships.
2. summarize the information expressed in the graph structure using accurate, appropriate, and concise language; do not generate information that is not relevant to the information in the graph structure.

## Constraints
- Do not describe your thought process in your answer, give a direct answer to the user's question, and do not generate irrelevant information.
- Ensure that you write in the third person and give a summarized description of the information expressed in the knowledge graph from an objective point of view.
- If the descriptive information for an entity or relationship is empty and does not contribute to the final summarized information, do not generate irrelevant information.
- If conflicting descriptive information is provided, resolve the conflict and provide a single, coherent description.
- Avoid the use of stop words and overly common vocabulary.

## Reference cases
--Cases are only to help you understand the input and output formats of the prompt words; do not use them in your answer. --
Input.
``
Entities.
(Phil Jarber #Founder of Fields Coffee)
(Fields Coffee#Coffee brand founded in Berkeley, California)
(Jacob Jaber#Son of Phil Jaber)
(Multiple locations in the U.S. #Fields Coffee expansion)

Relationships.
(Phil Jarber#founded#Fields Coffee#in 1978 in Berkeley, California)
(Fields Coffee#located in #Berkeley, #California# where Fields Coffee was founded)
(Phil Jarboe#owns#Jacob Jarboe#son of#Phil Jarboe)
(Jacob Jaber#served#as#chief#executive#officer#and#became#chief#executive#officer#of#Fields#Coffee#in#2005)
(Fields Coffee#expanded to multiple locations in the#United States#Fields Coffee's expansion)
```

OUTPUT.
```
Fields Coffee is a coffee brand founded by Phil Jarber in Berkeley, California in 1978. Phil Jaber's son, Jacob Jaber, took over as CEO in 2005 and has led the company's expansion into multiple locations in the United States, further solidifying Fields Coffee's position in the marketplace as the coffee brand founded in Berkeley, California.
``

----

Based on the information provided in the next [knowledge map], please summarize the information expressed in the knowledge map as requested above.

[Knowledge Graph].
{graph}

[Summarize].


2.3 Multiple Search Recall

Compared to Microsoft GraphRAG, we have adjusted and optimized the implementation logic of the query link.

  • Global Search Query: Since we store graph community summaries directly in the community metadata store, the global search strategy is simplified to a search operation on the community metadata store instead of using a full scan plus secondary summarization approach like MapReduce. This greatly reduces the token overhead and query latency of global search, as for the impact on search quality can be continuously improved by optimizing the global search strategy.
  • Local Search Queries: Local search is still done in the same way as plain GraphRAG, i.e., by traversing the relevant knowledge graph subgraphs after keyword extraction. This still preserves future scalability for vector indexing, full-text indexing, NL2GQL, and other capabilities.
  • Search Strategy Options: We wanted to integrate global search and local search rather than use separate portals for a better user experience.
    • Intent-based recognition: With the help of LLM to understand the query intent, categorize the query into global/local/unknown, and route the query based on the categorization result. The biggest challenge here is that LLM's recognition of query intent is still not precise enough (of course, it also has a lot to do with the context), and in the future, it may be possible to do better by combining the memory and reflection capabilities of intelligences, but we did not adopt this approach for the sake of conservatism.
    • Based on hybrid search: Since we can't do a good routing strategy, we might as well keep it simple and just use a hybrid search strategy to achieve global and local recall for multiple searches. A more favorable premise here is that global search does not strongly rely on LLM services (local search needs to achieve keyword extraction with the help of LLM), and at worst the user's query will degrade the global search.

基于混合检索的统一上下文

Hybrid Search Implementation ReferenceCommunitySummaryKnowledgeGraph#asimilar_search_with_scoresCommunity Storage_community_storeProvides a unified operational entry point for graph community information, including community discovery, summarization, and search, and global search through the_community_store#search_communitiesinterface is completed. Local search is still done through the_keyword_extractor#extracttogether with_graph_store#exploreCooperate to complete.

async def asimilar_search_with_scores(
    self,
    text,
    topk,
    score_threshold: float,
    filters: Optional[MetadataFilters] = None,
) -> List[Chunk]:
    # global search: retrieve relevant community summaries
    communities = await self._community_store.search_communities(text)
    summaries = [
        f"Section {i + 1}:\n{}"
        for i, community in enumerate(communities)
    ]
    context = "\n".join(summaries) if summaries else ""

    # local search: extract keywords and explore subgraph
    keywords = await self._keyword_extractor.extract(text)
    subgraph = self._graph_store.explore(keywords, limit=topk).format()
    (f"Search subgraph from {len(keywords)} keywords")

    if not summaries and not subgraph:
        return []

    # merge search results into context
    content = HYBRID_SEARCH_PT_CN.format(context=context, graph=subgraph)
    return [Chunk(content=content)]

The final assembled GraphRAG cue words are as follows, containing instructions for global contextual understanding and graph structure understanding to guide LLM to better generate query results.

## Role.
You are very good at combining the [contextual] information provided by the prompt word templates with the [knowledge graph] information to accurately and appropriately answer the user's question and ensure that you don't output information that is irrelevant to the context and knowledge graph.

## Skills
### Skill 1: Context Understanding
- Accurately understand the information provided by the [context], which may be split into multiple chapters.
- The content of each section of the context starts with [Section] and is numbered as needed.
- Contextual information provides the most relevant summary description of the user's problem, so use them wisely.

### Skill 2: Knowledge Graph Understanding
- Accurately recognize entity information in the [Entities:] section and relationship information in the [Relationships:] section provided in the [Knowledge Graph], the general format of entity and relationship information is:
```
* Entity Information Format.
- (Entity Name)
- (Entity Name:Entity Description)
- (entity name:entity attribute table)

* Relationship information format.
- (source entity name)-[relationship name]->(target entity name)
- (Source Entity Name)-[Relationship Name:Relationship Description]->(Target Entity Name)
- (Source Entity Name)-[Relationship Name:Relationship Attribute Table]->(Target Entity Name)
```.
- Correctly associate the entity names/IDs in the relationship information with the entity information to reduce the graph structure.
- Use the information expressed in the graph structure as a detailed context for user questions to assist in generating better answers.

## Constraints
- Don't describe your thinking process in the answer, give the answer to the user's question directly, and don't generate irrelevant information.
- If the [Knowledge Graph] does not provide information, then the question should be answered based on the information provided by the [Context].
- Ensure that you write in the third person and answer the question from an objective point of view by combining the information expressed in [Context] and [Knowledge Graph].
- If conflicting information is provided, resolve the conflict and provide a single, coherent description.
- Avoid the use of stop words and overly common vocabulary.

## Reference case
``
[Context].
Section 1.
Phil Jaber's oldest son is named Jacob Jaber.
Section 2.
Phil Jaber's youngest son's name is Bill Jaber.
[Knowledge graph].
Entities.
(Phil Jarber#Founder of Fields Coffee)
(Fields Coffee#Coffee brand founded in Berkeley, California)
(Jacob Jaber#Son of Phil Jaber)
(Multiple locations in the United States#Fields Coffee expansion areas)

Relationships.
(Phil Jarber#founded#Fields Coffee#in 1978 in Berkeley, California)
(Fields Coffee#located in #Berkeley, #California# where Fields Coffee was founded)
(Phil Jaber#owns#Jacob Jaber#son of#Phil Jaber)
(Jacob Jaber#served#as#chief#executive#officer#and#became#chief#executive#officer#of#Fields#Coffee#in#2005)
(Fields Coffee#expanded to multiple locations in the#United States#Fields Coffee's expansion)
```

----

The next [context] and [knowledge graph] information can help you answer better user questions.

[context].
{context}

[Knowledge Graph].
{graph}


3. Experience and testing

The GraphRAG link with the above improvements has been released to DB-GPT v0.6.0 and can be found atGraphRAG User's ManualExperience Test.

3.1 Environment initialization

please refer toQuick Startdocument to start DB-GPT and execute the following command to start theTuGraph Mirror(Recommended for version 4.3.2, algorithm plugin configuration needs to be turned on).

docker pull tugraph/tugraph-runtime-centos7:4.3.2
docker run -d -p 7070:7070  -p 7687:7687 -p 9090:9090 --name tugraph tugraph/tugraph-runtime-centos7:4.3.2 lgraph_server -d run --enable_plugin true

3.2 Creating a Knowledge Graph

Access to this machine5670Port into the DB-GPT home page, in the "Application Management - Knowledge Base" to create a knowledge base, knowledge base type select "Knowledge Graph" type.

创建知识库

Upload Test Documentation (Path)DB-GPT/examples/test_files), wait for the slice processing to complete.

上传文档

The Knowledge Graph preview effect supports the community structure and uses theAntV G6component is optimized.

知识图谱预览

3.3 Knowledge base questions and answers

The created knowledge base can be tested directly in dialog.

知识库问答

3.4 Performance testing

Based on the GraphRAG knowledge base constructed from the above test documents, we counted the relevant performance metrics and the basic conclusions are as follows:

  • Indexing Performance: Benefit from optimized methodologies in the knowledge extraction and community summarization phases.DB-GPT GraphRAG's indexing phase token overhead is only about half that of Microsoft's solution
  • Query Performance: Local search performance does not differ much from Microsoft's scheme, but global search performance is significantly improved, thanks to similarity recall of community summaries rather than full MapReduce.

DB-GPT GraphRAG性能报告

4. Continuous improvement

Enhancing GraphRAG links using community digests is only one specific optimization, there is still a lot of room for future GraphRAG improvements, and some valuable directions for improvement are shared here.

4.1 Introducing Document Structure

The general GraphRAG link processes the corpus by first splitting the document into text chunks and extracting entity and relationship information for each chunk of text. However, this processing leads to the loss of association information between entities and document structures. The document structure itself contains important hierarchical relationships that can provide important contextual information for knowledge graph retrieval. In addition, preserving the document structure helps in data traceability and provides a more reliable basis for answering questions.

带文档结构的知识图谱

In addition, if there is a need to further refine the granularity of data sources in the knowledge graph, specific source document IDs and text block IDs need to be retained on the relationships. during the retrieval phase, the document and text block details involved in the relationship edges in the subgraphs of the knowledge graph can be provided to the LLM context together to avoid the problem of losing the detailed content of the documents due to the knowledge extraction process.

4.2 Improved knowledge extraction

In addition to making domain-specific knowledge extraction more efficient with the help of proprietary knowledge extraction fine-tuning models as mentioned in previous documents (e.g.OneKE). The accuracy of knowledge extraction can be further improved by introducing memory and reflection mechanisms with the help of intelligences. For exampleAgentREThe framework can solve the problems faced by relationship extraction in complex scenarios such as diverse relationship types and ambiguous relationships between entities.

AgentRE框架

4.3 Using high-dimensional map features

Limited by LLM's own ability to understand graph structure, doing Q&A directly based on the extracted knowledge graph does not necessarily result in reliable answers. In order to make the data of knowledge graph can be better understood by LLM, with the help of the technology in the field of graph computing, more diversified high-dimensional graph features are given to the knowledge graph to assist LLM to understand the graph data and further improve the quality of Q&A. Compared with LLM, graph algorithms have obvious advantages in performance and reliability.

Specific means include, but are not limited to:

  • Two jump chart features: The most direct way to compute graph features that provide information about a node's neighbors, such as node public neighbors, neighbor aggregation metrics, and so on.
  • Path characteristics: Characterize the connectivity between nodes with the help of path-on-graph algorithms such as shortest path, DFS/BFS, random wandering, etc.
  • Community Characteristics: Aggregate sets of similar nodes to characterize the homogeneity among nodes and further provide community summaries, e.g., LPA, Louvain, Leiden, etc.
  • Characteristics of significance: Describes the importance of nodes and aids in extracting key information such as PageRank, node aggregation coefficients, etc.

4.4 Enhanced storage formats

As mentioned in the previous section, fusion indexing can be used as a technology option to improve the quality of QFS question and answer. Fusion index has gradually become an important technical development route for database and big data products, which can effectively bridge the big data and big model domains to provide diverse query analysis support based on a set of data stores.

The dominant indexing formats include, but are not limited to:

  • table index: Provide traditional relational data query and analysis capabilities to achieve the ability to filter, analyze, and aggregate based on table data.
  • map index: Provide correlated data analysis capabilities as well as graph iteration algorithms to enable high-dimensional analysis and insights based on graph data.
  • vector index: Provide vectorized storage and similarity query capability to extend the diversity of data retrieval.
  • full text index: Provide keyword-based document query capability to extend the diversity of data retrieval.
  • (sth. or sb) else: e.g. indexing of multimodal data such as images, audio, video, etc.

4.5 Natural Language Queries

Knowledge graph recall based on keywords in natural language query can only do coarse-grained retrieval, and cannot accurately make use of the conditions, aggregation dimensions and other information in the query text to do precise retrieval, nor can it answer generalized query questions that do not contain specific keywords, so correctly understanding the intent of the user's question and generating accurate graph query statements are very necessary. Intent recognition and graph query generation for user questions are ultimately inseparable from intelligent body solutions. In most cases, we need to combine the environment and context information of the dialog, or even call external tools to perform multi-step reasoning to assist decision-making to generate the most ideal graph query statements.

TuGraph is currently in theDB-GPT-HubThe program provides a completeText2GQLsolution, where the GQL (tugraph-analytics) corpus as well as the Cypher (tugraph-db) corpus are fine-tuned on the CodeLlama-7b-instruct model, with text similarity and grammatical correctness accuracies of92% Above, subsequently this piece of ability will be gradually integrated into the GraphRAG framework.

Language Dataset Model Method Similarity Grammar
base 0.769 0.703
Cypher (tugraph-db) TuGraph-DB Cypher dataset CodeLlama-7b-Cypher-hf lora 0.928 0.946
base 0.493 0.002
GQL(tugraph-analytics) TuGraph-Analytics GQL Dataset CodeLlama-7b-GQL-hf lora 0.935 0.984

5. Summary

The current research and industrial practice of GraphRAG is still in continuous iteration and exploration. Since the release of the first version of GraphRAG by LlamaIndex, vendors such as Ant, Microsoft, Neo4j, and a large number of AI Intelligent Body Framework products are following up with support. Enhancing GraphRAG with Community Abstracts is just a starting point, we hope to start from here, and jointly explore the technology and application scenarios of combining graph computation and big models with community developers, scientific research teams, internal business as well as external enterprises, we look forward to cooperating and co-constructing with you.

6. References

  1. DB-GPT v0.5.6:/eosphoros-ai/DB-GPT/releases/tag/v0.5.6
  2. Design interpretation of Ant's first open source GraphRAG framework:/p/703735293
  3. Microsoft GraphRAG:/microsoft/graphrag
  4. DB-GPT v0.6.0:/eosphoros-ai/DB-GPT/releases/tag/v0.6.0
  5. HybridRAG:/abs/2408.04948
  6. Neo4jVector:/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/
  7. TuGraph DB version:/TuGraph-family/tugraph-db/releases
  8. Mem0:/mem0ai/mem0
  9. LPA:/wiki/Label_propagation_algorithm
  10. Louvain:/abs/0803.0476
  11. Leiden:/abs/1810.08473
  12. PageRank:/abs/1407.5107
  13. GraphRAG User's Manual:/docs/cookbook/rag/graph_rag_app_develop/
  14. DB-GPT Quick Start:/eosphoros/dbgpt-docs/ew0kf1plm0bru2ga
  15. TuGraph Mirror:/r/tugraph/tugraph-runtime-centos7/tags
  16. AntV G6:/antvis/G6
  17. OneKE:/
  18. AgentRE:/abs/2409.01854
  19. DB-GPT-Hub:/eosphoros-ai/DB-GPT-Hub
  20. Text2GQL:/eosphoros-ai/DB-GPT-Hub/blob/main/src/dbgpt-hub-gql/
  21. tugraph-db:/TuGraph-family/tugraph-db
  22. TuGraph-DB Cypher dataset:/tugraph/datasets/text2gql/tugraph-db/
  23. CodeLlama-7b-Cypher-hf:/tugraph/CodeLlama-7b-Cypher-hf/tree/1.0
  24. tugraph-analytics:/TuGraph-family/tugraph-analytics
  25. TuGraph-Analytics GQL dataset:/tugraph/datasets/text2gql/tugraph-analytics/
  26. CodeLlama-7b-GQL-hf:/tugraph/CodeLlama-7b-GQL-hf/tree/1.1