Location>code7788 >text

Demystifying prompt series 41. is GraphRAG really Silver Bullet?

Popularity:290 ℃/2024-10-27 18:51:30

In this chapter, we introduce the GraphRAG paradigm, counting the time it is also time to get to the graph, NLP every round of new models out, often first study the fine-tuning, and then a variety of pre-training programs, then mull over the data, a variety of active learning semi-supervised, weakly supervised, unsupervised, and then after that it is to the graph and the adversarial learning~

A while ago the wind of Graph RAG blowing huffing and puffing, often asked you also Graph RAG? However, although Graph RAG is good, it is not the Silver Bullet of RAG, it has specific suitable problems and scenarios, it is more suitable as a RAG all the way back to solve the entity-intensive, rely on the global relationship of the information recall. So in this chapter we'll talk about the implementation of GraphRAG and what problems it specifically solves.

Comparison of the effects of Graph RAG and Naive RAG

  • /

Let's look at a few comparisons of results based on a demo of Graph RAG provided by Deepset. the demo builds graphs based on quarterly reports from US stocks, and then compares side-by-side the difference between answering the same question using GPT4-O based on a graph and using a plain RAG. Let's look at 3 question types where Graph RAG would show a significant advantage

concern1:Which companies bought GPUs, write one line summary for each company

image

concern2:Compare Tesla and Apple Inc, answer the question in a structure and concise way

image

concern3: What do these reports talk about?

image

The above 3 demos show the most core benefits of Graph RAG and the adapted scenarios, ranging from the largest to the smallest are

  • Dataset Global Info: relies on global structured information about the data to answer questions such as "what are they?", "the best, most, top", etc.? , "the best, most, top" type questions
  • Subgroup Abstract Info: relies on categorizing global data and abstracting local information, which can answer some derived "a certain class, a certain topic", "how many classes" questions.
  • Entity and Relationship Info: relies on entity and inter-entity relationship information to answer questions such as "aspects of A" and "A vs B" comparison differences. This can also be extended from entities to documents, including multiple documents and inter-document relationship information.

Static mapping: a Microsoft Graph RAG implementation

  • /microsoft/graphrag
  • GRAPH Retrieval-Augmented Generation: A Survey
  • From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Ali's graph RAG review classifies the process of GRAG, Graph RAG is actually adding graph recall to the recall content of RAG to optimize the answer, which mainly contains the following three parts: graph construction, graph data recall, and graph-enhanced answer. So the differences between different Graph RAG papers are also mainly in the different implementations and permutations of the above three parts, below we look at the specific implementation of Microsoft GraphRAG.

Step1.

The first step is chunking the document, the goal of chunking is because the large model will lead to low recall of entity extraction when dealing with too long above, the paper compares different chunk sizes from 600-2400 words, as the chunk gets bigger, the magnitude of the entities that can be detected in the paragraph will gradually decrease.

The second step is to use the big model to extract entities from the segmented content. The generic domain directly uses the following commands for unspecified entity types for generalized entities (entity, type, description) and entity triples (source, target, relation) extraction, while the vertical domain will need to rely on the few-shot to improve the extraction effect. Here the paper will use the model to conduct multiple rounds of reflection "Are there any unrecognized entities in the extraction results?" If there are, then additional extraction will be performed to improve the recall of the entity triad extraction, which is the most critical for constructing the graph.

GRAPH_EXTRACTION_PROMPT = """
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{{tuple_delimiter}}<entity_name>{{tuple_delimiter}}<entity_type>{{tuple_delimiter}}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: an integer score between 1 to 10, indicating strength of the relationship between the source entity and target entity
Format each relationship as ("relationship"{{tuple_delimiter}}<source_entity>{{tuple_delimiter}}<target_entity>{{tuple_delimiter}}<relationship_description>{{tuple_delimiter}}<relationship_strength>)

3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{{record_delimiter}}** as the list delimiter.

4. If you have to translate into {language}, just translate the descriptions, nothing else!

5. When finished, output {{completion_delimiter}}.

-Examples-
######################
{examples}

-Real Data-
######################
entity_types: [{entity_types}]
text: {{input_text}}
######################
output:"""

The results of the entities extracted using the above Promtp are as follows

image

With the entity triples, it is straightforward to proceed with the graph construction, here graphrag uses NetworkX directly to construct the undirected graph.

step2. map segmentation and description generation

With a diagram, the next step is how to describe the information of the diagram, before the big model we are more using templates to transform the entity and entity relationship information into text, while in the LLM era there are more possibilities. Here Microsoft features one more step to divide and describe the graph.

Graph division, also called community discovery, the reason why to do community discovery, in fact, from GraphRag to solve the problem of global subjects, summary classes, associated classes, and this problem is achieved by first global hierarchical division, all local subjects (communities) are pre-summarized information to achieve this part of the information contains both the local structure of the information such as the main business for the GPU of a few companies, and the local semantics such as abstract topics and concepts. It also contains local semantics e.g. abstract topics and concepts.

There are many kinds of algorithms for community discovery, some based on modularity, some based on hierarchical clustering, and various algorithms based on random wandering, here the paper chooses Leiden, which is an optimization of Louvain's algorithm, and also belongs to the modularity type of algorithms, which generates multiple subgraphs that are mutually exclusive.

For each subgraph, the following Prompt command is used to allow the model to generate a summary of the community, where a subgraph actually corresponds to a topic, which can be an event, all the information related to a subject, a type of topic, etc. The following prompt will generate a summary in addition to the previously mentioned abstraction, which corresponds to the abstraction of information of the subject and topic type. The following Prompt not only generates a summary, but also generates a finding, which corresponds to the abstraction, subject, and topic type of information abstraction mentioned earlier.

COMMUNITY_REPORT_SUMMARIZATION_PROMPT = """
{persona}

# Goal
Write a comprehensive assessment report of a community taking on the role of a {role}. The content of this report includes an overview of the community's key entities and relationships.

# Report Structure
The report should include the following sections:
- TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant points associated with its entities.
- REPORT RATING: {report_rating_description}
- RATING EXPLANATION: Give a single sentence explanation of the rating.
- DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.

.....This omits the output format andfew-shot
"""

The structured summary obtained for each community is as follows, and the overall spliced text of title+summary+findings['summary']+findings['explanation'] will be used later as a summary of that local information. Here is actually the most model-consuming part of the whole process, and although the paper does not mention it, when there is new node information in the graph, the report here also needs to be incrementally updated, and needs to recognize the subgraphs that have changed and correspondingly go to the subgraphs to update their reports.

image

step3. map information answer

The last is how to use the above information, when the user's query comes in, Microsoft's paper uses the logic for the above generated REPORT, all used for the answer, and then the answer is screened. But in fact, you can also first join the recall logic, based on the user query to recall the relevant subgraph REPORT, although there will certainly be some loss, but can significantly reduce the time-consuming and cost. Microsoft's implementation is

  • Concatenation: break up all the reports and splice them into multiple chunks, each chunk being a piece of context above.
  • Map: use the above multiple context and user query to get multiple intermediate answers, then use the big model to score all intermediate answers 0-100
  • Reduce: the above scoring is descending, keeping all answers within the window length limit spliced as above, and using the big model to generate the final responses.

Effectively the paper compares the effect of using reports generated using subgraphs of different levels as context (C0-C3) on the podcast and news article datasets respectively, as well as the direct use of graph node information (TS) to compare naive RAG (SS). Below are the two-by-two win rates using the Big Model to evaluate the comprehensiveness, diversity, usefulness and directness of responses, respectively, across the above four evaluation perspectives. All scenarios using graph information outperform naive rag~

image

LightRAG

  • /HKUDS/LightRAG
  • LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG is a new RAG framework just released by HKU, comparing with Microsoft's graph rag implementation, it is more optimized at the information recall level. Here we only look at the core differences between lightrag and graph rag: the construction of the graph index and graph information recall.

image

The two are basically the same in the link of graph construction, the difference is that LightRAG, in order to construct the recall index, adds a high-level keyword extraction instruction in the Prompt instruction of graphRAG to extract entities and relations, which is used to extract keywords that can describe the local abstract information of the graph, and the keywords as an index can be directly used for the topics, concepts, and other issues. The keywords can be used as indexes for topic, concept, and other issues. Compared with Microsoft's use of subgraph report to describe local information, lightrag uses keywords to describe local information in extraction, which is more lightweight, but it is insufficient for describing the information of subgraphs with larger scope.

GRAPH_EXTRACTION_PROMPT = """
...ibid

- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

...ibid
"""

And in the retrieval phase, lightrag adds a two-way graph information recall

  • low level: used to answer detail-oriented questions, e.g., who wrote Pride and Prejudice, questions focused on specific entities, relationships
  • high level: used to answer global, conceptual questions that require global, abstract information, e.g., how artificial intelligence is impacting contemporary education

For the above two perspectives, lightrag will use the command to let the big model to generate two types of search keywords respectively, one type for the specific entity to search, and one type for the topic concept to search, corresponding to the high level keywords generated in the above entity extraction process, prompt and fresh-shot are as follows

image

Using the above two types of keywords will result in two separate recalls, and then the messages will be merged

  • Low level: use query based to generate low level keywords to retrieve the ENTITY, as low level targets entity oriented detail information. Here the paper is vectorized to the entity, using entity name + description over embedding model
  • high level: use the high level keywords generated based on the query to retrieve the relation, because in the previous extraction for the relation to extract the local abstract keywords, and the relation of the vectorization of the use of these keywords and the description of the relation, so that the subject class of the local recall can be achieved through the recall of relations

image

Splicing the above and the big model answer and so on are not too bad will not go into detail, interested students can go to pick up the code~

Want to see more full big model papers - Fine-tuning pre-training data - Open source frameworks - AIGC applications >> DecryPrompt