GraphRAG

GraphRAG is a graph-based retrieval enhancement method developed and open-sourced by Microsoft. It extracts structured data from unstructured text by combining the techniques of LLM and graph machine learning to construct a knowledge graph to support various application scenarios such as Q&A, summarization, etc. GraphRAG features semantic aggregation and hierarchical analysis using graph machine learning algorithms to be able to answer some high-level abstract or summarization questions, which is the shortcoming of conventional RAG systems.

The core of GraphRAG lies in its processing flow, which consists of 2 stages: Indexing and Querying.

In the Indexing phase: GraphRAG splits the input text into multiple analyzable units (called TextUnits). Entities, relationships and key statements are extracted using LLM. The graph is then divided into communities by hierarchical clustering techniques (e.g., Leiden's algorithm) and a summary of each community is generated
In the Querying phase: these structures are used to provide material to be used as a context for LLM to answer questions. Querying modes include global search and local search:
- Global search: reasoning about answers to questions about the corpus as a whole by using summaries from the community
- Local search: reasoning about a specific entity by extending to its neighbors and related concepts

Demonstrates significant performance gains in processing private datasets relative to traditional RAG methods. It enhances the performance of questioning and answering complex information by constructing knowledge graphs and community hierarchies, as well as utilizing graph machine learning techniques, especially when large datasets or semantic concepts of a single large document need to be fully understood.

Based on Microsoft's official blog posting [1], Microsoft uses an LLM rater to give an assessment of the performance of GraphRAG and Baseline RAG. Based on a set of metrics including comprehensiveness (completeness of extracted contextual content, including implicit information), human empowerment (providing source material or other contextual information) and diversity (providing different perspectives or angles on issues). From the preliminary evaluation results, GraphRAG outperforms Baseline RAG on all these metrics.

In addition to this, Microsoft has pragmatically used SelfCheckGPT for absolute measurements of fidelity to ensure factual and consistent results based on raw materials. The results also show that GraphRAG achieves a similar level of fidelity as Baseline RAG. Microsoft is also currently developing an evaluation framework to further measure the performance of the above question types.

1. Testing of GraphRAG

Note that by default, GraphRAG uses models from OpenAI. To use the models provided by AWS Bedrock, it is recommended that the Bedrock Access Gateway solution provided by Amazon Cloud Technologies can be used [2].

Here is the deployment process

# Create conda environment

conda create -n grag python=3.10

conda activate grag

# Install graphrag

pip3 install graphrag

# Setting up the environment

mkdir gragdemo

cd gragdemo/

# Initialize the environment, creating .env and files and default configuration items

python3 -m --init --root ./ragtest

# Based on the used llm configuration file and the .env file

# Load sample data

mkdir -p ./ragtest/input

curl /cache/epub/24022/ > ragtest/input/

# Generate index

python -m --root ./ragtest

# The corresponding Indexing section outputs

⠴ GraphRAG Indexer

├── Loading Input (text) - 1 files loaded

├── create_base_text_units

├── create_base_extracted_entities

├── create_summarized_entities

├── create_base_entity_graph

├── create_final_entities

├── create_final_nodes

├── create_final_communities

├── join_text_units_to_entity_ids

├── create_final_relationships

├── join_text_units_to_relationship_ids

├── create_final_community_reports

├── create_final_text_units

├── create_base_documents

└── create_final_documents

🚀 All workflows completed successfully.

# Execute queries

python -m \

--root ./ragtest \

--method local \

"Who is Scrooge, and what are his main relationships?"

# Query Returns

SUCCESS: Local Search Response: # Ebenezer Scrooge and His Key Relationships

…

## Ebenezer Scrooge: The Miserly Central Character

…

## Scrooge's Past Relationship with Belle

…

## Scrooge's Deceased Business Partner: Jacob Marley

…

## Scrooge's Relationship with Bob Cratchit and His Family

…

## Scrooge's Nephew and Niece

…

Throughout the story, Ebenezer Scrooge's relationships with these characters serve as catalysts for his personal growth and redemption, as he is confronted with the consequences of his actions and the importance of kindness, generosity, and embracing the Christmas spirit.

Not all of them have been posted because the query returned a lot of content. The framing of the responses shows a much more comprehensive grasp of the interpersonal relationships of the characters under GraphRAG, an effect that is very difficult to achieve under standard RAG.

2. Process analysis

The execution process of GraphRAG (e.g., the Indexing process printed out in many steps as seen earlier) consists of a data pipeline, with the goal of extracting meaningful information from unstructured data processed by LLM and saving it as structured data. After the Indexing process is completed, you can see a set of parquet files in the output/xxx/artifacts/ directory, which are used to store the extracted (or LLM-processed) meaningful information.

3. Indexing process

When doing the Indexing process on the document earlier, we can see that there are a series of processes that include:

├── Loading Input (text) - 1 files loaded

├── create_base_text_units

├── create_base_extracted_entities

├── create_summarized_entities

├── create_base_entity_graph

├── create_final_entities

├── create_final_nodes

├── create_final_communities

├── join_text_units_to_entity_ids

├── create_final_relationships

├── join_text_units_to_relationship_ids

├── create_final_community_reports

├── create_final_text_units

├── create_base_documents

└── create_final_documents

The Indexing process of GraphRAG consists of a set of workflow, task, prompt template, and input/output adapters. The default standard pipeline in the above test is:

Document slicing (chunk), do Embedding, generate Text Units
Extracting ENTITIES, RELATIONSHIPS and CLAIMS from text
existentitiesmedium termcommunity detection
Generate community summaries and reports at multiple levels of granularity
Embed entities into graph vector space
commander-in-chief (military)text chunkEmbedded intextual vector space

Below we combine the official documents [3], combing the entire Indexing default workflow process, which is divided into 6 stages.

Stage 1: Building TextUnits

The first stage of Workflow is to process the input documents and turn them into TextUnits. a TextUnit is the basic unit (i.e., a chunk) in doing graph extraction techniques, and is also used as the basic unit for source referencing.

Chunk size is user-specified and defaults to 300 (overlap defaults to 100). Using a larger chunk size speeds up processing (on the other hand, it has been verified that changing to a chunk size of 1200 gives more positive results), but also results in lower fidelity output and less meaningful reference text.

The default is 1 document corresponding to multiple chunks of the relationship, but you can also configure multiple documents corresponding to multiple chunks of the relationship (applicable to very short documents, the need for multiple documents to form a meaningful unit of analysis of the scenario, such as chat logs or tweets).

For each text-unit, it undergoes a text-embedding operation and is passed to the next stage of the processing pipeline.

The first stage of the process is shown below:

TextUnits actual data

As mentioned earlier, after building the index, you can find the generated parquet structured data file in the output's artifacts. It contains create_base_text_units.parquet, which has the following contents:

You can see that id is the chunk_id. document_ids is the id of the original document. chunk is the original content of the slice.

Stage 2: graph extraction (Graph Extraction)

In this phase, the text unit is analyzed and the basic units that make up the graph are extracted: Entities, Relationships and Claims.

The flowchart is:

Entities & Relationship extraction

In the first step Graph Extraction, we use LLM to extract the entities and relationships in a text unit. The output is a sub-graph for each text unit, containing a set of ENTITIES with their corresponding names, types, and descriptions; and a set of RELATIONSHIPS with their sources, targets, and descriptions.

The sub-graphs are then merged. Any ENTITY with the same name and type is considered as the same ENTITY and the corresponding descriptions are merged into arrays. Similarly, any relationship with the same source and target is considered as the same relationship and its description is merged into an array.

Entity & Relationship Summarization

Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into individual descriptions for each entity and relationship (by making a summary of all descriptions via LLM). In this way, all entities and relationships can have a concise description.

Entity parsing (not enabled by default)

The final step in graph extraction is to resolve "the case of entities with different names in the same world (or space) that are actually the same entity", which is accomplished through LLM. Other entity resolution techniques are currently being explored in the hope of achieving a more conservative, non-destructive approach.

Claim Extraction & Emission

Finally, claims are extracted from the original text unit. these claims represent positive factual statements with an evaluated state and time horizon, saved as Covariates (containing statements about entities that may be time-bound) and outputted

Entity Graph actual data

The corresponding file generated after the extraction of the original entities & relationships is the file create_base_extracted_entities.parquet. You can see that there is only 1 line in the file, which corresponds to entity_graph, and you can see that it is a graph representation in GraphML format.

Since it is in GraphML format, it means that we can visualize it, corresponding to the visualization diagram (generated using Gephi, the output may not be a regular file, and requires secondary processing):

From the extracted entity relationships, we can see the entities of key people, such as entity BOB (ID and LABEL are BOB), type "PERSON" (its description is a list containing the description of BOB in each related chunk), and the entities related to it include TINY TIM PERTER, BOB's CHILD, and so on. On the other hand, we can also see some relatively meaningless entities, such as WEATHER, FIRE, HIM, SUMMER, GAIN, etc., which correspond to None for both type and description, and there is only 1 source chunk.

In addition to the nodes, the individual edges are described accordingly, e.g. the relationship between BOB and PETER is described (Bob is Peter's father...) :

After the original graph is constructed, it is processed again to generate the summarized_entities graph. This is seen by comparing the same entity BOB. In the original entity diagram, BOBs are described in multiple paragraphs, separated by spaces. In the summarized_entities graph, the description of BOBs is a 1-segment summary description. The description of "edges" is treated in the same way.

Stage 3: Graph Augmentation (Graph Augmentation)

Now that there is a usable graph of entities and relationships, the next step is to hopefully understand their community structure and enhance this graph with additional information. This is done in two steps; Community Detection and Graph Embedding. enables us to understand the topology of the graph in both explicit (community) and implicit (embedding) ways.

Community Detection

In this step, the Hierarchical Leiden algorithm is used to generate a hierarchy of entity communities. This method will perform recursive community clustering on our graph until a community size threshold is reached. This allows us to understand the community structure of the graph and provides a way to navigate and summarize the graph at different levels of granularity.

Graph Embedding

In this step, a vector representation of the graph is generated using the Node2Vec algorithm. This will allow us to understand the implicit structure of the graph and provide an additional vector space for searching related concepts during the query phase.

Graph Tables Emission

Once the graph enhancement step is complete, after text embedding (Entity descriptions do embedding to write to the vector database), the final Entities table and Relationships table are generated.

Graph Augment actual data

In this workflow, we know that community detection will be done first, and the corresponding results will be saved in the create_final_communities.parquet file, the following is part of the file:

It can be seen that each line is a community information, including the level to which it belongs to, the contained relationship id and text unit id. analysis through pandas can be found that a total of 67 communities generated, community level is divided into four levels.

As also mentioned above, this process ends up generating data for the Entities and Relationships tables, with corresponding files create_final_entities.parquet and create_final_relationships.parquet, respectively.

The final contents of the Entities table are:

You can see that for each entity, it is labeled with its type (e.g., the BOB entity in the above figure is of type PERSON), its description, and the embedding vector corresponding to the description (which will be stored in the vector database, and is lancedb by default).

The corresponding RELATIONSHIP stores the structured representation of the relationship:

Stage 4: Community Summarization (CS)

At this point, we have a functional graph of entities and relationships, a community hierarchy of entities, and an embedded representation of the node2vec generated graph.

Now we want to generate reports for each community based on the community data. This allows us to have a high level understanding of the graph at multiple levels of granularity. For example, if community A is the top level community, we will get a report on the entire graph. If the community is a low-level community, we will get a report on the local clusters.

Generate community reports

In this step, a summary is generated for each community using LLM. This allows us to understand the unique information contained in each community and provides a scoping understanding of the diagram from a high or low level perspective. These reports contain an overview, as well as references to key entities, relationships, and statements in the community substructure.

Summarizing community reports

In this step, each community report is simplified by summarizing it through LLM.

Community Embedding

In this step, we generate a vector representation of the community by generating text embeddings of three pieces of text: the community report, the community report summary, and the community report title.

Community Table Generation

At this point, do some logging and generate the Communities and CommunityReports tables.

Community summary of actual data

The community summarized data is written to the create_final_comunity_reports.parquet file:

As you can see, the description of the community, the summary, and the importance rating done on it with the reasons can be recorded in the table.

Stage 5: Documentation

This phase creates the Documents table for the knowledge model.

Add columns (CSV data only)

If processing CSV data, you can configure the workflow to add additional fields to the document output. These fields should be present in the input CSV table.

Link to Unit Text

In this step, each document is linked to the text-unit created in the first stage. This allows us to understand which documents are associated with which text-unit and vice versa.

Document Embedding

In this step, a vector representation of the document is generated from the average embedding of the document fragments. The documents are first re-chunked without overlap and then embedding is generated for each chunk.Then a weighted average of these chunks (weighted by the number of tokens) is created and will be used as the document embedding.This helps us to understand the implicit relationships between the documents and helps us to generate a document network representation of the documents.

file table output

Export the document table to the knowledge model.

Document processing of actual data

This step outputs the create_final_documents.parquet file with the contents:

Since we only have 1 document, there is only 1 line.

Stage 6: Network Visualization

In this phase, a number of steps are performed to support network visualization of high-dimensional vector spaces in existing diagrams. At this point, there are two logical diagrams in play: the entity-relationship diagram and the document diagram.

Web Visualization Workflow

For each logical graph, we perform UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in two dimensions and understand the relationships between the nodes in the graph.The result of UMAP embedding is output as a Nodes table. The rows of the table include an indicator of whether the node is a document or an entity, and the UMAP coordinates.

Web visualization of actual data

This part of the data generates the create_final_nodes.parquet file. From the content of the file, each node is an entity information, as well as the corresponding community, degree, source id, etc.. But graph_embedding with x and y coordinates are empty or 0.

4. Querying process

From the above, we understand that in the process of constructing the index, GraphRAG generates entity relationship graphs, community hierarchies, and their sumamry, source chunks, and other various dimensions of information, which are stored in a vectorized and structured manner. Below we describe how to use this information for information enhancement during retrieval.

There are two types of Query, Local Search and Global Search.

4.1. Local Search

Local Search is an Entity-based answering model. It combines structured data from the Knowledge Graph and unstructured data from the input documents to extend the LLM context with relevant entity information at query time. The method is well suited for answering questions that require the understanding of specific entities mentioned in the input document (e.g., "What are the therapeutic properties of chamomile?") .

The flow chart is shown below:

Given a user query (or coupled with a conversation history), Local Search identifies a set of entities from the Knowledge Graph that are semantically related to the user input. These entities serve as entry points into the Knowledge Graph, enabling the extraction of further relevant details such as connected entities, relationships, entity covariates (variables related to entities) and community reports. In addition, it extracts relevant text chunks from the original input document that are related to the identified entities.These candidate data sources are then prioritized and filtered to fit into a single context window of predefined size, which is used to generate responses to user queries.

4.2. Global Search

Global Search is based on inference across the entire dataset. Conventional RAG struggles to perform well when dealing with scenarios where information needs to be aggregated across datasets and then combined to answer. For example, "What are the top 5 themes in the data?" This kind of question query will perform poorly because conventional RAGs are handled in a way that relies on the presence of semantically similar textual content in the dataset for vector retrieval. If there is no textual content in the knowledge base that contains an answer to this question, it will not give a high quality answer.

However, using GraphRAG can answer such questions because the structure of the knowledge graph generated by the LLM tells us about the structure (and topics) of the entire dataset. This allows private datasets to be organized into meaningful semantic clusters that are pre-summarized. By using our global search approach, LLM can use these clusters in response to user queries to summarize these topics and answer user questions about the entire dataset.

The flow chart is shown below:

Given a user query (or coupled with a dialog history), Global Search generates a response in a map-reduce fashion using a series of LLM community reports generated from the specified levels of the graph's community hierarchy as context data. In the map step, the community reports are partitioned into text chunks of predefined size; each text chunk is then used to generate an intermediate response containing a list of bullet points, each of which is accompanied by a numerical rating indicating the importance of the point. In the reduce step, the most important points from the intermediate responses are filtered and summarized and used as context to generate the final response.

The quality of the global search response can be significantly affected by the level of the community hierarchy used as the source of community reports. Lower hierarchy levels report more detail and typically produce more comprehensive responses, but this may also increase the time and LLM resources required to generate the final response due to the increased number of reports.

5. Summary

From initial use and an understanding of the Indexing and Querying process, the benefits of GraphRAG are clear:

Comprehensive knowledge: thanks to the entity-relationship diagrams, communities, original document slices, and other information constructed during the Indexing process, it is able to access richer and more layered information during content retrieval (including vector retrieval and structured data retrieval), providing more comprehensive answers
Well-documented: In the process of generating each index, source document chunks are introduced to maintain loyalty to the source data, making the answer reliable

On the other hand, its drawbacks are very obvious:

Time-consuming and costly: In the process of Indexing, frequent calls to LLM (e.g., extraction of entities, summarization of entities and relationships, community summarization, etc.) and calls to external algorithms (e.g., community detection) are required, which can make the indexing time elongated and the cost of the LLM calls expensive (this point may be mitigated in the future, e.g., the introduction of the GPT-4o mini, which has dramatically reduced the LLM calls). cost of LLM calls)
Scalability: In addition to being more time-consuming and costly as the dataset expands to higher orders of magnitude, the stability of the Indexing process needs to be further tested.
High latency: due to the need for multiple recalls, filtering and sorting during retrieval, there is a corresponding increase in the latency to answer questions

In addition to this, a possible disadvantage is the case of "denotational disambiguation". As we can see in the entity relationship diagram above, words such as HIM and GAIN, for example, have unclear indications, and there are also cases where, for example, different names for the same entity may become separate entity nodes. Thus, there is a risk of inaccurate information.

Overall, the author still agrees that GraphRAG is a very powerful tool to improve the ability to extract Insight from unstructured data and to make up for the shortcomings of the existing RAG model. However, it will still be several challenging points in terms of time-consumption, cost, and scalability in applying it to production environments.

Finally, in the testing phase we used the English dataset given by default. In the next step, we build a GraphRAG scenario using the Chinese corpus and introduce the retrieval process in more depth by means of code.

References

[1] GraphRAG: Unlocking LLM discovery on narrative private data: /en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

[2] Bedrock Access Gateway： /aws-samples/bedrock-access-gateway?tab=readme-ov-file

[3] Indexing Dataflow： /graphrag/posts/index/1-default_dataflow/