RAG chunking strategies: mainstream methods (recursive, jina-seg) + cutting-edge recommendations (Meta-chunking, Late chunking, SLM-SFT)

Most commonly used data chunking methods (chunking) are rule-based, using techniques such as fixed chunk size or overlap of adjacent chunks to ensure that information is not lost. and other techniques. For documents with multiple hierarchies, you can use the RecursiveCharacterTextSplitter provided by Langchain, which allows documents to be split at different levels.

However, in practice, because the predefined rules (such as chunk size or size of overlapping parts) are too rigid, the rule-based data chunking method can easily lead to retrieval contexts that are incomplete or contain noise (e.g., unwanted, interfering information or data that may interfere with or mislead analysis or processing). The rule-based data chunking approach can easily lead to problems such as incomplete retrieval contexts or the inclusion of noise ("noise" refers to unwanted and intrusive information or data that may interfere with or mislead the analysis or processing). Problems such as too large a block of data

The challenge of quantifying long texts

In a vectorized model based on the Transformer architecture, each word is mapped to a high-dimensional vector. In order to represent the semantics of the whole text, it is common to use averaging of the word vectors or using special markers (e.g., the[CLS]) position of the vector as an overall representation. However, when directly vectorizing excessively long text, the following challenges are faced:
- semantic information dilution: Long texts often cover multiple topics or viewpoints, and it is difficult for the overall vector to accurately capture the semantics of the details, resulting in the dilution of semantic information, which fails to fully reflect the core content of the text.
- Increased computational overhead: Processing long text requires more computational resources and storage space, which increases the computational complexity of the model and affects the performance and efficiency of the system.
- Reduced search efficiency: Excessively long vectors may reduce the matching accuracy during the retrieval process, leading to a decrease in the relevance of the retrieval results, as well as reducing the speed and efficiency of the retrieval.
The need to improve the quality of retrieval and generation

In order to overcome the above challenges.Reasonable text chunking strategyIt is particularly important. By appropriately slicing and dicing the text, the quality of retrieval and generation can be effectively improved:
- Improved search accuracy: After chunking the text, the resulting text fragments have finer granularity, which can more accurately match the user's query intent and improve the relevance and accuracy of the retrieval results.
- Optimize system performance: Shortening the length of a single text block reduces the computational and storage overhead of the model in the vectorization and retrieval process, and improves the processing efficiency and response speed of the system.
- Enhancing the quality of answers for large models: Providing more relevant and refined chunks of text for the Big Language Model helps the model to better understand the context and thus generate more accurate, coherent and relevant responses.

By rationally chunking the text, not only can the semantic expression ability be improved in the vectorization process, but also higher matching accuracy can be achieved in the retrieval stage, which ultimately makes the RAG system able to provide users with better quality services.

1. the effect of text chunking strategy on the output of large models

1.1 Effects of excessively long text chunks

When building a RAG (Retrieval-Augmented Generation) system, the length of the text chunks has a crucial impact on the quality of the output of the large model.long block of textIt can create a host of problems:

semantic ambiguity: When a text block is too long, the detailed semantic information is easily averaged or diluted in the vectorization process. This is because the vectorization model needs to compress a large amount of lexical information into a fixed-length vector representation, resulting in the inability to accurately capture the core themes and key details of the text. The result is that the generated vectors are difficult to represent the important content of the text, reducing the accuracy of the model's understanding of the text.
Reduced Recall Precision: In the retrieval phase, the system needs to retrieve relevant text from the vector database based on the user's query. Excessively long text blocks may cover multiple topics or ideas, which increases the semantic complexity and makes it difficult for the retrieval model to accurately match the user's query intent. In this way, the relevance of the recalled text decreases, which affects the quality of the answers generated by the large model.
Input constraints: The Large Language Model (LLM) has strict limits on input length. Excessively long text blocks take up more input space, reducing the number of text blocks available for input to the larger model. This limits the breadth of information that can be accessed by the model and may lead to the omission of important contextual or relevant information, affecting the final answer.

1.2 Effects of too short text chunking

On the contrary.too short a block of textIt also adversely affects the output of the larger model in a number of ways:

missing context: Short text chunks may lack the necessary contextual information. Context is critical to understanding the meaning of language, and blocks of text that lack context can make it difficult for the model to accurately understand the meaning of the text, resulting in incomplete or off-topic responses being generated.
Loss of subject information: Thematic information at the paragraph or section level requires a certain text length to be expressed. Text blocks that are too short may contain only snippets of information, failing to convey the main ideas or core concepts in their entirety and affecting the model's grasp of the overall content.
Fragmentation issues: A large number of short text blocks leads to fragmentation of information and increases the complexity of retrieval and processing. The system needs to process more text blocks, which increases the computation and storage overhead. At the same time, too much fragmented information may interfere with the model's judgment, reducing system performance and answer quality.

The above analysis leads to the conclusion:A rational text chunking strategy is the key to improving the performance of RAG systems and the quality of answers to large models.. In order to achieve the best results in practical applications, trade-offs and optimizations need to be made in the following areas:

Choosing a Cutting Strategy Based on Text Content: Different types of text lend themselves to different slicing methods.
- Texts with strong logic: For texts with tight logic within paragraphs, such as essays and technical documents, try to maintain the integrity of the paragraphs and avoid excessive cuts to retain the full semantic and logical structure.
- Semantically independent text: For texts with relatively independent logic between sentences, such as regulatory provisions and product specifications, they can be sliced and diced by sentence. This approach helps to accurately match specific queries and improve the accuracy of retrieval.
Consider the performance of vectorized models: Evaluate the processing power of the vectorization model used for texts of different lengths.
- Long Text Processing: If the vectorization model tends to lose information when dealing with long text, the length of the text block should be appropriately shortened to improve the accuracy of the vector representation.
- Short text optimization: For models that can effectively handle short texts, the text can be appropriately sliced, but care should be taken to retain the necessary contextual information.
Focus on input constraints for large models: The large language model has some limitations on the input length and needs to ensure that the recalled text chunks make it all the way into the model.
- Input length optimization: When slicing text chunks, the length of each chunk is controlled so that it contains complete semantic information without exceeding the input limits of the model.
- Information coverage: Ensure that the sliced text blocks cover the key information in the knowledge base to avoid missing important content.
Experimentation and Iteration: There is no one-size-fits-all best practice, and it needs to be experimented with and adapted to specific application scenarios.
- Performance Evaluation: Experimentally assessing the impact of different slicing strategies on retrieval accuracy and generation quality in order to select the most suitable scheme.
- Continuous optimization: Continuously optimize the slicing strategy based on model performance and user feedback to improve the overall performance of the system.

2. Common text chunking strategies

In the RAG (Retrieval-Augmented Generation) system, thetext chunking strategyThe choice of has an important impact on the system performance and the quality of big model generation. Reasonable text chunking can improve retrieval accuracy and provide better contextual support for large model generation. In the following, several common text chunking methods and their application scenarios are discussed in depth.

Commonly used text chunking methods include:Fixed-size chunks, NTLK-based chunks, special format chunks, deep learning model chunks, smart-body style chunks。

2.1 Fixed-size text chunking (recursive method)

Fixed-size text chunkingis the most simple and intuitive method of text chunking. It is in accordance with a predetermined fixed length, the text is divided into a number of blocks. This method is relatively easy to realize, but in practice, the need to pay attention to the following points:

Issues and challenges
- contextualization: Simply truncating the text according to a fixed number of characters may interrupt sentences or paragraphs, resulting in loss of contextual information. This can affect the subsequent text vectorization effect and semantic understanding.
- Impaired semantic integrity: Text chunks may contain incomplete sentences or ideas, affecting the accuracy of matching in the retrieval phase, as well as the quality of answers generated by the larger model.
Improved methodology
- introduce redundancy: Introduce a certain amount of overlap between neighboring blocks of text to ensure contextual coherence. For example, each block of text has a 50 character overlap with the previous block. This helps to preserve sentence integrity and paragraph coherence.
- Intelligent truncation: When cutting text, try to choose to truncate at punctuation marks or at the end of a paragraph rather than strictly by character count. This avoids interrupting sentences and maintains semantic integrity.
Practical Tool: LangChain's RecursiveCharacterTextSplitter**

Big Model Application Development FrameworkLangChain offersRecursiveCharacterTextSplitter, optimized for fixed-size text chunking flaws, recommended for use in general-purpose text processing.

usage example：

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    length_function=len,
    separators=["\n", "。", ""]
)
text = "..." # Text to be processed
texts = text_splitter.create_documents([text])
for doc in texts:
    print(doc)

Parameter description：

chunk_size: Maximum length of the text block (e.g. 200 characters).
chunk_overlap: Length of overlap between neighboring blocks (e.g. 50 characters).
length_function: function for calculating the length of text, defaults tolen。
separators: Defines a list of splitters to be used to prioritize appropriate positions when slicing text.

Working Principle：

RecursiveCharacterTextSplitter on the basis ofseparators recursively slices the text in the order of the splitters in the

first cut: Use the first separator (e.g."\n", indicating paragraph breaks) to make an initial cut in the text.
Check block size: If the length of the resulting text block exceedschunk_sizeIf it is not, the next separator is used (e.g."。"(which indicates sentence separation) further cuts.
recursive processing: Use the remaining splitters in turn until the text block is the required length or can no longer be cut.
merged block: If the combined length of neighboring text blocks does not exceedchunk_sizeIf the block lengths are too long, then the merge is performed to ensure that the block lengths are as close as possible to thechunk_size, while preserving contextual integrity.

2.2 Text chunking based on NLTK, spaCy

NLTK（Natural Language Toolkit） is a widely used Python natural language processing library that provides a rich set of text processing features. Among other things, thesent_tokenize method can be used to automatically slice text into sentences.

principle：sent_tokenize Based on thesis《Unsupervised Multilingual Sentence Boundary Detection》approach that uses unsupervised algorithms to model acronyms, collocations, and sentence starters, and then uses these models to identify sentence boundaries. This approach has yielded good results on a wide range of languages, mainly European.

Pre-training models are missingNLTK does not provide pre-training weights for the Chinese clause model, so users need to train them by themselves.
Training interface available: NLTK provides a training interface that allows users to train clause models based on their own Chinese corpus.
Application in LangChain

LangChain integrates NLTK's text slicing function, which is convenient for users to call directly.

usage example：

from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter()
text = "..." # Text to be processed
texts = text_splitter.split_text(text)
for doc in texts:
    print(doc)

Extension: spaCy-based text chunking

spaCy LangChain is another powerful natural language processing library with more advanced linguistic analysis capabilities, and LangChain also integrates spaCy's text slicing methods.

Usage：

Simply place theNLTKTextSplitter Replace withSpacyTextSplitter：


from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter()

text = "..." # Text to be processed
texts = text_splitter.split_text(text)

for doc in texts:
    print(doc)

Tip: When using spaCy, you need to download the model of the corresponding language first. For example, to process Chinese text, you need to download the Chinese model package.

2.3 Special format text chunking (HTML, Markdown)

In practical applications, it is often necessary to deal with text with a special internal structure, such asHTML, Markdown, LaTeX, code files etc. The structural information of these texts is crucial for understanding their contents, and simple text-slicing methods may destroy their original structure, resulting in the loss of contextual information.

Retention of structural information: When slicing text, try to preserve as much of its intrinsic structure as possible, such as labels, headings, code blocks, etc.
Reduce context loss: Avoid slicing text at key locations so that important contextual information is not lost.
Special text chunking methods provided by LangChain

LangChain provides users with chunking classes for a variety of special text formats, making it easy for users to handle different types of text.

Table 4-1 Special Text Chunking Methods Provided by LangChain

text format	class name
Python	PythonTextSplitter
markdown	MarkdownTextSplitter
latex	LatexTextSplitter
html	HTMLHeaderTextSplitter

Take the example of processing Markdown text:


from langchain.text_splitter import MarkdownTextSplitter

text_splitter = MarkdownTextSplitter()

text = "..." # pending Markdown copies
texts = text_splitter.split_text(text)

for doc in texts:
    print(doc)

These special text chunking classes preset a list of suitable splitters for different text formats and then call theRecursiveCharacterTextSplitter Make further cuts. Example:

PythonCodeTextSplitter: Set up separators that are appropriate for the structure of Python code, such as function definitions, class definitions, comments, and so on.
MarkdownTextSplitter: Slicing and dicing based on the structure of Markdown, such as headings, lists, paragraphs, and so on.
LatexTextSplitter: Identify sections, formulas, environments, etc. of LaTeX documents for slicing and dicing.
HTMLHeaderTextSplitter: A tag structure for HTML documents, sliced by element level.
4.3.5 Custom extensions

LangChain also predefines a list of splitters for other programming languages (e.g. Go, C++, Java), etc., making it easy for users to quickly define new text chunking classes. If you need to deal with text formats not provided, you can refer to the existing class implementation.

Customization Examples: Create a text slice class for slicing Java code.

from langchain.text_splitter import RecursiveCharacterTextSplitter

class JavaCodeTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs):
        separators = [
            "\n\n", # blank line
            "\n", # line feed (computing)
            ";", # statement termination
            " ", # blank space
            ""       # no separator
        ]
        super().__init__(separators=separators, **kwargs)

text_splitter = JavaCodeTextSplitter(chunk_size=200, chunk_overlap=50)

text = "..." # pending Java coding
texts = text_splitter.split_text(text)

for doc in texts:
    print(doc)

By customizing the list of splitters and parameter settings, you can flexibly adapt to the needs of cutting text in different formats.

2.4 Semantic-based text chunking

Embedding-based(Translator's note: An embedding-based approach to data chunking, where data is mapped into a low-dimensional space to better capture its semantic information.)
Model-based(Translator's note: A model-based approach to data chunking that uses a pre-trained model for semantic chunking.)
LLM-based(Translator's note: A data chunking approach based on the Large Language Model (LLM) that uses LLM to capture semantic information in text.)

2.4.1 Embedding-based methods

Both LlamaIndex [2] and Langchain [3] provide semantic chunkers based on embedding (translation: tools or algorithms that are able to chunk text or data according to semantic information.) The concept of their algorithms is roughly the same. Their algorithms are generally similar in concept, and in this paper we will take LlamaIndex as an example.

Please note that the latest version of LlamaIndex needs to be installed in order to use the semantic chunker in LlamaIndex.My previous installation of LlamaIndex, version number 0.9.45, did not include this algorithm. Therefore, I created a new conda virtual environment and installed a newer version of LlamaIndex -- 0.10.12:

pip install llama-index-core

pip install llama-index-readers-file

pip install llama-index-embeddings-openai

It is worth mentioning that the 0.10.12 version of LlamaIndex allows you to install as many components or modules as you need, so in this article we will only install some key components.

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core              0.10.12
llama-index-embeddings-openai 0.1.6
llama-index-readers-file 0.1.5
llamaindex-py-client          0.1.13

The test code is shown below:

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index. import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader


import os
["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

# load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()


embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
 print('-' * 100)
 print(node.get_content())

splitter.get_nodes_from_documentsThe internal implementation logic of the [4] function, the main flow of which is shown in Figure 2:

The "sentences" referred to in Figure 2 is a Python list, where each member is a dictionary containing four key-value pairs, with the following meanings for each key:

sentence: Current Sentence
index: Serial number of the current sentence
combined_sentence: Used to build sliding windows, including [index - self.buffer_size, index, index + self.buffer_size] three sentences(by default, self.buffer_size = 1), can be used to compute semantic correlations between sentences. Merging the current sentence with its preceding and following sentences reduces unnecessary interfering information and thus captures the correlations between consecutive sentences more efficiently.
combined_sentence_embedding: Embedding vector for combined_sentence

It is evident from the above analysis thatEmbedding vector-based semantic chunking(Translator's note: The process of dividing text or data into meaningful segments or chunks based on semantic information)The method essentially calculates text similarity based on a sliding window (combined_sentence).Those sentences that are adjacent and meet the threshold are grouped into a semantic block.

There is only one BERT paper document [5] in the project path. Here are the results of the run:

(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py 
...
...
----------------------------------------------------------------------------------------------------
We argue that current techniques restrict the
power of the pre-trained representations, espe-
cially for the ﬁne-tuning approaches. The ma-
jor limitation is that standard language models are
unidirectional, and this limits the choice of archi-
tectures that can be used during pre-training. For
example, in OpenAI GPT, the authors use a left-to-
right architecture, where every token can only at-
tend to previous tokens in the self-attention layers
of the Transformer (Vaswani et al., 2017). Such re-
strictions are sub-optimal for sentence-level tasks,
and could be very harmful when applying ﬁne-
tuning based approaches to token-level tasks such
as question answering, where it is crucial to incor-
porate context from both directions.
In this paper, we improve the ﬁne-tuning based
approaches by proposing BERT: Bidirectional
Encoder Representations from Transformers.
BERT alleviates the previously mentioned unidi-
rectionality constraint by using a “masked lan-
guage model” (MLM) pre-training objective, in-
spired by the Cloze task (Taylor, 1953). The
masked language model randomly masks some of
the tokens from the input, and the objective is to

predict the original vocabulary id of the
maskedarXiv:1810.04805v2  []  24 May 2019
----------------------------------------------------------------------------------------------------
word based only on its context. Unlike left-to-
right language model pre-training, the MLM ob-
jective enables the representation to fuse the left
and the right context, which allows us to pre-
train a deep bidirectional Transformer. In addi-
tion to the masked language model, we also use
a “next sentence prediction” task that jointly pre-
trains text-pair representations. The contributions
of our paper are as follows:
• We demonstrate the importance of bidirectional
pre-training for language representations. Un-
like Radford et al. (2018), which uses unidirec-
tional language models for pre-training, BERT
uses masked language models to enable pre-
trained deep bidirectional representations. This
is also in contrast to Peters et al. 
----------------------------------------------------------------------------------------------------
...
...

Test results show that using this data chunking method, the data chunks obtained are relatively coarse in granularity.

Figure 2 also shows that this approach is page-based (which chunks text by page rather than other smaller units, such as sentences or paragraphs.) It does not directly address the problem of chunking data across multiple pages.

Typically, the performance of embedding-based data chunking methods is heavily dependent on the embedding model. The actual effect needs to be further evaluated in the future.

2.4.2 Model-based approach

BERT-based Text Segmentation Method

In order for the BERT model to learnRelationship between two sentences, designed in its pre-training process aII. Mandate of the classification: Input two sentences into BERT at the same time and predict whether the second sentence is the next sentence of the first. Based on this idea, it is possible to design aThe plainest way to slice textwhichThe smallest unit of cut is the sentence。

Specifically, on the complete text, using thesliding windowway, two neighboring sentences are input into the BERT model for binary prediction respectively. If the prediction score is low, it indicates that the semantic relationship between these two sentences is weak, and this can be used as a cut-off point for the text. However, this method has some limitations:

Limited contextual information: When determining the cut point, only the two neighboring sentences before and after were considered, and textual information from longer distances was not utilized, which may lead to inaccurate cut points.
Computational inefficiency: The need to predict each pair of neighboring sentences in the text is computationally intensive and difficult to cope with the demands of large-scale text processing.

Cross-Segment Modeling: Introducing Longer Contextual Dependencies

Addressing the shortcomings of the above methods.Lukasik et al.In the paper _"Text Segmentation by Cross Segment Attention"_, it is proposed that theCross-Segment Model. The core of the model isLeverage longer contextual information, while improving forecasting efficiency.

In the cross-segment BERT model (a), the local context near the potential text segmentation points is input to the model: k tokens on the left and k tokens on the right. in the BERT+Bi-LSTM model (b), the BERT model is first used to encode each sentence, and then the sentence representations are input into Bi-LSTM. In hierarchical BERT model (c), each sentence is first encoded using BERT and then the output sentence representation is fed into another transformer-based model.Source: Text Segmentation by Cross Segment Attention.[6]

The cross-segment BERT model in Fig. 4(a) defines text segmentation as a sentence-by-sentence classification task, determining sentence-by-sentence whether it is a text segmentation point or not. Local contexts (k tokens on both sides) near potential text segmentation points are input into the model. The hidden state corresponding to [CLS] is passed to the softmax classifier, which decides whether to segment at a potential sentence break.

Main Steps:

sentence vectorization: The BERT model is utilized to obtain the vector representation of each sentence separately, preserving the semantic information of the sentence.
Cross-paragraph projections: Vector representations of multiple consecutive sentences are fed simultaneously into another BERT or LSTM model.Predict whether each sentence is a boundary of a text segment at once。

The advantage of this method is:

Integration of context: By processing multiple sentences at the same time, the model is able to capture dependencies over longer distances and improve the accuracy of the cutoff.
increase efficiency: Batching multiple sentences can improve computational efficiency compared to predicting neighboring sentences one by one.

SeqModel Model: Adaptive Sliding Window and Context Modeling

Although the Cross-Segment model introduces a longer context, the sentence vectorization process is still performed on each sentencestand alonecarried out without adequately modeling the complex dependencies between sentences. For this reason, theZhang et al.in thesis_《Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation》_in which it is proposed that theSeqModel Model, the text slicing method was further improved.

SeqModel features:

Encoding multiple sentences at the same time: Utilizing BERT for multiple consecutive sentencesUnicode, which directly models the dependencies between sentences and obtains a representation of the sentence containing contextual information.
Predicting the cut-off boundary: After obtaining the contextual representation of the sentences, the model predicts whether text segmentation is needed after each sentence.
Adaptive sliding window: The introduction ofAdaptive sliding windowmethod to dynamically resize the window according to the text content.Speed up reasoning without sacrificing accuracy。

This approach not only improves the accuracy of the cuts, but also makes the model more efficient in processing long texts.

Application and Implementation of the SeqModel Model

It is interesting to note that the SeqModel model'spre-training weightAlready in"ModelScope."It supports Chinese text processing. This provides a convenient way for developers to apply the SeqModel model in real projects.

Example of usage：

from import OutputKeys
from import pipeline
from import Tasks

# Initialize the pipeline for the text segmentation task
p = pipeline(task=Tasks.document_segmentation, model='damo/nlp_bert_document-segmentation_chinese-base')

# Enter the long text to be segmented
documents = 'Here you enter the content of the long text you need to segment'

#Execute text segmentation
result = p(documents=documents)

#Output the result of the split text
print(result[])

SeqModel is available through the ModelScope[10] framework. The code is as follows:

from  import OutputKeys
from  import pipeline
from  import Tasks

p = pipeline(
    task = Tasks.document_segmentation,
    model = 'damo/nlp_bert_document-segmentation_english-base'
)

print('-' * 100)

result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')

print(result[])

The sentence "Today is a good day" is added to the end of the test data, but the result variable after the text splitting process does not split "Today is a good day" in any way.

We demonstrate the importance of bidirectional pre-training for language  Radford et al.(2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional  is also in contrast to Peters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific  is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific  is a good day

There is still a lot of room for improvement in the above methods, and one suggested improvement would be to create training data customized specifically for a project or task for domain micro, which would improve the performance of the model. In addition, optimizing the model architecture is also a point of improvement

Deep learning model-based text slicing methods utilize the deep understanding of language from pre-trained models, which have been continuously improved from the original plain approach to Cross-Segment and then to the SeqModel model:

plain method: Using sentences as the smallest unit, BERT is utilized to predict the connection of neighboring sentences, but the contextual consideration is limited and less efficient.
Cross-Segment Model: Introducing longer contexts and batch prediction of cut-points improves accuracy and efficiency.
SeqModel Model: Encoding multiple sentences at the same time, modeling inter-sentence dependencies, and using adaptive sliding windows to further improve performance.

The evolution of these methods reflects the researchers' unremitting exploration and innovation in the field of text slicing. By choosing appropriate models and methods, text can be sliced more accurately to improve the effectiveness of downstream tasks and meet the needs of diverse practical applications.

2.5 LLM-based approach

The paper, entitled Dense X Retrieval: What Retrieval Granularity Should We Use? introduces a new unit of retrieval called proposition.proposition is defined as atomic expressions in the text(Translator's note: Individual semantic elements that cannot be further decomposed can be used to form larger semantic units), which is used to retrieve and express unique facts or specific concepts in a text that can be expressed in a concise manner, presenting a separate concept or fact in its entirety using natural language without the need for additional information to explain it.

So, how to get the so-called proposition? In this paper, we obtain them by constructing the propositions and interacting with the LLM.

Both LlamaIndex and Langchain implement algorithms for this purpose, as demonstrated below using LlamaIndex.

The idea behind the implementation of LlamaIndex is to generate propositions using the cue words provided in the paper:

PROPOSITIONS_PROMPT = PromptTemplate(
    """Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]

Input: {node_text}
Output:"""
)

The key components of LlamaIndex 0.10.12 have already been installed in the previous section "01 Embedding-based approach". However, if you want to use DenseXRetrievalPack, you need to run pip install llama-index-llms-openai to install it. Once installed, the current LlamaIndex related components are shown below:

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core                    0.10.12
llama-index-embeddings-openai       0.1.6
llama-index-llms-openai             0.1.6
llama-index-readers-file            0.1.5
llamaindex-py-client                0.1.13

In LlamaIndex, DenseXRetrievalPack is a package that needs to be downloaded separately. Here it is downloaded directly in the test code. The test code is as follows:

from llama_index. import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack

import os
["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

# Download and install dependencies
DenseXRetrievalPack = download_llama_pack(
    "DenseXRetrievalPack", "./dense_pack"
)

# If you have already downloaded DenseXRetrievalPack, you can import it directly.
# from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack

# Load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()

# Use LLM to extract propositions from every document/node
dense_pack = DenseXRetrievalPack(documents)

response = dense_pack.run("YOUR_QUERY")

This test code mainly uses the constructor of the DenseXRetrievalPack class. Therefore, it is necessary to analyze the source code of the DenseXRetrievalPack class [11].

class DenseXRetrievalPack(BaseLlamaPack):
    def __init__(
        self,
        documents: List[Document],
        proposition_llm: Optional[LLM] = None,
        query_llm: Optional[LLM] = None,
        embed_model: Optional[BaseEmbedding] = None,
        text_splitter: TextSplitter = SentenceSplitter(),
        similarity_top_k: int = 4,
    ) -> None:
        """Init params."""
        self._proposition_llm = proposition_llm or OpenAI(
            model="gpt-3.5-turbo",
            temperature=0.1,
            max_tokens=750,
        )

        embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)

        nodes = text_splitter.get_nodes_from_documents(documents)
        sub_nodes = self._gen_propositions(nodes)

        all_nodes = nodes + sub_nodes
        all_nodes_dict = {n.node_id: n for n in all_nodes}

        service_context = ServiceContext.from_defaults(
            llm=query_llm or OpenAI(),
            embed_model=embed_model,
            num_output=self._proposition_llm.metadata.num_output,
        )

        self.vector_index = VectorStoreIndex(
            all_nodes, service_context=service_context, show_progress=True
        )

         = RecursiveRetriever(
            "vector",
            retriever_dict={
                "vector": self.vector_index.as_retriever(
                    similarity_top_k=similarity_top_k
                )
            },
            node_dict=all_nodes_dict,
        )

        self.query_engine = RetrieverQueryEngine.from_args(
            , service_context=service_context
        )

As shown in the code, the idea of this constructor is to first use text_splitter to divide the document intonodes(Translator's note: the smallest unit into which the document is split according to its original formatting.) and then call self._gen_propositions to generate propositions to get the correspondingsub_nodes(Translator's note: the document fragments or subsets corresponding to the propositions generated from the nodes). The VectorStoreIndex is then constructed using nodes + sub_nodes and retrieved by the RecursiveRetriever. A recursive retriever can process small chunks of a document to find the information it needs, just as it can go directly to a subsection or paragraph in a book. However, if it cannot find complete information in these small chunks, the recursive retriever passes the relevant larger chunks to the generation stage for further processing, just as it does when searching for a subsection or paragraph in a book. If you need more information, you will turn to the relevant chapter or the whole book.

There is only one BERT paper document in the project path. Through debugging, I found that the content of sub_nodes[].text is not the original text, the content has been rewritten:

> /Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/(91)__init__()
     90 
---> 91         all_nodes = nodes + sub_nodes
     92         all_nodes_dict = {n.node_id: n for n in all_nodes}


ipdb> sub_nodes[20]
IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Our paper demonstrates the importance of bidirectional pre-training for language representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[21]
IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Radford et al. (2018) uses unidirectional language models for pre-training.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[22]
IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='BERT uses masked language models to enable pre-trained deep bidirectional representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[23]
IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Peters et al. (2018a) uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[24]
IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Pre-trained representations reduce the need for many heavily-engineered task-specific architectures.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)

The relationship between sub_nodes and nodes is shown in Figure 7, and this index structure is sorted and organized in a small-to-big fashion.

Figure 7: Index structure organized in a small-to-big fashion.

This index structure is constructed using self._gen_propositions[12] with the following code:

 async def _aget_proposition(self, node: TextNode) -> List[TextNode]:
 """Get proposition."""
        inital_output = await self._proposition_llm.apredict(
            PROPOSITIONS_PROMPT, node_text=
 )
        outputs = inital_output.split("\n")

        all_propositions = []

 for output in outputs:
 if not ():
 continue
 if not ().endswith("]"):
 if not ().endswith('"') and not ().endswith(
 ","
 ):
                    output = output + '"'
                output = output + " ]"
 if not ().startswith("["):
 if not ().startswith('"'):
                    output = '"' + output
                output = "[ " + output

 try:
                propositions = (output)
 except Exception:
 # fallback to yaml
 try:
                    propositions = yaml.safe_load(output)
 except Exception:
 # fallback to next output
 continue

 if not isinstance(propositions, list):
 continue

            all_propositions.extend(propositions)

 assert isinstance(all_propositions, list)
        nodes = [TextNode(text=prop) for prop in all_propositions if prop]

 return [IndexNode.from_text_node(n, node.node_id) for n in nodes]

 def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]:
 """Get propositions."""
        sub_nodes = (
            run_jobs(
 [self._aget_proposition(node) for node in nodes],
                show_progress=True,
                workers=8,
 )
 )

 # Flatten list
 return [node for sub_node in sub_nodes for node in sub_node]

For each original node, call self._aget_proposition asynchronously to get the inital_output returned by LLM via PROPOSITIONS_PROMPT, then get the propositions based on the inital_output and build the TextNode. Finally, associate these TextNodes with the original node using [IndexNode.from_text_node(n, node.node_id) for n in nodes].

One thing that needs to be said more: the original paper fine-tuned a text generation model using LLM-generated propositions as training data. This text generation model is currently publicly accessible [13], and in general, theThis approach to data chunking, which utilizes LLM to construct propositions, enables finer-grained data chunking.can form a small-to-big index structure with the original node, thus providing a new idea for semantic chunking. This provides a new way of thinking for semantic chunking.

3. Optimization strategies for text chunking

3.1 Maintaining semantic integrity

3.1.1 Avoiding sentence splitting

Splitting sentences should be avoided as much as possible during the text-slicing process.Sentences are the basic unit for expressing complete semantics, splitting sentences may lead to semantic fragmentation, affecting the accuracy of the vectorized representation and the model's understanding of the text. For example, subject-verb-object structures or modifier relations contained in a sentence will lose their original meaning when truncated, making it difficult for the model to accurately capture the core content of the text.

Practice Recommendations：

punctuate: Use punctuation marks such as periods, question marks, and exclamation points as cut-off points to ensure that each block of text contains complete sentences.
Prioritization of splitters: When setting the splitter, the end-of-sentence symbol is placed in high priority to ensure that sentence boundaries are prioritized during the slashing process.

3.1.2 Consideration of paragraph relevance

In addition to the sentence level, theParagraphs are also important units for expressing complete ideas.. Sentences within a paragraph are usually centered around a theme and have a close logical association. Splitting the closely logically related paragraphs into different text blocks may lead to contextual fragmentation and affect the model's grasp of the overall semantics.

Practice Recommendations：

full paragraph break: Try to keep the content of the same paragraph in the same text block to avoid semantic breaks caused by cuts.
Combined theme segmentation: For long passages, cuts can be made based on themes or semantic turning points to ensure thematic unity within each block of text.

3.2 Controlling the length of text blocks

3.2.1 Setting reasonable length thresholds

The length of text blocks has a direct impact on the processing performance of vectorized models and large language models (LLMs). Excessively long text blocks may lead to:

vectorial representation of dilution: Important semantic information is flooded, reducing retrieval accuracy.
Model Input Limits: exceeds the input length of the larger model and cannot handle the full content.

Blocks of text that are too short may lack context, resulting in incomplete semantics.

Practice Recommendations：

Evaluating model performance: Set appropriate text block length thresholds according to the processing effect of vectorized and large models on different lengths of text.
Common length reference: Normally, the length of the text block can be set to 200 to 500 characters, which can be adjusted on a case-by-case basis.

3.2.2 Dynamic adjustment

Different types of text may have different requirements for text block length. Flexible adjustment of the length threshold can better accommodate diverse text content.

Practice Recommendations：

Text type analysis: For different types of texts such as news, legal documents, technical manuals, etc., analyze their structural characteristics and set the appropriate length.
Adaptive slice: Develop a dynamic adjustment mechanism to adjust the length of text blocks in real time according to the text content and structure to achieve personalized processing.

3.3 Overlapping cuts

3.3.1 Methodology

overlapping cutsIt refers to the introduction of a certain amount of overlap between blocks of text so that neighboring blocks of text share some of their content. The specific method is:

Setting the overlap length: If you set up a 50 character overlap between each block of text.
Sliding window cutout: Slices the text using a sliding window, where the step size of each move is smaller than the window size.

3.3.2 Advantages

Preserving Contextual Connections: The overlapping part makes it possible to preserve the contextual information between neighboring text blocks and avoid semantic breaks caused by cuts.
Enhanced Model Understanding: The Big Model is able to refer to both pre- and post-textual information when generating responses, producing more coherent and accurate results.

Practice Recommendations：

Rationalize the amount of overlap: Set appropriate overlap lengths based on text characteristics and modeling needs to preserve context without adding too much redundant information.
Efficiency considerations: Note that overlapping cuts increase the number of text blocks, and there is a trade-off between processing efficiency and context preservation.

3.4 Combining vectorized model performance

3.4.1 Adaptation Model Characterization

Different vectorization models (e.g., BERT, GPT, Sentence Transformers) perform differently in processing text length and semantic information.

Practice Recommendations：

Model characterization: To gain insight into the characteristics of the vectorization model used and to evaluate its performance under different text lengths.
Choosing the right strategy: Depending on the strengths of the model, choose a chunking strategy for short or long texts. For example, if the model works better for vector representation of short texts, shorter text chunks should be favored.

3.4.2 Optimized Vector Representation

For long texts that may appearsemantic dilutionproblem, a more advanced vector representation can be used.

Practice Recommendations：

weighted average: Assign higher weights to important words in the text (e.g., keywords, proper nouns) to enhance their impact in the vector representation.
attention mechanism: Focus on key parts of the text using the Attention mechanism (Attention) to generate more representative vectors.
hierarchical coding: Hierarchical coding of text, encoding sentences, then paragraphs, combined layer by layer to maintain semantic structure.

3.5 Considering Input Constraints for Large Models

3.5.1 Input length control

Large language models have strict limits on the length of the input text (e.g., the maximum input length for GPT-3 is 2048 tokens). In the recall phase, it is necessary to ensure that the total length of the selected text blocks does not exceed the model's limit.

Practice Recommendations：

Counting the length of text blocks: Calculate the total length of a text block before entering it into the model to ensure that it does not exceed the model limits.
truncation strategy: If the length exceeds the limit, consider truncating low relevance content and retaining the core block of text.

3.5.2 Prioritization

Different chunks of text contribute differently to the answer and should be sorted according to their relevance to the query, prioritizing the most important inputs.

Practice Recommendations：

relevance score: Calculate the relevance score to the query for each recalled text block.
meritocracy: Based on the relevance score, text blocks are selected from high to low until the input length limit of the model is reached.

3.5.3 Content refinement

When important blocks of text are too long, they can be compressed or summarized to ensure that the core information is retained.

Practice Recommendations：

autosummary: Generate condensed summaries of long blocks of text using the summary model.
Key sentence extraction: Extract the sentences in the text block that best represent the core content for the larger model.

Optimizing text chunking strategies requiresCombining semantic integrity, text block length, vectorized model performance, and input constraints for large modelsand other factors. By avoiding sentence splitting, maintaining paragraph correlation, introducing overlapping cuts, adapting model characteristics and reasonably controlling the input length, the retrieval effect of the RAG system and the quality of the large-model answer can be effectively improved.

4. Recommendations for cutting-edge methodologies

4.1 Meta-Chunking （2024.8.16）

Meta-Chunking aims to optimize the processing of text chunks in Retrieval Augmented Generation (RAG) to improve the quality of knowledge-intensive tasks. The study proposes to introduce a new level of granularity between sentences and paragraphs, Meta-Chunking, which consists of multiple sentences with deep linguistic and logical connections within a paragraph. To realize this concept, the researcher devised a Perplexity-based (PPL) segmentation method that balances performance and speed and accurately determines the boundaries of text chunks. Also, considering the complexity of different texts, the study proposes a strategy that combines PPL segmentation and dynamic merging to achieve a balance between fine-grained and coarse-grained text segmentation. Experiments on 11 datasets show that Meta-Chunking can effectively improve the performance of RAG-based single-hop and multi-hop quizzing. In addition, it is found that PPL segmentation exhibits significant flexibility and adaptability on different sizes and types of models

caption：Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception，
github:/IAAR-Shanghai/Meta-Chunking/tree/386dc29b9cfe87da691fd4b0bd4ba7c352f8e4ed

Based on a core principle: allow block sizes to vary to more effectively capture and maintain the logical integrity of the content. This dynamic granularity adjustment ensures that each segmented block contains a complete and independent expression of ideas, thus avoiding breaks in the logical chain during segmentation. This not only enhances the relevance of document retrieval, but also improves content clarity.

As shown in the figure below, the example sentences show a recursive relationship, but the semantic similarity is low, which may lead to a complete separation of the examples.

The first type of chunking, called Margin Sampling Chunking, is roughly the idea of letting the LLM do binary classification, the output of the large model is a probability distribution of word lists, and here they do a probability difference between "yes" and "no" to determine whether the threshold is met. Here they make a probability difference between "yes" and "no" to determine whether the threshold is met.
The second type of chunking, called Perplexity Chunking, calculates the perplexity of each sentence in context (if the perplexity is high, it means that the model is confused about this text, so chunking is not recommended). Every time we find the sentence with the smallest perplexity in the sequence, and if the 2 sentences before and after this sentence are smaller than the current sentence, then it is ready to be cut. Counting the perplexity can be done using a fixed-length kv-cache, to ensure that the explicit memory problem

Margin Sampling Chunking:

Split the text into a series of sentences.
For neighboring sentences, binary classification is performed using LLM to determine whether segmentation is required.
LLM outputs the probabilities of the two options and calculates the probability difference Margin.
Compare Margin with the preset threshold, if Margin is greater than the threshold, split the sentence; otherwise, merge the sentence.

Perplexity Chunking:

Split the text into a series of sentences.
Use LLM to calculate the PPL value for each sentence based on its context.
Analyze the distribution characteristics of PPL values and identify potential text block boundaries (i.e., local minima of PPL values).
Split the sentence into multiple text blocks, each containing one or more consecutive sentences.

两种策略的优缺点：
- Margin Sampling Chunking:
  - vantage：Can effectively reduce the need for model size，Making even small models capable of text chunking tasks。
  - Cons: Segmentation results may be affected by the LLM model and are relatively inefficient.
- Perplexity Chunking:
  - Advantages: the segmentation results are more objective, more efficient, and can effectively capture the logical structure of the text.
  - Disadvantage: It is necessary to analyze the characteristics of the distribution of PPL values, which may require a certain amount of computation.

4.2 Late Chunking（jina 2024.8.4）

discuss a paper or thesis (old)Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models：/abs/2409.04701
github：/jina-ai/late-chunking
jina blog:/news/late-chunking-in-long-context-embedding-models/
Project Link:/drive/15vNZb6AsU7byjYoaEtXuNu567JWNzXOz?usp=sharing

Imagine that you are organizing a large pile of files. The traditional method is to divide the files into small piles and then process them one by one. Post-chunking does the opposite: it processes the entire file and then categorizes it. This may sound counterintuitive, but in the world of AI, this approach has shown surprising results. Specifically, post-chunking works like this.

First "read" the entire document using a long context embedding model (e.g., jina-embeddings-v2-small-en, with 8K context length)
And then we cut this "understanding" into small pieces.

Late Chunking is now available in jina-embeddings-v3 API. 8192-token

Traditional methods involve pre-segmenting text using sentence, paragraph, or maximum length limits. Afterwards, an embedding model is iteratively applied to these resulting chunks. To generate an embedding for each block, many embedding models use mean pooling on these word-level embeddings to output a single embedding vector.
The "late chunking" approach first applies the transformation layer of the embedding model to the entire text or as much of it as possible. This generates a sequence of vector representations for each token in the text that contains information about the entire text. Subsequently, average pooling is applied to each part of this sequence of token vectors to produce an embedding that considers each part of the entire text context. In contrast to the generation of independent identically distributed (...) Unlike the simple encoding method of chunked embeddings, thePost-chunking creates a set of chunk embeddings, where each embedding "depends" on the previous embedding, thus encoding more contextual information for each chunk.

To address this problem, long input sequences jina-embeddings-v2-base-en that can be handled by recent embedding models are utilized.These models support much longer input text, e.g., 8192 tokens jina-embeddings-v2-base-en or about ten pages of standard text. Text segments of this size are unlikely to have contextual dependencies that can only be resolved with larger contexts. However, smaller vector representations of text blocks are still needed, partly because of the limited input size of the LLM, but mainly because of the limited information capacity of short embedding vectors.

The simple encoding approach (shown on the left side of the figure above) chunks the text before processing it, using sentence, paragraph, and maximum length constraints to segment the text a priori, and then applies the embedding model to the resulting chunks. Post-chunking, on the other hand, first applies the converter portion of the embedding model to the entire text or to as large a portion as possible. This generates a sequence of vector representations for each token containing textual information from the entire text. To generate individual embeddings for the text, many embedding models apply these token representations to mean pooling to output individual vectors. Later chunking, on the other hand, applies mean pooling to smaller segments of this sequence of marker vectors, thus generating embeddings for each chunk that consider the entire text.

Code section:

#!pip install transformers==4.43.4
#Load the model to be used for embedding。option jinaai/jina-embeddings-v2-base-en But any other model that supports mean pooling is possible。nevertheless，Models with larger maximum context lengths are preferred。
from transformers import AutoModel
from transformers import AutoTokenizer

#load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

#Define the text to be encoded and split it into chunks。 chunk_by_sentencesThe function also returns the span annotation。These specify the number of tokens per block needed for the chunking pools。
def chunk_by_sentences(input_text: str, tokenizer: callable):
    """
    Split the input text into sentences using the tokenizer
    :param input_text: The text snippet to split into sentences
    :param tokenizer: The tokenizer to use
    :return: A tuple containing the list of text chunks and their corresponding token spans
    """
    inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
    punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
    sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
    token_offsets = inputs['offset_mapping'][0]
    token_ids = inputs['input_ids'][0]
    chunk_positions = [
        (i, int(start + 1))
        for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
        if token_id == punctuation_mark_id
        and (
            token_offsets[i + 1][0] - token_offsets[i][1] > 0
            or token_ids[i + 1] == sep_id
        )
    ]
    chunks = [
        input_text[x[1] : y[1]]
        for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    span_annotations = [
        (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    return chunks, span_annotations

#in production，You should use a more advanced、More powerful segmentation methods，for example Jina AI Tokenizer API /tokenizer#apiform 。

/segmenter/#apiform

import requests

def chunk_by_tokenizer_api(input_text: str, tokenizer: callable):
    #Define the API endpoint and payload
    url = '/'
    payload = {
        "content": input_text,
        "return_chunks": "true",
        "max_chunk_length": "1000"
    }

    #Make the API request
    response = (url, json=payload)
    response_data = ()

    #Extract chunks and positions from the response
    chunks = response_data.get("chunks", [])
    chunk_positions = response_data.get("chunk_positions", [])

    #Adjust chunk positions to match the input format
    span_annotations = [(start, end) for start, end in chunk_positions]

    return chunks, span_annotations


input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

#determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')


Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population."
- " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
- " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

#Use the traditional and context-sensitive Late_chunking method encodes the block：

def late_chunking(
    model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
    token_embeddings = model_output[0]
    outputs = []
    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if (
            max_length is not None
        ):  # remove annotations which go bejond the max-length of the model
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start < (max_length - 1)
            ]
        pooled_embeddings = [
            embeddings[start:end].sum(dim=0) / (end - start)
            for start, end in annotations
            if (end - start) >= 1
        ]
        pooled_embeddings = [
            ().cpu().numpy() for embedding in pooled_embeddings
        ]
        (pooled_embeddings)

    return outputs


#chunk before
embeddings_traditional_chunking = (chunks)

#chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = late_chunking(model_output, [span_annotations])[0]

#比较单词“Berlin”与词块的相似度。上下文相关的分块池方法的相似度应该更高
import numpy as np

cos_sim = lambda x, y: (x, y) / ((x) * (y))

berlin_embedding = ('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))
    
similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.84862185
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.7084338
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.84980094
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.7534553

Text	Similarity Traditional	Similarity Late Chunking
Berlin is the capital and largest city of Germany, both by area and by population."	0.84862185	0.849546
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.	0.7084338	0.82489026
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.	0.7534553	0.84980094

In order to validate the effectiveness of this method with a few simple examples, it was tested using a number of retrieval benchmarks from BeIR. These retrieval tasks consist of a query set, a corpus of text documents and a QRels file that stores information about the document IDs associated with each query. To identify the relevant documents for a query, the documents can be chunked, encoded into embedding indexes, and the most similar blocks (kNN) embedded in each query can be determined. Since each chunk corresponds to a document, the kNN rankings of the chunks can be converted into kNN rankings of the documents (for documents with multiple occurrences in the rankings, only the first occurrence is kept). Afterwards, the resulting rankings can be compared with the rankings corresponding to real QRels documents and retrieval metrics such as nDCG@10 can be computed.

Dataset	AVG Document Length (characters)	Traditional Chunking (nDCG@10)	Late Chunking (nDCG@10)	No Chunking (nDCG@10)
SciFact	1498.4	64.20%	66.10%	63.89%
TRECCOVID	1116.7	63.36%	64.70%	65.18%
FiQA2018	767.2	33.25%	33.84%	33.43%
NFCorpus	1589.8	23.46%	29.98%	30.40%
Quora	62.2	87.19%	87.19%	87.19%

The longer the document, the more effective the late chunking strategy will be.

Post-chunking does not require additional training of the embedding model. It can be applied to any long context embedding model that uses mean pooling

Late Chunking is now available in jina-embeddings-v3 API. 8192-token

import requests
import json

url = '/v1/embeddings'
headers = {
    'Content-Type': 'application/json', 'Authorization': 'Bearer jina_321dd2MDu0hLW2Gozj1fTd
    'Authorization': 'Bearer jina_321dd2MDu0hLW2Gozj1fTd'
}
data = {
    "model": "jina-embeddings-v3", }
    "task": "text-matching",.
    "late_chunking": True, }
    
    
    "input": [
        "Natural and organic skin care products specially designed for sensitive skin.：Experience natural pampering with aloe vera and chamomile extracts.. Designed especially for sensitive skin, these products are warming, nourishing and protective, leaving your skin free from irritation. Your skin will be noticeably less sensitive and more radiant and healthy." ,
        "New Makeup Trends Focus on Vibrant Colors and Innovative Technology: This season's makeup trends focus on bold colors and innovative technology. From neon eyeliners to holographic highlighters, unleash your creativity and create a unique look every time."
    ]
}

response = (url, headers=headers, data=(data))
print()

4.3 Anthropic (2024.9.20)

Link:/news/contextual-retrieval

Anthropic introduces a separate strategy called context retrieval. Anthropic's approach is a powerful way to solve the context loss problem and works as follows:

Each block is sent to LLM with the complete documentation .
LLM adds the relevant context for each block.
This will produce richer, more informative embeds.

This is essentially context enrichment where the global context is explicitly hard-coded into each block using LLM, which is expensive in terms of cost, time and storage. Furthermore, it is not clear that this approach can accommodate block boundaries, as LLM relies on accurate and readable blocks for efficient context enrichment

4.4 Small Language Model, SLM--qwen0.5

simple-qwen-0.5: This is the simplest model, recognizing boundaries mainly based on the structural elements of the document. It is simple and efficient, suitable for basic chunking needs./jinaai/text-seg-lm-qwen2-0.5b
topic-qwen-0.5: Inspired by Chain-of-Thought reasoning, this model recognizes topics in a text, such as "the start of the Second World War", and then uses those topics to define chunk boundaries. It ensures that each chunk of text is thematically coherent, which makes it particularly suitable for processing complex multi-topic documents. Initial tests have shown that it chunks very well, very close to human intuition./jinaai/text-seg-lm-qwen2-0.5b-cot-topic-chunking
summary-qwen-0.5: This model not only recognizes text boundaries, but also generates a summary for each text block. Block summarization is useful in RAG applications, especially for tasks such as quizzing long documents. Of course, the tradeoff is that more data is needed for training./jinaai/text-seg-lm-qwen2-0.5b-summary-chunking

All three models return only the header of the text block, which is a truncated version of each text block. Instead of generating complete blocks of text, they output key points or sub-themes. Simply put, because it only extracts key points or sub-themes, this is equivalent to capturing the core meaning of the fragment, which is used to more accurately identify boundaries and ensure the coherence of the text block through semantic transformation of the text. Retrieval time, according to these "text block header" will be document text slicing, and then reconstruct the complete fragment based on the results of slicing. Equivalent to the first "text block header" made an index, and then find the corresponding complete fragment according to the index when needed. This ensures the accuracy of the search, but also to improve efficiency and avoid processing too much redundant information.

4.4.1 Data set construction

Used wiki727k (/koomri/text-segmentation) dataset, which is a large-scale collection of structured text fragments extracted from Wikipedia articles. It contains more than 727,000 text chunks, with each fragment representing a different part of a Wikipedia article, such as an introduction, a chapter, or a sub-chapter.

data enhancement

To enrich the training data for each model variant, each article in the dataset was augmented using GPT-4o. Specifically, the following prompts were provided to GPT-4o:

f""""
Generate a five- to ten-word topic and a one-sentence summary for this text.

{text}

Ensure that the topic is concise and that the abstract covers the main themes wherever possible.
Please use the following format for your response:

Subject:...
Summary: ...

Reply directly to the desired topic and summary without including any other details, and do not enclose your reply in quotation marks, backquotes, or other separators.
   """.strip()

Each article is divided into blocks of text by three line breaks (\n\n\n), with sub-blocks of text within the block divided by two line breaks (\n\n). For example, an article about CGI will be divided into the following blocks:

[
    [
      "In computing, the Common Gateway Interface (CGI) provides a standard protocol for Web servers to execute programs (also known as command-line interface programs) that execute like console applications and that run on a server to dynamically generate Web pages." ,
      "Such programs are called \\\ "CGI scripts\\\ or simply \\\ "CGI\\\."" ,.
      "The exact details of how the script is executed by the server are determined by the server." ,
      "Typically, CGI scripts are executed and generate HTML when a request is made."
    ],
    [
      "In 1993, the National Center for Supercomputing Applications ( NCSA ) team wrote the specification for invoking command-line executables on the www-talk mailing list; however, NCSA no longer hosts the specification." ,
      "Other Web server developers adopted it, and it has remained the standard for Web servers ever since." ,
      "The following contributors are specifically mentioned in the RFC: \n1. Alice Johnson\n2. Bob Smith\n3. Carol White\n4. David Nguyen\n5. Eva Brown\n6. Frank Lee\n7. Grace Kim\n8. Henry Carter\n9. Carter\\n9. Ingrid Martinez\n10. Jack Wilson."
    ]

Then, the text blocks of the original text, the topics and abstracts generated by GPT-4 are organized into JSON format to facilitate model learning.

{
  "sections": [
      [
      "在计算领域，通用网关接口 ( CGI ) because of Web The server provides a standard protocol，Used to execute programs that execute like console applications ( 也称because of命令行界面程序 ) ，These programs run on the server，Dynamically generated web pages。",
       // ...Split chapter content (Results of the previous step)
    ]
  ],
  "topics": [
    "Web Common Gateway Interface in the server",
    "CGI History and standardization of",
    // ...Other topics
  ],
  "summaries": [
    "CGI because of Web The server provides a protocol for running programs that generate dynamic web pages。",
    "NCSA sentence-final interrogative particle 1993 First defined in 2007 CGI，随后它成because of Web Server standards...",
    // ...Other summaries
  ]
}

Next, noise was added to the data, including breaking up the data, inserting random characters/words/letters, randomly removing punctuation, and removing all line breaks.

These data enhancement methods are effective but still have limitations, and the ultimate goal is to enable the model to generate coherent text and correctly handle structured content such as code snippets.

So GPT-4o was also used to generate codes, formulas and lists to further enrich the dataset to enhance the model's ability to handle these elements.

Training settings

The settings during the training of the model are as follows:
- organizing plan: Hugging Face'stransformers library, which also integrates theUnsloth to optimize the model. This is important for optimizing memory usage and speeding up training, allowing small models to be trained efficiently with large datasets.
- Optimizers and SchedulersThe AdamW optimizer is used, together with a linear learning rate scheduling strategy and a warmup mechanism, to help stabilize the training process in the early stages of training.
- Experimental tracking: All training experiments were tracked with Weights & Biases and training losses, validation losses, learning rate changes, overallModel Performanceand other key metrics. This real-time tracking allows to understand the progress of the model and quickly adjust the parameters when needed to get the best learning results.
training process

attributableqwen2-0.5b-instruct As a base model, useUnsloth Three SLM variants were trained, each corresponding to a different chunking strategy. The training data was obtained from wiki727k and contains, in addition to the full text of the articles, information on chapters, topics, and abstracts extracted in the previous "Data Enhancement" step.
1. simple-qwen-0.5: 10,000 samples and 5,000 training steps were used, which converged quickly and effectively detected the boundaries between coherent chunks of text. The training loss is 0.16.
2. topic-qwen-0.5: Follow.simple-qwen-0.5 Similarly, 10,000 samples were used and 5,000 steps were trained with a training loss of 0.45.
3. summary-qwen-0.5: 30,000 samples were used and 15,000 steps were trained. This model has a lot of potential, but the training loss is relatively high (0.81), suggesting that more data is needed, roughly twice the original number of samples, to fully utilize its strength.

5. Comparison of the effects of different chunking strategies

5.1 jina vs qwen0.5

Take a look at the results of different chunking strategies. Here are three consecutive chunking examples generated by each strategy, along with the results from Jina's Segmenter API. In order to generate these chunks of text, we first used Jina Reader to grab the plain text of an article from Jina AI blog (including all the page data such as header, footer, etc.), and then processed it with different chunking methods respectively.

/news/can-embedding-reranker-models-compare-numbers/

Jina Segmenter API

The Jina Segmenter API's text chunking is very fine-grained, and it will be based on the\n、\t and other characters are chunked, so the chunks of text cut out are usually small. If you look at just the first three chunks, it extracts the navigation bar from the site'ssearch\\n、notifications\\n respond in singingNEWS\\n, but completely failed to extract anything relevant to the content of the article:

Further on, there's always some blog post content, but each chunk retains very little contextual information:

simple-qwen-0.5

simple-qwen-0.5 will break blog posts into longer text chunks based on semantic structure, each with a more coherent meaning:

topic-qwen-0.5

topic-qwen-0.5 It will first identify topics based on the content of the document and then chunk it based on those topics:

summary-qwen-0.5

summary-qwen-0.5 Not only does it recognize text block boundaries, it also generates a summary for each text block:

5.2 later chunk

The Weaviate researchers did a series of tests, and the results are pretty amazing. We tested late chunking against plain chunking using a snippet from a weaviate blog post.

Using a fixed-size chunking strategy (number of tokens = 128) resulted in the following sentence being split into two different chunks:

Weaviate's native, multi-tenant architecture provides advantages for customers who need to prioritize data privacy while maintaining fast retrieval and accuracy.

Below:

chunking	element
chunk1	... Evolution of the technology stack. This flexibility, combined with ease of use, helps teams deploy AI prototypes into production environments faster. Flexibility is also critical when it comes to architecture. Different application scenarios have different needs. For example, we work with many software companies as well as companies operating in regulated industries. They often need multi-tenant capabilities to isolate data and maintain compliance. When building Retrieval Augmented Generation (RAG) applications that use account- or user-specific data to display results in context, the data must be retained in a tenant dedicated to its user group.Weaviate's native, multi-tenant architecture excels for customers who need to prioritize such needs.
chunk2	... fast retrieval and accuracy while maintaining data privacy. On the other hand, we support some very large scale single-tenant use cases focused on real-time data access. Many of these cases involve e-commerce and industries where speed and customer experience are competitive points.

chunking

element

chunk1

... Evolution of the technology stack. This flexibility, combined with ease of use, helps teams deploy AI prototypes into production environments faster. Flexibility is also critical when it comes to architecture. Different application scenarios have different needs. For example, we work with many software companies as well as companies operating in regulated industries. They often need multi-tenant capabilities to isolate data and maintain compliance. When building Retrieval Augmented Generation (RAG) applications that use account- or user-specific data to display results in context, the data must be retained in a tenant dedicated to its user group.Weaviate's native, multi-tenant architecture excels for customers who need to prioritize such needs.

chunk2

... fast retrieval and accuracy while maintaining data privacy. On the other hand, we support some very large scale single-tenant use cases focused on real-time data access. Many of these cases involve e-commerce and industries where speed and customer experience are competitive points.

Suppose someone asks, "What do customers care about most?"

methodologies	AI's answer
Traditional methods	1. "Product updates, attend our webinars." (Score: 75.6) 2. "Focus on data privacy while maintaining speed and accuracy. We support a number of large-scale single-tenant use cases, primarily in e-commerce and those industries where speed and customer experience are critical." (Score: 70.1)
chunking	1. "The customer's needs are diverse and we have introduced different storage tiers. It's amazing to see our customers' products grow in popularity. But as users grow, costs can skyrocket..." (Score: 74.8) 2. "Technology choice is critical. Flexibility and ease of use help teams implement AI faster. different use cases have different needs. For example, software companies and regulated industries often require multi-tenancy to isolate data and ensure compliance. When building AI applications, it's important to use specific user data to personalize results. weaviate's multi-tenant architecture excels at this." (Score: 68.9)

methodologies

AI's answer

Traditional methods

1. "Product updates, attend our webinars." (Score: 75.6) 2. "Focus on data privacy while maintaining speed and accuracy. We support a number of large-scale single-tenant use cases, primarily in e-commerce and those industries where speed and customer experience are critical." (Score: 70.1)

chunking

1. "The customer's needs are diverse and we have introduced different storage tiers. It's amazing to see our customers' products grow in popularity. But as users grow, costs can skyrocket..." (Score: 74.8) 2. "Technology choice is critical. Flexibility and ease of use help teams implement AI faster. different use cases have different needs. For example, software companies and regulated industries often require multi-tenancy to isolate data and ensure compliance. When building AI applications, it's important to use specific user data to personalize results. weaviate's multi-tenant architecture excels at this." (Score: 68.9)

Obviously, the answer given by the later chunking method is more comprehensive and relevant, and better able to capture the core of the problem.

colab links./drive/15vNZb6AsU7byjYoaEtXuNu567JWNzXOz?usp=sharing&ref=.io_
Test Link./blog/late-chunking_
Article./blog/launching-into-production_

6. Selection of an appropriate chunking strategy, taking into account business scenarios and textual features

different application requirements

Different business scenarios have unique needs and challenges for text processing.When choosing a text chunking strategy, it is important to first fully understand the goals and requirements of the specific application。
- Specialized field applicationsFor example, in specialized fields such as law and medicine, texts often contain a large number of terms and complex syntactic structures. At this point, it is necessary to ensure that the text chunking can retain the integrity of the terminology and contextual relevance, to avoid semantic misunderstanding due to improper segmentation.
- Long Document Processing: For applications that need to process long documents, such as technical documents and research reports, a chunking strategy that maintains the structure of paragraphs and chapters should be used to ensure that the larger model understands the overall logic and topic development of the document.
- Real-time response scenarios: In scenarios requiring real-time response such as chatbots and intelligent customer service, the text chunking strategy needs to balance processing speed and semantic integrity, and may need to find a balance between chunk size and computational resources.
Analyze the internal structure of the text

Different types of texts have different structural characteristics, these characteristics should be fully considered when choosing a chunking strategy.
- Structured text: e.g. HTML, Markdown formatted text with clear labeling and hierarchical structure. Use targeted chunking methods (e.g.HTMLHeaderTextSplitter、MarkdownTextSplitter), can preserve the structural information of the text and improve the comprehension of the model.
- Unstructured text: For unstructured texts such as prose and dialogues, semantic-based chunking methods can be considered, using natural language processing techniques to identify sentence boundaries and theme change points to ensure semantic integrity and coherence.
- Multilingual texts: When processing text in multiple languages, you need to select a chunking tool that supports the appropriate language features and may need to adjust chunking parameters for different languages.

Comparison of multiple scenarios

In practice, different text chunking strategies should be experimentally verifiedto determine the most appropriate method for a particular application scenario.
- strategic diversity: Try a variety of strategies such as fixed-size cuts, punctuation-based cuts, overlapping cuts, semantic cuts, etc.
- Tool Selection: Use different text chunking tools (e.g.RecursiveCharacterTextSplitter、NLTKTextSplitter、SpacyTextSplitter) to compare and understand their performance in specific applications.
Development of assessment indicators

In order to objectively assess the effectiveness of different compartmentalization strategies, theNeed for clear assessment indicators。
- Search Performance: Evaluate the effectiveness of the vectorization and retrieval phases, e.g., recall rate, accuracy rate, average retrieval time, etc.
- Generating quality: Evaluate the accuracy, completeness, coherence, and relevance to the user's query of the responses generated by the Big Model.
- User feedback: Collect user satisfaction ratings of system responses as an important reference for quality assessment.
Data-driven optimization

Optimization of chunking strategy based on experimental data and evaluation results。
- parameter tuning: According to the experimental results, adjust the text block length, overlap length, splitter and other parameters to seek the best configuration.
- Problem orientation: Identify the causes of performance degradation in chunking strategies by analyzing model error cases and targeting improvements.
- Progressive optimization: An iterative approach is used to gradually improve the chunking strategy and parameter settings, and each optimization is verified to ensure that the optimization is in the right direction.

bibliography：

[1]/langchain-ai/langchain/blob/v0.1.9/libs/langchain/langchain/text_splitter.py#L851C1-L851C6

[2]/en/stable/examples/node_parsers/semantic_chunking.html

[3]/docs/modules/data_connection/document_transformers/semantic-chunker

[4]/run-llama/llama_index/blob/v0.10.12/llama-index-core/llama_index/core/node_parser/text/semantic_splitter.py

[5]/pdf/1810.

[6]/abs/2004.14535

[7]/aakash222/text-segmentation-NLP/

[8]/pdf/2107.

[9]/alibaba-damo-academy/SpokenNLP

[10]/modelscope/modelscope/

[11]/run-llama/llama_index/blob/v0.10.12/llama-index-packs/llama-index-packs-dense-x-retrieval/llama_index/packs/dense_x_retrieval/

[12]/run-llama/llama_index/blob/v0.10.12/llama-index-packs/llama-index-packs-dense-x-retrieval/llama_index/packs/dense_x_retrieval/#L161

[13]/chentong0/factoid-wiki

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems_, 33, 9459–9474.
Guu, K., et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. Proceedings of the 37th International Conference on Machine Learning_, 1-41.
Chen, D., et al. (2017). Reading Wikipedia to Answer Open-Domain Questions. Association for Computational Linguistics_, 1870–1879.
《Text Segmentation by Cross Segment Attention》: /abs/2004.14535_
Training Realization./aakash222/text-segmentation-NLP/
《Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation》: /abs/2107.09278_
ModelScope Framework./modelscope/modelscope/
《Dense X Retrieval: What Retrieval Granularity Should We Use?》: /pdf/2312.
Advanced RAG 05: Exploring data chunking methods based on intrinsic semantic information in text:/s/ejLY3vmEW3yEaJT0qfdnUQ