AlignSum: Data Pyramid and Hierarchy Fine-Tuning to Improve Text Summarization Model Performance

discuss a paper or thesis (old): AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Paper Address:/abs/2410.00409
Thesis Code:/csyanghan/AlignSum

innovation point

Pre-trained language models were found to perform inconsistently in automatic versus manual evaluation in text summarization tasks, possibly due to low-quality training data.
Considering the annotation cost, the paper proposes a new alignment framework for human summary preferences\({\tt AlignSum}\) , using extraction,LLMMultiple methods, such as generation and manual annotation, build data pyramids that can fully utilize the extremely limited amount of high-quality data to enhance pre-trained language models (PLMs) at the limits of its capabilities in summary generation.

Content overview

Text summarization tasks typically use pre-trained language models (PLMs) to accommodate a variety of standard data sets. Although thesePLMsperform well in automated evaluations, but often perform poorly in manual evaluations, suggesting a bias between the summaries they generate and human summary preferences. This discrepancy may be due to low-quality fine-tuned datasets, or the limited availability of high-quality human-annotated data that reflect true human preferences.

Annotating a large number of high-quality summary datasets is impractical, and the paper hopes to move away from relying on the traditional simple fine-tuning of large amounts of training data, and instead leverage the extremely limited amount of high-quality data available to enhance pre-trained language models (PLMs) at the limits of its capabilities in summary generation.

To address this challenge, the paper proposes a new human summary preference alignment framework\({\tt AlignSum}\) . The framework consists of three parts: first, a data pyramid is constructed containing extracted, generative, and human-labeled summary data. Second, Gaussian resampling is performed to remove extracts of extreme length. Finally, a two-stage hierarchical fine-tuning with the data pyramid is implemented after Gaussian resampling.

commander-in-chief (military)\({\tt AlignSum}\) Applied to human labeledCNN/DailyMailcap (a poem)BBC XSumDatasets likeBART-Largethis kind ofPLMsOutperforms both automated and manual assessments in the175B(used form a nominal expression)GPT-3. This proves that\({\tt AlignSum}\) Significantly enhanced alignment of language models with human summarization preferences.

AlignSum

The overall framework consists of three parts:

Use extraction,LLMVarious methods such as generation and manual labeling are used to construct the data pyramid (Data Pyramid）。
Since the source data has different summary lengths, Gaussian resampling is utilized to adjust the length of the generated summaries to be close to the target length.
A two-phase hierarchical fine-tuning strategy was used: an initial phase for thePLMsPerform training on extractive and generative data for the general domain, and then on manually labeled data for the just fine-tunedPLMsFurther fine-tuning is performed to align it with human preferences.

Building the Data Pyramid

The data pyramid consists of three tiers, increasing from bottom to top in terms of quality and difficulty of access, while decreasing in quantity. The first two are the two most common styles in the field of summary generation, referring to them collectively as generic data. The last layer is the most critical part used to align human preferences and is called personalized data.

extractive data

Extracted data forms the main part of the pre-training corpus and is the most readily available. ReferencesGSGUseROUGE-1metrics to compute similarity and traverse the entire document to find the most similar sentences as pseudo-summaries\(\hat{S}\) ：

\[\begin{equation} \begin{split} &\ \ r_i = \mathrm{Rouge} (d_i, D_{\setminus d_i}), \\ &\ \ \hat{S} = \mathrm{argmax}_{d_i} \{r_i\}_{i=1}^n. \end{split} \end{equation} \]

generative data

Extractive data helps identify important sentences in a document, but is not sufficient to summarize key information across multiple sentences. In contrast.LLMs(large-scale language modeling) are effective zero-sample summary generators capable of extracting summary information across sentence and document levels.

Use system prompts and user tips to guideLLMsreview a document\(D\) Perform summarization and generate pseudo-summaries\(\hat{S}\) . The system prompt specifies the general requirements for accurate summary generation and then inserts the document before the user prompt to ensure that theLLMAbility to read entire documents and follow user requirements. User prompts are dataset specific, setting the required summary length and number of words.

Human labeled data

By using both of these data for training, thePLMs(pre-trained language models) gained domain-specific knowledge. Further fine-tuning on human labeled data is necessary to generate summaries that match human preferences.

To avoid the variability of random annotations, useElement-awareDataset. The dataset follows specific instructions and combines micro and macro needs to ensure consistent and high quality human annotation.

Gaussian resampling

The pseudo-summaries from the three different data sources have unique marker length distributions, with significant differences in the marker length distributions of the summaries for extracted and abstracted data. Therefore, training directly with these different distributions may result in generating summaries that are either too long or too short.

To address this problem, a Gaussian resampling technique was introduced to align all summary lengths with human-annotated summaries.

Modeling the marker length distribution of human labeled data as a Gaussian distribution. In the95%Probabilistic [\(\mu - 2\sigma\) , \(\mu + 2\sigma\) ] intervals to resample extracted and abstracted data to remove samples with too long or too short pseudo-digests.

Two-stage tier fine-tuning

Directly to the pre-trained language model (PLMs) Fine-tuning can be challenging because a small amount of high-entropy data is critical for alignment, but may be confounded by information from a large amount of low-entropy data, resulting in an underutilized data pyramid.

To avoid this potential problem, the paper proposes a two-stage hierarchical fine-tuning strategy. Given a pre-trained language model\(p_{\theta}\)：

First the generic fine-tuning phase, using extractive and abstract data on the\(p_{\theta}\) be fine-tuned to enhance its ability to generate domain-generalized summaries to obtain models\(p_{\theta'}\) 。
Next is the personalization fine-tuning phase, which uses human-annotated data on the\(p_{\theta'}\) Fine-tuning to create final models aligned with human preferences\(p_{\theta''}\) 。

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

AlignSum: Data Pyramid and Hierarchy Fine-Tuning to Improve Text Summarization Model Performance | EMNLP'24

Building the Data Pyramid

extractive data

generative data

Human labeled data

Gaussian resampling

Two-stage tier fine-tuning