Thesis Speed Reading Record

Special thankskimi, the following papers are read with kimi assistance.

catalogs

RMIB: Representation Matching Information Bottleneck for Matching Text Representations
AttentionRank: Unsupervised keyphrase Extraction using Self and Cross Attentions
ANSWERING COMPLEX OPEN-DOMAIN QUESTIONS WITH MULTI-HOP DENSE RETRIEVAL
APPROXIMATE NEAREST NEIGHBOR NEGATIVE CONTRASTIVE LEARNING FOR DENSE TEXT RETRIEVAL
CogLTX: Applying BERT to Long Texts
How to Fine-Tune BERT for Text Classification?
Optimizing E-commerce Search: Toward a Generalizable andRank-Consistent Pre-Ranking Model

RMIB: Representation Matching Information Bottleneck for Matching Text Representations

2024 ICML
/chenxingphh/rmib/tree/main

Texts from different domains, the vector representations obtained after characterization are not pairwise, the authors proposed RMIB based on Information Bottleneck (IB), which optimizes the information bottleneck by matching the prior distribution of the text representations to narrow down its distribution. Specifically, the following two constraints are mainly added to the model learning process:

Sufficiency of interactions between textual representations.
Incompleteness of a single textual representation.

Taking a look at the code, the paper's optimization points are mainly in the loss function, as shown below:

The loss function can be written as:

\[Z_{1}^{*}, Z_{2}^{*} = \arg \min_{Z_1,Z_2} I(X_1,X_2;Z_1)+ I(X_1,X_2;Z_2) \ . \ \max I(Z_1;Z_2\mid Y) \]

Rewrite it to make it understandable:

\[L_{\text{RMIB}} = -\alpha_1 \cdot \text{KL}(p(Z_1 | X_1, X_2) \| p(Z)) - \alpha_2 \cdot \text{KL}(p(Z_2 | X_1, X_2) \| p(Z)) + \alpha_3 \cdot \text{CE}(Y, \hat{Y}) \]

What attracted me to this paper was the mention of asymmetric text matching, outlining this scenario mentioned in the paper:

Field differences. For example, the field of medicine and the field of computer science.
Data distribution differences. There may also be differences in the distribution of data within the same domain, for example, user queries in search are more colloquial and shorter, while documents are more standardized and longer.
Task differences. For example, quizzes, long and short text matching, etc.

Recently, I've been doing text matching, and I just ran into the problem of matching long and short text, so I feel like this paper will be a bit helpful, so I can try it out later when I have time. 🙈

AttentionRank: Unsupervised keyphrase Extraction using Self and Cross Attentions

2021 EMNLP

The paper proposes an unsupervised key phrase extraction method AttentionRank, which computes two kinds of attention on the basis of PLM:

Self-Attention. : Used to determine the importance of a candidate phrase (labeled by lexicality, with nouns as candidates) in the context of a sentence.
Cross-Attention. : Calculate the semantic correlation between the candidate phrase and other sentences within the document.

The whole process:

Given the input document, the noun-like words in the text are recognized by PoS annotation, and then noun-based candidate phrases are generated based on NLTK.
The attention weight of each candidate phrase in the sentence is calculated by self-attention.
The attention weight of each candidate phrase with respect to the document is calculated by cross-attention.
Combine the self-attention weights and cross-attention weights to calculate the final weights of the candidate phrases.

The structure of the model is as follows (this diagram is a bit roughly done):

ANSWERING COMPLEX OPEN-DOMAIN QUESTIONS WITH MULTI-HOP DENSE RETRIEVAL

/pdf/2009.12756，ICLR 2021, Facebook.

A multi-hop dense retrieval method is proposed for answering open-domain complex problems, where the main complexity oriented problem is the multi-hop problem. The scheme is to iteratively encode the question and the previously retrieved documents as query vectors and retrieve the next relevant document using an efficient Maximum Inner Product Search (MIPS) method. The general flow is as follows:

I stumbled across this paper, and even though it's 21 years old, it feels like an "ancient" method (since LLM dominated the quizbowl scene), and it should basically be on LLM now 😂~

APPROXIMATE NEAREST NEIGHBOR NEGATIVE CONTRASTIVE LEARNING FOR DENSE TEXT RETRIEVAL

/pdf/2007.00808，2020 Microsoft

Aiming at a major learning bottleneck problem in Dense Retrieval (DR): the DR task training is usually sampled batch within the negative samples, these negative samples to calculate the loss of the gradient is small, the learning of the model does not help much, and the large difference in the distribution of samples between different batch, resulting in a large variance of the gradient during the learning process, the learning process is not stable.

On top of that, DR tasks require more negative samples. (As an aside, in an information funnel system, the closer to the bottom, the more negative samples the art? The closer to the top, the more the art of characterization? Of course it is in the case that the other links are constructed reasonably) The DR stage has to be able to distinguish between various types of negative samples, as shown in the following figure, the DR has to be able to distinguish between relevant and irrelevant, but irrelevance can have many kinds of dimensions, for example, literal irrelevance, literal relevance but semantic irrelevance, as well as semantically difficult to distinguish between irrelevance. In general, what DR has seen should be as comprehensive as possible and close to the actual distribution. The problem the paper is addressing is also straightforward:Negative sampling within the batch is too simple, neither consistent with the actual distribution nor conducive to model learning。

In this regard, the paper proposes ANCE (Approximate nearest neighbor Negative Contrastive Learning), which constructs an ANN index to select global negative samples based on the DR model that has been optimized, and the training process is as follows:

CogLTX: Applying BERT to Long Texts

/paper_files/paper/2020/file/，2020 NIPS

Usually, BERT is difficult to handle long text for the following reasons:

Input length limitations.The maximum length of BERT's inputs is usually 512 (although BERT's positional encoding is positive cosine, the input length is usually within 512 during training), and key content may appear after the 512 range, or the distance between key content will exceed 512.
Time constraints.Self-attentive time complexity in BERT is the square of the length, and the computational complexity of long texts may be unacceptable.

Some ways to handle long text:

The truncation method.
Sliding window. Split long text into multiple chunks and compute them separately before pooling.
Compression method. Similar to sequence modeling, step-by-step processing and compression.
Magic Attention. Such as sparse attention, sliding window attention (sliding window attention).

To address the above problems, the paper proposes CogLTX (Cognize Long TeXts), the core idea: analogous to the way human beings process information, CongLTX introduces MemRecall to recognize key text blocks from long texts, and uses these key contents as inputs to the model.CogLTX relies on a basic assumption: for most NLP tasks, it only rely on a portion of the key sentences in the source text is sufficient. Specifically, CogLTX introduces MemRecall (which can be another BERT model on the line, trained in conjunction with the actual BERT model to be used) to extract key blocks in the text.The workflow of MemRecall is shown below:

This paper has been accumulated for a long time, mainly to solve some problems when applying BERT to long text scenarios at that time, to avoid the influence of irrelevant content on the target. Although we are now starting to push larger models and longer contexts, this smaller model is still very useful in practical application scenarios. If you have the opportunity, you can practice it in the future.

How to Fine-Tune BERT for Text Classification?

/pdf/1905.05583，2020

The application of BERT in the field of NLP needs no introduction, and it is still hard to replace even in the hot day of big models. As a halfway NLPer, it is still necessary to refresh the knowledge of these tools.

BERT, as a representative of the encoder model, is commonly used in discriminative types of tasks, such as text categorization, similarity computation, and summary extraction, for learning token- or sentence-level representations. This paper explores how BERT models can be fine-tuned for text categorization tasks. Starting from a pre-trained BERT to a model suitable for the target task, there are usually three steps:

Retraining. Under the large amount of corpus of the target scenario, another pre-training is performed to make the model adapt to the data in the target scenario. The role of this step is relatively easy to understand, the pre-trained model is usually trained on a generalized corpus, and may lack some domain data, for example, to be applied in the legal domain, it needs to be pre-trained again to make the model understand the meanings of the words in the domain.
Multi-task fine-tuning. Fine-tune the model with different tasks under the target domain to fit the task even further. Why add this process? It is actually possible to go directly to the next step, but kimi's reminder to perform multi-task fine-tuning serves the following purposes:
- Improved generalizability: sharing the underlying representation between different tasks, the model can learn common features across tasks, improving the generalization ability of the model and avoiding overfitting.
- Knowledge migration: if the data volume of some tasks is small, knowledge can be migrated from tasks with larger data volumes to help the model better learn and adapt to small data tasks.
Final fine-tuning. Fine-tuning on real-world application tasks.

Optimizing E-commerce Search: Toward a Generalizable andRank-Consistent Pre-Ranking Model

/pdf/2405.05606，2024 SIGIR

It's also an old backlog of a paper on the rough ranking phase of Jingdong's product search.

Rough ranking, a lightweight module, plays a major role in filtering (ad-hoc vs. filter comes to mind) in the system flow. In many previous work, the goal of rough ranking is mainly to be as consistent as possible with the ordering in the sorting phase. To address this issue, many works have also done some discussion, whether the coarse ranking and fine ranking the more similar to the better, not too much discussion here. The paper proposes Generalizable and RAnk-ConsistEnt Pre-Ranking Model , GRACE. There are mainly so many improvements as follows:

Sorting consistency is achieved by introducing multiple binary classification tasks to predict whether a product is among the top k results in the rank phase.
Pre-training of all product representations through comparative learning to improve generalization.
Easy to implement in terms of feature building and online deployment.

Some thoughts on the first improvement:

It still makes sense to consider it this way; the coarse sort essentially takes on the responsibility of distinguishing between good and bad results, filtering out the bad results and giving the potentially good results to the subsequent sessions. But in fact, it is also easy to make people wonder: alignment sorting stage is not more direct? After all, if you don't consider the performance of the sorting model directly in the rough sorting session results may be better? Then should the rough sort be aligned to the refined sort? I think it is better not to over-align the fine row. There are several considerations: 1) the coarse row and fine row are in different links, the input sample distribution is different; 2) the probability that the clicked sample is a good result, but the exposure of the unclicked is not necessarily a bad result, the coarse row if the exposure of the unclicked is considered a bad result, there is a positional bias or other bias caused by a false positive; 3) the coarse row is overly fine row as the goal, easy to lead to a positive feedback, the click of the whole link will have more and more impact on the lack of a good result. The impact of the click on the whole chain will be more and more, the lack of clicks on the good results and cold start is not friendly; 4) coarse row, fine row model complexity is not the same, weakening the coarse row of the sorting ability to strengthen the ability to distinguish between good and bad may be easier to optimize (?). The model is also considered to beDecoupling of systems。

summarize

I have to say, I'm still reading a pretty mixed bag of papers myself it's better to gradually focus a bit in the future 🤣

These are mainly the ones that have been piling up before without seeing the paper, procrastination is not an option 😣

Also, although this article is a speed record, it's not particularly speedy 😢