PlugIR: Open source and no fine-tuning yet, Seoul National University proposes plug-and-play graphical retrieval of multi-round conversations

plug-and-playPlugIRpass (a bill or inspection etc)LLMThe dialog between the questioner and the user progressively improves the text query for image retrieval and then utilizes theLLMConverting conversations into a format (a sentence) that is more understandable to the retrieval model. First, the need to fine-tune the retrieval model on existing visual dialog data is eliminated by reconstructing the dialog form context so that any black-box model can be used. Second, constructing theLLMThe questioner generates non-redundant questions about the attributes of the target image based on the information of the retrieved candidate images in the current context, mitigating the noise and redundancy problems that occur when generating questions. In addition, it is newly proposedBest log Rank Integral（BRI) metrics to measure comprehensive performance in a multi-round task. The paper validates the effectiveness of the retrieval system in various environments and highlights its flexible capabilities.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Paper Address:/abs/2406.03411
Thesis Code:/Saehyung-Lee/PlugIR

Introduction

Text-to-image retrieval, a task focused on locating a target image in an image database corresponding to an input textual query, has made significant progress due to the development of visual-verbal multimodal models. Traditionally, approaches in this area have used single-round retrieval methods that rely on an initial textual input, which requires the user to provide a comprehensive and detailed description. Recently, a study has proposed a chat-based image retrieval system that utilizes a large language model (LLMs) as a questioner to facilitate multiple rounds of dialog. Retrieval efficiency and performance can be enhanced even if the user provides a simple initial image description. However, this chat-based retrieval framework faces a number of limitations, including the need for fine-tuning to adequately encode conversational text, a process that is both resource-intensive and unsuitable for scalability. Furthermore.LLMThe questioner relies on the initial description and conversation history without the ability to view candidate images. Based solely on theLLMof parameterized knowledge that may generate content unrelated to the target image.

To overcome these challenges, the authors introduce a novel plug-and-play approach to interactive text-to-image retrievalPlugIRcollaboration withLLMsTightly coupled.PlugIRTwo key components are included: context refactoring and context-aware dialog generation. Utilizing theLLMsthe ability to follow instructions.PlugIRThe interaction context between the user and the questioner is reconstructed into a compatible format suitable for pre-trained visual-verbal models. This process allows for the direct application of a range of multimodal retrieval models, including black-box variants, without further fine-tuning. In addition, the authors' approach ensures thatLLMThe questioner's query is based on the context of the retrieval candidate set, thus enabling him or her to ask questions related to the attributes of the target image. In this process, the retrieval context is injected in textual form into theLLMThe questioner enters the context as a reference. Subsequently, the authors' approach also includes a filtering process that selects the most contextualized, non-repetitive questions and simplifies the search options.

The authors identify the importance of evaluating three key aspects of interactive retrieval systems: user satisfaction, efficiency, and ranking improvement, finding that existing metrics such asRecall@Kcap (a poem)Hits@KThe Government of the United States of America, in particular, has not been able to meet its obligations under the Convention in these areas. For example.Hits@Kfails to take efficiency into account, when in fact the target image can be better localized through fewer interactions. To address these issues, the authors introduceBest log Rank Integral（BRI) Indicators.BRIEffectively covers all three key aspects, providing a comprehensive assessment that is not dependent on a specific ranking K, with theRecall@KmaybeHits@KDifferent. We prove empirically thatBRICloser to manual evaluation than existing metrics.

IncludingVisDial、COCOcap (a poem)Flickr30kExperiments conducted on multiple datasets, including the following, have shown thatPlugIRSignificant advantages are shown with respect to existing interactive retrieval systems that use zero-sample or fine-tuned models. In addition, the authors' approach shows significant adaptability when applied to a variety of retrieval models, including black-box models. This compatibility extends its utility to a wider range of applications and scenarios.

The paper contributions are as follows:

A first set of empirical evidence is presented showing that zero-sample models have difficulties in understanding dialog, and a contextual reconstruction approach is introduced as a solution that does not require fine-tuning the retrieval model.
Presented aLLMQuestioner, designed to address search bottlenecks caused by noisy and redundant questions.
Having introducedBRImetrics, a new type of metric aligned with human judgment and specifically designed to enable comprehensive and quantifiable evaluation of interactive retrieval systems.
Validates the effectiveness of the paper's framework in a variety of different environments, highlighting its versatile plug-and-play capabilities.

Method

Preliminaries: Interactive Text-to-Image Retrieval

Interactive text-to-image retrieval is a multi-round task that starts with a simple initial description provided by the user\(D_0\) Start. This task involves a discussion between the user and the retrieval system about the relationship with the\(D_0\) (target image) corresponding to the image is dialogued to form a context that is used as a query to search for the target image in each round (round). In each round\(t\) in which the retrieval system generates questions about the target image\(Q_t\) The user takes the answer\(A_t\) responds, thus creating a dialog context for the round\(C_t=(D_0, Q_0, A_0, …, Q_t, A_t)\) . This dialog context is appropriately processed, e.g., by concatenating all textual elements to form a single textual query, which is used for image search in that round. While performing image search, the retrieval system matches all the images in the image pool with the text query and ranks them based on the similarity score, the performance of the retrieval system can be evaluated based on the retrieval ranking of the target images.

For assessments, two main indicators are usually used:Recall@Kcap (a poem)Hits@K. When using theRecall@KWhen performing the evaluation, if the target image computed in the current round is ranked in the topKWithin the first place, it is considered a success. ForHits@K, if the target image appeared in any of the rounds prior to the current round in the firstKOut of the five results, it is considered successful.

Context Reformulation

Do zero-shot models understand dialogs?

To demonstrate the necessity of the proposed approach, zero-sample models are evaluated for the extent to which they understand and effectively utilize a given conversation in an interactive text-to-image retrieval task. In particular, the variation in retrieval performance is tracked for zero-sample models, which consist of three white-box models (CLIP、BLIPcap (a poem)BLIP-2) and a black-box model that enhances performance by progressively providing additional question-answer pairs related to the target image, for a total of10round. Therefore, in the first10round, the input query is an image containing an image caption and the10a dialog of question-answer pairs. It is hypothesized that if a zero-sample model is able to understand the dialogs and effectively utilize them in an image retrieval task, it will exhibit better performance in subsequent rounds than in the initial round, which involves the use of image captions only.

as shown2shows that all tested zero-sample models in successive rounds of theHits@10Gradual improvement in scores. This trend suggests that some query samples, which failed in the initial retrieval, eventually succeeded as the dialog became richer in subsequent rounds. However, it is not recommended to jump to the conclusion that dialogs are effective as zero-sample model input queries based on these observations alone. A true analysis should be more influenced by theRecall@10Score instead ofHits@10the impact of scores.Recall@10showed different conclusions: the zero-sample model seems to have difficulty understanding dialog in text-to-image retrieval tasks.

Indeed, by simply adding noise to the similarity matrix between the image captions and the candidate images, theHits@KScores can be increased in successive rounds due to theHits@KOnly one successful retrieval attempt is required at any point in time prior to each round. In contrast, theRecall@KReflects the amount of information contained in "each round" of a query in a text-to-image retrieval task.

as shown2shown, all retrieval models in the study obtain their highest when using only the image caption as an input queryRecall@10Score. It is worth noting that theCLIP、BLIPcap (a poem)BLIP-2In the model, as rounds progress, theirRecall@10The scores decrease. This trend implies that in the context of these zero-sample models, the appended conversations act mainly as noise. In theCLIP、BLIPcap (a poem)BLIP-2in which the noise effect becomes more pronounced as the length of the dialog increases. AmazonTitanThe multimodal base model (ATM) While it does not result in an increase in dialog length withRecall@10dropped, but also did not show improved performance, suggesting that the added dialog may not have substantially contributed to the message context.

A plug-and-play approach

To overcome the challenge of zero-sample retrieval models failing to use dialog effectively in text-to-image retrieval tasks, one strategy may be to fine-tune pre-trained retrieval models using datasets consisting of image and dialog pairs. For example, in theVisDialbring two things into contactBLIPThe model was fine-tuned for higherHits@KScore. The experiments in the paper also show that this approach can give the retrieval model the ability to understand dialog. However, the implementation of this fine-tuning-based approach depends on the fact that it is not always feasible: (1) one must have access to the retrieval model parameters; and (2) one must have access to sufficient and appropriate training data. For example, this approach is not applicable to a model likeATMSuch a black-box retrieval model.

Thus, the authors explored a novel approach that makes text queries easier to understand by the retrieval model, rather than modifying the retrieval model to fit the format of the text query. Specifically, instead of using dialogs directly as input queries, they utilize theLLMsConvert the dialog to a format that is more consistent with the training data distribution of the retrieval model (e.g., title style). This strategy effectively bypasses the limitations of the fine-tuning-based approach because it does not require fine-tuning of the retrieval model.

Context-aware Dialogue Generation

Is the additional information in dialogues actually effective?

Is the extra information in the dialog actually valid? The motivation for the reframing presented earlier was based on the observation that dialog forms tend to be more like noise than information useful to a pre-trained retriever. Thus, the authors delved into the form of the context and focused on the actual content of the context. When relying only on dialog context to generate questions about the target image, the authors identified two key problems. First, the generated questions may involve attributes that are not related to the target image. For example, a question asking about an object that is not in the target image may elicit a negative answer. This situation itself may act as noise in the dialog context. As a result, the contextual representation introduces more confusion into the retrieval process compared to previous rounds, leading to degraded retrieval performance

The second problem is the potential redundancy of the generated questions. During question generation, routine questions like "What is the person in the photo doing?" Routine questions such as these can usually be answered based on information already available in the context of the dialog without the need to view the target image. In this case, the question-answer pair also fails to provide valuable additional information, leading to redundancy. As a result, this redundancy did not help to improve the retrieval performance in subsequent rounds. To address these issues, the authors propose a questioner structure that can be flexibly applied to a variety of situations to effectively deal with the noise and redundancy challenges in dialogs.

A plug-and-play approach

In order to avoid the problem of generating attributes unrelated to the target image, the retrieval candidate image information of the current round is injected into theLLMThe questioner's text input. For this process, images that are similar to the (reconstructed) dialog context in the embedding space are first extracted from the pool of images as a collection of "retrieval candidates". These similar images contain attributes similar to the current dialog context, including some information about the target image, ensuring that the generated questions about these attributes have some relevance to the target image.

Embedding applications on candidate imagesK-meansClustering to obtain the similarity distribution between each candidate image and other candidate images. For each cluster, the image with the lowest entropy in its similarity distribution is selected as the representative. This selection is based on the idea that a lower entropy in the similarity distribution indicates that the corresponding image contains more specific and distinguishable attributes. For example, among the images belonging to the same cluster, the image corresponding to the caption "Home office" shows high entropy, while the other image corresponding to the caption "A desk with two computer monitors and a keyboard" shows low entropy.

obtained by this methodKAn image is subsequently converted to textual information by an arbitrary image caption model and provided as additional input to theLLMQuestioner. This retrieval context extraction process is as in the algorithm1Shown.

in order to ensureLLMThe questioner effectively uses the textual information of the search candidate as the basis for a kind of "chain thinking" (CoT) method. This consists of adding a method to theLLMThe questioner provides sample less examples as additional guidance to effectively utilize the content of the search candidate.

Generated questions that are based on additional context extracted from the retrieved search space are able to include attributes that are relevant to the target image, but may still be redundant. In order to prevent the generation of such questions, an additional filtering process is employed, which has been recentlyDdcotThe strategy presented in the For each question generated by the questioner, use aLLMAgents answered "not sure" when they could not derive an answer from the corresponding description and dialog, which implied that the question was not redundant, and then used only the questions answered "not sure".

The filtering process effectively removes questions that can be answered without viewing the target image, but fails to exclude questions that cannot be answered even with the target image, and these failures involve attributes that are relevant to the candidate set but not to the target image. The authors observe that the use of such inappropriate questions leads to relatively abrupt changes in the similarity distribution between the query and the candidate images, resulting in degraded retrieval performance. Therefore, the authors make a selection based on the similarity distribution of the dialog context with and after combining the questions, taking theKullback-Leibler（KL) The problem with the lowest dispersion.

arithmetic2displayedPlugIRthe filtering process. Context-aware dialog generation processes configured in this manner can be used in concert with the contextual refactoring described in the previous portion of the description and have the flexibility to be used independently, particularly when utilized to fine-tune the retrieval model for the context of the dialog.

The Best log Rank Integral Metric

The following key aspects are essential when evaluating interactive retrieval systems:

User satisfaction: this aspect is considered satisfied if the system successfully retrieves the target image at least once within its query budget.
Efficiency: the efficiency of the system is measured by the number of rounds required for a successful retrieval; fewer rounds indicate better performance.
Importance of Ranking Improvement: Improvement in higher ranking positions is inherently more challenging, and therefore this should be emphasized more in the metric evaluation. For example, when an image's ranking increases from2rise to1The improvement in metrics should be significantly more pronounced when compared to the improvement from the100rise to99On the contrary. This distinction highlights the added challenge and value that accompanies achieving top rankings.

Recall@K, commonly used in non-interactive retrieval system evaluations, does not fully address these three aspects in a given context.Hits@K, the metrics recommended by the interactive system fulfill the criteria of user satisfaction, but still lack in adequately addressing the last two aspects. Therefore, the paper introduces a novel evaluation metric that aims to fully address these three considerations.

To address the issue of user satisfaction, defineBest Rankas follows: Set\(R(q)\) Representation and Queries\(q\) corresponding to the retrieval ranking of the target image. Then, in the first\(t\) Wheel of inquiry\(q_t\) Best Ranking\(\pi\) because of

\[\pi(q_t)= \begin{cases} \min(\pi(q_{t-1}), R(q_t))&\textup{if}\;t\geq1 \\ R(q_0)&\textup{if}\;t=0 \end{cases} \]

found\(Q\) cap (a poem)\(T\) test query set and the specified system query budget, respectively. Then.BRIconstitute

\[\mathop{\mathbb{E}}_{q\in Q} \left[ \frac{1}{2T}\log\pi(q_0)\pi(q_T)+\frac{1}{T}\sum^{T-1}_{t=1}\log\pi(q_t) \right]. \]

BRIIt can be interpreted as in all queries\(Q\) On, for the first\(t\) rotating\(\log\pi\) The average area of the graphic. The faster the improvement in the ranking of the target image, the smaller the area below the graph. The logarithmic properties of the function makeBRIdecreases more dramatically near the top rank, with lowerBRIindicates that the interactive retrieval system performs better. It is worth noting thatBRIComparison in assessment methodology withRecall@Kcap (a poem)Hits@KDifferent. Rather than dichotomizing a data sample based on a specific ranking (K), it calibrates the results across all data samples in the assessment process, making it a more general and reliable metric.

The experimental results confirm thatBRIConsistency with human assessments is much greater than with other metrics.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

PlugIR: Open source and no fine-tuning yet, Seoul National University proposes plug-and-play graphical retrieval of multi-round conversations | ACL 2024

Preliminaries: Interactive Text-to-Image Retrieval

Context Reformulation

Do zero-shot models understand dialogs?

A plug-and-play approach

Context-aware Dialogue Generation

Is the additional information in dialogues actually effective?

A plug-and-play approach

The Best log Rank Integral Metric