SAM4MLLM: Combining Multimodal Large Language Models and SAM for Highly Accurate Reference Expression Segmentation

discuss a paper or thesis (old): SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Paper Address:/abs/2409.10542
Thesis Code:/AI-Application-and-Integration-Lab/SAM4MLLM

innovation point

A method is proposed that allowsMLLMWays to Understand Pixel-Level DetailSAM4MLLMNo need to changeMLLMmodel architecture, introducing new markup, or using additional losses, the approach is simple but not very useful for quoting expression splits (RES) Very effective.
To connectMLLMcap (a poem)SAM, introduces a novel approach to obtaining cue point clues by actively interrogating the language system.
in all kinds ofRESExperimentation on benchmarks, includingRESDataset,GREScap (a poem)ReasonSegThe verification of theSAM4MLLMeffectiveness and demonstrates its excellent performance in handling complex pixel perception tasks.

Content overview

SAM4MLLMis an innovative approach to integratingSegment Anything Model（SAM) with the multimodal large-scale language model (MLLMs) to realize the pixel-aware task.

First, in theMLLMPixel-level information is introduced into the training dataset without changing the originalMLLMarchitecture, which allows theMLLMAbility to use the same technology as the mainstreamLLMsame textual cross-entropy loss to understand pixel-level information.
Second, given input resolution limitations and model architectures not explicitly designed for visual tasks.MLLMpotential limitations that may exist in pixel expression. Further utilization of theSAMEnhanced output through post-processingMLLMoutputs to obtain higher precision split masks in a relatively simple way.
Finally, in order to be inSAMcap (a poem)MLLMA simple way to make the connection between theMLLMgeneratingSAMof the cue points. Utilizing theLLMThe ability to converse and take the initiative to ask forMLLMgainSAMof effective cue points.

SAM4MLLMsolvedRESIssues that makeMLLMsAbility to learn pixel-level location information. Combines detailed visual information with the powerful expressive capabilities of a large language model in a unified language-based manner, without additional computational overhead in learning.

SAM4MLLM

The coded split mask is`SAM`draw attention to sth.

Existing for segmentationMLLMsrelies on the specialized design of the model architecture, the partitioning of specifictokenand heterogeneous loss functions to predict object masks. WhileSAM4MLLMtook advantage ofSAMcharacterize a small number of textual cues that willtoken(bounding boxes plus a few points indicating whether they lie in the object area) is converted to a high-quality continuous segmentation mask.

SAM4MLLMUse points sampled within a bounding box as discrete cues. Specifically, use a bounding box\(Prompt_B \in \mathbb{N}^4\) cap (a poem)\(\mathcal{K}\) points to encode an arbitrarily shaped mask.\(\mathcal{K}\) Hints for individual points, each containing three values:\(x\) Coordinates,\(y\) coordinates and whether it is on the mask, encoded as\(Prompt_P \in \mathbb{N}^{\mathcal{K} \times 3}\) 。

By encoding the continuous segmentation mask as a discreteSAMhints, avoiding the need to add anytokenor change the model structure while training using only textual autoregressive cross-entropy loss. This approach is consistent with the original training pattern of the language model, allowing theMLLMsBeing able to understand pixel-level information and facilitate future model extensions becomes easier.

utilization`MLLM`draw attention to sth.`SAM`

in order to incorporateSAMIntegration in a harmonized mannerMLLMOne of the main problems is gettingSAMof cue points, including positive points within the object mask area (inside) and the negative point on the outside (outside). To this end, two solutions are proposed: cue point generation (Prompt-Point Generation, PPG) and active query cue points (Proactive Query of Prompt-Points, PQPP）。

PPGdirect adoptionMLLMto generate cue points and bounding boxes, but learning to generate multiple points at the same time would be challenging, so only a small number of cue points were used.PQPPInstead, it utilizes theMLLM's dialog capability, which first asks for a rough bounding box and then detects multiple points of interest within the bounding box by means of a question-and-answer session to cue theSAM。

SAM4MLLM-PPG

PPGA text prompt and image input that can simultaneously accept text prompts and image inputs is used.MLLM. In order to makeMLLMAlignment with segmentation tasks using parameter-efficient fine-tuning techniquesLoRA, thus based on the image-text pairs containing the image-text pairs and true masks ofRESdataset for model training.LoRAOutput location hints, including bounding boxes\(Prompt_B \in \mathbb{N}^4\) cap (a poem)\(k\) Group Positive and Negative Points\(Prompt_P \in \mathbb{N}^{(n_1+n_2)k \times 3}\) As shown in Figure (a) is shown, where a set containing\(n_1\) Positive points and\(n_2\) Negative points (\(n_1=2, n_2=1\) ）。

In order to provideLoRAProvides positional supervision with random sampling based on object masks in the training phase\(K\) Group points (\(K>k\) ), and then send these tips to theSAM. For each group, theSAMOutput the segmentation result. Filter out comparisons to the true maskIoULower hints, retaining only the first\(k\) Groups (as shown (c) shown). In this implementation, only the text loss (autoregressive cross-entropy loss) is required.\(K\) typically64， \(k=1\) 。

In the reasoning stage.LoRAThe direct output is sent to theSAMThe point at which the segmentation is performed is shown in Fig. (b) shown.

SAM4MLLM-PQPP

PQPPutilizationMLLMof query-response capabilities rather than directly generating cues. Sampling of cue points and proactive querying of theMLLMwhether these points are inside (or outside) the mask. In the training phase, a bounding box is randomly sampled based on the true mask and the\(K\) Group points and have two rounds of dialog. In the first round of dialog, theLoRARespond to a bounding box. In the second round, for each\((n_1+n_2)K\) points.LoRARespond during training to whether the point is within the mask (yes or no).

In the reasoning stage.LoRAA bounding box is output for the input text query and image in the first round. Then, the points are sampled evenly within the bounding box and sent again in the second round to theMLLM-LoRA, and ask if they are positive (or negative) points for theSAMPerform segmentation. It is common to set the grid size to\(5\times 5\) . For the purpose of sending a message to theSAMHigh-quality cue points are previously provided and low-confidence points are removed.

RES training

In order to make the foundationMLLMtogether withREStask alignment, using the inclusion of the same name as theRESthree datasets of related examples to guide the model toward the goal. Two of them (RESdatasets andgRefCOCO(dataset) contains the true mask with theRESData, the third (VQA) is a visual dialog dataset without masks for further enhancing the overall capabilities of joint visual-verbal understanding.

During training, in order to keepMLLMability to generalize over images, freezing most of the network parameters and only tuning theMLLMThe visual resampler andLoRAAdapter.

For all the datasets mentioned above, we do not use data augmentation during training because flipping and/or cropping may change the relative positions or relationships of objects in the images.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

SAM4MLLM: Combining Multimodal Large Language Models and SAM for Highly Accurate Reference Expression Segmentation | ECCV'24

The coded split mask isSAMdraw attention to sth.

utilizationMLLMdraw attention to sth.SAM

SAM4MLLM-PPG

SAM4MLLM-PQPP

RES training

The coded split mask is`SAM`draw attention to sth.

utilization`MLLM`draw attention to sth.`SAM`