discuss a paper or thesis (old): SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
- Paper Address:/abs/2409.10542
- Thesis Code:/AI-Application-and-Integration-Lab/SAM4MLLM
innovation point
- A method is proposed that allows
MLLM
Ways to Understand Pixel-Level DetailSAM4MLLM
No need to changeMLLM
model architecture, introducing new markup, or using additional losses, the approach is simple but not very useful for quoting expression splits (RES
) Very effective. - To connect
MLLM
cap (a poem)SAM
, introduces a novel approach to obtaining cue point clues by actively interrogating the language system. - in all kinds of
RES
Experimentation on benchmarks, includingRES
Dataset,GRES
cap (a poem)ReasonSeg
The verification of theSAM4MLLM
effectiveness and demonstrates its excellent performance in handling complex pixel perception tasks.
Content overview
SAM4MLLM
is an innovative approach to integratingSegment Anything Model
(SAM
) with the multimodal large-scale language model (MLLMs
) to realize the pixel-aware task.
- First, in the
MLLM
Pixel-level information is introduced into the training dataset without changing the originalMLLM
architecture, which allows theMLLM
Ability to use the same technology as the mainstreamLLM
same textual cross-entropy loss to understand pixel-level information. - Second, given input resolution limitations and model architectures not explicitly designed for visual tasks.
MLLM
potential limitations that may exist in pixel expression. Further utilization of theSAM
Enhanced output through post-processingMLLM
outputs to obtain higher precision split masks in a relatively simple way. - Finally, in order to be in
SAM
cap (a poem)MLLM
A simple way to make the connection between theMLLM
generatingSAM
of the cue points. Utilizing theLLM
The ability to converse and take the initiative to ask forMLLM
gainSAM
of effective cue points.
SAM4MLLM
solvedRES
Issues that makeMLLMs
Ability to learn pixel-level location information. Combines detailed visual information with the powerful expressive capabilities of a large language model in a unified language-based manner, without additional computational overhead in learning.
SAM4MLLM
The coded split mask isSAM
draw attention to sth.
Existing for segmentationMLLMs
relies on the specialized design of the model architecture, the partitioning of specifictoken
and heterogeneous loss functions to predict object masks. WhileSAM4MLLM
took advantage ofSAM
characterize a small number of textual cues that willtoken
(bounding boxes plus a few points indicating whether they lie in the object area) is converted to a high-quality continuous segmentation mask.
SAM4MLLM
Use points sampled within a bounding box as discrete cues. Specifically, use a bounding box\(Prompt_B \in \mathbb{N}^4\) cap (a poem)\(\mathcal{K}\) points to encode an arbitrarily shaped mask.\(\mathcal{K}\) Hints for individual points, each containing three values:\(x\) Coordinates,\(y\) coordinates and whether it is on the mask, encoded as\(Prompt_P \in \mathbb{N}^{\mathcal{K} \times 3}\) 。
By encoding the continuous segmentation mask as a discreteSAM
hints, avoiding the need to add anytoken
or change the model structure while training using only textual autoregressive cross-entropy loss. This approach is consistent with the original training pattern of the language model, allowing theMLLMs
Being able to understand pixel-level information and facilitate future model extensions becomes easier.
utilizationMLLM
draw attention to sth.SAM
in order to incorporateSAM
Integration in a harmonized mannerMLLM
One of the main problems is gettingSAM
of cue points, including positive points within the object mask area (inside
) and the negative point on the outside (outside
). To this end, two solutions are proposed: cue point generation (Prompt-Point Generation
, PPG
) and active query cue points (Proactive Query of Prompt-Points
, PQPP
)。
PPG
direct adoptionMLLM
to generate cue points and bounding boxes, but learning to generate multiple points at the same time would be challenging, so only a small number of cue points were used.PQPP
Instead, it utilizes theMLLM
's dialog capability, which first asks for a rough bounding box and then detects multiple points of interest within the bounding box by means of a question-and-answer session to cue theSAM
。
-
SAM4MLLM-PPG
PPG
A text prompt and image input that can simultaneously accept text prompts and image inputs is used.MLLM
. In order to makeMLLM
Alignment with segmentation tasks using parameter-efficient fine-tuning techniquesLoRA
, thus based on the image-text pairs containing the image-text pairs and true masks ofRES
dataset for model training.LoRA
Output location hints, including bounding boxes\(Prompt_B \in \mathbb{N}^4\) cap (a poem)\(k\) Group Positive and Negative Points\(Prompt_P \in \mathbb{N}^{(n_1+n_2)k \times 3}\) As shown in Figure (a
) is shown, where a set containing\(n_1\) Positive points and\(n_2\) Negative points (\(n_1=2, n_2=1\) )。
In order to provideLoRA
Provides positional supervision with random sampling based on object masks in the training phase\(K\) Group points (\(K>k\) ), and then send these tips to theSAM
. For each group, theSAM
Output the segmentation result. Filter out comparisons to the true maskIoU
Lower hints, retaining only the first\(k\) Groups (as shown (c
) shown). In this implementation, only the text loss (autoregressive cross-entropy loss) is required.\(K\) typically64
, \(k=1\) 。
In the reasoning stage.LoRA
The direct output is sent to theSAM
The point at which the segmentation is performed is shown in Fig. (b
) shown.
-
SAM4MLLM-PQPP
PQPP
utilizationMLLM
of query-response capabilities rather than directly generating cues. Sampling of cue points and proactive querying of theMLLM
whether these points are inside (or outside) the mask. In the training phase, a bounding box is randomly sampled based on the true mask and the\(K\) Group points and have two rounds of dialog. In the first round of dialog, theLoRA
Respond to a bounding box. In the second round, for each\((n_1+n_2)K\) points.LoRA
Respond during training to whether the point is within the mask (yes or no).
In the reasoning stage.LoRA
A bounding box is output for the input text query and image in the first round. Then, the points are sampled evenly within the bounding box and sent again in the second round to theMLLM-LoRA
, and ask if they are positive (or negative) points for theSAM
Perform segmentation. It is common to set the grid size to\(5\times 5\) . For the purpose of sending a message to theSAM
High-quality cue points are previously provided and low-confidence points are removed.
RES training
In order to make the foundationMLLM
together withRES
task alignment, using the inclusion of the same name as theRES
three datasets of related examples to guide the model toward the goal. Two of them (RES
datasets andgRefCOCO
(dataset) contains the true mask with theRES
Data, the third (VQA
) is a visual dialog dataset without masks for further enhancing the overall capabilities of joint visual-verbal understanding.
During training, in order to keepMLLM
ability to generalize over images, freezing most of the network parameters and only tuning theMLLM
The visual resampler andLoRA
Adapter.
For all the datasets mentioned above, we do not use data augmentation during training because flipping and/or cropping may change the relative positions or relationships of objects in the images.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].