OpenPSG: One Step Closer to AGI, the First Open Environmental Relationship Prediction Framework

Panoramic scene map generation (PSG) The goal is to segment objects and recognize the relationships between them to achieve a structured understanding of images. Previous approaches have focused on predicting predefined classes of objects and relations, thus limiting their application to open-world scenarios. With large multimodal models (LMMs) have made significant progress in open-set object detection and segmentation due to rapid development, but thePSGThe prediction of open set relationships in remains unexplored.

The paper focuses on the task of open-set relationship prediction and combines it with a pre-trained open-set panoramic segmentation model for true open-set panoramic scene graph generation (OpenPSG）。OpenPSGutilizationLMMsImplementing open-set relationship prediction in an autoregressive manner introduces a relational query transformer to efficiently extract visual features of object pairs and estimate the existence of relationships between them. The latter can improve the prediction efficiency by filtering irrelevant pairs. Finally, the paper designs generating and judging instructions to autoregressively implement relationship prediction in thePSGin performing open set relation prediction. Extensive experiments show that the method achieves state-of-the-art performance in open set relation prediction and panoramic scene graph generation.

Thesis: OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models

Paper Address:/abs/2407.11213
Thesis Code:/franciszzj/OpenPSG

Introduction

Panoramic scene map generation (PSG) aims at segmenting objects in an image and recognizing the relationships between them in order to construct a panoramic scene graph for a structured understanding of the image. Given its significant potential for applications such as visual quizzing, image description and embodied navigation, since the emergence of thePSGIt has attracted a lot of attention from researchers.

pastPSGMethods are only able to predict closed-set object and relation categories and are unable to recognize objects/relationships beyond predefined categories. In recent years, with theCLIP、BLIP-2Large multimodal models such as (LMMs), a large number of open-set prediction methods for object detection and segmentation have emerged, thanks to theLMMsrich understanding of language and strong associations between vision and language. However, the relational predictions of open sets have so far remained unexplored.

Compared to open-set object detection and segmentation, open-set relationship prediction is more complex: the model needs not only to understand the different objects, but also to recognize relationships between object pairs based on their interactions. In particular, the latter computation may grow exponentially. To address this problem, the paper focuses on open set relational prediction.

Large-scale language modeling (LLMs) demonstrated excellent semantic analysis and comprehension in a variety of multimodal tasks. Particularly in text processing, theLMMsNot only are they good at parsing nouns (i.e., representing objects), they also pay considerable attention to predicates (i.e., representing relationships between objects), thus ensuring that the content they generate is sufficiently consistent. Inspired by this, the paper proposes an open-set panoramic scene graph generation architectureOpenPSG, using a large multimodal model (e.g., theBLIP-2) to achieve open set relationship prediction.

Specifically, the model consists of three parts. The first is an open-set panoramic splitter that adapts an existing model (e.g., theOpenSeeD), which is capable of extracting open-set object categories, masks and visual features from the entire image to form object pairs and pair masks. The second is the relational query transformer, which has two functions: extracting visual features of object pairs based on pair masks with a special focus on interactions between objects; and determining potential relationships between object pairs. These two functions are realized by two sets of queries, i.e., pairwise feature extraction queries and relation existence estimation queries. Only those object pairs that are judged to be potentially related are fed into the third part, the multimodal relationship decoder. This decoder is directly inherited fromLMMthat predicts open-set relationships for a given pair of objects in an autoregressive manner, presupposing a specifically designed textual instruction and pre-extracted pairwise visual features.

The paper is the first study to propose an open-set panoramic scene graph generation task that enables open-set prediction of object masks and relationships. Extensive experiments show that theOpenPSGState-of-the-art results were achieved in the closed-set setting and excelled in the open-set setting.

Task Definition

given image\(I \in \mathbb{R}^{H \times W \times 3}\) The goal of the open-set panoramic scene graph generation is to generate a panoramic scene graph from the image\(I\) Extracting open-set panoramic scene graphs in\(G = \{O, R\}\) which\(H\) cap (a poem)\(W\) is the height and width of the image.

\(O = \{o_i\}_{i=1}^{N}\) denotes the segmentation from the image\(N\) objects, each of which is defined as\(o_i = \{c, m\}\) which\(c\) is an object class that can belong to a predefined base object class\(C_{base}\) or undefined categories of novelty objects\(C_{novel}\) 。 \(m\) Indicates that the object is in the\(\{0, 1\}^{H \times W}\) The binary mask in the
\(R = \{r_{i,j} \mid i,j \in \{1, 2, \ldots, N\}, i \neq j\}\) represents the relationship between objects, where\(r_{i,j}\) indicate\(o_i\) cap (a poem)\(o_j\) The relationship between\(o_i\) is the subject.\(o_j\) is the object. Each relationship\(r\) Can fall into predefined categories of underlying relationships\(K_{base}\) or undefined categories of novel relationships\(K_{novel}\) 。

Method

as shown2Shown.OpenPSGContains three components: an object splitter, a relational query transformer (RelQ-Former) and the multimodal relational decoder (RelDecoder)。

For the object segmenter, the input image is converted into object categories and masks, as well as visual features representing the entire image, using a pre-trained open-set panoramic segmentation model. Subsequently, the object categories, masks, and visual features are input to theRelQ-Formerin. Through two sets of learnable queries, and in combination with the designed commands, obtaining the same information as theLMMThe visual characteristics of object pairs with compatible input formats, as well as judgments about the existence of potential relationships. Finally, only those object pairs judged to have potential relationships are fed into theRelDecoderPerforms open set relationship prediction and ultimately generates an open set panoramic scene graph.

Object Segmenter

Given an image\(I\) , utilizing a pre-trained open-set object splitter (e.g., theOpenSeeD) Predicting objects in an image\(O\) and visual characterization of the entire image\(F_I \in \mathbb{R}^{h \times w \times D}\) . Here.\(h\) cap (a poem)\(w\) represent respectively\(F_I\) the height and width of the\(D\) denotes the feature dimension.

The architecture of the splitter is similar to theMask2FormerSimilarly, a pixel decoder is included. The visual characteristics of the entire image\(F_I\) refers to the visual features output by the pixel decoder. The paper develops thepatchifymodules andpairwisemodule to process the output of the splitter, generatingRelQ-FormerThe input.

Patchify Module

PatchifyThe module is intended to be a visual characterization of\(F_I\) and object mask\(m\) Serialize them so that they can be used as input by theRelQ-FormerProcessing.

Similar to a vision converter (ViT) inputspatchifylayer, a single convolutional layer is utilized to convert the extracted\(F_I\) Conversion to visual marker sequences\(F_{Iseq} \in \mathbb{R}^{L \times D}\) which\(L\) bepatchThe number of\(D\) is the feature dimension. When the convolution kernel size and step size of the convolutional layer are both\(p\) when\(L\) The formula for this is\(L = \frac{h}{p} \times \frac{w}{p}\) 。

Also, for each extracted object the mask\(m_i\) Nearest neighbor interpolation is used, where\(m_i\) The size of the height\(\frac{h}{p}\) and width\(\frac{w}{p}\) , which is then reshaped to a length of\(L\) of a one-dimensional vector. Obtain a sequence of masks for all objects by treating all masks in the same way\(m_{seq} \in \{0, 1\}^{N \times L}\) 。

Pairwise Module

The paired modules are designed to construct subject-predicate pairs. Given an image\(I\) hit the nail on the head\(N\) objects, and will all the objects\(O\) pairwise combination of subject-predicate pairs\(P = \{(o_i, o_j) | i, j \in \{1, 2, \ldots, N\}, i \neq j\}\) 。 \(P\) The number of subject-predicate pairs in the\(N \times (N - 1)\) along with\(N\) increase, this number grows exponentially. Thus, the set of combined subject-predicate pair categories is also obtained\(c^{pair} \in \{(c_i, c_j) | i, j \in \{1, 2, \ldots, N\}, i \neq j\}\) 。

By using logic for each subject-predicate pairORoperation (on both objects of the\(m_{seq}\)), from\(m_{seq}\) Build the index corresponding to the\(i\) cap (a poem)\(j\) of the mask sequences of the two objects. This operation is applied to all subject-predicate pairs, resulting in a paired mask sequence of subject-predicate pairs\(m_{seq}^{pair} \in \{0, 1\}^{N \times (N-1) \times L}\) which\(L\) bepatchThe number of

Relation Query Transformer

The relational query converter utilizes the obtained\(F_{Iseq}\) 、 \(c^{pair}\) cap (a poem)\(m_{seq}^{pair}\) , using two different types of queries, i.e., pairwise feature extraction queries and relation existence estimation queries, combined with customized instructions. This approach helps in extracting subject-predicate pair features and evaluating which subject-predicate pairs are likely to be related.

Pair Feature Extraction Query

The goal of the paired feature extraction query is to extract the corresponding subject-predicate pair features from the entire image visual features based on subject-predicate pair masks.

Common extraction methods include mask pooling, which extracts features for a target subject-predicate pair, treating each region on the subject-predicate pair equally. However, features used for relationship prediction should focus more on regions where interactions between objects occur. By utilizing the attention mechanism, the paper facilitates a sequence of visual features representing subject-predicate pairs\(F_{Iseq}\) interactions between visual markers in different regions in the This approach enhances regions that are critical for relationship prediction. In addition, inspired by previous research, the paper devised an instruction to help this learnable query understand its purpose of extracting subject-predicate pair features.

Specifically, for each subject-predicate pair\((o_i, o_j)\) First, the pairwise feature extraction query\(Q^{feat} \in \mathbb{R}^{E \times D}\) Input to the self-attention layer (\(SA(\cdot)\) ) with the addition of a pairwise instruction specifically designed for pairwise feature extraction queries. This pairwise instruction is processed through the classifier layer to obtain the\(F_{Inst}^{feat} \in \mathbb{R}^{X^{feat} \times D}\) It specifies the function of pairwise feature extraction query, i.e., "subject-predicate extraction from visual features based on masks (\(c_i\) , \(c_j\) ) characteristics." Here\(E\) is the number of tokens in pairs of feature extraction queries.\(X^{feat}\) is the number of tokens in pairs of instructions. Note that also the names of the categories of subjects and predicates\((c_i, c_j)\) into this paired instruction. This operation can be expressed as

\[\begin{equation} F_{SA}^{feat} = Trunc(SA(Concat(Q^{feat}, F_{Inst}^{feat})), E), \end{equation} \]

included among these\(Concat(\cdot)\) indicates a splice operation.\(Trunc(\cdot)\) denotes a truncation operation in which the\(E\) indicates that only the first\(E\) features, i.e., those corresponding to the pairwise feature extraction query.

Next, use the masked cross-attention layer (\(MaskCA(\cdot)\) ), which will\(F_{SA}^{feat}\) As a query, the\(F_{Iseq}\) as the key and the value of the\(m_{seq}\) as a mask to extract features corresponding to subject-predicate pairs

\[\begin{equation} F_{CA}^{feat} = MaskCA(F_{SA}^{feat}, F_{Iseq}, m_{seq}). \end{equation} \]

hallmark\(F_{CA}^{feat}\) By means of a feed-forward network (\(FFN(\cdot)\) ) is further refined and expressed as\(F_{FFN}^{feat} = FFN(F_{CA}^{feat})\) 。

By repeating this process twice, the visual features of the subject-predicate pairs to be fed into the multimodal relational decoder are obtained\(F_{I}^{pair(i,j)} \in \mathbb{R}^{E \times D}\) . These operations are performed in parallel for all subject-predicate pairs to obtain the corresponding features for all pairs.

Relation Existence Estimation Query

In addition to the pairwise feature extraction query, the paper also designed a relation existence estimation query to determine subject\(o_i\) and objects\(o_j\) whether or not there is a possible relationship between them, without predicting specific relationship categories. The goal is to filter out irrelevant subject-predicate pairs to save subsequentLMMThe computation of decoding.

Specifically, for each subject-predicate pair\((o_i, o_j)\) , relation existence estimation query\(Q^{exist} \in \mathbb{R}^{1 \times D}\) , similar to paired feature extraction queries, are fed into the self-attention, masked cross-attention, and feed-forward network layers, respectively, with the\(F_{Iseq}\) 、 \(m_{seq}\) and specially designed relational directives for interaction. The purpose of the relational directives is to direct relational existence estimation queries to determine whether a relation may exist in a subject-predicate pair, such as "\(o_i\) cap (a poem)\(o_j\) Is there a relationship between them?" After being processed by the disambiguator, the relation instruction produces\(F_{Inst}^{exist} \in \mathbb{R}^{X^{exist} \times D}\) which\(X^{exist}\) Indicates the number of markers.

Ultimately, the extracted features are fed into a relationship existence prediction layer that consists of a2A multilayer perceptron with layers (MLP) and use thesigmoidfunction normalizes the predicted score to\([0, 1]\) Scope. Notably, the paper is trained using binary labels to indicate whether a relationship exists between objects and is filtered during inference using the selection module specified below.

Selector

leave it (to sb)2The multilayer perceptron (MLP) implementation of the selection module is set to filter irrelevant subject-predicate pairs. Only scores above the threshold\(\theta\) pairs in order to be fed into the multimodal relational decoder. In contrast to making predictions for all subject-predicate pairs, it is possible to realize the\(20 \times\) The acceleration.

Multimodal Relation Decoder

The goal of the multimodal relational decoder is to utilize the subject-predicate pair features extracted by the above module\(F_I^{i, j}\) that incorporates instructions that guide it to perform open set relational prediction. Inspired by previous research, a generative instruction was first designed to perform open set relationship prediction in an autoregressive manner. This approach worked well, but was found to favor more common relationships to some extent. Therefore, the thesis further designed a judgmental instruction that utilizes theLMMof powerful analytical and judgmental capabilities. The judgment directive also uses autoregressive methods, but is used to determine whether a specific relationship exists between objects, thus simplifying the complexity of open-set relationship prediction. Next, the two directives will be refined separately.

Generation Instruction

For generating instructions, the design of the instructions used in open-set object recognition is followed, utilizing the "\(c_i\) cap (a poem)\(c_j\) What are the relationships between?" The\(c_i\) cap (a poem)\(c_j\) refer to the names of the subject and predicate, respectively. The command is converted to a feature using a participleizer\(F_{inst}^{gen} \in \mathbb{R}^{X^{gen} \times D}\) which\(X^{gen}\) is the number of tokens of the generated instruction. Characterize the generated instruction\(F_{inst}^{gen}\) With subject-predicate pairs of features\(F_{I}^{pair(i,j)}\) Together with the input to the multimodal relational decoder\(Dec(\cdot)\) in which all possible relationships are predicted in an autoregressive manner, with formulas of the form:

\[\begin{equation} r_{i,j} = Dec(Concat(F_{I}^{pair(i,j)}, F_{inst}^{gen})). \end{equation} \]

If multiple relationships are predicted, they will be separated by the separator "[SEPSeparate.

Judgement Instruction

Unlike generative directives, judgment directives instruct the relational decoder to determine whether the relation exists between the subject and the predicate based on the given relation name. For example, "Please judge\(c_i\) cap (a poem)\(c_j\) Whether there is a relationship between\(r_k\) ". In this case, only a "yes" or "no" answer from the multimodal relationship decoder is required to determine the existence of the relationship. It is worth noting that it can be very complex to feed the decoder with a complete judgment instruction for each relationship. Therefore, the name of the relationship is placed at the end of the instruction. In the reasoning process, the judgment instruction is divided into two parts: the part before the name of the relation, which is converted by the classifier to\(F_{inst}^{judge}\) , while the relationship name itself is treated as\(F_{inst}^{rel}\) 。

Open set relationship prediction with the help of autoregressive approach first characterizes the subject-predicate pair\(F_{I}^{pair(i,j)}\) together with\(F_{inst}^{judge}\) The input multimodal relational decoder, of the form Eq:

\[\begin{equation} F_{prefix}^{(i,j)} = Dec(Concat(F_{I}^{pair(i,j)}, F_{inst}^{judge})), \end{equation} \]

This result is then cached for subsequent calculations for each relation. For each relation\(r_k\) , the multimodal relational decoder only needs to process the\(F_{prefix}\) cap (a poem)\(F_{inst}^{rel(k)}\) to realize the relational prediction in the form of Eq:

\[\begin{equation} J_{i,j,k} = Dec(Concat(F_{prefix}^{(i,j)}, F_{inst}^{rel(k)})), \end{equation} \]

included among these\(J_{i,j,k}\) denotes a pair of ternary\((o_i, r_k, o_j)\) of judgment. When the\(J_{i,j,k}\) When "is", indicates relationship\(r_k\) exist in\(o_i\) cap (a poem)\(o_j\) between them; otherwise, it means that the relationship does not exist. By this method, it is possible to maintain the same prediction time as the generated instruction.

Performing the above process for all subject-predicate pairs that may be related ultimately leads to open set relation prediction. For the method that uses generating instructions, it is referred to as theOpenPSG-G, while for methods that use judgmental instructions, they are calledOpenPSG-Jbut (not)OpenPSGThe default refers to the latter.

Loss Function

During model training, two loss functions are involved: a binary cross-entropy loss for estimating relation existence via the relation existence estimation query in the relation query transformer\(\mathcal{L}_{exist}\) , and a cross-entropy loss consistent with the training of the language model used by the multimodal relational decoder\(\mathcal{L}_{LM}\) The total losses were. Total losses were:\(\mathcal{L} = \lambda \mathcal{L}_{exist} + \mathcal{L}_{LM}\) which\(\lambda\) is a weighting factor.

Implementation details

In the experiments, the use of pre-trainedOpenSeeDas an open-set object splitter.pathifymodularpatchadults and children\(p\) set to8. In the relational query transformer, the length of the paired feature extraction query\(E\) because of32The threshold used to filter subject-object pairs\(\theta\) set to0.35. In the multimodal relational decoder, using theBLIP-2of the decoder. During model training, the loss weight factor\(\lambda\) set to10. Use the same data enhancement strategy as in previous methods. Use theAdamWoptimizer with a learning rate of\(1e^{-4}\) The weight decay is\(5e^{-2}\) Total training. Total training12cycles, in the first\(8\) Reduce the learning rate to\(1e^{-5}\) . The experiment was conducted using fourA100 GPU. Note that during training, the parameters of the object splitter and the multimodal relation decoder are frozen and only the training of the proposedRelQ-Former。

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

OpenPSG: One Step Closer to AGI, the First Open Environmental Relationship Prediction Framework | ECCV'24

Object Segmenter

Patchify Module

Pairwise Module

Relation Query Transformer

Pair Feature Extraction Query

Relation Existence Estimation Query

Selector

Multimodal Relation Decoder

Generation Instruction

Judgement Instruction

Loss Function

Implementation details