OOOPS: Zero Sample Realization of 360-degree Open Panorama Segmentation, Open Source

Panoramic image capture360° field of view (FoV) that contain omnidirectional spatial information critical for scene understanding. However, obtaining enough densely labeled panoramas for training is not only costly, but also application-limited when training models in a closed vocabulary setting. To address this problem, the paper defines a new task called open panorama segmentation (Open Panoramic Segmentation，OPS). In this task, the model is trained using pinhole images with restricted field of view in the source domain and evaluated using panoramic images with open field of view in the target domain, thus realizing the zero-sample open panoramic semantic segmentation capability of the model. In addition, the paper proposes a method calledOOOPSThe model that incorporates a network of deformable adapters (Deformable Adapter Network，DAN), which significantly improves the performance of zero-sample panoramic semantic segmentation. To further enhance the distortion-aware modeling capability from the pinhole source domain, the paper proposes a new data enhancement method called random equirectangular projection (Random Equirectangular Projection，RERP), the method is specifically designed to pre-treat object deformation.OOOPSmodel bindingRERPexistOPStask outperforms other state-of-the-art open lexicon semantic segmentation methods on three artificial panorama datasetsWildPASS、Stanford2D3Dcap (a poem)Matterport3DProven effective, especially outdoors.WildPASSUpgraded+2.2%performance in indoorStanford2D3DUpgraded+2.4%(used form a nominal expression)mIoU。

Thesis: Open Panoramic Segmentation

Paper Address:/abs/2407.02685
Thesis Code:/publications/OPS/

Introduction

Panoramic imaging systems have evolved significantly in the last few years, which has facilitated the creation of a wide range of panoramic vision applications. Due to the comprehensive360°Field of view, all-weather panoramas provide richer visual cues in perceiving the surrounding environment, enabling more complete and immersive capture of environmental data in a wide range of scene understanding tasks, which is critical for in-depth scene understanding. This wide-angle view extends beyond the limited scope of pinhole images and significantly enhances the ability of computer vision systems to perceive and parse the environment in a variety of applications. While the benefits of utilizing panoramic images in computer vision applications compared to pinhole images are obvious, there are some noteworthy challenges that must be continually considered, as outlined below:

wider field of view challenges. For example, Figure1aDemonstrates the performance degradation of state-of-the-art open vocabulary semantic segmentation methods as the field of view expands, i.e., from a pinhole image to a360° panoramic images. In the process, it was observed that more than\(12\%\) (used form a nominal expression)mIoUThe performance degradation demonstrates the challenge posed by the huge difference in semantic and structural information between narrow and wide images.
Category limitation. The traditional closed vocabulary segmentation task paradigm provides only a limited number of labeled categories, which cannot handle the immeasurable number of categories in real applications. Fig.1bThe difference between closed and open vocabulary segmentation is demonstrated. In contrast to the closed vocabulary setup (first row), the open vocabulary setup (second row) is not limited by the number of categories in the dataset. In the closed vocabulary setup, only four predefined categories (highlighted in different colors) are identified, in contrast to the open setup where each panoramic pixel has its own semantic meaning, even though these categories are not labeled in the dataset.

In order to further unlock the enormous potential of panoramic imagery, there are three key issues that need to be addressed:

How do you get a holistic perception from a single image?
How to break through the barrier of limited recognizable categories in existing panoramic datasets so that downstream vision applications can benefit from unrestricted informative visual cues?
How do you deal with the scarcity of panoramic labels?

Based on these three problems, the paper proposes a new task called open panorama segmentation (Open Panoramic Segmentation，OPS), aims to comprehensively address these challenges and better utilize the advantages offered by panoramic imagery. The new task paradigm is shown in Figure1cShown. The Open Panorama Segmentation task takes into account three important elements of the three problems, and the concept of "openness" is threefold: all-weather panoramic images (open field of view), and unlimited range of recognizable categories (open vocabulary). In addition, the model is trained using pinhole images in the source domain and evaluated using panoramic images in the target domain (open field of view). Considering the lower cost of densely labeled pinhole segmentation labels compared to panoramic labels, it is cost-effective to open different domains. Note that.OPSWith domain adaptation (Domain Adaptation，DA) is different. InDAThe utilization of data in the context of training includes both the source and target domains, while in theOPSin which the entire training process relies exclusively on data originating only from the source domain.

In addition to the new task, the paper proposes a method calledOOOPSThe new model for theOPSThree challenges related to openness were mentioned in the mission. The model consists of a frozenCLIPmodel and a key component, the deformable adapter network (Deformable Adapter Network，DAN) consists of two important functions: (1) Efficiently transferring frozenCLIPmodel adaptation to the panorama segmentation task, (2) to address object distortion and image distortion in panoramic images. More specifically.DANThe key component of the new deformable adapter operator (Deformable Adapter Operator，DAO) designed to cope with panoramic distortion. In order to improve the model's ability to perceive distortion in the pinhole source domain, the paper further introduces a random isometric projection (Random Equirectangular Projection，RERP), specifically designed to address object distortion and image distortion. The pinhole image was segmented into four image blocks and randomly disrupted. Then, an isometric projection, a common method for mapping a sphere onto a panoramic plane, was introduced into the disrupted image.OOOPSModel collocationRERPexistWildPASS、Stanford2D3Dcap (a poem)Matterport3Doutperforming other state-of-the-art open vocabulary segmentation methods, respectively, on themIoUincreased+2.2% 、+2.4%cap (a poem)+0.6%。

To summarize, the paper presents the following contributions:

A new task was introduced, called open panorama segmentation (open panoramic segmentationAbbreviationsOPS), including the open field of view (Open FoV), open vocabulary (Open Vocabulary) and open areas (Open Domain). The model is trained in the source domain using pinhole images with a restricted field of view in an open vocabulary setting, while it is evaluated in the target domain using panoramic images with an open field of view.
proposes a method calledOOOPSmodel that aims to address three openness-related challenges simultaneously. A network of deformable adapters is proposed (Deformable Adapter Network，DAN), which is used to transfer the frozenCLIPThe zero-sample learning capability of the model is transferred from the pinhole domain to a different panoramic domain.
A novel data enhancement strategy called randomized equirectangular projection (Random Equirectangular Projection，RERP), specifically for the proposedOPSTasks are designed to further improve theOOOPSmodel accuracy, achieving state-of-the-art performance in open panorama segmentation tasks.
In order to have a better understanding ofOPSBenchmarking was performed on both indoor and outdoor datasets (WildPASS、Stanford2D3Dcap (a poem)Matterport3D) on a comprehensive assessment involving more than10A model for segmentation of close and open vocabularies.

Methodology

Open Panoramic Segmentation

Open Panorama Split (OPS) The task is designed to solve three challenging problems:

Narrow field of view (FoV）
Limitations on the scope of the category
The dearth of panoramic labels

OPSThree answers were given to the above questions:

open field of view
open vocabulary
Open field

OPSThe task paradigm is shown in Figure1cShown. The model is trained in an open vocabulary setup in the pinhole source domain with a narrow field of view, while it is evaluated in the panoramic target domain with a wide field of view.

Model Architecture

The base model can be efficiently transferred to downstream tasks by using adapters. In order to improve the panoramic modeling capabilities, the thesis has designed theOOOPSModel. As shown in Figure2shown, it consists of a frozenCLIPmodel and a proposed deformable adapter network (DAN) composition, the latter combining multiple transformation layers and novelDAO. Feature fusion occurs inCLIPcap (a poem)DANbetween the intermediate layers of theDANOne of the two outputs contains the mask proposal, while the other serves as a depth-supervised guide to help theCLIPGenerate the proposedlogits。

In the training phase, pinhole images are fed into theOOOPSin which the mask proposal and the proposal for loss calculation are generatedlogits. In the inference phase, the panoramic image is fed into theOOOPSin which the corresponding proposal is proposed through a mask proposal withlogitsThe product of the segmentation predictions is generated. The frozenCLIPinsofar asOOOPSof zero-sample learning ability is necessary.

Deformable Adapter Network

Deformable adapter networks are multiple transformation layers with the proposedDAOThe combination of. Due to the presence of distortion in panoramic images, this is a great challenge when utilizing information-rich panoramic images. The paper delves into deformable designs and sampling methods such asAPEScap (a poem)MateRobotThe proposedDAOto cope with distortion and object distortion in panoramic images.

Revisiting DCN Series

groundbreaking workDCNIt is possible to give traditionalCNNThe ability to perceive spatial deformations. Given an object with\(K\) The convolution kernel of a sampling location, set\(\mathbf{w}_k\) cap (a poem)\(\mathbf{p}_k\) denote, respectively, the first\(k\) weights and preset offsets for each position. For example, the\(K{=}9\) both (... and...)\(\mathbf{p}_k {\in} \{(1,1), \ldots, (-1,-1)\}\) defines an expansion rate of\(1\) (used form a nominal expression)\(3{\times}3\) Convolution kernel. Let\(\mathbf{x}(\mathbf{p})\) cap (a poem)\(\mathbf{y}(\mathbf{p})\) denote the input feature maps, respectively\(\mathbf{x}\) and output feature maps\(\mathbf{y}\) in position\(\mathbf{p}\) The characteristics of theDCNThe formula for the:

\[\begin{equation} \mathbf{y}(\mathbf{p}) = \sum^K_{k=1}\mathbf{w}_k\mathbf{x}(\mathbf{p}+\mathbf{p}_k+\Delta\mathbf{p}_k), \end{equation} \]

included among these\(\Delta\mathbf{p}_k\) is the first\(k\) learnable offsets for each position. Although theDCNCapable of capturing spatial deformations, but each sampling location is treated as the same when computing local features.DCNv2of the proposal adds an additional term called the modulation scalar. Specifically, theDCNv2It can be expressed as:

\[\begin{equation} \mathbf{y}(\mathbf{p}) = \sum^K_{k=1}\mathbf{w}_k\mathbf{m}_k\mathbf{x}(\mathbf{p}+\mathbf{p}_k+\Delta\mathbf{p}_k), \end{equation} \]

included among these\(\mathbf{m}_k\) is the first\(k\) A learnable modulation scalar for individual positions. Subject to theTransformerof inspiration.DCNv3A grouping operation is proposed to further enhance theDCNv2of deformation perception.DCNv3It can be expressed by the following equation:

\[\begin{equation} \mathbf{y}(\mathbf{p}) = \sum^G_{g=1}\sum^K_{k=1}\mathbf{w}_g\mathbf{m}_{gk}\mathbf{x}_g(\mathbf{p}+\mathbf{p}_k+\Delta\mathbf{p}_{gk}), \end{equation} \]

included among these\(G\) Indicates the total number of aggregation groups.DCNv4together withDCNv3Similarly, it is possible to achieve similar performance while significantly reducing runtime.

Deformable Adapter Operator (DAO)

When dealing with distortions in panoramas, theDCNv3cap (a poem)DCNv4fails to satisfy the requirements of deformation perception. Therefore, the proposedDAOto solve the distortion problem in panoramic images using the following expression:

\[\begin{equation} \mathbf{y}(\mathbf{p}) = \mathbf{s}(\mathbf{p})\sum^G_{g=1}\sum^K_{k=1}\mathbf{w}_g\mathbf{m}_{gk}\mathbf{x}_g(\mathbf{p}+\mathbf{p}_k+\Delta\mathbf{p}_{gk}), \end{equation} \]

included among these\(\mathbf{s(\mathbf{p}})\) It's the location.\(\mathbf{p}\) The learnable significant scalar at.DAOinherited fromDCNv3, an additional term, called the saliency scalar, is proposed to indicate the importance of each pixel in the panorama. It is worth noting that theDCNv3cap (a poem)DCNv4share the same mathematical expressions, but by designDAOprevious experiments.DCNv3more robust. Therefore, the paper chooses to incorporateDCNv3act asDAOpart of the world, rather than a part of theDCNv4。

as shown2Shown by theDCNv3The output feature maps are sequentially passed throughPatchsimilarity layer, normalization layer, and standard deviation layer to form a saliency map. The intuition behind this design is straightforward: salient pixels in an image are those that are significantly different from their neighbors, such as edge pixels. If all pixels in a block of an image are different, the standard deviation of pixel similarity within that block will be higher than that of a block containing similar pixels, leading to a higher saliency scalar.

seek3The process of generating salient graphs is explained in more detail. Given a feature map, theDAOThe cosine similarity between the center pixel and all pixels in the convolution kernel is first calculated, e.g., Fig.3hit the nail on the head\(3 \times 3\) The convolutional kernel within the\(9\) pixels, the result is a\(9\) dimensional cosine similarity vector. The vector is then applied to theSoftmaxNormalization. Subsequently.DAOThe standard deviation of this normalized cosine similarity vector is calculated to indicate the significance of the center pixel. A saliency map is generated by traversing each pixel throughout the feature map to enhance the saliency of pixels that are typically at the edges of the image, where strong panoramic distortions often occur.

Random Equirectangular Projection

Isometric rectangular projection (ERP) is one of the most common methods for mapping a sphere to a panoramic plane, which converts spherical coordinates to planar coordinates, as shown below:

\[\begin{align} &x = R(\lambda-\lambda_0)cos(\varphi_1), \\ &y = R(\varphi-\varphi_0), \end{align} \]

included among these\(\lambda\) cap (a poem)\(\varphi\) are the longitude and latitude of the location to be projected, respectively.\(\varphi_1\) It's the standard latitude line.\(\lambda_0\) cap (a poem)\(\varphi_0\) The central meridian and the central latitude line of the map, respectively.\(R\) is the radius of the sphere.\(x\) represents the horizontal coordinates of the projected location on the map.\(y\) Indicates vertical coordinates.

seek4aVisualizes an isometric rectangular projection on the panorama plane. It can be observed that after the isometric rectangular projection, strong distortions appear in the panorama, e.g., straight lines are converted into curves. To further improve the performance, the paper proposes a randomized isometric rectangular projection on pinhole images (RERP), becauseOPSThe task required the model to be trained using pinhole images instead of panoramic images.

The pinhole image is divided into four parts and the image blocks are randomly disrupted,subsequently an isometric rectangular projection is applied to the distortion free pinhole image. Figure.4bVisualized after a random equidistant rectangular projection (RERP) after the pinhole image. The first column is the pinhole image without any data enhancement. The second column is the isometric rectangular projection on the pinhole image without random disruption (ERP). The last column shows the random isometric rectangular projection proposed in the paper (RERP). It can be observed that afterRERPPanorama-like distortions were also seen in the post-pinhole images. Random upsetting is used to enhance robustness and facilitate generalization.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

OOOPS: Zero Sample Realization of 360-degree Open Panorama Segmentation, Open Source | ECCV'24

Open Panoramic Segmentation

Model Architecture

Deformable Adapter Network

Revisiting DCN Series

Deformable Adapter Operator (DAO)

Random Equirectangular Projection