Panoramic image capture
360
° field of view (FoV
) that contain omnidirectional spatial information critical for scene understanding. However, obtaining enough densely labeled panoramas for training is not only costly, but also application-limited when training models in a closed vocabulary setting. To address this problem, the paper defines a new task called open panorama segmentation (Open Panoramic Segmentation
,OPS
). In this task, the model is trained using pinhole images with restricted field of view in the source domain and evaluated using panoramic images with open field of view in the target domain, thus realizing the zero-sample open panoramic semantic segmentation capability of the model. In addition, the paper proposes a method calledOOOPS
The model that incorporates a network of deformable adapters (Deformable Adapter Network
,DAN
), which significantly improves the performance of zero-sample panoramic semantic segmentation. To further enhance the distortion-aware modeling capability from the pinhole source domain, the paper proposes a new data enhancement method called random equirectangular projection (Random Equirectangular Projection
,RERP
), the method is specifically designed to pre-treat object deformation.OOOPS
model bindingRERP
existOPS
task outperforms other state-of-the-art open lexicon semantic segmentation methods on three artificial panorama datasetsWildPASS
、Stanford2D3D
cap (a poem)Matterport3D
Proven effective, especially outdoors.WildPASS
Upgraded+2.2%
performance in indoorStanford2D3D
Upgraded+2.4%
(used form a nominal expression)mIoU
。
Thesis: Open Panoramic Segmentation
- Paper Address:/abs/2407.02685
- Thesis Code:/publications/OPS/
Introduction
Panoramic imaging systems have evolved significantly in the last few years, which has facilitated the creation of a wide range of panoramic vision applications. Due to the comprehensive360
°Field of view, all-weather panoramas provide richer visual cues in perceiving the surrounding environment, enabling more complete and immersive capture of environmental data in a wide range of scene understanding tasks, which is critical for in-depth scene understanding. This wide-angle view extends beyond the limited scope of pinhole images and significantly enhances the ability of computer vision systems to perceive and parse the environment in a variety of applications. While the benefits of utilizing panoramic images in computer vision applications compared to pinhole images are obvious, there are some noteworthy challenges that must be continually considered, as outlined below:
- wider field of view challenges. For example, Figure
1a
Demonstrates the performance degradation of state-of-the-art open vocabulary semantic segmentation methods as the field of view expands, i.e., from a pinhole image to a360
° panoramic images. In the process, it was observed that more than\(12\%\) (used form a nominal expression)mIoU
The performance degradation demonstrates the challenge posed by the huge difference in semantic and structural information between narrow and wide images. - Category limitation. The traditional closed vocabulary segmentation task paradigm provides only a limited number of labeled categories, which cannot handle the immeasurable number of categories in real applications. Fig.
1b
The difference between closed and open vocabulary segmentation is demonstrated. In contrast to the closed vocabulary setup (first row), the open vocabulary setup (second row) is not limited by the number of categories in the dataset. In the closed vocabulary setup, only four predefined categories (highlighted in different colors) are identified, in contrast to the open setup where each panoramic pixel has its own semantic meaning, even though these categories are not labeled in the dataset.
In order to further unlock the enormous potential of panoramic imagery, there are three key issues that need to be addressed:
- How do you get a holistic perception from a single image?
- How to break through the barrier of limited recognizable categories in existing panoramic datasets so that downstream vision applications can benefit from unrestricted informative visual cues?
- How do you deal with the scarcity of panoramic labels?
Based on these three problems, the paper proposes a new task called open panorama segmentation (Open Panoramic Segmentation
,OPS
), aims to comprehensively address these challenges and better utilize the advantages offered by panoramic imagery. The new task paradigm is shown in Figure1c
Shown. The Open Panorama Segmentation task takes into account three important elements of the three problems, and the concept of "openness" is threefold: all-weather panoramic images (open field of view), and unlimited range of recognizable categories (open vocabulary). In addition, the model is trained using pinhole images in the source domain and evaluated using panoramic images in the target domain (open field of view). Considering the lower cost of densely labeled pinhole segmentation labels compared to panoramic labels, it is cost-effective to open different domains. Note that.OPS
With domain adaptation (Domain Adaptation
,DA
) is different. InDA
The utilization of data in the context of training includes both the source and target domains, while in theOPS
in which the entire training process relies exclusively on data originating only from the source domain.
In addition to the new task, the paper proposes a method calledOOOPS
The new model for theOPS
Three challenges related to openness were mentioned in the mission. The model consists of a frozenCLIP
model and a key component, the deformable adapter network (Deformable Adapter Network
,DAN
) consists of two important functions: (1
) Efficiently transferring frozenCLIP
model adaptation to the panorama segmentation task, (2
) to address object distortion and image distortion in panoramic images. More specifically.DAN
The key component of the new deformable adapter operator (Deformable Adapter Operator
,DAO
) designed to cope with panoramic distortion. In order to improve the model's ability to perceive distortion in the pinhole source domain, the paper further introduces a random isometric projection (Random Equirectangular Projection
,RERP
), specifically designed to address object distortion and image distortion. The pinhole image was segmented into four image blocks and randomly disrupted. Then, an isometric projection, a common method for mapping a sphere onto a panoramic plane, was introduced into the disrupted image.OOOPS
Model collocationRERP
existWildPASS
、Stanford2D3D
cap (a poem)Matterport3D
outperforming other state-of-the-art open vocabulary segmentation methods, respectively, on themIoU
increased+2.2%
、+2.4%
cap (a poem)+0.6%
。
To summarize, the paper presents the following contributions:
-
A new task was introduced, called open panorama segmentation (
open panoramic segmentation
AbbreviationsOPS
), including the open field of view (Open FoV
), open vocabulary (Open Vocabulary
) and open areas (Open Domain
). The model is trained in the source domain using pinhole images with a restricted field of view in an open vocabulary setting, while it is evaluated in the target domain using panoramic images with an open field of view. -
proposes a method called
OOOPS
model that aims to address three openness-related challenges simultaneously. A network of deformable adapters is proposed (Deformable Adapter Network
,DAN
), which is used to transfer the frozenCLIP
The zero-sample learning capability of the model is transferred from the pinhole domain to a different panoramic domain. -
A novel data enhancement strategy called randomized equirectangular projection (
Random Equirectangular Projection
,RERP
), specifically for the proposedOPS
Tasks are designed to further improve theOOOPS
model accuracy, achieving state-of-the-art performance in open panorama segmentation tasks. -
In order to have a better understanding of
OPS
Benchmarking was performed on both indoor and outdoor datasets (WildPASS
、Stanford2D3D
cap (a poem)Matterport3D
) on a comprehensive assessment involving more than10
A model for segmentation of close and open vocabularies.
Methodology
Open Panoramic Segmentation
Open Panorama Split (OPS
) The task is designed to solve three challenging problems:
- Narrow field of view (
FoV
) - Limitations on the scope of the category
- The dearth of panoramic labels
OPS
Three answers were given to the above questions:
- open field of view
- open vocabulary
- Open field
OPS
The task paradigm is shown in Figure1c
Shown. The model is trained in an open vocabulary setup in the pinhole source domain with a narrow field of view, while it is evaluated in the panoramic target domain with a wide field of view.
Model Architecture
The base model can be efficiently transferred to downstream tasks by using adapters. In order to improve the panoramic modeling capabilities, the thesis has designed theOOOPS
Model. As shown in Figure2
shown, it consists of a frozenCLIP
model and a proposed deformable adapter network (DAN
) composition, the latter combining multiple transformation layers and novelDAO
. Feature fusion occurs inCLIP
cap (a poem)DAN
between the intermediate layers of theDAN
One of the two outputs contains the mask proposal, while the other serves as a depth-supervised guide to help theCLIP
Generate the proposedlogits
。
In the training phase, pinhole images are fed into theOOOPS
in which the mask proposal and the proposal for loss calculation are generatedlogits
. In the inference phase, the panoramic image is fed into theOOOPS
in which the corresponding proposal is proposed through a mask proposal withlogits
The product of the segmentation predictions is generated. The frozenCLIP
insofar asOOOPS
of zero-sample learning ability is necessary.
Deformable Adapter Network
Deformable adapter networks are multiple transformation layers with the proposedDAO
The combination of. Due to the presence of distortion in panoramic images, this is a great challenge when utilizing information-rich panoramic images. The paper delves into deformable designs and sampling methods such asAPES
cap (a poem)MateRobot
The proposedDAO
to cope with distortion and object distortion in panoramic images.
-
Revisiting DCN Series
groundbreaking workDCN
It is possible to give traditionalCNN
The ability to perceive spatial deformations. Given an object with\(K\) The convolution kernel of a sampling location, set\(\mathbf{w}_k\) cap (a poem)\(\mathbf{p}_k\) denote, respectively, the first\(k\) weights and preset offsets for each position. For example, the\(K{=}9\) both (... and...)\(\mathbf{p}_k {\in} \{(1,1), \ldots, (-1,-1)\}\) defines an expansion rate of\(1\) (used form a nominal expression)\(3{\times}3\) Convolution kernel. Let\(\mathbf{x}(\mathbf{p})\) cap (a poem)\(\mathbf{y}(\mathbf{p})\) denote the input feature maps, respectively\(\mathbf{x}\) and output feature maps\(\mathbf{y}\) in position\(\mathbf{p}\) The characteristics of theDCN
The formula for the:
included among these\(\Delta\mathbf{p}_k\) is the first\(k\) learnable offsets for each position. Although theDCN
Capable of capturing spatial deformations, but each sampling location is treated as the same when computing local features.DCNv2
of the proposal adds an additional term called the modulation scalar. Specifically, theDCNv2
It can be expressed as:
included among these\(\mathbf{m}_k\) is the first\(k\) A learnable modulation scalar for individual positions. Subject to theTransformer
of inspiration.DCNv3
A grouping operation is proposed to further enhance theDCNv2
of deformation perception.DCNv3
It can be expressed by the following equation:
included among these\(G\) Indicates the total number of aggregation groups.DCNv4
together withDCNv3
Similarly, it is possible to achieve similar performance while significantly reducing runtime.
-
Deformable Adapter Operator (DAO)
When dealing with distortions in panoramas, theDCNv3
cap (a poem)DCNv4
fails to satisfy the requirements of deformation perception. Therefore, the proposedDAO
to solve the distortion problem in panoramic images using the following expression:
included among these\(\mathbf{s(\mathbf{p}})\) It's the location.\(\mathbf{p}\) The learnable significant scalar at.DAO
inherited fromDCNv3
, an additional term, called the saliency scalar, is proposed to indicate the importance of each pixel in the panorama. It is worth noting that theDCNv3
cap (a poem)DCNv4
share the same mathematical expressions, but by designDAO
previous experiments.DCNv3
more robust. Therefore, the paper chooses to incorporateDCNv3
act asDAO
part of the world, rather than a part of theDCNv4
。
as shown2
Shown by theDCNv3
The output feature maps are sequentially passed throughPatch
similarity layer, normalization layer, and standard deviation layer to form a saliency map. The intuition behind this design is straightforward: salient pixels in an image are those that are significantly different from their neighbors, such as edge pixels. If all pixels in a block of an image are different, the standard deviation of pixel similarity within that block will be higher than that of a block containing similar pixels, leading to a higher saliency scalar.
seek3
The process of generating salient graphs is explained in more detail. Given a feature map, theDAO
The cosine similarity between the center pixel and all pixels in the convolution kernel is first calculated, e.g., Fig.3
hit the nail on the head\(3 \times 3\) The convolutional kernel within the\(9\) pixels, the result is a\(9\) dimensional cosine similarity vector. The vector is then applied to theSoftmax
Normalization. Subsequently.DAO
The standard deviation of this normalized cosine similarity vector is calculated to indicate the significance of the center pixel. A saliency map is generated by traversing each pixel throughout the feature map to enhance the saliency of pixels that are typically at the edges of the image, where strong panoramic distortions often occur.
Random Equirectangular Projection
Isometric rectangular projection (ERP
) is one of the most common methods for mapping a sphere to a panoramic plane, which converts spherical coordinates to planar coordinates, as shown below:
included among these\(\lambda\) cap (a poem)\(\varphi\) are the longitude and latitude of the location to be projected, respectively.\(\varphi_1\) It's the standard latitude line.\(\lambda_0\) cap (a poem)\(\varphi_0\) The central meridian and the central latitude line of the map, respectively.\(R\) is the radius of the sphere.\(x\) represents the horizontal coordinates of the projected location on the map.\(y\) Indicates vertical coordinates.
seek4a
Visualizes an isometric rectangular projection on the panorama plane. It can be observed that after the isometric rectangular projection, strong distortions appear in the panorama, e.g., straight lines are converted into curves. To further improve the performance, the paper proposes a randomized isometric rectangular projection on pinhole images (RERP
), becauseOPS
The task required the model to be trained using pinhole images instead of panoramic images.
The pinhole image is divided into four parts and the image blocks are randomly disrupted,subsequently an isometric rectangular projection is applied to the distortion free pinhole image. Figure.4b
Visualized after a random equidistant rectangular projection (RERP
) after the pinhole image. The first column is the pinhole image without any data enhancement. The second column is the isometric rectangular projection on the pinhole image without random disruption (ERP
). The last column shows the random isometric rectangular projection proposed in the paper (RERP
). It can be observed that afterRERP
Panorama-like distortions were also seen in the post-pinhole images. Random upsetting is used to enhance robustness and facilitate generalization.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].