[Computer Vision Frontier Research Hot Top Session] Papers Related to Target Detection in ECCV 2024

Integer-valued training and spike-driven inference of spiking neural networks for high-performance and energy-efficient target detection

Brain-excited pulsed neural networks (SNNs) have the advantages of biological plausibility and low power consumption compared to artificial neural networks (ANNs). Due to the poor performance of SNNs, current applications are limited to simple classification tasks. In this work, we focus on bridging the performance gap between artificial neural networks and neural networks for target detection. Our design centers around network architecture and spiking neurons.

When pedestrian detection meets multimodal learning: generalist models and benchmark datasets

In recent years, research on pedestrian detection using different sensor modalities (e.g., RGB, IR, Depth, LiDAR, and Event) has received increasing attention. However, designing a unified generalized model that can effectively handle different sensor modalities is still a challenge. In this paper, we introduce a new generalized model for multi-channel sensing, MMPedestron. unlike previous expert models that only deal with one or a pair of specific channel inputs, MMPedestron is able to deal with multiple channel inputs and their dynamic combination.

TCC-Det: Temporary Consistent Cues for Weakly Supervised 3D Detection

Accurate LiDAR point cloud target detection is a key prerequisite for robust and safe autonomous driving and robotics applications. Currently training 3D object detectors involves the need to manually annotate large amounts of training data, which is time consuming and expensive. As a result, the amount of easily available annotated training data is limited, and these annotated datasets may not contain edge cases or other rare instances simply because the probability of them appearing in such a small dataset is low. In this paper, we propose a method that does not require any manual annotation to train 3D object detectors by exploiting the consistency of existing visual components and the world around us. Thus, the method can be used to train 3D detectors by collecting only sensor recordings from the real world, which is very cheap and allows the
Permission to train with an order of magnitude more data than traditional fully supervised methods.

CARB-Net: A Camera-Assisted Radar Network for Vulnerable Road User Detection

Ensuring reliable sensing of vulnerable road users is essential for safe autonomous driving. Radar stands out as an attractive sensor option due to its resilience to inclement weather, cost-effectiveness, depth sensing capability, and established role in adaptive cruise control. However, the limited angular resolution of radar poses a challenge for target identification, especially in distinguishing between close targets. To address this limitation, we propose the Camera-based Radar Network (CARB-Net), a novel and efficient framework that fuses the angular accuracy of a camera with the robustness and depth perception capabilities of a radar.

Weak-to-strong synthetic learning of generative models for percentage-based object detection

Visual-linguistic (VL) models have been shown to be highly effective in a variety of target detection tasks by utilizing weakly supervised image-text pairs from the Web. However, these models exhibit limited understanding of the complex composition of visual objects (e.g., attributes, shapes, and their relationships), leading to significant performance degradation given complex and diverse linguistic queries. While traditional approaches have attempted to enhance VL models by using hard negated synthetic augmentation on the textual domain, their effectiveness remains limited without intensive image-text augmentation. In this paper, we propose a structured synthetic data generation method to improve the component comprehension of VL models for language-based target detection, which generates densely paired positive and negative triples (objects, text descriptions, bounding boxes) in the image and text domains.

Grounding DINO: Combined with DINO via positioning pre-training for open-set object detection

In this paper, we combine a transformer-type detector, Dino, with grounded pre-training to develop an open-set target detector, Grounding DINO, which can detect arbitrary targets with human input, such as category names or referring expressions. The key solution for open-set target detection is to introduce language to generalize the open-set concept in a closed-set detector. To efficiently fuse linguistic and visual channels, we conceptually divide the closed-set detector into three stages and propose a tightly fused solution that consists of a feature enhancer, a linguistically guided query selection, and a cross-channel decoder for cross-channel fusion.

Unlocking Textual and Visual Intelligence: Enhancing Open Vocabulary 3D Object Detection with Comprehensive Guidance on Text and Images

Open Vocabulary 3D Object Detection (OV-3DDET) is a challenging task aimed at localizing and recognizing objects in 3D scenes, both seen and previously unseen categories. While in vision and language domains there is a large amount of training data available for training generalized models, 3D detection models suffer from a scarcity of training data. Despite this challenge, the burgeoning Visual Language Modeling (VLMS) provides valuable insights that can guide the learning process of OV-3DDET. While several efforts have been made to incorporate VLM into OV-3DDET learning, existing approaches often fail to establish a comprehensive link between 3D detectors and VLM. In this paper, we investigate the application of VLMS to the open vocabulary 3D detection task.

A simple background enhancement method for target detection based on diffusion modeling

In computer vision, it is well known that lack of data diversity will impair the performance of a model. In this study, we address the challenge of enhancing the dataset diversity problem for various downstream tasks such as object detection and instance segmentation. We propose a simple yet effective approach to data enhancement by leveraging advances in generative models, particularly text-to-image synthesis techniques such as stable diffusion. Our approach focuses on generating variants of labeled real images, augmenting existing training data by restoration to leverage generated objects and background enhancement without additional annotation. We find that background enhancement in particular significantly improves the robustness and generalization of the model.

Bayesian Detector Ensemble for Object Detection Using Crowdsourcing Annotations

Obtaining fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and susceptible to noise, especially in crowdsourcing scenarios. Most previous approaches to object detection assume accurate annotations; some recent work has investigated object detection with noisy crowdsourcing annotations and evaluated different synthetic crowdsourcing datasets under artificial assumptions for different settings. To address the limitations of these algorithms and the inconsistency of their evaluation, we first propose a new Bayesian Detector Combining (BDC) framework to more efficiently train object detectors for crowdsourced annotations with noise, with the unique ability to automatically infer the labeling quality of annotators. Unlike previous approaches, BDC is model-independent, does not require prior knowledge of the annotator's skill level, and can be seamlessly integrated with existing object detection models.

Bridging the past and the future: overcoming information asymmetry in incremental object detection

In incremental object detection, knowledge refinement has been shown to be an effective method to mitigate catastrophic forgetting. However, previous work has focused on preserving knowledge from old models, ignoring the fact that images may simultaneously contain categories from past, present, and future stages. Co-occurrence of objectives makes the optimization objective inconsistent across stages, as the definition of foreground objectives varies across stages, which greatly limits the performance of the model. To overcome this problem, we propose a method called Bridging Past and Future (BPF), which aligns the model across stages to ensure consistent optimization direction.

Group Ranking Based Loss for Efficient Training of Target Detectors

Ordering-based loss functions, such as average accuracy loss and rank ordering loss, outperform widely used score-based losses in target detection. These loss functions better match the evaluation criteria, have fewer hyperparameters, and provide robustness against imbalances between positive and negative categories. However, they require two-by-two comparisons between positive and negative predictions, introducing a time complexity of $O_{(PN)}$, which is prohibitive since $N$ is typically large. Despite their advantages, the widespread adoption of ranking-based loss is hampered by its high temporal and spatial complexity. In this paper, we work to improve the efficiency of ranking-based loss functions. To this end, we propose bucket-based ranked losses, which reduce the number of pairwise comparisons and hence the time complexity.

IRSam: Segmented Arbitrary Model for Improved Infrared Small Target Detection

The recently proposed Segment Anything Model (SAM) is a major advancement in the field of natural image segmentation, exhibiting strong zero-shot performance for various downstream image segmentation tasks. However, due to the obvious domain gap between natural and infrared images, the direct use of the pre-trained SAM for the infrared small target detection (IRSTD) task does not achieve satisfactory performance. Unlike visible cameras, thermal cameras display the temperature distribution of an object by capturing infrared radiation. Small targets usually show subtle temperature variations at their boundaries. To address this problem, we propose the IRSAM model for IRSTD, which improves the codec structure of SAM to better learn the feature representation of small targets in infrared.

YOLOv 9: Learning what you want to learn using programmable gradient information

Today's deep learning methods are concerned with designing the most appropriate objective function so that the model's predictions are closest to the ground truth. At the same time, a suitable architecture must be designed to facilitate the acquisition of sufficient information to make predictions. Existing methods ignore the fact that a large amount of information is lost when the input data undergoes layer-by-layer feature extraction and spatial transformation. In this paper, we will delve into the problem of data loss, i.e., information bottlenecks and invertible functions, when data is transmitted in deep networks. We propose the concept of programmable gradient information (PGI) to cope with the various variations required for deep networks to achieve multiple objectives.

CLFF: Continuous Latent Diffusion for Open Vocabulary Object Detection

Open Vocabulary Object Detection (OVD) utilizes image-level cues to extend the linguistic space of region suggestions, thereby facilitating the detection of different novelty categories. Recent studies accommodate clip embeddings by combining minimizing object-image and object-text differences in a distinction paradigm. However, they ignore potential distributions and inconsistencies between image and text objects, leading to misaligned distributions between visual and linguistic subspaces. To address this shortcoming, we explore high-level generative paradigms with distributional awareness and propose a new framework based on the diffusion model, called Continuous Latent Diffusion (CLIFF), which probabilistically describes continuous distributional shifts between object, image, and text latent spaces.

Projecting Points onto Axes: Oriented Object Detection via Point-Axis Representation

This paper describes the point-axis representation of oriented objects in aerial imagery, as shown in Fig. 1, emphasizing its flexibility and geometric intuition, including two key components: points and axes. 1)Points describe the spatial extent and outline of an object, providing a detailed shape description. 2) Axes define the primary orientation of the object, providing basic orientation cues critical for accurate detection. The point-axis representation separates position and rotation, solving the problem of loss discontinuities often encountered in traditional box-envelope-based methods. For effective optimization without introducing additional annotations, we propose to use maximum projection loss to guide point set learning and cross-axis loss to guide robust axis representation learning.

Relationship DETR: Exploring Explicit Positional Relationship Prioritization for Object Detection

In this paper, we present a general scheme for improving the convergence and performance of detection transformers (DETRs). We investigate the problem of slow convergence in the transformer from a new perspective, arguing that it is due to self-attention that introduces no structural bias against inputs. To address this issue, we explore the prioritization of positional relationships as attentional bias to enhance target detection and validate its statistical significance using the proposed quantitative macroscopic visual correlation (MC) metric. Our approach, called Relationship-DETR, introduces an encoder to construct positional relational embeddings for progressive attentional refinement, extending the traditional pipeline of DETR into a contrast-relational pipeline to resolve the conflict between no-repeat prediction and active supervision. (Page 393)

ECCV 2024 Paper Collection PDF

Due to differences in the basis of judgment, this blog may not be comprehensive enough to encompass the paper you need.

The following information contains and translates the titles and abstracts of all the papers in ECCV 2024, which clears the language barrier and allows you to fully utilize the fragmented time and follow the most cutting-edge research in the field of Computer Vision and Pattern Recognition anytime, anywhere.
ECCV 2024 contains a collection of all paper titles and topics:/o/bread/mbd-Zpqal5dx