TRLO: An Efficient LiDAR Odometry with 3D Dynamic Object Tracking and Removal

arxiv | CAS-UNUST Open Source TRLO: An Efficient LiDAR Odometry Combining 3D Dynamic Object Tracking and Removal
【TRLO: An Efficient LiDAR Odometry with 3D Dynamic Object Tracking and Removal】
Article Link:[2410.13240] TRLO: An Efficient LiDAR Odometry wit...
Project home page:GitHub - Yaepiii/TRLO

TRLO: An Efficient LiDAR Odometry with 3D Dynamic Object Tracking and Removal

Abstract: Synchronized state estimation and mapping is an essential capability for mobile robots operating in dynamic urban environments. Most existing SLAM solutions rely heavily on the static assumption. However, this assumption does not always hold due to the presence of moving vehicles and pedestrians, leading to degraded localization accuracy and distorted maps. To address this challenge, we propose TRLO, a Dynamic LiDAR Odometer, which can effectively improve the accuracy of state estimation and generate clearer point cloud maps. In order to effectively detect dynamic objects in the surrounding environment, a deep learning-based method is applied to generate a detection bounding box. Then, we designed aA 3D multi-object tracker based on the untraceable Kalman filter (UKF) and nearest neighbor (NN) strategy.to reliably identify and remove dynamic objects. Subsequently, a fast two-stage iterative nearest-point solver is employed to solve for state estimates using the cleaned static point cloud. Note that a novel hash-based keyframe database management is proposed for fast access to search keyframes. In addition, all detected object bounding boxes are used to impose pose consistency constraints to further refine the final state estimates. Extensive evaluation and ablation studies on the KITTI and UrbanLoco datasets show that our approach not only achieves more accurate state estimation but also generates clearer maps compared to the baseline.

Index Terms - Localization, mapping, dynamic SLAM, mobile robotics.

I. Introduction

Accurate state estimation is essential for self-driving vehicles to localize themselves and explore unknown environments without GPS assistance. In contrast to visual SLAM methods [1], [2], which may perform poorly under varying illumination or uniform textures, LiDAR-based methods can provide more reliable attitude estimation by accurately capturing the geometric details of the environment [3]. Feature-based methods [4]-[6] attempt to resolve the transformation between two neighboring scans by focusing on the most representative points (e.g., edges or planes). However, they may inadvertently discard well-structured points, which can help improve the quality of downstream registration. In particular, when the scene lacks distinctive features, feature-based methods may have difficulty extracting enough feature points, which may lead to pose drift and mapping distortion.

Nowadays, an alternative approach that directly utilizes the entire point cloud to estimate the transitions between LiDAR scans, usually by Iterative Nearest Point (ICP) [7], is popular. However, when dealing with dense scans, the computational consumption of ICP techniques is too high and may affect real-time performance. To address this problem, various ICP variants [8], [9] have been proposed to improve computational efficiency. Some derived LiDAR odometry (LO) works [10]-[12] have achieved promising performance based on the improved ICP algorithm.

Most existing LO solutions [4], [5], [10]-[12] rely heavily on the underlying assumptions of a static environment. While some of them [11], [12] mitigate the effects of dynamic objects through some algorithmic designs, they do not directly focus on moving objects, leading to degradation of localization accuracy or potential failures. In addition, moving objects usually affect the quality of mapping and hinder subsequent navigation tasks. Therefore, effectively detecting and tracking dynamic objects has become crucial for robust localization and mapping of robots in the real world.

Despite the existence of some impressive work [13]-[17] introducing innovative approaches to the SLAM challenge in dynamic environments, they still have some limitations. RFLIO [13], DOR-LINS [14], and LIMOT [18] rely on the IMU to provide an attitude prior, which can be susceptible to noise. Many works [15], [19] incorporate strict semantic segmentation into LO systems, whose performance is limited by sparse channel LiDAR. Both [16] and [17] focus on improving the accuracy of object tracking rather than reducing the impact of dynamic objects on LiDAR odometry accuracy. In this work, a novel and efficient dynamic LO, namely TRLO, is proposed.To efficiently detect and track dynamic objects, a strut-based object detection network is implemented as a TensorRT module, and a Multiple Object Tracking (MOT) approach based on the Untraceable Kalman Filter (UKF) [20] and Nearest Neighbor (NN) strategy is proposed. Subsequently, the static point cloud is fed into the Fast G-ICP [21] module for two-stage scan matching for accurate localization and image building. In addition, we apply the accumulated bounding boxes in the sliding window to impose consistency constraints to further refine the robot's pose. Fig. 1 shows an example of a point cloud map generated with our proposed method.

Figure 1. Global clean map of our dynamic object removal on the Urbanloco dataset.

A and B are two examples of localized map details of congested traffic intersections, demonstrating the superiority of our approach to mapping.

The contribution of our work is summarized below:

- We have proposed TRLO.A LiDAR odometer capable of accurately detecting and tracking moving objects, to achieve high-precision state estimation and clear mapping, the key of which is the proposed reliable 3D multi-target tracker and NN based on UKF.

- We proposeHash-based keyframe database management, accelerates scan-to-map registration by directly accessing the keyframe index.

- We reuse the poses of the bounding boxes accumulated in the short-term sliding window to impose consistency constraints on the state estimates, assuming that all objects are standing on the same ground in the localized scene.

- Extensive experiments on challenging sequences from the KITTI and UrbanLoco datasets show that our method outperforms baselines in terms of localization and mapping.

The source code and demo video of our method is located at:/TRLO/。

II. Related work

LiDAR odometry and SLAM are classic topics in robotics with a large amount of research. In this section, we briefly review traditional LIDAR odometry estimation methods and focus on SLAM solutions for dynamic scenarios.

A. Conventional LiDAR odometers

As the most typical feature-based approach, LOAM [4] introduces an innovative methodology that divides the estimation problem into two different algorithms, one that operates at high frequencies with lower accuracy and another that operates at low frequencies with higher accuracy. Accuracy. The two results are fused together to produce a final estimate with high frequency and high accuracy. LeGO-LOAM [5] implements a lightweight LiDAR odometer by using ground point segmentation and a two-stage Levenberg-Marquardt algorithm with a dynamic small object rejection mechanism to improve accuracy. MULLS [22] focused on employing Principal Component Analysis (PCA) to extract richer and more stable features, proposing a scanline-independent LiDAR-based SLAM framework that enhances generalization capabilities. Li et al. [23] proposed a novel angle-based feature extraction method based on the equiangular distribution properties of the point cloud and constructed fixed-lag smoothing to jointly optimize the pose of multiple key frames.

With rapid iterative advances in processing units [24], some recent work is exploring the direct utilization of ICP or its variants for real-time state estimation and mapping. Direct LiDAR odometry (DLO) [10] introduces NanoGICP, a lightweight iterative nearest-point solver that helps to accurately align a dense raw point cloud to determine the robot's pose. It achieves reasonable keyframe management and mapping through an adaptive approach. CT-ICP [11] introduces a combination of continuity in scan matching and discontinuity between scans to improve the accuracy and robustness of point cloud alignment. KISS-ICP [12] is the latest direct registration method that focuses on simplifying the implementation of ICP. It achieves a balance between accuracy and speed by employing adaptive threshold data correlation.

B. LiDAR-based SLAM in dynamic scenes

Traditional methods operate under the main assumption of a static environment. However, in dynamic scenarios such as dynamic urban environments, odometer accuracy degradation and map distortion may occur. Several SLAM schemes work on recognizing and filtering dynamic objects in the environment to ensure location and map accuracy.

Early efforts [25], [26] used vision techniques to recognize and remove dynamic objects. To realize outdoor large-scale LIDAR-based dynamic SLAM, certain combine 3D LIDAR sensors with other sensors. RF-LIO [13], based on the LIO-SAM [6] framework, utilizes initial values and multi-resolution depth maps provided by the IMU to detect and remove dynamic points in the sub-map, and implements an accurate scan-to-static based alignment of the sub-map. DOR-LINS [14] introduces ground pseudo-occupancy to recognize and remove dynamic objects. However, introducing other sensors would lead to expensive costs and additional complexity. In contrast, our approach uses LiDAR as the only sensor to achieve comparable or even better results while reducing costs.

Driven by advances in deep learning and Convolutional Neural Networks (CNNs) for scene understanding [27], [28], several approaches extract semantic information from LiDAR data for dynamic removal. SuMa++ [15] used RangeNet++ [29] for spherical projection to extract semantic information for dynamic removal through a point-by-point probabilistic approach. Pfrowindshu et al. [19] used deep learning techniques for real-time 3D dynamic object detection. They extracted features from a static point cloud to estimate its own motion. Although effective, these methods generalize to LiDAR odometry because the semantic extraction network is specifically tailored for 64-channel or denser LiDAR. As a result, the system will not function when 32-channel LiDAR is used. Our method still produces surprising results on datasets using 32-channel LiDAR.

References [16] and [17] are most similar to our work. They both optimize the integration of the object's position into a factor graph estimation framework to achieve effective tracking of dynamic objects. However, [16] removes all detected objects prior to odometry estimation, ignoring valuable static information. Due to asynchronous state estimation, [17] cannot operate in real-time. Although S2MAT [30] focuses on dynamic object tracking and removal, it does not address how to mitigate the effect of dynamic objects on the LiDAR odometer to improve localization accuracy. In addition, some methods [31], [32] focus on map construction in dynamic scenes and thus usually do not emphasize real-time performance.

III. Methodology

In this paper, the architecture of our system is shown in Fig. 2. Our work attempts to solve the following problem: given adjacent raw point cloud scans P′kand P′k-1 at moment k, remove dynamic objects to obtain static point clouds Pkand Pk-1 as inputs to the LiDAR odometer, and then estimate the robot's global positional pose χW k ∈ SE(3) and construct a global map in the world W coordinate system map Mk, where χk consists of a rotational part Rk ∈ SO(3) and a translational part tk ∈ R3.

Fig. 2. Our system architecture. The column-based 3D object detector is first used to preprocess the raw point cloud to detect dynamic and semi-static objects in order to generate a 3D object bounding box. Subsequently, a 3D multi-target tracker is applied to recognize and remove dynamic objects. Neighboring static scans Pk are input to compute the scan-to-scan (S2S) transformation. The initial values are propagated to the world coordinate system and used in the scan-to-map (S2M) auxiliary fast GICP. Pk matches scans with a sub-map Sk consisting of selective keyframes. Finally, the S2M transform is further optimized using pose consistency constraints imposed by detected bounding boxes to obtain a refined global robot pose, which is checked against multiple metrics to determine if it should be stored in the keyframe hash dataset

A. 3D object detector

When our system receives the raw point cloud P′k acquired by the 360° LiDAR (Fig. 3(a)), in order to minimize the noise and the loss of raw data and to ensure the accuracy of the object detection, only 0.5m box filter and 0.25m voxel filter around the robot are used to reduce the false detections and to eliminate the noise in the preprocessing stage. The processed point cloud is then fed into the 3D object detector.

In this module, we use PointPillars [33] as the backbone of feature map extraction, and then Center-based [34] to generate 3D bounding boxes (Fig. 3(b)) for dynamic and semi-static objects, where semi-static objects-static objects are defined as objects that may be currently moving. It is worth noting that although we chose Center-base [34] to ensure a trade-off between accuracy and speed, our pipeline is not limited to specific object detection methods. Let BL k,m be a bounding box in the LiDAR coordinate system L. Scan all bounding boxes at k BL k = {BL k,m|m = 1, ... , M }. To mitigate the impact of false detection on 3D multi-target tracking, we selectively detect cars and bicyclists while raising the threshold to 0.75 [35]. In addition, we optimize the overall efficiency of the system by rewriting the 3D object detector to a TensorRT version and integrating it into the front-end of the system, thereby more than doubling the processing speed.

Fig. 3. Effects of the proposed dynamic detection and filtering. (a) The original point cloud. (b) 3D bounding boxes of dynamic and semi-static objects detected by our 3D object detector (blue boxes). (c) 3D bounding box of dynamic objects recognized by our 3D multi-object tracker (green box). (d) The static point cloud after removing the dynamic objects, which will be used as input to the LiDAR odometer.

B. 3D multi-target tracker

Traditional SLAM systems rely on geometric information to localize observations and associate them with maps. However, they cannot account for dynamic scene changes caused by moving objects, which can lead to incorrect alignment and map distortion. Dynamic SLAM algorithms are usually equipped with some outlier rejection. However, if we simply discard dynamic and semi-static objects, we risk losing their valuable contribution to odometry estimation.

weA method based on UKF and NN techniques is proposed to track the detected bounding box BL k in III-A. Unlike EKF, UKF solves nonlinear state estimation by using deterministic sampling methods instead of linear approximation [20], which is more suitable for dynamic object tracking. The inference of the object tracking process is as follows:

1) Estimation model: we approximate the inter-frame displacement of each object using a constant velocity model that is independent of other objects and agents. For simplicity, we omit the subscript m. Given the object xO k at the kth scan, its state is defined as:

where x, y, and z are the center coordinates of the target in the robot's coordinate system B. Note that we assume that L and B coincide. θ is the yaw angle between the object coordinate system O and B, v is the velocity to be estimated, and l, w, and h are the length, width, and height of the bounding box, respectively. We then estimate the state of each object via UKF [20]:

where X O k+1|k and Y O k+1|k are the values obtained after proportionally corrected symmetric sampling and nonlinear transfer, Wm i and W c i are the corresponding weights, Kk+1 is the Kalman gain, yk is the observation k scan, ˆ xO k+1|k+1 and ˆ PO k+1|k+1 are the estimated states and covariances, respectively, PO xx,k+1|k, PO xy,k+1|k and P O yy,k+1|k are the covariance matrices of X O k+1|k, Y O k+1|k. For more details, please refer to [20]. Based on the estimated velocity of the object, we define the tracked object as dynamic if the estimated velocity of the tracked object exceeds 1 m/s, otherwise the tracked object is considered as semi-static. Our method effectively distinguishes dynamic objects (Fig. 3(c)) or semi-static objects based on the estimated velocity.

2) Data association: after estimating the new state of each target using (2) and (3), we perform the data association step. Instead of the typical IOU-based criteria, we simply use the nearest neighbor technique to assign the target to the detection in the current frame, as we found that this simple approach achieves high accuracy in our application as well. When a detection is associated with a target, the detected bounding box is used to update the target state. In addition, to mitigate the effects of false detections or occlusions, we only perform the prediction step without updating if no target is associated with the detection.

3) Creation and deletion of tracked targets: when an object enters the LiDAR field of view, the tracker is initialized using the geometry of the bounding box and the velocity is set to zero. Since the velocity is unobservable, the covariance of the velocity component is initialized to a large value, reflecting this uncertainty. When an object leaves the LiDAR field of view, it is not immediately deleted, but retains the Agemax lifecycle, during which only the prediction step is performed. This approach effectively avoids the problem of deleting true-positive trajectories that are still present in the scene but cannot be matched due to false detection or occlusion. As shown in Fig. 3(d), after dynamic removal we obtain a cleaner static point cloud while retaining semi-static objects that can provide valuable information for LiDAR ranging.

C. LiDAR odometer

The LiDAR odometry can be viewed as the process of recovering the SE(3) transform by comparing the processed neighboring static point clouds with those in memory. The process is typically executed in two phases, first providing a best instantaneous guess and then refining it to be more globally consistent. 1) Scan-to-scan: in the first stage, scan-to-scan aims to compute the relative transform ^ χL k (where P t k = Ps k-1 ) between the source point cloud P s k and the target point cloud P t k captured in L, where

where N is the number of corresponding points between the source P s k and the target P t k , di = pt i - χL k ps i , ps i ∈ Ps k , pt i ∈ Pt k , and Cs k,i and Ct k,i are the matrices of estimated covariances corresponding to each point i of the source and target clouds, respectively. If an external sensor (e.g., an IMU) is available, our system can provide an initial guess for Fast GICP via IMU pre-integration. However, it is worth noting that our system does not rely on the IMU, and in the case where the IMU is not available, we can still achieve robust state estimation by setting the initial guess to the unit matrix I. We can also use the IMU to estimate the covariance of each point i in the target cloud.

Finally, we accumulate the result ˆ χL k to the previous global scan-to-map transformation ˆ χW k-1 , i.e., ˆ χW k = ˆ χW k-1 ˆ χL k .

2) Scan-to-Map: In this phase, we use ^ χW k as the initial guess for scan-to-map and follow a similar process as scan-to-scan. However, the goal here is to obtain a globally consistent global state of the robot ˆ χW k by matching the current static point cloud P s k with the local sub-map Sk such that

where M is the number of corresponding points between the source P s k and the subgraph Sk, and CS k,j is the scanning suture covariance matrix corresponding to each point j of the subgraph, as defined later in III-D.

3) Bounding box consistency constraints: we use the 3D bounding box generated in III-A to impose pose consistency constraints on the [tz, roll, pitch] of the scanned-to-map results. Specifically, it is assumed that all detected objects stand on the same flat surface. Therefore, we vertically project the center of the bounding box to produce the ground points of the objects. We then utilize all these ground points for ground fitting. If the number of interior points exceeds a specified threshold, we utilize them to compute the ground normal, which acts as a constraint on [roll, pitch]. It is worth noting that in some scans with few bounding boxes, this process may fail. To avoid this problem, we implement a sliding window to accumulate bounding boxes for several consecutive scans. All bounding boxes within this sliding window are used for ground fitting.

In addition, we infer that the robot is traversing relatively flat terrain if the average z-axis change in the bounding box between neighboring scans is less than 0.1m. In this case, we impose a constraint on tz by considering the z-axis displacements in the local window to be close to zero and assuming that the ground does not change between two consecutive scans.

D. Hash-based keyframe management

Inspired by the visual SLAM approach [2], we use keyframes to manage maps. However, unlike common keyframe management, we use a hash table to access keyframes efficiently. Referring to the idea of adaptive thresholding in DLO [10], the threshold for adding keyframes is adaptively adjusted according to the spaciousness of the current environment. See [10] for more details. When the robot's moving distance or rotation angle is larger than the threshold, we associate the current state χk of the robot with the static point cloud Pk in the form of key-value pairs, which are then stored in the hash-based keyframe database.

When subgraphs need to be constructed, we implement a keyframe selection strategy based on nearest neighbors and convex and concave packets. Let Kk be the set of all keyframe point clouds, and the sub-map Sk be defined as a series of K nearest-neighbor keyframe scans Qk, L nearest-neighbor convex-packet scans Hk, and J nearest-neighbor concave-packet Gk such that Sk = Qk ⊕ Hk ⊕ Gk where ⊕ denotes the intersection. This combination reduces overall trajectory drift by maximizing scan-to-map overlap and aligns the system by providing environmental features at multiple scales.

IV. Experiments

To validate the effectiveness of our approach, we rigorously conducted extensive experiments on the KITTI [36] and UrbanLoco [37] datasets, including odometry benchmarks and mapping results, as well as runtime analysis and ablation studies for each component. All experiments were executed on a laptop equipped with a 20-core Intel i9 2.50GHz CPU and RTX 3060 GPU.

A. Odometer benchmarks

1) Experimental setup: we conducted a baseline assessment that willOur approach is compared with several conventional LiDAR odometry methods (A-LOAM [4], LeGO-LOAM [5], DLO [10], CT-ICP [11]) and KISS-ICP [12]) as well as several dynamic LiDAR inertial odometry methods (LIO-SEGMOT [17] and LIMOT [18]).For a fair comparison, the closed-loop detection module was disabled in LeGO-LOAM [5], LIO-SEGMOT [17] and LIMOT [18]. Since SuMa++ [15] relies on semantic information extraction to filter dynamic objects, it cannot be run on datasets collected by 32-channel LiDAR. We re-implemented Dynamic LO, RF-A-LOAM, integrating range-based dynamic removal. All experiments were performed on the KITTI [36] and UrbanLoco [37] datasets, theUsing the root mean square error (RMSE) of the relative positional error (RPE) and absolute positional error (APE) as metrics。

2) Experimental analysis: the results are shown in Table I. Compared with the conventional LO system, our approach usually achieves encouraging performance through the proposed dynamic removal and bounding box consistency constraint strategies, especially for highly dynamic sequences. Furthermore, the proposed method outperforms LIO-SEGMOT [17] and LIMOT [18], except for the Urbanloco03 sequence, which has a lower APE than LIMOT [18]. Notably, our method obtains competitive results using only LiDAR scans as input, which demonstrates the effectiveness of dynamic removal and robust alignment mechanisms.

B. Mapping results

1) Qualitative: as shown in Figure 4.We compare the maps constructed by the proposed method with other baselines in a dynamic urban environment. In the map generated by DLO [10], we clearly see severe ghost tails in the map caused by moving cars. In contrast, our approach skillfully recognizes and filters moving objects while retaining valuable semi-static objects. Compared to LIO-SEGMOT [17], LIMOT [18], and RF-A-LOAM, our method exhibits better dynamic removal performance. Ultimately, we obtain high-precision LiDAR odometry and construct sharper maps.

Figure 4.

(a) Global maps generated by traditional methods.

(b) Local details of the global map generated by the traditional method, where the red dots represent ghost tails caused by moving objects.

(d) Local details of the global map generated by our method. It is clear that our method effectively detects and filters moving objects, thus producing more consistent results.

2) Quantitative: The map quality is evaluated as shown in the table. PR and RR are shown in Table II. Table II represents the retention rate of static points and the removal rate of dynamic points. Statistically, the PR score of our method is higher than other baselines, while the RR score is lower than LIMOT [18]. Balancing PR and RR, the F1-Score of our method is competitive.

The motivation for this work is to realize a compact dynamic LO system dedicated to improving the overall accuracy of odometry and map building by integrating target detection, tracking, and bounding box attitude consistency constraints. We acknowledge that we do not focus on how to further improve the performance of object detection and tracking. From a large number of comparative experiments, the proposed dynamic principal vibration system is competitive with the baseline in terms of localization accuracy, mapping quality and efficiency.

C. Ablation studies

To study the impact of our system components, we perform ablation studies on KITTI07 and UrbanLoco05 sequences. We compare the accuracy of the following versions of the odometry: our method without dynamic removal and without bounding box consistency constraints (a), our method without dynamic removal (b), our method without bounding box consistency constraints (c), our method with EKF instead of UKF (d), and our full method (e). In addition, following the approach mentioned in [38], we use RANSAC for global segmentation to fit the ground and impose ground constraints. We also compare the effect of our pose consistency constraints with the traditional ground segmentation method for different downsampling sizes (0.4m, 0.5m).

1) Tracker: As shown in Table III, the UKF-based target tracker for the LO system outperforms the EKF-based target tracker, benefiting from more robust dynamic tracking. Theoretically, the EKF solves the nonlinear state estimation by using linear approximation, which may introduce tracking bias.

2) Dynamic Removal: as shown in Table III, the results indicate that by adding a dynamic removal step, outliers in the environment can be effectively filtered, resulting in more accurate mileage estimates. This improvement is especially evident when dealing with the highly dynamic UrbanLoco dataset.

3) Bounding box consistency constraint: the results are shown in Fig. 5 and Table IV. By imposing the bounding box consistency constraint, the z-axis drift of the odometer is effectively suppressed. Combining dynamic removal and bounding box consistency constraints, our complete system can obtain more accurate results. The translation accuracies of KITTI07 dataset and UrbanLoco05 dataset are improved by 39.3% and 30.4%, respectively. As shown in Fig. 5 and Table IV, the proposed method with bounding box constraints achieves higher localization accuracy with less time cost. The high efficiency is due to the fact that the bounding box constraints liberate the ground segmentation step, which is one of the secondary contributions.

Figure 5. z-axis drift for all methods in the ablation study. By imposing a bounding box consistency constraint, z-axis drift is suppressed and odometry accuracy is improved.

D. Runtime analysis

As shown in Table V, the time consumption of our method is two times faster than LIMOT [18] and five times faster than LIO-SEGMOT [17]. Compared to the baseline, the proposed UKF-based 3D multi-target tracker requires less time to maintain similar accuracy for dynamic removal. In addition to high efficiency, our odometer module also exhibits competitive localization accuracy. In summary, our system can operate above 20 Hz to fulfill real-time operation.

V. Conclusion

This work presents TRLO, a LiDAR odometer that is accurate and robust for dynamic environments traversed over long periods of time. One of our key innovations is the design of a 3D multi-target tracker based on a traceless Kalman filter and nearest-neighbor method, which stably identifies and removes dynamic targets to provide static scans for a two-stage iterative nearest-point LiDAR odometer. In addition, an efficient hash-based keyframe management is proposed for fast subgraph construction. Finally, we reuse 3D bounding boxes to further optimize the computed poses, obtaining globally consistent poses and clean maps. Our benchmark tests in various challenging dynamic environments confirm the reliability of our approach. We systematically explored the contribution of each component through the ablation study. In the future, we are interested in integrating IMU into our system or adding closed-loop for more robust pose estimation.