Robust Loop Closure by Textual Cues in Challenging Environments

arxiv | Nanyang Technological University Open Source Robust Loop Closure by Textual Cues in Challenging Environments

Article Link:[2410.15869] Robust Loop Closure by Textual Cues i...

Open Source Repository:GitHub - TongxingJin/TXTLCD: This repository is fo...

Robust loop closure with text prompts in challenging environments

Abstracts:Loopback detection is an important task in robot navigation. However, most of the existing methods rely on some implicit or heuristic features of the environment and still do not work in common environments such as corridors, tunnels and warehouses. Indeed, navigating in such featureless, degraded, and repetitive (FDR) environments can pose a significant challenge even for humans, but explicit textual cues in the surrounding environment usually provide the best help. This inspires us to propose a multimodal closed-loop approach based on explicit human-readable textual cues in FDR environments. Specifically, our method first extracts scene text entities based on Optical Character Recognition (OCR), then creates local maps of text cues based on accurate LIDAR odometers, and finally recognizes closed-loop events via a graph-theoretic scheme. Experimental results show that the method provides superior performance over existing methods that rely only on vision and LiDAR sensors. For the benefit of the community, we are /TongshingJin/TXTLCD Source code and datasets were released.

Index Terms-Loop closure, LiDAR SLAM, localization

I. Introduction

In recent years, LiDAR inertial odometry (LIO) has become the backbone of the mobile robotics space [1]-[3]. Notably, the newer Livox Mid-360 3D LiDAR is now available at a similar cost to the Intel Real Sense D455 camera, offering a wider field of view, longer range, and higher accuracy. In terms of continuous localization, LiDAR-based approaches clearly demonstrate better accuracy and robustness than traditional visual SLAM, and eliminate the need for visual SLAM in most applications, as reflected in the Hilti SLAM challenge [4]. However.LiDAR-based closed-loop detection (LCD) approach, such as Stable Triangle Descriptor (STD) [5], Scanning Context (SC) [6], and Intensity Scanning Context (ISC) [7], it is often difficult to find accurate matches in degraded and repetitive environments. While vision-based LCDs [8]-[10] provide a larger dimension of feature descriptors, most of them are sensitive to illumination and viewpoint changes.

Vision-based LCD failures occur despite dense feature representations. LCD challenges arise when dealing with featureless, degenerate and repetitive (FDR) environments with computational limitations. There are gaps in efficient, simple and intuitive LCD solutions for LIO in FDR environments that mirror human-like processes.

The research presented in this paper begins withHumans often rely on textual cues in their environment to determine their locationinspired by the observation of the Indeed, these textual cues are often designed to help humans navigate in FDR environments (Fig. 1) and can take a variety of forms, such as wayfinding signs, nameplates, and other forms of language-based signage. TextSLAM [11], [12] was the first approach to tightly integrate scene text into a visual SLAM pipeline and demonstrated its effectiveness in a text-rich commercial plaza. In contrast to their approach, theOur approach further utilizes the spatial structure of the scene text to verify the veracity of the loopbacks while operating effectively in environments with moderate text density, which is typical of most real-world scenes.

　Fig. 1. Examples of common FDR scenarios in which humans navigate naturally using readable text symbols and their spatial arrangement.

This inspired us to use textual cues to understand global locations.

Based on this inspiration, we propose a multimodal (MM) closed-loop solution that utilizes scene text cues in FDR scenes. Specifically, we use the well-establishedVisual Optical Character Recognition (OCR) Technologyto detect scene text entities present near the current position, and then based on the low-drift LIO, create Local Text Entity Mapping (LTEM) to encode a particular spatial arrangement of these texts, which can be used as a token to verify the authenticity of the candidate closure. Closed-loop constraints will be introduced to the back-end pose map to enhance the robustness and accuracy of state estimation in a typical FDR environment.The contribution of our work can be summarized as follows:

1) We introduce a novel approach to textual entity representation, estimation and management by fusing LiDAR and vision data, which supports efficient closed-loop retrieval and alignment.

2) We propose an association scheme for the same textual entity observation, which is then used to create closed loops and improve state estimation accuracy. In particular, we employ a graph-theoretic approach to identify the veracity of candidate closed loops.

3) We combine our approach with LiDAR odometry to form a SLAM framework and conduct extensive experiments to demonstrate its competitive performance compared to state-of-the-art (SOTA) methods.

4) We release source code and survey-graded high-precision datasets for the benefit of the community.

II. Related work

Vision and LiDAR fusion for global positioning is a common problem in perception tasks, which has been addressed in various previous studies.

Traditionally, global localization can be achieved by visual odometry or SLAM methods. However, visual methods often lack robustness when dealing with featureless regions, illumination variations, and distant objects [13]. Often, visual factors must be supplemented with other factors, such as IMU or UWB [14], to improve efficiency and robustness. Recently, LiDAR methods have become mainstream for front-end odometry estimation because LIO [1], [3] consistently produces superior results compared to most real-time methods fused with vision. The popularity of vision-based methods for robot localization has declined with the advent of new low-cost LiDAR-based solutions [4].

Another way to achieve global positioning is through theLCD.. Traditionally, vision-based methods have been dominant, with hand-crafted features. DBOW2 [8] improved real-time LCDs using a binary visual word model based on BRIEF features [15].In recent decades, learning-based approaches [9] have dominated LCDs due to their better performance in handling viewpoint and appearance changes. As an extension of [9], a visual place recognition model based on optimal traffic aggregation was proposed in [10], which achieved SOTA results on many benchmarks. However, vision-based LCDs are still far from perfect due to limited geometric understanding, FDR environments and illumination variations.

Recently, LiDAR-based LCD methods have been widely explored in the field robotics for accurate geometric measurements and illumination invariance. The SC series is currently considered to be the most popular method for LiDAR-based LCDs [6], [16], whose main idea is to use projections and spatial partitions to encode the entire point cloud, and later work has improved this idea by integrating intensity [7] and semantic information [17]. However, this family of methods is unable to estimate the complete SE3 relative pose between candidate frames and relies on odometry poses to reject error loops, making them vulnerable to significant odometry drift.

STD [5] proposes to create triangle-based descriptors by aggregating local point features, using the length of each edge as a key in a hash table to find closed-loop candidates via a voting scheme. Recent work called Binary Triangle Combining (BTC) [18] combines STD with binary patterns to improve speed and viewpoint invariance. BTC is currently in the preview phase and is not yet available for open source verification. However.These methods encounter difficulties in FDR scenarios, where similar spatial shapes, intensities, and semantics may lead to ambiguities in loop closure.

TextSLAM [11], [12] is the first visual SLAM framework that tightly integrates scene text into point-basedThe. It selects the top ten historical keyframes for observing the most visible text objects as candidates for loop closure. The co-visibility requirement requires multiple text objects to be visible in the loop closure frame, which limits its applicability in real-world environments. To overcome this limitation, we create local text entity maps with the help of a low-drift LIO and check the realism of the candidate closures using the spatial arrangement of the scene text, thus making it valid in more common scenes with moderate text density.

III. Methodology

In this section, we describe the process of representing scene text as text entities with content and pose attributes and observing them in LiDAR frames. We then explain the rationale for creating associative relationships between text observations and identifying candidate closed-loop realism using a graph-theoretic approach. The workflow of our approach is shown in Fig. 2.

Figure 2. pipeline with loop closure based on textual cues. camera and lidar data are fused to estimate textual entity poses and create local textual entity maps encoding specific alignments of scene text.

A novel graph-theoretic scheme is applied to verify the authenticity of candidate closures retrieved from an online database and to perform bitmap optimization whenever a new closure closes to mitigate cumulative odometer drift and ensure global map consistency.

Notation: we define four main coordinate systems: the world coordinate system W, the LiDAR coordinate system L, and the camera coordinate system C. We use TW Lt to denote LiDAR's SE3 bit pose in the world coordinate system at the timestamp t. For simplicity, we can omit the superscripts of W in the world coordinate system and rewrite them as TLt and TL. Alternatively, the back-end bit pose map optimization uses the LiDAR bitmap {TLt}tn t=t0 as the node. Similarly, T L C will be used to express the extrinsic parameters between the camera and LiDAR, and TC text and TL text are the textual entity poses expressed in the camera and LiDAR frames, respectively.

A. Observation of textual entities

We abstract the scene text as a text entity containing two attributes: text content and SE3 bit pose. Text content refers to text strings that can be realized by OCR, while bit-pose observation is realized by fusion of camera and LiDAR measurements.

1) Text Content Interpretation : OCR is a well-established technique that first localizes text regions in an image in the form of polygons and then converts the region of interest into readable text content. In our implementation, we use AttentionOCR [19] to extract scene text, which provides confidence scores to help filter out unreliable recognition results.

2) Textual entity representation: inspired by TextSLAM [11], [12], it is reasonable to assume that scene textual entities are usually located on flat surfaces or localized planes. For example.Notices on bulletin boards, room numbers, nameplates on fire fighting equipment and emergency exit signs. As shown in Fig. 3, we define the midpoint of the left edge of the scene text region as the origin of the text entity. The x-axis points to the midpoint of the right edge of the text, the z-axis is aligned with the normal direction of the local plane and points to the camera, and the y-axis is determined by the right-hand rule.

Figure 3. Graphical representation of textual entities

3) Attitude estimation: in order to estimate the SE3 positional attitude of the textual entities in the camera frame, we first accumulate the LiDAR scans of the past second into a cloud map of the present location and project them into the camera frame through external parameters between the LiDAR and the camera:

where pL is the LiDAR point coordinates in the LiDAR coordinate system, TC L is the outer reference between LiDAR and the camera coordinate system, and pC is the point coordinates in the camera coordinate system. The LiDAR points are then further projected into the image coordinates:

　where K is the eigenmatrix of the camera and [u,v]⊤ are the pixel coordinates where the LiDAR points are located.

Since scene text is usually attached to a local plane, the plane parameters in the camera frame can be estimated by RANSAC on the set of points detected in the region where the text is located. We denote the plane in the camera frame as:

where n is the normal to the plane, p is an arbitrary point on the plane, and d is the distance from the center of the camera's light to the plane. Given the plane parameters (n, d) and the projected coordinates [u, v] of the point pC, the depth of the point can be recovered as follows:

　 Each text entity detected by OCR carries a bounding box. We denote the midpoints on the left and right sides of the bounding box as pC l and prC, respectively. We choose pC l as the position of the text entity and nx ≜ pC r -pC l ∥ prC -pC l ∥ unit vector of the x-axis. Therefore, the pose matrix of the text entity is defined as:

Since the camera and LIDAR are different modal sensors and are triggered at different time points, the text entity will be further anchored to the most recent LIDAR frame with timestamp ti prior to the image timestamp tj, and its SE3 bit-pose in the LIDAR frame will be represented as:

where ti and tk are the two most recent LiDAR timestamps before and after the image timestamp tj , respectively. interpolate(T , s) is a linear interpolation between the constant transform and T by a factor s ∈ (0, 1); T Li text is the SE3 pose of the text entity in its anchored LiDAR frame. For simplicity, in the future we will only deal with the pose of the text entity with respect to the LiDAR frame T L text.

B. Textwatch Management

To support efficient loop closure storage, retrieval, and alignment, we keep all historical textual entity observations in the Textual Entity Observation Database, which consists of a textual dictionary and a frame dictionary implemented by hash mapping (Fig. 1). 2). The text dictionary utilizes text strings as keys, the indexes of all LIDAR frames that observe textual content, and their estimated textual entity poses as values, which enables fast retrieval of candidate frames for observing specific textual content. The frame dictionary utilizes the frame index as a key, the content of all observed text entities in that frame and the estimated pose as a value, which helps to create a map of local text entities in the vicinity of a candidate frame.

C. Text entity-based loop closure detection and alignment

Multiple scene texts were found in a variety of environments, providing insights into the function and location of the entities involved. Unlike QR codes or other landmarks, theThe advantage of scene text is that it does not require specialized deployment and can be seamlessly integrated with human navigation.We classify the scene text into two categories: the ID text and the generic text, where the ID text is the address-like text that helps us to recognize a specific room or object, and the generic text is everything else, such as exit, danger, and power. Based on the text entities, we apply different closed-loop detection strategies.

1) ID Text: ID texts are texts that follow special conventions designed by humans to identify specific objects within a building or map. For example, S1-B4c-14 indicates building S1, fourth basement level (first floor), area c, room 14, while S2-B3c-AHU3 indicates building S2, third basement level, area c, room 14. Air Handling Unit 3. can pick out such text according to a predefined pattern of the application environment.

ID texts are often designed to be exclusive, such as door or device numbers, so repeated detection of the same ID text content at different times indicates a high likelihood of loop closure. The relative pose a priori ̄ T Li Lj between the current and candidate closed-loop poses TLi and TLj is computed as follows, the

However, the ID text can also have multiple instances in different locations, e.g., a room may have multiple doors with the same number. Therefore, we use ICP checks to rule out false loop candidates in such cases. In addition, ICP can provide a more accurate relative bitmap prior (9), which facilitates the global bitmap optimization task.

2) Generic text:In general, a large portion of scene text does not indicate proprietary location information and may occur multiple times within the scene, e.g., exit, no parking, and stop. The association of such text entities may be ambiguous. To address this issue, we create a Local Text Entity Map (LTEM) by aggregating text entities near the current location. Such LTEMs encode spatial alignments and can be used as markers for the current pose to verify the authenticity of candidate loops with other poses, as we explain below.

Specifically, LTEM is a set of all text entities observed by a set of LiDAR odometer poses, including ID text and generic text. Assuming that we observe a text entity Ec at the current pose Tc (subscript c stands for current), we define Mc as the LTEM that contains all text entities observed by successive poses Tc = {Tc-w . . . Tc}, where Tc-w is the earliest pose within a certain distance d from Tc. Note that Mc may contain other text entities whose contents (i.e., text strings) are different from Ec.

We then use the contents of Ec to search for all past poses from the text dictionary that see text entities with the same contents. Let us denote the set of these poses as T . For each candidate previous pose Tp ∈ T , we denote Ep as the text entity with the same content as the Ec observed by Tp . We then construct the LTEM of all text entities observed by the consecutive poses Tp = {Tp-w . . Tp+v}, where Tp-w and Tp+v are the earliest and latest poses within the same distance d from Tp, respectively. We denote this LTEM as Mp (Fig. 4).

Given Mc and Mp, we will construct a set of association relations A ≜ {ai, . . . } = {(Ec i ,Ep i ), . . . }, where Ec i ∈ Mc, E p i ∈ Mp and Ec i , Ep i have the same textual content. The set A is called the set of putative associations. Obviously, A may contain inappropriate associations due to some duplicate textual contents. As shown in Fig. 4, the associations a1, a2, and a3 are mutually exclusive because they try to associate the same textual entity from Mc with three different entities from Mp .

The affinity between these hypothesized associations in A can be represented by a consistency graph G, as shown in Fig. 5(a). The nodes in the consistency graph correspond to the hypothetical associations in Fig. 4, the connection between any two nodes ai and aj indicates their compatibility, and the dark color of the lines further indicates the geometric consistency scores assessed through the following:

where pi and qi are the positions of the two text entities associated with ai, pj and qj are the positions of the text entities associated with aj, ∥-∥ denotes the Euclidean norm of the vector, and s : R → [0, 1] is the loss function satisfying s(0) = 1 and s(x) = 0 if x > ε, where ε is the threshold. This score indicates that the distance between two entities in one LTEM should match the distance between the corresponding entities in the other LTEM, since the LiDAR odometer drift within the LTEM is negligible.

Next, we want to find a fully connected subgraph G* ⊂ G (Fig. 5(b)), and a subset of their nodes A* ⊂ A such that any pair of associations ai and aj in A* agree with each other. This problem is a variant of the maximal clustering problem, which CLIPPER [20] formulated as finding the densest subgraph G*. In this work, we use CLIPPER to solve this problem.

Once the set A* is recognized as having at least three elements and (Ec, Ep) ∈ A* , the two entities Ec and Ep can be used to construct the relative pose constraints ̄ Tp c for closing the loop, similar to (9 ). We then iterate over all other poses in T to find all possible closed-loop constraints ̄ Tp c . Algorithm 1 summarizes the general process.

We note that in the multimodal LCD and alignment scheme above, the LIO output is tightly integrated with the visual detection information, firstly for the estimation of textual entity poses and secondly for the construction of LTEMs. The high accuracy of LIO short-term navigation is critical to the performance of our approach, which cannot be achieved with VIO because of its poor depth perception and large localization drift.

Fig. 4. Hypothesized association between two LTEMs. LTEMs Mc and Mp contain a set of text entities observed by successive LiDAR poses Tc (green track) and Tp (blue track), respectively. Text entities with the same text content are represented by balls of the same color, connected by a purple line to indicate an assumed association between the two LTEMs. The only EXIT in Mc is associated with three different entities in Mp. While the association a1 is the only correct association, a2 and a3 (indicated by purple dashed lines) should be rejected by our graph-theoretic closed-loop verification method.

Figure 5. Consistency diagram. The black color of the lines indicates the geometric consistency between the two nodes (assumed associations) that are connected.

IV. Experiments

In this section, we discuss the development of the dataset and comparison with existing SOTA methods. All of our experiments were performed on a laptop with an Intel i7-10875H CPU @ 2.30GHz and an NVIDIA GeForce RTX 2060 GPU. Video summaries of our experiments can be viewed on the project pages listed in the abstracts.

A. Data set and experimental set-up

To the best of our knowledge, the collected dataset pays little attention to textual cues. We note that in [11] there are enough textual cues, but we were unable to produce accurate local textual entity maps due to the lack of available LiDAR data. Another key requirement is that there needs to be sufficient overlap between the camera and LiDAR fields of view. Due to these requirements, we found no public datasets available for text-based visual LIDAR closed-loop studies.

To fill this gap, we developed high-quality datasets for multimodal LCDs in repetitive and degraded scenes. Our setup consists of a camera with a resolution of 1920 × 1080, a Livox Mid360 LiDAR, and its embedded IMU.For ground truth, we use a Leica MS60 scanner to create high-precision a priori point cloud maps of the environment, and then align the LiDAR point clouds with these a priori maps to obtain the ground truth trajectories, similar to the results of [4], [21], [22 ]. In total, 8 data sequences were collected from 3 different FDR scenarios: indoor corridors, semi-outdoor corridors, and cross-floor buildings with distances ranging from 200 to 500 meters. Their trajectories are shown in Figure 6.

Fig. 6. Graphical representation of trajectories in our dataset. The blue line indicates the normal path, while the red line indicates the location of the loop closure event (Section IV-B.1). Sequences 1, 2, and 3 are captured on the same floor as 4, 5, and 6, while sequences 7 and 8 cross different floors and vertical staircases. In (a), (b), (c), and (d), we show very similar scenes in different corridors in sequences 1 and 4.

We compare our approach with other popular open-source SOTA works, including SC [6], ISC [7], and STD [5]. To ensure the fairness of our experiments, we integrate FAST-LIO2 [1] with different closed-loop methods to form a complete SLAM system for evaluation. We tried to keep all parameters unchanged, except that the ikdtree graph size was set to 100m × 100m with a resolution of 0.2m, and the scans were downsampled with a voxel resolution of 0.1m.

While our approach is designed to address the LiDAR closed-loop problem in FDR scenes, we also feed image sequences from our dataset into DBoW2 [8] and SALAD [10] to evaluate their recall and precision performance, as our approach uses the camera to detect text.

B. LCD Recall and Accuracy Analysis

1) Real closed-loop events: based on the ground truth poses, we will evaluate each pose to determine if closed-loop detection should occur. Specifically, consider a pose Tk for which we find the set Nk ≜ {Tp : ∥Tk ⊟ Tp∥ < τ ∧ S(Tk, Tp) > 10m, ∀p < k}, where τ is a Euclidean distance threshold, which we set to be 1.0m and 1.7m for our experiments, and the traveling distance S(Tk, Tp) ≜ Pk-1 i=p ∥Ti+1 ⊟ Ti∥. Tk is labeled as a closed-loop pose if Nk ̸= ∅.

For each closed-loop method, we evaluate its recall and precision. Based on the examination of Nk above, the method's prediction at positional attitude Tk can be either TP, FP, TN, or FN. Thus, the recall is the ratio P T P/(P T P + P F N ) and the precision is the ratio P T P/(P T P + P F P ).

　2) Recall : As shown in Tab. I, SALAD achieves the best recall performance when τ = 1.0m, over 70% in most cases. Both our method and SC show competitive results, recalling more than 50% of the cycles in the 4 sequences. The main factor limiting our recall performance is the repeatability of the OCR module, which refers to its ability to consistently detect the same text string results across multiple observations of the same text entity.

ISC and STD have the lowest recall rates, usually below 10%, because they set tighter thresholds to confirm true loop closure and thus have relatively higher accuracy rates compared to SC.

In the case of loop closure, the higher recall may not indicate a decisive advantage between the two methods, since detecting at least one accurate loop for each repetitive corridor is sufficient to significantly reduce odometer drift. However, precision is more critical, as incorrect loop closures may corrupt global attitude estimation and map construction. In sec. IVC, the adequacy of recall can be verified by effectively reducing the absolute translation error when the detected loops are incorporated into the attitude map optimization phase.

3) Accuracy: As shown in Table 1, SALAD shows competitive performance when τ = 1.0m, obtaining five second-place finishes, while ISC stands out as the most effective pure LIDAR closed-loop method, achieving an accuracy rate of more than 80% in four sequences. DBoW performed well in Sequence 1 and Sequence 2 (indoor corridors), achieving accuracy rates above 80%. However, in Sequences 4 and 5, the performance drops significantly to around 40%. Our FDR dataset is indeed challenging for DBoW, as it erroneously distinguishes between similar locations (e.g., (a) and (b), (c) and ( d) as shown in Fig. 6, which can be difficult to distinguish even for humans.

The multi-story buildings of Sequences 7 and 8 are typical repetitive scenes with very similar layouts on different floors, as shown in Fig. 6(c) and (d). Indeed, some may also find it difficult to distinguish corridors without the help of text indicators. A significant drawback of all comparison methods is their tendency to produce catastrophic error closures. Figure 7 shows some of the trajectories generated by several methods in Sequence 8. both ISC and STD incorrectly associate frames from different floors as closed loops, causing their trajectories to deviate from their true values. However, our method can still use room or device numbers as textual cues to distinguish different floors and avoid the risk of forming erroneous loops.

Fig. and ISC trajectories deviate significantly from the ground truth, with green dashed arrows indicating the direction of convergence. In contrast, our trajectories are consistently close to the true situation.

In addition, the accuracy of all SOTA methods is significantly lower in partially overlapping sequences compared to fully overlapping sequences, even though they were collected from the same environments, i.e., sequences 3 vs. 1-2 and sequences 6 vs. 4-5. This reveals their tendency to predict incorrectly closed loops, which becomes even more pronounced when the overlap of trajectories is relatively low.

In contrast, our method consistently achieves the best performance and obtains over 95% high accuracy on all sequences, thanks to our graph-theoretic loop closure recognition scheme, which efficiently exploits the spatial alignment of textual entities. The reason for the imperfect performance is that we set a strict threshold τ = 1.0m to determine the incidence of loop closure, as shown below. in IV-B.1. If the threshold is slightly relaxed to 1.7m, our method can achieve 100% accuracy while maintaining the same recall, as shown in Table 1.

C. Position Map Optimization Error Assessment

We use FAST-LIO2 as a front-end odometer for global bitmap optimization while detecting loopbacks. The FDR environment poses significant challenges to visual odometry or SLAM methods such as ORB-SLAM [23], as many images are oriented toward wall captures with few features that can be continuously extracted or tracked. At the same time, the vision method SALAD is not designed for closed-loop and cannot directly output relative pose estimates for subsequent global pose optimization. Therefore, we only analyze the pose errors of the different LiDAR LCD methods when integrated with the FASTLIO2 evaluated by EVO [24].

As shown in Table II, by utilizing textual cues for loop closures, our approach effectively minimizes odometer drift across all datasets, consistently achieving the lowest average translation error. In contrast, ISC and STD frequently report erroneous loop closures, resulting in higher average errors compared to odometer poses. The main challenge in our dataset is its symmetric and repetitive layout, as shown in Fig. 6 (a)-(d).

Although we avoid forming closed loops between neighboring poses within a 10m travel distance, as described in IV-B.1, SC can retrieve loops between poses traveling a distance slightly larger than this threshold and introduce relative pose constraints between non-neighboring poses before the loop is closed, resulting in a smaller average translation error for SC compared to other methods.

In addition to the average translation error, Fig. 8 shows the error distribution of the different methods over the three sequences. It is clear that our method always achieves the lowest error. It is the upper limit compared to other methods because all the loops created by our method are real and do not introduce false constraints in the positional map. Moreover, the distribution of our localization error remains consistent across sequences.

Figure 8. Mean error distribution

D. Runtime analysis

We evaluate the time cost of the different stages of the methods in sequences 1, 4 and 7, respectively. The results, as shown in Table III, indicate that OCR is the most time-consuming part. However, it can be replaced by other OCR methods in the future.

V. Conclusion

To fill the gap in existing navigation methods in FDR scenes, we propose a loopback scheme that utilizes textual cues of the scene inspired by human navigation. Our approach fuses LiDAR and vision information to observe textual entities in the environment and recognizes the veracity of candidate loop closures through a graph-theoretic scheme. We have collected multiple datasets from FDR scenes and conducted comprehensive comparative experiments to demonstrate the competitiveness of our approach. Our open source code and datasets will be available to the community.

Attachment:

I. Introduce what is OCR technology?

Visual Optical Character Recognition (OCR)OCR technology is a software technology that converts scanned images of varying quality into editable text formats (e.g. PDF, Word, etc.). It is capable of recognizing and processing printed text, handwritten text, and scene text.OCR technology is widely used in document scanning, bill recognition, license plate recognition, and document information extraction.

The basic process of OCR technology includes the following steps:

1. **Image pre-processing**: This includes denoising, binarization, skew correction, etc., with the aim of improving the quality of the image and making the text easier to recognize.

2. **TEXT DETECTION**: Locate the text area in the image and determine the position and shape of the text.

3. **Character segmentation**: Splits the detected text area into individual characters.

4. **Character recognition**: the segmented characters are recognized and converted into corresponding text characters.

5. **Post-processing**: Includes steps such as calibration and error correction to improve the accuracy of identification.

6. **Output**: Output of recognition results in editable text format.

With the development of deep learning technology, OCR technology has significantly improved in terms of recognition accuracy and processing speed. Many OCR service providers now use deep learning-based models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to improve recognition of complex scenes and different fonts.

If you need to convert paper documents to electronic text, you can use commercially available OCR software such as Adobe Acrobat, ABBYY FineReader, etc., or use an online OCR service. These tools usually provide user-friendly interfaces that make document conversion easy and fast.

Second, introduce what is the LiDAR-based closed-loop detection (LCD) approach

LiDAR-based Loop Closure Detection (LCD) is a key component in SLAM (Simultaneous Localization and Mapping) systems, which aims to identify the robot's repeatedly visited positions in the environment in order to correct the cumulative errors caused by the of map and trajectory drift problems. Some LiDAR-based closed-loop detection methods are described below:

1. **OverlapNet**：
- Open sourced by the Photogrammetry and Robotics Lab at the University of Bonn, Germany, OverlapNet is a code for closed-loop detection in LiDAR SLAM.
- Without a priori positional information, the overlap and relative yaw angle of two LiDAR scans are directly estimated using a deep neural network.
- Combining odometer information and overlap rate prediction to achieve closed-loop detection and correction.
- Loopback information can be estimated in challenging environments with good generalization performance across different datasets.

2. **LCDNet**：
- LCDNet is an end-to-end system designed to address closed-loop detection in self-driving cars and other mobile robots.
- Through a combination of deep learning and geometric methods, LCDNet is able to efficiently detect potential loop closures and perform matching between point clouds.
- It contains a deep neural network architecture that efficiently processes point cloud data to recognize potential loop closure scenarios and includes a fast point cloud registration module.

3. **Voxel-SLAM**：
- Voxel-SLAM, proposed by the University of *, is a complete, accurate and versatile LiDAR-inertial SLAM system.
- Real-time estimation and high-precision mapping by fully utilizing short-, medium-, long-term and multi-map data linkages.
- Closed-loop detection mitigates drift by using long-term data correlation and corrects for cumulative errors through bitmap optimization.

4. **Closed-loop detection methodology based on point clouds**:
- A fast and complete laser SLAM system based on point cloud closed-loop detection, which detects closed loops by computing 2D histograms of key frames and local map patches.
- Using normalized cross-correlation of 2D histograms as a similarity metric between the current keyframe and keyframes in the map is fast and rotationally invariant.

5. **Closed-loop LiDAR-SLAM detection based on multi-scale point cloud feature transformers**:
- A closed-loop detection method based on multi-scale point cloud feature extraction and transformer global context modeling is proposed.
- The voxel sparse convolution is utilized to obtain the original point cloud features at different resolutions, and the Transformer network is used to establish the contextual relationship between the features at different resolutions.

6. **LiDAR SLAM and closed-loop detection based on intensity information**:
- A novel intensity-based LiDAR-SLAM framework is proposed that emphasizes the importance of LiDAR intensity information in sparse feature environments.
- Closed-loop detection using intensity cylindrical projected shape context descriptors with a two-valued cyclic candidate verification strategy.

These methods demonstrate the diversity of closed-loop detection in LiDAR SLAM and the integration of deep learning techniques to improve the accuracy and robustness of closed-loop detection. With further research, these methods have promising applications in autonomous driving, robot navigation, and other fields.