TaurusDB library table point-in-time speedy recovery, dramatically shorten the data recovery time

After a number of experimental comparisons, there is a significant optimization effect for cases where only a few tables of data need to be recovered under large instances. Especially for the game business and other scenarios that require frequent archiving, it will significantly reduce the downtime caused by data recovery. We will gradually open this feature on public clouds to benefit more users.

This article was shared from Huawei Cloud CommunityHuawei Cloud MySQL Technical Column] TaurusDB Library Table Point-in-Time Recovery., Author: GaussDB Database.

1. Background

Customers on the cloud often have operations such as mistakenly deleting tables and libraries. For such problems, the industry generally provides library table-level recovery programs. First, full and incremental data at the selected point in time are recovered to a temporary instance in the background, and then the tables that users need to recover are automatically exported and then recovered to the original instance, thus reducing the impact on the original instance.

However, in order to ensure data integrity, this process usually involves the complete recovery of the entire instance, and the longer recovery time leads to customer dissatisfaction with this solution. Especially when the amount of table data to be recovered is much smaller than the entire instance, such as recovering only 20M tables in a 3T instance, but need to complete the PITR (Point-in-Time Recovery) of the entire 3T instance, and then import and export the table data, which is not only inefficient, but also doubtful in terms of rationality.

To address the above issues, TaurusDB combines its own architectural features to optimize the table-level recovery process and launch a table-level high-speed recovery solution, so that the recovery time is only related to the amount of data in the table to be recovered, rather than the size of the entire instance, thus dramatically reducing the RTO and improving service availability.

2. Introduction to the principle

2.1 Multi-interval segmentation download

TaurusDB cloud-native database adopts compute and storage separation architecture, its backup principle refers to the official website/usermanual-gaussdbformysql/gaussdbformysql_03_0052.html

The smallest management unit of TaurusDB storage is defined as a 64MB plog. on the plog, page data is stored discretely at a granularity of 16KB. In order to realize fine-grained data recovery, the multi-RANGE download capability provided by Huawei Cloud Object Storage Service (OBS) is required.

Figure 1 Multi-RANGE download only example

As shown in Figure 1, we download the table data scattered across multiple plogs and merge them into a new plog and complete the location update in the log directory.

2.2 Tablespace Storage Mapping

TaurusDB's unit of management on the storage side is called a slice, and on the compute side the SliceManager module manages the mapping relationship from [tablespace id, pageno] to slice, with each slice logically allocated 10G of storage.

As shown in Figure 2, for the table with tablespace id 8, we only need to recover slice1 and slice3.

Fig. 2 Mapping relationship between table and slice

This set of relationships is persisted to a file for continued querying after restart. Of course, the backup module also needs to update the relevant slice information on recovery to ensure that the tables map to the newly created slice after recovery.

2.3 Tablespace change log tracking

Through the above introduction, we know that we can identify which slice needs to be recovered based on the table tablespace id of the table to be recovered, but in practice, when customers use table-level recovery, they usually provide the table name information, so they need to understand the mapping relationship between the table name and the table tablespace id. This set of mapping relationships can be visualized with theINNODB_TABLESPACES The table is queried in real time, but DDL operations such as drop , create , rename and so on will change the tablespace id of the table, so you need to pay attention to this in practice.

Figure 3 Table tablespace id change process

As shown in Figure 3, when the system is restored to the T2 moment, the tablespace id of table A is 12, and the T3 moment and after the recognition of the drop statement, will directly return an error, because the current time period and after the existence of no such table.

TaurusDB additionally records information about tablespace id changes involving tables during the process of adding backups. When doing library table-level point-in-time speedy recovery, the system will utilize this recorded information, combined with the tablespace information of the full backup, to get the tablespace id corresponding to the table name of the point-in-time moment of recovery.

3. Overall process analysis

The overall process of table-level recovery operations, as shown in Figure 4:

Figure 4 Overall flow of table-level recovery operations

The Management Agent issues the name of the table to be recovered + the point in time and gets the recovery table tablespace id;
Full recovery will get the list of slice to be recovered based on the table tablespace id, and issue a recovery task to the storage side to recover the specified plog;
Start MysqlD, InnoDB layer according to the table tablespace id, not in the list is displayed as DB_CANNOT_OPEN_FILE, playback of incremental logs when skipping non-recovery table logs;
Exporting imported tables using mydumper and myloader.

4. Applications

Taking the recovery of a 2T instance used by a user on a public cloud as an example, a table with a size of 12M is tested, and the overall time-consumption comparison before and after optimization is shown in Figure 5:

Fig. 5 Comparison of time consumption before and after table-level recovery optimization

A comparison of the data in the figure shows that the amount of data recovered after optimization is reduced from TB to MB, and the overall recovery time is only 21% of the time required before optimization.

In addition, in the instance creation phase, the time required is reduced by parallelizing the processing of each sub-step. In the table import and export phase, the data recovery performance was significantly improved by adjusting the corresponding strategies for the open source mydumper and myloader tools, see /blogs/433475 for details.。

5. Summary

TaurusDB significantly reduces the amount of data required for recovery by virtue of the library table point-in-time very fast recovery feature. After a number of experimental comparisons, it has a significant optimization effect for the situation where only a few tables of data need to be recovered under large instances. Especially for the game business and other scenarios that require frequent archiving, it will significantly reduce the downtime caused by data recovery. We will gradually open this feature on the public cloud to benefit more users.

Huawei Developer Space, which gathers development resources and tools from various root technologies such as Hongmeng, Rise, Kunpeng, GaussDB, and Euler, is committed to providing each developer with a cloud host, a set of development tools, and storage space on the cloud, so that developers can innovate based on Huawei's root ecosystem.click on a linkGet your free cloud hosting!

Click to follow and be the first to know about Huawei Cloud's fresh technology~!