Methodology for analyzing master-standby delay faults

【Title】Master and Standby Delay Failure Analysis Method

[Problem Classification] Failure Analysis

[Keywords] Yashandb, master and backup latency

[Problem description] When a database backup machine experiences playback latency, the cause of the latency needs to be analyzed by some means. The system view of the database or operating system monitoring data can assist in analyzing the bottleneck of playback delay.

[Analysis of the causes of the problem]

Means of preparation for delays

Replication of the current backup repository

Note：

{rst}{asn}{blockid}

rst: is the reset id, after each failover, the reset id of the new redo file generated by the database will be added 1.

asn: archive sequence number, archive sequence number, for each redo generated, the ASN will be increased by 1. The ASN is not the same for each redo.

blockid: ID of the page in the redo file, the offset of the page is block id*block size.

lfn: log flush number, log sequence number, each time redo flushes the disk, LFN plus 1.

Standby Playback Progress View

The database goes from the MOUNT to OPEN phase, the statistics are reboot playback information, and the Redo Remain item in the view decreases with playback. The view entries no longer change after the host goes OPEN. However, after the standby machine is OPEN, the view contents may be reset, and the Redo Remain item and Remain Time item indicate the size and playback time corresponding to the current remaining logs.
Checking redo's drop speed

Introduction to other auxiliary analysis views

Checking disk IO performance

Description of the output message

Device : Disk name

rrqm/s : number of merge reads per second

wrqm/s : number of merge writes per second

r/s: number of I/O reads per second

w/s: number of write I/Os per second

rkB/s : number of bytes per second to read the device (unit: K bytes)

wkB/s : number of bytes written to the device per second (unit: K bytes)

avgrq-sz: average data size per device I/O operation

avgqu-sz: average I/O queue length

r_await : average time required for each read operation (including queue wait time)

w_await : average time required for each write operation (including queue wait time)

await : average wait time per device IO operation (average response time not more than 5ms, unit: ms)

svctm : average service time per device IO operation (in ms)

%util : disk busyness (note: each disk is followed by an indication of whether it is busy or not)

If svctm is closer to await, it means that IO is hardly waiting.

If await is much higher than svctm, it means that the IO queue is too long and the response is too slow, and needs to be optimized, which can be seen from the avgqu-sz queue length.

YCM monitors primary and backup latency

The ycm for V23.2.1.100 monitors the master and backup latency as follows

Viewing thread status with gstack

gstack yasdb process >

typical case

QUESTIONNAIRE: Database latency after production data migration is completed is relatively large

Second line of analysis article: "The problem of high latency in the playback of master and backup logs

IO Performance Testing Tools