【Title】Master and Standby Delay Failure Analysis Method
[Problem Classification] Failure Analysis
[Keywords] Yashandb, master and backup latency
[Problem description] When a database backup machine experiences playback latency, the cause of the latency needs to be analyzed by some means. The system view of the database or operating system monitoring data can assist in analyzing the bottleneck of playback delay.
[Analysis of the causes of the problem]
Means of preparation for delays
Replication of the current backup repository
Note:
{rst}{asn}{blockid}
rst: is the reset id, after each failover, the reset id of the new redo file generated by the database will be added 1.
asn: archive sequence number, archive sequence number, for each redo generated, the ASN will be increased by 1. The ASN is not the same for each redo.
blockid: ID of the page in the redo file, the offset of the page is block id*block size.
lfn: log flush number, log sequence number, each time redo flushes the disk, LFN plus 1.
Standby Playback Progress View
The database goes from the MOUNT to OPEN phase, the statistics are reboot playback information, and the Redo Remain item in the view decreases with playback. The view entries no longer change after the host goes OPEN. However, after the standby machine is OPEN, the view contents may be reset, and the Redo Remain item and Remain Time item indicate the size and playback time corresponding to the current remaining logs.
Checking redo's drop speed
Introduction to other auxiliary analysis views
Checking disk IO performance
Description of the output message
Device : Disk name
rrqm/s : number of merge reads per second
wrqm/s : number of merge writes per second
r/s: number of I/O reads per second
w/s: number of write I/Os per second
rkB/s : number of bytes per second to read the device (unit: K bytes)
wkB/s : number of bytes written to the device per second (unit: K bytes)
avgrq-sz: average data size per device I/O operation
avgqu-sz: average I/O queue length
r_await : average time required for each read operation (including queue wait time)
w_await : average time required for each write operation (including queue wait time)
await : average wait time per device IO operation (average response time not more than 5ms, unit: ms)
svctm : average service time per device IO operation (in ms)
%util : disk busyness (note: each disk is followed by an indication of whether it is busy or not)
If svctm is closer to await, it means that IO is hardly waiting.
If await is much higher than svctm, it means that the IO queue is too long and the response is too slow, and needs to be optimized, which can be seen from the avgqu-sz queue length.
YCM monitors primary and backup latency
The ycm for V23.2.1.100 monitors the master and backup latency as follows
Viewing thread status with gstack
gstack yasdb process >
typical case
QUESTIONNAIRE: Database latency after production data migration is completed is relatively large
Second line of analysis article: "The problem of high latency in the playback of master and backup logs
IO Performance Testing Tools