This article was shared from Huawei Cloud CommunityGaussTech Technical Column] GaussDB Performance Tuning., authored by GaussDB Database.
Database performance tuning is a complex and systematic task that requires comprehensive consideration of various factors. Therefore, tuning personnel should have a broad and deep understanding of system software architecture, hardware and software configuration, database configuration parameters, concurrency control, query processing and database applications.
The purpose of this paper is to dissect the general idea of GaussDB performance tuning, explore the overall system performance issues, as well as analyze and optimize the lock blocking problem.
1. Performance tuning ideas
GaussDB overall performance tuning idea is: first performance bottleneck analysis, find the corresponding bottleneck point, and then targeted optimization, until the system performance to reach the business acceptable range.
tuning idea, as shown in Figure 1:
Figure 1 GaussDB general performance tuning idea
First, you should confirm whether the application pressure is transferred to the database, which can be determined by analyzing the resource usage of the database nodes, such as CPU, I/O, memory, and information about the database thread pool, active sessions, etc. The GaussDB database control platform provides a rich system of monitoring metrics, which makes it easy for the performance analysts to view the real-time or historical resource usage of the database.
After logging into the control platform, enter the Monitor Patrol menu and select Monitor Disk to view the CPU/memory utilization of the corresponding instance, as shown in Figure 2:
Figure 2 CPU/memory utilization of corresponding instances
Click the Disk/Storage menu to view the disk I/O utilization, focusing on the disk read and write rates and whether the latency is as expected, as shown in Figure 3:
Figure 3 Disk read/write rate and delay situation
Click the Network menu to view the network transfer rate and whether the network card has lost packets, wrong packets, etc., as shown in Figure 4:
Fig. 4 Network Transmission Rate and NIC Transmission Rate
Select the Connections menu to view the connection and session status of the database, as shown in Figure 5:
Figure 5 Connection and session status
In Figure 5, if the percentage of active sessions is much lower than the number of concurrency of the application, it means that a large number of sessions are idle in the database. At the same time, if the CPU utilization rate is also very low, then it can be judged that the pressure did not reach the database, and at this time you need to troubleshoot whether there is a bottleneck on the application side.
Some of the more common causes of application-side bottlenecks are:
-
Application server resource bottlenecks. For example, the application server's CPU is full, the application memory allocation is insufficient, and so on;
-
Application to database network problems. For example, the network latency is high, the bandwidth is full, and there are packet losses;
-
Slow processing of the application's own logic;
-
Poor application configuration, such as improperly set connection pool parameters, memory-related configuration, etc.
For example, a customer does a large concurrency pressure test with jmeter and the performance is not as good as the business expectation. After analyzing the problem, it was found that the maximum available memory allocated by the jmeter tool was insufficient, resulting in the pressure not reaching the database. The problem is solved by modifying the following configuration.
Edit file: set HEAP=-Xms1g -Xmx4g
After confirming that the pressure reaches the database, then analyze and optimize for the corresponding bottleneck points. This is done in the following two main ways:
1) Troubleshoot the database for business SQL statements that do not perform as expected and optimize the SQL that does not perform as expected. With the following statement, view the TOP SQL statements in the database with high time consumption and analyze and tune those SQL statements whose execution performance is not as expected one by one.
select unique_sql_id,substr(query,1,50) as query ,n_calls,round(total_elapse_time/n_calls/1000,2) avg_time,round(total_elapse_time/1000,2) as total_time from dbe_perf.summary_statement t where n_calls>10 and avg_time>3 and user_name='root' order by total_time desc;
As shown in Figure 6, n_calls represents the number of times the SQL statement was executed in the database, avg_time is the average execution time of the SQL statement, and total_time is the total elapsed time of the SQL statement. For the SQL statements whose average execution time exceeds the threshold, the focus is on analysis and optimization.
Figure 6 SQL statement metrics and corresponding data display
For SQL statements with poor execution performance, you can view the execution details of the SQL statement through unique_sql_id to help analyze the performance bottleneck point of the SQL statement.
select * from dbe_perf.statement where unique_sql_id=3508314654;
As shown in Figure 7, this view records the detailed execution of the SQL statement in the database, for example, the total number of executions (n_calls) and the total elapsed time (total_elapse_time), which makes it easy to get the total elapsed time as well as the average elapsed time for this SQL.
Figure 7 Detailed view of the execution of SQL statements in the database
line activities.This includes information such as the number of rows scanned randomly, the number of rows scanned sequentially, the number of rows returned, the number of rows inserted/updated/deleted, and the number of pages hit by the buffer. In addition, the number of soft parses (n_soft_parse) and hard parses (n_hard_parse) is also recorded. For example, if a large number of hard parses of SQL cause the database CPU to be too high, you can analyze and locate the database by using this indicator.
time model, contains metrics such as db_time, cpu_time, execution_time, plan_time, data_io_time, net_send_info, net_recv_info, sort_time, and hash_time, which help to determine at what stage of the database the SQL is consumed. which phase of the SQL is being consumed in the database. For example, if an environment has poor disk performance, the percentage of time spent on data_io_time will be higher.
If you need to further analyze the performance of SQL itself, such as whether the execution plan is optimal or not, whether the index is optimal or not, and other performance issues, you can analyze it with the help of SQL's execution plan.
The SQL execution plan can be viewed in the following way:
explain analyze SELECT c_id FROM bmsql_customer WHERE c_w_id = 1 AND c_d_id = 1 AND c_last = 'ABLEABLEABLE' ORDER BY c_first;
Combined with the execution plan of SQL, the bottleneck point of SQL performance is analyzed and then performance optimization is performed as shown in Figure 8:
Figure 8 SQL performance optimization process
2) Tuning at the operating system level and database system level from the system level to fully utilize the machine's CPU, memory, I/O and network resources and avoid resource conflicts, thus improving the throughput of the entire system query.
2. Analysis of system-level performance issues
2.1 High CPU utilization
The high CPU usage of the database is usually caused by business SQL statements. We can get the SQL statements in the database that consume high CPU resources and optimize the corresponding business SQL statements in the following way.
select unique_sql_id,substr(query,1,50) as query ,n_calls,round(total_elapse_time/n_calls/1000,2) avg_time,round(total_elapse_time/1000,2) as total_time,round(cpu_time/1000,2) as cup_time from dbe_perf.statement t where n_calls>10 and avg_time>3 and user_name='root' order by cpu_time desc limit 5;
Common causes of high CPU resource consumption are:
-
SQL statements make heavy use of full table scans, which can be caused by missing indexes, index failures, poor execution plans, and other factors.
-
SQL statements are heavily hard parsed, usually because the application logic is not using PBE (Prepare Bind Execute).
-
The SQL statement scanned a large number of tuples, e.g., partitioned table partition pruning failed, full partition was scanned, a large number of dead tuples existed in the table, resulting in scanning a large number of useless pages, etc.
If the high CPU usage is caused by non-business SQL statements, you can analyze and locate them with the help of flame diagrams. Through the flame diagram, you can visualize which functions in the program take up a lot of CPU time or resources, and you can trace the function call path.
GaussDB has a built-in flamegraph tool in kernel version 505, which is automatically collected every 5 minutes by default, and saved in the $GAUSSLOG/gs_flamegraph/{datanode} path, for more information, please refer to the GaussDB product documentation, "Built-in perf tool" chapter.
For example, a customer found that the CPU SYS occupancy of the database server exceeded 70% during the pressure test. By capturing the flame graph during the pressure test and analyzing it, as shown in Figure 9, it was found that the threads of the zone file accounted for more than 40% of the database loading.
Figure 9 Flame map during pressure testing for a customer
After analyzing, the reason is that in high concurrency and frequent connection establishment, the database needs to read the time zone file to get the time zone information every time it establishes a connection, and the application does not use the long connection, which leads to a spike in the CPU SYS usage rate.
2.2 Insufficient memory
Memory resources, too, are one of the key factors affecting database performance. Before analyzing the memory problem, let's understand the memory management mechanism of GaussDB.
As shown in Figure 10, GaussDB's memory management uses a combination of dynamic and static memory, with the parameter max_process_memory controlling the maximum memory available to the database. Among them, the static memory area is mainly used as a shared buffer for the database, used to cache data pages, controlled by the shared_buffers parameter. The dynamic memory area, which is dynamically allocated by the database according to the needs, mainly includes the cache of metadata, the cache of the execution plan, the user's built-in connection, and the memory consumption of internal threads.
Figure 10 GaussDB's memory management mechanism
Performance problems caused by memory are usually categorized as follows:
1) Insufficient shared buffer resulting in low buffer hit rate for SQL. In order to see the corresponding performance metrics, you can use GaussDB's control platform or WDR reports. Typically, the buffer hit rate of a TP database should be above 99%. If the buffer hit rate of the database is low, it is recommended to check whether the shared_buffers parameter of the database is set reasonably (as shown in Figure 11).
Fig. 11 Buffer hit rate of the database
(2) In GaussDB, the SQL hash join or sort operator exists data drop disk operation, work_mem parameter controls the physical memory space available for drop disk operator. If the physical memory limited by work_mem is not enough, the data of the operator will be written to the temporary tablespace, which will bring 5-10 times performance degradation. In order to optimize performance, you can check the execution plan of the SQL, and if there is a case of the operator falling down (as shown in Figure 12), you can adjust the value of the work_mem parameter appropriately.
Fig. 12 The situation of the calculator's fall
(3) Insufficient dynamic memory in the database, resulting in business execution errors (ERROR: memory is temporarily unavailable) or inadequate performance. When the dynamic memory is insufficient, the following SQL statements can be used to find out the SQL statements with high memory consumption, so as to check whether there are any sub-optimal SQL statements. With the help of SQL execution plan analysis, you can check whether there is any unreasonable join order, or whether there is any non-essential sorting operation, so as to avoid consuming a lot of memory.
select unique_sql_id,substr(query,1,50) as query ,n_calls,round(total_elapse_time/n_calls/1000,2) avg_time,round(total_elapse_time/1000,2) as total_time,hash_mem_used,sort_mem_used from dbe_perf.statement t where n_calls>10 and avg_time>3 and user_name='root' order by (hash_mem_used+sort_mem_used) desc;
If you need to troubleshoot abnormal memory consumption issues caused by non-business SQL statements, such as memory buildup, memory leaks, etc., GaussDB provides a rich set of memory-related monitoring views, which can be used to view the memory consumption of a database node through the following view (as shown in Figure 13).
Figure 13 GaussDB memory-related monitoring view
Based on the above query results, if the dynamic_used_shrctx occupancy is high, it means that the global shared dynamic memory is highly consumed. You can view the consumption of the global shared dynamic memory context by using the following SQL statement.
select contextname, sum(totalsize)/1024/1024 totalsize, sum(freesize)/1024/1024 freesize, count(*) count from gs_shared_memory_detail group by contextname order by totalsize desc limit 10;
If the occupancy of max_dynamic_memory is high, but the occupancy of dynamic_used_shrctx is low, then it means that a thread or session is consuming more memory. You can query the memory context consumption of a database thread by using the following SQL statement.
select contextname, sum(totalsize)/1024/1024 totalsize, sum(freesize)/1024/1024 freesize, count(*) sum from gs_thread_memory_context group by contextname order by sum desc limit 10;
The query results are shown in the following figure, which shows that the highest memory usage in the current database is the cache for metadata (LocalSysCacheShareMemory). Combine the query results in Figure 14 to troubleshoot whether there is any unreasonable memory usage.
Figure 14 Memory context consumption of database threads
2.3 IO bottlenecks
With the iostat command, you can view the busyness and throughput of the I/O of the database nodes and analyze whether there is a performance bottleneck due to I/O. As shown in Figure 15:
Figure 15 Busyness and throughput of database node I/Os
Focus on the read/write throughput and read/write latency of the disk. Normally, the read/write latency of SSD disks is below 2ms and the bandwidth of a single disk is above 300MB. If there are abnormalities in disk performance, give priority to troubleshooting whether there are hardware failures, such as bad disks, slow disks, RAID card failures, or incorrect disk read/write policies. If the disk hardware performance is normal and the I/O pressure is high, you can adjust the database I/O related parameters appropriately to reduce the I/O consumption of the data, so as to optimize the overall performance of the database.The key I/O related parameters are linked below:
Back-end write process:/distributed-devg-v2-gaussdb/
Asynchronous I/O:/distributed-devg-v2-gaussdb/
2.4 Network anomalies
In traditional centralized database environments, application servers and database servers are usually deployed in the same server room, thus ensuring less network overhead between the application and the database. However, in cloud+distributed database environments, the network links from application servers to database servers are long and network elapsed time is critical for transaction performance. In this scenario, we not only need to focus on the network conditions between the application and the database (which should usually be less than 0.2ms), but also consider the network conditions between the nodes within the database, which can also have a large impact on the performance.
GaussDB requires intra-AZ network latency to be less than 0.2ms, inter-AZ network latency to be less than 2ms, and inter-region network latency to be less than 100ms. the network latency and packet loss between the two servers can be troubleshooted by using the ping command in linux as shown in Figure 16:
Figure 16 The ping command to troubleshoot network latency and packet loss between two servers
View the transmission of the network by using the sar -n DEV 1 command.
As shown in Figure 17, "rxkB/s" is the number of kilobytes received per second and "txkB/s" is the number of kilobytes sent per second, which mainly focuses on whether the transmission amount of each NIC reaches the transmission limit.
Figure 17 sar -n DEV 1 Command, Network Transmissions
3. Analysis of the lock blocking problem
Database locking mechanism is a technique used to manage concurrent access. It ensures data consistency and integrity when multiple users access the database concurrently by locking the data in the database.
In concurrent access scenarios, performance problems due to lock conflicts are often encountered. Here we look at how you should locate and analyze lock conflicts in GaussDB.
If the application is running, you can use the following SQL statement to see if there is lock blocking for the currently executing session in the database.
Centralized scenarios:
SELECT as w_pid, as w_query,, as locking_query, as l_state, as l_pid, as l_sessionid
FROM pg_stat_activity AS a
JOIN pg_thread_wait_status b ON b.query_id = a.query_id
JOIN pg_thread_wait_status c
ON = b.block_sessionid and c.node_name=b.node_name
JOIN pg_stat_activity d
on =
;
Distributed scenarios:
SELECT as w_pid, as w_query, as w_state, , , as lock_query, as l_state, as l_pid, as l_sessionid
FROM pgxc_stat_activity AS a
JOIN pgxc_thread_wait_status b ON b.query_id = a.query_id
JOIN pgxc_thread_wait_status c ON = b.block_sessionid and c.node_name=b.node_name
JOIN pgxc_stat_activity d
on substring(d.global_sessionid,0,instr(d.global_sessionid,'#')) ilike substring(c.global_sessionid,0,instr(c.global_sessionid,'#'))
;
The result of the query is shown in Figure 18, where you can get the SQL statement with lock blocking in the current library, as well as the session ID, thread ID, and the corresponding query that blocked it.
Figure 18 Lock blocking query result display
To find and end the session that is blocking the current query, you can use the following statement.
SELECT PG_TERMINATE_BACKEND(pid);
If the performance problem is caused by historical lock blocking, you can query the database wait events during the specified time period with the following statement. If you find a large number of ACCESS LOCK (including TRANSACTION ID, RELATIONSHIP, and TUPLE) events, it means that the database had a lock blocking problem during that time period.
select wait_status,event,count(*) from gs_asp where sample_time>='20241016 18:45:00' and sample_time <='20241016 19:00:00' group by 1,2 order by 3 desc;
ASP (Active Session Profile) reproduces the system activities in the past period of time in a low-cost way by sampling the state information of the active sessions in the instance, which mainly contains the basic information of the session, the session transactions, the executed statements, the waiting events, the state of the session (e.g., active, idle, etc.), and the locks or blocked by which session is currently blocked. event, which lock it is waiting for, or which session it is blocking.
As shown in Figure 19, the two waiting events that account for the highest percentage of the database in this time period, one is waiting for the dn_6004_6005_6006 slice to return the execution result, which needs to be further investigated to find out the cause of the performance bottleneck on this slice; the other waiting event is the ACCEIVED LOCK (RELATIONSHIP), which indicates that there are a large number of table-level locks waiting.
Figure 19 Two waiting events with the highest database share within a given event
Combined with the normalized view of the database, you can get the SQL statements in the database that have lock waiting, as shown in Figure 20:
Figure 20 Getting SQL statements with lock waiting in the database
Get the query_id of the query blocking the statement by the Unique_query_id of the statement.
execute direct on datanodes $$select t1.unique_query_id,t1.thread_id,,t1.wait_status,,,t2.query_id as lock_query_id from gs_asp t1,gs_asp t2 where t1.block_sessionid= and t1.unique_query_id=168353725$$;
As shown in Figure 21, lock_query_id is the query_id that blocks this SQL statement.
Figure 21 Getting the query_id of the blocking lock waiting SQL statement
Using the query_id from the previous step and combining it with the gs_asp view, you can get the details of this SQL statement with the following statement. The result of the query is shown in Figure 22. It can be seen that the blocking statement is also an UPDATE statement for the same table, which indicates a lock conflict due to concurrent updates to the same row of data.
Figure 22 Query result of SQL statement for lock wait
Typically, the solution to the concurrent update lock conflict problem needs to start from the business point of view, examining the existence of concurrent updates to the same row of the situation is consistent with the business scenario. If the business does not exist in such a scenario, it should be optimized from the business logic or business data to avoid concurrent updates to the same row.
4. Summary
Database performance tuning involves hardware, operating system, database, application and other levels, therefore, in the process of performance tuning, it is necessary to comprehensively consider the impact of various factors. This article introduces the common means and ideas for analyzing performance problems in GaussDB to help you familiarize yourself with the tools commonly used in GaussDB database performance diagnosis and how to use them.
Huawei Developer Space, which gathers development resources and tools from various root technologies such as Hongmeng, Rise, Kunpeng, GaussDB, and Euler, is committed to providing each developer with a cloud host, a set of development tools, and storage space on the cloud, so that developers can innovate based on Huawei's root ecosystem.click on a linkGet your exclusive cloud hosting for free.
Be the first to know about Huawei Cloud's fresh technology~!