Abstract: network, there are often many wait in ccn situations. When you deal with it, you will think that it is hung living or something, and you are very anxious, but wait ccn is actually a state of waiting for resources. Here is a summary of a blog post on how to deal with ccn issues. All ccn issues can be handled through this post.

This article is shared from the Huawei Cloud Community " GaussDB(DWS) wait in ccn queue, how to quickly locate and process? ", author: Malick.

Preface

When the live network uses dynamic load management, there are often many wait in ccn situations. When you deal with it, you will think that it is hung living or something, and you are very anxious, but wait ccn is actually a state of waiting for resources. Here is a summary Both the blog posts on ccn problem handling and ccn problems can be handled through this post.

background knowledge:

  • Which is ccn:

Connection environment,

source environment variable

source /opt/huawei/Bigdata/mppdb/.mppdbgs_profile

implement:

cm_ctl query -Cv | grep Cen -A 4
The results are as follows:
image.png

5003 is the ccn of the cluster.

What is ccn: ccn is the cluster concurrency control brain. All complex tasks will go to ccn to apply for resources, and only the sentences that apply for resources can be issued. Complex sentences will be recorded uniformly in ccn.

View explanation:

  • pg_stat_get_workload_struct_info();
    image.png
  • totalsize represents the total memory that can be allocated by ccn, totalsize: the maximum dynamic memory; freesize_limit is the maximum memory that can be allocated for ccn, which is 80% of the maximum dynamic memory. Freesize represents the current remaining memory.
  • Just pay attention to the central waiting/running number in the figure (the global one can be ignored, it belongs to another data structure, and the central waiting is a duplicate information.). Each line represents a sentence. Running represents that the statement is running, and waiting represents that the statement is queuing. queryId represents the thread number of the statement, corresponding to lwtid in pg/pgxc_thread_wait_status and processid in pg_sessiion_wlmstat.
  • pg_session_wlmstat/pgxc_session_wlmstat();
    image.png

Step 1: Determine the problem scenario

  • Connect ccn to query the following statements to determine the problem scenario:

The first step is to query pgxc_stat_activity to determine whether a large number of statements are in wait ccn. Or the statements of a certain resource pool are all in wait ccn.

  • Query pg/pgxc_session_wlmstat to determine whether all complex statements are queued. Or statements in the same queue are all queued.

The first step is to connect to the ccn node and query

select * from pg_stat_get_workload_struct_info();
image.png

The second step is to query pgxc_session_wlmstat();

select threadid,processid,usename,attribute,status,enqueue,statement_mem,active_points,control_group,resource_pool,substring(query,position('explain' in query),20) as subquery from pg_session_wlmstat order by status,attribute,usename,subquery,resource_pool;

image.png

judges which subsequent processing method to use according to the following scenarios:

1) If there are individual statements in the workload view that are in the Running state, and the running statement occupies a large amount of memory, occupying freesize, and a large number of statements are in the waiting state, then you can basically determine the problem handling scenario 1.

2) If there is a statement with a running state in the workload view, but in fact there are only statements in the waiting state in the pgxc_stat_activity or pg_session_wlmstat view, and in the workload view, there are two or more statements with the same qid.queryId value. So basically we are sure to go to problem handling scenario two.

3) If all the statements are in the waiting state and there are no statements in the running state, then it is basically determined to proceed to the third scenario.

A large memory statement in the processing scene causes problems

The first step is to find the statement that takes up too much memory in the workload view.
image.png

As shown in the figure above: the total available memory is 1638MB, a statement currently running occupies 1048MB, and the remaining memory freesize=590MB

At this time, the estimated memory size of the remaining statements is 600MB, so all of the memory cannot be issued anymore. Only when the 1048 statement ends, the memory can be released to return to normal.

The second step is to find the pid of the statement according to the qid.queryId corresponding to the statement. as shown above is 9145

select coorname,pid,usename,substr(query,0,30) from pgxc_stat_activity a,pgxc_thread_wait_status b where a.pid = b.tid and b.lwtid = $qid.query_id;

third step is to detect and kill large memory statements based on pid and cn. It can be restored after the memory is released.

Deal with the problem of residual hash or other sentence in Scenario 2

The first step is to confirm the concurrency configuration on the problematic resource pool:

select * from pg_resource_pool;
image.png

In the second step, if the concurrency limit of the resource pool is only reached, for example, the concurrency of the resource pool is set to 10, and the number of remaining running statements is 10, because the concurrency reaches the upper limit and the statements are in a waiting state, then adjust the queue concurrency to -1, no limit After that, waiting for concurrent statements can be issued.

Modification method, take son_pool as an example:

alter resource pool son_pool with(active_statements=-1);

The third step is to clean up the problem statement (the connection will not be disconnected, the thread will not release , and the residual information will not be automatically cleaned up)

Note: To clean up the invalid statement information, it is judged based on whether /proc/processed still exists. If it does not exist, it will be cleaned up. If the connection has been occupied, the thread will not be released. The residue will not be automatically cleaned up.

  • Judgment of the question sentence:

In the workload view, the repeated statement of qid.queryId is the problem statement. The problem thread is repeated two. One of them may be normal and the other one is residual. There may be problems, but in the end, only one active statement is queued or executed.

2) The method to clean up the problem statement, according to the repeated qid.queryId mentioned in 1) above, find the problem statement:

select coorname,pid,usename,substr(query,0,30) from pgxc_stat_activity a,pgxc_thread_wait_status b where a.pid = b.tid and b.lwtid = $qid.query_id;

third step is to use pg_terminate_backend(pid) to detect and kill residual statements based on pid and cn. Resume after releasing concurrency and memory resources.

Deal with the scene three-long jump lock problem

The first step is to confirm the problem

hit stack

gstack $ccn_pid > ccnStack.log

grep grep pthread_mutex_lock ccnStack.log

If there is a result similar to the following, confirm the problem
image.png

The second step of emergency treatment

Approach:

kill -9 ccn_pid

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量