本文为墨天轮数据库管理服务团队第96期技术分享,内容原创,作者为技术顾问陈昱,如需转载请联系小墨(VX:modb666)并注明来源。如需查看更多文章可关注【墨天轮】公众号。

image.png

适用范围

DB: ALL VERSION

OS: ALL PLATFORM

问题概述

数据库在7号9:00 左右开始卡顿,采集了对应时间段的AWR报告如下:
–node1
image.png
image.png
image.png
–node2
image.png
image.png
image.png
–node1
image.png
–node2
image.png

有3个EVENT需要注意,分别是

1)其中enq: TX - index contention主要由 apd5vv681f54w 引起。

SQL语句:INSERT INTO A_LM_*****_HOUR ("DATA_DATE", "METER_ID", "ASSET_NO", "MGT_ORG_CODE", "MGT_ORG_NAME", "CUST_ID", "CUST_NO", "CUST_NAME", "CONS_TYPE", "IS_SPECIAL", "UI_ABNOR_TYPE", "ABNOR_DESC_RATE", "ABNOR_DESC_DATA", "ABNOR_DATE", "CREATE_TIME", "IMPO_CONS_TYPE", "IMPO_CONS_SORT", "TMNL_ONLINE_FLAG", "THRES_HOLD") VALUES (:1 , :2 , :3 , :4 , :5 , :6 , :7 , :8 , :9 , :10 , :11 , :12 , :13 , :14 , :15 , :16 , :17 , :18 , :19 )

查询索引分裂对应的索引

select instance_number,current_obj#,count(*) cnt
from ashtmp
where to_char(sample_time,'yyyy-mm-dd hh24') between '2025-06-04 09' and '2025-06-04 11' 
and sql_id = 'apd5vv681f54w'
group by instance_number,current_obj#
order by instance_number,cnt;
11-12236
2104979
31313244769254
413132448144496

可以看见 3132447 和 3132448 最严重,下面查询2个索引对象

select owner,object_name,object_type from dba_objects where object_id in (3132447,3132448);

image.png

image.png

建议:修改_assm_high_gsp_threshold 和 _assm_low_gsp_threshold 参数,功能:

  • 控制在索引分裂过程中,搜索可重用的空闲块(free block) 时的最大拒绝次数阈值。
  • 当空间管理模块(ASSM)在寻找空闲块过程中累计的块拒绝次数达到该阈值时,会跳过继续搜索,直接扩展新的数据块,从而减少因反复扫描高使用率块带来的开销。
alter system set event='43822 trace name context forever,level 1','450502 trace name context forever, level 1',''450503 trace name context forever, level 1' scope=spfile sid='*';
ALTER SYSTEM SET "_assm_low_gsp_threshold"=512 SCOPE=BOTH;
ALTER SYSTEM SET "_assm_high_gsp_threshold"=512 SCOPE=BOTH;

2)其中enq: HW - contention主要由 8dyk8nmwd39aw引起。

SQL语句为:INSERT INTO A_COLL_*****_DAY (MGT_ORG_CODE, TMNL_ASSET_NO, METER_POINT_CNT, POINT_CNT, METER_U_POINT_CNT, U_POINT_CNT, METER_I_POINT_CNT, I_POINT_CNT, METER_P_POINT_CNT, P_POINT_CNT, METER_PF_POINT_CNT, PF_POINT_CNT, METER_PAR_POINT_CNT, PAR_POINT_CNT, METER_I0_POINT_CNT, I0_POINT_CNT, METER_CNT, METER_SUCC_CNT, TMNL_INTE_RATE, U_INST_RATE, I_INST_RATE, P_INST_RATE, PF_INST_RATE, PAR_INST_RATE, I0_INST_RATE, PO_POINT_CNT, IOT_POINT_NO, IOT_POINT_NAME, PRAP_IR_DAY, STAT_DATE) VALUES(:1 , :2 , :3 , :4 , :5 , :6 , :7 , :8 , :9 , :10 , :11 , :12 , :13 , :14 , :15 , :16 , :17 , :18 , :19 , :20 , :21 , :22 , :23 , :24 , :25 , :26 , :27 , :28 , :29 , :30 )

建议:
1) 将A_COLL_TMNL_ALL_INTE_DAY表的next extent 增大,默认是1M,可以修改为10M,减小大量insert带来的高水位争用。
2) 手工对改表进行扩容

3)enq: TX – contention等待链条如下。

with ash as
(select /*+ materialize*/ *
    from ashtmp t
   where to_char(sample_time,'yyyy-mm-dd hh24:mi') = '2025-06-04 08:35'
),
chains as
(select instance_number inst_id,
         blocking_session blocking_sid,
         blocking_session_serial# blocking_serial,
         session_id,
         session_serial# session_serial,
         level lvl,
         sys_connect_by_path(instance_number || ' ' || session_id || ',' ||
                             session_serial# || ' ' || sql_id || ' ' ||
                             event,
                             ' <-by ') path,
         connect_by_isleaf isleaf
   from ashtmp
   start with event in ('enq: TX - contention')
   connect by nocycle(prior blocking_session = session_id
                  and prior blocking_session_serial# = session_serial#
                  and prior sample_id = sample_id))
select inst_id,
       blocking_sid,
       blocking_serial,
       lpad(round(ratio_to_report(count(*)) over() * 100) || '%', 5, ' ') "%This",
       count(*) ash_time,
       path
  from chains
where isleaf = 1
group by inst_id, blocking_sid, blocking_serial, path
order by inst_id, ash_time desc;

image.png

可以发现enq: TX – contention 基本都被row cache lock 阻塞,对 row cache lock 进行分析发现

select inst_id,to_char(min(sample_time),'yyyy-mm-dd hh24:mi:ss') mintim
       ,to_char(max(sample_time),'yyyy-mm-dd hh24:mi:ss') maxtim
       ,event,p1,count(*) cnt
from ashtmp
where to_char(sample_time,'yyyy-mm-dd hh24:mi') between '2025-06-04 09' and '2025-06-04 11' 
and event in ('row cache lock')
group by inst_id,event,p1
order by inst_id,mintim,cnt;

image.png

DC_SEGMENTS:该缓冲队列锁会在段分配的时候发生,观察持有这个队列锁的会话在做什么。This is likely to be down to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.

DC_TABLESPACES:该缓冲队列锁会在申请新的extent时产生,当next extent设置的较小,同时申请的较为频繁时产生。

select instance_number,user_id,count(*) cnt
from ashtmp
where to_char(sample_time,'yyyy-mm-dd hh24:mi') between '2025-06-04 09' and '2025-06-04 11' 
and event in ('row cache lock')
group by instance_number,user_id
order by instance_number,cnt;
1113
09
11717733
10319
1171238
117 就是 SG_MIS

image.png

用户对应的 quota tablespace 并没有做限制。

with ash as
(select /*+ materialize*/ *
    from ashtmp t
   where to_char(sample_time,'yyyy-mm-dd hh24:mi') = '2025-06-04 08:35'
),
chains as
(select instance_number instance_number,
         blocking_session blocking_sid,
         blocking_session_serial# blocking_serial,
         session_id,
         session_serial# session_serial,
         level lvl,
         sys_connect_by_path(instance_number || ' ' || session_id || ',' ||
                             session_serial# || ' ' || sql_id || ' ' ||
                             event,
                             ' <-by ') path,
         connect_by_isleaf isleaf
   from ashtmp
   start with event in ('row cache lock')
   connect by nocycle(prior blocking_session = session_id
                  and prior blocking_session_serial# = session_serial#
                  and prior sample_id = sample_id))
select instance_number,
       blocking_sid,
       blocking_serial,
       lpad(round(ratio_to_report(count(*)) over() * 100) || '%', 5, ' ') "%This",
       count(*) ash_time,
       path
  from chains
where isleaf = 1
group by instance_number, blocking_sid, blocking_serial, path
order by instance_number, ash_time desc;

image.png

以 1 4509,50180 为例进行分析

select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss') tim,session_id,session_serial#,sql_id,event,p1,session_state,blocking_session,
 to_char(sql_exec_start,'yyyy-mm-dd hh24:mi:ss') sql_exec_start,sql_exec_id
from ashtmp t
where to_char(sample_time,'yyyy-mm-dd hh24:mi') between '2025-06-04 08:30' and '2025-06-04 08:45'
and instance_number = 1
and session_id = 4509
and session_serial#=50180
order by tim;

image.png

image.png

可以发现P1也是5,且SQL一直没有执行完成。

所以,本质上 enq: TX – contention 被 row cache lock 阻塞,raw cache lock 又因为 enq: HW – contention 引起,查询HW等待最严重的SQL

select sql_id,count(*) cnt
from ashtmp t
where to_char(sample_time,'yyyy-mm-dd hh24:mi') between '2025-06-04 08' and '2025-06-04 11' 
and event = 'enq: HW - contention'
group by sql_id
order by cnt;
159jx8w4gh1d1q    4
33v43zsn3h19n2    9
4    apd5vv681f54w    19
5    cz87u7n30qvh8    38
68tv7bhm1dvzvv    46
7    chua70mgbmx5j    57
864vb4f762gnaw    131
90w5addar5sg1j    206
10796yx0b9w53cm    520
11    dctqsa2bc5zru    863
12    c5hayahuzby96    1469
135a7451c2ku1j4    2570
143fa019sbh4653    3455
15079s8t9k1jzmu    4015
16    cufvk33jrzrpk        5310
178dyk8nmwd39aw    129508

最严重的SQL是 8dyk8nmwd39aw,建议参考2), 同时对于1000以上的也可以适当增加next extent.

问题原因

  1. 大量插入造成的索引分裂
      2)大量插入造成的高水位争用
  2. 由于enq: HW contention 导致raw cache lock 从而导致 enq: TX contetion,还是由于ENQ: HW引起

解决方案

1) 索引分裂

alter system set event=‘43822 trace name context forever,level 1’,‘450502 trace name context forever, level 1’,‘‘450503 trace name context forever, level 1’ scope=spfile sid=’*';

ALTER SYSTEM SET “\_assm\_low\_gsp\_threshold”=512 SCOPE=BOTH;

ALTER SYSTEM SET "assm\_high\_gsp\_threshold"=512 SCOPE=BOTH;

2) enq: HW

将A\_COLL\_TMNL*****\_DAY表的next extent 增大,默认是1M,可以修改为10M,减小大量insert带来的高水位争用。手工对改表进行扩容。


墨天轮从乐知乐享的数据库技术社区蓄势出发,全面升级,提供多类型数据库管理服务。墨天轮数据库管理服务旨在为用户构建信赖可托付的数据库环境,并为数据库厂商提供中立的生态支持。
墨天轮数据库服务官网:https://www.modb.pro/service


墨天轮
44 声望22 粉丝