High-performance data access middleware OBProxy (3): Troubleshooting and service operation and maintenance

1 Introduction

In the last article, we explained the installation and deployment of OBProxy. In practice, you may also fail to install and deploy due to problems such as machine environment and operation sequence. At this time, you can ask questions in the open source community Q&A section of OceanBase, and professional engineers will answer them. your problem. (You can also join the nail group: 33254054 to communicate face-to-face with technical students) After completing the installation and deployment of OBProxy, you can use OBProxy to access the OceanBase database.

At the beginning of using OBProxy, troubleshooting and operation and maintenance of OBProxy are the obstacles that everyone will encounter. Therefore, in this chapter, we will explain the troubleshooting methods and service operation and maintenance methods in detail, so that everyone can use OBProxy more easily and become an OBProxy expert.

OBProxy is an important part of the database access link. Combined with the content of this article, you can draw inferences from one case and think more about link tracking and troubleshooting of distributed systems, which is very helpful for understanding distributed systems. Alright, now let's get started!

2 Troubleshooting

In most cases, troubleshooting needs to be completed by the team. I will introduce some norms and ideas below to facilitate communication and problem solving.

2.1 Describe the problem

When encountering a problem, we first need to describe the problem clearly, preferably to be able to reproduce the problem, so that the R&D students can effectively solve the problem. For example, when asking questions in the OceanBase open source community's Q&A version, it is recommended to use the following template to describe the problem:

 <!-- 为了避免多次沟通确认，提升效率，请提供以下信息 -->
【 使用环境 】生产环境 or 测试环境
【 OB or 其他组件 】
【 使用版本 】
【问题描述】清晰明确描述问题
【复现路径】问题出现前后相关操作
【问题现象及影响】

【附件】

It is recommended that you fill in as required. While filling in the content, it will also help you to sort out the problem more clearly. The problem description suggests at least the following two points:

Problem time: For problems that are not easy to reproduce, time information is helpful for sorting out problems and aligning problems.
Client error report: The client error report contains rich information (SQL, stack, execution time, etc.), and the detailed information is very helpful for troubleshooting; when there is a conclusion, you can compare the results before and after to further verify the conclusion.

2.2 Analyzing the problem

After describing the problem, for the troubleshooter, it is time to analyze the problem. Before analyzing the problem, we must first understand the database access link globally, which mainly includes the following modules:

From the global to the local, the thinking will be clearer. OBProxy interacts with various modules. You can learn more about the application, LB and ObServer when you have time. When analyzing problems, analyze them one by one in the order from front to back (starting from the application side), first determine which module is the problem, and then determine the specific problem of the module.

Here are some inefficient troubleshooting methods to avoid:

The customer said that the application uses the database to report an error, so go directly to the log directory of OBProxy to search for WARN and ERROR level logs. If the log printing is not standardized or there are too many WARN/ERROR logs, this method is very useful. of inefficiencies that have nothing to do with customer issues.
The customer said that the database connection was broken. I skipped the intermediate link and started the investigation directly from the OceanBase database, but no problems were found.

2.3 Classification issues

This section classifies the common problems of OBProxy, so that you can have a better idea when analyzing the OBProxy module. From a business perspective, the problems of OBProxy mainly include the following categories. ODP in the figure below is the abbreviation of OBProxy.

Login failure: Most of the problems are caused by a configuration problem somewhere, which is easier to reproduce and troubleshoot.
SQL execution: There are many types of problems, and some difficult problems are difficult to troubleshoot.
Return error: The back-end service returns a clear error code. According to the error code, go to OBServer and OBProxy for troubleshooting.
Slow SQL: It is necessary to determine the time consumption of each module and find the bottleneck point to optimize. In addition to the modules in 2.2, the network is also an important factor.
Disconnection: Similar to slow SQL, it is necessary to first check which module is actively disconnecting (actively sending FIN messages), and then check the specific module.

2.4 Summary

After the introduction of the previous content, everyone has a methodology for troubleshooting, but this is only the first step of the Long March. Because of the limited length of the article, I will not continue to introduce it in depth. There are some simple examples later to facilitate your further understanding. Everyone also needs to constantly summarize the actual combat in order to face the problem with ease.

3 OBProxy logs

When solving the problem of OBProxy, there are three magic weapons: OBProxy logs, Linux commands (network commands, system commands and text commands) and OceanBase Cloud Platform (OCP for short). Among them, the knowledge points of Linux command and monitoring platform are very common, and there are a lot of information on the Internet. You can learn it by yourself. After mastering it, it will be very helpful for you to troubleshoot problems.

In this section, we mainly introduce the log part that is closely related to OBProxy. There are various types of OBProxy logs. We will introduce the function of each log to help you troubleshoot problems.

3.1 Error log

The file name of the error log is obproxy_error.log, and the error log will record the execution error request, including OBProxy's own error and OBServer return error. We test with select obproxy_error from dual , in which the obproxy_error field is not quoted, it will be treated as a column, causing the execution to fail. The client reports the following error:

 MySQL [test]> select obproxy_error from dual;
ERROR 1054 (42S22): Unknown column 'obproxy_error' in 'field list'

Open obproxy_error.log, the content is as follows:

 2022-07-11 10:26:09.358231,undefined,,,,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test,OB_MYSQL,,,COM_QUERY,SELECT,failed,1054,select obproxy_error from dual,42423us,454us,0us,41222us,Y0-7F4B1EF653A0,,,,0,xx.xx.xx.xx:33041,Unknown column 'obproxy_error' in 'field list'

# 日志通过逗号分隔，如果SQL中有逗号，会通过%2C替代，通过tr ',' '\n'替换结果如下

1,2022-07-11 10:26:09.358231    #日志打印时间
2,undefined           # sharding模式下逻辑租户名
3,                    # 无需关注，内部使用
4,                    # 无需关注，内部使用
5,                    # sharding模式下逻辑库名
6,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test  # 物理库信息（cluster:tenant:database）
7,OB_MYSQL         # 数据库类型
8,                 # 逻辑表名
9,                 # 物理表名
10,COM_QUERY       # SQL 命令（COM_QUERY、COM_STMT_PREPARE等）
11,SELECT          # SQL 类型
12,failed          # 执行结果（success/failed）
13,1054            # 错误码（succ时为空）
14,select obproxy_error from dual  # SQL语句
15,42423us         # 执行总耗时（ms，包括内部 SQL 执行耗时
16,454us           # 预执行时间
17,0us             # 建立连接时间
18,41222us         # 数据库执行时间
19,Y0-7F4B1EF653A0 # OBProxy内部日志trace_id
20,           # 无需关注，内部使用
21,           # 无需关注，内部使用
22,           # 无需关注，内部使用
23,0          # 无需关注，内部使用
24,xx.xx.xx.xx:33041  # 路由到的ObServer的地址信息
25,Unknown column 'obproxy_error' in 'field list' # 报错信息

3.2 Audit log

The file name of the audit log is obproxy_digest.log, and the audit log records requests and error response requests whose execution time is greater than the parameter query_digest_time_threshold threshold (default 100ms). For the example in 3.1, we can also see the log from obproxy_digest.log.

 2022-07-11 10:26:09.358221,undefined,,,,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test,OB_MYSQL,,,COM_QUERY,SELECT,failed,1054,select obproxy_error from dual,42423us,454us,0us,41222us,Y0-7F4B1EF653A0,,,,0,xx.xx.xx.xx:33041

In addition to executing the wrong SQL, we use select sleep(3) from dual imitate slow SQL. After execution, check obproxy_digest.log, and you can see that OBProxy execution took 409us, and ObServer execution took 3039883us. The content is as follows:

 2022-07-11 14:32:51.758265,undefined,,,,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test,OB_MYSQL,,,COM_QUERY,SELECT,success,,select sleep(3),3041116us,409us,0us,3039883us,Y0-7F4B1CEA13A0,,,,0,xx.xx.xx.xx:33041

日志分析

1,2022-07-11 14:32:51.758265    #日志打印时间
2,undefined          # sharding模式下逻辑租户名
3,                   # 无需关注，内部使用
4,                   # 无需关注，内部使用
5,                   # sharding模式下逻辑库名
6,ob9988.zhixin.lm.100.88.147.179:sys:test  # 物理库信息（cluster:tenant:database）
7,OB_MYSQL         # 数据库类型
8,                 # 逻辑表名
9,                 # 物理表名
10,COM_QUERY       # SQL 命令（COM_QUERY、COM_STMT_PREPARE等）
11,SELECT          # SQL 类型
12,success         # 执行结果（success/failed）
13,                # 错误码（succ时为空）
14,select sleep(3) # SQL语句
15,3041116us       # 执行总耗时（ms，包括内部 SQL 执行耗时
16,409us           # 预执行时间
17,0us             # 建立连接时间
18,3039883us       # 数据库执行时间
19,Y0-7F4B1CEA13A0 # OBProxy内部日志trace_id
20,                # 无需关注，内部使用
21,                # 无需关注，内部使用
22,                # 无需关注，内部使用
23,0               # 无需关注，内部使用
24,xx.xx.xx.xx:33041  # 路由到的ObServer的地址信息

3.3 Slow log

The file name of the slow log is obproxy_slow.log, and the slow log records requests whose execution time is greater than the slow_query_time_threshold threshold (default 500ms). As shown in the previous section SQL select sleep(3) from dual is also recorded in obproxy_slow.log, the content is as follows:

 2022-07-11 14:32:51.758270,undefined,,,,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test,OB_MYSQL,,,COM_QUERY,SELECT,success,,select sleep(3),3041116us,409us,0us,3039883us,Y0-7F4B1CEA13A0,,,,0,xx.xx.xx.xx:33041

The log format of the slow log is the same as that of the audit log, which is not described here.

For slow SQL, there will also be records in obproxy.log, the keyword is Slow Query , the information recorded in obproxy.log is more detailed, such as the above SQL, search obproxy.log to get:

 [2022-07-11 14:32:51.758195] WARN  [PROXY.SM] update_cmd_stats (ob_mysql_sm.cpp:8425) [74744][Y0-7F4B1CEA13A0] [lt=7] [dc=0] Slow Query: ((
client_ip={127.0.0.1:50422},   // 发送SQL的客户端地址
server_ip={xx.xx.xx.xx:33041}, // SQL被路由到的目标BOServer
obproxy_client_port={xx.xx.xx.xx:52052},  // 和OBServer连接的本地地址
server_trace_id=Y81100B7C0535-0005E3460FBBE3CD-0-0, // 目标OBServer中执行过程中的trace id
route_type=ROUTE_TYPE_NONPARTITION_UNMERGE_LOCAL, // SQL使用的路由策略
user_name=root,         // 用户名
tenant_name=sys,        // 租户名
cluster_name=ob9988.zhixin.lm.100.88.147.179,  // 集群名
logic_database_name=,  // 逻辑库名
logic_tenant_name=,    // 逻辑租户名
ob_proxy_protocol=0,   // 协议类型
cs_id=14,         // client login时看到的connection id, proxy分配
proxy_sessid=7230691598940700681, // client 访问ob时内部记录connection id
ss_id=21, 
server_sessid=3221588238,  // SQL在目标observer中的connection id, observer 分配
sm_id=14, 
cmd_size_stats={
    client_request_bytes:20,  // client发给proxy的请求包大小
    server_request_bytes:38,  // OBProxy发给目标observer的请求包大小
    server_response_bytes:0,  // 目标observer发给proxy的响应包大小
    client_response_bytes:71}, // OBProxy发给client的响应包大小
cmd_time_stats={
    client_transaction_idle_time_us=0, // 在事务中该条SQL与上一条SQL执行结束之间的间隔时间, 即client事务间隔时间
    client_request_read_time_us=97, // OBProxy从client socket读取请求包的耗时
    client_request_analyze_time_us=95, // OBProxy分析client的SQL耗时
    cluster_resource_create_time_us=0, // OBProxy创建集群资源耗时(仅首次访问集群时需要创建)
    pl_lookup_time_us=0, // 根据SQL获取涉及路由表的耗时
    pl_process_time_us=0,  // 对涉及路由表的进行筛选排序的耗时
    congestion_control_time_us=21,  // 根据SQL获取涉及黑名单信息的耗时
    congestion_process_time_us=3,  // 对涉及黑名单的进行检查过滤的耗时
    do_observer_open_time_us=55,  // 对目标observer获取可用连接的耗时, 包含connect_time
        server_connect_time_us=0,  // 对目标observer创建连接的耗时
    server_sync_session_variable_time_us=0,  // 对选择的目标连接进行初始化的耗时, 包括saved_login, 同步db, 同步系统变量, 同步last_insert_id, 同步start_trans
        server_send_saved_login_time_us=0,  // 对选择的目标连接进行saved login耗时
        server_send_use_database_time_us=0,  // 对选择的目标连接同步db耗时
        server_send_session_variable_time_us=0,  // 对选择的目标连接同步已修改的系统变量耗时
        server_send_all_session_variable_time_us=0,  // 对选择的目标连接同步所有系统耗时
        server_send_last_insert_id_time_us=0, // 对选择的目标连接同步last_insert_id耗时
        server_send_start_trans_time_us=0,   // 对选择的目标连接同步start_trans/begin耗时
    build_server_request_time_us=23,  // 构建对目标server的请求包的耗时
    plugin_compress_request_time_us=0,   // 对请求包进行压缩耗时
    prepare_send_request_to_server_time_us=409,  // OBProxy接受到客户端请求，到转发到observer执行前总计时间，正常应该是前面所有时间之和
    server_request_write_time_us=32,  // OBProxy向目标server socket发送请求包的耗时
    server_process_request_time_us=3039883,   // 目标server该执行SQL的耗时
    server_response_read_time_us=67,   // OBProxy从目标server socket读取响应包的耗时
    plugin_decompress_response_time_us=59, // 对响应包进行解压缩耗时
    server_response_analyze_time_us=70,  // 对响应包进行分析的耗时
    ok_packet_trim_time_us=0, // 对响应包trim掉最后一个ok包的耗时
    client_response_write_time_us=185,  // OBProxy向client socket发送响应包的耗时
    request_total_time_us=3041116},  // OBProxy处理该请求总时间, 等于前面所有耗时之和
sql=select sleep(3)   //client的请求SQL
)

The above content is very detailed, which is convenient for locating slow SQL problems. For the meaning of each part, you can refer to the SQL execution process described in the first article "OceanBase Rewriting Series I: OceanBase Query Rewriting Practice Overview" (confirm whether the following article and link are correct) to correspond.

3.4 Statistics log

The file name of the statistics log is obproxy_stat.log. The statistics log is output every minute by default (controlled by the monitor_stat_dump_interval parameter). Through this log, you can view the SQL execution status of OBProxy within one minute. For example, to check whether OBProxy has request traffic, just look at the log record. The reference example is as follows:

 2022-07-11 10:26:59.499204,undefined,,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test,OB_MYSQL,SELECT,success,,1,1,0,0,41480us,332us,40369us

日志分析

1,2022-07-11 10:26:59.499204   #日志打印时间
2,undefined      # sharding模式下逻辑租户名
3,               # sharding模式下逻辑库名
4,ob9988.zhixin.lm.xx.xx.xx.xx:sys:test  # 物理库信息（cluster:tenant:database）
5,OB_MYSQL  # 数据库类型
6,SELECT    # SQL 类型
7,success   # 执行结果（success/failed）
8,          # 错误码（succ时为空）
9,1         # 总请求数量
1           # 30 ms ~ 100 ms 请求数量
0           # 100 ms ~ 500 ms 请求数量
0           # 大于 500 ms 请求数量
41480us     # 执行总耗时（ms，包括内部 SQL 执行耗时）
332us       # 预执行时间
40369u      # 数据库执行时间

3.5 Main log

The file name of the main log is obproxy.log, and it has four levels: DEBUG, TRACE, INFO, and ERROR. It is controlled by the syslog_level parameter. When you need to further investigate the cause of the problem, you can filter it through the trace_id field to get a session All logs of OBProxy, viewing obproxy.log requires very familiarity with the code implementation of OBProxy, which is technically difficult. Interested students can view the OBProxy source code .

3.6 Summary

OBProxy's logs are very useful for troubleshooting. Mastering the above content can deal with most OBProxy problems. This section introduces the log format and further explains it with examples. I believe you can quickly grasp it.

4 Service operation and maintenance

OBProxy has an administrator account root@proxysys. For security, we restrict the login address of this account and require everyone to log in locally for operation and maintenance operations. Below we will give a detailed introduction to configuration management and internal commands.

4.1 Configuration Management

For the configuration parameters of OBProxy, please refer to the parameter description in the documentation. Log in with root@proxysys to view and modify configuration items:

 # 查看日志级别

MySQL [(none)]> show proxyconfig like '%syslog_level%';
+--------------+-------+-----------------------------------------------------------------------------------+-------------+---------------+
| name         | value | info                                                                              | need_reboot | visible_level |
+--------------+-------+-----------------------------------------------------------------------------------+-------------+---------------+
| syslog_level | DEBUG | specifies the current level of logging: DEBUG, TRACE, INFO, WARN, USER_ERR, ERROR | false       | USER          |
+--------------+-------+-----------------------------------------------------------------------------------+-------------+---------------+

# 修改日志级别为INFO

MySQL [(none)]> alter proxyconfig set syslog_level = INFO;
Query OK, 0 rows affected (0.01 sec)

For the input of show proxyconfig, the meaning of each field is as follows:

name: Configuration name, all configuration item names can be obtained from the parameter description document.
value: Configuration content, mainly in integer and string formats. An error will be reported when the content is illegal.
info: Describes specific information about configuration parameters.
need_reboot: Indicates whether the parameter takes effect after reboot. It should be noted here that although some parameters do not need to be restarted, sometimes they only take effect on new connections.

4.2 Internal Commands

Internal commands are only executed on OBProxy and will not be forwarded to OBServer. They are mainly used to obtain the internal state of OBProxy or modify internal behavior. This part of the content is closely related to the principle. This section only gives a brief introduction. You can learn more after you have a more in-depth understanding of OBProxy.

5 Summary

The content of this chapter is a summary of the troubleshooting of OBProxy. These methods have gathered the efforts of OBProxy technical experts for many years and have undergone a lot of actual combat tests. It can not only be used for learning, but also can be used as a manual. When encountering problems with OBProxy, you can read the relevant content instructions and hope to help everyone!

6 After class interaction

6.1 Last Interaction

The obproxyd.sh script probes the OBProxy process. What is the probe method? In addition to the methods in the script, what other detection methods are there? What are the advantages and disadvantages of these methods, please give examples.

obproxyd.sh is probed by checking if the /proc/$pid file exists. You can also check if the port is listening or not. These two methods achieve process-level detection, but in some scenarios, if the process is abnormal and cannot execute business logic, it cannot be detected. Further, application heartbeat messages can be sent for detection.

6.2 Interaction in this issue

Xiao Ming wants to view all the SQL that has passed through OBProxy. Is there any way?