This is Jerry's 67th article in 2021, and it is also the 344th original article in Wang Zixi's public account.
Since graduating from university in 2007 and joining SAP Chengdu Research Institute, Jerry has been engaged in the development of enterprise management software.
Enterprise management software is aimed at enterprise-level users. If the software fails (bug), in some extreme cases, the enterprise may suffer huge economic losses. Therefore, it is necessary for software developers to prioritize programming specifications, software testing and software delivery. The verification and other aspects have put forward higher requirements. At the same time, due to the high complexity of the enterprise management software itself, some failures are difficult to reproduce or can only be reproduced on the production system running the customer's specific business process. All these have brought huge challenges to enterprise management software analysis and fault handling.
This article starts from an actual software failure handled by Jerry, and talks about my own experience of handling some thorny failures in enterprise management software.
In Jerry's view, these thorny faults can be divided into the following categories.
Some manifestations of tricky faults in the field of enterprise management software
I have dealt with many headaches in the SAP Chengdu Research Institute. They have one or more of the following characteristics.
1. Requires a complicated process to reproduce
For example, I have dealt with a customer invoice (Customer Invoice) related failure in SAP Business ByDesign. This failure can only be reproduced every time the invoice is released. In order to release the invoice, we must first create a Sales Order, create a Customer Demand based on the order, then create a Pick Task, generate a delivery note, and finally generate a new customer bill.
These complex processes often require the system to maintain the corresponding master data (Master Data) and transaction data (Transaction Data) before they can be executed smoothly. Complex business processes increase the difficulty of recurring failures.
2. The failure spans multiple modules of the enterprise management software
Due to the complexity of the enterprise management software itself, a seemingly simple fault that the end user sees may span multiple modules implemented by the software.
Take the failure described in Form 1 above as an example, suppose the support function described in the software help document is: the customer adds a new custom field on the sales order interface and maintains the corresponding value. This value can be retrieved from the sales order. Shipment task, delivery note, and finally delivered to the customer invoice. We call this transfer of field values from multiple documents called data flow.
Then if the customer sees that the value of this field is empty on the invoice page, the customer may think that the invoice module is malfunctioning. However, the module processing corresponding to each node of the data flow may be the culprit causing the failure. Sales orders and customer invoices belong to the CRM module, while picking tasks and invoices belong to the category of SCM.
In actual development work, this means that the analysis of the fault often requires cross-team collaboration, because the CRM and SCM modules are often responsible for different development teams.
3. The fault can only be reproduced in the customer's production system
Before the delivery of enterprise management software, different levels of testing must be carried out in the internal development, testing and verification system (validation system). Even so, due to various objective reasons, such as when the application is running on the customer's production system, the failure will be exposed when the configuration is based on certain specific business processes that only the customer can use, and these configurations are not managed by the enterprise management software. Covered by the supplier’s internal system testing.
Because this type of failure can only be reproduced in the customer's production system, it is more difficult to analyze and locate the problem, especially when the reproduction step will be written in the customer's production system, usually only contact the customer's relevant personnel, using remote desktop + The method of telephone conference allows the relevant personnel of the customer to operate, and then the support personnel of the software supplier conduct online debugging.
4. The fault can only be reproduced in the background operation mode, and everything is normal when running in the online mode
In the field of enterprise management software, especially ERP, background jobs are often used to perform some time-consuming batch processing tasks, such as batch processing of orders, report data analysis, aggregation and so on. The background operation mode is different from the online mode with the user interface attached, and it also brings difficulties to single-step debugging.
5. The fault can only be reproduced in the normal operation mode of the software. When single-step debugging, the software works normally.
When the fault has this characteristic, it actually sends a signal to the support staff: the fault may be related to the specific execution timing of the program. Because the program runs normally, the execution timing is obviously different from running in the single-step debugging mode. For example, when the debugger is single-step debugging, the normal execution timing of the multi-threaded program may be damaged.
Because of the lack of a powerful weapon such as a debugger, analyzing this type of failure requires support personnel with stronger theoretical analysis capabilities and problem abstraction capabilities.
Due to space limitations, this article only gives a practical example to share the analysis and processing flow of the above-mentioned fifth type of failure.
Jerry was once responsible for the maintenance of the SAP CRM IBASE (Installed Base) module. IBASE is an abstract model used to describe resource objects such as equipment, machines, services, or software that have been installed at the customer's location. The IBASE model describes the hierarchical structure of these objects and their various components in a tree structure, which is the reference basis of the service module.
One day, I received a failure report. A colleague from another team used the IBASE API responsible for my team to create an IBASE component in the same session, modify it, then delete it, and then save it. A runtime error (Runtime Error).
The screenshot of the runtime error mentioned in the fault description is shown in the figure above.
This colleague found that this error can only be reproduced in the background operation mode, and may not be able to reproduce every time. The fault cannot be reproduced in single-step debugging mode.
It is not always possible to reproduce! = It cannot be reproduced.
In order to analyze this problem, I have to find a way to reproduce it stably. Because this fault is immune to single-step debugging Dafa, I can only think of other ways.
Read the description in the failure report word by word, the operation process before the failure is:
(1) Create IBASE
(2) Modify IBASE
(3) Delete IBASE
(4) Save the transaction.
A runtime error occurred.
Because I am the person in charge of the IBASE module, I wrote a program of less than 200 lines after three times and five divisions. In the program, I call the creation, modification and deletion APIs of IBASE in turn, and then save the transaction.
The program source code is as follows:
Executing this report, encountered the expected runtime error. This is a good sign, because I have now found a way to reproduce the problem stably. In the next step, I need to narrow the scope of the problem and find out which line of code in my 200 lines of code was executed that caused the runtime error.
Jerry likes to call this kind of program he developed specifically for analyzing failures and reproducing errors, "scaffolding programs" or "fault triggers."
Because these 200 lines of code are written by myself, I can modify it at will.
- First comment out all the code, leaving only the IBASE creation API call. Executing the program, everything is normal.
- Then uncomment the IBASE modification API call code and let it participate in the program execution, everything is normal.
- Then uncomment IBASE to delete the API call code, execute the program, a runtime error occurs!
This shows that this runtime error is related to the scene deleted by IBASE.
Back to the screenshot of the runtime error in the fault submission report: Line 103 throws an error of type X, because the function CRM_IBASE_COMP_GET_DETAIL is called, and the IBASE data corresponding to the timestamp specified by the input parameters i_date and i_time is not read. Therefore, the program decided to terminate the execution by throwing an error.
Through the wrong context call stack at runtime, I found the reason why the CRM_IBASE_COMP_GET_DETAIL API did not return any IBASE data: the CHECK statement of the highlighted code in line 53 of the following figure, check the current incoming timestamp (the default is the timestamp when IBASE was created ) Is less than the valto (ie valid to, the timestamp of the effective expiration date of the IBASE) field of the IBASE header to be read. If it is less than, the next CHECK line is 54 lines. If it is greater than or equal to, then exit the loop body where the data reading logic is located.
In the background job running mode, and when my scaffolding program was executed, the time stamp judgment condition on line 53 was not met, so the loop exited, causing CRM_IBASE_COMP_GET_DETAIL to fail to read, which caused a fault.
There are only two possibilities to satisfy the judgment condition of line 53:
Current timestamp> IBASE valto field value
Current timestamp = IBASE valto field value
It should be emphasized that the timestamp field in the ABAP programming language is accurate to the second. For example, 20211024102424 represents October 24, 2021 at 10:24:24.
Although my scaffolding application cannot reproduce the fault in single-step debugging mode, it can be reproduced by direct execution. Therefore, execute the scaffolding application, click the Debugger button in the toolbar on the runtime fault page, and you can pop up the debugger to view various information about the runtime error thrown by the application:
This time, in the debugger, all the puzzles were revealed: the current timestamp = the value of the IBASE valto field, which caused the API CRM_IBASE_COMP_GET_DETAIL to fail to read and a runtime error was thrown.
When calling the IBASE creation API, the valfr field of the header of the IBASE to be created will be assigned the current timestamp of the system.
When calling the IBASE delete API, the valto field of the header of the IBASE to be deleted will be assigned the current timestamp of the system.
Why can't this error be reproduced in single-step debugging mode? Let's look at a simple timing diagram.
The horizontal axis represents the timestamp. t3 represents the value of the <ibinadm>-valto field in the judgment statement on line 53 of code, and t1 represents the value of the lv_timestamp field in the judgment statement on line 53 of code.
In the single-step debugging mode, suppose we start from the creation of IBASE to single-step the API, then due to the key hand speed, t3 must be greater than t1.
In the background work mode and the normal operation of the scaffolding program, if the IBASE creation, modification and deletion APIs are executed fast enough to be completed within one second, the difference between t3 and t1 is less than one second, so the CHECK statement is executed If it fails, return directly.
In other words, when this fault was submitted, the developers of the CRM IBASE API did not consider the scenario where same second. After all, under normal circumstances, it is impossible for a customer to complete the IBASE creation and then deletion operation in the UI within 1 second. This scenario is only possible when customers use IBASE API for some secondary development scenarios.
Of course, the last question is not just to change the <symbol of the 53-line CHECK statement to less than or equal to the operation. We carefully evaluated the possible side effects of the change, and discussed with the team developers who submitted the fault, and finally adopted other methods to avoid this fault.
Going back to the failure analysis process itself, when the failure was first received, Jerry was at a loss because the single-step debugging could not be reproduced. Later, he thought of writing a scaffolding program to stably reproduce the failure. This step was a breakthrough in the problem analysis. .
After having the scaffolding program, first comment out all API calls, and then gradually open the code for IBASE creation, modification and deletion, and finally narrow the scope of the problem to the IBASE deletion process.
Run-time errors triggered by the direct execution of the scaffolding application, use the debugger to view the variable values when the program throws the error, lock the problem to the processing logic of the timestamp, and find the root cause.
This analysis step is a bit like the troubleshooting measures used by computer DIY enthusiasts at the end of the last century and the beginning of this century when the assembly machine failed to start. When the assembly machine fails to start, only keep the power supply, motherboard and CPU, try to start, if it succeeds, add graphics cards, hard drives and other equipment one by one. When the newly added device causes the system to return to the unable to start state, it indicates that there is a problem with the device. At that time, enthusiasts called this method the "Minimal System Method."
The most important thing in the entire analysis process is to abstract the content executed in the background job that cannot be stably reproduced in the fault report into a scaffolding program of less than 200 lines.
The fifth chapter of "Programming Pearls" once shared an interesting story about fault debugging: a programmer in IBM Research Center installed a new workstation and found a fault: he can only log in to the system in a sitting posture; once he stands When you get up, you can't log in to the system. Do you know how to locate this fault in the end? Go read the original book!
I hope this article can give you some inspiration for troubleshooting methods in the field of enterprise management software. Thank you for reading.
related reading
- Jerry's introspection: programmers should not easily say "this function is technically impossible"
- Record the experience of a SAP development engineer reporting an incident to Microsoft Azure
More original articles by Jerry, all in: "Wang Zixi":
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。