What is capital loss
Asset loss usually refers to the loss of funds in the payment scenario, which can be viewed from two dimensions:
- From the perspective of the user: excessive deduction of user funds leads to loss of user funds. This problem generally requires feedback through channels such as customer service. The excessive money can be returned to the user, but the user experience is largely lost;
- From the perspective of the company: It is mainly due to the situation of more withdrawals, more shipments, and more recharges. Generally, such losses are difficult to recover. This is the actual asset loss.
For example, an e-commerce business may involve various businesses in the above figure. There are various logic or state synchronization such as delivery, callback, message transfer, etc. between the businesses. If some of the interactive operations are lost, the inventory is abnormal and the fund settlement conversion is abnormal. , The bill process is not over abnormal, logic triggers repeated requests, concurrency control is handled improperly, etc., which ultimately leads to the loss of assets or funds.
In addition to a series of rigorous tests before the event, the optimization and remediation of the post-event analysis can also be monitored during the event. In this regard, the product is prevented and controlled through the self-developed DCheck.
Getting to know DCheck
This system is dominated by the "Transaction & Stability" team, mainly hoping to discover data problems in a timely manner to ensure the stable operation of data, especially scenarios involving asset losses. In order to achieve real-time and effective monitoring, in this context A quasi-real-time checking system DCheck is built. This platform is based on Mysql's binglog monitoring and MQ subscription information flow technology means. By configuring trigger conditions, rules, task operation and alarms, it ensures the consistency of the status between the upstream and downstream businesses of each business , The accuracy of the calculation of the deduction amount in and out.
DCheck in depth
Functional level
Architecture logic
concept
- Topic: Logical database or MQ subscription
- Event: Update/Insert type operation or MQ custom cancel
- Sub-event: the first level of filtering of the data returned by the event
- Script pool: There are two inheritance methods of filter and check, filter handles the second-tier data filtering, and logical judgment of the upstream and downstream of the check business
- Rules: Trigger and detect the core execution part
DCheck
A simple demonstration of the function and use of this system is as follows:
Check the configuration
Theme management
- TOPIC: library name
- Subject code: table name
- Subject name: Chinese table name
- MQ instance address: fill in * for binlog
Incident management
For binlog, it is divided into INSERT and UPDATE. It is automatically generated according to the above topic configuration and does not need to be created.
Sub-event management
The filtering layer of the cleaned data. Here, in the rule configuration, the one that returns'TRUE' will enter the next layer, such as:
if( obj.status.toInteger() == 10000 && (obj.type.toInteger() ==101 || obj.type.toInteger() ==301) ) return 'TRUE';
If you want to return everything, just return'TRUE' directly.
Script pool management
All the executed scripts are written using groovy, mainly BaseScript implements two methods of filter and check. You can refer to the script library internally:
- Filter After event filtering, the script needs to perform secondary filtering operations, mainly because some can not be simply judged from donCleanData, need to obtain other data in forward and reverse directions, or some complex logic.
Check verification logic processing method, mainly is to verify the status of upstream and downstream synchronization, complex result calculation (especially in and out of account deductions, etc.) comparison.
Code example:import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import groovy.util.logging.Slf4j; import org.springframework.stereotype.Service; import javax.annotation.Resource; /** * DCheck:冷静期内-平台客服取消订单-退买家支付金额 */ @Slf4j @Service class CheckRefundPayForLess30min implements BaseScript { @Resource private OrderDevOpsApi orderDevOpsApi; @Resource DCheckApi dCheckApi; @Resource private PayServiceFeignClient payServiceFeignClient String logTag = "TAG_CheckCrossAndOverSeaRefundPayForLess30min:{}" // 1.关单时间-支付时间<30分钟 @Override boolean filter(JSONObject doneCleanData) { // 查询支付时间 String unionId = doneCleanData.getString("order_no"); String payTime = getOrderData(unionId,"payTime", DevOpsSceneEnum.FORWARD_PAY); long modifyTime = doneCleanData.getDate("modify_time").getTime(); long diffTime = modifyTime - Long.valueOf(payTime) if (diffTime < 30 * 60 * 1000){ return true; } log.info(logTag,"===>有符合数据进入Check") return false; } @Override String check(JSONObject doneCleanData) { String subOrderNo = doneCleanData.getString("sub_order_no"); Result<List<String>> listResult = dCheckApi.queryPayNoBySubOrderNo(subOrderNo); if(listResult == null || listResult.getData() == null) { return "根据正向查询接口通过子订单号查询支付流水号数据为空"; } if(listResult.getData().size() > 1){ return "根据正向查询接口通过子订单号查询支付流水号多条数据,请查看是否需要优化逻辑"; } String outPayNo = listResult.getData().get(0); RefundQueryRequest refundQueryRequest = new RefundQueryRequest(); refundQueryRequest.setPayLogNum(outPayNo); Result<List<RefundBillDTO>> resp = payServiceFeignClient.queryRefundsByPayLogNum(refundQueryRequest); // 判断支付查询数据是否为空,如果为空直接报数据错误,以及是否查询到了多条数据 if (resp == null || resp.getData() == null) { return "上游数据为空:支付退款查询(根据支付流水号)"; } else if (resp.getData().size() != 1) { return "上游数据为多条请确认逻辑:支付退款查询(根据支付流水号)"; } // 检查点逻辑判断1: 状态为打款成功 if (resp.getData().get(0).getStatus() !=2 ){ return "校验支付打款状态非2"; } // 逻辑校验点2:交易退款和RPC查询的金额一致,否则告警 if (resp.getData().get(0).getAmount() != doneCleanData.getLong("amount")) { return "校验交易退款金额和支付打款金额不一致"; } return "SUCCESS"; } // 数据库查询对应字段值 String getOrderData(String unionNo,String key,DevOpsSceneEnum devOpsSceneEnum){ // 内部方法省略.... return value; } }
Rule configuration
The rule configuration is a combination of all the above basic configurations and the real operating core. There are two main blocks, basic information and downgrading strategies:
basic information : sub-event (support search and multiple selection) + script type (search selection) = trigger and execution logic, other auxiliary configuration information is configured according to their respective domains and requirements
downgrade strategy :
- Sampling percentage: the percentage of online traffic sampling, the pre-test or the traffic that has a great impact on the business needs to be controlled, and it cannot be 100%.
- First delay time: Trigger execution delay time. In the business process, there will be some delays in data synchronization. In order to avoid the problem of unsynchronized status due to delay, it is recommended to set a certain delay ratio, generally about 10 seconds.
- Maximum timeout time and effective time: rule effective time configuration.
Tool use
Check abnormal
Regarding abnormal check data, generally it will be sent to the configured warning Feishu group and the configured individual first. Click to jump to this page, mainly to see the specific data of the error. After confirmation, it is a script or part of the data problem. Optimization Later "retransmission" can be marked as processing, if it is determined that the problem is, then the "capital loss" problem is located.
Mock
Because local scripts call some RPC interfaces, there is currently no good way to debug locally, so you need to configure the rules first, and then use mock to debug. The main thing you use is rule debugging, select the specified rules, search or create For the request parameters in json format that meet the dcheck scenario, submit the request to view the response result.
There is a problem here. Because the internal logic of dcheck does a unified treatment of some system exception scripts, it is often impossible to see the specific reason, which is failure or pass outside the logic. This requires more printing logs in the script, and then pass Log platform to check specific logic problems.
Some tips
Rule configuration skills
- Multiple events can be selected in the rules, so for different events, script processing logic with the same or similar check can be merged to reduce the amount of rule maintenance.
- Trigger data can be configured in the event, and use script filter processing as little as possible. Code processing generally requires logic outside of non-trigger data.
Groovy's closure usage
There are many list and key-value processing in script data processing, which can make full use of the closure feature of groovy, which greatly simplifies the complex processing logic of the java language.
Example: If the result of the following data format is returned to objectList:
[{id=10086, refundNo=RE10086, orderNo=100888, userId=15206, bizType=110, payTool=0, payStatus=404,amount=100, feature=, isDel=0, createTime=2021-05-11 21:39:34.000, modifyTime=2021-05-11 21:39:34.000, moneyFrom=1, currency=, countryCode=},{id=10087, refundNo=RE10087, orderNo=100999, userId=15206, bizType=202, payTool=0, payStatus=404, amount=400, feature=, isDel=0, createTime=2021-05-11 21:39:34.000, modifyTime=2021-05-11 21:39:34.000, moneyFrom=1, countryCode=}]
- filter for conditional judgment, you can use any
def filterResult = objectList.any{it.bizType in [110,119] && it.payStatus == 404};
return filterResult - Check to obtain a qualified value, you can use find
def mount = objectList.find{ it.type=5}.amount
For more groovy features, please refer to the article: https://www.jianshu.com/p/5d30f1443aa6
Some platforms are insufficient
There is no convenient way to debug
At present, the local script library does not have an environment where you can run and debug. Although there is a mock tool, you need to configure events first, and then you can debug after uploading the script configuration. At the same time, script logic problems need to go to the log and constantly change the way the script is added to the log to check. In addition, if you pass the debugging in the test environment, you have to go to the online environment and repeat the configuration process again.
Suggestion : Add a Debug button to the page where sub-events, scripts, and rules are added online, and you can directly set the parameters or grab a piece of data that meets the conditions to debug and give the results, and it is best to give this part. Log.
There will be a lot of redundancy in the script pool Filter and check methods put together
In the development script and configuration, it is found that many filter logics are the same, or check logics are the same, but because the scripts are put together, they will produce cross logic writing, and effective public stripping cannot be achieved.
suggest : Filter and check are split, and as a common pool, the rules are configured separately, and the mode can be reference or import. The import support is mainly convenient for most of the same logic, and the quick modification of the configuration rules is online when the individual parameters are different. .
There is no circuit breaker when the problem occurs
Although the actual situation of the entire platform remains to be seen, if the follow-up effect of the platform is very good, and a large-scale data problem or financial loss problem does occur, the mechanism of the platform is only a warning of a problem, and then the technical intervention is essentially still The post-processing method, such as the post-processing mentioned at the beginning, cannot achieve the purpose of timely stop loss.
suggestion : When the platform becomes more accurate in the future, it should consider linkage with key domain fuse to prevent loss in time.
The platform should add up and down switches
At present, the execution of the rules is not executed. It can only be closed by editing the control flow 0 or the execution time. There is also no batch, which is somewhat inconvenient in operation.
suggestion : add up and down switch, increase batch operations such as: flow size, switch, warning, etc.
The platform can consider a dynamic judgment mechanism for increasing traffic
Since many check points are operated through the interfaces of each domain, when the traffic is heavy, especially the key business, it may have a great impact on the business product, or the compatibility of a certain interface is not good, and the system abnormality causes a large number of unconventional errors. When the time comes, all parties will blow up.
suggestion : Rules are graded, core business can be advanced configuration, encounter the above problems to trigger automatic adjustment to reduce the flow, if the problem is restored, the flow can be automatically increased.
Text | Daqi
Pay attention to Dewu Technology, and work hand in hand to the cloud of technology
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。