滴滴KDD2018：强化学习派单

白话解读

离线learning部分

本质上是将任意时刻任意空间位置离散化为时空网格，根据派单记录（含参加调度但无单的司机）计算该时空网格到当天结束时刻的预期收入。

关键问题：怎么计算预期收入？

动态规划思路：假设总共有时刻区间为[0, T)；先计算T-1时刻的所有网格的预期收入（此时未来收入为0，只有当前收入），其本质就是计算当前收入的均值；然后计算T-2时刻的所有网格的预期收入；...；以此类推

这样的话，就可以计算出每个时空网格到当天结束时刻的预期收入。

重点：为什么按照这个方式得到的值函数是合理的？

The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.

在线planning部分

使用以下公式描述订单和司机之间的匹配度：

价格越高，匹配度越高
当前位置价值越大，匹配度越低
未来位置价值越大，匹配度越高
接驾里程，隐形表达，越大则预计送达时间越大，衰减系数越小，匹配度越低

使用KM算法求解匹配结果

评估方案

AB-test方案

we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.

实际收益

the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experience

Value function可视化效果

如何包装为强化学习

将时空网格定义为state；将派单和不派单定义为action；将state的预期收入定义为状态值函数。

强化学习的目的是求解最优策略，也等价于求解最优值函数。派单场景的独特的地方是，建模的时候agent是每个司机，做决策的时候是平台决策，所以司机其实是没有策略的，或者说，通过派单机制，司机的策略被统一化为使平台的期望收入最大。因此在强化学习的框架下，可以将离线learning和在线planning认为是policy iteration的两个步骤，learning是更新value function，planning是policy update。然而，其实细想起来，还是有些勉强。