天池学习记录——O2O优惠券使用预测赛题[1]

赛题说明

应用背景：以优惠券盘活老用户或吸引新客户进店消费是O2O（Online to Offline）的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言，滥发的优惠券可能降低品牌声誉，同时难以估算营销成本。而个性化投放是提高优惠券核销率的重要技术，它可以让具有一定偏好的消费者得到真正的实惠，同时赋予商家更强的营销能力。

目标：根据提供的O2O场景相关的丰富数据，通过分析建模，精准预测用户是否会在规定时间内使用相应优惠券。

数据分析

读取数据：
图片描述

我们看到在 offline 训练数据集中有以下 7 类数据：
User_id
Merchant_id
Coupon_id
Discount_rate
Distance
Date_received
Date

当 Coupon_id 为 null 时表示无优惠券消费，此时Discount_rate和Date_received字段无意义。

具体字段意义请参考赛题链接。

根据 Coupon_id 和 Date 是否为 null，可以将数据分为四种类型：

print('有优惠券，购买商品条数', dfoff[(dfoff['Coupon_id'] != 'null') & (dfoff['Date'] != 'null')].shape[0])
print('无优惠券，购买商品条数', dfoff[(dfoff['Coupon_id'] == 'null') & (dfoff['Date'] != 'null')].shape[0])
print('有优惠券，没有购买商品条数', dfoff[(dfoff['Coupon_id'] != 'null') & (dfoff['Date'] == 'null')].shape[0])
print('无优惠券，也没有购买商品条数', dfoff[(dfoff['Coupon_id'] == 'null') & (dfoff['Date'] == 'null')].shape[0])

得到结果：
图片描述

其中，75382 表示用优惠券进行了消费的数量，即正样本；977900 表示领取优惠券但没有使用，这部分优惠券就被浪费了，即负样本；701602 表示没有优惠券的普通消费。

下面我们分别对训练集中的 7 类数据对优惠券使用的影响进行分析。

1. 优惠券和距离

print('Discount_rate 类型:',dfoff['Discount_rate'].unique())
print('Distance 类型:', dfoff['Distance'].unique())

图片描述

我们看到输出的是str类型的数据，需要将它们转换成numeric类型。

在Discount_rate里有两种折扣方法：x in [0,1] 代表折扣率；x : y 表示满 x 减 y。这里我们还要将满 x 减 y 类型用式子1-y/x转换成折扣率。并建立折扣券相关的特征 discount_rate, discount_man, discount_jian, discount_type。代码如下：

# convert Discount_rate and Distance

def getDiscountType(row):
    if row == 'null':
        return 'null'
    elif ':' in row:
        return 1
    else:
        return 0

def convertRate(row):
    """Convert discount to rate"""
    if row == 'null':
        return 1.0
    elif ':' in row:
        rows = row.split(':')
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)

def getDiscountMan(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0

def getDiscountJian(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0

def processData(df):
    
    # convert discunt_rate
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    print(df['discount_rate'].unique())
    
    # convert distance
    df['distance'] = df['Distance'].replace('null', -1).astype(int)
    print(df['distance'].unique())
    return df

dfoff = processData(dfoff)
dftest = processData(dftest)

图片描述

2. 时间
对收到优惠券的日期date_received和消费日期date_buy进行处理：

date_received = dfoff['Date_received'].unique()
date_received = sorted(date_received[date_received != 'null'])

date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])

date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date'])

并输出结果：
图片描述

查看顾客每天收到的优惠券数量：

couponbydate = dfoff[dfoff['Date_received'] != 'null'][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
couponbydate.head()

图片描述

查看顾客用这些优惠券进行了消费的数量：

buybydate = dfoff[(dfoff['Date'] != 'null') & (dfoff['Date_received'] != 'null')][['Date_received', 'Date']].groupby(['Date_received'], as_index=False).count()
buybydate.columns = ['Date_received','count']
buybydate.head()

图片描述

将以上数据可视化：

plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(date_received, format='%Y%m%d')

plt.subplot(211)
plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
plt.yscale('log')
plt.ylabel('Count')
plt.legend()

plt.subplot(212)
plt.bar(date_received_dt, buybydate['count']/couponbydate['count'])
plt.ylabel('Ratio(coupon used/coupon received)')
plt.tight_layout()

图片描述

提取特征

上面显示的是单独一天的数据量，我们知道人们一般在星期天上街比较多，使用优惠券的可能性也增大，所以现在我们以星期为依据新建特征。

def getWeekday(row):
    if row == 'null':
        return row
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1

dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

# weekday_type :  周六和周日为1，工作日为0
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )

# change weekday to one-hot encoding 
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
print(weekdaycols)

tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf

得到的tmpdf为以下形式：
图片描述

对['date_received']数据进行标注，转换成numeric：

def label(row):
    if row['Date_received'] == 'null':
        return -1
    if row['Date'] != 'null':
        td = pd.to_datetime(row['Date'], format='%Y%m%d') -  pd.to_datetime(row['Date_received'], format='%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0
dfoff['label'] = dfoff.apply(label, axis = 1)

若 Date_received == 'null'，则 y = -1；Date != 'null' & Date-Date_received <= 15，则 y = 1；否则，y = 0。

此时，这些转换后的数据已经以0，1，-1的形式存在了label列中。

模型训练

在应用模型前，首先对数据进行划分。在这里，我们将 20160101 到 20160515 的数据用作训练集(train)，20160516 到 20160615 的数据用作验证集(valid)。

df = dfoff[dfoff['label'] != -1].copy()
train = df[(df['Date_received'] < '20160516')].copy()
valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy()
print(train['label'].value_counts())
print(valid['label'].value_counts())

图片描述

用线性模型 SGDClassifier 进行预测。

predictors = original_feature
print(predictors)

def check_model(data, predictors):
    
    classifier = lambda: SGDClassifier(
        loss='log', 
        penalty='elasticnet', 
        fit_intercept=True, 
        max_iter=100, 
        shuffle=True, 
        n_jobs=1,
        class_weight=None)

    model = Pipeline(steps=[
        ('ss', StandardScaler()),
        ('en', classifier())
    ])

    parameters = {
        'en__alpha': [ 0.001, 0.01, 0.1],
        'en__l1_ratio': [ 0.001, 0.01, 0.1]
    }

    folder = StratifiedKFold(n_splits=3, shuffle=True)
    
    grid_search = GridSearchCV(
        model, 
        parameters, 
        cv=folder, 
        n_jobs=-1, 
        verbose=1)
    grid_search = grid_search.fit(data[predictors], 
                                  data['label'])
    
    return grid_search

if not os.path.isfile('1_model.pkl'):
    model = check_model(train, predictors)
    print(model.best_score_)
    print(model.best_params_)
    with open('1_model.pkl', 'wb') as f:
        pickle.dump(model, f)
else:
    with open('1_model.pkl', 'rb') as f:
        model = pickle.load(f)

接下来，对每个优惠券预测的结果计算 AUC，再对所有的取平均。计算 AUC 的时候，如果label只有一类，就直接跳过，因为 AUC 无法计算。

进行预测：

y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1['pred_prob'] = y_valid_pred[:, 1]

计算平均 AUC：

vg = valid1.groupby(['Coupon_id'])
aucs = []
for i in vg:
    tmpdf = i[1] 
    if len(tmpdf['label'].unique()) != 2:
        continue
    fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred_prob'], pos_label=1)
    aucs.append(auc(fpr, tpr))
print(np.average(aucs))

得到结果0.5348655160896371。

对测试集进行预测并提交结果：

y_test_pred = model.predict_proba(dftest[predictors])
dftest1 = dftest[['User_id','Coupon_id','Date_received']].copy()
dftest1['label'] = y_test_pred[:,1]
dftest1.to_csv('submit1.csv', index=False, header=False)

至此，我们已经得到一个提交结果，在这个过程中用到的特征是优惠券，距离和时间。预测效果较差，还需要进行进一步的特征工程，来得到更好的效果。

思路解答

总结以上思路，首先对数据进行分析，通过画图可以更直观的反映出数据的特征；然后根据对数据对分析结果，进行特征提取，用这些特征训练所用的模型。在训练过程中通过划分数据集，分为训练集和验证集两部分，对模型进行训练；最后，将测试集的数据喂给训练好的模型，得到预测结果，并转换为能提交的.csv格式的文件。

这就是进行一次数据分析的大致思路，就本题来说，在特征工程和模型的选择上还有更多的思考余地，来提高准确率。

用到的知识点

one-hot encoding
AUC

遇到的问题

针对博主的学习，在这次的赛题总结中反映出的问题有以下 3 点：

数据可视化的代码部分，不够了解，而画图可能为我们提供很多思路
对各个模型的参数有哪些需要深入了解，如果不想做调包侠客，就更要掌握调参背后的原理
特征工程是制胜的关键，需要不断的练习学习

参考链接：
https://tianchi.aliyun.com/no...
https://tianchi.aliyun.com/no...

不足之处，欢迎指正。

天池学习记录——O2O优惠券使用预测赛题[1]

赛题说明

数据分析

提取特征

模型训练

思路解答

用到的知识点

遇到的问题

秋刀鱼

引用和评论

【Leetcode刷题】开篇

python与nodejs哪个性能高

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时