Human or Robot
本项目为kaggle上Facebook举行的一次比赛,地址见数据来源,完整代码见我的github,欢迎来玩~
代码
数据探索——Data_Exploration.ipynb
数据预处理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynb
模型设计及评测——Model_Design.ipynb
项目数据来源
项目所需额外工具包
mlxtend: 含有聚和算法Stacking
项目整体运行时间预估为60min左右,在Ubuntu系统,8G内存,运行结果见所提交的jupyter notebook文件
由于文章内容过长,所以分为两篇文章,总共包含四个部分
数据探索
数据预处理及特征工程
模型设计
评估及总结
数据探索
import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display
df_bids = pd.read_csv('bids.csv', low_memory=False)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_bids.head()
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
df_train.head()
# df_train.dtypes
bidder_id | payment_account | address | outcome | |
---|---|---|---|---|
0 | 91a3c57b13234af24875c56fb7e2b2f4rb56a | a3d2de7675556553a5f08e4c88d2c228754av | a3d2de7675556553a5f08e4c88d2c228vt0u4 | 0.0 |
1 | 624f258b49e77713fc34034560f93fb3hu3jo | a3d2de7675556553a5f08e4c88d2c228v1sga | ae87054e5a97a8f840a3991d12611fdcrfbq3 | 0.0 |
2 | 1c5f4fc669099bfbfac515cd26997bd12ruaj | a3d2de7675556553a5f08e4c88d2c2280cybl | 92520288b50f03907041887884ba49c0cl0pd | 0.0 |
3 | 4bee9aba2abda51bf43d639013d6efe12iycd | 51d80e233f7b6a7dfdee484a3c120f3b2ita8 | 4cb9717c8ad7e88a9a284989dd79b98dbevyi | 0.0 |
4 | 4ab12bc61c82ddd9c2d65e60555808acqgos1 | a3d2de7675556553a5f08e4c88d2c22857ddh | 2a96c3ce94b3be921e0296097b88b56a7x1ji | 0.0 |
异常数据检测
# 查看各表格中是否存在空值
print 'Is there any missing value in bids?',df_bids.isnull().any().any()
print 'Is there any missing value in train?',df_train.isnull().any().any()
print 'Is there any missing value in test?',df_test.isnull().any().any()
Is there any missing value in bids? True
Is there any missing value in train? False
Is there any missing value in test? False
整个对三个数据集进行空值判断,发现用户数据训练集和测试集均无缺失数据,而在竞标行为数据集中存在缺失值的情况,下面便针对bids数据进一步寻找缺失值
# nan_rows = df_bids[df_bids.isnull().T.any().T]
# print nan_rows
pd.isnull(df_bids).any()
bid_id False
bidder_id False
auction False
merchandise False
device False
time False
country True
ip False
url False
dtype: bool
missing_country = df_bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = df_bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country: 8859
No. of normal country: 7647475
import matplotlib.pyplot as plt
labels = ['unknown', 'normal']
sizes = [missing_country, normal_country]
explode = (0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.title('Distribution of missing countries vs. normal countries')
plt.show()
综合上述的分析可以发现,在竞标行为用户的country
一栏属性中存在很少一部分用户行为是没有country
记录的,在预处理部分可以针对这部分缺失数据进行填充操作,有两种思路:
针对原始行为数据按照用户分组后,看看每个对应的用户竞标时经常所位于的国家信息,对缺失值填充常驻国家
针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息
# 查看各个数据的记录数
# 看看数据的id是否是唯一标识
print df_bids.shape[0]
print len(df_bids['bid_id'].unique())
print df_train.shape[0]
print len(df_train['bidder_id'].unique())
print df_test.shape[0]
print len(df_test['bidder_id'].unique())
7656334
7656334
2013
2013
4700
4700
# 简单统计各项基本特征(类别特征)的数目(除去时间)
print 'total bidder in bids: ', len(df_bids['bidder_id'].unique())
print 'total auction in bids: ', len(df_bids['auction'].unique())
print 'total merchandise in bids: ', len(df_bids['merchandise'].unique())
print 'total device in bids: ', len(df_bids['device'].unique())
print 'total country in bids: ', len(df_bids['country'].unique())
print 'total ip in bids: ', len(df_bids['ip'].unique())
print 'total url in bids: ', len(df_bids['url'].unique())
total bidder in bids: 6614
total auction in bids: 15051
total merchandise in bids: 10
total device in bids: 7351
total country in bids: 200
total ip in bids: 2303991
total url in bids: 1786351
由上述基本特征可以看到:
竞标行为中的用户总数少于训练集+测试集的用户数,也就是说并不是一一对应的,接下来验证下竞标行为数据中的用户是否完全来自训练集和测试集
商品类别和国家的种类相对其他特征较少,可以作为天然的类别特征提取出来进行处理,而其余的特征可能更多的进行计数统计
lst_all_users = (df_train['bidder_id'].unique()).tolist() + (df_test['bidder_id'].unique()).tolist()
print 'total bidders of train and test set',len(lst_all_users)
lst_bidder = (df_bids['bidder_id'].unique()).tolist()
print 'total bidders in bids set',len(lst_bidder)
print 'Is bidders in bids are all from train+test set? ',set(lst_bidder).issubset(set(lst_all_users))
total bidders of train and test set 6713
total bidders in bids set 6614
Is bidders in bids are all from train+test set? True
lst_nobids = [i for i in lst_all_users if i not in lst_bidder]
print 'No. of bidders never bid: ',len(lst_nobids)
lst_nobids_train = [i for i in lst_nobids if i in (df_train['bidder_id'].unique()).tolist()]
lst_nobids_test = [i for i in lst_nobids if i in (df_test['bidder_id'].unique()).tolist()]
print 'No. of bidders never bid in train set: ',len(lst_nobids_train)
print 'No. of bidders never bid in test set: ',len(lst_nobids_test)
No. of bidders never bid: 99
No. of bidders never bid in train set: 29
No. of bidders never bid in test set: 70
data_source = ['train', 'test']
y_pos = np.arange(len(data_source))
num_never_bids = [len(lst_nobids_train), len(lst_nobids_test)]
plt.bar(y_pos, num_never_bids, align='center', alpha=0.5)
plt.xticks(y_pos, data_source)
plt.ylabel('bidders no bids')
plt.title('Source of no bids bidders')
plt.show()
print df_train[(df_train['bidder_id'].isin(lst_nobids_train)) & (df_train['outcome']==1.0)]
Empty DataFrame
Columns: [bidder_id, payment_account, address, outcome]
Index: []
由上述计算可知存在99个竞标者无竞标记录,其中29位来自训练集,70位来自测试集,而且这29位来自训练集的竞标者未被标记为机器人用户,所以可以针对测试集中的这70位用户后续标记为人类或者取平均值处理
# check the partition of bots in train
print (df_train[df_train['outcome'] == 1].shape[0]*1.0) / df_train.shape[0] * 100,'%'
5.11674118231 %
训练集中的标记为机器人的用户占所有用户数目约5%
df_train.groupby('outcome').size().plot(labels=['Human', 'Robot'], kind='pie', autopct='%.2f', figsize=(4, 4),
title='Distribution of Human vs. Robots', legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f477135c5d0>
由上述训练集中的正负例分布可以看到本数据集正负例比例失衡,所以后续考虑使用AUC(不受正负例比例影响)作为评价指标,此外尽量采用Gradient Boosting族模型来进行训练
数据预处理与特征工程
import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
bids = pd.read_csv('bids.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
处理缺失数据
针对前面数据探索部分所发现的竞标行为数据中存在的国家属性缺失问题,考虑使用针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息的方法来进行缺失值的填充处理
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
# pd.algos.is_monotonic_int64(bids.time.values, True)[0]
print 'Is the time monotonically non-decreasing? ', pd.Index(bids['time']).is_monotonic
Is the time monotonically non-decreasing? False
# bidder_group = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].ffill()
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].bfill()
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
print 'Is there any missing value in bids?',bids.isnull().any().any()
# pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? True
missing_country = bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country: 5
No. of normal country: 7656329
nan_rows = bids[bids.isnull().T.any().T]
print nan_rows
bid_id bidder_id auction \
1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix
2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs
2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39
3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39
5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39
merchandise device time country \
1351177 office equipment phone84 9767200789473684 NaN
2754184 mobile phone150 9633363947368421 NaN
2836631 jewelry phone72 9634034894736842 NaN
3125892 books and music phone106 9635755105263157 NaN
5153748 mobile phone267 9645270210526315 NaN
ip url
1351177 80.211.119.111 g9pgdfci3yseml5
2754184 20.67.240.88 ctivbfq55rktail
2836631 149.210.107.205 vasstdc27m7nks3
3125892 26.23.62.59 ac9xlqtfg0cx5c5
5153748 145.7.194.40 0em0vg1f0zuxonw
# print bids[bids['bid_id']==1351177]
nan_bidder = nan_rows['bidder_id'].values.tolist()
# print nan_bidder
print bids[bids['bidder_id'].isin(nan_bidder)]
bid_id bidder_id auction \
1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix
2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs
2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39
3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39
5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39
merchandise device time country \
1351177 office equipment phone84 9767200789473684 NaN
2754184 mobile phone150 9633363947368421 NaN
2836631 jewelry phone72 9634034894736842 NaN
3125892 books and music phone106 9635755105263157 NaN
5153748 mobile phone267 9645270210526315 NaN
ip url
1351177 80.211.119.111 g9pgdfci3yseml5
2754184 20.67.240.88 ctivbfq55rktail
2836631 149.210.107.205 vasstdc27m7nks3
3125892 26.23.62.59 ac9xlqtfg0cx5c5
5153748 145.7.194.40 0em0vg1f0zuxonw
在对整体数据的部分用户缺失国家的按照各个用户分组后在时间上前向和后向填充后,仍然存在5个用户缺失了国家信息,结果发现这5个用户是仅有一次竞标行为,下面看看这5个用户还有什么特征
lst_nan_train = [i for i in nan_bidder if i in (train['bidder_id'].unique()).tolist()]
lst_nan_test = [i for i in nan_bidder if i in (test['bidder_id'].unique()).tolist()]
print 'No. of bidders 1 bid in train set: ',len(lst_nan_train)
print 'No. of bidders 1 bid in test set: ',len(lst_nan_test)
No. of bidders 1 bid in train set: 1
No. of bidders 1 bid in test set: 4
print train[train['bidder_id']==lst_nan_train[0]]['outcome']
546 0.0
Name: outcome, dtype: float64
由于这5个用户仅有一次竞标行为,而且其中1个用户来自训练集,4个来自测试集,由训练集用户的标记为人类,加上行为数太少,所以考虑对这5个用户的竞标行为数据予以舍弃,特别对测试集的4个用户后续操作类似之前对无竞标行为的用户,预测值填充最终模型的平均预测值
bid_to_drop = nan_rows.index.values.tolist()
# print bid_to_drop
bids.drop(bids.index[bid_to_drop], inplace=True)
print 'Is there any missing value in bids?',bids.isnull().any().any()
pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? False
统计基本的计数特征
根据前面的数据探索,由于数据集大部分由类别数据或者离散型数据构成,所以首先针对竞标行为数据按照竞标者分组统计其各项属性的数目,比如使用设备种类,参与竞标涉及国家,ip种类等等
# group by bidder to do some statistics
bidders = bids.groupby('bidder_id')
# pickle.dump(bids, open('bidders.pkl', 'w'))
# print bidders['device'].count()
def feature_count(group):
dct_cnt = {}
dct_cnt['devices_c'] = group['device'].unique().shape[0]
dct_cnt['countries_c'] = group['country'].unique().shape[0]
dct_cnt['ip_c'] = group['ip'].unique().shape[0]
dct_cnt['url_c'] = group['url'].unique().shape[0]
dct_cnt['auction_c'] = group['auction'].unique().shape[0]
dct_cnt['auc_mean'] = np.mean(group['auction'].value_counts()) # bids_c/auction_c
# dct_cnt['dev_mean'] = np.mean(group['device'].value_counts()) # bids_c/devices_c
dct_cnt['merch_c'] = group['merchandise'].unique().shape[0]
dct_cnt['bids_c'] = group.shape[0]
dct_cnt = pd.Series(dct_cnt)
return dct_cnt
cnt_bidder = bidders.apply(feature_count)
display(cnt_bidder.describe())
# cnt_bidder.to_csv('cnt_bidder.csv')
# print cnt_bidder[cnt_bidder['merch_c']==2]
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | merch_c | url_c | |
---|---|---|---|---|---|---|---|---|
count | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 |
mean | 6.593493 | 57.850810 | 1158.470117 | 12.733848 | 73.492359 | 544.507187 | 1.000151 | 290.964140 |
std | 30.009242 | 131.814053 | 9596.595169 | 22.556570 | 172.171106 | 3370.730666 | 0.012301 | 2225.912425 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 1.000000 | 2.000000 | 3.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 |
50% | 1.677419 | 10.000000 | 18.000000 | 3.000000 | 8.000000 | 12.000000 | 1.000000 | 5.000000 |
75% | 4.142857 | 47.000000 | 187.000000 | 12.000000 | 57.000000 | 111.000000 | 1.000000 | 36.000000 |
max | 1327.366667 | 1726.000000 | 515033.000000 | 178.000000 | 2618.000000 | 111918.000000 | 2.000000 | 81376.000000 |
特征相关性
在对竞标行为数据按照用户分组后,对数据集中的每一个产品特征构建一个散布矩阵(scatter matrix),来看看各特征之间的相关性
# 对于数据中的每一对特征构造一个散布矩阵
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
在针对竞标行为数据按照竞标用户进行分组基本统计后由上表可以看出,此时并未考虑时间戳的情形下,有以下基本结论:
由各项统计的最大值与中位值,75%值的比较可以看到除了商品类别一项,其他的几项多少都存在一些异常数值,或许可以作为异常行为进行观察
各特征的倾斜度很大,考虑对特征进行取对数的操作,并再次输出散布矩阵看看相关性。
商品类别计数这一特征的方差很小,而且从中位数乃至75%的统计来看,大多数用户仅对同一类别商品进行拍卖,而且因为前面数据探索部分发现商品类别本身适合作为类别数据,所以考虑分多个类别进行单独统计,而在计数特征中舍弃该特征。
cnt_bidder.drop('merch_c', axis=1, inplace=True)
cnt_bidder = np.log(cnt_bidder)
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
由上面的散布矩阵可以看到,个行为特征之间并没有表现出很强的相关性,虽然其中的ip计数和竞标计数,设备计数在进行对数操作处理之后表现出轻微的正相关性,但是由于是在做了对数操作之后才体现,而且从图中可以看到并非很强的相关性,所以保留这三个特征。
针对前述的异常行为,先从原train数据集中的机器人、人类中分别挑选几个样本进行追踪观察他们在按照bidders分组后的统计结果,对比看看
cnt_bidder.to_csv('cnt_bidder.csv')
# trace samples,first 2 bots, last 2 humen
indices = ['9434778d2268f1fa2a8ede48c0cd05c097zey','aabc211b4cf4d29e4ac7e7e361371622pockb',
'd878560888b11447e73324a6e263fbd5iydo1','91a3c57b13234af24875c56fb7e2b2f4rb56a']
# build a DataFrame for the choosed indices
samples = pd.DataFrame(cnt_bidder.loc[indices], columns = cnt_bidder.keys()).reset_index(drop = True)
print "Chosen samples of training dataset:(first 2 bots, last 2 humen)"
display(samples)
Chosen samples of training dataset:(first 2 bots, last 2 humen)
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | url_c | |
---|---|---|---|---|---|---|---|
0 | 3.190981 | 5.594711 | 8.785692 | 4.174387 | 6.011267 | 8.147578 | 7.557995 |
1 | 2.780432 | 4.844187 | 7.624619 | 2.639057 | 3.178054 | 5.880533 | 1.609438 |
2 | 0.287682 | 1.098612 | 1.386294 | 1.098612 | 1.386294 | 1.386294 | 0.000000 |
3 | 0.287682 | 2.890372 | 3.178054 | 1.791759 | 2.639057 | 2.995732 | 0.000000 |
使用seaborn来对上面四个例子的热力图进行可视化,看看percentile的情况
import matplotlib.pyplot as plt
import seaborn as sns
# look at percentile ranks
pcts = 100. * cnt_bidder.rank(axis=0, pct=True).loc[indices].round(decimals=3)
print pcts
# visualize percentiles with heatmap
sns.heatmap(pcts, yticklabels=['robot 1', 'robot 2', 'human 1', 'human 2'], annot=True, linewidth=.1, vmax=99,
fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' feature statistics')
plt.xticks(rotation=45, ha='center');
auc_mean auction_c bids_c \
bidder_id
9434778d2268f1fa2a8ede48c0cd05c097zey 94.9 94.6 97.0
aabc211b4cf4d29e4ac7e7e361371622pockb 92.4 87.2 92.3
d878560888b11447e73324a6e263fbd5iydo1 39.8 30.4 30.2
91a3c57b13234af24875c56fb7e2b2f4rb56a 39.8 60.2 53.0
countries_c devices_c ip_c url_c
bidder_id
9434778d2268f1fa2a8ede48c0cd05c097zey 95.4 95.6 96.7 97.4
aabc211b4cf4d29e4ac7e7e361371622pockb 77.3 63.8 84.8 50.3
d878560888b11447e73324a6e263fbd5iydo1 48.8 38.7 34.2 13.4
91a3c57b13234af24875c56fb7e2b2f4rb56a 63.7 56.8 56.2 13.4
由上面的热力图对比可以看到,机器人的各项统计指标除去商品类别上的统计以外,均比人类用户要高,所以考虑据此设计基于基本统计指标规则的基准模型,其中最显著的特征差异应该是在auc_mean
一项即用户在各个拍卖场的平均竞标次数,不妨先按照异常值处理的方法来找出上述基础统计中的异常情况
设计朴素分类器
由于最终目的是从竞标者中寻找到机器人用户,而根据常识,机器人用户的各项竞标行为的操作应该比人类要频繁许多,所以可以从异常值检验的角度来设计朴素分类器,根据之前针对不同用户统计的基本特征计数情况,可以先针对每一个特征找出其中的疑似异常用户列表,最后整合各个特征生成的用户列表,认为超过多个特征异常的用户为机器人用户。
# find the outliers for each feature
lst_outlier = []
for feature in cnt_bidder.keys():
# percentile 25th
Q1 = np.percentile(cnt_bidder[feature], 25)
# percentile 75th
Q3 = np.percentile(cnt_bidder[feature], 75)
step = 1.5 * (Q3 - Q1)
# show outliers
# print "Data points considered outliers for the feature '{}':".format(feature)
display(cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))])
lst_outlier += cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))].index.values.tolist()
再找到各种特征的所有可能作为‘异常值’的用户id之后,可以对其做一个基本统计,进一步找出其中超过某几个特征值均异常的用户,经过测试,考虑到原始train集合里bots用户不到5%,所以最终确定以不低于1个特征值均异常的用户作为异常用户的一个假设,由此与train集合里的用户进行交叉,可以得到一个用户子集,可以作为朴素分类器的一个操作方法。
# print len(lst_outlier)
from collections import Counter
freq_outlier = dict(Counter(lst_outlier))
perhaps_outlier = [i for i in freq_outlier if freq_outlier[i] >= 1]
print len(perhaps_outlier)
214
# basic_pred = test[test['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
train_pred = train[train['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
print len(train_pred)
76
设计评价指标
根据前面数据探索知本实验中的数据集的正负例比例约为19:1,有些失衡,所以考虑使用auc这种不受正负例比例影响的评价指标作为衡量标准,现针对所涉及的朴素分类器在原始训练集上的表现得到一个基准得分
from sklearn.metrics import roc_auc_score
y_true = train['outcome']
naive_pred = pd.DataFrame(columns=['bidder_id', 'prediction'])
naive_pred['bidder_id'] = train['bidder_id']
naive_pred['prediction'] = np.where(naive_pred['bidder_id'].isin(train_pred), 1.0, 0.0)
basic_pred = naive_pred['prediction']
print roc_auc_score(y_true, basic_pred)
0.54661464952
在经过上述对基本计数特征的统计之后,目前尚未针对非类别特征:时间戳进行处理,而在之前的数据探索过程中,针对商品类别和国家这两个类别属性,可以将原始的单一特征转换为多个特征分别统计,此外,在上述分析过程中,我们发现针对用户分组可以进一步对于拍卖场进行分组统计。
对时间戳进行处理
针对商品类别、国家转换为多个类别分别进行统计
按照用户-拍卖场进行分组进一步统计
对时间戳进行处理
主要是分析各个竞标行为的时间间隔,即统计竞标行为表中在同一拍卖场的各个用户之间的竞标行为间隔
然后针对每个用户对其他用户的时间间隔计算
时间间隔均值
时间间隔最大值
时间间隔最小值
from collections import defaultdict
def generate_timediff():
bids_grouped = bids.groupby('auction')
bds = defaultdict(list)
last_row = None
for bids_auc in bids_grouped:
for i, row in bids_auc[1].iterrows():
if last_row is None:
last_row = row
continue
time_difference = row['time'] - last_row['time']
bds[row['bidder_id']].append(time_difference)
last_row = row
df = []
for key in bds.keys():
df.append({'bidder_id': key, 'mean': np.mean(bds[key]),
'min': np.min(bds[key]), 'max': np.max(bds[key])})
pd.DataFrame(df).to_csv('tdiff.csv', index=False)
generate_timediff()
由于内容长度超过限制,后续内容请移步使用机器学习识别出拍卖场中作弊的机器人用户(二)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。