2

Human or Robot

本项目为kaggle上Facebook举行的一次比赛,地址见数据来源,完整代码见我的github,欢迎来玩~

代码

  • 数据探索——Data_Exploration.ipynb

  • 数据预处理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynb

  • 模型设计及评测——Model_Design.ipynb

项目数据来源

项目所需额外工具包


由于文章内容过长,所以分为两篇文章,总共包含四个部分

  • 数据探索

  • 数据预处理及特征工程

  • 模型设计

  • 评估及总结

数据探索

import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display
df_bids = pd.read_csv('bids.csv', low_memory=False)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_bids.head()
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
df_train.head()
# df_train.dtypes
bidder_id payment_account address outcome
0 91a3c57b13234af24875c56fb7e2b2f4rb56a a3d2de7675556553a5f08e4c88d2c228754av a3d2de7675556553a5f08e4c88d2c228vt0u4 0.0
1 624f258b49e77713fc34034560f93fb3hu3jo a3d2de7675556553a5f08e4c88d2c228v1sga ae87054e5a97a8f840a3991d12611fdcrfbq3 0.0
2 1c5f4fc669099bfbfac515cd26997bd12ruaj a3d2de7675556553a5f08e4c88d2c2280cybl 92520288b50f03907041887884ba49c0cl0pd 0.0
3 4bee9aba2abda51bf43d639013d6efe12iycd 51d80e233f7b6a7dfdee484a3c120f3b2ita8 4cb9717c8ad7e88a9a284989dd79b98dbevyi 0.0
4 4ab12bc61c82ddd9c2d65e60555808acqgos1 a3d2de7675556553a5f08e4c88d2c22857ddh 2a96c3ce94b3be921e0296097b88b56a7x1ji 0.0

异常数据检测

# 查看各表格中是否存在空值
print 'Is there any missing value in bids?',df_bids.isnull().any().any()
print 'Is there any missing value in train?',df_train.isnull().any().any()
print 'Is there any missing value in test?',df_test.isnull().any().any()
Is there any missing value in bids? True
Is there any missing value in train? False
Is there any missing value in test? False

整个对三个数据集进行空值判断,发现用户数据训练集和测试集均无缺失数据,而在竞标行为数据集中存在缺失值的情况,下面便针对bids数据进一步寻找缺失值

# nan_rows = df_bids[df_bids.isnull().T.any().T]
# print nan_rows
pd.isnull(df_bids).any()
bid_id         False
bidder_id      False
auction        False
merchandise    False
device         False
time           False
country         True
ip             False
url            False
dtype: bool
missing_country = df_bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = df_bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country:  8859
No. of normal country:  7647475

import matplotlib.pyplot as plt
labels = ['unknown', 'normal']
sizes = [missing_country, normal_country]
explode = (0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.title('Distribution of missing countries vs. normal countries')
plt.show()

empty value

综合上述的分析可以发现,在竞标行为用户的country一栏属性中存在很少一部分用户行为是没有country记录的,在预处理部分可以针对这部分缺失数据进行填充操作,有两种思路:

  • 针对原始行为数据按照用户分组后,看看每个对应的用户竞标时经常所位于的国家信息,对缺失值填充常驻国家

  • 针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息

# 查看各个数据的记录数
# 看看数据的id是否是唯一标识
print df_bids.shape[0]
print len(df_bids['bid_id'].unique())
print df_train.shape[0]
print len(df_train['bidder_id'].unique())
print df_test.shape[0]
print len(df_test['bidder_id'].unique())
7656334
7656334
2013
2013
4700
4700

# 简单统计各项基本特征(类别特征)的数目(除去时间)
print 'total bidder in bids: ', len(df_bids['bidder_id'].unique())
print 'total auction in bids: ', len(df_bids['auction'].unique())
print 'total merchandise in bids: ', len(df_bids['merchandise'].unique())
print 'total device in bids: ', len(df_bids['device'].unique())
print 'total country in bids: ', len(df_bids['country'].unique())
print 'total ip in bids: ', len(df_bids['ip'].unique())
print 'total url in bids: ', len(df_bids['url'].unique())
total bidder in bids:  6614
total auction in bids:  15051
total merchandise in bids:  10
total device in bids:  7351
total country in bids:  200
total ip in bids:  2303991
total url in bids:  1786351

由上述基本特征可以看到:

  • 竞标行为中的用户总数少于训练集+测试集的用户数,也就是说并不是一一对应的,接下来验证下竞标行为数据中的用户是否完全来自训练集和测试集

  • 商品类别和国家的种类相对其他特征较少,可以作为天然的类别特征提取出来进行处理,而其余的特征可能更多的进行计数统计

lst_all_users = (df_train['bidder_id'].unique()).tolist() + (df_test['bidder_id'].unique()).tolist()
print 'total bidders of train and test set',len(lst_all_users)
lst_bidder = (df_bids['bidder_id'].unique()).tolist()
print 'total bidders in bids set',len(lst_bidder)
print 'Is bidders in bids are all from train+test set? ',set(lst_bidder).issubset(set(lst_all_users))
total bidders of train and test set 6713
total bidders in bids set 6614
Is bidders in bids are all from train+test set?  True

lst_nobids = [i for i in lst_all_users if i not in lst_bidder]
print 'No. of bidders never bid: ',len(lst_nobids)
lst_nobids_train = [i for i in lst_nobids if i in (df_train['bidder_id'].unique()).tolist()]
lst_nobids_test = [i for i in lst_nobids if i in (df_test['bidder_id'].unique()).tolist()]
print 'No. of bidders never bid in train set: ',len(lst_nobids_train)
print 'No. of bidders never bid in test set: ',len(lst_nobids_test)
No. of bidders never bid:  99
No. of bidders never bid in train set:  29
No. of bidders never bid in test set:  70

data_source = ['train', 'test']
y_pos = np.arange(len(data_source))
num_never_bids = [len(lst_nobids_train), len(lst_nobids_test)]
plt.bar(y_pos, num_never_bids, align='center', alpha=0.5)
plt.xticks(y_pos, data_source)
plt.ylabel('bidders no bids')
plt.title('Source of no bids bidders')
plt.show()

source of no bid bidders

print df_train[(df_train['bidder_id'].isin(lst_nobids_train)) & (df_train['outcome']==1.0)]
Empty DataFrame
Columns: [bidder_id, payment_account, address, outcome]
Index: []

由上述计算可知存在99个竞标者无竞标记录,其中29位来自训练集,70位来自测试集,而且这29位来自训练集的竞标者未被标记为机器人用户,所以可以针对测试集中的这70位用户后续标记为人类或者取平均值处理

# check the partition of bots in train
print (df_train[df_train['outcome'] == 1].shape[0]*1.0) / df_train.shape[0] * 100,'%'
5.11674118231 %

训练集中的标记为机器人的用户占所有用户数目约5%

df_train.groupby('outcome').size().plot(labels=['Human', 'Robot'], kind='pie', autopct='%.2f', figsize=(4, 4), 
                                        title='Distribution of Human vs. Robots', legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f477135c5d0>

正负例比较

由上述训练集中的正负例分布可以看到本数据集正负例比例失衡,所以后续考虑使用AUC(不受正负例比例影响)作为评价指标,此外尽量采用Gradient Boosting族模型来进行训练

数据预处理与特征工程

import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
bids = pd.read_csv('bids.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

处理缺失数据

针对前面数据探索部分所发现的竞标行为数据中存在的国家属性缺失问题,考虑使用针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息的方法来进行缺失值的填充处理

display(bids.head())
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
# pd.algos.is_monotonic_int64(bids.time.values, True)[0]
print 'Is the time monotonically non-decreasing? ', pd.Index(bids['time']).is_monotonic
Is the time monotonically non-decreasing?  False

# bidder_group = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].ffill()
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].bfill()
display(bids.head())
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
print 'Is there any missing value in bids?',bids.isnull().any().any()
# pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? True

missing_country = bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country:  5
No. of normal country:  7656329

nan_rows = bids[bids.isnull().T.any().T]
print nan_rows
          bid_id                              bidder_id auction  \
1351177  1351177  f3ab8c9ecc0d021ebc81e89f20c8267bn812w   jefix   
2754184  2754184  88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2   cc5fs   
2836631  2836631  29b8af2fea3881ef61911612372dac41vczqv   jqx39   
3125892  3125892  df20f216cbb0b0df5a7b2e94b16a7853iyw9g   jqx39   
5153748  5153748  5e05ec450e2dd64d7996a08bbbca4f126nzzk   jqx39   

              merchandise    device              time country  \
1351177  office equipment   phone84  9767200789473684     NaN   
2754184            mobile  phone150  9633363947368421     NaN   
2836631           jewelry   phone72  9634034894736842     NaN   
3125892   books and music  phone106  9635755105263157     NaN   
5153748            mobile  phone267  9645270210526315     NaN   

                      ip              url  
1351177   80.211.119.111  g9pgdfci3yseml5  
2754184     20.67.240.88  ctivbfq55rktail  
2836631  149.210.107.205  vasstdc27m7nks3  
3125892      26.23.62.59  ac9xlqtfg0cx5c5  
5153748     145.7.194.40  0em0vg1f0zuxonw  

# print bids[bids['bid_id']==1351177]
nan_bidder = nan_rows['bidder_id'].values.tolist()
# print nan_bidder
print bids[bids['bidder_id'].isin(nan_bidder)]
          bid_id                              bidder_id auction  \
1351177  1351177  f3ab8c9ecc0d021ebc81e89f20c8267bn812w   jefix   
2754184  2754184  88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2   cc5fs   
2836631  2836631  29b8af2fea3881ef61911612372dac41vczqv   jqx39   
3125892  3125892  df20f216cbb0b0df5a7b2e94b16a7853iyw9g   jqx39   
5153748  5153748  5e05ec450e2dd64d7996a08bbbca4f126nzzk   jqx39   

              merchandise    device              time country  \
1351177  office equipment   phone84  9767200789473684     NaN   
2754184            mobile  phone150  9633363947368421     NaN   
2836631           jewelry   phone72  9634034894736842     NaN   
3125892   books and music  phone106  9635755105263157     NaN   
5153748            mobile  phone267  9645270210526315     NaN   

                      ip              url  
1351177   80.211.119.111  g9pgdfci3yseml5  
2754184     20.67.240.88  ctivbfq55rktail  
2836631  149.210.107.205  vasstdc27m7nks3  
3125892      26.23.62.59  ac9xlqtfg0cx5c5  
5153748     145.7.194.40  0em0vg1f0zuxonw  

在对整体数据的部分用户缺失国家的按照各个用户分组后在时间上前向和后向填充后,仍然存在5个用户缺失了国家信息,结果发现这5个用户是仅有一次竞标行为,下面看看这5个用户还有什么特征

lst_nan_train = [i for i in nan_bidder if i in (train['bidder_id'].unique()).tolist()]
lst_nan_test = [i for i in nan_bidder if i in (test['bidder_id'].unique()).tolist()]
print 'No. of bidders 1 bid in train set: ',len(lst_nan_train)
print 'No. of bidders 1 bid in test set: ',len(lst_nan_test)
No. of bidders 1 bid in train set:  1
No. of bidders 1 bid in test set:  4

print train[train['bidder_id']==lst_nan_train[0]]['outcome']
546    0.0
Name: outcome, dtype: float64

由于这5个用户仅有一次竞标行为,而且其中1个用户来自训练集,4个来自测试集,由训练集用户的标记为人类,加上行为数太少,所以考虑对这5个用户的竞标行为数据予以舍弃,特别对测试集的4个用户后续操作类似之前对无竞标行为的用户,预测值填充最终模型的平均预测值

bid_to_drop = nan_rows.index.values.tolist()
# print bid_to_drop
bids.drop(bids.index[bid_to_drop], inplace=True)
print 'Is there any missing value in bids?',bids.isnull().any().any()
pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? False

统计基本的计数特征

根据前面的数据探索,由于数据集大部分由类别数据或者离散型数据构成,所以首先针对竞标行为数据按照竞标者分组统计其各项属性的数目,比如使用设备种类,参与竞标涉及国家,ip种类等等

# group by bidder to do some statistics
bidders = bids.groupby('bidder_id')
# pickle.dump(bids, open('bidders.pkl', 'w'))
# print bidders['device'].count()
def feature_count(group):
    dct_cnt = {}
    dct_cnt['devices_c'] = group['device'].unique().shape[0]
    dct_cnt['countries_c'] = group['country'].unique().shape[0]
    dct_cnt['ip_c'] = group['ip'].unique().shape[0]
    dct_cnt['url_c'] = group['url'].unique().shape[0]    
    dct_cnt['auction_c'] = group['auction'].unique().shape[0]
    dct_cnt['auc_mean'] = np.mean(group['auction'].value_counts())    # bids_c/auction_c
#     dct_cnt['dev_mean'] = np.mean(group['device'].value_counts())    # bids_c/devices_c
    dct_cnt['merch_c'] = group['merchandise'].unique().shape[0]
    dct_cnt['bids_c'] = group.shape[0]
    dct_cnt = pd.Series(dct_cnt)
    return dct_cnt
cnt_bidder = bidders.apply(feature_count)
display(cnt_bidder.describe())
# cnt_bidder.to_csv('cnt_bidder.csv')
# print cnt_bidder[cnt_bidder['merch_c']==2]
auc_mean auction_c bids_c countries_c devices_c ip_c merch_c url_c
count 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000
mean 6.593493 57.850810 1158.470117 12.733848 73.492359 544.507187 1.000151 290.964140
std 30.009242 131.814053 9596.595169 22.556570 172.171106 3370.730666 0.012301 2225.912425
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 1.000000 2.000000 3.000000 1.000000 2.000000 2.000000 1.000000 1.000000
50% 1.677419 10.000000 18.000000 3.000000 8.000000 12.000000 1.000000 5.000000
75% 4.142857 47.000000 187.000000 12.000000 57.000000 111.000000 1.000000 36.000000
max 1327.366667 1726.000000 515033.000000 178.000000 2618.000000 111918.000000 2.000000 81376.000000

特征相关性

在对竞标行为数据按照用户分组后,对数据集中的每一个产品特征构建一个散布矩阵(scatter matrix),来看看各特征之间的相关性

# 对于数据中的每一对特征构造一个散布矩阵
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');

scatter matrix

在针对竞标行为数据按照竞标用户进行分组基本统计后由上表可以看出,此时并未考虑时间戳的情形下,有以下基本结论:

  • 由各项统计的最大值与中位值,75%值的比较可以看到除了商品类别一项,其他的几项多少都存在一些异常数值,或许可以作为异常行为进行观察

  • 各特征的倾斜度很大,考虑对特征进行取对数的操作,并再次输出散布矩阵看看相关性。

  • 商品类别计数这一特征的方差很小,而且从中位数乃至75%的统计来看,大多数用户仅对同一类别商品进行拍卖,而且因为前面数据探索部分发现商品类别本身适合作为类别数据,所以考虑分多个类别进行单独统计,而在计数特征中舍弃该特征。

cnt_bidder.drop('merch_c', axis=1, inplace=True)
cnt_bidder = np.log(cnt_bidder)
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');

scatter matrix(log)

由上面的散布矩阵可以看到,个行为特征之间并没有表现出很强的相关性,虽然其中的ip计数和竞标计数,设备计数在进行对数操作处理之后表现出轻微的正相关性,但是由于是在做了对数操作之后才体现,而且从图中可以看到并非很强的相关性,所以保留这三个特征。

针对前述的异常行为,先从原train数据集中的机器人、人类中分别挑选几个样本进行追踪观察他们在按照bidders分组后的统计结果,对比看看

cnt_bidder.to_csv('cnt_bidder.csv')
# trace samples,first 2 bots, last 2 humen
indices = ['9434778d2268f1fa2a8ede48c0cd05c097zey','aabc211b4cf4d29e4ac7e7e361371622pockb',
           'd878560888b11447e73324a6e263fbd5iydo1','91a3c57b13234af24875c56fb7e2b2f4rb56a']

# build a DataFrame for the choosed indices
samples = pd.DataFrame(cnt_bidder.loc[indices], columns = cnt_bidder.keys()).reset_index(drop = True)
print "Chosen samples of training dataset:(first 2 bots, last 2 humen)"
display(samples)
Chosen samples of training dataset:(first 2 bots, last 2 humen)

auc_mean auction_c bids_c countries_c devices_c ip_c url_c
0 3.190981 5.594711 8.785692 4.174387 6.011267 8.147578 7.557995
1 2.780432 4.844187 7.624619 2.639057 3.178054 5.880533 1.609438
2 0.287682 1.098612 1.386294 1.098612 1.386294 1.386294 0.000000
3 0.287682 2.890372 3.178054 1.791759 2.639057 2.995732 0.000000

使用seaborn来对上面四个例子的热力图进行可视化,看看percentile的情况

import matplotlib.pyplot as plt
import seaborn as sns

# look at percentile ranks
pcts = 100. * cnt_bidder.rank(axis=0, pct=True).loc[indices].round(decimals=3)
print pcts

# visualize percentiles with heatmap
sns.heatmap(pcts, yticklabels=['robot 1', 'robot 2', 'human 1', 'human 2'], annot=True, linewidth=.1, vmax=99, 
            fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' feature statistics')
plt.xticks(rotation=45, ha='center');
                                       auc_mean  auction_c  bids_c  \
bidder_id                                                            
9434778d2268f1fa2a8ede48c0cd05c097zey      94.9       94.6    97.0   
aabc211b4cf4d29e4ac7e7e361371622pockb      92.4       87.2    92.3   
d878560888b11447e73324a6e263fbd5iydo1      39.8       30.4    30.2   
91a3c57b13234af24875c56fb7e2b2f4rb56a      39.8       60.2    53.0   

                                       countries_c  devices_c  ip_c  url_c  
bidder_id                                                                   
9434778d2268f1fa2a8ede48c0cd05c097zey         95.4       95.6  96.7   97.4  
aabc211b4cf4d29e4ac7e7e361371622pockb         77.3       63.8  84.8   50.3  
d878560888b11447e73324a6e263fbd5iydo1         48.8       38.7  34.2   13.4  
91a3c57b13234af24875c56fb7e2b2f4rb56a         63.7       56.8  56.2   13.4  

hot map

由上面的热力图对比可以看到,机器人的各项统计指标除去商品类别上的统计以外,均比人类用户要高,所以考虑据此设计基于基本统计指标规则的基准模型,其中最显著的特征差异应该是在auc_mean一项即用户在各个拍卖场的平均竞标次数,不妨先按照异常值处理的方法来找出上述基础统计中的异常情况

设计朴素分类器

由于最终目的是从竞标者中寻找到机器人用户,而根据常识,机器人用户的各项竞标行为的操作应该比人类要频繁许多,所以可以从异常值检验的角度来设计朴素分类器,根据之前针对不同用户统计的基本特征计数情况,可以先针对每一个特征找出其中的疑似异常用户列表,最后整合各个特征生成的用户列表,认为超过多个特征异常的用户为机器人用户。

# find the outliers for each feature
lst_outlier = []
for feature in cnt_bidder.keys():
    # percentile  25th
    Q1 = np.percentile(cnt_bidder[feature], 25)
    # percentile  75th
    Q3 = np.percentile(cnt_bidder[feature], 75)
    step = 1.5 * (Q3 - Q1)    
    # show outliers
    # print "Data points considered outliers for the feature '{}':".format(feature)
    display(cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))])
    lst_outlier += cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))].index.values.tolist()

再找到各种特征的所有可能作为‘异常值’的用户id之后,可以对其做一个基本统计,进一步找出其中超过某几个特征值均异常的用户,经过测试,考虑到原始train集合里bots用户不到5%,所以最终确定以不低于1个特征值均异常的用户作为异常用户的一个假设,由此与train集合里的用户进行交叉,可以得到一个用户子集,可以作为朴素分类器的一个操作方法。

# print len(lst_outlier)
from collections import Counter
freq_outlier = dict(Counter(lst_outlier))
perhaps_outlier = [i for i in freq_outlier if freq_outlier[i] >= 1]
print len(perhaps_outlier)
214

# basic_pred = test[test['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
train_pred = train[train['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
print len(train_pred)
76

设计评价指标

根据前面数据探索知本实验中的数据集的正负例比例约为19:1,有些失衡,所以考虑使用auc这种不受正负例比例影响的评价指标作为衡量标准,现针对所涉及的朴素分类器在原始训练集上的表现得到一个基准得分

from sklearn.metrics import roc_auc_score
y_true = train['outcome']
naive_pred = pd.DataFrame(columns=['bidder_id', 'prediction'])
naive_pred['bidder_id'] = train['bidder_id']
naive_pred['prediction'] = np.where(naive_pred['bidder_id'].isin(train_pred), 1.0, 0.0)
basic_pred = naive_pred['prediction']
print roc_auc_score(y_true, basic_pred)
0.54661464952

在经过上述对基本计数特征的统计之后,目前尚未针对非类别特征:时间戳进行处理,而在之前的数据探索过程中,针对商品类别和国家这两个类别属性,可以将原始的单一特征转换为多个特征分别统计,此外,在上述分析过程中,我们发现针对用户分组可以进一步对于拍卖场进行分组统计。

  • 对时间戳进行处理

  • 针对商品类别、国家转换为多个类别分别进行统计

  • 按照用户-拍卖场进行分组进一步统计

对时间戳进行处理

主要是分析各个竞标行为的时间间隔,即统计竞标行为表中在同一拍卖场的各个用户之间的竞标行为间隔

然后针对每个用户对其他用户的时间间隔计算

  • 时间间隔均值

  • 时间间隔最大值

  • 时间间隔最小值

from collections import defaultdict

def generate_timediff():    
    bids_grouped = bids.groupby('auction')
    bds = defaultdict(list)
    last_row = None

    for bids_auc in bids_grouped:
        for i, row in bids_auc[1].iterrows():
            if last_row is None:
                last_row = row
                continue

            time_difference = row['time'] - last_row['time']
            bds[row['bidder_id']].append(time_difference)
            last_row = row

    df = []
    for key in bds.keys():
        df.append({'bidder_id': key, 'mean': np.mean(bds[key]),
                   'min': np.min(bds[key]), 'max': np.max(bds[key])})

    pd.DataFrame(df).to_csv('tdiff.csv', index=False)
generate_timediff()

由于内容长度超过限制,后续内容请移步使用机器学习识别出拍卖场中作弊的机器人用户(二)


LancelotHolmes
51 声望20 粉丝