本文承接上一篇文章:使用机器学习识别出拍卖场中作弊的机器人用户

本项目为kaggle上Facebook举行的一次比赛,地址见数据来源,完整代码见我的github,欢迎来玩~

代码

  • 数据探索——Data_Exploration.ipynb

  • 数据预处理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynb

  • 模型设计及评测——Model_Design.ipynb

项目数据来源

项目所需额外工具包


由于文章内容过长,所以分为两篇文章,总共包含四个部分

  • 数据探索

  • 数据预处理及特征工程

  • 模型设计

  • 评估及总结


特征工程(续)

import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
# bids = pd.read_csv('bids.csv')
bids = pickle.load(open('bids.pkl'))
print bids.shape
display(bids.head())
(7656329, 9)

bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
bidders = bids.groupby('bidder_id')

针对国家、商品单一特征多类别转换为多个独立特征进行统计

cates = (bids['merchandise'].unique()).tolist()
countries = (bids['country'].unique()).tolist()

def dummy_coun_cate(group):
    coun_cate = dict.fromkeys(cates, 0)
    coun_cate.update(dict.fromkeys(countries, 0))
    for cat, value in group['merchandise'].value_counts().iteritems():
        coun_cate[cat] = value

    for c in group['country'].unique():
        coun_cate[c] = 1

    coun_cate = pd.Series(coun_cate)
    return coun_cate
bidder_coun_cate = bidders.apply(dummy_coun_cate)
display(bidder_coun_cate.describe())
bidder_coun_cate.to_csv('coun_cate.csv')
ad ae af ag al am an ao ar at ... vc ve vi vn ws ye za zm zw zz
count 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 ... 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000
mean 0.002724 0.205629 0.054774 0.001059 0.048570 0.023907 0.000303 0.036314 0.120442 0.052655 ... 0.000605 0.033591 0.000303 0.130882 0.001967 0.040551 0.274474 0.067181 0.069753 0.000757
std 0.052121 0.404191 0.227555 0.032530 0.214984 0.152770 0.017395 0.187085 0.325502 0.223362 ... 0.024596 0.180186 0.017395 0.337297 0.044311 0.197262 0.446283 0.250354 0.254750 0.027497
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 209 columns

同样的,对于每个用户需要统计他对于自己每次竞拍行为的时间间隔情况

def bidder_interval(group):
    time_diff = np.ediff1d(group['time'])
    bidder_interval = {}
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)
    bidder_interval['tmean'] = diff_mean
    bidder_interval['tstd'] = diff_std
    bidder_interval['tmedian'] = diff_median
    bidder_interval['tzeros'] = diff_zeros
    bidder_interval = pd.Series(bidder_interval)
    return bidder_interval
bidder_inv = bidders.apply(bidder_interval)
display(bidder_inv.describe())
bidder_inv.to_csv('bidder_inv.csv')
tmean tmedian tstd tzeros
count 6.609000e+03 6.609000e+03 6.609000e+03 6609.000000
mean 2.933038e+12 1.860285e+12 3.440901e+12 122.986231
std 8.552343e+12 7.993497e+12 6.512992e+12 3190.805229
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
25% 1.192853e+10 2.578947e+09 1.749995e+09 0.000000
50% 2.641139e+11 5.726316e+10 5.510107e+11 0.000000
75% 1.847456e+12 6.339474e+11 2.911282e+12 0.000000
max 7.610295e+13 7.610295e+13 3.800092e+13 231570.000000

</div>

按照用户-拍卖场分组进一步分析

之前的统计是按照用户进行分组,针对各个用户从整体上针对竞标行为统计其各项特征,下面根据拍卖场来对用户进一步细分,看一看每个用户在不同拍卖场的行为模式,类似上述按照用户分组来统计各个用户的各项特征,针对用户-拍卖场结对分组进行统计以下特征

  • 基本计数统计,针对各个用户在各个拍卖场统计设备、国家、ip、url、商品类别、竞标次数等特征的数目作为新的特征

  • 时间间隔统计:统计各个用户在各个拍卖场每次竞拍的时间间隔的 均值、方差、中位数和0值

  • 针对商品类别、国家进一步转化为多类别进行统计

def auc_features_count(group):
    time_diff = np.ediff1d(group['time'])
    
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)

    row = dict.fromkeys(cates, 0)
    row.update(dict.fromkeys(countries, 0))

    row['devices_c'] = group['device'].unique().shape[0]
    row['countries_c'] = group['country'].unique().shape[0]
    row['ip_c'] = group['ip'].unique().shape[0]
    row['url_c'] = group['url'].unique().shape[0]
#     row['merch_c'] = group['merchandise'].unique().shape[0]
    row['bids_c'] = group.shape[0]
    row['tmean'] = diff_mean
    row['tstd'] = diff_std
    row['tmedian'] = diff_median
    row['tzeros'] = diff_zeros

    for cat, value in group['merchandise'].value_counts().iteritems():
        row[cat] = value

    for c in group['country'].unique():
        row[c] = 1

    row = pd.Series(row)
    return row
bidder_auc = bids.groupby(['bidder_id', 'auction']).apply(auc_features_count)
bidder_auc.to_csv('bids_auc.csv')
print bidder_auc.shape
(382336, 218)

模型设计与参数评估

合并特征

对之前生成的各项特征进行合并产生最终的特征空间

import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display

首先将之前根据用户分组的统计特征合并起来,然后将其与按照用户-拍卖场结对分组的特征合并起来,最后加上时间特征,分别于训练集、测试集连接生成后续进行训练和预测的特征数据文件

def merge_data():    
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')

    time_differences = pd.read_csv('tdiff.csv', index_col=0)
    bids_auc = pd.read_csv('bids_auc.csv')

    bids_auc = bids_auc.groupby('bidder_id').mean()
    
    bidders = pd.read_csv('cnt_bidder.csv', index_col=0)
    country_cate = pd.read_csv('coun_cate.csv', index_col=0)
    bidder_inv = pd.read_csv('bidder_inv.csv', index_col=0)
    bidders = bidders.merge(country_cate, right_index=True, left_index=True)
    bidders = bidders.merge(bidder_inv, right_index=True, left_index=True)

    bidders = bidders.merge(bids_auc, right_index=True, left_index=True)
    bidders = bidders.merge(time_differences, right_index=True,
                            left_index=True)

    train = train.merge(bidders, left_on='bidder_id', right_index=True)
    train.to_csv('train_full.csv', index=False)

    test = test.merge(bidders, left_on='bidder_id', right_index=True)
    test.to_csv('test_full.csv', index=False)    
merge_data()
train_full = pd.read_csv('train_full.csv')
test_full = pd.read_csv('test_full.csv')
print train_full.shape
print test_full.shape
(1983, 445)
(4626, 444)

train_full['outcome'] = train_full['outcome'].astype(int)
ytrain = train_full['outcome']
train_full.drop('outcome', 1, inplace=True)

test_ids = test_full['bidder_id']

labels = ['payment_account', 'address', 'bidder_id']
train_full.drop(labels=labels, axis=1, inplace=True)
test_full.drop(labels=labels, axis=1, inplace=True)

设计交叉验证

模型选择

根据之前的分析,由于当前的数据集中存在正负例不均衡的问题,所以考虑选取了RandomForestClassfier, GradientBoostingClassifier, xgboost, lightgbm等四种模型来针对数据及进行训练和预测,确定最终模型的基本思路如下:

  • 对四个模型分别使用评价函数roc_auc进行交叉验证并绘制auc曲线,对各个模型的多轮交叉验证得分取平均值并输出

  • 根据得分确定最终选用的一个或多个模型

    • 若最后发现一个模型的表现大幅度优于其他所有模型,则选择该模型进一步调参

    • 若最后发现多个模型表现都不错,则进行模型的集成,得到聚合模型

    • 使用GridSearchCV来从人为设定的参数列表中选择最佳的参数组合确定最终的模型

from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle

# from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, auc

def kfold_plot(train, ytrain, model):
#     kf = StratifiedKFold(y=ytrain, n_folds=5)
    kf = StratifiedKFold(n_splits=5)
    scores = []
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    exe_time = []
    
    colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue'])
    lw = 2
    
    i=0
    for (train_index, test_index), color in zip(kf.split(train, ytrain), colors):
        X_train, X_test = train.iloc[train_index], train.iloc[test_index]
        y_train, y_test = ytrain.iloc[train_index], ytrain.iloc[test_index]
        begin_t = time.time()
        predictions = model(X_train, X_test, y_train)
        end_t = time.time()
        exe_time.append(round(end_t-begin_t, 3))
#         model = model
#         model.fit(X_train, y_train)    
#         predictions = model.predict_proba(X_test)[:, 1]        
        scores.append(roc_auc_score(y_test.astype(float), predictions))        
        fpr, tpr, thresholds = roc_curve(y_test, predictions)
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=lw, color=color, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))
        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=lw, color='k', label='Luck')
    
    mean_tpr /= kf.get_n_splits(train, ytrain)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color='g', linestyle='--', label='Mean ROC (area = %0.2f)' % mean_auc, lw=lw)
    
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc='lower right')
    plt.show()
    
    print 'mean scores: ',np.mean(scores)
    print 'mean model process time: ',np.mean(exe_time), 's'

RandomForestClassifier

from sklearn.model_selection import GridSearchCV
import time
from sklearn.ensemble import RandomForestClassifier

def forest_model(X_train, X_test, y_train):
    model = RandomForestClassifier(n_estimators=160, max_features=35, max_depth=8, random_state=7)
    model.fit(X_train, y_train)    
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
kfold_plot(train_full, ytrain, forest_model)
# kfold_plot(train_full, ytrain, model_forest)

mean scores:  0.909571935157
mean model process time:  0.6372 s

from sklearn.ensemble import GradientBoostingClassifier
def gradient_model(X_train, X_test, y_train):
    model = GradientBoostingClassifier(n_estimators=200, random_state=7, max_depth=5, learning_rate=0.03)
    model.fit(X_train, y_train)
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
kfold_plot(train_full, ytrain, gradient_model)

mean scores:  0.911847771023
mean model process time:  4.007 s

import xgboost as xgb
def xgboost_model(X_train, X_test, y_train):
    X_train = xgb.DMatrix(X_train.values, label=y_train.values)
    X_test = xgb.DMatrix(X_test.values)
    params = {'objective': 'binary:logistic', 'eval_metric': 'auc', 'silent': 1, 'seed': 7,
              'max_depth': 6, 'eta': 0.01}    
    model = xgb.train(params, X_train, 600)
    predictions = model.predict(X_test)
    return predictions
kfold_plot(train_full, ytrain, xgboost_model)

mean scores:  0.915372340426
mean model process time:  4.855 s

import lightgbm as lgb
def lightgbm_model(X_train, X_test, y_train):
    X_train = lgb.Dataset(X_train.values, y_train.values)
    params = {'objective': 'binary', 'metric': {'auc'}, 'learning_rate': 0.01, 'max_depth': 6, 'seed': 7}
    model = lgb.train(params, X_train, num_boost_round=600)
    predictions = model.predict(X_test)
    return predictions
kfold_plot(train_full, ytrain, lightgbm_model)

mean scores:  0.921512158055
mean model process time:  0.4236 s

模型比较

比较四个模型在交叉验证机上的roc_auc平均得分和模型训练的时间

data_source = ['forest', 'gradient boosting', 'xgboost', 'lightgbm']
y_pos = np.arange(len(data_source))
model_auc = [0.910, 0.912, 0.915, 0.922]
barlist = plt.bar(y_pos, model_auc, align='center', alpha=0.5)
barlist[3].set_color('r')
plt.xticks(y_pos, data_source)
plt.ylabel('roc-auc score')
plt.title('Model Performance')
plt.show()

data_source = ['forest', 'gradient boosting', 'xgboost', 'lightgbm']
y_pos = np.arange(len(data_source))
model_auc = [0.6372,4.007, 4.855, 0.4236]
barlist = plt.bar(y_pos, model_auc, align='center', alpha=0.5)
barlist[3].set_color('r')
plt.xticks(y_pos, data_source)
plt.ylabel('time(s)')
plt.title('Time of Building Model')
plt.show()

auc_forest = [0.87, 0.93, 0.94, 0.86, 0.96]
auc_gb = [0.89, 0.91, 0.93, 0.87, 0.95]
auc_xgb = [0.89,0.92, 0.93, 0.89, 0.95]
auc_lgb = [0.90, 0.93, 0.94, 0.88, 0.96]
print 'std of forest auc score: ',np.std(auc_forest)
print 'std of gbm auc score: ',np.std(auc_gb)
print 'std of xgboost auc score: ',np.std(auc_xgb)
print 'std of lightgbm auc score: ',np.std(auc_lgb)
data_source = ['roc-fold-1', 'roc-fold-2', 'roc-fold-3', 'roc-fold-4', 'roc-fold-5']
y_pos = np.arange(len(data_source))
plt.plot(y_pos, auc_forest, 'b-', label='forest')
plt.plot(y_pos, auc_gb, 'r-', label='gbm')
plt.plot(y_pos, auc_xgb, 'y-', label='xgboost')
plt.plot(y_pos, auc_lgb, 'g-', label='lightgbm')
plt.title('roc-auc score of each epoch')
plt.xlabel('epoch')
plt.ylabel('roc-auc score')
plt.legend()
plt.show()
std of forest auc score:  0.0396988664826
std of gbm auc score:  0.0282842712475
std of xgboost auc score:  0.0233238075794
std of lightgbm auc score:  0.0285657137142

单从5次交叉验证的各模型roc-auc得分来看,xgboost的得分相对比较稳定

聚合模型

由上面的模型比较可以发现,四个模型的经过交叉验证的表现都不错,但是综合而言,xgboost和lightgbm更胜一筹,而且两者的训练时间也相对更短一些,所以接下来考虑进行模型的聚合,思路如下:

  • 先通过GridSearchCV分别针对四个模型在整个训练集上进行调参获得最佳的子模型

  • 针对子模型使用

    • stacking: 第三方库mlxtend里的stacking方法对子模型进行聚合得到聚合模型,并采用之前相同的cv方法对该模型进行打分评价

    • voting: 使用sklearn内置的VotingClassifier进行四个模型的聚合

  • 最终对聚合模型在一次进行cv验证评分,根据结果确定最终的模型

先通过交叉验证针对模型选择参数组合

def choose_xgb_model(X_train, y_train): 
    tuned_params = [{'objective': ['binary:logistic'], 'learning_rate': [0.01, 0.03, 0.05], 
                     'n_estimators': [100, 150, 200], 'max_depth':[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(xgb.XGBClassifier(seed=7), tuned_params, scoring='roc_auc')
    clf.fit(X_train, y_train)
    end_t = time.time()
    print 'train time: ',round(end_t-begin_t, 3), 's'
    print 'current best parameters of xgboost: ',clf.best_params_
    return clf.best_estimator_
bst_xgb = choose_xgb_model(train_full, ytrain)
train time:  86.216 s
current best parameters of xgboost:  {'n_estimators': 150, 'objective': 'binary:logistic', 'learning_rate': 0.05, 'max_depth': 4}

def choose_lgb_model(X_train, y_train): 
    tuned_params = [{'objective': ['binary'], 'learning_rate': [0.01, 0.03, 0.05], 
                     'n_estimators': [100, 150, 200], 'max_depth':[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(lgb.LGBMClassifier(seed=7), tuned_params, scoring='roc_auc')
    clf.fit(X_train, y_train)
    end_t = time.time()
    print 'train time: ',round(end_t-begin_t, 3), 's'
    print 'current best parameters of lgb: ',clf.best_params_
    return clf.best_estimator_
bst_lgb = choose_lgb_model(train_full, ytrain)
train time:  16.602 s
current best parameters of lgb:  {'n_estimators': 150, 'objective': 'binary', 'learning_rate': 0.05, 'max_depth': 4}

先使用stacking聚合两个综合表现最佳的模型lgb和xgb

from mlxtend.classifier import StackingClassifier

def stacking_model(X_train, X_test, y_train):    
    sclf = StackingClassifier(classifiers=[bst_xgb], use_probas=True, average_probas=False, meta_classifier=bst_lgb)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
kfold_plot(train_full, ytrain, stacking_model)

mean scores:  0.880479989868
mean model process time:  0.8142 s

然而两个单独表现最佳的模型经过stacking的聚合模型表现反而不如之前的任何一个单一模型,考虑对四个模型进行stacking聚和操作

def choose_forest_model(X_train, y_train):    
    tuned_params = [{'n_estimators': [100, 150, 200], 'max_features': [8, 15, 30], 'max_depth':[4, 8, 10]}]
    begin_t = time.time()
    clf = GridSearchCV(RandomForestClassifier(random_state=7), tuned_params, scoring='roc_auc')
    clf.fit(X_train, y_train)
    end_t = time.time()
    print 'train time: ',round(end_t-begin_t, 3), 's'
    print 'current best parameters: ',clf.best_params_
    return clf.best_estimator_
bst_forest = choose_forest_model(train_full, ytrain)
train time:  42.852 s
current best parameters:  {'max_features': 15, 'n_estimators': 150, 'max_depth': 8}

def choose_gradient_model(X_train, y_train):    
    tuned_params = [{'n_estimators': [100, 150, 200], 'learning_rate': [0.03, 0.05, 0.07], 
                     'min_samples_leaf': [8, 15, 30], 'max_depth':[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(GradientBoostingClassifier(random_state=7), tuned_params, scoring='roc_auc')
    clf.fit(X_train, y_train)
    end_t = time.time()
    print 'train time: ',round(end_t-begin_t, 3), 's'
    print 'current best parameters: ',clf.best_params_
    return clf.best_estimator_
bst_gradient = choose_gradient_model(train_full, ytrain)
train time:  632.815 s
current best parameters:  {'n_estimators': 100, 'learning_rate': 0.03, 'max_depth': 8, 'min_samples_leaf': 30}

def stacking_model2(X_train, X_test, y_train):    
    sclf = StackingClassifier(classifiers=[bst_xgb, bst_forest, bst_gradient], use_probas=True, average_probas=False, 
                              meta_classifier=bst_lgb)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
kfold_plot(train_full, ytrain, stacking_model2)

mean scores:  0.899170466059
mean model process time:  4.1236 s

可以看到四个模型的聚合效果比用两个模型的stacking聚合效果要好一点,但是相比较单一模型仍然效果不好,接下来考虑使用voting对四个模型进行聚合

from sklearn.ensemble import VotingClassifier

def voting_model(X_train, X_test, y_train):    
    vclf = VotingClassifier(estimators=[('xgb', bst_xgb), ('rf', bst_forest), ('gbm',bst_gradient),
                                       ('lgb', bst_lgb)], voting='soft', weights=[2, 1, 1, 2])
    vclf.fit(X_train, y_train)
    predictions = vclf.predict_proba(X_test)[:, 1]
    return predictions
kfold_plot(train_full, ytrain, voting_model)

mean scores:  0.926889564336
mean model process time:  4.2736 s

由上可以看到最终通过voting将四个模型进行聚合可以得到得分最高的模型,确定为最终模型

综合模型,对测试文件进行最终预测

# predict(train_full, test_full, y_train)
def submit(X_train, X_test, y_train, test_ids):
    predictions = voting_model(X_train, X_test, y_train)

    sub = pd.read_csv('sampleSubmission.csv')
    result = pd.DataFrame()
    result['bidder_id'] = test_ids
    result['outcome'] = predictions
    sub = sub.merge(result, on='bidder_id', how='left')

    # Fill missing values with mean
    mean_pred = np.mean(predictions)
    sub.fillna(mean_pred, inplace=True)

    sub.drop('prediction', 1, inplace=True)
    sub.to_csv('result.csv', index=False, header=['bidder_id', 'prediction'])
submit(train_full, test_full, ytrain, test_ids)

最终结果提交到kaggle上进行评分,得分如下

以上就是整个完整的流程,当然还有很多模型可以尝试,很多聚合方法也可以使用,此外,特征工程部分还有很多空间可以挖掘,就留给大家去探索啦~

参考资料
[1] Chen, K. T., Pao, H. K. K., & Chang, H. C. (2008, October). Game bot identification based on manifold learning. In Proceedings of the 7th ACM SIGCOMM Workshop on Network and System Support for Games (pp. 21-26). ACM.
[2] Alayed, H., Frangoudes, F., & Neuman, C. (2013, August). Behavioral-based cheating detection in online first person shooters using machine learning techniques. In Computational Intelligence in Games (CIG), 2013 IEEE Conference on (pp. 1-8). IEEE.
[3] https://www.kaggle.com/c/face...
[4] http://stats.stackexchange.co...
[5] https://en.wikipedia.org/wiki...
[6] https://en.wikipedia.org/wiki...
[7] https://en.wikipedia.org/wiki...
[8] https://xgboost.readthedocs.i...
[9] https://github.com/Microsoft/...
[10] https://en.wikipedia.org/wiki...
[11] http://stackoverflow.com/ques...
[12] http://pandas.pydata.org/pand...
[13] http://stackoverflow.com/a/18...
[14] http://www.cnblogs.com/jasonf...

2 条评论
蔡繁荣 · 4月18日

膜拜下大神,向大神学习。看到最后,这满满的参考资料不得不赞!另外,吃惊😲的是,kaggle上两年前的比赛还可以提交吗?

回复

你好,kaggle上已经结束的比赛时可以提交的,但是不参与排名,可以看到你的结果的private/public得分

LancelotHolmes 作者 · 4月18日 +1回复
载入中...
LancelotHolmes LancelotHolmes

18 声望

发布于专栏

Lancelot's Desert

算法,Java,Python,etc,一边行走一边学习一边写作

4 人关注