💡 Author: Han Xinzi @ShowMeAI
📘Machine learning combat series : https://www.showmeai.tech/tutorials/41
📘The address of this article : https://www.showmeai.tech/article-detail/287
📢 Disclaimer: All rights reserved, please contact the platform and author for reprinting and indicate the source 📢 Favorite ShowMeAI for more exciting content
Introduction to Machine Learning and Pipeline
We know that the machine learning application process includes many steps, as shown in the figure " standard machine learning application process ", including data preprocessing, feature engineering, model training, model iteration optimization, deployment estimation and other links.
For simple analysis and modeling, each plate can be constructed and applied individually. But in enterprise-level applications, we prefer that different links in machine learning projects are built into workflows in an orderly manner, so that different process steps are easier to understand, reproducible, and can prevent data leakage and other problems.
Common machine learning modeling tools, such as Scikit-Learn, cover the pipeline with advanced functions, including transformers, models, and other modules.
For the application method of Scikit-Learn, you can refer to the article in ShowMeAI 📘Machine Learning Practical Tutorial 📘 SKLearn Most Complete Application Guide , or you can go to the Scikit-Learn Quick Lookup Table to get a high-density knowledge point list.
However, under the simple usage of SKLearn, if we merge external tool libraries, such as imblearn , which handles uneven data samples, into the pipeline, there may be incompatibility problems, such as the following errors:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘SMOTE()’ (type <class ‘imblearn.over_sampling._smote.base.SMOTE’>) doesn’t
This article takes " customer churn " as an example to explain how to build a SKLearn pipeline, specifically including:
- Build a pipeline that will cover the application of Scikit-Learn, imblearn and feature-engine tools
- Extract features after encoding steps (e.g. one-hot encoding)
- Build a feature importance map
The final solution is shown in the following diagram: Combine multiple modules from different packages in one pipeline.
Our program flow below covers the different links mentioned above:
- Step ① : Data preprocessing: data cleaning
- Step 2 : Feature Engineering: Numerical and Categorical Feature Processing
- Step ③ : Sample Processing: Category Unbalanced Processing
- Step 4 : Logistic regression, xgboost, random forest and voting integration
- Step ⑤ : Hyperparameter tuning and feature importance analysis
💡 Step 0: Prepare and load data
We start by importing the required tool library.
# 数据处理与绘图
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sklearn工具库
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedStratifiedKFold, cross_validate
# pipeline流水线相关
from sklearn import set_config
from sklearn.pipeline import make_pipeline, Pipeline
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# 常数列、缺失列、重复列 等处理
from feature_engine.selection import DropFeatures, DropConstantFeatures, DropDuplicateFeatures
# 非均衡处理、样本采样
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# 建模模型
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.inspection import permutation_importance
from scipy.stats import loguniform
# 流水线可视化
set_config(display="diagram")
If you haven't heard of imblearn and the feature-engine toolkit before, here's a quick note:
- 📘 Imblearn can handle class imbalanced classification problems with built-in different sampling strategies
- 📘 feature-engine for feature column processing (constant column, missing column, duplicate column, etc.)
Dataset: Newspaper Subscriber Churn
The dataset we use here comes from the Kaggle competition Newspaper churn . The dataset includes 15,856 records of individuals who are or have subscribed to the newspaper.
🏆 Actual data set download (Baidu network disk) : The public account "ShowMeAI Research Center" replies " Actual Combat ", or click here to get this article [[14] Machine Learning Modeling Application Pipeline]( https://www.showmeai.tech /article-detail/287 ) " Newspaper churn dataset "
⭐ ShowMeAI official GitHub : https://github.com/ShowMeAI-Hub
The dataset contains demographic information such as HH information representing household income, home ownership, child information, ethnicity, year of residence, age range, language; geographic information such as address, state, city, county, and zip code. In addition, the length of the subscription period selected by the user, and the charging data associated with it. The dataset also includes the source channel of the user. At the end there will be fields indicating whether the customer is still our subscriber (whether it is churn).
Data preprocessing and segmentation
We first load the data and do preprocessing (eg lowercase all column names and convert the target variable to boolean).
# 读取数据
data = pd.read_excel("NewspaperChurn new version.xlsx")
#数据预处理
data.columns = [k.lower().replace(" ", "_") for k in data.columns]
data.rename(columns={'subscriber':'churn'}, inplace=True)
data['churn'].replace({'NO':False, 'YES':True}, inplace=True)
# 类型转换
data[data.select_dtypes(['object']).columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))
# 取出特征列和标签列
X = data.drop("churn", axis=1)
y = data["churn"]
# 训练集验证集切分
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
The preprocessed data should look like this:
💡 Step 1: Data Cleaning
The first step in the pipeline process we built is "data cleaning", removing columns that are not helpful for prediction (such as id
class fields, constant value fields, or duplicate fields).
# 步骤1:数据清洗+字段处理
ppl = Pipeline([
('drop_columns', DropFeatures(['subscriptionid'])),
('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
('drop_duplicates', DropDuplicateFeatures())
])
The above code creates a pipeline object with 3 steps: drop_columns
, drop_constant_values
, drop_duplicates
.
The steps are in the form of tuples, the first element defines the name of the step (eg drop_columns
) and the second element defines the converter (eg DropFeatures()
).
These simple steps can also be easily completed by external tools such as pandas. However, our idea when assembling the pipeline is to integrate as much functionality as possible in the pipeline.
💡 Step 2: Feature Engineering and Data Transformation
After removing irrelevant columns, let's do missing value processing and feature engineering. You can see that the dataset contains different types of columns (numeric and categorical), and we will define two separate workflows for these two types.
Regarding feature engineering, you can check the article in ShowMeAI 📘Machine Learning Practical Tutorial 📘The most complete interpretation of machine learning feature engineering .
# 数据处理与特征工程pipeline
ppl = Pipeline([
# ① 剔除无关列
('drop_columns', DropFeatures(['subscriptionid'])),
('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
('drop_duplicates', DropDuplicateFeatures()),
# ② 缺失值填充与数值/类别型特征处理
('cleaning', ColumnTransformer([
# 2.1: 数值型字段缺失值填充与幅度缩放
('num',make_pipeline(
SimpleImputer(strategy='mean'),
MinMaxScaler()),
make_column_selector(dtype_include='int64')
),
# 2.2:类别型字段缺失值填充与独热向量编码
('cat',make_pipeline(
SimpleImputer(strategy='most_frequent'),
OneHotEncoder(sparse=False, handle_unknown='ignore')),
make_column_selector(dtype_include='category')
)])
)
])
Add a step named clearning
, corresponding to a ColumnTransformer
object.
In ColumnTransformer
, two new pipelines are set up: one for numeric and one for categorical. Use the make_column_selector
function to ensure the correct field type is selected each time.
Here, the dtype_include parameter is used to select the column of the corresponding type. This function can also provide a list of column names or a regular expression to select.
💡 Step 3: Class Imbalance Processing (Data Sampling)
In problem scenarios such as "churn" and "fraud identification", a very big challenge is "category imbalance" - that is, the number of churned users is relatively small relative to non-churned users.
Here we will use a tool library called im
`blearn` to deal with the class imbalance problem, which provides a series of data generation and sampling methods to alleviate the above problems. This time, the SMOTE sampling method is used to resample the few class samples.
SMOTE category unbalanced processing
The pipeline after adding the SMOTE step is as follows:
# 总体处理pipeline
ppl = Pipeline([
# ① 剔除无关列
('drop_columns', DropFeatures(['subscriptionid'])),
('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
('drop_duplicates', DropDuplicateFeatures()),
# ② 缺失值填充与数值/类别型特征处理
('cleaning', ColumnTransformer([
# 2.1: 数值型字段缺失值填充与幅度缩放
('num',make_pipeline(
SimpleImputer(strategy='mean'),
MinMaxScaler()),
make_column_selector(dtype_include='int64')
),
# 2.2:类别型字段缺失值填充与独热向量编码
('cat',make_pipeline(
SimpleImputer(strategy='most_frequent'),
OneHotEncoder(sparse=False, handle_unknown='ignore')),
make_column_selector(dtype_include='category')
)])
),
# ③ 类别非均衡处理:重采样
('smote', SMOTE())
])
pipeline feature check
Before finally building the ensemble classifier model, let's look at the feature names and other information processed through the pipeline.
The pipeline object provides a function called get_feature_names_out()
through which we can get the feature name. But before using it, we have to fit on the dataset. Since step ③ SMOTE processing only focuses on our label y data, we ignore it for now and focus on steps ① and ②.
# 拟合数据,获取pipeline构建的特征名称和信息
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()
pd.Series(features)
The result looks like this:
0 num__year_of_residence
1 num__zip_code
2 num__reward_program
3 cat__hh_income_$ 20,000 - $29,999
4 cat__hh_income_$ 30,000 - $39,999
...
12122 cat__source_channel_TMC
12123 cat__source_channel_TeleIn
12124 cat__source_channel_TeleOut
12125 cat__source_channel_VRU
12126 cat__source_channel_iSrvices
Length: 12127, dtype: object
Due to one-hot vector encoding, many feature names starting with cat_
(for category) have been created.
If you want to get the same pipeline visualization as the above flow chart, just make a little modification in the code, add set_config(display="diagram")
to your code before calling the pipeline object.
💡 Step 4: Build an ensemble classifier
Next we train multiple models and use a powerful ensemble model (voting classifier) to solve the current problem.
Regarding the logistic regression, random forest and xgboost models used here, you can see the detailed principle explanation in ShowMeAI's 📘 Graphical Machine Learning Algorithm Tutorial .
# 逻辑回归模型
lr = LogisticRegression(warm_start=True, max_iter=400)
# 随机森林模型
rf = RandomForestClassifier()
# xgboost
xgb = XGBClassifier(tree_method="hist", verbosity=0, silent=True)
# 用投票器进行集成
lr_xgb_rf = VotingClassifier(estimators=[('lr', lr), ('xgb', xgb), ('rf', rf)],
voting='soft')
After defining the integration model, we also integrate it into our pipeline.
# 总体处理pipeline
ppl = imbPipeline([
# ① 剔除无关列
('drop_columns', DropFeatures(['subscriptionid'])),
('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
('drop_duplicates', DropDuplicateFeatures()),
# ② 缺失值填充与数值/类别型特征处理
('cleaning', ColumnTransformer([
# 2.1: 数值型字段缺失值填充与幅度缩放
('num',make_pipeline(
SimpleImputer(strategy='mean'),
MinMaxScaler()),
make_column_selector(dtype_include='int64')
),
# 2.2:类别型字段缺失值填充与独热向量编码
('cat',make_pipeline(
SimpleImputer(strategy='most_frequent'),
OneHotEncoder(sparse=False, handle_unknown='ignore')),
make_column_selector(dtype_include='category')
)])
),
# ③ 类别非均衡处理:重采样
('smote', SMOTE()),
# ④ 投票器集成
('ensemble', lr_xgb_rf)
])
You may notice that the Pipeline we used in line 1 is replaced by imbPipeline from imblearn. This is a very critical process. If we use SKLearn's pipeline, the error mentioned at the beginning of the article will occur during fitting:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
At this point, we have built the basic pipeline process.
💡 Step 5: Hyperparameter Tuning and Feature Importance
Hyperparameter Tuning
In the entire modeling pipeline we built, many components have hyperparameters that can be adjusted, and these hyperparameters will affect the final model effect. How to tune the hyperparameters of the pipeline, we use random search RandomizedSearchCV
to tune the hyperparameters, the code is as follows.
For the detailed principle knowledge of search parameter tuning, you can check ShowMeAI's introduction in the article 📘Network Optimization: Hyperparameter Tuning, Regularization, Batch Normalization and Program Framework .
Pay special attention to the naming rules in the code.
# 超参数调优
params = {
'ensemble__lr__solver': ['newton-cg', 'lbfgs', 'liblinear'],
'ensemble__lr__penalty': ['none', 'l1', 'l2', 'elasticnet'],
'ensemble__lr__C': loguniform(1e-5, 100),
'ensemble__xgb__learning_rate': [0.1],
'ensemble__xgb__max_depth': [7, 10, 15, 20],
'ensemble__xgb__min_child_weight': [10, 15, 20, 25],
'ensemble__xgb__colsample_bytree': [0.8, 0.9, 1],
'ensemble__xgb__n_estimators': [300, 400, 500, 600],
'ensemble__xgb__reg_alpha': [0.5, 0.2, 1],
'ensemble__xgb__reg_lambda': [2, 3, 5],
'ensemble__xgb__gamma': [1, 2, 3],
'ensemble__rf__max_depth': [7, 10, 15, 20],
'ensemble__rf__min_samples_leaf': [1, 2, 4],
'ensemble__rf__min_samples_split': [2, 5, 10],
'ensemble__rf__n_estimators': [300, 400, 500, 600],
}
# 随机搜索调参
rsf = RepeatedStratifiedKFold(random_state=42)
clf = RandomizedSearchCV(ppl, params,scoring='roc_auc', verbose=2, cv=rsf)
clf.fit(X_train, y_train)
# 输出信息
print("Best Score: ", clf.best_score_)
print("Best Params: ", clf.best_params_)
print("AUC:", roc_auc_score(y_val, clf.predict(X_val)))
Explain the hyperparameter naming in the code above:
- First parameter (
ensemble__
): the name of our VotingClassifier - Second parameter (
lr__
): the name of the model used in our ensemble - The third parameter (
solver
): the name of the model-related hyperparameter
Since this is a class imbalance scenario, we use repeated hierarchical k-folds ( RepeatedStratifiedKFold
).
This step of hyperparameter tuning is not necessary. In simple scenarios, you can directly use the default parameters, or finalize the hyperparameters when defining the model.
Feature importance map
In order not to make our model a black-box model, we want to do some interpretation of the model, the most important of which is attribution analysis, we want to understand which features are important, here we plot the feature importance.
# https://inria.github.io/scikit-learn-mooc/python_scripts/dev_features_importance.html
# 绘制特征重要度
def plot_feature_importances(perm_importance_result, feat_name):
""" bar plot the feature importance """
fig, ax = plt.subplots()
indices = perm_importance_result['importances_mean'].argsort()
plt.barh(range(len(indices)),
perm_importance_result['importances_mean'][indices],
xerr=perm_importance_result['importances_std'][indices])
ax.set_yticks(range(len(indices)))
ax.set_title("Permutation importance")
tmp = np.array(feat_name)
_ = ax.set_yticklabels(tmp[indices])
# 获取特征名称
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()
# 用乱序法进行特征重要度计算和排列,以及绘图
perm_importance_result_train = permutation_importance(clf, X_train, y_train, random_state=42)
plot_feature_importances(perm_importance_result_train, features)
The resulting graph after running the above code is as follows, we can see that the feature hh_income
dominates the prediction. Since this feature is actually sortable (eg 30-40k is smaller than 150-175k), we can use a different encoding (eg using LabelEncoding tag encoding).
The above is the complete machine learning pipeline construction process. As you can see, the pipeline can integrate different links, run and tune at one time, and the code and process are more concise and efficient, and the efficiency is also higher.
References
- 🏆 Actual data set download (Baidu network disk) : The public account "ShowMeAI Research Center" replies " Actual Combat ", or click here to get this article [[14] Machine Learning Modeling Application Pipeline]( https://www.showmeai.tech /article-detail/287 ) " Newspaper churn dataset "
⭐ ShowMeAI official GitHub : https://github.com/ShowMeAI-Hub
- 📘Machine Learning Practical Tutorial : https://www.showmeai.tech/tutorials/41
- 📘 SKLearn's most complete application guide : https://www.showmeai.tech/article-detail/203
- 📘 Imblearn handles class imbalanced classification : https://imbalanced-learn.org/stable/
- 📘 feature-engine handling of feature columns (constant columns, missing columns, duplicate columns, etc.) : https://feature-engine.readthedocs.io/en/latest/
- 📘Machine Learning Practical Tutorial : http://showmeai.tech/tutorials/41
- 📘The most complete interpretation of machine learning feature engineering : https://www.showmeai.tech/article-detail/208
- 📘 Illustrated Machine Learning Algorithms Tutorial : http://showmeai.tech/tutorials/34
- 📘Network Optimization: Hyperparameter Tuning, Regularization, Batch Normalization and Program Framework : https://www.showmeai.tech/article-detail/218
- 📘 Scikit-Learn cheat sheet: https://www.showmeai.tech/article-detail/108
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。