我很困惑,因为如果你先做 OneHotEncoder
然后 StandardScaler
就会有问题,因为缩放器也会缩放之前由 OneHotEncoder
转换的列有没有办法同时执行编码和缩放,然后将结果连接在一起?
原文由 James Wong 发布,翻译遵循 CC BY-SA 4.0 许可协议
我很困惑,因为如果你先做 OneHotEncoder
然后 StandardScaler
就会有问题,因为缩放器也会缩放之前由 OneHotEncoder
转换的列有没有办法同时执行编码和缩放,然后将结果连接在一起?
原文由 James Wong 发布,翻译遵循 CC BY-SA 4.0 许可协议
Scikit-learn 从 0.20 版本开始提供 sklearn.compose.ColumnTransformer
做 Column Transformer with Mixed Types 。您可以缩放数字特征并将分类特征一起一次性编码。下面是官方示例(您可以在 此处 找到代码):
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
np.random.seed(0)
# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
注意:此方法是实验性的,某些行为可能会在不同版本之间发生变化而不会弃用。
原文由 NiYanchun 发布,翻译遵循 CC BY-SA 4.0 许可协议
2 回答5.1k 阅读✓ 已解决
2 回答1.1k 阅读✓ 已解决
4 回答1.4k 阅读✓ 已解决
3 回答1.3k 阅读✓ 已解决
3 回答1.2k 阅读✓ 已解决
1 回答1.7k 阅读✓ 已解决
1 回答1.2k 阅读✓ 已解决
当然可以。只需根据需要分别缩放和单热编码单独的列: