新手上路，请多包涵

我正在使用 OneHotEncoder 来编码一些分类变量（例如 - 性别和年龄组）。编码器生成的特征名称类似于 - ‘x0_female’、’x0_male’、’x1_0.0’、’x1_15.0’ 等。

 >>> train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})

>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder = OneHotEncoder()
>>> train_X_encoded = encoder.fit_transform(train_X[['Sex', 'AgeGroup']])

 >>> encoder.get_feature_names()
>>> array(['x0_female', 'x0_male', 'x1_0.0', 'x1_15.0', 'x1_30.0', 'x1_45.0',
       'x1_60.0', 'x1_75.0'], dtype=object)

有没有办法告诉 OneHotEncoder 以在开头添加列名的方式创建特征名称，例如-Sex_female，AgeGroup_15.0等，类似于Pandas get_dummies() 确实如此。

原文由 Supratim Haldar 发布，翻译遵循 CC BY-SA 4.0 许可协议

python-3.x scikit-learn one-hot-encoding

阅读 826

2 个回答

得票最新

社区维基

发布于
2022-11-16

✓ 已被采纳

可以将具有原始列名的列表传递给 get_feature_names 。

 >>> encoder.get_feature_names(['Sex', 'AgeGroup'])

array(['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
       'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75'],
      dtype=object)

已弃用： get_feature_names 在 1.0 中已弃用，将在 1.2 中删除。请改用 get_feature_names_out 。
- 根据 sklearn.preprocessing.OneHotEncoder 。

 >>> encoder.get_feature_names_out(['Sex', 'AgeGroup'])

array(['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
       'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75'],
      dtype=object)

原文由 kabochkov 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2022-11-16

type(train_X_encoded) → scipy.sparse.csr.csr_matrix
- 使用 pandas.DataFrame.sparse.from_spmatrix 加载稀疏矩阵，否则转换为密集矩阵并加载 pandas.DataFrame 。

 # pandas.DataFrame.sparse.from_spmatrix will load this sparse matrix
>>> print(train_X_encoded)

  (0, 1)    1.0
  (0, 2)    1.0
  (1, 0)    1.0
  (1, 3)    1.0
  (2, 1)    1.0
  (2, 4)    1.0
  (3, 0)    1.0
  (3, 5)    1.0
  (4, 1)    1.0
  (4, 6)    1.0
  (5, 0)    1.0
  (5, 7)    1.0

# pandas.DataFrame will load this dense matrix
>>> print(train_X_encoded.todense())

[[0. 1. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 1.]]

 import pandas as pd

column_name = encoder.get_feature_names_out(['Sex', 'AgeGroup'])
one_hot_encoded_frame = pd.DataFrame.sparse.from_spmatrix(train_X_encoded, columns=column_name)

# display(one_hot_encoded_frame)
   Sex_female  Sex_male  AgeGroup_0  AgeGroup_15  AgeGroup_30  AgeGroup_45  AgeGroup_60  AgeGroup_75
0         0.0       1.0         1.0          0.0          0.0          0.0          0.0          0.0
1         1.0       0.0         0.0          1.0          0.0          0.0          0.0          0.0
2         0.0       1.0         0.0          0.0          1.0          0.0          0.0          0.0
3         1.0       0.0         0.0          0.0          0.0          1.0          0.0          0.0
4         0.0       1.0         0.0          0.0          0.0          0.0          1.0          0.0
5         1.0       0.0         0.0          0.0          0.0          0.0          0.0          1.0

从 scikit-learn v1.0 使用 get_feature_names_out 而不是 get_feature_names

原文由 Nursnaaz 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

来自 OneHotEncoder 的功能名称

你尚未登录，登录后可以

Stack Overflow 翻译