新手上路，请多包涵

请用这个给我指明正确的方向。如何将包含连续变量的列转换为离散变量？我有一些金融工具的价格，我正试图将其转换成某种绝对值。我以为我可以做到以下几点。

 labels = df['PRICE'].astype('category').cat.categories.tolist()
replace_map_comp = {'PRICE' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)

但是，当我尝试对数据子集运行 RandomForestClassifier 时，出现错误。

 from sklearn.ensemble import RandomForestClassifier
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
clf = RandomForestClassifier()
clf.fit(df[features], df1['PRICE'])

错误信息如下： ValueError: Unknown label type: 'continuous'

我很确定这很接近，但这里肯定有问题。

代码更新如下：

 # copy only numerics to new DF
df1 = df.select_dtypes(include=[np.number])

from sklearn import linear_model
features = np.array(['INTEREST',
'SPREAD',
'BID',
'ASK',
'DAYS'])
reg = linear_model.LinearRegression()
reg.fit(df1[features], df1['PRICE'])

# problems start here...
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

错误：AttributeError：“LinearRegression”对象没有属性“feature_importances_”

从这里遵循概念：

http://blog.yhat.com/tutorials/5-Feature-Engineering.html

仅供参考，我尝试了单热编码，代码转换使列太大，我得到了一个错误。也许处理这个问题的方法是获取一小部分数据。对于 250k 行，我猜 100k 行应该可以很好地代表整个数据集。也许这就是要走的路。只是在这里大声思考。

原文由 ASH 发布，翻译遵循 CC BY-SA 4.0 许可协议

python python-3.x scikit-learn random-forest

阅读 1.2k

2 个回答

得票最新

社区维基

发布于
2022-11-17

✓ 已被采纳

Pandas 有一个 cut 函数，可以用于你正在尝试做的事情：

 import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
n_bins = 5
df = pd.DataFrame(data=norm.rvs(loc=500, scale=50, size=100),
                  columns=['PRICE'])
y = label_encoder.fit_transform(pd.cut(df['PRICE'], n_bins, retbins=True)[0])
rfc = RandomForestClassifier(n_estimators=100, verbose=2)
rfc.fit(df[['PRICE']], y)

这是一个示例。首先要知道有一百种不同的方法可以做到这一点，所以这不一定是“正确”的方法；这只是一种方式。

主要思想：使用 Pandas cut 函数为连续数据创建桶。桶的数量由您决定。在本例中，我选择了 n_bins 作为 5 。

有了垃圾箱后，可以使用 sklearn 的 LabelEncoder() 将它们转换为类。这样，您就可以更轻松地回顾这些类。它就像是您课程的存储系统，因此您可以跟踪它们。使用 label_encoder.classes_ 查看课程。

完成这些步骤后， y 将如下所示：

 array([1, 2, 2, 0, 2, 2, 0, 1, 3, 1, 1, 2, 1, 4, 4, 2, 3, 1, 1, 3, 2, 3,
       2, 2, 2, 0, 2, 2, 4, 1, 3, 2, 1, 3, 3, 2, 1, 4, 3, 1, 1, 4, 2, 3,
       3, 2, 1, 1, 3, 4, 3, 3, 3, 2, 1, 2, 3, 1, 3, 1, 2, 0, 1, 1, 2, 4,
       1, 2, 2, 2, 0, 1, 0, 3, 3, 4, 2, 3, 3, 2, 3, 1, 3, 4, 2, 2, 2, 0,
       0, 0, 2, 2, 0, 4, 2, 3, 2, 2, 2, 2])

您现在已将连续数据转换为类，现在可以传递给 RandomForestClassifier() 。

原文由 Jarad 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2022-11-17

分类器在您面对解释变量的类别并且价格不是类别的情况下很好，除非您对类别进行求和：

 df['CLASS'] = np.where( df.PRICE > 1000, 1, 0) # Classify price above 1000 or less

在使用连续解释变量的情况下，回归方法是非常可取的。

 from sklearn import linear_model
reg = linear_model()
reg.fit(df[features], df['CLASS'])

原文由 DeepBlue 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

如何将连续变量转换为分类变量？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译