新手上路，请多包涵

我正在学习 sklearn，我不太了解其中的区别以及为什么使用 4 个输出和函数 train_test_split() 。

在文档中，我找到了一些示例，但这还不足以消除我的疑虑。

Does the code use the X_train to predict the X_test or use the X_train to predict the y_test ?

训练和测试有什么区别？我是否使用火车来预测测试或类似的东西？

我对此很困惑。我将在下面提供文档中提供的示例。

 >>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

原文由 Jancer Lima 发布，翻译遵循 CC BY-SA 4.0 许可协议

python 机器学习 scikit-learn sklearn-pandas supervised-learning

阅读 1.8k

2 个回答

得票最新

社区维基

发布于
2023-01-08

✓ 已被采纳

下面是一个虚拟的 pandas.DataFrame 例如：

 import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
                       'X2':[14,15,22,24,23,31,33,35,40,40],
                       'Y':[0,0,0,0,1,1,1,1,1,1]})

这里我们有 3 列， X1,X2,Y 假设 X1 & X2 是你的自变量， 'Y' 列是你的因变量。

 X = df[['X1','X2']]
y = df['Y']

使用 sklearn.model_selection.train_test_split 您正在创建 4 个数据部分，这些数据将用于拟合和预测值。

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42)

X_train, X_test, y_train, y_test

现在

1). X_train - 这包括您所有的自变量，这些将用于训练模型，正如我们指定的 test_size = 0.4 ，这意味着 60% 将使用来自您完整数据的观察结果训练/拟合模型并休息 40% 将用于测试模型。

2). X_test - 这是剩余的 40% 来自数据的自变量部分，不会在训练阶段使用，将用于进行预测以测试模型的准确性。

3). y_train - 这是您的因变量，需要由该模型预测，这包括针对您的自变量的类别标签，我们需要在训练/拟合模型时指定我们的因变量。

4). y_test - 此数据具有测试数据的类别标签，这些标签将用于测试实际类别和预测类别之间的准确性。

现在您可以根据这些数据拟合模型，让我们拟合 sklearn.linear_model.LogisticRegression

 logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report

您可以在此处阅读有关指标的更多信息

在此处阅读有关数据拆分的更多信息

希望这可以帮助：）

原文由 ManojK 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2023-01-08

假设我们有这些数据

Age    Sex       Disease
----  ------ |  ---------

  X_train    |   y_train   )
                           )
 5       F   |  A Disease  )
 15      M   |  B Disease  )
 23      M   |  B Disease  ) training
 39      M   |  B Disease  ) data
 61      F   |  C Disease  )
 55      M   |  F Disease  )
 76      F   |  D Disease  )
 88      F   |  G Disease  )
-------------|------------

  X_test     |    y_test

 63      M   |  C Disease  )
 46      F   |  C Disease  ) test
 28      M   |  B Disease  ) data
 33      F   |  B Disease  )

X_train 包含特征值（年龄和性别=>训练数据）

y_train 包含对应于 X_train 值的目标输出（疾病=>训练数据）（训练过程后我们应该找到什么值）

还有训练过程（预测）后生成的值，如果模型是成功的，这些值应该与 y_train 值非常接近或相同。

X_test 包含训练后要测试的特征值（年龄和性别=>测试数据）

y_test 包含对应于 X_test （年龄和性别=>训练数据）的目标输出（疾病=>测试数据）并将与给定的预测值进行比较 X_test 训练后模型的值，以确定模型的成功程度。

原文由 caner 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

sklearn中的X_test、X_train、y_test、y_train有什么区别？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

sklearn中的X_test、X_train、y_test、y_train有什么区别？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？