如何提高线性回归模型的准确性?(使用 Python 进行机器学习)

新手上路,请多包涵

我有一个使用 scikit-learn 库的 python 机器学习项目。我有两个单独的数据集用于训练和测试,我尝试进行线性回归。我使用如下所示的代码块:

 import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")

df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]

df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]

X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']

X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))

我的结果是: lr.coef_: [ 2.32088001e+00 2.07441948e-12 -4.73338567e-05 6.79658129e+02]

lr.intercept_: 2166.186033098048

训练集得分:0.63

测试集分数:0.5732999

你有什么建议?我怎样才能提高我的准确性? (添加代码、参数等)我的数据集在这里: https ://yadi.sk/d/JJmhzfj-3QCV4V

原文由 f.koglu 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 612
1 个回答

我将通过一些示例详细说明@GeorgiKaradjov 的回答。您的问题非常广泛,并且有多种方法可以获得改进。最后,拥有领域知识(上下文)将为您提供获得改进的最佳机会。

  1. 规范化您的数据,即,将其移动到均值为零且分布为 1 个标准差
  2. 通过例如 OneHotEncoding 将分类数据转换为变量
  3. 做特征工程:
    • 我的特征是否共线?
    • 我的任何特征是否有交叉项/高阶项?
  4. 对特征进行正则化以减少可能的过度拟合
  5. 考虑到项目的基本特征和目标,查看替代模型

1) 归一化数据

from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)

X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])

给予

0    0.752395
1    0.008489
2   -0.381637
3   -0.020588
4    0.171446
Name: AFP, dtype: float64

2) 分类特征编码

def feature_engineering(df):

    dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
    df[dev_plat.columns] = dev_plat
    df = df.drop('Development_platform', axis=1)

    lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
    df[lang_type.columns] = lang_type
    df = df.drop('Language_Type', axis=1)

    resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
    df[resource_level.columns] = resource_level
    df = df.drop('Resource_Level', axis=1)

    return df

X_train = feature_engineering(X_train)
X_train.head(5)

给予

AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105    dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1    resource_level_2    resource_level_4
0   0.752395    1   0   0   0   1   0   0   1   0   0
1   0.008489    0   0   1   0   0   1   0   1   0   0
2   -0.381637   0   0   1   0   0   1   0   1   0   0
3   -0.020588   0   0   1   0   1   0   0   1   0   0

3) 特征工程;共线性

import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)

在此处输入图像描述

您想要 y=x 的红线,因为值应该与其自身相关。但是,任何红色或蓝色列都表明存在需要更多调查的强相关/反相关。例如,Resource=1,Resource=4,在某种意义上可能是高度相关的,如果人们拥有 1,则拥有 4 的可能性较小,等等。回归假设所使用的参数彼此独立。

3) 特征工程;高阶项

也许你的模型太简单了,你可以考虑添加高阶和交叉项:

 from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)

我对此进行了快速尝试,我认为高阶项没有太大帮助。您的数据也可能是 非线性 的,快速 logarithm 或 Y 输出给出更差的拟合,表明它是线性的。你也可以看看实际情况,但我太懒了。。。

4) 正则化

尝试使用 sklearn 的 RidgeRegressor 并使用 alpha:

 lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)

5) 备选机型

有时线性回归并不总是适用。例如,随机森林回归器可以很好地执行,并且通常对标准化数据和分类/连续数据不敏感。其他模型包括 XGBoost 和 Lasso(具有 L1 正则化的线性回归)。

 lr = RandomForestRegressor(n_estimators=100)

把它们放在一起

我被带走了,开始研究你的问题,但在不了解这些特性的所有上下文的情况下无法对其进行太多改进:

 import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV

def feature_engineering(df):

    dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
    df[dev_plat.columns] = dev_plat
    df = df.drop('Development_platform', axis=1)

    lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
    df[lang_type.columns] = lang_type
    df = df.drop('Language_Type', axis=1)

    resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
    df[resource_level.columns] = resource_level
    df = df.drop('Resource_Level', axis=1)

    return df

df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")

df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]

df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]

X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']

X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']

std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)

X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])

X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)

lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)

print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))

fig = plt.figure()
ax = fig.add_subplot(111)

ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])

导致:

 Training set score: 0.90
Test set score: 0.61

在此处输入图像描述

您可以查看变量的重要性(值越高,越重要)。

 Importance
AFP                         0.882295
dev_plat_077070             0.020817
dev_plat_077082             0.001162
dev_plat_077117108116105    0.016334
dev_plat_080067             0.004077
lang_type_051071076         0.012458
lang_type_052071076         0.021195
lang_type_065112071         0.001118
resource_level_1            0.012644
resource_level_2            0.006673
resource_level_4            0.021227

您可以开始查看超参数以获得改进: http ://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

原文由 jonnybazookatone 发布,翻译遵循 CC BY-SA 3.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题