PCA 分析后的特征/变量重要性

我对我的原始数据集进行了 PCA 分析，并从 PCA 转换的压缩数据集中选择了我想要保留的 PC 数量（它们解释了几乎 94% 的方差）。现在我正在努力识别缩减数据集中重要的原始特征。如何找出降维后剩余的主成分中哪些特征重要，哪些不重要？这是我的代码：

 from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)

此外，我还尝试对缩减数据集执行聚类算法，但令我惊讶的是，分数低于原始数据集。这怎么可能？

原文由 fbm 发布，翻译遵循 CC BY-SA 4.0 许可协议

阅读 1.1k

首先，我假设您调用 features 变量和 not the samples/observations 。在这种情况下，您可以通过创建一个在一个图中显示所有内容的 biplot 函数来执行类似以下操作。在这个例子中，我使用的是虹膜数据。

在例子之前，请注意 使用 PCA 作为特征选择工具时的基本思想是根据其系数（载荷）的大小（绝对值从大到小）来选择变量。有关更多详细信息，请参阅情节后的最后一段。

概述：

第 1 部分：我解释了如何检查特征的重要性以及如何绘制双标图。

第 2 部分：我解释了如何检查特征的重要性以及如何使用特征名称将它们保存到 pandas 数据框中。

第1部分：

 import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

使用双标图可视化正在发生的事情

现在，每个特征的重要性由特征向量中相应值的大小反映（更高的幅度 - 更高的重要性）

让我们首先看看每个 PC 解释的方差量是多少。

 pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 explains 72% 和 PC2 23% 。如果我们只保留 PC1 和 PC2，它们一起解释 95% 。

现在，让我们找出最重要的特征。

 print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

这里， pca.components_ 具有形状 [n_components, n_features] 。因此，通过查看第一行的 PC1 （第一主成分）： [0.52237162 0.26335492 0.58125401 0.56561105]] 我们可以得出结论 feature 1, 3 and 4 3（或 Var 4在双标图中）是最重要的。从双标图中也可以清楚地看到这一点（这就是为什么我们经常使用此图以可视化方式总结信息的原因）。

综上所述，看k个最大特征值对应的特征向量分量的绝对值。在 sklearn 中，组件按 explained_variance_ 排序。这些绝对值越大，特定特征对该主成分的贡献就越大。

第2部分：

重要特征是影响更多组件的特征，因此在组件上具有较大的绝对值/分数。

要在具有名称 的 PC 上获取最重要的功能 并将它们保存到 pandas 数据框中，请使用以下命令：

 from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

这打印：

      0  1
 0  PC0  e
 1  PC1  d

因此，在 PC1 上，名为 e 的功能是最重要的，而在 PC2 上，名为 d 的功能最重要。

这里也有不错的文章： https ://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source= friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

原文由 seralouk 发布，翻译遵循 CC BY-SA 4.0 许可协议

# Import libraries import numpy as np import pandas as pd from pca import pca # Lets create a dataset with features that have decreasing variance. # We want to extract feature f1 as most important, followed by f2 etc f1=np.random.randint(0,100,250) f2=np.random.randint(0,50,250) f3=np.random.randint(0,25,250) f4=np.random.randint(0,10,250) f5=np.random.randint(0,5,250) f6=np.random.randint(0,4,250) f7=np.random.randint(0,3,250) f8=np.random.randint(0,2,250) f9=np.random.randint(0,1,250) # Combine into dataframe X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9] X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9']) # Initialize model = pca() # Fit transform out = model.fit_transform(X) # Print the top features. The results show that f1 is best, followed by f2 etc print(out['topfeat']) # PC feature # 0 PC1 f1 # 1 PC2 f2 # 2 PC3 f3 # 3 PC4 f4 # 4 PC5 f5 # 5 PC6 f6 # 6 PC7 f7 # 7 PC8 f8 # 8 PC9 f9

PCA 分析后的特征/变量重要性

第1部分：

第2部分：

你尚未登录，登录后可以

请问： Python中是否有方式可以像前端的TSLint一样进行代码的自动风格格式检查？

请问一下Python 可以进行强类型开发吗？

Qt中布局是否只有5种呢？

python中最好的单元测试是使用的什么呢？

请问一下，如何理解reduce函数呢？

这段代码为什么不能获取到数据？

Python类属性与实例属性自增行为差异？

Stack Overflow 翻译