Many of the decisions we make in life are based on the opinions of other people, and in general, decisions made by a group of people will produce better results than decisions made by any member of the group. Called the wisdom of the group. Ensemble Learning is similar to this idea. Ensemble Learning combines predictions from multiple models and aims to perform better than any member integrating the learner, thereby improving prediction performance (the accuracy of the model). Performance is also the most important concern for many classification and regression problems.
Ensemble Learning is the combination of several weak classifiers (or regressors) to generate a new classifier. (Weak classifier refers to a classifier whose classification accuracy rate is slightly better than random guess, that is, error rate <0.5).
Integrated machine learning involves combining predictions from multiple skilled models. The success of the algorithm lies in ensuring the diversity of weak classifiers. And integrating unstable algorithms can also get a more obvious performance improvement. Integrated learning is an idea. When the best performance of a predictive modeling project is the most important result, ensemble learning methods are popular and are usually the preferred technique.
Why use integrated learning
(1) Better performance: Compared with the contribution of any single model, the integration can make better predictions and obtain better performance;
(2) Stronger robustness: integration reduces the spread or dispersion of prediction and model performance, and smooths the expected performance of the model.
(3) A more reasonable boundary: There are certain differences between weak classifiers, resulting in different classification boundaries. After multiple weak classifiers are merged, a more reasonable boundary can be obtained, the overall error rate can be reduced, and a better effect can be achieved;
(4) Adapt to different sample sizes: For samples that are too large or too small, they can be divided and replaced to generate different sample subsets, and then use the sample subsets to train different classifiers, and finally merge;
(5) Easy to merge: It is difficult to merge multiple heterogeneous feature data sets. You can model each data set and then perform model fusion.
Bias and variance of machine learning modeling
Errors generated by machine learning models are usually described by two attributes: bias and variance.
Bias is a measure of how close the model can capture the mapping function between input and output. It captures the rigidity of the model: the strength of the model's assumption of the functional form of the mapping between input and output.
The variance of a model is the amount of change in the performance of the model when it fits different training data. It captures the impact of the details of the data on the model.
Ideally, we prefer models with low bias and low variance. In fact, this is also the goal of applying machine learning to a given predictive modeling problem. The bias and variance of model performance are related, and reducing bias can usually be achieved easily by increasing variance. Conversely, the variance can be easily reduced by increasing the deviation.
Compared with a single predictive model, integration is used to achieve better predictive performance on predictive modeling problems. The way to achieve this can be understood as the model adds a bias to reduce the variance component of the prediction error (that is, in the case of weighing bias-variance).
The Bagging Thought of Integrated Learning
Bagging, also known as Bootstrap Aggregating, involves fitting many learners on different samples of the same data set and averaging the predictions, and finding diverse ensemble members by changing the training data.
The Bagging idea is to re-select N new data sets to train N classifiers separately through sampling with replacement on the original data set. Duplicate data is allowed in the model training data.
When the model trained using the Bagging method predicts the classification of a new sample, it will use a majority vote or an average strategy to count the final classification results.
Bagging-based weak learners (classifiers/regressors) can be basic algorithm models, such as Linear, Ridge, Lasso, Logistic, Softmax, ID3, C4.5, CART, SVM, KNN, Naive Bayes, etc.
Random Forest
Principles of Random Forest Algorithm
Random forest is an algorithm modified on the basis of Bagging strategy. The method is as follows:
(1) Use Bootstrap strategy to sample data from the sample set;
(2) Randomly select K features from all features to construct a normal decision tree;
(3) Repeat 1, 2 times to build multiple decision trees;
(4) Integrate multiple decision trees to form a random forest, and make decisions on data through voting or averaging.
Random Forest OOB Error
In the random forest, it can be found that about 1/3 of the samples of Bootstrap sampling will not appear in the sample set sampled by Bootstrap. Of course, they did not participate in the establishment of the decision tree. This part of the data is called out-of-bag data OOB ( out of bag), it can be used to replace the test set error estimation method.
For the random forest that has been generated, use the out-of-bag data to test its performance. Assuming that the total number of out-of-bag data is O, use these O out-of-bag data as input and bring in the previously generated random forest classifier. The classifier will give O The corresponding classification of the data, because the type of the O data is known, the correct classification is compared with the result of the random forest classifier, and the number of classification errors of the random forest classifier is counted. Set as X, then the bag is out of the bag. The size of the data error is X/O.
Advantages: This has been proven to be an unbiased estimate, so there is no need for cross-validation or a separate test set to obtain an unbiased estimate of the test set error in the random forest algorithm.
Disadvantages: When the amount of data is small, the data set generated by Bootstrap sampling changes the distribution of the initial data set, which will introduce estimation bias.
Random forest algorithm variant
The RF algorithm has relatively good characteristics in practical applications and is widely used. It is mainly used in: classification, return, feature conversion, and abnormal point detection. The following are common RF variant algorithms:
·Extra Trees (ET)
·Totally Random Trees Embedding (TRTE)
·Isolation Forest (IForest)
Extra Trees (ET)
Extra-Trees (Extremely randomized trees) were proposed by Pierre Geurts et al. in 2006. It is a variant of RF, and the principle is basically the same as that of RF. But there are two main differences between this algorithm and random forest:
(1) Random forest will use Bootstrap for random sampling, as the training set of the sub-decision tree, applying the Bagging model; while ET uses all training samples to train each sub-tree, that is, each sub-decision tree of ET uses the original Sample training
(2) Random forest will be the same as traditional decision tree (based on information gain, information gain rate, Gini coefficient, mean square error, etc.) when selecting partition feature points, while ET is a completely random selection of partition features to partition decision trees.
For a certain decision tree, because its best division feature is randomly selected, its prediction results are often inaccurate, but a combination of multiple decision trees can achieve a good prediction effect.
When the ET is constructed, we can also apply all the training samples to get the error of the ET. Because although the decision tree is constructed and the prediction application is the same training sample set, because the best partition attribute is randomly selected, we will still get completely different prediction results, and the prediction results can be compared with the true response of the sample. Values are compared to get the prediction error. If it is analogous with random forest, in ET, all training samples are OOB samples, so calculating the prediction error of ET, that is, calculating this OOB error.
Since Extra Trees randomly select the division points of the eigenvalues, the size of the decision tree will generally be larger than the decision tree generated by RF. That is to say, the variance of the Extra Trees model is further reduced relative to RF. In some cases, ET has a stronger generalization ability than random forest.
Totally Random Trees Embedding (TRTE)
TRTE is a data conversion method of unsupervised learning. It maps low-dimensional data to high-dimensional data so that the data mapped to high-dimensional data can be better applied to classification regression models.
The conversion process of the TRTE algorithm is similar to the method of the RF algorithm, and T decision trees are established to fit the data. When the decision tree is constructed, the position of the leaf node in the T decision subtrees of each data in the data set is determined, and the feature conversion operation is completed by converting the position information into a vector.
For example, there are 3 decision trees, each decision tree has 5 leaf nodes, a certain data feature x is divided into the third leaf node of the first decision tree, the first leaf node of the second decision tree, the first The fifth leaf node of the three decision trees. Then the feature code after x mapping is (0,0,1,0,0 1,0,0,0,0 0,0,0,0,1), which has 15-dimensional high-dimensional features. After the features are mapped to high dimensions, supervised learning can be further carried out.
Isolation Forest (IForest)
IForest is an outlier detection algorithm that uses a similar RF method to detect outliers; the difference between the IForest algorithm and the RF algorithm is:
(1) In the process of random sampling, generally only a small amount of data is needed;
(2) In the process of building a decision tree, the IForest algorithm will randomly select a partition feature and randomly select a partition threshold for the partition feature;
(3) The decision tree constructed by the IForest algorithm generally has a relatively small depth max_depth.
The purpose of IForest is to detect abnormal points, so as long as the abnormal data can be distinguished, a large amount of data is not required; in addition, in the process of abnormal point detection, there is generally no need for a large-scale decision tree.
For the judgment of abnormal points, the test sample x is fitted to T decision trees. Calculate the depth ht(x) of the leaf node of the sample on each tree. Thus calculate the average depth h(x); then you can use the following formula to calculate the abnormal probability value of the sample point x, the value range of p(s,m) is [0,1], the closer to 1, the abnormal The greater the probability of a point.
m is the number of samples, ξ is Euler's constant
Summary of the advantages and disadvantages of random forest
In this AI small class, we have learned about Bagging's ideas and principles, as well as the knowledge of random forests based on Bagging. Finally, let us summarize the advantages and disadvantages of random forests:
advantage
(1) Training can be parallelized, which has speed advantages for large-scale sample training;
(2) Due to the random selection of the decision tree to divide the feature list, it still has better training performance when the sample dimension is relatively high;
(3) Due to random sampling, the trained model has small variance and strong generalization ability;
(4) Simple implementation;
(5) Not sensitive to missing some features;
(6) The importance of features can be measured.
shortcoming
(1) It is easy to overfit on some features with relatively large noise;
(2) The division features with more values will have a greater impact on the decision-making of RF, which may affect the effect of the model.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。