Microsoft Azure Machine Learning Studio is Microsoft's powerful machine learning platform. In the designer, Microsoft has built-in 15 scenarios, but there seems to be no in-depth analysis of these 15 cases online, so I plan to write a series to Finish. Since it is an in-depth analysis, it is no longer a simple introduction to the operation, but goes into every detail, and would rather expand in detail than simply scan it.
Researcher at Microsoft MVP Labs
The case we analyze this time is: using the Vowpal Wabbit model for binary classification - adult income prediction.
Preliminary knowledge
▍Dataset
This dataset is the annual income dataset of the US population, and the original data comes from the 1994 US Census database. The dataset has a total of 32,560 pieces of data and 15 columns. Downloading is not recommended, in a later step we will discuss how to get this dataset from storage in Azure Machine Learning Studio.
What we want to predict is the possible income range of the new record. In this sample set, there are only two ranges of income: >50K and <=50K, so it is a typical classification problem. Classification models can be used to predict discrete values. When the final target (model output) of the machine learning model is a Boolean or a certain range of numbers, such as judging whether a picture is a specific target, whether the result is 0 or 1, or the output is an integer between 1-10, etc. Class requirements can mostly be solved by classification problems. A typical guess is to win or lose. When we have clear options for the predicted results, we can use the Classification scheme.
Download address: https://archive.ics.uci.edu/ml/datasets/Adult
▍Vowpal Wabbit data format
Vowpal Wabbit, or VW for short, is a powerful open-source, online and out-of-core machine learning system created by John Langford and colleagues at Microsoft Research. Azure ML provides native support for VW through the Train VW and Score VW modules. It can be used to train datasets larger than 10 GB, which is usually the upper limit allowed for learning algorithms in Azure ML. It supports many learning algorithms, including OLS regression, matrix factorization, single layer neural network, Latent Dirichlet Allocation, Contextual Bandits ) and so on, the input data of VW represents one sample per line, and the format of each sample must be as follows
label | feature1:value1 feature2:value2 ...
Simply put, the first of each sample is the label (Label), followed by the feature (Feature). That is, each sample is a labeled sample (labeled)
▍Parquet columnar storage format
Parquet is the mainstream columnar storage format in the Hadoop ecosystem. It was first jointly developed by Twitter and Cloudera. In May 2015, it graduated from the Apache incubator and became an Apache top-level project. There is a saying that if HDFS is the de facto standard for file systems in the era of big data, Parquet is the de facto standard for storage formats in the era of big data. Parquet columnar storage format has a high compression ratio, so IO operations are smaller. Parquet is language-independent and is not bound to any data processing framework. It is suitable for multiple languages and components. The query engines that can be adapted to Parquet include Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL, etc. The computing framework includes MapReduce, Spark, Cascading, Crunch, Scalding, Kite, etc., and the data model includes Avro, Thrift, Protocol Buffer, POJOs, etc. So Parquet is a data store that provides a format for the engine to quickly query data. In-depth analysis
This case has a total of nine working nodes, and we analyze the details and core information worthy of attention in each node one by one.
▍Adult Census Income Binary Classification dataset node
This node is the input of data, the core has three pieces of information Datastore name: azureml_globaldatasets is a link, click to jump to the location of the data store Relative path: describe the location of the current file in the Datastore, the default is GenericCSV/Adult_Census_Income_Binary_Classification_dataset Click azureml_globaldatasets to jump Go to the Datastore browser, where you can observe your stored data. The general interface is as follows
The file on the left that we need to focus on the most is _data.parquet, which is a Parquet columnar storage format file. It is recommended that you download, in the next steps, we will manipulate and analyze this file.
▍Select Columns in Dataset node
This node is the selection of data columns (features) The key information is to observe the selected columns
▍Execute Python Script node
This node converts Parquet columnar storage format files into VW (Vowpal Wabbit). The core of this node is a piece of Python code. Let's understand and analyze these codes in detail (before making sure you have downloaded _data.parquet and you have With a Python development environment)
The azureml_main function is a necessary entry function for Azure ML.
The core of the rest of the code is to load parquet, set labels and features, and generate vw format.
In order to better help you understand the code, I have rewritten a script to ensure that you can load _data.parquet locally and fully understand the whole process.
Make sure you have the following components installed
python -m pip install pandas
python -m pip install pyarrow
Detailed code and explanation
import pandas as pd
from pandas.api.types import is_numeric_dtype
label_col = 'income'
true_label = '>50K'
# 读取_data.parquet
dataframe=pd.read_parquet('_data.parquet')
# 输出每一列的数据类型
for col in dataframe:
print(col)
print('----------------------------------')
# 特征列的集合(不包含income)
feature_cols = [col for col in dataframe.columns if col != label_col]
for col in dataframe:
print(col)
print('----------------------------------')
# 所有数字列
numeric_cols = [col for col in dataframe.columns if is_numeric_dtype(col)]
for col in dataframe:
print(str(col) +'type is :' + str(dataframe[col].dtype))
print('----------------------------------')
def parse_row(row):
line = []
# vw样本的第一个元素,定义该样本的权重,这里因为就两个状态,所以定义1和-1
line.append(f"{1 if row[label_col] == true_label else -1} |")
# 添加样本的后续值
for col in feature_cols:
if col in numeric_cols:
# 具有数字的值,格式为列名:值
line.append(f"{col}:{row[col]}")
else:
# 非数字的值,格式为列值
line.append("".join((str(row[col])).split()).replace("|", "").replace(":", ""))
print(line)
vw_line = " ".join(line)
return vw_line
vw = dataframe.apply(parse_row, axis=1).to_frame()
Sample output
['-1 | 39 77516 Bachelors 13 Never-married Not-in-family White Male 2174 0 40']
['1 | 31 84154 Some-college 10 Married-civ-spouse Husband White Male 0 0 38']
▍Split Data node
This node is simpler and splits the dataset into 50% and 50% by row.
▍Train Vowpal Wabbit Model node
This node is to train and model the vw data set we generated before. The information that needs to be paid attention to is:
- VW arguments (VW parameter): This is the command line parameter of the Vowpal Wabbit executable file. The loss_function parameter switch can be optional: classic, expectile, hinge, logistic, poisson, quantile, squared, the default is squared.
Here is the choice
--loss_function
- Logistic logistic regression was proposed by Cox in 1958. Although its name is regression, it is actually a binary classification algorithm and a linear model. Because it is a linear model, the calculation is simple in prediction, and it has been successfully applied in some large-scale classification problems, such as advertising click-through rate prediction (CTR). If your data size is huge and the prediction speed is required to be very fast, nonlinear models such as SVMs and neural networks with nonlinear kernels can no longer be used. At this time, logistic regression is one of your few choices. Specify file type: VW represents the internal format used by Vowpal Wabbit, and SVMLight is a format used by some other machine learning tools. Obviously we should choose VW.
- Output readable model file: select True, the file will be saved in the same storage account and container as the input file
- Output inverted hash file: select True and the file will be saved in the same storage account and container as the input file
▍Score Vowpal Wabbit Model node
Score Vowpal Wabbit Model is similar to Train Vowpal Wabbit Model. The difference parameter is VW arguments (VW parameter): the parameters of the link switch can be glf1, identity, logistic, poisson, and the default is identity. The choice here is --link logistic
▍Execute Python Script node
This node is also a Python script, the purpose is to add an evaluation column, this script is relatively simple key code
# 阈值设定,通过和结果概率比较,得到标签
threshold = 0.5
# 二分的结果标签
binary_class_scored_col_name = "Binary Class Scored Labels"
# 二分的评估概率,这个值会被反映为标签
binary_class_scored_prob_col_name = "Binary Class Scored Probabilities"
output = dataframe.rename(columns={"Results": binary_class_scored_prob_col_name})
output[binary_class_scored_col_name] = output[binary_class_scored_prob_col_name].apply(
lambda x: 1 if x >= threshold else -1)
▍Edit Metadata node
The core of Edit Metadata is to define the metadata of the column, which has two important functions
- Redefine the data type of the column, but note that the values and data types in the dataset are not actually changed; what is changed is the metadata inside the machine learning that tells downstream components how to use the column. For example, redefining a numeric column as a categorical value tells machine learning to re-look at the data column.
- Indicates which column contains the class label, or the value to classify or predict. This function is more important, and it can help machine learning understand the training meaning of which column.
So this time we need to define Labels, do not change the data type, but define the training meaning as the Labels type. This description is a bit convoluted, we should say this: define a column named Labels as type Labels
The definition of this metadata is to prepare for the next Evaluate Model, telling the Evaluate Model to know which column is the label that needs to be evaluated.
▍Evaluate Model node
The metrics returned by evaluating a model depend on the type of model you are evaluating:
- classification model
- regression model
- Cluster Analysis Model
At this node we are mainly concerned with the graph output after training is complete
ROC curve (ROC curve) : Also known as "receiver operating characteristic curve", or susceptibility curve. The ROC curve is mainly used for the prediction accuracy of X to Y. The ROC curve was originally used in the military, and now it is more used in the medical field to judge whether a certain factor has diagnostic value for the diagnosis of a certain disease.
The ROC curve graph is a curve reflecting the relationship between sensitivity and specificity. We generally look at it this way: the X-axis of the abscissa is 1 – specificity, also known as the false positive rate/false positive rate (false positive rate), the closer the X-axis is to zero, the higher the accuracy; the Y-axis of the ordinate is called sensitivity, Also known as true positive rate/true rate (sensitivity), the larger the Y-axis, the better the accuracy.
According to the position of the curve, the whole graph is divided into two parts. The area under the curve is called AUC (Area Under Curve), which is used to indicate the prediction accuracy. The higher the AUC value, the larger the area under the curve, indicating the prediction The higher the accuracy. The closer the curve is to the upper left corner (the smaller the X, the larger the Y), the higher the prediction accuracy. That is to say, the closer the AUC is to 1.0, the higher the authenticity of the detection method. When it is less than or equal to 0.5, the authenticity is the lowest and has no application value.
So it can be seen that the ROC curve is very suitable for describing a binary problem, that is, the instances are divided into positive or negative classes. For a dichotomous problem, four situations arise. If an instance is a positive class and is also predicted to be a positive class, it is a true class (True positive), if an instance is a negative class and is predicted to be a positive class, it is called a false positive class (False positive). Correspondingly, if the instance is a negative class and is predicted to be a negative class, it is called a true negative class, and if a positive class is predicted to be a negative class, it is a false negative class.
Precision-recall curve (Precision-recall curve): The recall rate refers to the proportion of correct predictions that are positive to all actual positive ones. The recall rate is for the original sample, which means that it is predicted in the actual positive sample. is the probability of a positive sample. A high recall rate means that there may be more false detections, but will try to find every object that should be found.
The precision-recall curve shows the trade-off between precision and recall at different thresholds. The high area under the curve represents high recall and high precision, where high precision is associated with a low false positive rate and high recall is associated with a low false negative rate. High scores for both indicate that the classifier is returning accurate results (high precision), and returning a majority of all positive samples (high recall).
Because of the contradiction between precision and recall, we introduced the F1 score (F1 Score) to measure the accuracy of the binary classification model. It takes into account both the precision and recall of the classification model. The F1 score can be regarded as a harmonic average of the model's precision and recall, with a maximum value of 1 and a minimum value of 0.
Lift curve: Unlike the ROC curve, lift considers the accuracy of the classifier, that is, the ratio of the number of positive classes obtained by using the classifier and the number of positive classes obtained randomly without using the classifier. The lift curve is a measure of the validity of a predictive model; the ratio is calculated from the results obtained with and without the model.
For example, a company has 10,000 customers, and with the change of business, the credit of 500 customers begins to deteriorate. If you provide credit to 1,000 customers, that is to say, 50 customers may encounter risks due to credit problems. However, if the model is used to predict bad customers, and only the 1,000 customers with the highest model scores are selected for credit, if only 8 of these 1,000 customers show that the final security risk is at risk, it means that the model plays a role in it. LIFT is greater than 1. If it turns out that there are more than or equal to 50 customers at risk, and LIFT is less than or equal to 1, then from the effect point of view, it is better not to use this model. LIFT is one such metric that measures how many times the predictive power of bad samples is improved using this model over random selection.
Usually, when calculating LIFT, the final score of the model will be sorted from low to high (risk probability from high to low) and divided into 10 groups with equal frequency, and the corresponding cumulative bad sample ratio/cumulative total sample of the group with the lowest score will be calculated. The proportion is equal to the LIFT value.
At this point, we have completed the analysis of the Binary Classification using Vowpal Wabbit Model - Adult Income Prediction case. During this process, we have learned in detail the core information and related concepts of each node. From data sources, data processing, Python scripts, metadata definitions and model quality reports. At the same time, I have been exposed to a large number of machine learning concepts. This article is highly recommended as an introductory and in-depth reading for Microsoft Azure Machine Learning Studio and machine learning.
After this, I'll move on to writing other Microsoft Azure Machine Learning Studio cases. Each case can be read independently, so some concepts are repeated in each.
Microsoft Most Valuable Professional (MVP)
The Microsoft Most Valuable Professional is a global award given to third-party technology professionals by Microsoft Corporation. For 29 years, technology community leaders around the world have received this award for sharing their expertise and experience in technology communities both online and offline.
MVPs are a carefully selected team of experts who represent the most skilled and intelligent minds, passionate and helpful experts who are deeply invested in the community. MVP is committed to helping others and maximizing the use of Microsoft technologies by Microsoft technical community users by speaking, forum Q&A, creating websites, writing blogs, sharing videos, open source projects, organizing conferences, etc.
For more details, please visit the official website: https://mvp.microsoft.com/zh-cn
Learn how to train and deploy models and manage ML lifecycles (MLOps) using Azure Machine Learning. Tutorials, code samples, API reference, and other resources.
Long press to identify the QR code and follow Microsoft Developer MSDN
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。