这项家庭包括编程部分。而你可以讨论问题与同伴一起,你必须自己回答问题并实施所有解决方案独立地。在你提交的材料中,一定要明确列出与你讨论过这件事的所有学生分配我们强烈建议您使用LATEX来准备您的提交文件。任务应该通过Gradescope界面以PDF格式提交,并注明答案。这个源代码应通过Gradescope编程作业以.zip文件的形式提交。在源代码中包含有关如何运行代码的说明。我们强烈建议您在家庭作业代码中使用Python 3。您可以使用其他语言。无论哪种情况,您都必须向我们提供关于如何运行代码的明确说明并重现你的实验。您不能在代码中使用任何特定于机器学习的库,例如TensorFlow、PyTorch、,或者scikit-learn中实现的任何机器学习算法(尽管您可以使用其他该库提供的函数,例如将数据集拆分为训练和测试的函数套)。您可以使用numpy和matplotlib等库。如果您不确定图书馆是允许的,一定要问我们。所有提交的作品都将使用两个独立的剽窃检测工具进行剽窃检查。重命名变量或函数名、在文件中移动代码等都是这样做的策略不要愚弄我们使用的剽窃检测工具。如果你被抓住了将采用教学大纲,其中可能包括直接不及格,字母成绩为“F”。→在开始本作业之前,请通过阅读教学大纲第14节。COMPSCI 589 Homework 1 - Spring 2023Due March 01, 2023, 11:55pm Eastern Time1

1.Instructions

This homework assignment consists of a programming portion. While you may discuss problemswith your peers, you must answer the questions on your own and implement all solutionsindependently. In your submission, do explicitly list all students with whom you discussed thisassignment. We strongly recommend that you use LATEX to prepare your submission. The assignment shouldbe submitted on Gradescope as a PDF with marked answers via the Gradescope interface. Thesource code should be submitted via the Gradescope programming assignment as a .zip file.Include with your source code instructions for how to run your code. We strongly encourage you to use Python 3 for your homework code. You may use otherlanguages. In either case, you must provide us with clear instructions on how to run your codeand reproduce your experiments. You may not use any machine learning-specific libraries in your code, e.g., TensorFlow, PyTorch,or any machine learning algorithms implemented in scikit-learn (though you may use otherfunctions provided by this library, such as one that splits a dataset into training and testingsets). You may use libraries like numpy and matplotlib. If you are not certain whether a specificlibrary is allowed, do ask us. All submissions will be checked for plagiarism using two independent plagiarism-detection tools.Renaming variable or function names, moving code within a file, etc., are all strategies that donot fool the plagiarism-detection tools we use. If you get caught, all penalties mentioned in thesyllabus will be applied—which may include directly failing the course with a letter grade of “F”.→ Before starting this homework, please review this course’s policies on plagiarism byreading Section 14 of the syllabus. The tex file for this homework (which you can use if you decide to write your solution in LATEX)can be found here.
The automated system will not accept assignments after 11:55pm on March 01.1Programming Section (100 Points Total)In this section of the homework, you will implement two classification algorithms: k -NN and DecisionTrees. Notice that you may not use existing machine learning code for this problem: youmust implement the learning algorithms entirely on your own and from scratch.1. Evaluating the k-NN Algorithm(50 Points Total)In this question, you will implement the k -NN algorithm and evaluate it on a standard benchmarkdataset: the Iris dataset. Each instance in this dataset contains (as attributes) four propertiesof a particular plant/flower. The goal is to train a classifier capable of predicting the flower’sspecies based on its four properties. You can download the dataset here.The Iris dataset contains 150 instances. Each instance is stored in a row of the CSV file and iscomposed of 4 attributes of a flower, as well as the species of that flower (its label/class). Thegoal is to predict a flower’s species based on its 4 attributes. More concretely, each traininginstance contains information about the length and width (in centimeters) of the sepal of a flower,as well as the length and width (in centimeters) of the flower’s petal. The label associated witheach instance indicates the species of that flower: Iris Versicolor, Iris Setosa, or Iris Virginica.See Figure 1 for an example of what these three species of the Iris flower look like. In theCSV file, the attributes of each instance are stored in the first 4 columns of each row, and thecorresponding class/label is stored in the last column of that row.Figure 1: Pictures of three species of the Iris flower (Source: Machine Learning in R for beginners).The goal of this experiment is to evaluate the impact of the parameter k on the algorithm’sperformance when used to classify instances in the training data, and also when used to classifynew instances. For each experiment described below, you should use Euclidean distance as thedistance metric and then follow these steps:(a) shuffle the dataset to make sure that the order in which examples appear in the dataset filedoes not affect the learning process;1(b) randomly partition the dataset into disjoint two subsets: a training set, containing 80%of the instances selected at random; and a testing set, containing the other 20% of theinstances. Notice that these sets should be disjoint: if an instance is in the training set,1If you are writing Python code, you can shuffle the dataset by using, e.g., the sklearn.utils.shuffle function.2it should not be in the testing set, and vice-versa.2 The goal of splitting the dataset inthis way is to allow the model to be trained based on just part of the data, and then to“pretend” that the rest of the data (i.e., instances in the testing set, which were not usedduring training) correspond to new examples on which the algorithm will be evaluated.If the algorithm performs well when used to classify examples in the testing set, this isevidence that it is generalizing well the knowledge it acquired after learning based on thetraining examples;(c) train the k -NN algorithm using only the data in the training set;(d) compute the accuracy of the k -NN model when used to make predictions for instances inthe training set. To do this, you should compute the percentage of correct predictions madeby the model when applied to the training data; that is, the number of correct predictionsdivided by the number of instances in the training set;(e) compute the accuracy of the k -NN model when used to make predictions for instances inthe testing set. To do this, you should compute the percentage of correct predictions madeby the model when applied to the testing data; that is, the number of correct predictionsdivided by the number of instances in the testing set.Important: when training a k-NN classifier, do not forget to normalize the features!You will now construct two graphs. The first one will show the accuracy of the k -NN model (forvarious values of k) when evaluated on the training set. The second one will show the accuracyof the k -NN model (for various values of k) when evaluated on the testing set. You should varyk from 1 to 51, using only odd numbers (1, 3, . . . , 51). For each value of k, you should run theprocess described above (i.e., steps (a) through (e)) 20 times. This will produce, for each value ofk, 20 estimates of the accuracy of the model over training data, and 20 estimates of the accuracyof the model over testing data.Q1.1 (10 Points) In the first graph, you should show the value of k on the horizontal axis,and on the vertical axis, the average accuracy of models trained over the training set, giventhat particular value of k. Also show, for each point in the graph, the corresponding standarddeviation; you should do this by adding error bars to each point. The graph should look like theone in Figure 2 (though the “shape” of the curve you obtain may be different, of course).Q1.2 (10 Points) In the second graph, you should show the value of k on the horizontal axis,and on the vertical axis, the average accuracy of models trained over the testing set, giventhat particular value of k. Also show, for each point in the graph, the corresponding standarddeviation by adding error bars to the point.Q1.3 (8 Points) Explain intuitively why each of these curves look the way they do. First,analyze the graph showing performance on the training set as a function of k. Why do you thinkthe graph looks like that? Next, analyze the graph showing performance on the testing set as afunction of k. Why do you think the graph looks like that?Q1.4 (6 Points) We say that a model is underfitting when it performs poorly on the trainingdata (and most likely on the testing data as well). We say that a model is overfitting when itperforms well on training data but it does not generalize to new instances. Identify and reportthe ranges of values of k for which k -NN is underfitting, and ranges of values of k for whichk -NN is overfitting.2If you are writing Python code, you can perform this split automatically by using thesklearn.model selection.train test split function.Figure 2: Example showing how your graphs should look like. The “shape” of the curves you obtainmay be different, of course.Q1.5 (6 Points) Based on the analyses made in the previous question, which value of k youwould select if you were trying to fine-tune this algorithm so that it worked as well as possible inreal life? Justify your answer.Q1.6 (10 Points) In the experiments conducted earlier, you normalized the features beforerunning k-NN. This is the appropriate procedure to ensure that all features are consideredequally important when computing distances. Now, you will study the impact of omittingfeaturenormalization on the performance of the algorithm. To accomplish this, you will repeat Q1.2and create a graph depicting the average accuracy (and corresponding standard deviation) ofk-NN as a function of k, when evaluated on the testing set. However, this time you will run thealgorithm without first normalizing the features. This means that you will run k-NN directly onthe instances present in the original dataset without performing any pre-processing normalizationsteps to ensure that all features lie in the same range/interval. Now (a) present the graph youcreated; (b) based on this graph, identify the best value of k; that is, the value of k that resultsin k-NN performing the best on the testing set; and (c) describe how the performance of thisversion of k-NN (without feature normalization) compares with the performance of k-NN withfeature normalization. Discuss intuitively the reasons why one may have performed better thanthe other.

2. Evaluating the Decision Tree Algorithm

In this question, you will implement the Decision Tree algorithm, as presented in class, andevaluate it on the 1984 United States Congressional Voting dataset. This dataset includesinformation about how each U.S. House of Representatives Congressperson voted on 16 keytopics/laws. For each topic/law being considered, a congressperson may have voted yea, nay,or may not have voted. Each of the 16 attributes associated with a congressperson, thus, has 3possible categorical values. The goal is to predict, based on the voting patterns of politicians (i.e.,on how they voted in those 16 cases), whether they are Democrat (class/label 0) or Republican(class/label 1). You can download the dataset here.4Notice that this dataset contains 435 instances. Each instance is stored in a row of the CSV file.The first row of the file describes the name of each attribute. The attributes of each instance arestored in the first 16 columns of each row, and the corresponding class/label is stored in the lastcolumn of that row. For each experiment below, you should repeat the steps (a) through (e)described in the previous question—but this time, you will be using the Decision Tree algorithmrather than the k-NN algorithm. You should use the Information Gain criterion to decide whetheran attribute should be used to split a node.You will now construct two histograms. The first one will show the accuracy distribution ofthe Decision Tree algorithm when evaluated on the training set. The second one will show theaccuracy distribution of the Decision Tree algorithm when evaluated on the testing set. Youshould train the algorithm 100 times using the methodology described above (i.e., shuffling thedataset, splitting the dataset into disjoint training and testing sets, computing its accuracy ineach one, etc.). This process will result in 100 accuracy measurements for when the algorithmwas evaluated over the training data, and 100 accuracy measurements for when the algorithmwas evaluated over testing data.Q2.1 (12 Points) In the first histogram, you should show the accuracy distribution when thealgorithm is evaluated over training data. The horizontal axis should show different accuracyvalues, and the vertical axis should show the frequency with which that accuracy was observedwhile conducting these 100 experiments/training processes. The histogram should look like theone in Figure 3 (though the “shape” of the histogram you obtain may be different, of course).You should also report the mean accuracy and its standard deviation.Figure 3: Example showing how your histograms should look like. The “shape” of the histograms youobtain may be different, of course.Q2.2 (12 Points) In the second histogram, you should show the accuracy distribution whenthe algorithm is evaluated over testing data. The horizontal axis should show different accuracyvalues, and the vertical axis should show the frequency with which that accuracy was observedwhile conducting these 100 experiments/training processes. You should also report the meanaccuracy and its standard deviation.Q2.3 (12 Points) Explain intuitively why each of these histograms looks the way they do. Isthere more variance in one of the histograms? If so, why do you think that is the case? Does onehistogram show higher average accuracy than the other? If so, why do you think that is the case?5Q2.4 (8 Points) By comparing the two histograms, would you say that the Decision Treesalgorithm, when used in this dataset, isunderfitting, overfitting, or performing reasonably well?Explain your reasoning.Q2.5 (6 Points) In class, we discussed how Decision Trees might be non-robust. Is it possible toexperimentally confirm this property/tendency via these experiments, by analyzing the histogramsyou generated and their corresponding average accuracies and standard deviations? Explain yourreasoning.
[QE.1] Extra points (15 Points) Repeat the experiments Q2.1 to Q2.4, but now usethe Gini criterion for node splitting, instead of the Information Gain criterion.[QE.2] Extra points (15 Points) Repeat the experiments Q2.1 to Q2.4 but now use asimple heuristic to keep the tree from becoming too “deep”; i.e., to keep it fromtesting a (possibly) excessive number of attributes, which is known to often causeoverfitting. To do this, use an additional stopping criterion: whenever more than85% of the instances associated with a decision node belong to the same class, do notfurther split this node. Instead, replace it with a leaf node whose class prediction isthe majority class within the corresponding instances. E.g., if 85% of the instancesassociated with a given decision node have the label/class Democrat, do not furthersplit this node, and instead directly return the prediction Democrat.
WX:codehelp


小胡子的灯泡
1 声望0 粉丝