xn是电子邮件n的特征向量。我们使用多项式模型来构建特征向量xn。设W={w1,w2,…,wD}是在训练套装。特征向量xn被定义为D维向量xn=[xn1,xn2,…,xnD],其中每个条目xnd,d=1,2,D是单词wd在电子邮件n中出现的次数。因此,总数电子邮件中的单词数n可以表示为ln=xn1+xn2+…+xnD。我们假设每个长度为ln的电子邮件n是由一系列独立于ln的事件随机生成的从词汇W中提取单词。(这被称为朴素贝叶斯假设。)对于每个事件,设p(wd|yn=1)是单词wd被选取的概率,假定电子邮件属于垃圾邮件;允许p(wd|yn=0)是单词wd被选中的概率,假设电子邮件属于ham。请注意p(wd|yn=1)和p(wd| yn=0)是不同的,这为我们提供了一种将垃圾邮件与火腿进行分类的方法。例如像“美元”、“赢家”这样的词在垃圾邮件中比在火腿中更容易出现。此外,请注意两者p(wd|yn=1),d=1,2,D和p(wd|yn=0),D=1,2,D的总和应为1ECE368: Probabilistic ReasoningLab 1:
1.Classification with Multinomial and Gaussian
Models1 Nave Bayes Classifier for Spam FilteringIn the first part of the lab, we use a Nave Bayes Classifier to build a spam email filter based on whetherand how many times each word in a fixed vocabulary occurs in the email. Suppose that we need to classifya set of N emails, and each email n is represented by {xn, yn}, n = 1, 2, . . ., N , where yn is the class labelwhich takes the valueyn ={1 if email n is spam,0 if email n is non-spam (also called ham),(1)
and xn is a feature vector of the email n. We use a multinomial model to construct the feature vector xn.Let W = {w1, w2, . . . , wD} be the set of the words (called the vocabulary) that appear at least once in thetraining set. The feature vector xn is defined as a D-dimensional vector xn = [xn1, xn2, . . . , xnD], whereeach entry xnd, d = 1, 2, . . . , D is the number of occurrences of word wd in email n. Thus the total numberof words in email n can be expressed as ln = xn1 + xn2 + . . .+ xnD.We assume that each email n of length ln is generated by a sequence of ln independent events that randomlydraw words from the vocabulary W. (This is known as the na¨?ve Bayes assumption.) For each event,let p(wd | yn = 1) be the probability that word wd is picked, given that the email belongs to spam; letp(wd | yn = 0) be the probability that word wd is picked, given that the email belongs to ham. Note thatp(wd | yn = 1) and p(wd | yn = 0) are different, which gives us a way to classify spam vs. ham. For example,words like “dollar”,“winner” would be more likely to occur in spam than in ham. Also, note that bothp(wd | yn = 1), d = 1, 2, . . . The probabilities p(wd | yn = 1), p(wd | yn = 0), d = 1, . . . , Dshould be learned from the training data.We make use of the word frequencies to model each email n probabilistically. Since each word in the emailis seen as independently drawn from the vocabulary W, the distribution of the feature vector xn given labelyn can be seen as a multinomial distribution as follows,p (xn | yn) = (xn1 + xn2 + . . .+ We assume that the prior class distribution p(yn) is modeled asp(yn = In the following, we first estimate the probabilities p(wd | yn = 1),p(wd | yn = 0), d = 1, . . . , D using thetraining set; we then build a classifier based on Bayes’ rule and make predictions on the testing set.Download classifier.zip under Modules/Lab1/ on Quercus and unzip the file. The spam emails for training arein the subfolder /data/spam/. The ham emails for training are in the subfolder /data/ham/. The unlabeledemails for testing are in the subfolder /data/testing/.Please answer the questions below and complete the routine classifier.py. File util.py contains a few func-tions/classes that will be helpful in writing the code for the classifier.Questions1. Training. We estimate the conditional probability distribution of the D-ary random variable as specifiedby p(wd | yn = 1) and p(wd | yn = 0), d = 1, . . . , D, from the training data using a bag-of-words modelas follows. For notational simplicity, we define pd = p(wd | yn = 1) and qd = p(wd | yn = 0).(a) We put all the words from the spam emails in the training set in a bag and simply count thenumber of occurrences of each word wd, d = 1, · · · , D. We do the same for ham emails. Themaximum likelihood estimates of pd and qd based on these counts are not the most appropriate touse when the probabilities are very close to 0 or to 1. For example, some words that occur in oneclass may not occur at all in the other class. In this problem, we use the technique of “Laplacesmoothing” to deal with this problem. Please write down such an estimator for pd and qd asfunctions of the training data {xn, yn}, n = 1, 2, . . . , N using Laplace smoothing for the D-aryrandom variable.
(b) Complete the function learn distributions in file classifier.py. In learn distributions, you first buildthe vocabulary {w1, . . . , wD} by accounting for all the words that appear in the training set atleast once; you then estimate pd and qd, d = 1, 2, . . . , D using your expressions in part (a).2. Testing. We classify the unlabeled emails in /data/testing/ using the trained classifier.(a) Let {x, y} be a data point from the testing set whose class label y is unknown. Write down themaximum a posterior (MAP) rule to decide whether y = 1 or y = 0 based on the feature vectorx. The d-th entry of x is denoted by xd. Please incorporate pd and qd in your expression. Pleasethink carefully how to treat words that do not appear in either ham or spam training sets. Pleaseassume that pi = 0.5.
(b) Complete the function classify new email in file classifier.py to implement the MAP rule, and runit on the testing set. There are two types of errors in classifying unlabeled emails: Type 1 erroris defined as the event that a spam email is misclassified as ham; Type 2 error is defined as theevent that a ham email is misclassified as spam. Write down the numerical values of these twonumbers of errors made by your classifier on the testing data. To avoid numerical underflow inyour code, please work with the log probability log p(y|x) in your code.(c) In practice, Type 1 error and Type 2 error lead to difference consequences (or costs). Therefore, wemay wish to trade off one type of error against the other in designing the classifier. For example,we usually want to achieve a very low Type 2 error since the cost of missing a useful email canbe severe, while we can tolerate a relative high Type 1 error as it merely causes inconvenience.Please provide a way to modify the decision rule in the classifier such that these two types oferror can be traded off. In other words, change the decision rule in a way such that Type 2 errorwould decrease at a cost of Type 1 error, and vice versa. Test your method on the testing set andprovide the following plot: Let the x-axis be the number of Type 1 errors and the y-axis be thenumber of Type 2 errors in the testing data set. Plot at least 10 points corresponding to differentpairs of Type 1 and Type 2 errors, as a result of adjusting the classification rule. The two endpoints of the plot should be: 1) the one with zero Type 1 error; and 2) the one with zero Type 2error. The code should be included in file classifier.py.2(d) Why do we need Laplace smoothing? Briefly explain what would go wrong if we use maximumlikelihood estimation in the training process, by considering a scenario in which a testing emailcontains both a word w1 that appears only in the ham training set (but not in the spam trainingset), and a word w2 that appears only in the spam training set (but not in the ham training set).How does Laplace smoothing resolve this issue?The training and test data for this problem are taken from V. Metsis, I. Androutsopoulos and G. Paliouras,“Spam Filtering with Naive Bayes – Which Naive Bayes?” Proceedings of the 3rd Conference on Email andAnti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.2 Linear/Quadratic Discriminant Analysis for Height/Weight DataWhen the feature vector is real-valued (instead of binary), a Gaussian vector model is appropriate. In thispart of the lab, we use linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) forthe height/weight data of people, and visualize the classification results of male and female persons basedon height and weight.Suppose that the data set contains N samples. Let xn = [hn, wn] be the feature vector, where hn denotesthe height and wn denotes the weight of a person indexed by n. Let yn denote the class label. Here yn = 1is male, and yn = 2 is female. We model the class prior as p(yn = 1) = pi and p(yn = 2) = 1 ? pi. For thisproblem, let pi = 0.5.For the class conditional distributions, let μm be the mean of xn if class label yn is male, and let μf be themean of xn if class label yn is female. For LDA, a common covariance matrix is shared by both classes, whichis denoted by Σ; for QDA, different covariance matrices are used for male and female, which are denoted byΣm and Σf , respectively.Download ldaqda.zip under Modules/Lab1/ on Quercus and unzip the file. The data set for training is in filetrainHeightWeight.txt, whereas the data set for testing is in file testHeightWeight.txt. Each file uses the sameformat to represent the data: the first column corresponds to the class labels, the second column correspondsto the heights, and the third column corresponds to the weights.Please answer the questions below and complete function ldaqda.py. File util.py contains a few func-tions/classes that will be useful in writing the code.Questions1. Training and visualization. We estimate the parameters in LDA and QDA from the training data intrainHeightWeight.txt and visualize the LDA/QDA model.(a) Please write down the maximum likelihood estimates of the parameters μm, μf , Σ, Σm, and Σfas functions of the training data {xn, yn}, n = 1, 2, . . . , N . The indicator function I(·) may beuseful in your expressions.(b) Once the above parameters are obtained, you can design a classifier to make a decision on theclass label y of the new data x. The decision boundary can be written as a linear equation of xin the case of LDA, and a quadratic equation of x in the case of QDA. Please write down theexpressions of these two boundaries.(c) Complete function discrimAnalysis in file ldaqda.py to visualize LDA and QDA. Please plot onefigure for LDA and one figure for QDA. In both plots, the horizontal axis is the height with range[50, 80] and the vertical axis is the weight with range [80, 280]. Each figure should contain: 1) Ncolored data points {xn, n = 1, 2, . . . , N} with the color indicating the corresponding class labels3(e.g., blue represents male and red represents female); 2) the contours of the the conditional Gaus-sian distribution for each class (To create a contour plot, you need first build a two-dimensionalgrid for the range [50, 80]× [80, 280] by using function np.meshgrid. You then compute the condi-tional Gaussian density at each point in the grid for each class. Finally use function plt.contour,which takes the two-dimensional grid and the conditional Gaussian density on the grid as inputsto automatically produce the contours.); 3) the decision boundary, which can also be created byusing plt.contour with appropriate contour level.2. Testing. We test the obtained LDA/QDA model on the testing data in testHeightWeight.txt. Completefunction misRate in file ldaqda.py to compute the misclassification rates for LDA and QDA, defined asthe total percentage of the misclassified samples (both male and female) over all samples.The data for this problem are taken from: K. Murphy, Machine Learning: A Probabilistic Approach, MITPress, 2012.
WX:codehelp
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。