注意事项:-您需要通过提交1)原始R Markdown文件和2)编织的HTML或PDF文件
Moodle。请在您认为合适的地方为R代码提供注释。一般来说尽可能简明扼要,同时给出完整的答案。良好的作业格式将获得额外积分。-请记住,班级政策严格适用于家庭作业。我们鼓励您在分组并与同学讨论。然而,每个学生都必须知道如何回答
她/他自己的问题。-请留出一些缓冲时间,不要在最后一刻提交作业。问题从HW2_house_dataset.csv加载数据集。您将实现一些树-基于预测房价的方法。数据集的基本特征如下:-问题类型:监督学习、回归-响应变量:房屋售价(log10单位)-R中的数据变量名称:“price”
1. MSBA7027 Machine Learning
Homework 2Due 11:59 pm Dec. 28, 2022Notes:
- You are required to submit 1) original R Markdown file and 2) a knitted HTML or PDF file viaMoodle. Please provide comments for R code wherever you see appropriate. In general, be asconcise as possible while giving a fully complete answer. Nice formatting of the assignment willreceive extra points.
- Remember that the Class Policy strictly applies to homework. You are encouraged to work ingroups and discuss with fellow students. However, each student has to know how to answer thequestions on her/his own.- Please allow some buffer time and do not submit homework at the last moment. You will havepoints deducted if you submit the above two items late.Question. Load the dataset from HW2_house_dataset.csv. You will implement some tree-based methods to predict housing prices. Basic characteristics of the dataset are given as follows:- Problem type: supervised learning, regression- Response variable: selling price of houses (in log10 units)- Data variable name in R: “price”- Number of features: 17
- Number of observations: 21,613- Task: use house attributes to predict sale price of a housePlease perform the following tasks:(1) Set seed
(2) Perform stratified sampling, use 80% as training and 20% as testing. Do not touch the testing datauntil the last problem (7).(3) Perform random forest (RF) on the training data. Find the best tuning parameters and describehow you find them, and after that report the smallest cross-validated RMSE on the training data.Which four predictors are the most important? Obtain PDPs for these four predictors, describethem and provide possible explanations.(4) Repeat (3) for basic GBM algorithm.(5) (Optional, completing this part will earn you up to 5 bonus points) Repeat (3) for Xgboostalgorithm.
(6) Are the four most important variables different in (3)-(4)? (or (3)-(5) if you have done (5) )(7) Among RF and GBM (and Xgboost, if you have done (5) ) with their own best-tuning parameters,which one has the smallest cross-validated RMSE on the training data? Choose that method, refitthe model with all of the training data, use that model to make prediction on the testing data, reportthe RMSE for the testing data. Is the obtained RMSE smaller or larger than the cross-validatedRMSE?
Appendix: Description of Features price (numeric): sale price (log10 units) bedrooms (numeric): number of bedrooms bathrooms (numeric): number of bathrooms sqft_living (numeric): size of living space sqft_lot (numeric): size of property floors (numeric): number of floors waterfront (numeric): binary indicator for a waterfront view view (numeric): rating of the quality of the view condition (factor): condition of the house (poor to very good) sqft_above (numeric): size of living space above group sqft_basement (numeric):size of living space below group yr_built (numeric): year build year_renovated (numeric): year renovated and, if not renovated, the year built zip_code (factor): zip code latitude (numeric): latitude longitude (numeric): longitude nn_sqft_living (numeric): size of living space of 15 neighbors nn_sqft_lot (numeric):size of lot of 15 neighbors
WX:codehelp
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。