AER包中的数据集NMES1988包含65岁以上的个人样本由医疗保险覆盖,以便通过医生评估医疗保健需求办公室就诊、门诊就诊、急诊就诊、住院等。数据可通过安装并加载AER包,然后运行数据(NMES1988)。更多背景有关NMES1988数据的信息和参考资料可以在AER的帮助页面中找到包裹a) 通过图形方式显示0次访问的受访者比在泊松模型下可能是预期的。SS 3860B/9055B - DevianceAssignment 2
STATS 3860B/9155BWinter 2023 This assignment is due Friday, March 10th, at 11:55 pm (EST). You must write your R code and answers using Rmarkdown generating asingle pdf file. Submissions must be made via Gradescope. You must carefully assign eachquestion part to its corresponding page (or pages) on your pdf file. Question partswith no pages assigned to them will receive zero marks. Each student must submit their own work. Scholastic offences are taken seriously,and students are directed to read the appropriate policy, specifically, the definition ofwhat constitutes a Scholastic Offence, at the following Web site:http://www.uwo.ca/univsec/pdf/academic_policies/appeals/schol...
1.Question 1
The data set NMES1988 in the AER package contains a sample of individuals over 65 whoare covered by Medicare in order to assess the demand for health care through physicianoffice visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed byinstalling and loading the AER package and then runningdata(NMES1988). More backgroundinformation and references about the NMES1988 data can be found in help pages for the AERpackage.
a) Show through graphical means that there are more respondents with 0 visits thanmight be expected under a Poisson model.b) Fit a ZIP model for the number of physician office visits using chronic, health, andinsurance as predictors for the Poisson count, and chronic and insurance as thepredictors for the binary part of the model. Then, provide interpretations in context forthe following estimated model parameters: coefficient of chronic in the Poisson part of the model coefficient of poor health in the Poisson part of the model the intercept in the logistic part of the model1 coefficient of insurance in the logistic part of the modelc) Use the estimated coefficients from part b) to calculate the predicted probability of“always zero” (never visiting the doctor) for an individual who has insurance and twochronic conditions. Then check your answer using the predict() function to computethe same probability.d) Use the estimated coefficients from part b) to calculate the predicted probability ofvisiting the doctor five times for an individual with poor health, four chronic conditionsand without insurance. Then check your answer using the predict() function tocompute the same probability.
2.Question 2
The dataset melanoma gives data on a sample of patients suffering from melanoma (skincancer) cross-classified by the type of cancer and the location on thebody.suppressMessages(library(faraway))str(melanoma)
## 'data.frame': 12 obs. of 3 variables:## $ count: num 22 16 19 11 2 54 33 17 10 115 ...## $ tumor: Factor w/ 4 levels "freckle","indeterminate",..: 1 4 3 2 1 4 3 2 1 4 ...## $ site : Factor w/ 3 levels "extremity","head",..: 2 2 2 2 3 3 3 3 1 1 ...a) Display the data in a two-way table. Make a mosaic plot and comment on the evidenceof independence.
Check for independence between site and tumour type using a Chi-squared test.c) Fit a Poisson GLM model and use it to check for independence.d) Make a two-way plot of the deviance residuals from the last model. Comment on thelarger residuals.
3.Question 3
The UCBAdmissions dataset presents data on applicants to graduate school at Berkeley forthe six largest departments in 1973 classified by admission andsex.suppressMessages(library(faraway))str(UCBAdmissions)
## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...## - attr(*, "dimnames")=List of 3## ..$ Admit : chr [1:2] "Admitted" "Rejected"## ..$ Gender: chr [1:2] "Male" "Female"## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...2
Show that this provides an example of the Simpson’s paradox.b) Determine the most appropriate dependence model between the variables. Explain indetail each step you take.c) Fit a binomial regression with admissions status as the response and show the relationshipto your model in the previous question.Question 4The hsb data was collected as a subset of the “High School and Beyond” study conducted bythe National Education Longitudinal Studies program of the National Center for EducationStatistics. The variables are gender, race, socioeconomic status (SES), school type, chosenhigh school program type, scores on reading, writing, math, science and social studies. Theresponse variable is the chosen high school program type (prog), which is multinomial with 3levels.b) Interpret the coefficients corresponding to the five subjects (scores on reading, writing,math, science and social studies) in terms of odds.c) Regarding to part b), identify which one of the five subjects gives unexpected results andsuggest an explanation for this behavior. Any reasonable explanation will be accepted.Question 5An earlier study examined the effect of workplace rules in Minnesota which require smokersto smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period3Table 1: A small subset of hypothetical data on Minnesota workplace rules on smoking.subject x (location) y (cigarettes)1 0 3was recorded, along with whether the smoker was at home or at work. A (very) small subsetof the data appears in Table 1 and it is also available in the smoking.csv file. Model 1: Assume that Y ~ Poisson(λ); there is no difference between home and work. Model 2: Assume that Y ~ Poisson(λW ) when the smoker is at work, and Y ~Poisson(λH) when the smoker is at home. Model 3: Assume that Y ~ Poisson(λ) and log(λ) = β0 + β1x.a) Write out the likelihood L(λ) and log-likelihood logL(λ) in Model 1. Use the datavalues in Table 1, and simplify where possible.b) Intuitively, what would be a reasonable estimate for λ based on this data? Why?c) Use R to produce a plot of the likelihood function L(λ) . Find the maximum likelihoodestimator for λ in Model 1 using an optimization routine in R (for example, optim(),but not the glm() function).d) Write out the log-likelihood function logL(λW , λH) in Model 2. Use the data values inTable 1, and simplify where possible.e) Intuitively, what would be reasonable estimates for λW and λH based on this data?Why?f) Find the maximum likelihood estimators for λW and λH in Model 2 using an optimizationroutine in R (for example, optim(), but not the glm() function).g) Write out the log-likelihood function logL(β0, β1) in Model 3. Again, use the data valuesin Table 1, and simplify where possible.h) Find the maximum likelihood estimators for β0 and β1 in Model 3 using an optimizationroutine in R (for example, optim(), but not the glm() function). Use R to produce a3D plot of the log-likelihood function.i) Confirm your estimates for Model 1 and Model 3 using glm(). Then show that theMLEs for Model 3 agree with the MLEs for Model 2.4Question 6This question refers to exercise 4 of Chapter 8 of the textbook; however, instead of workingwith the Galapagos data, you will work with the dataset in Table 1 from the previous question.The purpose of this question is to reproduce the details of the GLM fitting of this data viathe IRWLS algorithm.a) Consider the Poisson GLM fit for Model 3 in part i) of the previous question. Reportthe estimated values of the coefficients and deviance.For parts b), c), d), e), f) and g) refer to Exercise 4 Chapter 8 of the textbook (page 172).5Camila de Souza SS 3860B/9055B - Deviance Winter 2023 1 / 7Use of deviance to measure goodness of fitWe can measure how well our proposed model fits the data bycomparing it to the saturated (or full) model.A saturated model is achievable by adding sufficiently many parameters,in most cases as many parameters as number of data points.This comparison (saturated versus proposed) can be done bycalculating the so-called deviance, that is, the difference between thelog likelihood of the saturated model (Sat) and the log likelihood ofour proposed model (M):
DM = 2(log LSat ? log LM)Camila de Souza SS 3860B/9055B - Deviance Winter 2023 2 / 7Use of deviance to measure goodness of fitGaussian linear regression: DM = RSS.However, we cannot use RSS as a test statistic for goodness of fitbecause of the variance (dispersion) parameter.Instead, we use can R2 = 1? RSS/TSSBinary logistic regression: DM = ?2 log LM , also cannot be used as atest statistic for goodness fit.For logistic regression we can use the Hosmer-Lemeshow (HL) test.Camila de Souza SS 3860B/9055B - Deviance Winter 2023 3 / 7Use of deviance to measure goodness of fitIn Poisson or Binomial regression we can use use DM as a teststatistic to assess the goodness of fit of the propose model.H0 : no lack of fit versus H1: lack of fitUnder H0, DM ~ approx. χ2n?p, where n is the number of observationsand p the number of parameters in the model.So, we reject “H0 : no lack of fit” if P(DM > Dobs) is sufficientlysmall, say smaller than 0.05.Note: For Poisson and Binomial regression we can also use the Pearson χ2statistic instead of DM to conduct the goodness of fit test.Camila de Souza SS 3860B/9055B - Deviance Winter 2023 4 / 7Use of deviance to measure goodness of fitQuasi-Binomial or Quasi-Poisson models:log L is approximated by a function Q that allows for extra variationQsat = 0We cannot use then the deviance based on Q to assess goodness of fitCamila de Souza SS 3860B/9055B - Deviance Winter 2023 5 / 7Use of deviance to compare two nested modelsFor logistic regression consider the likelihood ratio test statistic (LRT):LRT = 2 log LLLS= DS ? DL ~ approx. χ2l?sH0: smaller model is correct→ H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ, where Θ0 ? ΘWe reject the null hypothesis (that the simpler model is consistent withthe data), and therefore selet the larger model over the smaller one, ifthe LRTobs exceeds the 95% quantile of the reference (null)distribution (P(LRT > LRTobs) < 0.05).If the LRTobs lies below this quantile then the null hypothesis is notrejected, and the smaller model is selected in favour of the larger one.Camila de Souza SS 3860B/9055B - Deviance Winter 2023 6 / 7Use of deviance to compare two nested modelsFor Poisson and Binomial regression:→ If there is no under or overdispersion: DS DL ~ approx. χ2ls→ If there is under or overdispersion:(DS DL)where (dfS dfL) = l s is the difference in number of parameters betweenthe large and small models.Note: for quasi-likelihood methods we must use F -tests to compare modelsbecause there is also the
WX:codehelp
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。