您使用的所有R命令的附录(无页数限制)。应该打字而不是手写。你的任务可能包括精心编辑的R输出(例如,图表、摘要、表格等)和适当的数据-对这些结果的讨论,以及一些选定的R命令。请选择您所呈现的内容,并且只包括尽可能多的页面和必要的R输出以证明您的解决方案的合理性。清楚地标记作业的每个部分和问题带有相应编号的附录。STAT6030 GENERALISED LINEAR MODELLINGThe Australian National UniversityAssignment 22023 Summer Session

1.Instructions

This assignment is worth 55 marks in total and 25% of your overall marks for thiscourse. The assignment is compulsory and must be submitted by 5pm on Monday6 March 2023.You must write your answers to this assignment individually and by yourself. If youcopy someone else’s work or allow your work to be copied, you will receive a mark ofzero for the assignment and risk severe academic consequences. Your answers should be individually submitted through Turnitin on Wattle as asingle pdf/Word document (less than 50MB) including the following:1. The assignment Cover Sheet (available on Wattle).2. Your answers (no more than 10 pages including graphs, summaries, tables, etc...but not Appendix and Cover Sheet, and respecting the other requirements for eachpart).3. An Appendix including all the R commands you used (no page limit). Assignments should be typed and not handwritten. Your assignment may include somecarefully edited R output (e.g., graphs, summaries, tables, etc...) and appropriate dis-cussion of these results, as well as some selected R commands. Please be selective aboutwhat you present and only include as many pages and as much R output as necessaryto justify your solution. Clearly label each part and question of your assignment andappendix with the corresponding numbers. Unless otherwise advised, use a significance level of 5%. Round numeric answers to 4 decimal places (e.g., 0.00115 is rounded to 0.0012). Marks will be deducted if these instructions are not strictly respected, especially whenthe total report is of an unreasonable length, i.e., more than the above page limit. TheAppendix will generally not be marked and checked if what you have written or doneneeds clarifications. Name your submission “CourseCode Uid”, e.g., “STAT6030 u1234567”. Try to submit your assignment at least 30 minutes before the deadline in casesomething unexpected happens, for instance an internet connection problem. Late submissions will NOT be accepted. Extensions will usually be granted onmedical or compassionate grounds on production of appropriate evidence, but mustreceive lecturer’s approval at least 24 hours before the deadline.1

2. Part 1

Please provide your answers to the following questions and include short working out if thereis any. There is a limit of 3 pages on your answers for Part 1.(a) [1 mark] What is the definition of canonicallink function in the context of generalisedlinear models?(b) [1 mark] Explain in words and/or by drawing a plot when a link function of a generalisedlinear model is valid.(c) [1 mark] In the context of generalised linear models, does the value of the maximisedlog-likelihood for the saturated model depend on the choice of link function and why?(d) [1 mark] The mean of a generalised linear model is known to lie between 1 and 2whatever the value of the linear predictor ηi = xi β is, i.e. 1 < μi < 2. Let Φ denotethe cumulative distribution function of the standard normal distribution N(0, 1) andΦ1 denote the inverse function of Φ. Which function below is an appropriate linkfunction in this setting? Notes: (i) precisely one answer below is correct and the otherones are incorrect; (ii) an incorrect answer scores zero while the correct answer scoresfull marks for the question.A. ηi = g(μi), where g(μi) = Φ(μi 1).B. ηi = g(μi), where g(μi) = Φ(μi/2).C. ηi = g(μi), where g(μi) = Φ1(μi 1).
D. ηi = g(μi), where g(μi) = Φ1(μi/2).(e) [1 mark] The gamma distribution has probability density functionf(y;α, β) = {βα/Γ(α)}yα?1 exp(?βy),where y > 0, α > 0 is a shape parameter, β > 0 is a rate parameter and Γ(·) is thegamma function. You may assume that(i) the mean μ of the gamma distribution is given by μ = α/β;(ii) the gamma distribution is a generalised linear model with dispersion parameter = 1/α, in the notation of equation (4.1) of Topic 4.What is the canonical link function when the generalised linear model is gamma?(f) [3 marks] The geometric distribution has probability mass function f(y; p) = (1? p)py,for y = 0, 1, . . ., where 0 < p < 1. What are the canonical link function and variancefunction of the geometric distribution?The deviance residual for observation i is given by sign(yi μi)} {b(h(yi)) b(h(μi))}]is the deviance associated with observation i, which is written as a function of theresponse variable yi and of the fitted value μ?i, while sign(·) is the sign function definedin the lecture notes. Also recall that b′?1(μ) ? h(μ). What is the expression for d2i , asa function of yi and μ?i, when the generalised linear model is geometric? Please simplifyyour expression as much as you can.2(g) [1 mark] Consider a generalised linear model with linear predictor ηi = υi+xi β, whereυi is an offset, xi is a vector of covariates of length p and β is a parameter vector oflength p to be estimated. Assuming that the model’s dispersion parameter ? = 1 isknown, how many free parameters (i.e., parameters to estimate) are there in this model?(h) [1 mark] A logistic regression model was fitted to a dataset consisting of a binaryoutcome variable, yi, taking values 0 and 1, and a single numerical covariate xi. Theestimated intercept and slope on the linear predictor scale were found to be ?0.47 and1.3, respectively, so that the linear predictor as a function of xi is given byη(xi) = 0.47 + 1.3xi.Recall the estimated probability Prob[yi = 1|xi] is given byProb[yi = 1|xi] = exp{η?(xi)}/[1 + exp{η?(xi)}]and so the estimated probability Prob[yi = 0|xi] is given by 1? Prob[yi = 1|xi]. Whatis the value of xi such that the odds of the event yi = 1 is 0.75? Recall that the oddsof an event that occurs with probability π is given by π/(1? π).(i) [2 marks] Consider a distribution with the probability density functionf(y;μ) = [1/(2πy3)]?1/2 exp[?(y ? μ)2/(2μ2y)],where μ is the mean of the distribution and y > 0. What is the variance function, V (μ),of this distribution?(j) [1 mark] The following output from a linear regression model fit in R was obtained.Calculate the value for ++++ that the R program would giveifthe sample size is 10.Call :lm( formula = y ? x )
C o e f f i c i e n t s :Estimate Std . Error t value Pr(>| t | )( I n t e r c ep t ) ?0.08888 0.66793 ?0.133 0 .897x 1.06903 0.10765 ???? ++++(k) [1 mark] Suppose we fit a Poisson regression model A with log link to a dataset whoseresponse variable is a count. No offset is included. In the fitted model we have includeda covariate x and the estimated coefficient of x is β?A. Suppose that we then decide tofit a second model B which is the same as model A but with x included as an offset aswell as included in the linear predictor as before. Suppose the estimated coefficient ofx is β?B in model B. Which of the following statements about the second fitted model iscorrect?
Notes: (i) precisely one answer below is correct and the other ones are incorrect; (ii) anincorrect answer scores zero while the correct answer scores full marks for the question.A. β?B = β?A ? 1 and the residual deviance of model B will (usually) change comparedto that of model A.3B. β?B = β?A? 1 and the residual deviance of model B will not change compared to thatof model A.C. β?B = β?A + 1 and the residual deviance of model B will (usually) change comparedto that of model A.D. β?B = β?A+1 and the residual deviance of model B will not change compared to thatof model A.(l) [2 marks] Suppose we have fitted a Poisson log-linear regression with extra-Poissonvariation and the estimate of the dispersion parameter ? is greater than 1. If thestandard Poisson model was used in this situation, would this be likely to be a case ofunderdispersion or overdispersion, and which assumption between mean and varianceof the Poisson distribution should fail? What would happen to the estimates of the βparameters for the standard Poisson model?4

3.Part 2

Different doses of two chemicals, A and B, were used in a trial whose purpose was to reducecockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 givesthe dose of chemical B. In the R code below, the first column of c gives the number ofcockroaches killed and the second column of c gives the number of cockroaches that survived.The following R outputs were obtained:Please provide your answers to the following questions and include short working out if thereis any. There is a limit of 2 pages on your answers for Part 2.(a) [1 mark] What type of generalised linear model is being fitted here and what linkfunction is being used?5(b) [5 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,G, H, J and K. Note that for E you are required to specify the link function.(c) [2 marks] Write down the relevant model in mathematical form, focusing on the contri-bution of observation i to the model.(d) [2 marks] Briefly indicate your impressions of the results of the statistical analysisprovided above(e) [2 marks] What are the next questions you would investigate in the statistical analysis?State what your next two steps would be.6Part 3 [12 Marks]The presence of sprouted or diseased kernels in wheat can reduce the value of a wheat pro-ducer’s entire crop. It is important to identify these kernels after being harvested but priorto sale. To facilitate this identification process, automated systems have been developed toseparate healthy kernels from the rest. Improving these systems requires a better understand-ing of the measurable ways in which healthy kernels differ from kernels that have sproutedprematurely or are infected with a fungus. To this end, Martin et al. (1998) conducted astudy examining numerous physical properties of kernels - density, hardness, size, weight,and moisture - measured on a sample of wheat kernels from two different classes of wheat,hard red winter (hrw) and soft red winter (srw) (represented by the categorical variable class)in the wheat.csv dataset on Wattle. Each kernel’s condition was also classified as “Healthy”,“Partly Diseased” and “Diseased” by human visual inspection (represented by the categoricalvariable type2).Please provide your answers to the following questions and include short working out if thereis any. There is a limit of 3 pages on your answers for Part 3.Throughout the following questions, treat type2 as the response variable.Suppose that we have conducted the following R analysis and obtained the R output below:7(a) [2 marks] Describe the interpretations of coefficient estimates -10.95451 and -0.6480912in the summary() output.(b) [2 marks] What are the null and alternative hypotheses corresponding to the p-value0.0291 in the Anova() output? What conclusion can you obtain based on the p-value?(c) [2 marks] Suppose we have a new observation of the following form:> xnew=data . frame ( class=’ srw ’ ,density=1, hardness=25, s i=2,+weight=25,moisture=12)> xnewclass density hardness s i z e weight moisture1 srw 1 25 2 25 12
If we use predict(), what are the predicted probabilities for the different categoriesof the response type2 and what is the prediction of the response type2 for this newobservation?Suppose that we conducted further R analysis and obtained the R output below:8(d) [2 marks] Describe the interpretations of coefficient estimates -0.17370 and 13.50540in the summary() output, respectively.(e) [2 marks] What are the null and alternative hypotheses corresponding to the p-value0.65749 in the Anova() output? What conclusion can you obtain based on the p-value?(f) [2 marks] Fit a nominal logistic regression model and an ordinal logistic regression model,respectively, with covariates class, density, hardness, size, weight, moisture,class:density, class:hardness, class:size, class:weight and class:moisture.Based on the model fitting results, which model is better? Please explain why thismodel is better.9

4.Part3

An analysis of some ship damage data is presented below. The data consists of a factor typ,corresponding to ship type, with 3 levels, A, B and C; a factor cons, corresponding to theperiod of construction of the ship, with 3 levels, 1960-1964, 1965-1969 or 1975-1979; a factoropr, corresponding to years of operation of the ship, with 2 levels, either 1960-1975 or 1975-1979; a numerical variable mnths, corresponding to the total number of months at risk; anddmge, corresponding to the number of damage incidents reported for the ship. The followingR output was obtained.10Please provide your answers to the following questions and includeshort working out if thereis any. There is a limit of 2 pages on your answers for Part 4.(a) [1 mark] What type of generalised linear model i being fitted here to obtain the outputout1 and what link function is being used?(b) [7 marks] Determine the missing information indicated by the letters A, B, C, D, E, F,G, H, J, K, M, N, P and Q. Note that F should consist of either a blank, a dot, onestar, two stars or three stars; and for J you should specify the link function that wasused. All the other letters apart from A represent a number.(c) [2 marks] Explain what is meant by an offset and the motivation for offsetting L=log(mnths)rather than mnths itself.(d) [2 marks] Using the R printout for out1, give the value of the linear predictor for a shipof Type A that was constructed in the period 1965-1969 and operated in the period1975-1979, assuming that mnths=1095.(e) [3 marks] Write down brief notes on what you would conclude about the wave damagedata from the R output. Can we draw any conclusions as to whether overdispersion ispresent in this dataset? What action you would consider taking if overdispersion weresuspected to be present.
WX:codehelp


小胡子的灯泡
1 声望0 粉丝