到目前为止,在玛丽女王学院,统计模块教授的是古典或经常光顾的方法基于概率表示长期极限频率的思想。在贝叶斯方法中,任何不确定的量都是用概率分布来描述的,因此概率表示对某一事件的信任程度取决于当事人是否知情。本课程将向您介绍贝叶斯统计学。这些笔记是独立的,但你可以还想阅读贝叶斯统计的其他描述。一本有用的入门教材是:1. Bayesian Statistical MethodsQueen

1. Contents

IntroductionSo far at Queen Mary, the statistics modules have taught the classical or frequentist approach which isbased on the idea that probability represents a long run limiting frequency. In the Bayesian approach,any uncertain quantity is described by a probabilitydistribution, and so probability represents adegree of belief in an event which is conditional on the knowledge of the person concerned.This course will introduce you to Bayesian statistics. These notes are self-contained but you maywant to read other accounts of Bayesian statistics as well. A useful introductory textbook is: Bayesian Statistics: An Introduction (4th ed.) by P M Lee. (This is available as an e-bookfrom the library)Parts of the following are also useful: Bayesian Inference for Statistical Analysis by G E P Box and G C Tiao. Bayesian Data Analysis by A Gelman, J B Carlin, H S Stern and D B Rubin. Probability and Statistics from a Bayesian viewpoint (Vol 2) by D V Lindley.3Chapter

2. Likelihood

First we review the concept of likelihood, which is essential for Bayesian theory, but can also be usedin frequentist methods. Let y be the data that we observe, which is usually a vector. We assumethat y was generated by some probability model which we can specify. Suppose that this probabilitymodel depends on one or more parameters, which we want to estimate.Definition 1.1. If the components of y are continuous, then the likelihood is defined as the jointprobability density function of y; if y is discrete, then the likelihood is defined as the joint probabilitymass function of y. In either case, we denote the likelihood asExample 1.1. Let y = y1, . . . , yn be a random sample from a normal distribution with unknownparameters ai = a1 ℅ a2 ℅ ﹞ ﹞ ﹞ ℅ an. is the normalprobabilitydensityfunction with parameterExample 1.2. Suppose weobserve k successes in n independent trials, where each trial hasproba-bility of success q. Now q, the observed data is k, and thelikelihood is the binomial probabilitymass functionIt is alsopossible to construct likelihoods which combine probabilities andprobability density func-tions, for example if the observed data contains both discrete and continuous components. Alter-natively,probabilities may appear in the likelihood if continuous data is only observed to lie withinsome interval.Example 1.3. Assume that the time until failure for a certain type of light bulb isexponentiallydistributed with parameter 竹, and we observe n bulbs, with failure times t = t1, . . . , tn.The likelihood contribution for a single observation ti is the exponential probability density functionSuppose instead that we observe the failure time for the first m light bulbs with m < n, but for theremaining n ?m bulbs we only observe that they have not failed by time ti. Then for i ≒ m, thelikelihood contributions are as before.For i > m, the likelihood is the probability of what we have observed. Denoting the random variablefor the failure time by Ti, we have observed that Ti > ti, so the likelihood contribution isp(Ti > ti) = 1.1 Maximum likelihood estimationIn example 1.1, the parameters f the normal distribution which generated the data areknown as population parameters. An estimator is defined as a function of the observed data which5we use as an estimate of a population parameter. For example the sample mean and variance maybe used as estimators of the population mean and variance, respectively.To use the likelihood to estimate parametersiven the data x that we have observed. This is the method of maximum likelihood, andthe estimator are continuous, and those are the only examples we cover, but the idea oflikelihood also makes sense if the unknown quantity is discrete.When finding the MLE it is usually more convenient to work with the log of thelikelihood: as thelog is a monotonically increasing function, the
Since the likelihood is typically a product of terms for independent observations, the log-likelihoodis a sum of terms, so using the log greatly simplifies finding the derivatives in order to find themaximum.Returning to the binomial example 1.2, the log-likelihood isWe also do not cover confidence intervals, but we do cover the Bayesian version, which are calledcredible intervals.

3.Bayesian inference

2.1 Bayes* theoremBayes* theorem is a formula from probability theory that is central to Bayesian inference. It isnamed after Rev. Thomas Bayes, a nonconformist minister who lived in England in the first half ofthe eighteenth century. The theorem states that:Theorem 2.1. Let ? be a sample space and A1, A2, . . . , Am be mutually exclusive and exhaustiveevents in ? (i.e. Ai ﹎ Aj = ?, i ?= j ,﹍k i=1Ai = ?; the Ai form a partition of ?.) Let B be any eventwith p(B) > 0. ThenThe proof follows from the definition of conditional probabilities and the law of total probabilities.Example 2.1. Suppose a test for an infection has 90% sensitivity and 95% specificity, and theproportion of the population with the infection is q = 1/2000. Sensitivity is the probability ofdetecting a genuine infection, and specificity is the probability of being correct about a noninfection.So p(+ve test | infected) = 0.9 and p(-ve test | not infected) = 0.95. What is the probability thatsomeone who tests positive is infected?Let the events be as follows:B: test positive
So there is a less than 1% chance that the person is infected if they test positive.Bayes* Theorem is also applicable to probability densities.Theorem 2.2. Let X , Y be two continuous r.v.*s (possibly multivariate) and let f(x, y) be the jointprobability density function (pdf), f(x | y) the conditional pdf etc. thenf(x | y) = f(y
Alternatively, Y may be discrete, in which case f(y | x) and f(y) are probability mass functions.2.2 Bayes* theorem and Bayesian inferenceIn the Bayesian framework, all uncertainty is specified by probability distributions. This includesuncertainty about the unknown parameters. So we need to start with a probability distribution he likelihood for parameters So the procedure is as follows: Start with a distribution on this posterior distribution.The use of Bayes* theorem can be summarized asPosterior distribution ≦ prior distribution ℅ likelihoodExample 2.2. Suppose a biased coin has probability of heads q, and we observe k heads in nindependent coin tosses. We saw the binomial likelihood for this problem:p(k | q) =For Bayesian inference, we need to specify a prior distribution for q. As q is a continuous quantitThe uniform distribution on [0, 1] is a special case of the beta distribution wiRecall that the maximum likelihood
So for large values of k and n, the Bayesian estimate and MLE will be similar, whereas they differmore for smaller sample sizes.In the special case that the prior distribution is uniform on [0, 1], the posterior distribution is Beta(k+1, n? k + 1). and the posterior mean value for q isE(q | k) = k + 1n the binomial example 2.2 with a beta prior distribution, we saw that the posterior distribution isalso a beta distribution. When we have the same family of distributions for the prior and posteriorfor one or more parameters, this is known as a conjugate family of distributions. We say that thefamily of beta distributions is conjugate to the binomial likelihood.The binomial lractical Bayesian inference, Markov Chain Monte Carlo methods, covered later in these notes,are the most common means of approximating the posterior distribution. These methods produce anapproximate sample from the joint posterior density, and once this is done the marginal distributionof each parameter is immediately available.Example 2.6. Generating samples from the posterior distribution may also be helpful even if wecan calculate the exact posterior distribution. Suppose that the data are the outcome of a clinicaltrial of two treatments for a serious illness, the number of deaths after each treatment. Let the databe ki deaths out of ni patients, i = 1, 2 for the two treatments, and the two unknown parametersare q1 and q2, the probability of death with each treatment. Assuming a binomial model for eachoutcome, and independent Beindependent prior distributions and likelihood, so the posterior distributions are also inde-pendent.p(q1, q2 | k1, k2) = p(q1 | k1) p(q2 | k2) ≦ p(k1 | q1)p(q1) p(k2 | q2)p(q2)However, it is useful to think in terms of the joint posterior density for q1 and q2, as then we canmake probability statements involving both parameters. In this case, one quantity of interest is theprobability P (q2 < q1), i.e. does the second treatment have a lower death rate than the first. Tofind this probability, we need to integrate the joint posterior density over the relevant region, whichis not possible to do exactly when it is a product of beta pdfs.We can approximate the probability by generating a sample of (q1, q2) pairs from the joint density.To generate the sample, we just need to generate each parameter from its beta posterior distribution,which can be done in R using the rbeta command. Then once we have the sample, we just countwhat proportion of pairs has q2 < q1 to estimate P (q2 < q1).2Specifying prior distributionsThe posterior distribution depends on both the observed data via the likelihood, and also on theprior distribution. So far, we have taken the prior distribution as given, but now we look at how tospecify a prior.3.1 Informative prior distributionsAn informative prior distribution is one in which the probability mass is concentrated in some subsetof the possible range for theparameter(s), and is usually based on some specific information. Theremay be other data that is relevant, and we might want to use this information without including allprevious data in our current model. In that case, we can use summaries of the data to find apriordistribution.Example 3.1. In example 2.3, the data was the lifetimes of light bulbs, t = (t1, . . . , tn), assumed tobe exponentially distributed with parameter the failure rate, reciprocal of the mean lifetime). Thegamma distribution provides a conjugate
WX:codehelp


鼻子大的手套
1 声望0 粉丝