《多元统计分析》课程教学资源(阅读材料)EM algorithm

Statistics 580 The EM Algorithm Introduction The EM algorithm is a very general iterative algorithm for parameter estimation by maximum likelihood when some of the random variables involved are not observed i.e.,con- sidered missing or incomplete.The EM algorithm formalizes an intuitive idea for obtaining parameter estimates when some of the data are missing: i.replace missing values by estimated values, ii.estimate parameters. iii.Repeat step (i)using estimated parameter values as true values,and step (ii)using estimated values as "observed"values,iterating until convergence. This idea has been in use for many years before Orchard and Woodbury (1972)in their missing information principle provided the theoretical foundation of the underlying idea. The term EM was introduced in Dempster,Laird,and Rubin(1977)where proof of general results about the behavior of the algorithm was first given as well as a large number of applications. For this discussion,let us suppose that we have a random vector y whose joint density f(y;0)is indexed by a p-dimensional parameter 0RP.If the complete-data vector y were observed,it is of interest to compute the maximum likelihood estimate of a based on the distribution of y.The log-likelihood function of y log L(0;y)=(0;y)=log f(y;0). is then required to be maximized. In the presence of missing data,however,only a function of the complete-data vector y,is observed.We will denote this by expressing y as(yobs,ymis),where yobs denotes the observed but“incomplete”data and ymis denotes the unobserved or“missing”data.For simplicity of description,assume that the missing data are missing at random (Rubin,1976),so that f(y;0)=f(yobs,ymis;0) =f1(yobsi0)f2(ymislyobsi 0), where fi is the joint density of yobs and f2 is the joint density of ymis given the observed data yobs,respectively.Thus it follows that lobs(;yobs)=e(0;y)-log f2(ymislyobsi0), where lobs(0;yobs)is the observed-data log-likelihood. 1
Statistics 580 The EM Algorithm Introduction The EM algorithm is a very general iterative algorithm for parameter estimation by maximum likelihood when some of the random variables involved are not observed i.e., considered missing or incomplete. The EM algorithm formalizes an intuitive idea for obtaining parameter estimates when some of the data are missing: i. replace missing values by estimated values, ii. estimate parameters. iii. Repeat • step (i) using estimated parameter values as true values, and • step (ii) using estimated values as “observed” values, iterating until convergence. This idea has been in use for many years before Orchard and Woodbury (1972) in their missing information principle provided the theoretical foundation of the underlying idea. The term EM was introduced in Dempster, Laird, and Rubin (1977) where proof of general results about the behavior of the algorithm was first given as well as a large number of applications. For this discussion, let us suppose that we have a random vector y whose joint density f(y; θ) is indexed by a p-dimensional parameter θ ∈ Θ ⊆ Rp . If the complete-data vector y were observed, it is of interest to compute the maximum likelihood estimate of θ based on the distribution of y. The log-likelihood function of y log L(θ; y) = `(θ; y) = log f(y; θ), is then required to be maximized. In the presence of missing data, however, only a function of the complete-data vector y, is observed. We will denote this by expressing y as (yobs, ymis), where yobs denotes the observed but “incomplete” data and ymis denotes the unobserved or “missing” data. For simplicity of description, assume that the missing data are missing at random (Rubin, 1976), so that f(y; θ) = f(yobs, ymis; θ) = f1(yobs; θ) · f2(ymis|yobs; θ), where f1 is the joint density of yobs and f2 is the joint density of ymis given the observed data yobs, respectively. Thus it follows that `obs(θ; yobs) = `(θ; y) − log f2(ymis|yobs; θ), where `obs(θ; yobs) is the observed-data log-likelihood. 1

EM algorithm is useful when maximizing lobs can be difficult but maximizing the complete- data log-likelihood e is simple.However,since y is not observed,e cannot be evaluated and hence maximized.The EM algorithm attempts to maximize ((0;y)iteratively,by replacing it by its conditional expectation given the observed data yobs.This expectation is computed with respect to the distribution of the complete-data evaluated at the current estimate of 0. More specifically,if (is an initial value for 0,then on the first iteration it is required to compute Q:0)=Ego [(0;y)lyobe] Q(0;0(0))is now maximized with respect to 0,that is,00)is found such that Q(01:8o)≥Q(0:0o) for all 0 ee.Thus the EM algorithm consists of an E-step (Estimation step)followed by an M-step (Maximization step)defined as follows: E-step:Compute Q(;0())where Q(0:0)=Eg)[e(0;y)lybs] M-step:Find (+1)in such that Q(0t+1:0)≥Q(0:0) for all0∈日 The E-step and the M-step are repeated alternately until the difference L(()-L(()) is less than 6,where 6 is a prescribed small quantity. The computation of these two steps simplify a great deal when it can be shown that the log-likelihood is linear in the sufficient statistic for 0.In particular,this turns out to be the case when the distribution of the complete-data vector (i.e.,y)belongs to the exponential family.In this case,the E-step reduces to computing the expectation of the complete- data sufficient statistic given the observed data.When the complete-data are from the exponential family,the M-step also simplifies.The M-step involves maximizing the expected log-likelihood computed in the E-step.In the exponential family case,actually maximizing the expected log-likelihood to obtain the next iterate can be avoided.Instead,the conditional expectations of the sufficient statistics computed in the E-step can be directly substituted for the sufficient statistics that occur in the expressions obtained for the complete-data maximum likelihood estimators of 0,to obtain the next iterate.Several examples are discussed below to illustrate these steps in the exponential family case. As a general algorithm available for complex maximum likelihood computations,the EM algorithm has several appealing properties relative to other iterative algorithms such as Newton-Raphson.First,it is typically easily implemented because it relies on complete- data computations:the E-step of each iteration only involves taking expectations over complete-data conditional distributions.The M-step of each iteration only requires complete- data maximum likelihood estimation,for which simple closed form expressions are already 2
EM algorithm is useful when maximizing `obs can be difficult but maximizing the completedata log-likelihood ` is simple. However, since y is not observed, ` cannot be evaluated and hence maximized. The EM algorithm attempts to maximize `(θ; y) iteratively, by replacing it by its conditional expectation given the observed data yobs. This expectation is computed with respect to the distribution of the complete-data evaluated at the current estimate of θ. More specifically, if θ (0) is an initial value for θ, then on the first iteration it is required to compute Q(θ; θ (0)) = Eθ (0) [`(θ; y)|yobs] . Q(θ; θ (0)) is now maximized with respect to θ, that is, θ (1) is found such that Q(θ (1); θ (0)) ≥ Q(θ; θ (0)) for all θ ∈ Θ. Thus the EM algorithm consists of an E-step (Estimation step) followed by an M-step (Maximization step) defined as follows: E-step: Compute Q(θ; θ (t) ) where Q(θ; θ (t) ) = Eθ (t) [`(θ; y)|yobs] . M-step: Find θ (t+1) in Θ such that Q(θ (t+1); θ (t) ) ≥ Q(θ; θ (t) ) for all θ ∈ Θ. The E-step and the M-step are repeated alternately until the difference L(θ (t+1)) − L(θ (t) ) is less than δ, where δ is a prescribed small quantity. The computation of these two steps simplify a great deal when it can be shown that the log-likelihood is linear in the sufficient statistic for θ. In particular, this turns out to be the case when the distribution of the complete-data vector (i.e., y) belongs to the exponential family. In this case, the E-step reduces to computing the expectation of the completedata sufficient statistic given the observed data. When the complete-data are from the exponential family, the M-step also simplifies. The M-step involves maximizing the expected log-likelihood computed in the E-step. In the exponential family case, actually maximizing the expected log-likelihood to obtain the next iterate can be avoided. Instead, the conditional expectations of the sufficient statistics computed in the E-step can be directly substituted for the sufficient statistics that occur in the expressions obtained for the complete-data maximum likelihood estimators of θ, to obtain the next iterate. Several examples are discussed below to illustrate these steps in the exponential family case. As a general algorithm available for complex maximum likelihood computations, the EM algorithm has several appealing properties relative to other iterative algorithms such as Newton-Raphson. First, it is typically easily implemented because it relies on completedata computations: the E-step of each iteration only involves taking expectations over complete-data conditional distributions. The M-step of each iteration only requires completedata maximum likelihood estimation, for which simple closed form expressions are already 2

available.Secondly,it is numerically stable:each iteration is required to increase the log- likelihood (:yb)in each iteration,and if (ys)is bounded,the sequence(y) converges to a stationery value.If the sequenceconverges,it does so to a local maximum or saddle point of e(;yobs)and to the unique MLE if e(e;yobs)is unimodal.A disadvantage of EM is that its rate of convergence can be extremely slow if a lot of data are missing: Dempster,Laird,and Rubin (1977)show that convergence is linear with rate proportional to the fraction of information about 0 in e(0;y)that is observed. Example 1:Univariate Normal Sample Let the complete-data vector y=(v,...,)T be a random sample from N(u,o2). Then f(y4,σ2)= 02 11n/2 xp{-1/2a2(∑听-2μ∑h+nr2)} which implies that (>yi,>y2)are sufficient statistics for =(u,o2)T.The complete-data log-likelihood function is: ,2y=-31ogo2)-32s9 i=1 02 constant 2 nu? =1 1=1 It follows that the log-likelihood based on complete-data is linear in complete-data sufficient statistics.Suppose yi,i=1,...,m are observed and yi,i =m+1,...,n are missing (at random)where yi are assumed to be i.i.d.N(u,o2).Denote the observed data vector by yobs=(,...,m)T).Since the complete-data y is from the exponential family,the E-step requires the computation of instead of computing the expectation of the complete-data log-likelihood function shown above.Thus,at the tth iteration of the E-step,compute s=E) (1) ∑+(n-m)μ =1 since E()=where and are the current estimates of uand 2 and 3
available. Secondly, it is numerically stable: each iteration is required to increase the loglikelihood `(θ; yobs) in each iteration, and if `(θ; yobs) is bounded, the sequence `(θ (t) ; yobs) converges to a stationery value. If the sequence θ (t) converges, it does so to a local maximum or saddle point of `(θ; yobs) and to the unique MLE if `(θ; yobs) is unimodal. A disadvantage of EM is that its rate of convergence can be extremely slow if a lot of data are missing: Dempster, Laird, and Rubin (1977) show that convergence is linear with rate proportional to the fraction of information about θ in `(θ; y) that is observed. Example 1: Univariate Normal Sample Let the complete-data vector y = (y1, . . . , yn) T be a random sample from N(µ, σ 2 ). Then f(y; µ, σ 2 ) = 1 2πσ 2 n/2 exp ( − 1 2 Xn i=1 (yi − µ) 2 σ 2 ) = 1 2πσ 2 n/2 exp n −1/2σ 2 X y 2 i − 2µ X yi + nµ2 o which implies that ( P yi , P y 2 i ) are sufficient statistics for θ = (µ, σ 2 ) T . The complete-data log-likelihood function is: `(µ, σ 2 ; y) = − n 2 log(σ 2 ) − 1 2 Xn i=1 (yi − µ) 2 σ 2 + constant = − n 2 log(σ 2 ) − 1 2σ 2 Xn i=1 y 2 i + µ σ 2 Xn i=1 yi − nµ2 σ 2 + constant It follows that the log-likelihood based on complete-data is linear in complete-data sufficient statistics. Suppose yi ,i = 1, . . . , m are observed and yi ,i = m + 1, . . . , n are missing (at random) where yi are assumed to be i.i.d. N(µ, σ 2 ). Denote the observed data vector by yobs = (y1, . . . , ym) T ). Since the complete-data y is from the exponential family, the E-step requires the computation of Eθ Xn i=1 yi |yobs! and Eθ X N i=1 y 2 i |yobs! , instead of computing the expectation of the complete-data log-likelihood function shown above. Thus, at the t th iteration of the E-step, compute s (t) 1 = E µ(t) ,σ2 (t) Xn i=1 yi |yobs! (1) = Xm i=1 yi + (n − m) µ (t) since E µ(t) ,σ2 (t) (yi) = µ (t) where µ (t) and σ 2 (t) are the current estimates of µ and σ 2 , and 3

9-Eno(区) (2) =∑听+(n-m)[oo+ue] since E()=22 For the M-step,first note that the complete-data maximum likelihood estimates of u and o2 are: =业and2= =1 2 n n The M-step is defined by substituting the expectations computed in the E-step for the complete-data sufficient statistics on the right-hand side of the above expressions to obtain expressions for the new iterates of u and o2.Note that complete-data sufficient statistics themselves cannot be computed directly since ym+1,...,Un have not been observed.We get the expressions (3) n and 02+1) -+, (4) n Thus,the E-step involves computing evaluating(1)and (2)beginning with starting values )and a2.M-step involves substituting these in (3)and (4)to calculate new values and 2,etc.Thus,the EM algorithm iterates successively between(1)and(2)and (3)and (4).Of course,in this example,it is not necessary to use of EM algorithm since the maximum likelihood estimates for (u,o2)are clearly given by=/m and 2=/m-20. Example 2:Sampling from a Multinomial population In the Example 1,“incomplete data'”in effect was“missing data'”in the conventional sense.However,in general,the EM algorithm applies to situations where the complete data may contain variables that are not observable by definition.In that set-up,the observed data can be viewed as some function or mapping from the space of the complete data. The following example is used by Dempster,Laird and Rubin (1977)as an illustration of the EM algorithm.Let yobs =(38,34,125)T be observed counts from a multinomial population with probabilities:(,,+)The objective is to obtain the maximum likelihood estimate of 0.First,to put this into the framework of an incomplete data problem, define y=(v,y2,43,)T with multinomial probabilities (,,,)=(pi,p2,P3,pa). 4
s (t) 2 = E µ(t) ,σ2 (t) Xn i=1 y 2 i |yobs! (2) = Xm i=1 y 2 i + (n − m) h σ (t) 2 + µ (t) 2 i since E µ(t) ,σ2 (t) (y 2 i ) = σ 2 (t) + µ (t) 2 . For the M-step, first note that the complete-data maximum likelihood estimates of µ and σ 2 are: µˆ = Pn i=1 yi n and σˆ 2 = Pn i=1 y 2 i n − Pn i=1 yi n !2 The M-step is defined by substituting the expectations computed in the E-step for the complete-data sufficient statistics on the right-hand side of the above expressions to obtain expressions for the new iterates of µ and σ 2 . Note that complete-data sufficient statistics themselves cannot be computed directly since ym+1, . . . , yn have not been observed. We get the expressions µ (t+1) = s (t) 1 n (3) and σ 2 (t+1) = s (t) 2 n − µ (t+1)2 . (4) Thus, the E-step involves computing evaluating (1) and (2) beginning with starting values µ (0) and σ 2 (0) . M-step involves substituting these in (3) and (4) to calculate new values µ (1) and σ 2 (1) , etc. Thus, the EM algorithm iterates successively between (1) and (2) and (3) and (4). Of course, in this example, it is not necessary to use of EM algorithm since the maximum likelihood estimates for (µ, σ 2 ) are clearly given by µˆ = Pm i=1 yi/m and σˆ 2 = Pm i=1 y 2 i /m−µˆ 2✷. Example 2: Sampling from a Multinomial population In the Example 1, “incomplete data” in effect was “missing data” in the conventional sense. However, in general, the EM algorithm applies to situations where the complete data may contain variables that are not observable by definition. In that set-up, the observed data can be viewed as some function or mapping from the space of the complete data. The following example is used by Dempster, Laird and Rubin (1977) as an illustration of the EM algorithm. Let yobs = (38, 34, 125)T be observed counts from a multinomial population with probabilities: ( 1 2 − 1 2 θ, 1 4 θ, 1 2 + 1 4 θ). The objective is to obtain the maximum likelihood estimate of θ. First, to put this into the framework of an incomplete data problem, define y = (y1, y2, y3, y4) T with multinomial probabilities ( 1 2 − 1 2 θ, 1 4 θ, 1 4 θ, 1 2 ) ≡ (p1, p2, p3, p4). 4

The y vector is considered complete-data.Then define yobs =(v1,42,y3+)T.as the observed data vector,which is a function of the complete-data vector.Since only y3+y4 is observed and y3 and y4 are not,the observed data is considered incomplete.However,this is not simply a missing data problem. The complete-data log-likelihood is e(0;y)=y1 log pi +y2 log p2 +y3 logp3+y4log p4+const. which is linear in y1,y2,y3 and y4 which are also the sufficient statistics.The E-step requires that Ee(yyobs)be computed;that is compute Eo(v1lyobs)=1=38 E(y2 yobs)=2=34 E.(mlyu)E(alu)125() since,conditional on (y3+44),y3 is distributed as Binomial(125,p)where 0 p=+9 Similarly, Eobo)=Ea6a的+an)=12sG/rg+, which is similar to computing Ee(ysyobs).But only =Eoc)(u3lyobs)= 125(4)99 (3+9 (1) needs to be computed at the tth iteration of the E-step as seen below. For the M-step,note that the complete-data maximum likelihood estimate of is 2+的 班+2+3 (Note:Maximize 11 1 1 1 69:y)=hlog(吃-2)+%log49+h1og40+4log2 and show that the above indeed is the maximum likelihood estimate of )Thus,substitute the expectations from the E-step for the sufficient statistics in the expression for maximum likelihood estimate 0 above to get 9+1)=34+259 72+ (2) Iterations between (1)and(2)define the EM algorithm for this problem.The following table shows the convergence results of applying EM to this problem with 0(0)=0.50. 5
The y vector is considered complete-data. Then define yobs = (y1, y2, y3 + y4) T . as the observed data vector, which is a function of the complete-data vector. Since only y3 + y4 is observed and y3 and y4 are not, the observed data is considered incomplete. However, this is not simply a missing data problem. The complete-data log-likelihood is `(θ; y) = y1 log p1 + y2 log p2 + y3 log p3 + y4 log p4 + const. which is linear in y1, y2, y3 and y4 which are also the sufficient statistics. The E-step requires that Eθ(y|yobs) be computed; that is compute Eθ(y1|yobs) = y1 = 38 Eθ(y2|yobs) = y2 = 34 Eθ(y3|yobs) = Eθ(y3|y3 + y4) = 125(1 4 θ)/( 1 2 + 1 4 θ) since, conditional on (y3 + y4), y3 is distributed as Binomial(125, p) where p = 1 4 θ 1 2 + 1 4 θ . Similarly, Eθ(y4|yobs) = Eθ(y4|y3 + y4) = 125(1 2 )/( 1 2 + 1 4 θ), which is similar to computing Eθ(y3|yobs). But only y (t) 3 = Eθ (t) (y3|yobs) = 125( 1 4 )θ (t) ( 1 2 + 1 4 θ (t) ) (1) needs to be computed at the t th iteration of the E-step as seen below. For the M-step, note that the complete-data maximum likelihood estimate of θ is y2 + y3 y1 + y2 + y3 (Note: Maximize `(θ; y) = y1 log( 1 2 − 1 2 θ) + y2 log 1 4 θ + y3 log 1 4 θ + y4 log 1 2 and show that the above indeed is the maximum likelihood estimate of θ). Thus, substitute the expectations from the E-step for the sufficient statistics in the expression for maximum likelihood estimate θ above to get θ (t+1) = 34 + y (t) 3 72 + y (t) 3 . (2) Iterations between (1) and (2) define the EM algorithm for this problem. The following table shows the convergence results of applying EM to this problem with θ (0) = 0.50. 5

Table 1.The EM Algorithm for Example 2 (from Little and Rubin (1987)) t 00) 0©-8 (0+-)/(0©-0) 00.500000000 0.126821498 0.1465 1 0.608247423 0.018574075 0.1346 2 0.624321051 0.002500447 0.1330 3 0.626488879 0.000332619 0.1328 4 0.626777323 0.000044176 0.1328 5 0.626815632 0.000005866 0.1328 6 0.626820719 0.000000779 7 0.626821395 0.000000104 0.626821484 0.000000014 Example 3:Sample from Binomial/Poisson Mixture The following table shows the number of children of N widows entitled to support from a certain pension fund. Number of Children:0 1 2 3 4 5 6 Observed of Widows:no n1 n2 n3 n4 n5 n6 Since the actual data were not consistent with being a random sample from a Poisson dis- tribution (the number of widows with no children being too large)the following alternative model was adopted.Assume that the discrete random variable is distributed as a mixture of two populations,thus: Population A:with probability &the random variable takes the value 0,and Mixture of Populations: Population B:with probability (1-E),the random variable follows a Poisson with mean A Let the observed vector of counts be nobs =(no,n,...,n6)T.The problem is to obtain the maximum likelihood estimate of (A,)This is reformulated as an incomplete data problem by regarding the observed number of widows with no children be the sum of observations that come from each of the above two populations. Define no nA+nB nA =widows with no children from population A nB no-nA =widows with no children from population B Now,the problem becomes an incomplete data problem because na is not observed.Let n=(nA,nB,n1,n2,...,n6)be the complete-data vector where we assume that na and nB are observed and no nA+nB. 6
Table 1. The EM Algorithm for Example 2 (from Little and Rubin (1987)) t θ (t) θ (t) − ˆθ (θ (t+1) − ˆθ)/(θ (t) − ˆθ) 0 0.500000000 0.126821498 0.1465 1 0.608247423 0.018574075 0.1346 2 0.624321051 0.002500447 0.1330 3 0.626488879 0.000332619 0.1328 4 0.626777323 0.000044176 0.1328 5 0.626815632 0.000005866 0.1328 6 0.626820719 0.000000779 · 7 0.626821395 0.000000104 · 8 0.626821484 0.000000014 · Example 3: Sample from Binomial/ Poisson Mixture The following table shows the number of children of N widows entitled to support from a certain pension fund. Number of Children: 0 1 2 3 4 5 6 Observed # of Widows: n0 n1 n2 n3 n4 n5 n6 Since the actual data were not consistent with being a random sample from a Poisson distribution (the number of widows with no children being too large) the following alternative model was adopted. Assume that the discrete random variable is distributed as a mixture of two populations, thus: Population A: with probability ξ, the random variable takes the value 0, and Mixture of Populations: Population B: with probability (1 − ξ), the random variable follows a Poisson with mean λ Let the observed vector of counts be nobs = (n0, n1, . . . , n6) T . The problem is to obtain the maximum likelihood estimate of (λ, ξ). This is reformulated as an incomplete data problem by regarding the observed number of widows with no children be the sum of observations that come from each of the above two populations. Define n0 = nA + nB nA = # widows with no children from population A nB = no − nA = # widows with no children from population B Now, the problem becomes an incomplete data problem because nA is not observed. Let n = (nA, nB, n1, n2, . . . , n6) be the complete-data vector where we assume that nA and nB are observed and n0 = nA + nB . 6

Then f(n;ξ,)=k(n){P(=0)}oΠ空1{P(=)}” -)a" =o+-6{-2之e(剑)] where k(n)=>o1ni/no!n!...ne!.Obviously,the complete-data sufficient statistic is (nA,nB,n1,n2,...,n6).The complete-data log-likelihood is (5,n)=nol1og(ξ+(1-)e-A) +(N-no)log(1-)-+∑in log入+const. i=l Thus,the complete-data log-likelihood is linear in the sufficient statistic.The E-step requires the computing of Es.(n nobs). This computation results in Es.(ni nobs)=ni for i=1,...,6, and noξ Ec(nAb)=E+(1-S)exp(-万' since na is Binomial(no,p)with p where pA-and pB =(1-E)e-.The expression for Es(nBlnobs)is equivalent to that for E(nA)and will not be needed for E- step computations.So the E-step consists of computing noE(t) n9=9+1-)exp-(可 (1) at the tth iteration. For the M-step,the complete-data maximum likelihood estimate of (A)is needed. To obtain these,note that na Bin(N,E)and that nB,n,...,n6 are observed counts for i=0,1,...,6 of a Poisson distribution with parameter A.Thus,the complete-data maximum likelihood estimate'sofξand入are 7
Then f(n; ξ, λ) = k(n) {P(y0 = 0)} n0 Π ∞ i=1 {P(yi = i)} ni = k(n) h ξ + (1 − ξ) e −λ in0 " Π 6 i=1 ( (1 − ξ) e −λλ i i! )ni # = k(n) h ξ + (1 − ξ)e −λ inA+nB n (1 − ξ)e −λ oP6 i=1 ni " Π 6 i=1 λ i i! !ni # . where k(n) = P6 i=1 ni/n0!n1! . . . n6!. Obviously, the complete-data sufficient statistic is (nA, nB, n1, n2, . . . , n6). The complete-data log-likelihood is `(ξ, λ; n) = n0 log(ξ + (1 − ξ)e −λ ) + (N − n0)[log(1 − ξ) − λ] + X 6 i=1 i ni log λ + const. Thus, the complete-data log-likelihood is linear in the sufficient statistic. The E-step requires the computing of Eξ,λ(n|nobs). This computation results in Eξ,λ(ni |nobs) = ni for i = 1, ..., 6, and Eξ,λ(nA|nobs) = n0ξ ξ + (1 − ξ) exp(−λ) , since nA is Binomial(n0, p) with p = pA pA+pB where pA = ξ and pB = (1 − ξ) e −λ . The expression for Eξ,λ(nB|nobs) is equivalent to that for E(nA) and will not be needed for Estep computations. So the E-step consists of computing n (t) A = n0ξ (t) ξ (t) + (1 − ξ (t) ) exp(−λ (t) ) (1) at the t th iteration. For the M-step, the complete-data maximum likelihood estimate of (ξ, λ) is needed. To obtain these, note that nA ∼ Bin(N, ξ) and that nB, n1, . . . , n6 are observed counts for i = 0, 1, . . . , 6 of a Poisson distribution with parameter λ. Thus, the complete-data maximum likelihood estimate’s of ξ and λ are ˆξ = nA N , 7

and 入= ini 台ns+∑91n The M-step computes N (2) and 6 X+1)= ini (3) 台ng+∑-1ni .(t) where ng no -nd. The EM algorithm consists of iterating between (1),and(2)and (3)successively.The following data are reproduced from Thisted(1988). Number of children 0 1 2 3 4 5 6 Number of widows 3,062 587 284 103 33 4 2 Starting with (o)=0.75 and A(0)=0.40 the following results were obtained. Table 2.EM Iterations for the Pension Data tξ 入 TA nB 00.75 0.40 2502.779559.221 1 0.614179 1.035478 2503.591 558.409 20.614378 1.036013 2504.219 557.781 30.614532 1.036427 2504.704 557.296 40.614651 1.036747 2505.079 556.921 50.614743 1.036995 2505.369556.631 A single iteration produced estimates that are within 0.5%of the maximum likelihood estimate's and are comparable to the results after about four iterations of Newton-Raphson. However,the convergence rate of the subsequent iterations are very slow;more typical of the behavior of the EM algorithm. 8
and λˆ = X 6 i=1 i ni nB + P6 i=1 ni . The M-step computes ξ (t+1) = n (t) A N (2) and λ (t+1) = X 6 i=1 i ni n (t) B + P6 i=1 ni (3) where n (t) B = n0 − n (t) A . The EM algorithm consists of iterating between (1), and (2) and (3) successively. The following data are reproduced from Thisted(1988). Number of children 0 1 2 3 4 5 6 Number of widows 3,062 587 284 103 33 4 2 Starting with ξ (0) = 0.75 and λ (0) = 0.40 the following results were obtained. Table 2. EM Iterations for the Pension Data t ξ λ nA nB 0 0.75 0.40 2502.779 559.221 1 0.614179 1.035478 2503.591 558.409 2 0.614378 1.036013 2504.219 557.781 3 0.614532 1.036427 2504.704 557.296 4 0.614651 1.036747 2505.079 556.921 5 0.614743 1.036995 2505.369 556.631 A single iteration produced estimates that are within 0.5% of the maximum likelihood estimate’s and are comparable to the results after about four iterations of Newton-Raphson. However, the convergence rate of the subsequent iterations are very slow; more typical of the behavior of the EM algorithm. 8

Example 4:Variance Component Estimation (Little and Rubin(1987)) The following example is from Snedecor and Cochran(1967,p.290).In a study of artificial insemination of cows,semen samples from six randomly selected bulls were tested for their ability to produce conceptions.The number of samples tested varied from bull to bull and the response variable was the percentage of conceptions obtained from each sample.Here the interest is on the variability of the bull effects which is assumed to be a random effect. The data are: Table 3.Data for Example 4 (from Snedecor and Cochran(1967)) Bull(i)Percentages of Conception ni 1 46,31,37.62,30 2 70.59 2 52,44,57,40,67,64,70 > 4 47,21,70,46,14 5 5 42,64,50,69,77,81,87 7 6 35,68,59,38,57,76,57,29,60 9 Total 35 A common model used for analysis of such data is the oneway random effects model: yij ai+eij,j=1,...,ni;i=1,...,k; where it is assumed that the bull effects ai are distributed as i.i.d.N(,o2)and the within- bull effects (errors)eij as i.i.d.N(0,o2)random variables where ai and ej are independent. The standard oneway random effects analysis of variance is: Source d.f. S.S. M.S. F E(M.S.) Bull 5 3322.059664.412.682+5.67o Error 29 7200.341248.29 02 Total 3410522.400 Equating observed and expected mean squares from the above gives s2=248.29 as the estimate of o2 and(664.41-248.29)/5.67 =73.39 as the estimate of o2 To construct an EM algorithm to obtain MLE's of =(u,o2,o2),first consider the joint density of y*=(y,a)T where y*is assumed to be complete-data.This joint density can be written as a product of two factors:the part first corresponds to the joint density of yij given ai and the second to the joint density of ai. f(y*;8)=fi(yla;0)f2(a;0) 9
Example 4: Variance Component Estimation (Little and Rubin(1987)) The following example is from Snedecor and Cochran (1967, p.290). In a study of artificial insemination of cows, semen samples from six randomly selected bulls were tested for their ability to produce conceptions. The number of samples tested varied from bull to bull and the response variable was the percentage of conceptions obtained from each sample. Here the interest is on the variability of the bull effects which is assumed to be a random effect. The data are: Table 3. Data for Example 4 (from Snedecor and Cochran(1967)) Bull(i) Percentages of Conception ni 1 46,31,37,62,30 5 2 70,59 2 3 52,44,57,40,67,64,70 7 4 47,21,70,46,14 5 5 42,64,50,69,77,81,87 7 6 35,68,59,38,57,76,57,29,60 9 Total 35 A common model used for analysis of such data is the oneway random effects model: yij = ai + ij , j = 1, ..., ni , i = 1, ..., k; where it is assumed that the bull effects ai are distributed as i.i.d. N(µ, σ 2 a ) and the withinbull effects (errors) ij as i.i.d. N(0, σ 2 ) random variables where ai and ij are independent. The standard oneway random effects analysis of variance is: Source d.f. S.S. M.S. F E(M.S.) Bull 5 3322.059 664.41 2.68 σ 2 + 5.67σ 2 a Error 29 7200.341 248.29 σ 2 Total 34 10522.400 Equating observed and expected mean squares from the above gives s 2 = 248.29 as the estimate of σ 2 and (664.41 - 248.29)/5.67 = 73.39 as the estimate of σ 2 a . To construct an EM algorithm to obtain MLE’s of θ = (µ, σ 2 a , σ 2 ), first consider the joint density of y ∗ = (y, a) T where y ∗ is assumed to be complete-data. This joint density can be written as a product of two factors: the part first corresponds to the joint density of yij given ai and the second to the joint density of ai . f(y ∗ ; θ) = f1(y|a; θ)f2(a; θ) = Πi Πj ( 1 √ 2πσ e − 1 2σ2 (yij−ai) 2 ) Πi ( 1 √ 2πσa e − 1 2σ2 a (ai−µ) 2 ) 9

Thus,the log-likelihood is linear in the following complete-data sufficient statistics: T1= ∑a T2= ∑ =∑∑(-a)2=∑∑-.)2+∑n(.-a2 21 2.7 Here complete-data assumes that both y and a are available.Since only y is observed,let yo =y.Then the E-step of the EM algorithm requires the computation of the expectations of T1,T2 and T3 given yobs,i.e.,E(Tily)for i=1,2,3.The conditional distribution of a given y is needed for computing these expectations.First,note that the joint distribution of y*= (y,a)is (N+k)-dimensional multivariate normal:N(u*,>*)where u*=(u,u),u= jw,u。=jk and*is the(N+k)×(N+)matrix Here 1 07 ∑2 ∑= 12=2 0 ∑k」 0 jnk where i=o2In+02Jn is an nixni matrix.The covariance matrix of the joint distribution of y is obtained by recognizing that the yi;are jointly normal with common mean u and common variance o2+o2 and covariance o2 within the same bull and 0 between bulls.That is Cov(yij,yij)=Cov(ai+eij,ai+eij) = a2+0a ifi=i,j=j, =2 ifi=t,j≠j, =0 fi≠i. 12 is covariance of y and a and follows from the fact that Cov(yij,ai)=o2 if i=i and 0 if izi.The inverse of is needed for computation of the conditional distribution of a given y and obtained as 0 Σ1 21 0 where Using a well-known theorem in multivariate normal theory,the distribution of a given y is given by N(a,A)where a=ua+12(y-u) and A=2I-212.It can be shown after some algebra that a,lyN(wμ+(1-u),) 10
Thus, the log-likelihood is linear in the following complete-data sufficient statistics: T1 = Xai T2 = Xa 2 i T3 = X i X j (yij − ai) 2 = X i X j (yij − y¯i.) 2 + X i ni(y¯i. − ai) 2 Here complete-data assumes that both y and a are available. Since only y is observed, let y ∗ obs = y. Then the E-step of the EM algorithm requires the computation of the expectations of T1, T2 and T3 given y ∗ obs, i.e., Eθ (Ti |y) for i = 1, 2, 3. The conditional distribution of a given y is needed for computing these expectations. First, note that the joint distribution of y ∗ = (y, a) T is (N + k)-dimensional multivariate normal: N(µ ∗ , Σ ∗ ) where µ ∗ = (µ, µa ) T , µ = µjN , µa = µjk and Σ ∗ is the (N + k) × (N + k) matrix Σ ∗ = Σ Σ12 Σ T 12 σ 2 a I ! . Here Σ = Σ1 0 Σ2 . . . 0 Σk , Σ12 = σ 2 a jn1 0 jn2 . . . 0 jnk where Σi = σ 2 Ini+σ 2 aJni is an ni×ni matrix. The covariance matrix Σ of the joint distribution of y is obtained by recognizing that the yij are jointly normal with common mean µ and common variance σ 2 + σ 2 a and covariance σ 2 a within the same bull and 0 between bulls. That is Cov(yij , yi 0j 0) = Cov(ai + ij , ai 0 + i 0j 0) = σ 2 + σ 2 a if i = i 0 , j = j 0 , = σ 2 a if i = i 0 , j6=j 0 , = 0 if i6=i 0 . Σ12 is covariance of y and a and follows from the fact that Cov(yij , ai) = σ 2 a if i = i 0 and 0 if i6=i 0 . The inverse of Σ is needed for computation of the conditional distribution of a given y and obtained as Σ −1 = Σ −1 1 0 Σ −1 2 . . . 0 Σ −1 k where Σ −1 i = 1 σ2 h Ini − σ 2 a σ2+niσ2 a Jni i . Using a well-known theorem in multivariate normal theory, the distribution of a given y is given by N(α, A) where α = µa + Σ 0 12Σ −1 (y − µ) and A = σ 2 a I − Σ 0 12Σ −1Σ12. It can be shown after some algebra that ai |y i.i.d ∼ N (wiµ + (1 − wi)y¯i., vi) 10
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第五讲 多元正态均值向量的推断.pdf
- 《多元统计分析》课程教学资源(阅读材料)R package - mvoutlier论文.pdf
- 《多元统计分析》课程教学资源(阅读材料)Outlier detection.pdf
- 《多元统计分析》课程教学资源(阅读材料)Multiple hypothesis testing.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第四讲 多元正态(II).pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第三讲 多元正态(I).pdf
- 《多元统计分析》课程教学资源(阅读材料)Lattice and Other Graphics in R.pdf
- 《多元统计分析》课程教学资源(阅读材料)A visual tour of interactive graphics with R.pdf
- 《多元统计分析》课程教学资源(阅读材料)A Survey on Multivariate Data Visualization.pdf
- 《多元统计分析》课程教学资源(阅读材料)30 Years of Multidimensional Multivariate Visulization.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第二讲 多元数据的可视化技术.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第一讲 简介及描述性统计(主讲:张伟平).pdf
- 《实用统计软件》课程教学资源(阅读材料)Dan Bruns, Chattanooga, TN, An Introduction to the Simplicity and Power of SAS/Graph.pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第十四讲 SAS介绍.pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第十三讲 MatLab介绍(二).pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第十二讲 MatLab介绍(一).pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第十一讲 R中的数值优化方法.pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第十讲 Expectation-Maximization(EM算法)方法.pdf
- 中国科学技术大学:《实用统计软件》课程课件讲义(统计计算与软件)第九讲 Markov Chain Monte Carlo(二)马尔科夫蒙特卡罗方法.pdf
- 《实用统计软件》课程教学资源(阅读材料)A History of Markov Chain Monte Carlo——Subjective Recollections from Incomplete Data.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第六讲 两均值向量的比较.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第七讲 主成分分析.pdf
- 《多元统计分析》课程教学资源(阅读材料)Overview - Principal component analysis.pdf
- 《多元统计分析》课程教学资源(阅读材料)Face Recognition using PCA.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第八讲 因子分析.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第九讲 判别与分类.pdf
- 《多元统计分析》课程教学资源(阅读材料)Least Squares Linear Discriminant Analysis.pdf
- 《多元统计分析》课程教学资源(阅读材料)Statistical Classification.pdf
- 《多元统计分析》课程教学资源(阅读材料)SVM in R.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第十讲 聚类分析.pdf
- 《多元统计分析》课程教学资源(阅读材料)Cluster Validation.pdf
- 《多元统计分析》课程教学资源(阅读材料)Spectral cluster.pdf
- 《多元统计分析》课程教学资源(阅读材料)A tutorial on Spectral clustering.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第十一讲 典型相关分析和多维标度法.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第十二讲 多元多重线性回归——估计.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第十三讲 多元多重线性回归——检验.pdf
- 《多元统计分析》课程教学资源(阅读材料)Visual Hypothesis Tests in Multivariate Linear Models - Heplot.pdf
- 《多元统计分析》课程教学资源(阅读材料)Model selection in linear regression.pdf
- 《多元统计分析》课程教学资源(阅读材料)Multivariate Linear Models in R.pdf
- 中国科学技术大学:《多元统计分析》课程教学资源(课件讲义)第十四讲 多元方差分析.pdf