Log predictive density asmptotically in predictive information criteria for Bayesian modelsI am reading this paper, Andrew Gelman's Understanding predictive information criteria for Bayesian models, and I will give a screenshot as below:Under standard conditions, the posterior distribution, p ( θ ∣ y ), approaches a normal distribution in the limit of increasing sample size (see, e.g., DeGroot, 1970). In this asymptotic limit, the posterior is dominated by the likelihood-the prior contributes only one factor, while the likelihood contributes n factors, one for each data point-and so the likelihood function also approaches the same normal distribution.As sample size n →∝, we can label the limiting posterior distribution as θ | y → N ( θ 0 , V 0 / n ). In this limit the log predictive density is l o g p ( y ∣ θ ) = c ( y ) − 1 2 ( k l o g ( 2 π ) + l o g | V o / n l + ( θ − θ 0 ) T ( V o / n ) − 1 ( θ − θ 0 ) )where c(y) is a constant that only depends on the data y and the model class but not on the parameters θ.The limiting multivariate normal distribution for 0 induces a posterior distribution for the log predictive density that ends up being a constant (equal to c ( y ) − 1 2 ( k l o g ( 2 π ) + l o g | V o / n | ) ) minus 1 2 times a χ k 2 random variable, where k is the dimension of θ, that is, the number of parameters in the model. The maximum of this distribution of the log predictive density is attained when equals the maximum likelihood estimate (of course), and its posterior mean is at a vaue lower.For actual posterior distributions, this asymptotic result is only an approximation, but it will be useful as a benchmark for interpreting the log predictive density as a measure of fit. With singular models (e.g. mixture models and overparameterized complex models more gener- ally) a set of different parameters can map to a single data model, the Fisher information matrix i not positive definite, plug-in estimates are not representative of the posterior, and the distribution of the deviance does not converge to a χ 2 distribution. The asymptotic behavior of such models can be analyzed using singular learning theory (Watanabe, 2009, 2010).Sorry for the long paragraph. The things that confuse me are:1. Why here seems like we know the posterior distribution f ( θ | y ) first, then we use it to find the log ⁡ p ( y | θ )? Shouldn't we get the model, log ⁡ p ( y | θ ) first?2. What does the green line "its posterior mean is at a value k 2 lower" mean? My understanding is since there is a term − 1 2 χ k 2 in the expression and the expectation of χ k 2 is k, which lead to a k 2 lower. But k 2 lower than what?3. How does the log ⁡ p ( y | θ ) interpreting the measure of fit? I can see that there is a mean square error(MSE) term in this expression but it is an MSE of the parameter θ, not the data y.Thanks for any help!

Question

Log predictive density asmptotically in predictive information criteria for Bayesian modelsI am reading this paper, Andrew Gelman&#039;s Understanding predictive information criteria for Bayesian models, and I will give a screenshot as below:Under standard conditions, the posterior distribution,   p  (  θ  ∣  y  ), approaches a normal distribution in the limit of increasing sample size (see, e.g., DeGroot, 1970). In this asymptotic limit, the posterior is dominated by the likelihood-the prior contributes only one factor, while the likelihood contributes n factors, one for each data point-and so the likelihood function also approaches the same normal distribution.As sample size   n  →∝, we can label the limiting posterior distribution as   θ      |    y  →  N  (      θ    0    ,      V    0        /    n  ). In this limit the log predictive density is  l  o  g  p  (  y  ∣  θ  )  =  c  (  y  )  −      1    2    (  k  l  o  g  (  2  π  )  +  l  o  g      |    V  o      /    n  l  +  (  θ  −      θ    0        )    T    (  V  o      /    n      )          −      1        (  θ  −      θ    0    )  )where c(y) is a constant that only depends on the data y and the model class but not on the parameters   θ.The limiting multivariate normal distribution for 0 induces a posterior distribution for the log predictive density that ends up being a constant   (equal to   c  (  y  )  −      1    2    (  k  l  o  g  (  2  π  )  +  l  o  g      |    V  o      /    n      |    )  ) minus       1    2   times a       χ    k    2   random variable, where k is the dimension of   θ, that is, the number of parameters in the model. The maximum of this distribution of the log predictive density is attained when equals the maximum likelihood estimate (of course), and its posterior mean is at a vaue lower.For actual posterior distributions, this asymptotic result is only an approximation, but it will be useful as a benchmark for interpreting the log predictive density as a measure of fit. With singular models (e.g. mixture models and overparameterized complex models more gener- ally) a set of different parameters can map to a single data model, the Fisher information matrix i not positive definite, plug-in estimates are not representative of the posterior, and the distribution of the deviance does not converge to a       χ    2   distribution. The asymptotic behavior of such models can be analyzed using singular learning theory (Watanabe, 2009, 2010).Sorry for the long paragraph. The things that confuse me are:1. Why here seems like we know the posterior distribution   f  (  θ      |    y  ) first, then we use it to find the   log  ⁡  p  (  y      |    θ  )? Shouldn&#039;t we get the model,   log  ⁡  p  (  y      |    θ  ) first?2. What does the green line &quot;its posterior mean is at a value       k    2   lower&quot; mean? My understanding is since there is a term   −      1    2        χ    k    2   in the expression and the expectation of       χ    k    2   is k, which lead to a       k    2   lower. But       k    2   lower than what?3. How does the   log  ⁡  p  (  y      |    θ  ) interpreting the measure of fit? I can see that there is a mean square error(MSE) term in this expression but it is an MSE of the parameter   θ, not the data y.Thanks for any help!

Jayce Bates · Accepted Answer

1. When we look at the posterior distribution, we are concerned with two contributing factors: the prior and the likelihood. As we are looking at the asymptotic limit   n  →  ∞, we know that the influence of the prior is negligible. We can model this limiting   θ      |    y as   N  (      θ    0    ,      V    0        /    n  ) to ascertain the behavior of the posterior from the likelihood. In the excerpt you have provided this is merely a heuristic for measuring fitness of your model because the log predictive density is approximately inferred from the likelihood. So to answer your question, we kind of are.2. The author is saying that the log predictive density posterior distribution   c  (  y  )  −      1    2    (  k  log  ⁡  (  2  π  )  +  log  ⁡      |        V    0        /    n      |    )  −      1    2        χ    k    2   is maximized when   θ equals the maximum likelihood estimate. So we differentiate the log predictive density and set it equal to zero in order to solve for the value of   θ that maximizes the distribution. This distribution for this   θ has a mean which is equal to       k    2   less than the maximum possible value of the log predictive density posterior distribution or maximum likelihood estimation. We expect this because   [  c  (  y  )  −      1    2    (  k  log  ⁡  (  2  π  )  +  log  ⁡      |        V    0        /    n      |    )  ]  −  [  c  (  y  )  −      1    2    (  k  log  ⁡  (  2  π  )  +  log  ⁡      |        V    0        /    n      |    )  −      1    2        χ    k    2    ] is       k    2  .3. I hope that the answer to this question is made more clear by the previous two answers. Ultimately the idea is that these approximation will work well if the model is a good fit.Alright, I did my best. I hope that this helps a little bit.

Log predictive density asmptotically in predictive information criteria for Bayesian models I am re

Answered question

Answer & Explanation

New Questions in Pre-Algebra