Log predictive density asmptotically in predictive information criteria for Bayesian models
I am re
Gabriella Sellers
Answered question
2022-06-13
Log predictive density asmptotically in predictive information criteria for Bayesian models
I am reading this paper, Andrew Gelman's Understanding predictive information criteria for Bayesian models, and I will give a screenshot as below:
Under standard conditions, the posterior distribution, , approaches a normal distribution in the limit of increasing sample size (see, e.g., DeGroot, 1970). In this asymptotic limit, the posterior is dominated by the likelihood-the prior contributes only one factor, while the likelihood contributes n factors, one for each data point-and so the likelihood function also approaches the same normal distribution.
As sample size , we can label the limiting posterior distribution as . In this limit the log predictive density is
where c(y) is a constant that only depends on the data y and the model class but not on the parameters .
The limiting multivariate normal distribution for 0 induces a posterior distribution for the log predictive density that ends up being a constant minus times a random variable, where k is the dimension of , that is, the number of parameters in the model. The maximum of this distribution of the log predictive density is attained when equals the maximum likelihood estimate (of course), and its posterior mean is at a vaue lower.
For actual posterior distributions, this asymptotic result is only an approximation, but it will be useful as a benchmark for interpreting the log predictive density as a measure of fit.
With singular models (e.g. mixture models and overparameterized complex models more gener- ally) a set of different parameters can map to a single data model, the Fisher information matrix i not positive definite, plug-in estimates are not representative of the posterior, and the distribution of the deviance does not converge to a distribution. The asymptotic behavior of such models can be analyzed using singular learning theory (Watanabe, 2009, 2010).
Sorry for the long paragraph. The things that confuse me are:
1. Why here seems like we know the posterior distribution first, then we use it to find the ? Shouldn't we get the model, first?
2. What does the green line "its posterior mean is at a value lower" mean? My understanding is since there is a term in the expression and the expectation of is k, which lead to a lower. But lower than what?
3. How does the interpreting the measure of fit? I can see that there is a mean square error(MSE) term in this expression but it is an MSE of the parameter , not the data y.
Thanks for any help!