ajakanvao

2022-11-23

Should the independent (or dependent) variables in a linear regression model be normal or just the residual?

barene55d

Linear regression expresses a relationship between a response and covariates that is linear in terms of coefficients. In the simple case it associates one-dimensional response $Y$ with one-dimensional $X$ as follows.
$Y={\beta }_{0}+{\beta }_{1}X+ϵ$
where $Y,X$ and ϵ are considered as random variables and ${\beta }_{0},{\beta }_{1}$ are coefficients (model parameters) to be estimated.
Being a regression to the mean, the model specifies:
$E\left[Y|X\right]={\beta }_{0}+{\beta }_{1}X$ with an implied assumption that
$E\left[ϵ|X\right]=0$ and also $Var\left(ϵ\right)=$ constant.
Thus, model restrictions are placed only on the conditional distribution of $ϵ$ given $X$, or equivalently on $Y$ given $X$.
A convenient distribution used for residuals ($ϵ$) is Normal/Gaussian, but the regression model, in general, works with other distributions as well.
Not to confuse things further here, but it should still be noted that the regression analysis doesn't have to make any distributional assumptions. In estimation of the coefficients, for example, we use least squares method with no mention of any distributions. However, for more complex analysis, statisticians use various probability distributions to specify models, make assumptions explicit and use probability theory to justify results.

Do you have a similar question?