jhenezhubby01ff

2022-09-04

Dimensionality of datasets in multiple regression
As an example, let's say that a linear regression is performed of the form
$Y={\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\cdots +{\beta }_{n}{X}_{n}+\epsilon$
where $Y$ is a vector of $10,000$ measurements of peak acceleration of different car models, and the regressors correspond to different technical features of the cars.
From a linear algebra standpoint $Y$ lives in ${\mathbb{R}}^{10000}$, and the coefficients are found by minimizing the sum of the square distances of this vector on a hyperplane.
Now, from the point of view of dimension being the number of linearly independent vectors that span a space, this vector $Y$ is just $1$ dimension.
If it is truly $1$-dimension of a ${\mathbb{R}}^{10000}$ ambient space, the Euclidean projection on the hyperplane that underpins the process of finding the coefficients does not have any dimensionality issues (collinearity between the regressors being a separate topic). Otherwise, ${L}^{2}$ norms in high dimensions do pose problems.
"So is $Y$ (the vector of $10,000$ observations) $1$-mimensional or high dimensional?"

Baron Coffey

Consider the function
$f\left({X}_{1},{X}_{2}\right)={\beta }_{0}+{\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}.$
This is a plane in 3 dimensions no matter how many times you evaluate the function. Thus, your problem "lives" in a $2$-dimensional space.
As for using the ${L}^{2}$ norm, you are correct.

Leonel Schwartz

The issue of dimensionality is the context of regression analysis is the ratio between $n$, number of observations, and $p$, number of estimated parameters. As closer $n$ to $p$, the less reliable your estimated model is. Assume that your model is
$y={\beta }_{0}+\sum _{j=1}^{p}{x}_{j}{\beta }_{j}+ϵ,$
hence in order to find the OLS estimators of $\beta =\left({\beta }_{0},...,{\beta }_{p}\right)$ you project the vector $y$ on the affine space spanned by $\left(1,{x}_{1},...,{x}_{p}\right)$, hence it is a $p$ dimensional space. The number of observations, $n$, is not count as dimension. If you have a continuous stochastic process, then you can sample from it infinitely many times, i.e., $n\to \mathrm{\infty }$, that is usually a good feature because you can safely use asymptotic results. Notably, in such a case, there is another problem of artificially low p.values, but this is unrelated to the dimension of the model or the embedded space.\

Do you have a similar question?