--- category: machine learning date: 2022-08-22 --- # Linear Regression If you ever had to fit a line to some data-points you quite possibly have come across _linear regression_ and _least squares_. Most of the time _(linear) regression_ is introduced as follows: Assume we have some target data $y \in \mathbb{R}^N$ and some observations $X \in \mathbb{R}^{N \times p}$ and our task is to fit a line $f(X_i) = w^T X_i + w_0$, which minimizes the error. Now at this point the _mean squared error_ is often introduced: $$ \text{MSE}(f) = \frac{1}{N} \sum_{i=1}^N (y_i - f(X_i))^2$$ The question one should always ask: _Why exactly do we use this?_ If we ask ourselves which conditions the error function should satisfy, we will see that the (mean) squared error arises quite naturally. - The error has to positive, i.e. $y_i - f(X_i) \ge 0$ - Small errors should have less influence than large errors - Should be easy to optimize All of these criterions are met for the squared error $(y_i - f(X_i))^2$. Squaring the error results in positive values and values between $0$ and $1$ have less weight, while values greater than $1$ get further amplified. At the same time the least squares error has the solution $\hat{w} = (X^T X)^{-1} X^T y$. Now even tough all of this sounds reasonable in my opinion there is a better way to introduce regression and least squares. ## Probabilistic introduction to Regression As before we want to fit a line $f(X_i) = w^T X_i + w_0$ as good as possible to our target data $y$. What this means, is we assume that the target values $y_i$ and the data have the following relationship: $$ y_i = w^T X_i + w_0 + \epsilon_i $$ Here $\epsilon_i$ is a random error, that will always be present in real observations, which we assume to be drawn from a normal distribution $\mathcal{N}(0, \sigma^2)$. ```{margin} See this [stackexchange post](https://stats.stackexchange.com/questions/316936/linear-regression-proving-least-squares-model) for the mathematical derivation. ``` Knowing this we can also express the conditional probability of $y$ in terms of a normal distribution: $$ P(y_i | X_i) = \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2) $$ Now we still want to find the best possible $w$ and $w_0$ for our data. A simple way to fit a statistical model is to use _maximum likelihood estimation_ which involves maximizing the following likelihood function: $$ L(w) = \prod_{i=1}^N \mathcal{N}(y_i | w^T X_i + w_0, \sigma^2) $$ Taking the logarithm of the above expression, multiplying with $-1$ and plugging in the definition of the normal distribution, we can equally minimize: $$ \begin{aligned} NLL(w) &= - \sum_{i=1}^N -\log \left[ \sqrt{\frac{1}{2 \pi \sigma^2}} \exp \left( -\frac{1}{2\sigma^2}(y_i - w^T X_i - w_0) \right) \right]\\ &= \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2 + \frac{N}{2}\log(2 \pi\sigma ^2) \end{aligned} $$ If you look closely the first term includes the _squared error_ we introduced earlier. At the same time the second term is a constant that can be neglected, when minimizing. ```{margin} The _residual sum of squares_ is defined as: $$ RSS = \frac{1}{2} \sum_{i = 1}^N (y_n - f(X_i))^2 $$ ``` The minimization problem at hand, $\mathop{\rm arg\,min}\limits_{w, w_0} \frac{1}{2\sigma^2}\sum_{i = 1}^N (y_n - w^T X_i - w_0)^2 $ is actually proportional to the _residual sum of squares_ and the _mean squared error_ introduced above. Hence the solution to the maximum likelihood estimation is also: $$ \hat{w} = (X^T X)^{-1} X^T y $$ To me this is quite a remarkable explanation of why the squared error is used. ### One step further ```{margin} Actually MLE is a special case of MAP with a uniform prior, as explained in this [post](https://agustinus.kristia.de/techblog/2017/01/01/mle-vs-map/) by Agustinus Kristiadi. ``` We can take the above one step further, if instead of a maximum likelihood estimation we use maximum a posteriori estimation. $$ w_{MAP}, w_{0_{MAP}} = \mathop{\rm arg\,max}\limits_{w, w_0} \prod_{i=1}^N P(y_i | w^T X_i + w_0) \cdot P(\mathbf{w}) $$ Here $P(\mathbf{w})$ is the probability density of the prior we choose for the weights $w, w_0$ of our model. Using this framework we can easily derive many regression models such as lasso and ridge regression as explained by the following table: ```{table} Summary of regression models for different likelihoods and priors. Likelihood refers to the distribution of $P(y_i | X_i)$ in this case. | *Likelihood* | *Prior* | *Name* | |:-------------|:---------|:------------------| | Gaussian | Uniform | Least Squares | | Gaussian | Gaussian | Ridge | | Gaussian | Laplace | Lasso | | Laplace | Uniform | Robust regression | ``` A in depth explanation of this topic can also be found in Chapter 11 of {cite:ps}`pml1Book`. ```{bibliography} ```