Linear Regression

Weikai Mao

maoweikai123@outlook.com

Categories:

All

AIS

A linear regression model assumes that the regression function $E(y_i \mid x_i)$ is linear in the inputs $x_i^{(1)}, \cdots, x_i^{(p)}$. The linear regression model has the form $\hat{y}_i = f(x_i) = \beta_0 + \underset{1 \times p}{\beta^T} \ \underset{p \times 1}{x_i} = \beta_0 + \sum_{j=1}^{p} \beta^{(j)} x_i^{(j)}$. If we set a $1$ in the first position of the vector $x_i$, then we can dismiss the intercept and the regression model can be written as $\hat{y}_i = f(x_i) = \underset{1 \times p}{\beta^T} \ \underset{p \times 1}{x_i} = \sum_{j=1}^{p} \beta^{(j)} x_i^{(j)}$.

Let’s first suppose $X\in\mathbb{R}^{n\times p}$ and $y\in\mathbb{R}^n$, which means we have $n$ observations and $p$ features.

Ordinary Least Squares

The residual sum of squares:

\[\text{RSS}(\beta) = (y-X \beta)^T (y-X \beta) = \|y - X\beta \|^2_2 = \sum_{i=1}^{n}(y_i - \beta^T x_i)^2.\]

Closed-form solution for OLS (ordinary least squares) estimator

\[\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \ \text{RSS}(\beta) = (X^T X)^{-1} X^T y.\]

Hat matrix $H = X (X^T X)^{-1} X^T$, which is positive semi-definite.

Assume $\epsilon \sim N(0, \sigma^2)$, then $\hat{\beta} \sim N(\beta, (X^TX)^{-1} \sigma^2)$.

More on The elements of Statistical Learning Page 47-49.

The Gauss-Markov Theorem: Least squares estimates of the parameter $\beta $ has the smallest variance among all linear unbiased estimates. This theorem means that the OLS estimator is BLUE (best linear unbiased estimator).

Derivation of OLS estimator: The first and second partial derivatives are

\[\frac{\partial \text{RSS}(\beta)}{\partial \beta} = -2 X^T (y-X \beta),\ \frac{\partial^2 \text{RSS}(\beta)}{\partial \beta \partial \beta^T} = 2 X^T X.\]

Assuming that $X$ has full column rank, and hence $X^TX$ is positive definite, we set the first derivative to zero and then we get the solution $\hat{\beta} = (X^T X)^{-1} X^T y$.

We can also use gradient descent to find $\hat{\beta}$. The gradient at $(t+1)$-th iteration is $\frac{\partial \text{RSS}(\beta_t)}{\partial \beta_t} = -2 X^T (y-X \beta_t)$.

Subset Selection

There are two reasons why we do variable subset selection:

The first is prediction accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy.
The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects.

Forward-stepwise selection

Forward-stepwise starts with the intercept $\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$, and then sequentially adds into the model the predictor that most improves the fit. It is a greedy algorithm, producing a nested sequence of models $M_0,M_1,…,M_P$, then we can select a single best model from among using cross-validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

The computation is $p(p-1)/2$. Advantages: computationally efficient, smaller variance as compared to best subset selection, but perhaps more bias; it can be used even when $p>n$. Disadvantages: errors made at the beginning cannot be corrected later.

Backward-stepwise selection

Backward-stepwise starts with the full model, and sequentially deletes the predictor that has the least impact on the fit.

Advantages: can throw out the “right” predictor by looking at the full model. Disadvantages: computationally inefficient (start with the full model), cannot work if $p>n$ .

Method	Optimality	Computation
Best subset	Best model guaranteed	$2^p$
Forward stepwise	No guarantee	$p(p-1)/2$
Backward stepwise	No guarantee	$p(p-1)/2$

There are several selection criteria like: $C_p$, AIC, BIC, Adjusted $R^2$.

Shrinkage Methods

Linear regression with L1-norm is LASSO and L2-norm is ridge regression.

Ridge regression:

\[\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \Big(\| y - X\beta \| ^2_2 + \lambda \| \beta \|^2_2 \Big) = (X^T X + \lambda I)^{-1} X^T y.\]

LASSO:

\[\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \Big( \|Y - X\beta \|^2_2 + \lambda \| \beta \|_1 \Big).\]

L1-norm shrinks some coefficients to $0$ and produces sparse coefficients, so it can be used to do feature selection. The sparsity makes the model more computationally efficient when doing prediction. L2-norm is differentiable so it has an analytical solution and can be calculated efficiently when training model.

Time Complexity

Suppose $X\in\mathbb{R}^{n\times p}$ and $y\in\mathbb{R}^n$, then the computational complexity of computing closed-form solution $(X^T X)^{-1} X^T y$ is $O(pnp + p^3 + pn + p^2) = O(np^2+p^3)$. Here are the analyses:

The product $X^TX$ takes $O(pnp)$ time;
The inversion of $X^TX$ takes $O(p^3)$ time;
The product $X^T y$ takes $O(p n)$ time;
Finally, the multiplication of $(X^T X)^{-1}$ and $X^T y$ takes $O(p^2)$.

If we use gradient descent, at the $(t+1)$-th iteration, computing the gradient $\frac{\partial \text{RSS}(\beta_t)}{\partial \beta_t} = -2 X^T (y-X \beta_t)$ takes $O(np+n+pn)=O(np)$ time. If we iterate $M$ times, the time is $O(Mnp)$. Here are the time complexity analyses for a iteration:

The product $X\beta_t$ takes $O(np)$ time;
The subtraction $y-X \beta_t$ takes $O(n)$ times;
The multiplication of $-2X^T$ and $(y-X \beta_t)$ takes $O(pn)$ time.

We may prefer gradient descent than computing the closed-form solution when $p$ is very large.

We can also use the mini-batch gradient descent with batch size $b$ ($b < n$), then the time of computing the gradient will be $O(bp)$. However, in this case, we usually need more iterations than ordinary gradient descent, since the gradient is not that accurate. Note that if we set $b=1$, it is stochastic gradient descent.

Comments

Method	Optimality	Computation
Best subset	Best model guaranteed	\(2^p\)
Forward stepwise	No guarantee	\(p(p-1)/2\)
Backward stepwise	No guarantee	\(p(p-1)/2\)

HOME

Weikai Mao

Categories: