Demonstrate various regularization techniques used in regression analysis.
OLS regression fits a linear model with coefficients \(w = (w_1, ..., w_p)\) to minimize the RSS between the observed and the predicted responses.
Mathematically, it tries to minimize the objective function:
\[
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\underset{w}{min\,} \norm {X w - y}_2^2
\]
Ridge regression imposes a penalty on the size of coefficients. Here, \(\alpha \geq 0\) is a complexity parameter that controls the amount of shrinkage of coefficients.
Objective to minimize:
\[
\underset{w}{min\,} {\norm {X w - y}_2^2 + \alpha\, \norm w_2^2}
\]
Mathematically, it consists of a linear model trained with \(\ell_1\) prior as regularizer. The lasso estimate solves the minimization of the least-squares penalty with \(\alpha ||w||_1\) added, where \(\alpha\) is a constant and \(||w||_1\) is the \(\ell_1\)-norm of the parameter vector.
Objective to minimize: \[ \underset{w}{min\,} { \frac{1}{2n_{samples}} \norm {X w - y}_2 ^ 2 + \alpha\, \norm w_1} \]
ElasticNet is a linear regression model trained with \(\ell_1\) and \(\ell_2\) prior as regularizer. We control the convex combination of \(\ell_1\) and \(\ell_2\) using the \(\ell_1\)-ratio (\(\rho\)) parameter.
Objective to minimize:
\[
\underset{w}{min\,} { \frac{1}{2n_{samples}} \norm {X w - y}_2^2 + \alpha \rho \norm w_1 +\frac{\alpha(1-\rho)}{2} \norm w_2^2}
\]
Major League Baseball Data from the 1986 and 1987 seasons.
model.data <- Hitters[complete.cases(Hitters),]
str(model.data)
## 'data.frame': 263 obs. of 20 variables:
## $ AtBat : int 315 479 496 321 594 185 298 323 401 574 ...
## $ Hits : int 81 130 141 87 169 37 73 81 92 159 ...
## $ HmRun : int 7 18 20 10 4 1 0 6 17 21 ...
## $ Runs : int 24 66 65 39 74 23 24 26 49 107 ...
## $ RBI : int 38 72 78 42 51 8 24 32 66 75 ...
## $ Walks : int 39 76 37 30 35 21 7 8 65 59 ...
## $ Years : int 14 3 11 2 11 2 3 2 13 10 ...
## $ CAtBat : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
## $ CHits : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
## $ CHmRun : int 69 63 225 12 19 1 0 6 253 90 ...
## $ CRuns : int 321 224 828 48 501 30 41 32 784 702 ...
## $ CRBI : int 414 266 838 46 336 9 37 34 890 504 ...
## $ CWalks : int 375 263 354 33 194 24 12 8 866 488 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
## $ PutOuts : int 632 880 200 805 282 76 121 143 0 238 ...
## $ Assists : int 43 82 11 40 421 127 283 290 0 445 ...
## $ Errors : int 10 14 3 4 25 7 9 19 0 22 ...
## $ Salary : num 475 480 500 91.5 750 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
We will use the R glmnet
package for regressing the Salary of the baseball players. In glmnet
, the penalty on the coefficient vector is defined as \[
\frac{1-\alpha}{2} \norm {\beta_j}_2^2 + \alpha\, \norm {\beta_j}_1,
\] where \(\alpha=1\) is the lasso penalty, and \(\alpha=0\) the ridge penalty. For \(0 < \alpha < 1\) you get the elastic net model.
x <- model.matrix(Salary ~., model.data)[, -1]
y <- model.data$Salary
grid <- 10^seq(0.1, 5.25, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
cv.ridge <- cv.glmnet(x, y, alpha = 0)
lasso.mod <- glmnet(x, y, alpha = 1, lambda = grid)
cv.lasso <- cv.glmnet(x, y, alpha = 1)
elastic.mod <- glmnet(x, y, alpha = 0.05, lambda = grid)
cv.elastic <- cv.glmnet(x, y, alpha = 0.05)
ridge.mod$lambda[50]
## [1] 502.3543
Here are the coefficients when \(\lambda = 502.3542734\) for ridge regression:
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs
## 32.28886473 0.10294692 0.73816726 0.97336679 0.99928237
## RBI Walks Years CAtBat CHits
## 0.87005503 1.45357384 2.18451382 0.01118123 0.05121007
## CHmRun CRuns CRBI CWalks LeagueN
## 0.36720312 0.10245821 0.10754182 0.06687056 17.25729011
## DivisionW PutOuts Assists Errors NewLeagueN
## -65.35634421 0.13934708 0.02059125 -0.96259342 9.37464578
To predict ridge regression coefficients for a new value of \(\lambda\), say 50:
coef(ridge.mod, s = 50)[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 4.834785e+01 -3.541365e-01 1.953935e+00 -1.287662e+00 1.155112e+00
## RBI Walks Years CAtBat CHits
## 8.085495e-01 2.710330e+00 -6.215391e+00 5.982191e-03 1.070891e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.292899e-01 2.180264e-01 2.156013e-01 -1.491177e-01 4.586129e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.182266e+02 2.501814e-01 1.209485e-01 -3.276847e+00 -9.426409e+00
Minimum \(\lambda\) and mean CV error for ridge regression:
c(cv.ridge$lambda.min, min(cv.ridge$cvm))
## [1] 28.01718 115163.60638
Lasso imposes sparsity among the coefficients and is great for recovering sparse signals or, when there are large number of variables. Ridge limits the size of the coefficient vector, and hence can reduce overfitting through a better compromise between bias-variance. Elastic Net is a mix of the former two regularization techniques, and hence achieves both shrinkage and automatic feature selection.