A common issue when trying to fit a model to data is that of overfitting - having too many degrees of freedom (i.e. variables or features) compared to the number of samples or observations. Models can become overly complex because of all of these degrees of freedom and end up overfitting to the training data - and not doing well on new data.

One way to address this problem is feature selection. With feature selection, only a subset of variables (features) is used. Another approach is to employ regularization, which uses additional constraints on the optimization. In this post, we are going to look at two particular types of regularization called Ridge and LASSO (Least Absolute Shrinkage Selection Operator) regression.

You can try out using LASSO regression to predict octane values from NIR spectra of gasoline samples using the Live Script I wrote, which you can find in the GitHub repository for my class here.

Regularization

The goal with simple linear regression is to minimize the objective function, which is the difference between the model-predicted values (\(\bf{x_i} \cdot \bf{b}\)) and the actual values (\(y_i\)) (actually, the sum of the squares of these differences):

\[\bf{x_i} \cdot \bf{b}\] \[\sum_{i=1}^n \left(y_i - \bf{x_i} \cdot \bf{b}\right)^2 \rightarrow min\]

(Note that the intercept has been included in the vector of regression coefficients, \(\bf{b}\), as \(b_0\), and that \(x_0 = 1\).)

With regularization, we add an additional term to the quantity being minimized. What that term is depends on the type of regularization being used. Ridge, LASSO and elastic are common types of regularization.

waitbar

Here, the second sum is over all \(m\) variables, and the term being summed over, \(r_j\), is \(\frac{1}{2}b_j^2\) for Ridge regression and \(\lvert b_j \rvert\) for LASSO regression. Note the complexity parameter, \(\lambda\), which determines how important this additional constraint is in the overall minimization.

Simple example

Let’s look at a very simple example to illustrate how regularization works and what it does.

Imagine that we have one observation with a single value of \(y = 8\) and a single vector of \(\bf{x} = \left[2 ~~ 3\right]\) coordinates.

waitbar

The equation to be solved is then:

\[2b_0 + 3b_1 = 8\]

There are an infinite number of solutions to this equation. In other words, there are an infinite number of sets of regression coefficients, \(\begin{bmatrix} b_0 \cr b_1 \end{bmatrix}\), that will work. In the figure below, those solutions are represented by the blue diagonal line. Two solutions in particular, \(b_0=4, b_1=0\) and \(b_0=-2, b_1=4\) are represented by the grey dots.

waitbar

The solutions with the Ridge and LASSO constraints are represented by the open circles. So, you can see how by imposing those constraints we go from an infinite number of solutions (all equally valid) to just one for each type of regularization.

The Ridge solution is: \(\begin{bmatrix} 1.23 \cr 1.85 \end{bmatrix}\).

The LASSO solution is: \(\begin{bmatrix} 0 \cr 2.67 \end{bmatrix}\).

LASSO regression as feature selection

In the example above, notice that the Ridge solution has two non-zero coefficients, whereas the LASSO solution results in only one. LASSO has performed a form of feature selection because one of the variables, \(x_0\), is not being used (it’s coefficient, \(b_0\), is zero).

This is one of the useful features of using LASSO regression - you get feature selection built in. To be fair, even though Ridge regression is not performing feature selection, it is adjusting the relative importance of the different features through the magnitudes of the coefficients - it’s just not eliminating some features like LASSO is.

\(\lambda\): the complexity parameter

The exact solution that is obtained with either Ridge or LASSO regression depends on the value of \(\lambda\), the complexity parameter; if \(\lambda=0\), there is no regularization and there are an infinite number of solutions (in the example, above). As the value of \(\lambda\) increases, the more important the regularization component of the objective function being minimized is.

For LASSO regression, increasing the value of \(\lambda\) increases the number of features that have coefficients with value of zero - in other words, larger values of \(\lambda\) result in a solution with a smaller number of features.

A more complex (interesting?) example

To see how LASSO regression can be used with a “real” dataset, let’s look at NIR (near-infrared) spectra of 40 biscuit dough samples with their associated fat content values (% fat weight compared to flour weight). (Osborne, B.G., Fearn, T., Miller, A.R. and Douglas, S. (1984), Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs. J. Sci. Food Agric., 35: 99-105.)

waitbar

There are 600 NIR absorption intensity values (“Data” on the y-axis) for each spectrum, which correspond to the 600 wavelengths from 1200 nm to 2398 nm (on the x-axis) - thus, overdetermined!

Even though NIR spectra lack the tell-tale “fingerprint” features that infrared spectra typically do, they still provide a wealth of information that has been used extensively especially in the area of food science/chemistry.

You can see that there is not a lot of variation in the spectra despite the fact that the fat values range from 15.0 to 21.6 (not shown). That’s OK, this is the sort of thing that multivariate regression is built for!

When the value of \(\lambda\) is optimized (more on that, below), 13 variables (i.e. wavelengths) are selected. In other words, only 13 of the 600 wavelengths have non-zero LASSO regression coefficients. You can see which ones indicated by blue lines overlayed with the average (normalized) NIR spectrum:

waitbar

You can see that the LASSO feature selection has focused on a particular region of the spectrum near 1650 nm, and that this region doesn’t correspond to a peak. In fact, it is the shoulder of a peak, and looking at the original spectra (above), it would have been hard to see that that region would contain the most information about fat content - in fact, all of the spectra look nearly identical there!

How good is the resulting LASSO model with 13 variables? Take a look:

waitbar

That’s a RMSEC (root-mean square of calibration) of only 0.3 and a RMSECV (root-mean square of cross validation) of only 0.5.

How to find the optimium value of \(\lambda\)

As I mentioned earlier, if you change the strength of regularization by changing the value of the complexity parameter, \(\lambda\), the number of features LASSO selects will change. In the biscuit dough example, above, I had MATLAB vary the value of \(\lambda\) and calculate the MSE (mean-squared error) using cross validation at each one.

waitbar

The recommended value of \(\lambda\) is actually not the one with the lowest MSE (the green circle and green dashed line) - it’s the smallest one that is within 1 standard error of the lowest MSE (the blue circle and the blue dashed line). This is an example of parsimony - choosing the least complicated model without giving up too much performance. If we had chosen to use the value of \(\lambda\) with the lowest MSE (\(\lambda=0.009\)), we would have a model with more features compared to the model built using \(\lambda=0.12\) - but not a lot better performance.

Ridge regression

And here are the coefficients from Ridge regression.

waitbar

You can see that very few (if any) coefficients equal zero. So, the Ridge regression model still involves all wavelengths; it’s just that some are weighted more heavily than others. Notice that a lot of the wavelengths with sizeable coefficients are near the 1650 nm region, just like we saw with LASSO regression!

While it is beyond the scope of this post to explain how to carry out LASSO and Ridge regression in MATLAB, it does have very convenient functions for doing so. Check out the LASSO and Ridge documentation on the official MathWorks site.