In this project, I built a set of Machine Learning models to predict the concentration of dissolved solids in beer wort from NIR (near-infrared) spectra.

waitbar

By Schlemazl - Own work, CC BY-SA 3.0


Outline


Motivation

When brewing beer, the concentration of dissolved solids in the wort, the initial malty liquid that eventually becomes beer, is an important quantity to measure. That’s because its value is correlated with the eventual alcohol content of the final product. Traditionally, the concentration of dissolved solids is estimated by measuring the specific gravity, or density relative to water, of the wort.

However, making such a measurement requires extracting a sample of the wort and using a hygrometer. The hygrometer is weighted to sink by a specific amount depending on the density - the denser the wort, the less it sinks. This process must be repeated every time the specific gravity is measured as the beer progresses through the fermentation process.

NIR (near-infrared) spectroscopy provides a convenient alternative since it can be performed non-destructively and in an online process. The trick is, of course, figuring out how to relate the measured spectra to the specific gravity, or more precisely, the concentration of dissolved solids.

Objective

Here, I used a published dataset (Nørgaard, L. et al. Interval Partial Least-Squares Regression ( i PLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl Spectrosc 54, 413–419 (2000)) of NIR spectra and dissolved solid concentrations, called “extract” (in crazy brewing units called degrees Plato, degP), to build a set of Machine Learning regression models.

Approach

The overall approach I took consisted of the following steps:

  1. explore the data by plotting NIR spectra
  2. build LASSO model
  3. build PLS model
  4. build GPR model

Results

Data

The dataset consists of 40 samples with a NIR spectrum and dissolved solids concentration (ranging from 4.2 - 18.8 degP) for each.

waitbar

NIR spectra of 40 beer wort samples from Nørgaard et al., Appl Spectrosc 54, 413–419 (2000).

Notice that there are not much differences in the spectra despite the fact that the dissolved solids concentration varies by more than a factor of four. This is a perfect example of how multivariate regression can be used to identify key features (i.e. wavelengths) that are correlated with the quantity of interest.

We are going to use three different regssion models:

  • LASSO regularization
  • PLS (partial least squares)
  • GPR (Gaussian process regression)

LASSO regularization

LASSO regularization adds an additional constraint to the optimization problem, thereby providing a unique solution where otherwise there wouldn’t be one. Practically, this results in a form of feature selection in which the regression coefficients for many of the variables are exactly zero.

Read my post on LASSO and Ridge regression to learn more about how regularization is used when problems are underdetermined.

The variables selected using the beer dataset are shown below:

waitbar

Average NIR spectrum of beer wort samples (red) and regression coefficients (blue).

Note that only seven wavelengths have non-zero coefficients: 1184, 1320, 1326, 2126, 2128, 2136 and 2246 nm, and of these only three dominate.

Using 5-fold cross validation, we can see how well we can expect the LASSO model to perform on new data:

waitbar

Cross-valided predicted values of the “extract”.

The extract RMSECV for this model is 0.38 degP, which is pretty good considering that the average extract value is 11.1 degP.

PLS

Partial least squares regression builds a set of latent variables that are particularly good for predicting values of the predictant (i.e. the “extract”) becuase it uses both sets of data (i.e. the NIR spectra and the “extract” values) when constructing them. You can think of it as related to PCA, but whereas PCA tries to explain the variance of the NIR spectra, PLS tries to explain the variance of both the NIR spectra and the extract values (it’s a little more involved than that, but that’s the basic idea).

Sounds good, right? Let’s use all of the data to come up with a great regression model. So, how does PLS do?

Well … the RMSECV of the PLS model is an underwhelming 0.94 degP. This is a lot worse than what we got with LASSO regression (0.38 degP) and is clearly not acceptable. What happened? Well, the PLS model was overtrained on the training data.

To address this issue, I used a variable ranking algorithm called RReliefF (see MathWorks documentation on the relieff() function) to select variables. Specifically, I chose just the top 10% (93) wavelengths.

waitbar

Average NIR spectrum of beer wort samples (red) and RReliefF weights for the top 10% of wavelengths (blue).

But, in order to avoid so-called data leakage, I used this variable selection inside each of my five folds in the 5-fold cross validation. Thus, each of the five PLS models that were built as part of the cross validation could have been built on a different set of (93) variables.

The RMSECV for this approach was a much more reasonable 0.24 degP.

GPR

I also built a GPR (Gaussian process regression) model, which is basically a super cool interpolation (a little too complicated to explain, here). I ran into a similar problem with the GPR as I did with the PLS regession - overfitting when using all of the wavelengths in the spectra. Only this time it was a lot worse! The RMSECV was 2.3 degP! I may as well just use the average extract value as my guess for each sample … literally!

However, if I use RRelief to select variables, I get a much better RMSECV = 0.23 degP.

Conclusion

Multivariate regression models were built that can predict the extract (i.e. dissolved solids concentration) very accurately from NIR spectra

  • All three models worked well with only a subset of the variables (i.e. wavelengths).

  • Using all of the variables led to gross overfitting using the PLS and GPR models.

  • Variable selection using RReliefF improved performance of the PLS and GPR models.

  • Using NIR spectroscopy to measure extract values seems feasible and convenient.


MATLAB script

HW6_2_MLR_lasso

Contents

HW6_2: Multiple linear regression of beer extract data using LASSO

close all
clear
load beer_HW6

tic

2.1 feature selection using Lasso

use cross-validation to find best value of lambda use Monte Carlo to average out randomness from cross-validation

[B,FitInfo] = lasso(beer_spectra,extract_data,'CV',3);

Figure 1: plot of MSE vs. lambda parameter value

lassoPlot(B,FitInfo,'PlotType','CV');
legend('show') % Show legend

% get fit coefficients and lambda within 1 MSE of best-fit lambda
idxLambda1SE = FitInfo.Index1SE;
b = B(:,idxLambda1SE);
b0 = FitInfo.Intercept(idxLambda1SE);
lambda = FitInfo.Lambda1SE;
numcoeffs = numel(b(abs(b)>0))
slm_idx = find(b);

% get fit coefficients and best-fit value of lambda
% idxLambdaMinMSE = FitInfo.IndexMinMSE;
% b = B(:,idxLambdaMinMSE);
% b0 = FitInfo.Intercept(idxLambdaMinMSE);
% lambda = FitInfo.LambdaMinMSE;
% numel(b(b>0))
numcoeffs =

    11

2.2 Figure 2: plot of best-fit regression coefficients

figure
plot(wavelength,b)
ylabel("regression coefficient")
hold on
yyaxis right
plot(wavelength,mean(beer_spectra));
xlabel("variable")
ylabel('intensity (arb. units)')

2.3 Figure 3: plot of predicted extract values vs. measured extract values

figure
yhat = beer_spectra*b + b0;
scatter(extract_data,yhat)
lsline
xlabel("measured extract (degP)")
ylabel("predicted extract (degP)")
RMSEC = sqrt(sumsqr(yhat-extract_data)/(numel(extract_data)-numcoeffs));
% RMSEC = sqrt(immse(extract_data,yhat));

2.4 RMSE for cross validation

% establish partitioning of data to use for 5-fold CV
k=5;
CV = cvpartition(length(extract_data),'KFold',k);
yhat = zeros(size(extract_data));
for i=1:k
    [B,FitInfo] = lasso(beer_spectra(CV.training(i),:),extract_data(CV.training(i)),'CV',3);
    b_k = B(:,FitInfo.Index1SE);
    yhat(CV.test(i)) = beer_spectra(CV.test(i),:)*b_k+FitInfo.Intercept(FitInfo.Index1SE);
    RMSECV(i) = sqrt(sumsqr(yhat(CV.test(i))-extract_data(CV.test(i)))/numel(extract_data(CV.test(i))));
end
RMSECV = mean(RMSECV);

Figure 4: plot of CV-predicted extract values vs. measured extract values

figure
scatter(extract_data,yhat)
xlabel("measured extract (degP)")
ylabel("CV-predicted extract (degP)")
lsline
mdl = fitlm(extract_data,yhat);
text(14,6,['R^2: ',num2str(mdl.Rsquared.Ordinary)]);

toc
Elapsed time is 16.902215 seconds.

sumsqr() function

function sqr = sumsqr(x)
    sqr = x'*x;
end
HW6_3_PLS

Contents

HW6_3: PLS of beer extract data

tic
close all
clear
load beer_HW6
[n,m] = size(beer_spectra);
beer_spectra_orig = beer_spectra;
beer_spectra = beer_spectra - mean(beer_spectra);

3.1 Build PLS regression model using all variables

[~,~,~,~,~,pctvar] = plsregress(beer_spectra,extract_data);
pctvar_cum = cumsum(pctvar(2,:));
num_LVs = max(2,1+sum(pctvar_cum < 0.99));
num_LVs = min(num_LVs,m);
[~,~,~,~,beta] = plsregress(beer_spectra,extract_data,num_LVs);

Figure 1: predicted extract values vs. measured extract values

figure
y_pred = [ones(n,1) beer_spectra]*beta;
scatter(extract_data,y_pred)
refline(1,0)
xlabel("measured extract (degP)")
ylabel("predicted extract (degP)")
RMSEC = sqrt(sumsqr(y_pred-extract_data)/(n-num_LVs));

3.2 Run PLS using 5-fold CV

k = 5;
c = cvpartition(length(extract_data),'KFold',k);

iter = 100;   % I built in the averaging over multiple runs of the CV
RMSECV = zeros(iter,k);
for j = 1:iter
    c = repartition(c);
    for i = 1:k
        train_idx = training(c,i);
        test_idx = ~train_idx;
        X = beer_spectra(train_idx,:); y = extract_data(train_idx);

        idx = 1:width(beer_spectra);   % use all wavelengths

        [~,~,~,~,~,pctvar] = plsregress(X,y);
        pctvar_cum = cumsum(pctvar(2,:));
        num_LVs = max(2,1+sum(pctvar_cum < 0.99));
        num_LVs = min(num_LVs,m);
        [~,~,~,~,beta] = plsregress(X,y,num_LVs);

        y_pred = [ones(sum(test_idx),1) beer_spectra(test_idx,idx)]*beta;
        RMSECV(j,i) = sqrt(sumsqr(y_pred-extract_data(test_idx))/numel(y_pred));
    end
end
RMSECV = mean(mean(RMSECV));

3.3 Build PLS regression model using all variables

[idx,weights] = relieff(beer_spectra,extract_data,10);
idx = idx(1:round(m/10));
[~,~,~,~,~,pctvar] = plsregress(beer_spectra(:,idx),extract_data);
pctvar_cum = cumsum(pctvar(2,:));
num_LVs = max(2,1+sum(pctvar_cum < 0.99));
num_LVs = min(num_LVs,m);
[~,~,~,~,beta] = plsregress(beer_spectra(:,idx),extract_data,num_LVs);

Figure 2: predicted extract values vs. measured extract values

figure
y_pred = [ones(n,1) beer_spectra(:,idx)]*beta;
scatter(extract_data,y_pred)
refline(1,0)
xlabel("measured extract (degP)")
ylabel("predicted extract (degP)")
RMSEC = sqrt(sumsqr(y_pred-extract_data)/(n-num_LVs));

Figure 3: wavelengths selected with ReliefF

figure
scatter(wavelength(idx),weights(idx),'.');
ylabel('ReliefF Weight (arb. units)','Color','b');
hold on
yyaxis right
plot(wavelength,mean(beer_spectra_orig));
xlabel('wavelength (nm)');
ylabel('Beer spectrum absorbance (arb. units)','Color','r');

3.5 Re-run PLS using 5-fold CV with variable selection

k = 5;
c = cvpartition(length(extract_data),'KFold',k);

iter = 25;   % I built in the averaging over multiple runs of the CV
RMSECV = zeros(iter,k);
for j = 1:iter
    c = repartition(c);
    for i = 1:k
        train_idx = training(c,i);
        test_idx = ~train_idx;
        X = beer_spectra(train_idx,:); y = extract_data(train_idx);

        [idx,~] = relieff(X,y,5);
        idx = idx(1:round(m/10));
        X = X(:,idx);

        [~,~,~,~,~,pctvar] = plsregress(X,y);
        pctvar_cum = cumsum(pctvar(2,:));
        num_LVs = max(2,1+sum(pctvar_cum < 0.99));
        num_LVs = min(num_LVs,m);
        [~,~,~,~,beta] = plsregress(X,y,num_LVs);

        y_pred = [ones(sum(test_idx),1) beer_spectra(test_idx,idx)]*beta;
        RMSECV(j,i) = sqrt(sumsqr(y_pred-extract_data(test_idx))/numel(y_pred));
    end
end
RMSECV = mean(mean(RMSECV));