Using Machine Learning to estimate potential alcohol content
In this project, I built a set of Machine Learning models to predict the concentration of dissolved solids in beer wort from NIR (near-infrared) spectra.
When brewing beer, the concentration of dissolved solids in the wort, the initial malty liquid that eventually becomes beer, is an important quantity to measure. That’s because its value is correlated with the eventual alcohol content of the final product. Traditionally, the concentration of dissolved solids is estimated by measuring the specific gravity, or density relative to water, of the wort.
However, making such a measurement requires extracting a sample of the wort and using a hygrometer. The hygrometer is weighted to sink by a specific amount depending on the density - the denser the wort, the less it sinks. This process must be repeated every time the specific gravity is measured as the beer progresses through the fermentation process.
NIR (near-infrared) spectroscopy provides a convenient alternative since it can be performed non-destructively and in an online process. The trick is, of course, figuring out how to relate the measured spectra to the specific gravity, or more precisely, the concentration of dissolved solids.
The overall approach I took consisted of the following steps:
explore the data by plotting NIR spectra
build LASSO model
build PLS model
build GPR model
Results
Data
The dataset consists of 40 samples with a NIR spectrum and dissolved solids concentration (ranging from 4.2 - 18.8 degP) for each.
NIR spectra of 40 beer wort samples from Nørgaard et al., Appl Spectrosc 54, 413–419 (2000).
Notice that there are not much differences in the spectra despite the fact that the dissolved solids concentration varies by more than a factor of four. This is a perfect example of how multivariate regression can be used to identify key features (i.e. wavelengths) that are correlated with the quantity of interest.
We are going to use three different regssion models:
LASSO regularization
PLS (partial least squares)
GPR (Gaussian process regression)
LASSO regularization
LASSO regularization adds an additional constraint to the optimization problem, thereby providing a unique solution where otherwise there wouldn’t be one. Practically, this results in a form of feature selection in which the regression coefficients for many of the variables are exactly zero.
The variables selected using the beer dataset are shown below:
Average NIR spectrum of beer wort samples (red) and regression coefficients (blue).
Note that only seven wavelengths have non-zero coefficients: 1184, 1320, 1326, 2126, 2128, 2136 and 2246 nm, and of these only three dominate.
Using 5-fold cross validation, we can see how well we can expect the LASSO model to perform on new data:
Cross-valided predicted values of the “extract”.
The extract RMSECV for this model is 0.38 degP, which is pretty good considering that the average extract value is 11.1 degP.
PLS
Partial least squares regression builds a set of latent variables that are particularly good for predicting values of the predictant (i.e. the “extract”) becuase it uses both sets of data (i.e. the NIR spectra and the “extract” values) when constructing them. You can think of it as related to PCA, but whereas PCA tries to explain the variance of the NIR spectra, PLS tries to explain the variance of both the NIR spectra and the extract values (it’s a little more involved than that, but that’s the basic idea).
Sounds good, right? Let’s use all of the data to come up with a great regression model. So, how does PLS do?
Well … the RMSECV of the PLS model is an underwhelming 0.94 degP. This is a lot worse than what we got with LASSO regression (0.38 degP) and is clearly not acceptable. What happened? Well, the PLS model was overtrained on the training data.
To address this issue, I used a variable ranking algorithm called RReliefF (see MathWorks documentation on the relieff() function) to select variables. Specifically, I chose just the top 10% (93) wavelengths.
Average NIR spectrum of beer wort samples (red) and RReliefF weights for the top 10% of wavelengths (blue).
But, in order to avoid so-called data leakage, I used this variable selection inside each of my five folds in the 5-fold cross validation. Thus, each of the five PLS models that were built as part of the cross validation could have been built on a different set of (93) variables.
The RMSECV for this approach was a much more reasonable 0.24 degP.
GPR
I also built a GPR (Gaussian process regression) model, which is basically a super cool interpolation (a little too complicated to explain, here). I ran into a similar problem with the GPR as I did with the PLS regession - overfitting when using all of the wavelengths in the spectra. Only this time it was a lot worse! The RMSECV was 2.3 degP! I may as well just use the average extract value as my guess for each sample … literally!
However, if I use RRelief to select variables, I get a much better RMSECV = 0.23 degP.
Conclusion
Multivariate regression models were built that can predict the extract (i.e. dissolved solids concentration) very accurately from NIR spectra
All three models worked well with only a subset of the variables (i.e. wavelengths).
Using all of the variables led to gross overfitting using the PLS and GPR models.
Variable selection using RReliefF improved performance of the PLS and GPR models.
Using NIR spectroscopy to measure extract values seems feasible and convenient.
lassoPlot(B,FitInfo,'PlotType','CV');
legend('show') % Show legend% get fit coefficients and lambda within 1 MSE of best-fit lambda
idxLambda1SE = FitInfo.Index1SE;
b = B(:,idxLambda1SE);
b0 = FitInfo.Intercept(idxLambda1SE);
lambda = FitInfo.Lambda1SE;
numcoeffs = numel(b(abs(b)>0))
slm_idx = find(b);
% get fit coefficients and best-fit value of lambda% idxLambdaMinMSE = FitInfo.IndexMinMSE;% b = B(:,idxLambdaMinMSE);% b0 = FitInfo.Intercept(idxLambdaMinMSE);% lambda = FitInfo.LambdaMinMSE;% numel(b(b>0))
numcoeffs =
11
2.2 Figure 2: plot of best-fit regression coefficients
figure
plot(wavelength,b)
ylabel("regression coefficient")
hold on
yyaxis right
plot(wavelength,mean(beer_spectra));
xlabel("variable")
ylabel('intensity (arb. units)')
2.3 Figure 3: plot of predicted extract values vs. measured extract values
% establish partitioning of data to use for 5-fold CV
k=5;
CV = cvpartition(length(extract_data),'KFold',k);
yhat = zeros(size(extract_data));
for i=1:k
[B,FitInfo] = lasso(beer_spectra(CV.training(i),:),extract_data(CV.training(i)),'CV',3);
b_k = B(:,FitInfo.Index1SE);
yhat(CV.test(i)) = beer_spectra(CV.test(i),:)*b_k+FitInfo.Intercept(FitInfo.Index1SE);
RMSECV(i) = sqrt(sumsqr(yhat(CV.test(i))-extract_data(CV.test(i)))/numel(extract_data(CV.test(i))));
end
RMSECV = mean(RMSECV);
Figure 4: plot of CV-predicted extract values vs. measured extract values
k = 5;
c = cvpartition(length(extract_data),'KFold',k);
iter = 100; % I built in the averaging over multiple runs of the CV
RMSECV = zeros(iter,k);
for j = 1:iter
c = repartition(c);
for i = 1:k
train_idx = training(c,i);
test_idx = ~train_idx;
X = beer_spectra(train_idx,:); y = extract_data(train_idx);
idx = 1:width(beer_spectra); % use all wavelengths
[~,~,~,~,~,pctvar] = plsregress(X,y);
pctvar_cum = cumsum(pctvar(2,:));
num_LVs = max(2,1+sum(pctvar_cum < 0.99));
num_LVs = min(num_LVs,m);
[~,~,~,~,beta] = plsregress(X,y,num_LVs);
y_pred = [ones(sum(test_idx),1) beer_spectra(test_idx,idx)]*beta;
RMSECV(j,i) = sqrt(sumsqr(y_pred-extract_data(test_idx))/numel(y_pred));
endend
RMSECV = mean(mean(RMSECV));
3.3 Build PLS regression model using all variables
figure
scatter(wavelength(idx),weights(idx),'.');
ylabel('ReliefF Weight (arb. units)','Color','b');
hold on
yyaxis right
plot(wavelength,mean(beer_spectra_orig));
xlabel('wavelength (nm)');
ylabel('Beer spectrum absorbance (arb. units)','Color','r');
3.5 Re-run PLS using 5-fold CV with variable selection
k = 5;
c = cvpartition(length(extract_data),'KFold',k);
iter = 25; % I built in the averaging over multiple runs of the CV
RMSECV = zeros(iter,k);
for j = 1:iter
c = repartition(c);
for i = 1:k
train_idx = training(c,i);
test_idx = ~train_idx;
X = beer_spectra(train_idx,:); y = extract_data(train_idx);
[idx,~] = relieff(X,y,5);
idx = idx(1:round(m/10));
X = X(:,idx);
[~,~,~,~,~,pctvar] = plsregress(X,y);
pctvar_cum = cumsum(pctvar(2,:));
num_LVs = max(2,1+sum(pctvar_cum < 0.99));
num_LVs = min(num_LVs,m);
[~,~,~,~,beta] = plsregress(X,y,num_LVs);
y_pred = [ones(sum(test_idx),1) beer_spectra(test_idx,idx)]*beta;
RMSECV(j,i) = sqrt(sumsqr(y_pred-extract_data(test_idx))/numel(y_pred));
endend
RMSECV = mean(mean(RMSECV));