In this project, I set out to demonstrate the utility of visualizing low-cost PurpleAir sensor data across the Southeast U.S. using Gaussian process regression for interpolation.

waitbar


Outline


Motivation

Because of their low cost (~ $250), PurpleAir aerosol sensors have gained in popularity with researchers and science hobbyists, alike. There are now over 20,000 of these sensors deployed all across the U.S. The resulting high density spatial coverage and fast time resolution (2 minutes) means that there is a wealth of information provided by this publicly-accessible dataset.

With so many sensors, however, it can be hard to synthesize all of the measurements and arrive at meaningful conclusions. This is especially true because the spatial coverage, while dense, is not homogeneous.

Objective

Here, I used Gaussian process regression to demonstrate how interpolation of PurpleAir sensor data can be used to provide a spatially complete overview of aerosol concentration in the Southeast U.S.

Approach

The overall approach I took consisted of the following steps:

  1. obtain hourly-averaged data from all PurpleAir sensors within 500 km of Athens, GA
  2. randomly pick one day/hour to use
  3. clean data by removing outliers
  4. fit GPR model to sensor data
  5. use GPR model to predict aerosol concentrations for grid of points covering Southeast U.S.
  6. plot 2-D and 3-D contour maps of predicted values

Results

Data

I obtained hourly-averaged data from June 2021 for 258 PurpleAir sensors located within 500 km of Athens, GA. (these data are publicly available from PurpleAir API service - details of the MATLAB scripts I wrote to access these data will be described in a future project!) I randomly picked one date / hour to use (June 8, 2021 - 8:00 am ).

Correct measurements

PurpleAir sensors are well-known to over-estimate aerosol concentrations. So, I applied a correction that I developed based on PurpleAir sensors around the country (but not those in Athens, GA). The development of this correction will be detailed in a future project!

Clean data

I didn’t want a handful of outliers to throw off the interpolation, so I used the MATLAB isoutlier() function to remove those samples considered outliers (more specifically, those that are more than three scaled median absolute deviations away from the median).

Convert locations to distances

In order to use GPR to carry out the interpolation, I needed to convert the latitude / longitude locations of each PurpleAir sensor to distances in meters. The distances were calculated relative to the centroid of the collection of 258 sensors.

Fit GPR model to sensor data

A GPR model was fit to the measurements at all 258 PurpleAir locations. A constant noise standard deviation of $2 \cdot \sigma$ was specified, where $\sigma$ is the standard deviation of the PurpleAir measurements (1.35 $\mu g / m^3$ in this example). Specifying the noise standard deviation was important as it prevented the GPR model from overfitting to the data; practically, this kept the interpolation smooth without “spikes” at each of the PurpleAir sites.

Predict aerosol concentrations over mesh of points

The GPR model was then used to predict aerosol concentration over a grid of 40,000 equally spaced locations (200 x 200) around the Southeast U.S.

Contour plot of predict values

A contour plot of the aerosol concentration values predicted by the GPR model was created.

waitbar

Interpolated aerosol concentration contours in units of $\mu g / m^3$. Dots represent indiviudal PurpleAir sensor locations.

This plot makes it easy to see the overall trends apparent thoughout the Southeast U.S. - there are higher aerosol concentrations on the western side, decreasing in a roughly southeastern direction.

Surface plot of predicted values

This interpolated “field” of aerosol concentrations can be visualized more readily as a 3-D surface plot:

waitbar

Interpolated aerosol concentration surface in units of $\mu g / m^3$. Dots represent PurpleAir measurements from indiviudal sensors.

Here, the dots show the PurpleAir measurements that the GPR interpolation model was based on. This representation highlights the need for and the utility of the GPR model - the PurpleAir measurements appear very scattered when viewed by themselves making it difficult to obtain a comprehensive picture of how aerosol concentration is distributed throughout the region.

The same surface can be viewed in a slightly more informative way with the contours superimposed on a map of the Southeastern U.S.:

waitbar

Conclusion

Scattered and noisy data from many individual, independent and non-uniformly sited PurpleAir sensors can be homogenized across a region using GPR interpolation. The resulting aerosol concentration field can be visualized easily with contour and surface plots.

  • GPR provides a powerful way to interpolate aerosol concentration values across a grid of points.

  • The extensive spatial coverage of the low-cost PurpleAir sensors can be leveraged to provide a level of geographic detail that is difficult to obtain using a much smaller number of expensive regulatory monitors.

  • The high time resolution (2 minutes) of the low-cost sensors makes it possible to obtain a picture of how aerosol concentration fields evovle with time, which is not possible with the regulatory monitors.


MATLAB script

try_kriging

Contents

trial run with Athens, GA PurpleAir sensors

AQS_loc = [33.918062, -83.344397];
pts = 200;

load data from one time point

% pick date/time for data from all available date/times
date_list = unique(PA_TT.Time);
date_chosen = date_list(randi(length(date_list)));
data = PA_TT(PA_TT.Time == date_chosen,:);

% add sensor locations from PA_T to data
[~,loc] = ismember(data.PA_ID,PA_T.ID);
data.loc = PA_T.loc(loc,:);
data2 = data;
data.PM25 = data.PM25_A;
data2.PM25 = data.PM25_B;
data = [data;data2];
data.PM25 = 3.27 + data.PM25*0.426 - 1.40*exp(-data.PM25*0.227);

% remove outliers
sum(isoutlier(data.PM25))
data(isoutlier(data.PM25),:) = [];
std(data.PM25)

clear date-list data2 date_chosen
ans =

    25


ans =

    1.3464

set up mesh points

x = linspace(min(data.loc(:,1)),max(data.loc(:,1)),pts);
y = linspace(min(data.loc(:,2)),max(data.loc(:,2)),pts);
[X,Y] = meshgrid(x,y);

convert latitude and longitude to local distances (in m)

origin = mean(data.loc); origin = [origin,0];
Z = [X(:) Y(:)];
[Z(:,1),Z(:,2),~] = latlon2local(Z(:,1),Z(:,2),zeros(height(Z),1),origin);
[xEast,yNorth,~] = latlon2local(data.loc(:,1),data.loc(:,2),zeros(height(data),1),origin);

fit GPR model

Mdl = fitrgp([xEast,yNorth],data.PM25,'KernelFunction','squaredexponential',... 'OptimizeHyperparameters',{'Sigma','BasisFunction'});

kparams0 = [10000, 5];
Mdl = fitrgp([xEast,yNorth],data.PM25,'KernelFunction','squaredexponential', ...
    'KernelParameters',kparams0,'Sigma',2*std(data.PM25),'ConstantSigma',true, ...
    'BasisFunction','constant');
pred = Mdl.predict([xEast,yNorth]);

predict and plot values for mesh

close all

data.PM25_ref(1)
Mdl.predict(flip(AQS_loc))

[PM25_pred,PM25_sd,PM25_CI] = Mdl.predict(Z,'Alpha',0.32);
meshc(Y,X,reshape(PM25_pred,pts,pts))
colorbar
hold on
scatter3(data.loc(:,2),data.loc(:,1),data.PM25,40,'filled')
scatter3(AQS_loc(2),AQS_loc(1),data.PM25_ref(1),40,'filled')

figure
contour(Y,X,reshape(PM25_pred,pts,pts),'ShowText','on')
hold on
scatter(data.loc(:,2),data.loc(:,1),40,'filled')

figure
contour(Y,X,reshape(PM25_sd,pts,pts),'ShowText','on')
hold on
scatter(data.loc(:,2),data.loc(:,1),40,'filled')
ans =

    5.3000


ans =

    5.3220

plot maps with contours

% 3D contour plot with map
figure
set(gcf,'Renderer','zbuffer')
h = worldmap([min(min(X)),max(max(X))],[min(min(Y)),max(max(Y))]);
states = readgeotable("usastatehi.shp");
geoshow(states,"DisplayType","polygon")

geoshow(X,Y,reshape(PM25_pred,pts,pts),'DisplayType','surface')
axis normal
view(3)

contourm(X,Y,reshape(PM25_pred,pts,pts),10)
colorbar