Aerosols are tiny particles supsended in the atmosphere, and they are important because they affect air quality, human health and climate change. Recently, a low-cost instrument for measuring aerosol concentrations, called a “PurpleAir” sensor, has become popular with researchers and scientific hobbyists, alike. By nature of their relatively low cost (~ $250), ease of setup and ability to run unattended for months and even years at a time, these sensors have been placed all over the United States and even the world. In fact, as of the writing of this report there are over 20,000 of these sensors active in the U.S.
Because the PurpleAir sensors are orders of magnitude less expensive than research-grade and regulatory instruments, many more of them can be placed providing:
increased spatial coverage
improved redundancy
higher time resolution
Additionally, each sensor contains two independent meaurement “channels” providing a way to check for internal consistency.
While very good sensors, they are not as accurate as the more expensive instruments and suffer from biases. Several researchers have developed corrections for the sensors to try to make them more accurate, but these “one-size-fits-all” approaches may not work well for all sensors.
Objective
Here, I used data from PurpleAir sensors and a regulatory monitor located in Athens, GA to develop a local correction to make these sensors more accurate.
Approach
The overall approach I took consisted of the following steps:
obtain data from PurpleAir sensors and regulatory monitor in Athens, GA
sync the two datasets and combine into a single timetable
clean the dataset
train a linear regression model to predict the particulate matter concentration from the PurpleAir measurements
compare performance of trained Athens model to “one-size-fits-all” correction on withheld test dataset
Results
Data
I obtained hourly-averaged data from Jan. 1, 2021 to Oct. 31, 2021 from 13 PurpleAir sensors located within 20 km of the regulatory monitor in Athens, GA. (these data are publicly available from PurpleAir and the U.S. EPA’s Air Quality System API services - details of the MATLAB scripts I wrote to access these data will be described in a future project!)
Particulate matter concentration values measured by PurpleAir sensors (orange and yellow dots) and the reference monitor (blue dot) around Athens, GA. Oct. 25, 2021.
Cleaning
Initially, I had a dataset with 92,534 hourly-averaged samples. After cleaning, in which entries with “NaN” and those for which the “A” and “B” channels disagreed by more than 30% were removed, I had 51,262 samples.
Splitting into training/test datasets
This dataset was split into a training set (Jan. 1, 2021 - Aug. 30, 2021) and a test set (Sep. 1, 2021 - Oct. 31, 2021) with 38,870 and 12,392 hourly-averaged samples, respectively.
Building linear model
The linear model was built on the training set resulting in the following correction equation:
where $PM_{corr}$ is the corrected particulate matter concentration (in $\mu g / m^3$), $PA_A$ and $PA_B$ are the PurpleAir channel A and B measurements, respectively.
Model performance
5-fold cross validation was used, and RMSECV and MAE from the cross validation were calculated. The model was then run with the test data (“Predicted”) and RMSEP and MAE values calculated, as well.
For comparison, a popular “one-size-fits-all” correction (the “LRAPA model”) was also used.
These performance metrics should be considered in relation to the average measurements of $PM_{ref}$ of 10.3 $\mu g / m^3$.
Conclusion
The linear model built specifically on Athens, GA data performed significantly better in cross validation and on the test dataset (from Athens, GA) than did the “one-size-fits-all” LRAPA model.
In fact, the Athens model performed about a whole $\mu g / m^3$ (11%) better than the LRAPA model.
While not surprising, this exercise goes to show that there can be local differences that are important to include when trying to make the most accurate predictions with the PurpleAir model.
In addition to providing a better correction for the PurpleAir sensors in the Athens, GA area, the linear model built here might also prove useful for use with sensors in other, similar environments.
MATLAB script
Script to build linear regression model for correction of PurpleAir measurements in Athens, GA
Script to build linear regression model for correction of PurpleAir measurements in Athens, GA
This script starts with a timetable called "PA_TT_all" that contains PurpleAir sensor and regulatory instrument measurements of aerosol concentration in the Athens, GA area.
Each PA sensor contains two channels, "A" and "B". Use redundancy to identify suspect samples for which "A" and "B" disagree by more than 30%
ratio = PA_TT_new.A ./ PA_TT_new.B;
flag = ratio > 1.3 | ratio < 1/1.3;
sum(flag);
PA_TT_new(flag,:) = [];
clearvars ratioflag;
Split into training and test datasets
Take cleaned data and split into training and test sets. Note that random partitioning into sets would not be a good idea because samples close in time can be correlated. Test: 1/1/21 - 8/31/21 Train: 9/1/21 - 10/31/21
Use 5-fold cross validation to assess model performance, both RMSE (root-mean square error) and MAE (mean absolute error). Also calculate RMSE and MAE for test ("predicted") data.