Madrid Pollution

15 minute read

All the files of this project are saved in a GitHub repository.

Dataset of the Pollution Level in Madrid

This report describes an analysis of the pollution in Madrid between 2011 and 2016.

The dataset consists in:

  • 72 csv files containing hourly measures of pollutants across 24 stations,
  • 1 xlsx file containing daily weather information.

The stations are located all across the city:


Packages

This analysis requires these R packages:

  • Data Cleaning: readxl, tidyr
  • Plotting: ggplot2, corrplot, GGally, gridExtra, leaflet
  • Statistics: jtools, lattice, car, caret, MASS

These packages are installed and loaded if necessary by the main script.


Data Preparation

The pollution and weather data are first read from the input files, formatted, combined and aggregated, into the data frame pollution_daily_h which provides the averaged information per day.

The dataset contains information for 12 pollutants: NO2, SO2, O3, PM2.5, BEN, CO, EBE, NMHC, NO, PM10, TCH, TOL.

The workflow to prepare the data is as below:


Additional variables have been added:

  • month: the first day of the related month
  • week: the first day of the related week
  • temp_gap: difference between temp_min and temp_max

The data frame pollution_daily_h is structured as below:

str(pollution_daily_h)
## 'data.frame':    2192 obs. of  22 variables:
##  $ date          : Date, format: "2011-01-01" "2011-01-02" ...
##  $ NO2           : num  41.5 48.5 63.6 46.3 51.5 ...
##  $ SO2           : num  10.71 11.93 11.91 8.84 9.51 ...
##  $ O3            : num  20.47 15.56 9.45 13.34 10.88 ...
##  $ PM2.5         : num  9.36 9.08 11.94 9.4 10.51 ...
##  $ BEN           : num  0.765 0.869 1.176 0.857 0.856 ...
##  $ CO            : num  0.372 0.453 0.532 0.367 0.395 ...
##  $ EBE           : num  0.595 0.722 1.047 0.788 1.074 ...
##  $ NMHC          : num  0.187 0.208 0.246 0.196 0.196 ...
##  $ NO            : num  16.5 28.8 71.1 27.8 36 ...
##  $ PM10          : num  13.8 13.7 21.6 14.4 15.6 ...
##  $ TCH           : num  1.43 1.62 1.62 1.46 1.52 ...
##  $ TOL           : num  1.66 2.14 3.56 2.4 3.19 ...
##  $ month         : Date, format: "2011-01-01" "2011-01-01" ...
##  $ week          : Date, format: "2010-12-27" "2010-12-27" ...
##  $ temp_avg      : num  8.3 8.6 4.2 6.5 8.9 12.2 10.9 9.8 8.4 6.9 ...
##  $ temp_max      : num  13 13 9.4 8 10 15 13 13 11 8.3 ...
##  $ temp_min      : num  3 4 -1.6 4.1 6.3 8.9 8.7 7.6 6.6 5.2 ...
##  $ precipitation : num  0 0 0 0 0 ...
##  $ humidity      : num  84 81 86 93 90 87 81 88 92 91 ...
##  $ wind_avg_speed: num  5.2 5.4 3.5 6.3 10.4 15.7 15.6 14.3 6.5 7.4 ...
##  $ temp_gap      : num  10 9 11 3.9 3.7 6.1 4.3 5.4 4.4 3.1 ...
summary(pollution_daily_h)
##       date                 NO2               SO2               O3         
##  Min.   :2011-01-01   Min.   :  7.819   Min.   : 1.679   Min.   :  3.979  
##  1st Qu.:2012-07-01   1st Qu.: 26.170   1st Qu.: 3.757   1st Qu.: 30.088  
##  Median :2013-12-31   Median : 35.646   Median : 5.483   Median : 49.555  
##  Mean   :2013-12-31   Mean   : 38.828   Mean   : 5.889   Mean   : 47.866  
##  3rd Qu.:2015-07-02   3rd Qu.: 48.122   3rd Qu.: 7.388   3rd Qu.: 65.058  
##  Max.   :2016-12-31   Max.   :105.077   Max.   :17.129   Max.   :110.287  
##      PM2.5             BEN               CO              EBE        
##  Min.   : 2.625   Min.   :0.1514   Min.   :0.1496   Min.   :0.1028  
##  1st Qu.: 7.306   1st Qu.:0.4520   1st Qu.:0.2575   1st Qu.:0.3318  
##  Median :10.024   Median :0.6152   Median :0.3075   Median :0.6782  
##  Mean   :11.112   Mean   :0.7392   Mean   :0.3574   Mean   :0.6789  
##  3rd Qu.:13.646   3rd Qu.:0.8911   3rd Qu.:0.4058   3rd Qu.:0.8651  
##  Max.   :58.556   Max.   :2.8132   Max.   :1.1792   Max.   :3.2252  
##       NMHC               NO               PM10              TCH       
##  Min.   :0.04319   Min.   :  2.351   Min.   :  3.618   Min.   :1.063  
##  1st Qu.:0.15269   1st Qu.:  6.973   1st Qu.: 12.730   1st Qu.:1.327  
##  Median :0.18694   Median : 11.814   Median : 18.556   Median :1.394  
##  Mean   :0.20500   Mean   : 23.681   Mean   : 20.805   Mean   :1.421  
##  3rd Qu.:0.23620   3rd Qu.: 26.178   3rd Qu.: 25.560   3rd Qu.:1.482  
##  Max.   :0.66167   Max.   :214.569   Max.   :216.628   Max.   :2.412  
##       TOL              month                 week           
##  Min.   : 0.2667   Min.   :2011-01-01   Min.   :2010-12-27  
##  1st Qu.: 1.6969   1st Qu.:2012-07-01   1st Qu.:2012-06-30  
##  Median : 2.4893   Median :2013-12-16   Median :2013-12-30  
##  Mean   : 3.0021   Mean   :2013-12-16   Mean   :2013-12-28  
##  3rd Qu.: 3.6722   3rd Qu.:2015-07-01   3rd Qu.:2015-06-29  
##  Max.   :15.6062   Max.   :2016-12-01   Max.   :2016-12-26  
##     temp_avg        temp_max        temp_min      precipitation    
##  Min.   :-0.50   Min.   : 3.30   Min.   :-7.000   Min.   : 0.0000  
##  1st Qu.: 8.40   1st Qu.:14.00   1st Qu.: 3.000   1st Qu.: 0.0000  
##  Median :14.60   Median :21.10   Median : 8.900   Median : 0.0000  
##  Mean   :15.45   Mean   :22.04   Mean   : 8.755   Mean   : 0.9173  
##  3rd Qu.:22.70   3rd Qu.:30.10   3rd Qu.:14.600   3rd Qu.: 0.0000  
##  Max.   :32.30   Max.   :42.00   Max.   :25.800   Max.   :48.0100  
##     humidity     wind_avg_speed     temp_gap    
##  Min.   :15.00   Min.   : 2.00   Min.   : 2.10  
##  1st Qu.:36.00   1st Qu.: 6.30   1st Qu.: 9.50  
##  Median :55.00   Median : 8.70   Median :14.10  
##  Mean   :55.04   Mean   :10.15   Mean   :13.29  
##  3rd Qu.:73.00   3rd Qu.:12.60   3rd Qu.:17.00  
##  Max.   :99.00   Max.   :35.60   Max.   :24.50

The data frame doesn’t contain any NA across its 2192 observations and 22 variables.


Evolution of the Variables Over Time

The charts below describe the evolution of each variable over time.

Dark Orange = Main Pollutants | Light Orange = Other Pollutants | Blue = Weather Parameters


Seasonal cycles suggests that the weather has an influence on the level of some pollutants.


Correlation Matrix

A correlation matrix is plotted to identify correlations between the variables:


Another view provides more information:


NO2 Model

A linear regression will be used to model the level of pollution in NO2. The data is split in train and test sets with the ratio 80|20.

set.seed(2018)
train.size <- 0.8
train.index <- sample.int(length(pollution_daily_h$NO2), round(length(pollution_daily_h$NO2) * train.size))
train.sample <- pollution_daily_h[train.index,]
test.sample <- pollution_daily_h[-train.index,]

The Train set has 1,754 rows and the Test set has 438 rows.

temp_min and temp_max are removed as correlated by definition with temp_avg. But the variable temp_gap is created to measure their influence on the model.

multi_model_NO2<-lm(NO2~.-month-week-date-temp_min-temp_max, data=train.sample)
lm_stats <- summary(multi_model_NO2)
print(lm_stats)
## 
## Call:
## lm(formula = NO2 ~ . - month - week - date - temp_min - temp_max, 
##     data = train.sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9368  -3.1235   0.0047   3.2157  21.3306 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.229319   3.113391   0.395 0.693003    
## SO2             1.383111   0.077489  17.849  < 2e-16 ***
## O3             -0.149220   0.012032 -12.402  < 2e-16 ***
## PM2.5           0.255679   0.067646   3.780 0.000162 ***
## BEN             3.343452   1.165812   2.868 0.004182 ** 
## CO             45.100966   3.957109  11.397  < 2e-16 ***
## EBE             2.843966   0.553288   5.140 3.06e-07 ***
## NMHC            1.358368   1.981506   0.686 0.493105    
## NO             -0.201584   0.017344 -11.623  < 2e-16 ***
## PM10           -0.008354   0.024343  -0.343 0.731516    
## TCH            11.070151   1.621469   6.827 1.19e-11 ***
## TOL             2.850827   0.214695  13.279  < 2e-16 ***
## temp_avg       -0.122812   0.038525  -3.188 0.001459 ** 
## precipitation   0.178564   0.041030   4.352 1.43e-05 ***
## humidity       -0.098214   0.015546  -6.318 3.36e-10 ***
## wind_avg_speed -0.323058   0.032912  -9.816  < 2e-16 ***
## temp_gap        0.304426   0.053019   5.742 1.10e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.096 on 1737 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9105 
## F-statistic:  1116 on 16 and 1737 DF,  p-value: < 2.2e-16


The R-square of the model is 0.9113 and the Adjusted R-squared is 0.9105, which means that the model is able to well explain NO2. Precisely, the predictors explain 91.1% of the variability in NO2.

The Mean Squared Error measures the mean of all of our errors squared. It describes the accuracy of a model. The MSE of this model is 5.0962.

Another way to evaluate a model is looking at the confidence intervals of the coefficients. The estimates for each coefficient are not exact, so the confidence intervals define a range in which the actual values are, at a certain level of confidence:

For every change of one (1) unit in the SO2 level, one can be 95% confident that the level of NO2 will change between 1.23 and 1.54.

The confidence intervals can be plotted:


The residuals of the model can be checked using these plots:

resids_multi_NO2 <- multi_model_NO2$residuals

The residuals seem correct and validate the model. Significant variables can be found using a Stepwise Regression.

## Start:  AIC=5729.65
## NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO + 
##     PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min + 
##     precipitation + humidity + wind_avg_speed + temp_gap) - month - 
##     week - date - temp_min - temp_max
## 
##                  Df Sum of Sq   RSS    AIC
## - PM10            1       3.1 45114 5727.8
## - NMHC            1      12.2 45124 5728.1
## <none>                        45111 5729.7
## - BEN             1     213.6 45325 5735.9
## - temp_avg        1     263.9 45375 5737.9
## - PM2.5           1     371.0 45482 5742.0
## - precipitation   1     491.9 45603 5746.7
## - EBE             1     686.2 45798 5754.1
## - temp_gap        1     856.2 45968 5760.6
## - humidity        1    1036.6 46148 5767.5
## - TCH             1    1210.5 46322 5774.1
## - wind_avg_speed  1    2502.4 47614 5822.3
## - CO              1    3373.7 48485 5854.2
## - NO              1    3508.4 48620 5859.0
## - O3              1    3994.5 49106 5876.5
## - TOL             1    4579.1 49691 5897.2
## - SO2             1    8274.0 53385 6023.0
## 
## Step:  AIC=5727.77
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO + TCH + TOL + 
##     temp_avg + precipitation + humidity + wind_avg_speed + temp_gap
## 
##                  Df Sum of Sq   RSS    AIC
## - NMHC            1      11.8 45126 5726.2
## <none>                        45114 5727.8
## + PM10            1       3.1 45111 5729.7
## - BEN             1     226.7 45341 5734.6
## - temp_avg        1     268.4 45383 5736.2
## - precipitation   1     489.2 45604 5744.7
## - EBE             1     684.2 45799 5752.2
## - temp_gap        1     860.1 45974 5758.9
## - PM2.5           1    1026.3 46141 5765.2
## - humidity        1    1067.1 46181 5766.8
## - TCH             1    1212.3 46327 5772.3
## - wind_avg_speed  1    2579.1 47693 5823.3
## - CO              1    3375.5 48490 5852.3
## - NO              1    3519.4 48634 5857.5
## - O3              1    4095.4 49210 5878.2
## - TOL             1    4582.1 49697 5895.4
## - SO2             1    8271.4 53386 6021.0
## 
## Step:  AIC=5726.23
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NO + TCH + TOL + temp_avg + 
##     precipitation + humidity + wind_avg_speed + temp_gap
## 
##                  Df Sum of Sq   RSS    AIC
## <none>                        45126 5726.2
## + NMHC            1      11.8 45114 5727.8
## + PM10            1       2.6 45124 5728.1
## - BEN             1     219.7 45346 5732.7
## - temp_avg        1     285.4 45412 5735.3
## - precipitation   1     493.7 45620 5743.3
## - EBE             1     674.1 45800 5750.2
## - temp_gap        1     863.8 45990 5757.5
## - PM2.5           1    1015.2 46141 5763.2
## - humidity        1    1089.1 46215 5766.1
## - TCH             1    2061.8 47188 5802.6
## - wind_avg_speed  1    2568.7 47695 5821.3
## - NO              1    3616.4 48743 5859.4
## - CO              1    3638.7 48765 5860.2
## - O3              1    4133.0 49259 5877.9
## - TOL             1    4701.8 49828 5898.1
## - SO2             1    8611.7 53738 6030.6

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO + 
##     PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min + 
##     precipitation + humidity + wind_avg_speed + temp_gap) - month - 
##     week - date - temp_min - temp_max
## 
## Final Model:
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NO + TCH + TOL + temp_avg + 
##     precipitation + humidity + wind_avg_speed + temp_gap
## 
## 
##     Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                          1737   45111.36 5729.651
## 2 - PM10  1  3.058383      1738   45114.42 5727.770
## 3 - NMHC  1 11.762300      1739   45126.18 5726.227


These results indicate that the variables PM10 and NMHC can be removed. The resulting model is:

## 
## Call:
## lm(formula = NO2 ~ . - month - week - date - temp_min - temp_max - 
##     PM10 - NMHC, data = train.sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0494  -3.1185   0.0019   3.2071  21.4585 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.45196    2.92446   0.155  0.87720    
## SO2             1.36999    0.07520  18.217  < 2e-16 ***
## O3             -0.14716    0.01166 -12.620  < 2e-16 ***
## PM2.5           0.23432    0.03746   6.255 5.00e-10 ***
## BEN             3.33921    1.14768   2.910  0.00367 ** 
## CO             45.69600    3.85895  11.842  < 2e-16 ***
## EBE             2.79371    0.54813   5.097 3.83e-07 ***
## NO             -0.20239    0.01714 -11.805  < 2e-16 ***
## TCH            11.71564    1.31433   8.914  < 2e-16 ***
## TOL             2.86320    0.21271  13.461  < 2e-16 ***
## temp_avg       -0.12662    0.03818  -3.317  0.00093 ***
## precipitation   0.17861    0.04095   4.362 1.37e-05 ***
## humidity       -0.09773    0.01509  -6.478 1.20e-10 ***
## wind_avg_speed -0.32267    0.03243  -9.949  < 2e-16 ***
## temp_gap        0.30558    0.05297   5.769 9.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.094 on 1739 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9106 
## F-statistic:  1276 on 14 and 1739 DF,  p-value: < 2.2e-16

The R-squared value of 0.9113 is consistent with the initial model.

Multicollinearity can be treated with the VIF Method (Variance Inflation Factors). As a general rule, if the VIF value is larger than 5, the multicollinearity is assumed to be high.

For each variable of the model:

  • the VIF values are calculated,
  • the variable with the largest value is removed,
  • the model is re-run the explanatory variables having a VIF value below 5.

The current VIF values are:

##            SO2             O3          PM2.5            BEN             CO 
##       2.731960       4.737638       2.589103      15.175970      21.433198 
##            EBE             NO            TCH            TOL       temp_avg 
##       3.384605      15.882579       2.124477      11.084651       6.505867 
##  precipitation       humidity wind_avg_speed       temp_gap 
##       1.218099       6.663257       1.991433       4.170651

The VIF values resulting from the procedure are:

##            SO2             O3          PM2.5            EBE            TCH 
##       1.958788       3.351606       2.049109       1.829030       1.781684 
##       temp_avg  precipitation wind_avg_speed       temp_gap 
##       3.056575       1.154820       1.743077       2.058636


After removing multicollinear variables, the Final Model is:

## NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg + precipitation + 
##     wind_avg_speed + temp_gap

## 
## Call:
## lm(formula = NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg + 
##     precipitation + wind_avg_speed + temp_gap, data = train.sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9354  -4.1242   0.1759   3.9862  18.8808 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.34165    2.32129  -0.578   0.5634    
## SO2             1.70825    0.07619  22.421  < 2e-16 ***
## O3             -0.23205    0.01174 -19.774  < 2e-16 ***
## PM2.5           0.56677    0.03988  14.213  < 2e-16 ***
## EBE             8.34368    0.48211  17.307  < 2e-16 ***
## TCH            17.17983    1.44013  11.929  < 2e-16 ***
## temp_avg       -0.13117    0.03131  -4.189 2.94e-05 ***
## precipitation   0.14279    0.04770   2.993   0.0028 ** 
## wind_avg_speed -0.35111    0.03630  -9.672  < 2e-16 ***
## temp_gap        0.77269    0.04452  17.355  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.095 on 1744 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.872 
## F-statistic:  1328 on 9 and 1744 DF,  p-value: < 2.2e-16


A 10-Fold Cross Validation can confirm the accuracy of the models:

## [1] "Initial Model:"

## Linear Regression 
## 
## 1754 samples
##   21 predictor
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1578, 1579, 1579, 1579, 1578, 1579, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE    
##   5.13777  0.9079525  3.97799
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

## [1] "Final Model:"

## Linear Regression 
## 
## 1754 samples
##    9 predictor
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1578, 1578, 1579, 1579, 1578, 1578, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.112716  0.8716267  4.866125
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE


The predictions of both models can be compared.

test.sample$NO2_predicted_model_final <- predict(multi_model_NO2_final,test.sample)
test.sample$NO2_predicted_model_0 <- predict(multi_model_NO2_0,test.sample)

Depending on the prediction point, the final model can be better or worse than the initial model. Below table displays some examples of the prediction points (predictions rows 80-90):

##          NO2 NO2_predicted_model_0 NO2_predicted_model_final
## 428 32.90278              33.47543                  38.52169
## 436 41.62434              45.47223                  48.86567
## 438 66.43728              62.02274                  58.57356
## 441 37.94618              36.13524                  35.07174
## 443 20.01215              21.83412                  21.85867
## 447 58.33043              54.81361                  52.00605
## 464 35.96528              37.69067                  40.09754
## 470 18.65799              15.89007                  17.62860
## 482 28.10417              21.46581                  22.55336
## 489 45.54340              37.47732                  37.65892
## 490 31.50521              22.42451                  23.59314


The accuracy of each prediction point can be understood by comparing their values with the actual values:

Blue = Initial Model | Orange = Final Model


The models predictions can be compared statistically using an One-Way Analysis of Variance (ANOVA) and a plot of the coefficients confidence intervals:

## Analysis of Variance Table
## 
## Model 1: NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO + 
##     PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min + 
##     precipitation + humidity + wind_avg_speed + temp_gap) - month - 
##     week - date - temp_min - temp_max - PM10 - NMHC
## Model 2: NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg + precipitation + 
##     wind_avg_speed + temp_gap
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1   1739 45126                                  
## 2   1744 64788 -5    -19661 151.54 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The low p-value returned by the ANOVA indicates that the Final model is significantly better than the Initial Model. The plot gives an indication on how each variable influences the predictions, with a 95% confidence interval.

As a conclusion, the Final Model provides a good way to predict the NO2 pollution level based on 9 pollutants and 5 weather parameters.