THIS R MARKDOWN DOCUMENT HAS BEEN PRODUCED IN THE FRAME OF THE MAS OF SPATIAL ANALYSIS FOR PUBLIC HEALTH AT JHU.

The unique purpose of this document is to permit reproducing the analysis. Code should be streamlined and commented for clarity.

A. Setup

Data sources

Selected Projection for project: NAD_1983_California_Teale_Albers (WKID: 3310 Authority: EPSG)

File name Source Overview URL
california_landcover USGS Land cover (especially developed land) URL
california_dem ESRI Digital elevation model URL
california_wildfires CalFire Fire location and surfaces burnt URL
california_coastline Natural Earth Coastline URL
acs_poverty US Census Bureau Population in poverty URL
acs_characteristics US Census Bureau Population characteristics URL
acs_population US Census Bureau Population URL
area_counties US Census Bureau County polygons URL
area_censusTracts US Census Bureau Census tract polygons URL
point_pm25 EPA PM2.5 monitors URL

The pre-processed files as well as the source code to run this analysis can be downloaded at: https://github.com/BRaimbault/2020-JHU-IA

Dependencies

Loading libraries:

{library(rgdal); library(geoR); library(ggplot2); library(dplyr); library(splancs); library(summarytools); library(grid);}

B. Data loading

State boundaries

Loading California state boundaries shapefile:

shp_california <- readOGR("Projected/area_state_single.shp");
plot(shp_california)
title("Data loading: Californa State boundaries")
footnote('Figure 1')

Monitors data

Loading PM2.5 monitors shapefile:

shp_monitors <- readOGR("Monitors/point_pm25_final.shp")
plot(shp_california)
points(shp_monitors,col='red')
title("Data loading: Monitors location")
footnote('Figure 2')

Convert shapefile to dataframe and to geodata:

frm_monitors <- data.frame(
  easting=shp_monitors@data$POINT_X, # Coordinates
  northing=shp_monitors@data$POINT_Y, # Coordinates 
  pm25=shp_monitors@data$meanPM, # PM2.5 concentration
  p_elev=shp_monitors@data$ELEV, # Elevation in m
  p_coast=shp_monitors@data$DISTCOAST, # Distance form the coast in m
  p_wfire=shp_monitors@data$WILDFIRE_p, # % of the county burned due to wildfires
  lc_100=as.character(shp_monitors@data$RS_LC_100), # Landcover majoritarian category in 100mx100m areas  
  lc_1000=as.character(shp_monitors@data$RS_LC_1000), # Landcover majoritarian category in 1000mx1000m areas
  c_popden=shp_monitors@data$COU_POPDEN, # From county level: Population density
  c_poppov=shp_monitors@data$COU_POPPOV, # From county level: % of population in poverty
  c_popblack=shp_monitors@data$COU_POPBLA, # From county level: % of population black or african american
  c_pophisp=shp_monitors@data$COU_POPHIS, # From county level: % of popylation of hispanic or latino origins
  t_popden=shp_monitors@data$TRA_POPDEN, # From census tract level: Population density
  t_poppov=shp_monitors@data$TRA_POPPOV, # From census tract level: % of population in poverty 
  t_popblack=shp_monitors@data$TRA_POPBLA, # From census tract level: % of population black or african american
  t_pophisp=shp_monitors@data$TRA_POPHIS # From census tract level: % of popylation of hispanic or latino origin
)
geo_monitors_pm25 <- as.geodata(frm_monitors,coords.col = 1:2, covar.col=4:16)

Check for duplicate monitor locations:

dup.coords(geo_monitors_pm25)
## NULL

Summary table:

st_css()
print(dfSummary(frm_monitors, plain.ascii = FALSE, graph.magnif = 0.85), max.distinct.values = 5, max.string.width=15, method = "render")

Data Frame Summary

frm_monitors

Dimensions: 100 x 16
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 easting [numeric] Mean (sd) : 6273.1 (191006.7) min < med < max: -354967 < -40110.9 < 424024.6 IQR (CV) : 324360.2 (30.4) 100 distinct values 100 (100%) 0 (0%)
2 northing [numeric] Mean (sd) : -179415.7 (249237.4) min < med < max: -593579.9 < -144393.2 < 415195.2 IQR (CV) : 421266.6 (-1.4) 100 distinct values 100 (100%) 0 (0%)
3 pm25 [numeric] Mean (sd) : 8.6 (2.9) min < med < max: 3 < 8.1 < 15.6 IQR (CV) : 4.1 (0.3) 100 distinct values 100 (100%) 0 (0%)
4 p_elev [numeric] Mean (sd) : 239.8 (398.5) min < med < max: -32 < 90.5 < 2063 IQR (CV) : 211.2 (1.7) 87 distinct values 100 (100%) 0 (0%)
5 p_coast [numeric] Mean (sd) : 67109.5 (64424.4) min < med < max: 1128.4 < 45195.5 < 283641.9 IQR (CV) : 98210.9 (1) 100 distinct values 100 (100%) 0 (0%)
6 p_wfire [numeric] Mean (sd) : 0.5 (1.1) min < med < max: 0 < 0.1 < 5.7 IQR (CV) : 0.4 (2.2) 38 distinct values 100 (100%) 0 (0%)
7 lc_100 [character] 1. 0 2. 1 3. 2
20(20.0%)
29(29.0%)
51(51.0%)
100 (100%) 0 (0%)
8 lc_1000 [character] 1. 0 2. 1 3. 2
20(20.0%)
33(33.0%)
47(47.0%)
100 (100%) 0 (0%)
9 c_popden [numeric] Mean (sd) : 257.3 (334.2) min < med < max: 0.7 < 95.2 < 1415.7 IQR (CV) : 387.6 (1.3) 44 distinct values 100 (100%) 0 (0%)
10 c_poppov [numeric] Mean (sd) : 16.5 (5) min < med < max: 7.7 < 16.5 < 28.3 IQR (CV) : 6.7 (0.3) 39 distinct values 100 (100%) 0 (0%)
11 c_popblack [numeric] Mean (sd) : 4.5 (3.3) min < med < max: 0.1 < 2.7 < 14.1 IQR (CV) : 5.5 (0.7) 30 distinct values 100 (100%) 0 (0%)
12 c_pophisp [numeric] Mean (sd) : 37.5 (17.2) min < med < max: 8.4 < 41.2 < 82.7 IQR (CV) : 29 (0.5) 42 distinct values 100 (100%) 0 (0%)
13 t_popden [numeric] Mean (sd) : 1778.7 (1807.6) min < med < max: 0.2 < 1344.4 < 8226.6 IQR (CV) : 2564.4 (1) 99 distinct values 100 (100%) 0 (0%)
14 t_poppov [numeric] Mean (sd) : 20 (12.3) min < med < max: 1.9 < 17.6 < 56.4 IQR (CV) : 19 (0.6) 85 distinct values 100 (100%) 0 (0%)
15 t_popblack [numeric] Mean (sd) : 5.1 (8.3) min < med < max: 0 < 1.4 < 57.6 IQR (CV) : 5.7 (1.6) 58 distinct values 100 (100%) 0 (0%)
16 t_pophisp [numeric] Mean (sd) : 41.6 (25.5) min < med < max: 5.4 < 39.6 < 94.7 IQR (CV) : 41.1 (0.6) 93 distinct values 100 (100%) 0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

Predictions data

Loading prediction grid shapefile:

shp_predictions <- readOGR("Point_Grid/point_grid_final.shp")
plot(shp_california)
points(shp_predictions,pch='.',col='blue')
title("Data loading: Predictors location")
footnote('Figure 4')

Convert shapefile to dataframe:

frm_predictions <- data.frame(
  easting=shp_predictions@data$POINT_X, # Coordinates
  northing=shp_predictions@data$POINT_Y, # Coordinates 
  p_elev=as.numeric(as.character(shp_predictions@data$ELEV)), # Elevation in m
  p_coast=shp_predictions@data$DISTCOAST, # Distance form the coast in m
  lc_1000=as.character(shp_predictions@data$RS_LC_1000), # Landcover majoritarian category in 1000mx1000m areas
  c_popden=shp_predictions@data$COU_POPDEN, # From county level: Population density
  c_poppov=shp_predictions@data$COU_POPPOV, # From county level: % of population in poverty
  c_popblack=shp_predictions@data$COU_POPBLA, # From county level: % of population black or african american
  c_pophisp=shp_predictions@data$COU_POPHIS # From county level: % of popylation of hispanic or latino origin
)

Filter locations where elevation is not available:

frm_predictions_filtered <- frm_predictions[frm_predictions$p_elev != '-9999',]

Convert dataframe to geodata:

geo_predictions <- as.geodata(frm_predictions_filtered,coords.col = 1:2, covar.col=3:9)

Check for duplicate prediction locations:

dup.coords(geo_predictions)
## NULL

Summary table on a sample of 9,999 values:

st_css()
frm_predictions_sample <- frm_predictions_filtered[sample(nrow(frm_predictions_filtered), 999), ]
print(dfSummary(frm_predictions_sample, plain.ascii = FALSE, graph.magnif = 0.85), max.distinct.values = 5, max.string.width=15, method = "render")

Data Frame Summary

frm_predictions_sample

Dimensions: 999 x 9
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 easting [numeric] Mean (sd) : 46460.5 (220837.6) min < med < max: -360101.1 < 4898.9 < 524898.9 IQR (CV) : 357500 (4.8) 178 distinct values 999 (100%) 0 (0%)
2 northing [numeric] Mean (sd) : -92429.2 (278658.1) min < med < max: -595327 < -115327 < 444673 IQR (CV) : 452500 (-3) 210 distinct values 999 (100%) 0 (0%)
3 p_elev [numeric] Mean (sd) : 856 (775) min < med < max: -71 < 679 < 3905 IQR (CV) : 1038.5 (0.9) 770 distinct values 999 (100%) 0 (0%)
4 p_coast [numeric] Mean (sd) : 140226.4 (90193.4) min < med < max: 148.4 < 130574.7 < 359574.3 IQR (CV) : 145888.9 (0.6) 999 distinct values 999 (100%) 0 (0%)
5 lc_1000 [character] 1. 0 2. 1 3. 2
942(94.3%)
40(4.0%)
17(1.7%)
999 (100%) 0 (0%)
6 c_popden [numeric] Mean (sd) : 85.1 (174.5) min < med < max: 0.6 < 40.5 < 1415.7 IQR (CV) : 53.7 (2) 58 distinct values 999 (100%) 0 (0%)
7 c_poppov [numeric] Mean (sd) : 17.7 (5.1) min < med < max: 7.7 < 17.8 < 28.3 IQR (CV) : 7.5 (0.3) 50 distinct values 999 (100%) 0 (0%)
8 c_popblack [numeric] Mean (sd) : 3.7 (3) min < med < max: 0.1 < 2.5 < 14.1 IQR (CV) : 4.8 (0.8) 37 distinct values 999 (100%) 0 (0%)
9 c_pophisp [numeric] Mean (sd) : 35.2 (19.2) min < med < max: 7.2 < 33.1 < 82.7 IQR (CV) : 33.3 (0.5) 56 distinct values 999 (100%) 0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

C. Exploratory analysis

Large-scale variations

Ploting monitors values:

area_california <- shp_california@polygons[[1]]@Polygons[[1]]@coords
colors <- c(rgb(252,146,114,maxColorValue=255),rgb(251,106,74,maxColorValue=255),rgb(222,45,38,maxColorValue=255),rgb(165,15,21,maxColorValue=255))
par(oma=c(0,0,2,0))
plot(geo_monitors_pm25, borders=area_california, lowess = TRUE,qt.col = colors)
mtext("PM2.5 Monitors", outer=TRUE,  cex=1.2, line=1)
footnote('Figure 6')

Short-scale variations

Estimate the semivariogram form the PM2.5 monitors data and then adjust a model:

# Estimate semivariogram from data
var_monitors_pm25.dot <- variog(geo_monitors_pm25)
names(var_monitors_pm25.dot)
max <- var_monitors_pm25.dot$max.dist
var_monitors_pm25.dot<-variog(geo_monitors_pm25,max.dist=max/2) #restricting to half the actual maximum distance

plot(var_monitors_pm25.dot,xlab="Distance (meters)",ylab="Semivariogram",pty="m", pch=16, ylim=c(0,11))
title("Semivariogram of PM2.5 Data\n(restricted < 1/2 max distance)")
footnote('Figure 7')

D. Modelling

Ordinary Kriging (OK)

Ordinary Kriging - Semivariogram adjustments (WLS & MLE)

Adjust a model to the semivariogram form the PM2.5 monitors data:

plot(var_monitors_pm25.dot,xlab="Distance (meters)",ylab="Semivariogram",pty="m", pch=16, ylim=c(0,11))
title("Semivariogram of PM2.5 Data\n(restricted < 1/2 max distance)")
footnote('Figure 8')

# Adjust semivariogram model
model<-"exponential" #"exponential" #"spherical" #"gaussian"
rangs <- c() #c(0.7e+05,0.8e+05,0.9e+05) #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sills <- c() #c(5,10,15) #c(9.0,9.5,10,10.5,11)
nugs <- c() #c(0,1,2) #c(0,0.2,0.4,0.6,0.8,1)
df.MLE<-data.frame(9999,9999,9999,9999,9999)
names(df.MLE)<-c("rang","sill","nug","AIC","BIC")
for (rang in rangs) { for (sill in sills) { for (nug in nugs) {
  psill<-sill-nug
  var_monitors_pm25.fit.MLE<-likfit(geo_monitors_pm25,ini.cov.pars=c(psill,rang),nugget=nug)
  temp <- data.frame(rang,sill,nug,var_monitors_pm25.fit.MLE$AIC,var_monitors_pm25.fit.MLE$BIC)
  names(temp)<-c("rang","sill","nug","AIC","BIC")
  df.MLE <- rbind(df.MLE, temp)
}}}

model<-"exponential" #"exponential" #"spherical" #"gaussian"
rangs <- c() #c(0.7e+05,0.8e+05,0.9e+05) #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sills <- c() #c(5,10,15) #c(9.0,9.5,10,10.5,11)
nugs <- c() #c(0,1,2) #c(0,0.2,0.4,0.6,0.8,1)
df.WLS<-data.frame(9999,9999,9999,9999)
names(df.WLS)<-c("rang","sill","nug","value")
for (rang in rangs) { for (sill in sills) { for (nug in nugs) {
  psill<-sill-nug
  var_monitors_pm25.fit.WLS<-variofit(var_monitors_pm25.dot,ini.cov.pars=c(psill,rang),cov.model=model,nugget=nug)
  temp <- data.frame(rang,sill,nug,var_monitors_pm25.fit.WLS$value)
  names(temp)<-c("rang","sill","nug","value")
  df.WLS <- rbind(df.WLS, temp)
}}}

model<-"exponential" #"exponential" #"spherical" #"gaussian"
rang.MLE <- 0.8e+05 #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sill.MLE <- 10 #c(9.0,9.5,10,10.5,11)
nug.MLE <- 0 #c(0,0.2,0.4,0.6,0.8,1)
psill.MLE<-sill.MLE-nug.MLE
var_monitors_pm25.fit.MLE<-likfit(geo_monitors_pm25,ini.cov.pars=c(psill.MLE,rang.MLE),nugget=nug.MLE)
rang.WLS <- 0.8e+05 #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sill.WLS <- 10 #c(9.0,9.5,10,10.5,11)
nug.WLS <- 0 #c(0,0.2,0.4,0.6,0.8,1)
psill.WLS<-sill.WLS-nug.WLS
var_monitors_pm25.fit.WLS<-variofit(var_monitors_pm25.dot,ini.cov.pars=c(psill.WLS,rang.WLS),cov.model=model,nugget=nug.WLS)

lines(var_monitors_pm25.fit.MLE,col="blue",lty=2)
lines(var_monitors_pm25.fit.WLS,col="red",lty=4)

legend(4.5e5,3,legend=c("MLE","WLS"),lty=2:4, col=c("blue","red"))

We tested a number of starting parameters to fit the variogram model and ended up with the value:

  • model: “exponential”
  • start values: rang: 0.8e+05 - sill: 10.0 - nug: 0.0
  • values adjusted by MLE: rang: 0.8000e+05 - sill: 8.5138 - nug: 0.3328
  • values adjusted by WLS: rang: 0.7861e+05 - sill: 9.3534 - nug: 0.0000

The MLE adjustement is more heavily affected by the values at larger distance (>500km) than the WLS adjustment.

Universal Kriging (UK)

Universal Kriging - Linear Models

Covariates selection:

Models generation:

linear_model_lc1000 <-lm(pm25~lc_1000,data=frm_monitors)
linear_model_lc100 <-lm(pm25~lc_100,data=frm_monitors)
linear_model_lvlcounty <-lm(pm25~c_popden+c_poppov+c_popblack+c_pophisp,data=frm_monitors)
linear_model_lvltract <-lm(pm25~t_popden+t_poppov+t_popblack+t_pophisp,data=frm_monitors)
linear_model_p_wfire <-lm(pm25~p_wfire,data=frm_monitors)
linear_model_p_coast <-lm(pm25~p_coast,data=frm_monitors)

Landcover resolution: 1,000m2 vs 100m2

{library(jtools); library(huxtable);}
model_names <- c("Model LC 1,000m2", "Model LC 100m2")
stats <- c(N = "nobs", R2 = "r.squared", AIC = "AIC", BIC = "BIC")
export_summs(linear_model_lc1000,linear_model_lc100, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Model LC 1,000m2 Model LC 100m2
(Intercept) 7.26 *** 7.38 ***
[5.99, 8.52]    [6.10, 8.66]   
lc_10001 1.95 *          
[0.35, 3.56]           
lc_10002 1.45            
[-0.06, 2.96]           
lc_1001         1.42    
        [-0.24, 3.08]   
lc_1002         1.56 *  
        [0.05, 3.06]   
N 100        100       
R2 0.06     0.04    
AIC 498.55     500.09    
BIC 508.97     510.51    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Administrative level for population data (density, poverty, black, hispanic): County level vs Census Tract level

{library(jtools); library(huxtable);}
model_names <- c("Model lvl County", "Model lvl CTract")
export_summs(linear_model_lvlcounty,linear_model_lvltract, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Model lvl County Model lvl CTract
(Intercept) 3.15 *** 5.85 ***
[1.32, 4.98]    [4.65, 7.04]   
c_popden 0.00            
[-0.00, 0.00]           
c_poppov 0.13 *          
[0.02, 0.25]           
c_popblack 0.16            
[-0.01, 0.33]           
c_pophisp 0.06 ***        
[0.03, 0.10]           
t_popden         -0.00    
        [-0.00, 0.00]   
t_poppov         0.04    
        [-0.01, 0.08]   
t_popblack         0.06    
        [-0.01, 0.14]   
t_pophisp         0.04 ***
        [0.02, 0.07]   
N 100        100       
R2 0.36     0.24    
AIC 463.99     481.38    
BIC 479.62     497.01    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Wildfire and Coast Distance covariates

{library(jtools); library(huxtable);}
model_names <- c("Wildfire", "Coast Distance")
export_summs(linear_model_p_wfire,linear_model_p_coast, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Wildfire Coast Distance
(Intercept) 8.77 *** 8.03 ***
[8.14, 9.40]    [7.20, 8.85]   
p_wfire -0.38            
[-0.92, 0.16]           
p_coast         0.00    
        [-0.00, 0.00]   
N 100        100       
R2 0.02     0.03