THIS R MARKDOWN DOCUMENT HAS BEEN PRODUCED IN THE FRAME OF THE MAS OF SPATIAL ANALYSIS FOR PUBLIC HEALTH AT JHU.
The unique purpose of this document is to permit reproducing the analysis. Code should be streamlined and commented for clarity.
Selected Projection for project: NAD_1983_California_Teale_Albers (WKID: 3310 Authority: EPSG)
File name | Source | Overview | URL |
---|---|---|---|
california_landcover | USGS | Land cover (especially developed land) | URL |
california_dem | ESRI | Digital elevation model | URL |
california_wildfires | CalFire | Fire location and surfaces burnt | URL |
california_coastline | Natural Earth | Coastline | URL |
acs_poverty | US Census Bureau | Population in poverty | URL |
acs_characteristics | US Census Bureau | Population characteristics | URL |
acs_population | US Census Bureau | Population | URL |
area_counties | US Census Bureau | County polygons | URL |
area_censusTracts | US Census Bureau | Census tract polygons | URL |
point_pm25 | EPA | PM2.5 monitors | URL |
The pre-processed files as well as the source code to run this analysis can be downloaded at: https://github.com/BRaimbault/2020-JHU-IA
Loading libraries:
{library(rgdal); library(geoR); library(ggplot2); library(dplyr); library(splancs); library(summarytools); library(grid);}
Loading California state boundaries shapefile:
shp_california <- readOGR("Projected/area_state_single.shp");
plot(shp_california)
title("Data loading: Californa State boundaries")
footnote('Figure 1')
Loading PM2.5 monitors shapefile:
shp_monitors <- readOGR("Monitors/point_pm25_final.shp")
plot(shp_california)
points(shp_monitors,col='red')
title("Data loading: Monitors location")
footnote('Figure 2')
Convert shapefile to dataframe and to geodata:
frm_monitors <- data.frame(
easting=shp_monitors@data$POINT_X, # Coordinates
northing=shp_monitors@data$POINT_Y, # Coordinates
pm25=shp_monitors@data$meanPM, # PM2.5 concentration
p_elev=shp_monitors@data$ELEV, # Elevation in m
p_coast=shp_monitors@data$DISTCOAST, # Distance form the coast in m
p_wfire=shp_monitors@data$WILDFIRE_p, # % of the county burned due to wildfires
lc_100=as.character(shp_monitors@data$RS_LC_100), # Landcover majoritarian category in 100mx100m areas
lc_1000=as.character(shp_monitors@data$RS_LC_1000), # Landcover majoritarian category in 1000mx1000m areas
c_popden=shp_monitors@data$COU_POPDEN, # From county level: Population density
c_poppov=shp_monitors@data$COU_POPPOV, # From county level: % of population in poverty
c_popblack=shp_monitors@data$COU_POPBLA, # From county level: % of population black or african american
c_pophisp=shp_monitors@data$COU_POPHIS, # From county level: % of popylation of hispanic or latino origins
t_popden=shp_monitors@data$TRA_POPDEN, # From census tract level: Population density
t_poppov=shp_monitors@data$TRA_POPPOV, # From census tract level: % of population in poverty
t_popblack=shp_monitors@data$TRA_POPBLA, # From census tract level: % of population black or african american
t_pophisp=shp_monitors@data$TRA_POPHIS # From census tract level: % of popylation of hispanic or latino origin
)
geo_monitors_pm25 <- as.geodata(frm_monitors,coords.col = 1:2, covar.col=4:16)
Check for duplicate monitor locations:
dup.coords(geo_monitors_pm25)
## NULL
Summary table:
st_css()
print(dfSummary(frm_monitors, plain.ascii = FALSE, graph.magnif = 0.85), max.distinct.values = 5, max.string.width=15, method = "render")
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | easting [numeric] | Mean (sd) : 6273.1 (191006.7) min < med < max: -354967 < -40110.9 < 424024.6 IQR (CV) : 324360.2 (30.4) | 100 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
2 | northing [numeric] | Mean (sd) : -179415.7 (249237.4) min < med < max: -593579.9 < -144393.2 < 415195.2 IQR (CV) : 421266.6 (-1.4) | 100 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
3 | pm25 [numeric] | Mean (sd) : 8.6 (2.9) min < med < max: 3 < 8.1 < 15.6 IQR (CV) : 4.1 (0.3) | 100 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
4 | p_elev [numeric] | Mean (sd) : 239.8 (398.5) min < med < max: -32 < 90.5 < 2063 IQR (CV) : 211.2 (1.7) | 87 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
5 | p_coast [numeric] | Mean (sd) : 67109.5 (64424.4) min < med < max: 1128.4 < 45195.5 < 283641.9 IQR (CV) : 98210.9 (1) | 100 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
6 | p_wfire [numeric] | Mean (sd) : 0.5 (1.1) min < med < max: 0 < 0.1 < 5.7 IQR (CV) : 0.4 (2.2) | 38 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
7 | lc_100 [character] | 1. 0 2. 1 3. 2 |
|
100 (100%) | 0 (0%) | |||||||||||||
8 | lc_1000 [character] | 1. 0 2. 1 3. 2 |
|
100 (100%) | 0 (0%) | |||||||||||||
9 | c_popden [numeric] | Mean (sd) : 257.3 (334.2) min < med < max: 0.7 < 95.2 < 1415.7 IQR (CV) : 387.6 (1.3) | 44 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
10 | c_poppov [numeric] | Mean (sd) : 16.5 (5) min < med < max: 7.7 < 16.5 < 28.3 IQR (CV) : 6.7 (0.3) | 39 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
11 | c_popblack [numeric] | Mean (sd) : 4.5 (3.3) min < med < max: 0.1 < 2.7 < 14.1 IQR (CV) : 5.5 (0.7) | 30 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
12 | c_pophisp [numeric] | Mean (sd) : 37.5 (17.2) min < med < max: 8.4 < 41.2 < 82.7 IQR (CV) : 29 (0.5) | 42 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
13 | t_popden [numeric] | Mean (sd) : 1778.7 (1807.6) min < med < max: 0.2 < 1344.4 < 8226.6 IQR (CV) : 2564.4 (1) | 99 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
14 | t_poppov [numeric] | Mean (sd) : 20 (12.3) min < med < max: 1.9 < 17.6 < 56.4 IQR (CV) : 19 (0.6) | 85 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
15 | t_popblack [numeric] | Mean (sd) : 5.1 (8.3) min < med < max: 0 < 1.4 < 57.6 IQR (CV) : 5.7 (1.6) | 58 distinct values | 100 (100%) | 0 (0%) | |||||||||||||
16 | t_pophisp [numeric] | Mean (sd) : 41.6 (25.5) min < med < max: 5.4 < 39.6 < 94.7 IQR (CV) : 41.1 (0.6) | 93 distinct values | 100 (100%) | 0 (0%) |
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15
Loading prediction grid shapefile:
shp_predictions <- readOGR("Point_Grid/point_grid_final.shp")
plot(shp_california)
points(shp_predictions,pch='.',col='blue')
title("Data loading: Predictors location")
footnote('Figure 4')
Convert shapefile to dataframe:
frm_predictions <- data.frame(
easting=shp_predictions@data$POINT_X, # Coordinates
northing=shp_predictions@data$POINT_Y, # Coordinates
p_elev=as.numeric(as.character(shp_predictions@data$ELEV)), # Elevation in m
p_coast=shp_predictions@data$DISTCOAST, # Distance form the coast in m
lc_1000=as.character(shp_predictions@data$RS_LC_1000), # Landcover majoritarian category in 1000mx1000m areas
c_popden=shp_predictions@data$COU_POPDEN, # From county level: Population density
c_poppov=shp_predictions@data$COU_POPPOV, # From county level: % of population in poverty
c_popblack=shp_predictions@data$COU_POPBLA, # From county level: % of population black or african american
c_pophisp=shp_predictions@data$COU_POPHIS # From county level: % of popylation of hispanic or latino origin
)
Filter locations where elevation is not available:
frm_predictions_filtered <- frm_predictions[frm_predictions$p_elev != '-9999',]
Convert dataframe to geodata:
geo_predictions <- as.geodata(frm_predictions_filtered,coords.col = 1:2, covar.col=3:9)
Check for duplicate prediction locations:
dup.coords(geo_predictions)
## NULL
Summary table on a sample of 9,999 values:
st_css()
frm_predictions_sample <- frm_predictions_filtered[sample(nrow(frm_predictions_filtered), 999), ]
print(dfSummary(frm_predictions_sample, plain.ascii = FALSE, graph.magnif = 0.85), max.distinct.values = 5, max.string.width=15, method = "render")
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | easting [numeric] | Mean (sd) : 46460.5 (220837.6) min < med < max: -360101.1 < 4898.9 < 524898.9 IQR (CV) : 357500 (4.8) | 178 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
2 | northing [numeric] | Mean (sd) : -92429.2 (278658.1) min < med < max: -595327 < -115327 < 444673 IQR (CV) : 452500 (-3) | 210 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
3 | p_elev [numeric] | Mean (sd) : 856 (775) min < med < max: -71 < 679 < 3905 IQR (CV) : 1038.5 (0.9) | 770 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
4 | p_coast [numeric] | Mean (sd) : 140226.4 (90193.4) min < med < max: 148.4 < 130574.7 < 359574.3 IQR (CV) : 145888.9 (0.6) | 999 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
5 | lc_1000 [character] | 1. 0 2. 1 3. 2 |
|
999 (100%) | 0 (0%) | |||||||||||||
6 | c_popden [numeric] | Mean (sd) : 85.1 (174.5) min < med < max: 0.6 < 40.5 < 1415.7 IQR (CV) : 53.7 (2) | 58 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
7 | c_poppov [numeric] | Mean (sd) : 17.7 (5.1) min < med < max: 7.7 < 17.8 < 28.3 IQR (CV) : 7.5 (0.3) | 50 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
8 | c_popblack [numeric] | Mean (sd) : 3.7 (3) min < med < max: 0.1 < 2.5 < 14.1 IQR (CV) : 4.8 (0.8) | 37 distinct values | 999 (100%) | 0 (0%) | |||||||||||||
9 | c_pophisp [numeric] | Mean (sd) : 35.2 (19.2) min < med < max: 7.2 < 33.1 < 82.7 IQR (CV) : 33.3 (0.5) | 56 distinct values | 999 (100%) | 0 (0%) |
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15
Ploting monitors values:
area_california <- shp_california@polygons[[1]]@Polygons[[1]]@coords
colors <- c(rgb(252,146,114,maxColorValue=255),rgb(251,106,74,maxColorValue=255),rgb(222,45,38,maxColorValue=255),rgb(165,15,21,maxColorValue=255))
par(oma=c(0,0,2,0))
plot(geo_monitors_pm25, borders=area_california, lowess = TRUE,qt.col = colors)
mtext("PM2.5 Monitors", outer=TRUE, cex=1.2, line=1)
footnote('Figure 6')
Estimate the semivariogram form the PM2.5 monitors data and then adjust a model:
# Estimate semivariogram from data
var_monitors_pm25.dot <- variog(geo_monitors_pm25)
names(var_monitors_pm25.dot)
max <- var_monitors_pm25.dot$max.dist
var_monitors_pm25.dot<-variog(geo_monitors_pm25,max.dist=max/2) #restricting to half the actual maximum distance
plot(var_monitors_pm25.dot,xlab="Distance (meters)",ylab="Semivariogram",pty="m", pch=16, ylim=c(0,11))
title("Semivariogram of PM2.5 Data\n(restricted < 1/2 max distance)")
footnote('Figure 7')
Adjust a model to the semivariogram form the PM2.5 monitors data:
plot(var_monitors_pm25.dot,xlab="Distance (meters)",ylab="Semivariogram",pty="m", pch=16, ylim=c(0,11))
title("Semivariogram of PM2.5 Data\n(restricted < 1/2 max distance)")
footnote('Figure 8')
# Adjust semivariogram model
model<-"exponential" #"exponential" #"spherical" #"gaussian"
rangs <- c() #c(0.7e+05,0.8e+05,0.9e+05) #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sills <- c() #c(5,10,15) #c(9.0,9.5,10,10.5,11)
nugs <- c() #c(0,1,2) #c(0,0.2,0.4,0.6,0.8,1)
df.MLE<-data.frame(9999,9999,9999,9999,9999)
names(df.MLE)<-c("rang","sill","nug","AIC","BIC")
for (rang in rangs) { for (sill in sills) { for (nug in nugs) {
psill<-sill-nug
var_monitors_pm25.fit.MLE<-likfit(geo_monitors_pm25,ini.cov.pars=c(psill,rang),nugget=nug)
temp <- data.frame(rang,sill,nug,var_monitors_pm25.fit.MLE$AIC,var_monitors_pm25.fit.MLE$BIC)
names(temp)<-c("rang","sill","nug","AIC","BIC")
df.MLE <- rbind(df.MLE, temp)
}}}
model<-"exponential" #"exponential" #"spherical" #"gaussian"
rangs <- c() #c(0.7e+05,0.8e+05,0.9e+05) #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sills <- c() #c(5,10,15) #c(9.0,9.5,10,10.5,11)
nugs <- c() #c(0,1,2) #c(0,0.2,0.4,0.6,0.8,1)
df.WLS<-data.frame(9999,9999,9999,9999)
names(df.WLS)<-c("rang","sill","nug","value")
for (rang in rangs) { for (sill in sills) { for (nug in nugs) {
psill<-sill-nug
var_monitors_pm25.fit.WLS<-variofit(var_monitors_pm25.dot,ini.cov.pars=c(psill,rang),cov.model=model,nugget=nug)
temp <- data.frame(rang,sill,nug,var_monitors_pm25.fit.WLS$value)
names(temp)<-c("rang","sill","nug","value")
df.WLS <- rbind(df.WLS, temp)
}}}
model<-"exponential" #"exponential" #"spherical" #"gaussian"
rang.MLE <- 0.8e+05 #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sill.MLE <- 10 #c(9.0,9.5,10,10.5,11)
nug.MLE <- 0 #c(0,0.2,0.4,0.6,0.8,1)
psill.MLE<-sill.MLE-nug.MLE
var_monitors_pm25.fit.MLE<-likfit(geo_monitors_pm25,ini.cov.pars=c(psill.MLE,rang.MLE),nugget=nug.MLE)
rang.WLS <- 0.8e+05 #c(1.0e+05,1.1e+05,1.2e+05,1.3e+05,1.4e+05,1.5e+05)
sill.WLS <- 10 #c(9.0,9.5,10,10.5,11)
nug.WLS <- 0 #c(0,0.2,0.4,0.6,0.8,1)
psill.WLS<-sill.WLS-nug.WLS
var_monitors_pm25.fit.WLS<-variofit(var_monitors_pm25.dot,ini.cov.pars=c(psill.WLS,rang.WLS),cov.model=model,nugget=nug.WLS)
lines(var_monitors_pm25.fit.MLE,col="blue",lty=2)
lines(var_monitors_pm25.fit.WLS,col="red",lty=4)
legend(4.5e5,3,legend=c("MLE","WLS"),lty=2:4, col=c("blue","red"))
We tested a number of starting parameters to fit the variogram model and ended up with the value:
The MLE adjustement is more heavily affected by the values at larger distance (>500km) than the WLS adjustment.
Models generation:
linear_model_lc1000 <-lm(pm25~lc_1000,data=frm_monitors)
linear_model_lc100 <-lm(pm25~lc_100,data=frm_monitors)
linear_model_lvlcounty <-lm(pm25~c_popden+c_poppov+c_popblack+c_pophisp,data=frm_monitors)
linear_model_lvltract <-lm(pm25~t_popden+t_poppov+t_popblack+t_pophisp,data=frm_monitors)
linear_model_p_wfire <-lm(pm25~p_wfire,data=frm_monitors)
linear_model_p_coast <-lm(pm25~p_coast,data=frm_monitors)
Landcover resolution: 1,000m2 vs 100m2
{library(jtools); library(huxtable);}
model_names <- c("Model LC 1,000m2", "Model LC 100m2")
stats <- c(N = "nobs", R2 = "r.squared", AIC = "AIC", BIC = "BIC")
export_summs(linear_model_lc1000,linear_model_lc100, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Model LC 1,000m2 | Model LC 100m2 | |
(Intercept) | 7.26 *** | 7.38 *** |
[5.99, 8.52] | [6.10, 8.66] | |
lc_10001 | 1.95 * | |
[0.35, 3.56] | ||
lc_10002 | 1.45 | |
[-0.06, 2.96] | ||
lc_1001 | 1.42 | |
[-0.24, 3.08] | ||
lc_1002 | 1.56 * | |
[0.05, 3.06] | ||
N | 100 | 100 |
R2 | 0.06 | 0.04 |
AIC | 498.55 | 500.09 |
BIC | 508.97 | 510.51 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Administrative level for population data (density, poverty, black, hispanic): County level vs Census Tract level
{library(jtools); library(huxtable);}
model_names <- c("Model lvl County", "Model lvl CTract")
export_summs(linear_model_lvlcounty,linear_model_lvltract, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Model lvl County | Model lvl CTract | |
(Intercept) | 3.15 *** | 5.85 *** |
[1.32, 4.98] | [4.65, 7.04] | |
c_popden | 0.00 | |
[-0.00, 0.00] | ||
c_poppov | 0.13 * | |
[0.02, 0.25] | ||
c_popblack | 0.16 | |
[-0.01, 0.33] | ||
c_pophisp | 0.06 *** | |
[0.03, 0.10] | ||
t_popden | -0.00 | |
[-0.00, 0.00] | ||
t_poppov | 0.04 | |
[-0.01, 0.08] | ||
t_popblack | 0.06 | |
[-0.01, 0.14] | ||
t_pophisp | 0.04 *** | |
[0.02, 0.07] | ||
N | 100 | 100 |
R2 | 0.36 | 0.24 |
AIC | 463.99 | 481.38 |
BIC | 479.62 | 497.01 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Wildfire and Coast Distance covariates
{library(jtools); library(huxtable);}
model_names <- c("Wildfire", "Coast Distance")
export_summs(linear_model_p_wfire,linear_model_p_coast, error_format = "[{conf.low}, {conf.high}]", model.names = model_names, statistics = stats)
Wildfire | Coast Distance | |
(Intercept) | 8.77 *** | 8.03 *** |
[8.14, 9.40] | [7.20, 8.85] | |
p_wfire | -0.38 | |
[-0.92, 0.16] | ||
p_coast | 0.00 | |
[-0.00, 0.00] | ||
N | 100 | 100 |
R2 | 0.02 | 0.03 |