In the field of Emergency Medical Services (EMS), response time is everything. When responding to EMS calls for conditions such as cardiac arrest or stroke, minutes can be the difference between life and death. Over the past year, the COVID-19 pandemic has pushed the efficiency and function of EMS around the country to its limit. Surges in EMS demand have caused ambulance delays and shortages and the process of taking COVID-related protective measures such as donning personal protective equipment has made response times longer. Attesting to the current strain on the medical system, paramedics in Los Angeles county were recently ordered to delay transporting cardiac arrest patients to the hospital until 5 minutes after they are revived in order to optimize use of limited ambulance and emergency room resources.
In this environment, increasing the efficiency of ambulance response is critical to minimize the number of lives lost. To do this, we use EMS call data to create a spatiotemporal model for forecasting future EMS calls. We perform this analysis in Santa Monica, California because it has a wealth of EMS call data that reaches back to 2009 and is updated daily.
As it is, Santa Monica dispatches ambulances from its network of fire stations and a private ambulance company as EMS calls are received. We propose to design an algorithm which forecasts the spatiotemporal patterns of EMS demand so that paramedics can arrive at predicted hotspots before emergencies occur. As Santa Monica’s COVID cases climb, nearly filling its intensive care units and infecting ambulance and hospital staff, we believe that a more data-driven approach to ambulance dispatch can help reduce strain on the medical system.
To watch our pitch deck video for Siren, the app we made with this algorithm, please click here.
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
options(scipen=10000000)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(devtools)
library(lubridate)
library(dplyr)
library(pander)
library(ggplot2)
options(scipen=999)
options(tigris_class = "sf")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source_url("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
paletteGray <- c("gray90", "gray70", "gray50", "gray30", "gray10")
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(), axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "white", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "white", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
paletteMap <- c("gray90","gray70","gray50","gray30","gray10")
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#9c2108","#08519c")
# NEAREST NEIGHBOR FUNCTION
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
In our model, we primarily use data from Santa Monica’s Fire Calls for Service dataset, filtered for EMS calls. We additionally use tract-level demographic data from the American Community Survey (ACS), environmental and city services data from Santa Monica’s Open Data Portal, and tract-level health outcome data from the Center for Disease Control (CDC). For the temporal aspect of the model, we use weather data from the Iowa Environment Mesonet and city-wide COVID-19 case counts compiled by the L.A. Times.
Our intention is that the data from the ACS, CDC, and Santa Monica will inform the spatial element of our prediction, as health, socioeconomic conditions, employment, and the physical environment are correlated to EMS calls and medical emergencies. Our use of EMS call, weather, and COVID-19 data contribute to the temporal part of this model, as they offer past trends about when EMS calls occur. We divide the city into 234 fishnet grid cells of 1000 square feet and use these cells as our units of analysis.
################Santa Monica Base################
SM_Base <- st_read("/Users/annaduan/Documents/GitHub/EMS-Call-Forecasting/SMTracts") %>%
#SM_Base <- st_read("D:/Upenn/CPLN508/final project/EMS-Prediction-master/EMS-Prediction-master/SMTracts") %>%
st_union() %>%
st_transform('EPSG:2225')
#xmin = st_bbox(SM_Base)[[1]]
#ymin = st_bbox(SM_Base)[[2]]
#xmax = st_bbox(SM_Base)[[3]]
#ymax = st_bbox(SM_Base)[[4]]
################Census Data################
census_api_key("d9ebfd04caa0138647fbacd94c657cdecbf705e9", install = FALSE, overwrite = TRUE)
ACS <-
get_acs(geography = "tract", variables = c("B25002_003E", "B25001_001E", "B19013_001E", "B01001A_001E", "B01003_001E", "B07013_002E", "B07013_003E", "B06009_002E",
"B05003_008E", "B05003_019E", "B06012_002", "B21005_006E", "B21005_011E",
"B01001_006E", "B01001_007E", "B01001_008E", "B01001_009E","B01001_010E", "B01001_011E", "B01001_012E",
"B01001_031E", "B01001_032E", "B01001_033E", "B01001_034E", "B01001_035E", "B01001_036E",
"B01002_001E", "B01002_002E","B01002_003E"), year=2018, state=06, county=037, geometry=T, key="d9ebfd04caa0138647fbacd94c657cdecbf705e9") %>%
st_transform('EPSG:2225')
#row 1: vacant, total housing units, mhhinc
#row 2: white, population, renter occ, owner occ, #no HS degree
#row 3: male adults, female adults, poverty level, youth unemployed (18-34, veteran), youth unemployed (non veteran)
#row 4-5: male 15-17, male 18-19, male 20, male 21, male 22-24, male 25-29, male 30-34
#row 6-7: female 18-34,
###MEDIAN AGE TOTAL, male, female
dd18_5 <- load_variables(year = 2018,dataset = "acs5", cache = TRUE)
ACSTracts <-
ACS %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf %>%
st_transform('EPSG:2225')
###need to subset here 7012.01 - 7023
ACSTracts <- subset(ACSTracts, GEOID >= 06037701201 & GEOID <= 06037702300)
#####
ACS <-
rbind(
st_centroid(ACS)[SM_Base,] %>%
st_drop_geometry() %>%
left_join(ACS) %>%
st_sf() %>%
mutate(inSM = "YES"),
st_centroid(ACS)[SM_Base, op = st_disjoint] %>%
st_drop_geometry() %>%
left_join(ACS) %>%
st_sf() %>%
mutate(inSM = "NO")) %>%
filter(inSM == "YES") %>%
dplyr::select(-inSM)
#long to wide form
ACS <-
ACS %>%
dplyr::select(-moe, -GEOID) %>%
spread(variable, estimate) %>%
dplyr::select(-geometry) %>%
rename(vacantUnits = B25002_003,
totalUnits = B25001_001,
medHHInc = B19013_001,
white = B01001A_001,
population = B01003_001,
ownerOcc = B07013_002,
renterOcc = B07013_003,
noHsDegree = B06009_002,
maleAdult = B05003_008,
femaleAdult = B05003_019,
poverty = B06012_002,
youthUnempVet = B21005_006,
youthUnempNonVet = B21005_011,
male1517 = B01001_006,
male1819 = B01001_007,
male20 = B01001_008,
male21 = B01001_009,
male2224 = B01001_010,
male2529 = B01001_011,
male3034 = B01001_012,
female1819 = B01001_031,
female20 = B01001_032,
female21 = B01001_033,
female2224 = B01001_034,
female2529 = B01001_035,
female3034 = B01001_036,
median_age = B01002_001,
medianage_male = B01002_002,
medianage_female = B01002_003) %>%
mutate(pctVacant = ifelse(totalUnits > 0, vacantUnits / totalUnits, 0),
pctWhite = ifelse(population > 0, white / population, 0),
pctRenterOcc = renterOcc/ (renterOcc + ownerOcc),
pctNoHS = noHsDegree/ (maleAdult + femaleAdult),
pctPoverty = ifelse(population > 0, poverty / population, 0),
youthUnemploy = (youthUnempVet + youthUnempNonVet) / (male1819 + male20 + male21 + male2224 + male2529 + male3034 + female1819 + female20 + female21 + female2224 + female2529 + female3034),
pctMaleYouth = ifelse(population > 0, (male1517 + male1819 + male20 + male21 + male2224 + male2529 + male3034) / population, 0)) %>%
dplyr::select(-totalUnits,-vacantUnits,-white,-renterOcc,-ownerOcc, -noHsDegree, -maleAdult, -femaleAdult, -youthUnempVet, -youthUnempNonVet, -male1517, -male1819, -male20, -male21, -male2224, -male2529, -male3034, -female1819, -female20, -female21, -female2224, -female2529, -female3034, -poverty)
################Fishnet################
fishnet <-
st_make_grid(ACS, cellsize = 1000) %>%
st_sf() %>%
mutate(uniqueID = as.numeric(rownames(.)))
#use this to get rid of fishnet grids outside of census tracts
fishnet_centroid <- fishnet %>%
st_centroid()
tractNames <- ACS %>%
dplyr::select(NAME)
fishnet <- fishnet_centroid %>%
st_join(., tractNames) %>%
st_drop_geometry() %>%
left_join(fishnet,.,by="uniqueID") %>%
dplyr::select(NAME) %>%
na.omit() %>%
mutate(uniqueID = as.numeric(rownames(.)))
################Santa Monica Open Data################
violentCrime <- read.socrata("https://data.smgov.net/OData.svc/kn6p-4y74?UCR='0100' or UCR='0110' or UCR='0120' or UCR='0300' or UCR='0311' or UCR='0312' or UCR='0314' or UCR='0315' or UCR='0317' or UCR='0321' or UCR='0322' or UCR='0325'") %>%
dplyr::select(Y = Latitude, X = Longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Violent_Crime")
#trash/dumping
trash <- read.socrata("https://data.smgov.net/OData.svc/tsas-mvez?topic='Illegal Dumping'") %>%
dplyr::select(Y = Latitude, X = Longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Trash")
#weather
#vsby: visibility; sknt:wind speed; gust:wind gust in knots
library(riem)
library(dplyr)
library(lubridate)
weather.Panel <-
riem_measures(station = "KSMO", date_start = "2020-01-01", date_end = "2020-12-17") %>%
dplyr::select(valid, tmpf, p01i, sknt, vsby, gust) %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt),
visibility = sum(vsby),
wind_gust = max(gust)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#downtown distaance
zipcode <- st_read("https://opendata.arcgis.com/datasets/0b879034d6954177a1c3bac61039f204_0.geojson")%>%
st_transform(st_crs(fishnet)) %>%
dplyr::select("ZIPCODE")
downtown <- subset(zipcode, ZIPCODE == "90401")
downtown <- ACS %>%
filter(NAME == "Census Tract 7019.02, Los Angeles County, California") %>%
st_centroid()
#distance to bars
#bars <- st_read("D:/Upenn/CPLN508/final project/EMS-Prediction-master/EMS-Prediction-master/bar.geojson")%>%
bars <- st_read("/Users/annaduan/Documents/GitHub/EMS-Call-Forecasting/bar.geojson") %>%
st_transform(st_crs(fishnet))
#intersection
#intersections <- st_read("D:/Upenn/CPLN508/final project/EMS-Prediction-master/EMS-Prediction-master/intersection.geojson")%>%
intersections <- st_read("/Users/annaduan/Documents/GitHub/EMS-Call-Forecasting/intersection.geojson") %>%
st_transform(st_crs(fishnet))
#311 calls
call311 <- read.socrata("https://data.smgov.net/resource/tsas-mvez.json")
#https://www.smgov.net/santamonicaworks.aspx
streetlight <- call311 %>% filter(topic=="Streetlights")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "streetlights")
signandmarking <- call311 %>% filter(topic=="Street Signs & Markings")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "markings")
socialservice <- call311 %>% filter(topic=="General Concerns/Social Service Request")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "servicerequest")
#NEAR INDUSTRIAL SITES
parcel <- read.socrata("https://data.smgov.net/resource/sa4y-7yah.json")
lu <- as.data.frame(unique(parcel$specificusetype))
harmful_lu <- parcel %>%
filter(specificusetype=="Petroleum and Gas"|generalusetype=="Industrial") %>%
dplyr::select(Y = center_lat, X = center_lon) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "harmfullanduse")
#OSM
#sport <- st_read("D:/Upenn/CPLN508/final project/EMS-Prediction-master/EMS-Prediction-master/sport.geojson") %>%
sport <- st_read("/Users/annaduan/Documents/GitHub/EMS-Call-Forecasting/sport.geojson") %>%
st_transform(st_crs(fishnet))%>%
mutate(Legend = "sport")
ACS$row <- 1:nrow(ACS)
#CDC health data
CDC <-
read.socrata("https://chronicdata.cdc.gov/OData.svc/k86t-wghb?stateabbr=CA") %>%
filter(placename == "Santa Monica") %>%
dplyr::select(-stateabbr, -placename, -placefips, -place_tractid, -access2_crude95ci, -arthritis_crude95ci, -binge_crude95ci, -bphigh_crude95ci, -bpmed_crude95ci, -cancer_crude95ci, -casthma_crude95ci, -chd_crude95ci, -checkup_crude95ci, -cholscreen_crude95ci, -colon_screen_crude95ci, -copd_crude95ci, -corem_crude95ci, -corew_crude95ci, -csmoking_crude95ci, -dental_crude95ci, -diabetes_crude95ci, -highchol_crude95ci, -kidney_crude95ci, -lpa_crude95ci, -mammouse_crude95ci, -mhlth_crude95ci, -teethlost_crude95ci, -stroke_crude95ci, -sleep_crude95ci, -phlth_crude95ci, -paptest_crude95ci, -obesity_crude95ci) %>%
rename(FIPS = tractfips,
population = population2010,
uninsured = access2_crudeprev,
arthritis = arthritis_crudeprev,
bingeDrink = binge_crudeprev,
highBloodPressure = bphigh_crudeprev,
bloodPressureMeds = bpmed_crudeprev,
cancer = cancer_crudeprev,
asthma = casthma_crudeprev,
heartDisease = chd_crudeprev,
doctorCheckups = checkup_crudeprev,
cholesterolScreen = cholscreen_crudeprev,
colonScreen65 = colon_screen_crudeprev,
pulmonaryDisease = copd_crudeprev,
clinicalServicesMen65 = corem_crudeprev,
clinicalServicesWomen65 = corew_crudeprev,
smoking = csmoking_crudeprev,
dentalVisits = dental_crudeprev,
diabetes = diabetes_crudeprev,
highCholesterol = highchol_crudeprev,
kidneyDisease = kidney_crudeprev,
noPhysicalActivity = lpa_crudeprev,
mammograms50 = mammouse_crudeprev,
poorMentalHealth = mhlth_crudeprev,
obese = obesity_crudeprev,
papTest = paptest_crudeprev,
poorPhysicalHealth = phlth_crudeprev,
sleepDeprived = sleep_crudeprev,
stroke = stroke_crudeprev,
noTeeth = teethlost_crudeprev,
geometry = geolocation) %>%
dplyr::select(-population, -FIPS, -geometry)
CDC$row <- 1:nrow(CDC)
tractData <- left_join(CDC,ACS, by = "row") %>%
st_as_sf()
# ATTACH VARIABLES TO FISHNET
vars_net <-
rbind(violentCrime,trash, streetlight, signandmarking, socialservice, harmful_lu) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
vars_net_tractData <- rbind(violentCrime,trash, streetlight, signandmarking, socialservice, harmful_lu) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup() %>%
st_centroid() %>%
dplyr::select(uniqueID) %>%
st_join(., tractData) %>%
st_join(., zipcode) %>%
st_drop_geometry() %>%
left_join(vars_net,.,by="uniqueID") %>%
dplyr::select(-NAME.y) %>%
rename(tract = NAME.x) %>%
st_as_sf()
# MAKE MAPPABLE
vars_net.long <-
gather(vars_net_tractData, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
# MAKE NN FEATURES
st_c <- st_coordinates
st_coid <- st_centroid
vars_net_tractData <-
vars_net_tractData %>%
mutate(
violentCrime.nn =
nn_function(st_c(st_coid(vars_net)), st_c(violentCrime),3),
trash.nn =
nn_function(st_c(st_coid(vars_net)), st_c(trash),3),
violentCrime.nn =
nn_function(st_c(st_coid(vars_net)), st_c(violentCrime),3),
streetlightissue.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetlight),3),
signandmarking.nn =
nn_function(st_c(st_coid(vars_net)), st_c(signandmarking),3),
socialservice.nn =
nn_function(st_c(st_coid(vars_net)), st_c(socialservice),3),
harmfullu.nn =
nn_function(st_c(st_coid(vars_net)), st_c(harmful_lu),1),
sport.nn =
nn_function(st_c(st_coid(vars_net)), st_c(st_coid(sport)),1),
bar.nn =
nn_function(st_c(st_coid(vars_net)), st_c(bars),1),
sport.nn =
nn_function(st_c(st_coid(vars_net)), st_c(st_coid(sport)),1),
downtown.nn =
nn_function(st_c(st_coid(vars_net)), st_c(st_coid(downtown)),1),
intersection.nn =
nn_function(st_c(st_coid(vars_net)), st_c(intersections),3),
uniqueID = as.numeric(rownames(.)))
vars_net.long.nn <-
dplyr::select(vars_net_tractData, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
rm(bars, call311, downtown, CDC, intersections, lu, parcel, signandmarking, socialservice, sport, streetlight, trash, violentCrime, zipcode)
To begin our analysis, we wrangle EMS call data to understand the spatial trends in EMS calls and response times.
#EMS
EMS <- read.socrata("https://data.smgov.net/Public-Safety/Fire-Calls-for-Service/5y3u-5db4?call_type_description=Emergency Medical Service (EMS)")%>%
dplyr::select (-census_block_2010_geoid, -census_tract_2010_geoid, -census_block_2000_geoid, -census_tract_2000_geoid, -incident_number, -call_type_description, -location, -disposition) %>%
mutate (interval = difftime(cleared,received,units = "min"))
EMS$incident_date <- as.POSIXct(EMS$incident_date)
#2020 (for exploratory analysis visualizations)
EMS_2020 <- subset(EMS,
incident_date >= as.POSIXct('2020-01-01') &
incident_date <= as.POSIXct('2020-12-16')
)
#SF
EMS_2020.sf <- na.omit(EMS_2020, cols=c("longitude", "latitude")) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:2225')
#2018 - 2020
EMS_18_20 <- subset(EMS,
incident_date >= as.POSIXct('2018-12-16') &
incident_date <= as.POSIXct('2020-12-16')
)
#SF
EMS_18_20.sf <- na.omit(EMS_18_20, cols=c("longitude", "latitude")) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:2225')
#2010 - 2020
EMS_2010_2020 <- EMS %>%
dplyr::filter(., grepl('201', incident_date))
#SF
EMS_2010_2020.sf <- na.omit(EMS_2010_2020, cols=c("longitude", "latitude")) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:2225')
#EMS NET - Count and response time
#18-20
EMS_18_20_net.count <-
dplyr::select(EMS_18_20.sf) %>%
mutate(countEMS = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countEMS = replace_na(countEMS, 0),
uniqueID = as.numeric(rownames(.)))
EMS_18_20_net.response <-
dplyr::select(EMS_18_20.sf) %>%
mutate(responseTime = na.omit(EMS_18_20.sf$interval)) %>%
aggregate(., fishnet, mean) %>%
mutate(responseTime = replace_na(responseTime, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
#2020
EMS_2020_net.response <-
dplyr::select(EMS_2020.sf) %>%
mutate(responseTime = na.omit(EMS_2020.sf$interval)) %>%
aggregate(., fishnet, mean) %>%
mutate(responseTime = replace_na(responseTime, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
EMS_2020_net.count <-
dplyr::select(EMS_2020.sf) %>%
mutate(countEMS = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countEMS = replace_na(countEMS, 0),
uniqueID = as.numeric(rownames(.)))
EMS_2020_net <- full_join(EMS_2020_net.count, st_drop_geometry(EMS_2020_net.response), by = "uniqueID")
#10-20
EMS_10_20_net.response <-
dplyr::select(EMS_2010_2020.sf) %>%
mutate(responseTime = na.omit(EMS_2010_2020.sf$interval)) %>%
aggregate(., fishnet, mean) %>%
mutate(responseTime = replace_na(responseTime, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
################Final Net################
final_net <-
inner_join(EMS_2020_net, st_drop_geometry(vars_net_tractData), by="uniqueID")
#2010 - 2020
EMS_net.count.emg <-
dplyr::select(EMS_2010_2020.sf) %>%
mutate(countEMS = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countEMS = replace_na(countEMS, 0),
uniqueID = as.numeric(rownames(.)))
EMS_net.response.emg <-
dplyr::select(EMS_2010_2020.sf) %>%
mutate(responseTime = na.omit(EMS_2010_2020.sf$interval)) %>%
aggregate(., fishnet, mean) %>%
mutate(responseTime = replace_na(responseTime, 0),
uniqueID = as.numeric(rownames(.)))
#hour features
EMS_18_20 <- EMS_18_20 %>%
mutate(interval60 = floor_date(ymd_hms(received), unit = "hour"),
interval15 = floor_date(ymd_hms(received), unit = "15 mins"),
interval5 = floor_date(ymd_hms(received), unit = "5 mins"),
interval1 = floor_date(ymd_hms(received), unit = "1 min"),
interval4h = floor_date(ymd_hms(received), unit = "4 hours"),
day = day(interval60),
week = week(interval60),
month = month(interval60),
dotw = wday(interval60, label=TRUE))
EMS_2020 <- EMS_2020 %>%
mutate(interval60 = floor_date(ymd_hms(received), unit = "hour"),
interval15 = floor_date(ymd_hms(received), unit = "15 mins"),
interval5 = floor_date(ymd_hms(received), unit = "5 mins"),
interval1 = floor_date(ymd_hms(received), unit = "1 min"),
interval4h = floor_date(ymd_hms(received), unit = "4 hours"),
day = day(interval60),
week = week(interval60),
month = month(interval60),
dotw = wday(interval60, label=TRUE))
EMS_2020.sf <- EMS_2020.sf %>%
mutate(interval60 = floor_date(ymd_hms(received), unit = "hour"),
interval15 = floor_date(ymd_hms(received), unit = "15 mins"),
interval5 = floor_date(ymd_hms(received), unit = "5 mins"),
interval1 = floor_date(ymd_hms(received), unit = "1 min"),
interval4h = floor_date(ymd_hms(received), unit = "4 hours"),
day = day(interval60),
week = week(interval60),
month = month(interval60),
dotw = wday(interval60, label=TRUE))
EMS_18_20.sf <- EMS_18_20.sf %>%
mutate(interval60 = floor_date(ymd_hms(received), unit = "hour"),
interval15 = floor_date(ymd_hms(received), unit = "15 mins"),
interval5 = floor_date(ymd_hms(received), unit = "5 mins"),
interval1 = floor_date(ymd_hms(received), unit = "1 min"),
interval4h = floor_date(ymd_hms(received), unit = "4 hours"),
day = day(interval60),
week = week(interval60),
month = month(interval60),
dotw = wday(interval60, label=TRUE))
In Figure 1, we see that Santa Monica’s EMS calls come primarily from the downtown and midtown regions, where most tourist attractions are located. This is expected, as studies have found that emergencies are most likely to occur in places with high daytime populations. With greater distance from downtown, EMS incidents become less frequent. Out of curiosity, we checked if these patterns have been consistent in recent years, and found that the hot-spots have remained the same since 2010.
#Map of 2018-2020 and 2010-2020 call density
grid.arrange(
ggplot() +
geom_sf(data = ACS, fill = "black") +
stat_density2d(data = data.frame(st_coordinates(EMS_2020.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.1, bins = 200, geom = 'polygon') +
scale_fill_gradient(low = "blue", high = "yellow",
breaks=c(0.000000003,0.00000003),
labels=c("Minimum","Maximum"), name = "Density") +
scale_alpha(range = c(0.0, 0.5), guide = FALSE) +
labs(title = "2020") +
mapTheme(),
ggplot() +
geom_sf(data = ACS, fill = "black") +
stat_density2d(data = data.frame(st_coordinates(EMS_2010_2020.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.1, bins = 200, geom = 'polygon') +
scale_fill_gradient(low = "blue", high = "yellow",
breaks=c(0.000000003,0.00000003),
labels=c("Minimum","Maximum"), name = "Density") +
scale_alpha(range = c(0.0, 0.5), guide = FALSE) +
labs(title = "2010-2020") +
mapTheme(),
ncol = 2,
top = "Figure 1: Observed EMS Call Density in 2020 vs last 10 years")
We would expect that response times in the past year have been slower due to increased demand related to the pandemic. However, average response times in 2020 were actually lower than the average response time over the past 10 years (Figure 2). This may indicate higher efficiency, however EMS calls have also been lower this year. Indeed, it’s been documented that fear of COVID-19 infection is deterring people from calling ambulances for non-COVID emergencies such as heart attacks. Additionally, while response times have decreased, they are still relatively high: in 2020, the mean response time was 29 minutes, more than 6x the length of the standard set by the National Fire Protection Association.
#Map of 2020 and 2010-2020 response times
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)}
mylegend<-g_legend(
ggplot() +
geom_sf(data = EMS_2020_net.response, aes(fill = responseTime, colour = responseTime)) +
scale_fill_viridis(option = "A", name = "Minutes") +
scale_colour_viridis(option = "A", name = "Minutes") +
labs(title = "2020", subtitle = "Santa Monica, CA") +
mapTheme())
p3 <- grid.arrange(arrangeGrob(ggplot() +
geom_sf(data = EMS_2020_net.response, aes(fill = responseTime, colour = responseTime)) +
scale_fill_viridis(option = "A", name = "Minutes") +
scale_colour_viridis(option = "A", name = "Minutes") +
labs(title = "2020", subtitle = "Santa Monica, CA") +
mapTheme() + theme(legend.position="none"),
ggplot() +
geom_sf(data = EMS_10_20_net.response, aes(fill = responseTime, colour = responseTime)) +
scale_fill_viridis(option = "A", name = "Minutes") +
scale_colour_viridis(option = "A", name = "Minutes") +
labs(title = "2010 - 2020", subtitle = "Santa Monica, CA") +
mapTheme() + theme(legend.position="none"),
nrow=1),
mylegend, nrow=2,heights=c(40, 40), top = "Figure 2: Observed EMS Response Times, 2010-2020")
To better understand the spatial distribution of EMS calls, we map the spatial distribution of some of its risk factors. We use health factors related to health emergencies such as past incidence of stroke, heart disease, and high blood pressure. We also use demographic features related to groups at high risk of relying on emergency medical services as their primary health care including poverty rate, unemployment, and percent White. In addition, we use environmental features related to poor community health including harmful land use, street light issues, and municipal service requests. Finally, we map violent crime incidents, perhaps the most direct proxy for EMS demand. We find three main patterns:
The distribution of poor mental health, obesity, stroke, median household income, and poverty point to an area in the middle of the city that appears to be highly economically disadvantaged and in poor health. Surprisingly, these areas are not in the hotspots for EMS calls.
Service requests, trash complaints, violent crime, binge drinking, and bars are clustered in the downtown region, the biggest hotspot for EMS calls.
Cancer, heart disease, stroke, and median age surprisingly do not exhibit clear spatial patterns and do not appear correlated with EMS calls.
#select features
vars_net.long <-
gather(vars_net_tractData %>%
dplyr::select(-uniqueID, -tract, -harmfullanduse, -markings, -streetlights, -arthritis, -highBloodPressure, -bloodPressureMeds, -doctorCheckups, -cholesterolScreen, -colonScreen65, -clinicalServicesMen65, -dentalVisits, -highCholesterol, -kidneyDisease, -noPhysicalActivity, -mammograms50, -papTest, -poorPhysicalHealth, -sleepDeprived, -noTeeth, -row, -medianage_male, -medianage_female, -population, -pctWhite, -pctRenterOcc, -pctNoHS, -youthUnemploy, -pctMaleYouth, -streetlightissue.nn, -signandmarking.nn, -harmfullu.nn, -sport.nn, -pulmonaryDisease, -violentCrime.nn, -socialservice.nn, -intersection.nn, -diabetes, -trash.nn, -ZIPCODE), Variable, value, -uninsured, -smoking, -asthma, -clinicalServicesWomen65, -pctVacant, -geometry)
#make list of unique variables
vars <- unique(vars_net.long$Variable)
mapList <- list()
#map risk factors
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "A", name="") +
labs(title=i) +
mapTheme() +
theme(plot.title = element_text(size=10))}
do.call(grid.arrange,c(mapList, ncol = 3, top = "Figure 3: Risk Factors by Fishnet"))
To verify their correlation with EMS incidents, we plot EMS calls as a function of these risk factors. The results verify our suspicions: the most correlated spatial risk factors are violent crime, 311 service requests, downtown, trash complaints, distance to nearest bar, binge drinking, median income, and poverty.
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -ZIPCODE, -tract, -harmfullanduse, -markings, -streetlights, -arthritis, -highBloodPressure, -bloodPressureMeds, -doctorCheckups, -cholesterolScreen, -colonScreen65, -clinicalServicesMen65, -dentalVisits, -highCholesterol, -kidneyDisease, -noPhysicalActivity, -mammograms50, -papTest, -poorPhysicalHealth, -sleepDeprived, -noTeeth, -row, -medianage_male, -medianage_female, -population, -pctWhite, -pctRenterOcc, -pctNoHS, -youthUnemploy, -pctMaleYouth, -streetlightissue.nn, -signandmarking.nn, -harmfullu.nn, -sport.nn, -pulmonaryDisease, -violentCrime.nn, -socialservice.nn, -intersection.nn, -diabetes, -trash.nn, -clinicalServicesWomen65, -cvID, -responseTime, -cancer, -median_age, -poorMentalHealth, -pctVacant, -smoking, -uninsured, -stroke, -obese, -asthma) %>%
gather(Variable, Value, -countEMS) %>%
mutate(Value = as.numeric(Value))
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countEMS, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countEMS)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "pink") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Figure 4: Observed EMS Calls as a Function of Risk Factors", subtitle = "Santa Monica, CA") +
plotTheme()
We now visualize observed EMS calls over the past year to understand when they occur. It appears that calls peaked in spring, dropped during the summer, then picked up again in the fall.
# weekly
EMS_2020 %>%
mutate(count = 1) %>%
group_by(incident_date) %>%
summarize(EMS_Count = sum(count)) %>%
ungroup() %>%
ggplot(aes(incident_date, EMS_Count)) +
geom_area(fill = "pink") +
geom_line(color = "pink") +
labs(title="Figure 5: Observed EMS Calls, 2020",
subtitle="Santa Monica, CA",
x="Day", y="EMS Count") +
plotTheme() + theme(panel.grid.major = element_blank())
EMS calls are most frequent on weekdays, and they peak at 1pm and 3pm respectively on weekdays and weekends.
grid.arrange(
ggplot(EMS_2020 %>% mutate(hour = hour(interval60)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Day of Week",
x="Hour",
y="EMS Calls")+
plotTheme(),
ggplot(EMS_18_20 %>%
mutate(hour = hour(received),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Weekend vs Weekday, 2020",
x="Hour",
y="EMS Calls")+
plotTheme(),
ncol=1,
top = "Figure 6: EMS Calls by Hour and Day of the Week, 2020")
The day of week with the most calls is Tuesday.
ggplot(EMS_2020 %>%
group_by(dotw) %>%
tally())+
geom_line(aes(x = dotw, y = n, group = 1), color = "pink", size = 2)+
labs(title="Figure 7: EMS Calls by Day of the Week, 2020",
x="Day",
y="Number of Calls")+
plotTheme()
We now look at rush hours. There are times where more people are outside, and therefore there may be more traffic and accidents. Overall, call volume is similar across each rush period, although it is slightly lower in the morning and evening rush.
####AM/PM Rushes####
EMS_2020 %>%
mutate(time_of_day = case_when(hour(interval60) < 5 | hour(interval60) > 20 ~ "Overnight",
hour(interval60) >= 5 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 20 ~ "PM Rush"))%>%
group_by(interval60, reporting_district, time_of_day) %>%
tally()%>%
group_by(time_of_day, reporting_district,)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1, fill = "pink")+
labs(title="Figure 8: Mean Number of Hourly Calls",
subtitle="Santa Monica, Nov, 2018",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day, nrow = 1)+
plotTheme()
# Select features for panel net
panel_net <- final_net %>%
dplyr::select(-tract, -harmfullanduse, -markings, -streetlights, -arthritis, -highBloodPressure, -bloodPressureMeds, -doctorCheckups, -cholesterolScreen, -colonScreen65, -clinicalServicesMen65, -dentalVisits, -highCholesterol, -kidneyDisease, -noPhysicalActivity, -mammograms50, -papTest, -poorPhysicalHealth, -sleepDeprived, -noTeeth, -row, -medianage_male, -medianage_female, -population, -pctWhite, -pctRenterOcc, -pctNoHS, -youthUnemploy, -pctMaleYouth, -streetlightissue.nn, -signandmarking.nn, -harmfullu.nn, -sport.nn, -uninsured, -cancer, -asthma, -heartDisease, -pulmonaryDisease, -clinicalServicesWomen65, -diabetes, -smoking, -poorMentalHealth, -obese, -stroke, -median_age, -medHHInc, -pctVacant, -violentCrime.nn, -trash.nn, -socialservice.nn, -intersection.nn)
dat_net <- st_join(EMS_2020.sf,
panel_net,
join=st_intersects,
left = TRUE) %>%
mutate(longitude = unlist(map(geometry, 1)),
latitude = unlist(map(geometry, 2))) %>%
as.data.frame()
Another factor we examine is Covid-19 cases. Figure 9 shows Covid-19 cases in Santa Monica with a 14 day lag. This is because Covid-19 cases do not become symptomatic until 14 days after exposure, therefore any cases involved in an EMS call were contracted around two weeks prior. These numbers are relatively low until a small peak in July. Cases then begin to increase rapidly starting in November. There doesn’t appear to be a correlation with EMS calls, however. This may be because Santa Monica has relatively few cases of Covid-19 compared to its total EMS calls.
covid <- st_read("/Users/annaduan/Documents/GitHub/EMS-Call-Forecasting/coronavirus.csv") %>%
#covid <- st_read("D:/Upenn/CPLN508/final project/EMS-Prediction-master/EMS-Prediction-master/coronavirus.csv") %>%
st_as_sf(coords = c("x", "y"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:2225') %>%
mutate(date = as.POSIXct(date)) %>%
dplyr::select(date, confirmed_cases, new_cases, two_week_lag, geometry) %>%
rename(Case_Count = confirmed_cases) %>%
mutate(day = day(date),
week = week(date),
month = month(date),
Case_Count = as.numeric(Case_Count),
new_cases = as.numeric(new_cases),
two_week_lag = as.numeric(two_week_lag))
covid %>%
filter(two_week_lag >= 0) %>%
ggplot(., aes(date,two_week_lag)) +
geom_area(fill = "pink") +
geom_line(color = "pink") +
labs(title = "Figure 9: Covid-19 Cases with 14-day Lag, 2020", subtitle = "Santa Monica, CA", x="Date", y="New cases 2 weeks ago") +
plotTheme()
So far, we have found that Santa Monica has relatively high average income and good health, and that its EMS cases may have less to do with crime, poverty, and poor health than in other cities. This prompts us to consider another source of medical emergencies: motor accidents. One of the key temporal predictors of these accidents is weather, therefore we visualize weather conditions in Figure 10. Overall, Santa Monica has pleasant weather year round, although its precipitation and wind gust are highest during the earlier parts of the year when EMS calls also peaked. We examine this further in a later section.
library(ggplot2)
library(gridExtra)
#vsby: visibility; sknt:wind speed; gust:wind gust in knots
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color = "pink") +
labs(title="Precipitation", x="Month", y="Precipitation") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color = "pink") +
labs(title="Wind Speed", x="Month", y="Wind Speed") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color = "pink") +
labs(title="Temperature", x="Month", y="Temperature") + plotTheme(),
ggplot(weather.Panel, aes(interval60,visibility)) + geom_line(color = "pink") +
labs(title="Visibility", x="Month", y="visibility") + plotTheme(),
ggplot(weather.Panel, aes(interval60,wind_gust)) + geom_line(color = "pink") +
labs(title="Wind Gust in Knots", x="Month", y="wind gust") + plotTheme(),
top="Figure 10: Weather Data, 2020 - Santa Monica, CA",
ncol=1)
To model the relationship between our spatial and temporal features and EMS calls, we now create the final dataset for our analysis: a ‘panel’. This panel summarizes our features for each unique space/time unit pair in our analysis (each 1 hour period from Jan/1/2020 to Dec/17/2020 x 234 fishnet grid cells).
library(dplyr)
#library(vctrs)
ACSTracts <- ACSTracts %>%
st_transform(crs=4326) %>%
st_as_sf() %>%
st_transform(st_crs(fishnet))
#STUDY.PANEL
study.panel <-
expand.grid(interval60=unique(dat_net$interval60),
uniqueID = unique(dat_net$uniqueID))
study.panel.rd <-
expand.grid(interval60=unique(EMS_2020$interval60),
reporting_district= unique(EMS_2020$reporting_district))
study.panel <-
dat_net %>%
mutate(EMS_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, uniqueID) %>%
summarize(EMS_Count = sum(EMS_Counter, na.rm=T)) %>%
left_join(weather.Panel, by = "interval60") %>%
ungroup() %>%
filter(is.na(uniqueID) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE))
study.panel.rd <-
EMS_2020 %>%
mutate(EMS_Counter = 1) %>%
right_join(st_drop_geometry(EMS_2020.sf)) %>%
group_by(interval60, reporting_district) %>%
summarize(EMS_Count = sum(EMS_Counter, na.rm=T)) %>%
left_join(weather.Panel, by = "interval60") %>%
ungroup() %>%
filter(is.na(reporting_district) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE))
This new panel allows us to look more closely at some of the temporal patterns around EMS calls. First, we visualize the effect of holidays on EMS calls. We expect that on major holidays such as Christmas Eve and Saint Patrick’s day, there is a higher density of people outdoors celebrating and likely more injuries as well. We identify a few of the most festive (and boozy) American holidays and plot them against EMS calls. It appears that there are indeed more calls on New Year’s, Mardi Gras, Saint Patrick’s Day, and Thanksgiving Eve.
#add holidays
thanksgivingEve <- as.POSIXct("2020-11-25 01:00:00")
laborDay <- as.POSIXct("2020-9-7 01:00:00")
july4 <- as.POSIXct("2020-7-4 01:00:00")
newYear <- as.POSIXct("2020-1-1 01:00:00")
mardiGras <- as.POSIXct("2020-2-25 01:00:00")
cincoDeMayo <- as.POSIXct("2020-5-5 01:00:00")
halloween <- as.POSIXct("2020-10-31 01:00:00")
stPattys <- as.POSIXct("2020-3-17 01:00:00")
holidayPlot <- study.panel %>%
group_by(interval60) %>%
summarize(EMS_Count = sum(EMS_Count)) %>%
ungroup()
holidayPlot %>%
ggplot(aes(interval60, EMS_Count)) + geom_line(colour = "pink") +
geom_vline(xintercept = thanksgivingEve, linetype = "dotted") +
geom_vline(xintercept = newYear, linetype = "dotted") +
geom_vline(xintercept = july4, linetype = "dotted") +
geom_vline(xintercept = laborDay, linetype = "dotted") +
geom_vline(xintercept = mardiGras, linetype = "dotted") +
geom_vline(xintercept = cincoDeMayo, linetype = "dotted") +
geom_vline(xintercept = stPattys, linetype = "dotted") +
geom_vline(xintercept = halloween, linetype = "dotted") +
labs(title="Figure 11: EMS Calls and Holidays, 2020",
subtitle="Dotted lines for Holidays",
x="Day", y="Trip Count") +
plotTheme() + theme(panel.grid.major = element_blank())
We also look more closely at temperature. Figure 12 suggests that on warmer days, there tend to be slightly fewer EMS calls.
study.panel %>%
filter(week < 10) %>%
group_by(interval60) %>%
summarize(EMS_Count = mean(EMS_Count),
Temperature = first(Temperature)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, EMS_Count)) +
geom_point() + geom_smooth(method = "lm", se= FALSE, color = "pink", size = 2) +
facet_wrap(~week, ncol=3) +
labs(title="Figure 12: EMS Count as a Function of Temperature by Week",
x="Temperature", y="Mean EMS Count") +
plotTheme()
As seen in Figure 6, EMS call volume varies in consistent patterns throughout the day, indicating temporal clustering. We therefore make lag features to help us predict EMS demand at a given hour.
study.panel <-
study.panel %>%
arrange(uniqueID, interval60) %>%
mutate(lagHour = dplyr::lag(EMS_Count,1),
lag2Hours = dplyr::lag(EMS_Count,2),
lag3Hours = dplyr::lag(EMS_Count,3),
lag4Hours = dplyr::lag(EMS_Count,4),
lag12Hours = dplyr::lag(EMS_Count,12),
lag1day = dplyr::lag(EMS_Count,24)) %>%
mutate(day = yday(interval60))
study.panel.rd <-
study.panel.rd %>%
arrange(reporting_district, interval60) %>%
mutate(lagHour = dplyr::lag(EMS_Count,1),
lag2Hours = dplyr::lag(EMS_Count,2),
lag3Hours = dplyr::lag(EMS_Count,3),
lag4Hours = dplyr::lag(EMS_Count,4),
lag12Hours = dplyr::lag(EMS_Count,12),
lag1day = dplyr::lag(EMS_Count,24)) %>%
mutate(day = yday(interval60))
In Figure 13, we see that 1 hour lag appears to correlate with EMS count.
#lagHour / EMS Count plot
cor_lag <- na.omit(study.panel.rd[, c("interval60","reporting_district","EMS_Count", "lagHour")])
corlag.emsct <- na.omit(study.panel.rd[c(-1), c("interval60","EMS_Count")])%>%
mutate(ems_Count=EMS_Count)%>%
dplyr::select(-EMS_Count)%>%
subset(.,
interval60 >= as.POSIXct('2020-10-10 01:00:00 UTC') &
interval60 <= as.POSIXct('2020-11-29 01:00:00 UTC')
)
corlag.lag1ct <- na.omit(study.panel.rd[, c("interval60","lagHour")])%>%
mutate(ems_Count=lagHour)%>%
dplyr::select(-lagHour)%>%
subset(.,
interval60 >= as.POSIXct('2020-10-10 01:00:00 UTC') &
interval60 <= as.POSIXct('2020-11-29 01:00:00 UTC')
)
rbind(mutate(corlag.emsct, Legend = "EMS count"),
mutate(corlag.lag1ct, Legend = "EMS 1-hour lag count")) %>%
group_by(Legend, interval60) %>%
summarize(ems_Count = sum(ems_Count)) %>%
ungroup() %>%
ggplot(aes(interval60, ems_Count, colour = Legend)) + geom_line() +
scale_colour_manual(values = c("pink","cyan")) +
labs(title="Figure 13: EMS Calls and 1 Hour Lag, 2020",
subtitle="Santa Monica, CA",
x="Day", y="EMS Count") +
plotTheme() + theme(panel.grid.major = element_blank())
Table 1 confirms this: of all the lag features, lagHour correlates most strongly with EMS calls.
as.data.frame(study.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "EMS_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -EMS_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day", "lay2day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, EMS_Count),3)) %>%
kable (caption="Table 1: Time Lag Features and Correlation", col.names = c('Time Lag', 'Correlation')) %>%
kable_styling()
Time Lag | Correlation |
---|---|
lagHour | 0.067 |
lag2Hours | 0.027 |
lag3Hours | 0.035 |
lag4Hours | 0.031 |
lag12Hours | 0.033 |
lag1day | 0.010 |
We also engineer features specific to the temporal trends we found in our exploratory analysis. These are isHoliday, isWeekend, and isPeakHour.
holidays <- c("2020-11-26", "2020-12-24", "2020-01-01", "2020-07-04", "2020-02-25", "2020-09-07", "2020-05-05", "2020-10-31", "2020-03-17", "2020-12-10")
study.panel <-
study.panel %>%
mutate(is1 = grepl("13:00:00", as.character(interval60), fixed = TRUE),
is3 = grepl("15:00:00", as.character(interval60), fixed = TRUE),
isPeakHour = ifelse(is1 == "TRUE" | is3 == "TRUE", 1, 0),
isWeekend = ifelse(dotw == "Sun" | dotw == "Sat", 1, 0),
isHoliday = ifelse(substring(as.character(interval60), 1, 10) %in% holidays, 1, 0))
Lastly, recall the spatial pattern of EMS calls that we saw in exploratory analysis. It is likely that EMS calls experience spatial as well temporal clustering, and we make features to account for this. We use a Local Moran’s I statistic to check if, and where, the clustering of EMS calls is greater than it would be if they were randomly distributed. In Figure 14, we see that of the fishnet grids with many EMS calls, only some of them have high clustering (Moran’s I). Next, we use the p-value of the Moran’s I value to identify areas with clustering that also have a high confidence level. This leaves us with two distinct hot spots in Santa Monica: one in the downtown area, and one in Midtown. These hotspots make sense: the downtown area has a beach, the Santa Monica pier, and Muscle Beach, three locations with high daytime occupancy and risk for accidents. Using this information, we make two new features which tell us whether a grid cell is in a hotspot, and a grid cell’s distance to the nearest hotspot.
##Prepare for moran's
library(spdep)
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$countEMS, final_net.weights)),
as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(EMS_Count = countEMS,
Local_Morans_I = Ii,
P_Value = `Pr(z > 0)`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.0000001, 1, 0)) %>%
gather(Variable, Value, -geometry)
#Unique values (for mapping)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
#map moran's I and P Value hotspots
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="", option = "B") +
labs(title=i) +
mapTheme() + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Figure 14: Local Moran's I Statistics, Observed EMS Calls"))
#feature engineering: is hotspot? + dist to hotspot
final_net <-
final_net %>%
mutate(EMS.isSig =
ifelse(localmoran(final_net$countEMS,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(EMS.isSig.dist =
nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, EMS.isSig == 1))), 1))
We now begin to build a series of linear regression models. First, we visualize the correlations between all of our final features. Figure 15 shows us that poverty rate and median income correlate strongly, therefore they may be co-linear and should not be included in a model together. The same goes for downtown distance and distance to nearest hotspot; percent white and median income; and day and week.
#Make final panel with spatial features
study.panel.net <- final_net %>%
dplyr::select(countEMS, uniqueID, responseTime, cvID, tract, servicerequest, Trash, Violent_Crime, bingeDrink, medHHInc, pctPoverty, ZIPCODE, bar.nn, downtown.nn, EMS.isSig.dist, pctWhite, medHHInc, median_age, poorPhysicalHealth, geometry) %>%
dplyr::mutate(latitude = sf::st_coordinates(st_centroid(.))[,1],
longitude = sf::st_coordinates(st_centroid(.))[,2]) %>%
left_join(study.panel, ., by= "uniqueID") %>%
na.omit() %>%
mutate(tract = as.factor(tract))
#Plot correlations
library(ggcorrplot)
ggcorrplot(round(cor(select_if(study.panel.net %>%
sample_n(., 20000), is.numeric) %>% na.omit()), 1),
p.mat = cor_pmat(select_if(study.panel.net, is.numeric) %>% na.omit()),
colors = c("dark blue", "white", "pink"),
type="lower",
insig = "blank") +
labs(title = "Figure 15: EMS Call Correlation with Numeric Variables") +
plotTheme()
To predict EMS calls across the time/space units in our panel, we first split our EMS data into a train and test set. We make four regression models:
1. Temporal features
2. Spatial features
3. Temporal + spatial features
4. Temporal + spatial + lag features
study.panel.net <- study.panel.net %>%
mutate(h_interval60 = hour(interval60))
#Split test/train sets
EMS.Train <- subset(study.panel.net,
interval60 >= as.POSIXct('2020-01-01') &
interval60 < as.POSIXct('2020-08-01')
) %>%
rename(Week = week)
EMS.Test <- subset(study.panel.net,
interval60 >= as.POSIXct('2020-8-01')) %>%
rename(Week = week)
#4 regressions
reg1_time <-
lm(EMS_Count ~ h_interval60 + dotw + Temperature + Precipitation + wind_gust + Wind_Speed + isWeekend + isHoliday + isPeakHour, data = EMS.Train)
reg2_space <-
lm(EMS_Count ~ uniqueID + Violent_Crime + bar.nn + downtown.nn, data = EMS.Train)
reg3_time_space <-
lm(EMS_Count ~ h_interval60 + dotw + Temperature + Precipitation + visibility + wind_gust + Wind_Speed + uniqueID + Violent_Crime + bar.nn + downtown.nn + isWeekend + isHoliday + isPeakHour, data = EMS.Train)
reg4_lag <- lm(EMS_Count ~ h_interval60 + dotw + Temperature + Precipitation + wind_gust + Wind_Speed + +isWeekend + isHoliday + isPeakHour + uniqueID + Violent_Crime + bar.nn + downtown.nn + lagHour + lag2Hours + lag4Hours + lag12Hours + lag1day, data = EMS.Train)
We evaluate the usefulness of these models using accuracy and generalizability as metrics. We test accuracy by calculating the Mean Absolute Error (MAE) in our predictions for each fishnet cell/hour pair. The lower the error, the more accurate our model. For generalizability, we check if our model predicts with equal accuracy across different spatial, temporal, and social contexts.
For a use case that concerns saving lives and responding to time-sensitive emergencies, it is critical that our model predicts with equal accuracy across social contexts such as race, income, and health levels.
##nesting
EMS.Test.weekNest <-
as.data.frame(EMS.Test) %>%
nest(-Week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
EMS.Test.weekNest %>%
mutate(Time = map(.x = data, fit = reg1_time, .f = model_pred),
Space = map(.x = data, fit = reg2_space, .f = model_pred),
Time_Space = map(.x = data, fit = reg3_time_space, .f = model_pred),
Time_Space_Lag = map(.x = data, fit = reg4_lag, .f = model_pred),) %>%
gather(Regression, Prediction, -data, -Week) %>%
mutate(Observed = map(data, pull, EMS_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE),
)
Across our models, the lowest error is found in the temporal model (Figure 16). Our model consistently detects each demand peak, although it does not predict the full size of each peak every time (Figure 17). As a result, our average error is slightly high, as many space/time units have 0 observed calls.
#Plot: MAE by Model Spec and Week
week_predictions %>%
filter(Week == 40 | Week == 41 | Week == 42 | Week == 43) %>%
dplyr::select(Week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -Week) %>%
ggplot(aes(Week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = c("pink", "purple", "maroon", "magenta")) +
labs(title = "Figure 16: Mean Absolute Error by Model and Week") +
plotTheme()
#PLOT AGAINST ACTUAL SERIES
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID)) %>%
dplyr::select(interval60, uniqueID, Observed, Prediction, Regression) %>%
unnest() %>%
filter(interval60 >= as.POSIXct('2020-12-01') & interval60 < as.POSIXct('2020-12-15'),
Regression == "Time") %>%
gather(Variable, Value, -Regression, -interval60, -uniqueID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Figure 17: Predicted/Observed EMS Call Time Series", subtitle = "Santa Monica, CA", x = "Hour", y= "EMS Calls") +
plotTheme()
While our model does not detect the full extent of each demand peak (low accuracy), generalizability is in some ways more important for this use case. When dispatching ambulances, it does not matter so much whether our model projects the exact number of calls in each fishnet cell. Rather, it must alert EMTs to the fishnet grids with the highest relative demand in the city, as this is what determines dispatch. Therefore, it is more important that our model predicts with comparable accuracy across different times and spaces. Further, it is critical that our model performs equally across different social contexts.
Figure 18 shows that our model has similar MAE across different hours of the day and week. This is a good sign of generalizability across time periods.
error.byWeek <-
filter(week_predictions, Regression == "Time_Space_Lag" & Week == 43) %>%
unnest() %>%
st_sf() %>%
dplyr::select(Absolute_Error, uniqueID, dotw, geometry) %>%
gather(Variable, Value, -dotw, -uniqueID, -geometry) %>%
group_by(uniqueID, dotw) %>%
summarize(MAE = mean(Value))
# Figure 1: Plot of error by time period
plot_week_pred <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID),
latitude = map(data, pull, latitude),
longitude = map(data, pull, longitude),
dotw = map(data, pull, dotw)) %>%
dplyr::select(interval60, uniqueID, longitude,
latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
sample_n(., 500000) %>%
filter(Regression == "Time_Space_Lag")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))
ggplot(plot_week_pred)+
geom_point(aes(x= Observed, y = Prediction)) +
geom_abline(slope = 1, intercept = 0, color = "black") +
facet_grid(time_of_day ~ weekend)+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "pink")+
labs(title="Figure 18: Observed vs Predicted Across Time Periods",
x="Observed EMS Calls",
y="Predicted EMS Calls")+
plotTheme()
rm(plot_week_pred)
error.byHour <-
filter(week_predictions, Regression == "Time_Space_Lag" & Week == 35) %>%
unnest() %>%
st_sf() %>%
dplyr::select(uniqueID, Absolute_Error, interval60, geometry) %>%
gather(Variable, Value, -interval60, -uniqueID, -geometry) %>%
filter(wday(interval60, label = TRUE) == "Mon") %>%
group_by(hour = hour(interval60), uniqueID) %>%
summarize(MAE = mean(Value))
Figure 19, however, shows that while the average MAE of our model is around 0.015, a select few fishnet grids have error of up to 4x this amount.
#Figure 2: Plot of error by fishnet grid
error.byWeek %>%
ggplot(aes(uniqueID, MAE)) +
geom_bar(stat = "identity", fill = "pink", color = "pink") +
scale_fill_manual(values = palette5) +
labs(title = "Figure 19: Mean Absolute Error by Fishnet Grid", x = "Fishnet Grid") +
plotTheme()
Figure 20 confirms this. This is concerning, and it requires that we look more closely at where these large errors are occuring, and to whom.
# Figure 2: Map of MAE by time period
MAE_plot_map <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID),
latitude = map(data, pull, latitude),
longitude = map(data, pull, longitude),
dotw = map(data, pull, dotw) ) %>%
dplyr::select(interval60, uniqueID, longitude,
latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
sample_n(., 10000) %>%
filter(Regression == "Time")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(uniqueID, weekend, time_of_day, longitude, latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
ungroup() %>%
st_as_sf(coords = c("latitude","longitude"), crs = "EPSG:2225") %>%
st_transform( crs=4326) %>%
cbind(., st_coordinates(.))
ggplot(MAE_plot_map)+
geom_sf(data = ACS %>%
st_transform(crs=4326), colour = 'white', fill = "gray70")+
geom_sf(data = MAE_plot_map , aes(color = MAE),
fill = "transparent", size = 1, alpha = 1.5)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "A") +
facet_grid(weekend~time_of_day)+
labs(title="Figure 20: Mean Absolute Error by Time Period and Fishnet Cell")+
mapTheme()
Figure 21 plots MAE as a function of income, age, race, and poor physical health. We see that areas with higher median income and share of White residents have slightly lower error. While the correlation between these factors and error is relatively weak, this is concerning because it means that our model has the potential to reinforce existing inequities in the way that EMS serves communities.
On the other hand, there is a slight negative correlation between median age and MAE. This may be a good sign, as older adults tend to require more medical care due to increased prevalence of chronic disease.
Finally, there is no significant correlation between MAE and percentage of residents who self report poor physical health. This is a very good sign. While income, age, and race can be proxies of health care needs and access, this outcome tells us that our model will not result in unequal EMS service for those who may need it most.
#pctWhite, medHHInc, median_age, poorPhysicalHealth
genByGroup <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID),
latitude = map(data, pull, latitude),
longitude = map(data, pull, longitude),
dotw = map(data, pull, dotw),
pctWhite = map(data, pull, pctWhite),
medHHInc = map(data, pull, medHHInc),
median_age = map(data, pull, median_age),
poorPhysicalHealth = map(data, pull, poorPhysicalHealth),) %>%
dplyr::select(interval60, uniqueID, longitude,
latitude, Observed, Prediction, Regression,
dotw, pctWhite, medHHInc, median_age, poorPhysicalHealth) %>%
unnest() %>%
sample_n(., 15000) %>%
filter(Regression == "Time_Space_Lag")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(uniqueID, pctWhite, medHHInc, median_age, poorPhysicalHealth) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-uniqueID, -MAE, key = "variable", value = "value")%>%
filter(MAE < 0.1)
genByGroup.cor <-
genByGroup %>%
group_by(variable) %>%
summarize(correlation = cor(value, MAE, use = "complete.obs"))
ggplot(genByGroup)+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_text(data = genByGroup.cor, aes(label = paste("r = ", round(correlation,2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "pink")+
facet_wrap(~variable, scales = "free")+
labs(title="Figure 21: Error as a Function of Socio-economic Variables",
y="Mean Absolute Error (EMS Calls)")+
plotTheme()
We believe that fire stations and ambulance companies in Santa Monica and the nation more broadly can significantly improve their operating capacity and efficiency by adopting our algorithm.
Our algorithm consistently predicts hotspots with a high degree of accuracy, allowing us to alert EMTs to all incoming EMS demand surges. While it slightly under-predicts some of the peaks, it is more important that our model successfully identifies each peak because regardless of peak magnitude, it is relative demand surge that leads to a dispatch. More importantly, our algorithm achieves a high degree of generalizability across spatial, race, income, age, and health contexts. As our app will have a direct impact on EMS response times and therefore health outcomes across the city, it is critical that it performs equitably or else it will worsen already-existing inequities in healthcare access across income and race lines.
Our model is able to perform at a high level due to a range of temporal features such as time of day, holidays, and day of week. We did not see improvements in predictive power when we added spatial features, which was surprising as we expected that areas with high prevalence of chronic illnesses would also have high demand for EMS. We additionally were surprised to see that our lag features also diminished our model’s accuracy. In terms of improvements, we believe that our model’s accuracy would be increased if we had access to data on the type of emergency associated with each EMS call as there are likely temporal patterns behind each type of emergency. However, this data is confidential in Santa Monica, as is data on the level of priority that each call was triaged into.
In all, we are satisfied with the accuracy and generalizability of our algorithm. Particularly in small cities like Santa Monica which have relatively low EMS call volume, we are confident in recommending our algorithm as a data-driven solution to the problem of long EMS response times.
If your agency is interested in adopting our app Siren for your ambulance dispatching needs, please feel free to reach out to us for a free consultation:
Anna Duan: email
Bingchu Chen: email