Introduction

Accurate prediction of home sale prices is important right now as the real estate market is seeing record levels of activity due to the pandemic. In this report, we construct a new hedonic model for Zillow’s housing market predictions. This task is challenging due to the number of factors which affect the real estate market and the non-linear relationship between many factors and prices. In this model which is built for Miami and Miami Beach, we incorporate local intelligence from open sourced data to adapt it to local housing and development patterns. We use determinants of home prices including internal characteristics, nearby amenities and dis-amenities, and spatial processes such as clustering to estimate home sale prices. Applying this model to a set of 3503 houses, it predicted that home sale prices are highest on the shoreline and in Miami Beach.

knitr::opts_chunk$set

####load libraries, etc
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(knitr)
library(gridExtra)
library(ggcorrplot)
library(stargazer)
library(mapview)
library(osmdata)
library(tidycensus)
library(tidygeocoder)
library(raster)
library(rnaturalearth)
library(RColorBrewer)
library(rnaturalearthdata)
library(geosphere)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette5 <- c("#c8ddfa", "#8cb5ed", "#5890db",   "#2868bd", "#023578")
paletteMap <- c("gray90","gray70","gray50","gray30","gray10")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

#nearest neighbor function
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>%
    dplyr::select(-thisPoint) %>%
    pull()

  return(output)  
}

Data

For this analysis, we received a dataset of houses and their internal characteristics including number of rooms, living area, and pools. In addition, we gathered open-sourced data from the American Community Survey, Miami Dade County’s Open Data Hub, OpenStreetMaps. This includes census-tract level demographic information and point and polygon layers of amenities including restaurants and parks (Table 1). It is expected that attributes that contribute to quality of life such as proximity to restaurants, park space, and low commuting times correlate positively with sale prices.

#projected to NAD 1983 StatePlane Florida East FIPS 0901 Feet


#STUDY AREA
miamiBound <- st_read("E:/Upenn/CPLN508/miami/2_Miami-Prediction/Raw Data/Municipal_Boundary.geojson") %>%
#miamiBound <- st_read("/Users/annaduan/Documents/GitHub/2_Miami\ Prediction/Raw\ Data/Municipal_Boundary.geojson") %>%
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
  st_union() %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')



#STUDY AREA OSM (not projected so that it works)
miamiBoundOSM <- st_read("E:/Upenn/CPLN508/miami/2_Miami-Prediction/Raw Data/Municipal_Boundary.geojson") %>%
#miamiBoundOSM <- st_read("/Users/annaduan/Documents/GitHub/2_Miami\ Prediction/Raw\ Data/Municipal_Boundary.geojson") %>%
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
  st_union()



#HOUSE DATA
houses <- st_read("E:/Upenn/CPLN508/miami/2_Miami-Prediction/Raw Data/studentsData.geojson") %>%
#houses <- st_read("/Users/annaduan/Documents/GitHub/2_Miami\ Prediction/Raw\ Data/studentsData.geojson") %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658') %>%
  st_centroid()




#HOUSE DATA OSM (Not projected)
housesOSM <- st_read("E:/Upenn/CPLN508/miami/2_Miami-Prediction/Raw Data/studentsData.geojson")
#housesOSM <- st_read("/Users/annaduan/Documents/GitHub/2_Miami\ Prediction/Raw\ Data/studentsData.geojson")



#CENSUS
census_api_key("d9ebfd04caa0138647fbacd94c657cdecbf705e9", install = TRUE, overwrite = TRUE)
#read in: vacant property, total housing units, mhhinc, white, population, owner occ, renter occ, travel time to work
acs <-
  get_acs(geography = "tract", variables = c("B25002_003E", "B25001_001E", "B19013_001E", "B01001A_001E", "B01003_001E", "B07013_002E", "B07013_003E", "B08012_001E", "B25104_001E"), year=2018, state=12, county=086, geometry=T) %>%
  st_transform('ESRI:102658')
#filter for Miami/Miami beach tracts
acs <-
  rbind(
    st_centroid(acs)[miamiBound,] %>%
      st_drop_geometry() %>%
      left_join(acs) %>%
      st_sf() %>%
      mutate(inMiami = "YES"),
    st_centroid(acs)[miamiBound, op = st_disjoint] %>%
      st_drop_geometry() %>%
      left_join(acs) %>%
      st_sf() %>%
      mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
  dplyr::select(-inMiami)
#long to wide form
acs <-
  acs %>%
  dplyr::select(-moe, -GEOID) %>%
  spread(variable, estimate) %>%
  dplyr::select(-geometry) %>%
  rename(vacantUnits = B25002_003,
         totalUnits = B25001_001,
         medHHInc = B19013_001,
         white = B01001A_001,
         population = B01003_001,
         ownerOcc = B07013_002,
         renterOcc = B07013_003,
         timeToWork = B08012_001,
         monthhousingcost = B25104_001)
acs["119", "medHHInc"] = 33194   #input value from nearby tract in same neighborhod because NA was messing up MAE
acs %>% na.omit()


#mutate
acs <-
  acs %>%
  mutate(pctVacant = ifelse(totalUnits > 0, vacantUnits / totalUnits, 0),
         pctWhite = ifelse(population > 0, white / population, 0),
         totalOcc = ownerOcc + renterOcc,
         pctRenterOcc = ifelse(totalOcc > 0, renterOcc / totalOcc, 0)) %>%
  dplyr::select(-totalUnits,-vacantUnits,-totalUnits,-population,-white, -ownerOcc, -renterOcc, -totalOcc)


#OSM BBOX (uses the non-projected base)
xmin = st_bbox(miamiBoundOSM)[[1]]
ymin = st_bbox(miamiBoundOSM)[[2]]
xmax = st_bbox(miamiBoundOSM)[[3]]  
ymax = st_bbox(miamiBoundOSM)[[4]]



#FOOD AND BEVERAGE SPOTS
 foodBev <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>%
  add_osm_feature(key = 'amenity', value = c("bar","pub","restaurant","cafe")) %>%
   osmdata_xml(filename = 'foodBev.osm')
 #project
 foodBev <- sf::st_read('foodBev.osm', layer = 'points') %>%
     st_as_sf(coords = c("LON", "LAT"), crs = EPSG:3857, agr = "constant") %>%
  st_transform('ESRI:102658')
 #filter for facilities in study area
 foodBev <- rbind(
    st_centroid(foodBev)[miamiBound,] %>%
      st_drop_geometry() %>%
      left_join(foodBev) %>%
      st_sf() %>%
      mutate(inMiami = "YES"),
    foodBev[miamiBound, op = st_disjoint] %>%
      st_drop_geometry() %>%
      left_join(foodBev) %>%
      st_sf() %>%
      mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
   dplyr::select(name)



#COASTLINE
Coastline<-opq(bbox = c(xmin, ymin, xmax, ymax)) %>%
  add_osm_feature("natural", "coastline") %>%
  osmdata_sf()
#add to housesOSM and convert to miles, then add to houses
housesOSM <-
  housesOSM %>%  
  mutate(CoastDist=(geosphere::dist2Line(p=st_coordinates(st_centroid(housesOSM)),
                                        line=st_coordinates(Coastline$osm_lines)[,1:2])*0.00062137)[,1])
houses <-
  houses %>%
  mutate(distWater = housesOSM$CoastDist,
         SPSqFt = ifelse(!is.na(ActualSqFt)&!is.na(SalePrice), SalePrice / ActualSqFt, 0))



#PARKS
muniParks <- st_read("https://opendata.arcgis.com/datasets/16fe02a1defa45b28bf14a29fb5f0428_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658') %>%
  dplyr::select(NAME, ADDRESS, CITY, CLASS, Shape__Area)

countyParks <- st_read("https://opendata.arcgis.com/datasets/aca1e6ff0f634be282d50cc2d534a832_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658') %>%
    dplyr::select(NAME, ADDRESS, CITY, CLASS, Shape__Area)
parks <- bind_rows(muniParks, countyParks) %>%
  filter(CITY == "Miami" | CITY == "Miami Beach") %>%
  mutate(counter = 1)



#SCHOOL DISTRICT
schoolDist <- st_read("https://opendata.arcgis.com/datasets/bc16a5ebcdcd4f3e83b55c5d697a0317_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658') %>%
  dplyr::select(ID)



#PUBLIC SCHOOL CATCHMENT/ATTENDANCE ZONES
#elementary
elementary <- st_read("https://opendata.arcgis.com/datasets/19f5d8dcd9714e6fbd9043ac7a50c6f6_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
elementary <- rbind(
  st_centroid(elementary)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(elementary) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(elementary)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(elementary) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
  dplyr::select(NAME)
#middle
middle <- st_read("https://opendata.arcgis.com/datasets/dd2719ff6105463187197165a9c8dd5c_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
middle <- rbind(
  st_centroid(middle)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(middle) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(middle)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(middle) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
  dplyr::select(NAME)
#high
high <- st_read("https://opendata.arcgis.com/datasets/9004dbf5f7f645d493bfb6b875a43dc1_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
high <- rbind(
  st_centroid(high)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(high) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(high)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(high) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
  dplyr::select(NAME)



#PUBLIC TRANSPORTATION
#bus
bus <- st_read("https://opendata.arcgis.com/datasets/021adadcf6854f59852ff4652ad90c11_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant")  %>%
  st_transform('ESRI:102658')
bus <- rbind(
  bus[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(bus) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  bus[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(bus) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")
#metro mover
metromover <- st_read("https://opendata.arcgis.com/datasets/aec76104165c4e879b9b0203fa436dab_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
metromover <- rbind(
  metromover[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(metromover) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  metromover[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(metromover) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")
#metro rail
metrorail <- st_read("https://opendata.arcgis.com/datasets/ee3e2c45427e4c85b751d8ad57dd7b16_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
metrorail <- rbind(
  metrorail[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(metrorail) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  metrorail[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(metrorail) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")



#CULTURE SPOTS
culture <- st_read("https://opendata.arcgis.com/datasets/70c48f0eb067448c8a787cfa1c1c3bb9_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
culture <- rbind(
  culture[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(culture) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  culture[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(culture) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")



#COMMERCIAL PROPERTIES
#read, project
commercial <- st_read("https://opendata.arcgis.com/datasets/fb8303c577c24ea386a91be7329842be_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
#filter
commercial <- rbind(
  commercial[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(commercial) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  commercial[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(commercial) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")



#FLOOD RISK ZONES
floodRisk <- st_read("https://opendata.arcgis.com/datasets/ef3bdd041b2e424695eb4dfe965966c4_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
#filter
 floodRisk <-
   rbind(
  st_centroid(floodRisk)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(floodRisk) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(floodRisk)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(floodRisk) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
   dplyr::select(-inMiami, -SHAPE_Length, -ELEV, -FID) %>%
   dplyr::rename(FloodZone = FZONE, FloodHazard = ZONESUBTY)


 floodInsure <- st_read("https://opendata.arcgis.com/datasets/f589473ddada46e78d437aaf09205b04_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
filter
 floodInsure <-
   rbind(
  st_centroid(floodInsure)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(floodInsure) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(floodInsure)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(floodInsure) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES") %>%
   mutate(floodInsureType = PANELID)



#CONTAMINATED SITES
contaminated <- st_read("https://opendata.arcgis.com/datasets/43750f842b1e451aa0347a2ca34a61d7_0.geojson") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
 contaminated <-
   rbind(
  st_centroid(contaminated)[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(contaminated) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  st_centroid(contaminated)[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(contaminated) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")
#CONTAMINATION BUFFER

 contamBuffer <- contaminated %>%
   st_buffer(800) %>%
   st_union() %>%
   st_as_sf() %>%
   mutate(contam = 1)
 houses$contaminated <- houses %>%
   st_join(contamBuffer) %>%
   mutate(contam = ifelse(is.na(contam), 0, 1)) %>%
   pull(contam)



#NEAREST NEIGHBOR (some are used for testing, to determine feature buffer distances)
 st_c <- st_coordinates
 houses <-
   houses %>%
   mutate(
     #commercial properties NN
     commNN1 = nn_function(st_c(st_centroid(houses)), st_c(st_centroid(commercial)), 1),
     commNN5 = nn_function(st_c(st_centroid(houses)), st_c(st_centroid(commercial)), 5),
     #metro mover stations
     metroMNN1 = nn_function(st_c(st_centroid(houses)), st_c(metromover), 1),
     metroMNN5 = nn_function(st_c(st_centroid(houses)), st_c(metromover), 5),
     #metro rail stations
     metroRNN1 = nn_function(st_c(st_centroid(houses)), st_c(metrorail), 1),
     metroRNN5 = nn_function(st_c(st_centroid(houses)), st_c(metrorail), 5),
     #food/drinks
     foodBevNN1 = nn_function(st_c(st_centroid(houses)), st_c(foodBev), 1),
     foodBevNN5 = nn_function(st_c(st_centroid(houses)), st_c(foodBev), 5)
     ) 


#COMMERCIAL BUFFER
 commercial <- commercial %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count properties within each buffer
houses$commercialProperties <-
   st_buffer(houses, 846) %>%
   aggregate(commercial, ., sum) %>%
   st_drop_geometry() %>%
  mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)



#FOOD AND BEV BUFFER
 foodBev <- foodBev %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count parks within each buffer
houses$foodEstablishments <-
   st_buffer(houses, 2774) %>%
   aggregate(foodBev, ., sum) %>%
   st_drop_geometry() %>%
    mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)



#CULTURE BUFFER
 culture <- culture %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count culture within each buffer
houses$cultureSpots <-
   st_buffer(houses, 774) %>%
   aggregate(culture, ., sum) %>%
   st_drop_geometry() %>%
   mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)




#METRORAIL BUFFER
 metrorail <- metrorail %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count stops within each buffer
houses$metrorailStops <-
   st_buffer(houses, 12076.7) %>%
   aggregate(metrorail, ., sum) %>%
   st_drop_geometry() %>%
  mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)



#METROMOVER BUFFER
 metromover <- metromover %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count metroM stops within each buffer
houses$metromoverStops <-
   st_buffer(houses, 18845) %>%
   aggregate(metromover, ., sum) %>%
   st_drop_geometry() %>%
   mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)



#BUS BUFFER
 bus <- bus %>%
   mutate(counter = 1) %>%
   dplyr::select(counter)
 #count bus within each buffer
houses$busStops <-
   st_buffer(houses, 775) %>%
   aggregate(bus, ., sum) %>%
   st_drop_geometry() %>%
  mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)



 #PARKS BUFFER + AREA CALCULATION (using 1600ft buffer distance because the mean NN1 = 1600)
 #get centroids
 parkCentroids <- parks %>%
   st_centroid(parks) %>%    #get centroids of park layer
  dplyr::select(counter)
 #count parks within each buffer
houses$parkCount <-
   st_buffer(houses, 1600) %>%
   aggregate(parkCentroids, ., sum) %>%
   st_drop_geometry() %>%
   mutate(counter = ifelse(is.na(counter), 0, counter)) %>%
   pull(counter)
#make buffer for each house
parkBuffer <- st_buffer(houses, 1600) %>%
  dplyr::select(Property.Address) %>%
  st_as_sf()
#calculate area of park space in each buffer
bufferedParks <- st_intersection(parkBuffer, parks) %>%
  group_by(Property.Address) %>%
  summarise() %>%
  mutate(parkArea = units::drop_units(st_area(.))) %>%
  st_drop_geometry()
#add park area back to houses file
houses <-
  left_join(houses, bufferedParks)



#SCHOOL CATCHMENT CATEGORIES
 houses <-
   st_join(houses, elementary) %>%
   rename(elemCatch = 'NAME')

  houses <-
   st_join(houses, middle) %>%
   rename(middleCatch = 'NAME')

   houses <-
   st_join(houses, high) %>%
   rename(highCatch = 'NAME')



 #SCHOOL DISTRICT CATEGORIES
 houses <-
   st_join(houses, schoolDist) %>%
   rename(schoolDist = ID)


 #FLOOD INSURANCE CATEGORIES
floodInsure <- floodInsure %>%
  dplyr::select(floodInsureType)
houses <- houses %>%
  st_join(., floodInsure) %>%
  mutate(floodInsureType = ifelse(is.na(floodInsureType), "other", floodInsureType))



#ADD ACS DATA
houses <-
  st_join(houses, acs)




 #HOUSE AGE
houses <-
   houses %>%
   mutate(age = ifelse(is.na(YearBuilt), 0, (2020 - YearBuilt)))



 #MAKE CATEGORICAL VARIABLES
 houses <-
  houses %>%
  mutate(Bed.cat = case_when(
                  Bed >= 0 & Bed < 3  ~ "Up to 2 Beds",
                  Bed >= 3 & Bed < 4  ~ "3 Beds",
                  Bed >= 4                    ~ "4+ Beds"))


 houses <-
  houses %>%
  mutate(Bath.cat = case_when(
                  Bath < 2  ~ "Up to 1 Bathroom",
                  Bath == 2  ~ "2 Bathrooms",
                  Bath >= 3                    ~ "3+ Bathrooms"))


 houses <-
  houses %>%
  mutate(Stories.cat = case_when(
                  Stories < 2  ~ "Up to 1 Stories",
                  Stories == 2  ~ "2 Stories",
                  Stories >= 3                    ~ "3+ Stories"))



 houses <-
  houses %>%
  mutate(distWater.cat = case_when(
                  distWater < 0.25  ~ "Less than 1/4 Mile",
                  distWater >= 0.25   ~ "More than 1/4 Mile"))


 houses <-
  houses %>%
  mutate(parkArea.cat = case_when(
                  parkArea < 57000  ~ "Less than 57,000 SqFt",
                  parkArea >= 57000 & parkArea < 320000  ~ "57,000 - 320,000 SqFt",
                  parkArea >= 320000              ~ "More than 320,000 SqFt"))



 #OTHER HOUSE FEATURES
houses <- houses %>%
  mutate(Pool = ifelse(str_detect(XF1, "Pool") | str_detect(XF2, "Pool") | str_detect(XF3, "Pool") | str_detect(XF1, "Whirlpool") | str_detect(XF2, "Whirlpool") | str_detect(XF3, "Whirlpool") | str_detect(XF1, "Jacuzzi") | str_detect(XF2, "Jacuzzi") | str_detect(XF3, "Jacuzzi"), "Pool", "No Pool"),
         Patio = ifelse(str_detect(XF1, "Patio") | str_detect(XF2, "Patio") | str_detect(XF3, "Patio"), "Patio", "No Patio"),
         Fence = ifelse(str_detect(XF1, "Fence") | str_detect(XF2, "Fence") | str_detect(XF3, "Fence"), "Fence", "No Fence"),
         Gazebo = ifelse(str_detect(XF1, "Gazebo") | str_detect(XF2, "Gazebo") | str_detect(XF3, "Gazebo"), "Gazebo", "No Gazebo"),
         Carport = ifelse(str_detect(XF1, "Carport") | str_detect(XF2, "Carport") | str_detect(XF3, "Carport"), "Carport", "No Carport"),
         Wall = ifelse(str_detect(XF1, "Wall") | str_detect(XF2, "Wall") | str_detect(XF3, "Wall"), "Wall", "No Wall"),
         Dock = ifelse(str_detect(XF1, "Dock") | str_detect(XF2, "Dock") | str_detect(XF3, "Dock"), "Dock", "No Dock"),
         )

#FIX NA VALUES
#zip
 houses <-
  houses %>%
  mutate(Mailing.Zip = as.numeric(Mailing.Zip),
         Mailing.Zip = ifelse(is.na(Mailing.Zip), 0, Mailing.Zip))

#School dist
 houses <-
   houses %>%
  mutate(elemCatch = ifelse(is.na(elemCatch), "other", elemCatch),
         middleCatch = ifelse(is.na(middleCatch), "other", middleCatch),
         highCatch = ifelse(is.na(highCatch), "other", highCatch),
         )



#park
 houses <- houses %>%
     mutate(parkArea = ifelse(is.na(parkArea), 0, parkArea))

 #acs - here I found the location of NA values and gave these houses the ACS values of houses nearby/in same neighborhood
houses["3487", "NAME"] = houses["1099", "NAME"]
houses["3487","timeToWork"] =  houses["1099","timeToWork"]
houses["3487", "medHHInc"] = houses["1099", "medHHInc"]
houses["3487", "monthhousingcost"] = houses["1099", "monthhousingcost"]
houses["3487", "pctVacant"] = houses["1099", "pctVacant"]
houses["3487", "pctWhite"] = houses["1099", "pctWhite"]
houses["3487", "pctRenterOcc"] = houses["1099", "pctRenterOcc"]


houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "NAME"] = houses["3458", "NAME"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "timeToWork"] = houses["3458", "timeToWork"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "medHHInc"] = houses["3458", "medHHInc"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "monthhousingcost"] = houses["3458", "monthhousingcost"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "pctVacant"] = houses["3458", "pctVacant"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "pctWhite"] = houses["3458", "pctWhite"]
houses[c("579","580","581","1016","1372","1557","1853","2140","2557","2563","2571","2603","2786","2981","3050","3361"), "pctRenterOcc"] = houses["3458", "pctRenterOcc"]

houses["441", c("NAME","timeToWork","medHHInc","monthhousingcost","pctVacant","pctWhite","pctRenterOcc")] = houses["2090", c("NAME","timeToWork","medHHInc","monthhousingcost","pctVacant","pctWhite","pctRenterOcc")]

houses %>%
  dplyr::select(-Property.Zip)
#table of summary statistics with variable descriptions, sorted by category

housesSub <- houses %>%
  dplyr::select("AdjustedSqFt", "LotSize", "Bed", "Bath", "Stories", "commercialProperties", "age", "distWater", "foodEstablishments", "cultureSpots", "busStops", "parkArea", "timeToWork","monthhousingcost","pctVacant")

housesSub <- st_drop_geometry(housesSub)
stargazer(as.data.frame(housesSub), type="text", digits=1, title="Table 1: Descriptive Statistics for Miami Houses", out = "Miami Data.txt")
## 
## Table 1: Descriptive Statistics for Miami Houses
## =================================================================================
## Statistic              N     Mean    St. Dev.   Min  Pctl(25) Pctl(75)     Max   
## ---------------------------------------------------------------------------------
## AdjustedSqFt         3,503  2,082.2   1,426.0   331   1,274     2,379    18,006  
## LotSize              3,503  7,657.8   4,401.4  1,250  5,500     8,025    80,664  
## Bed                  3,503    3.0       1.1      0      2         4        13    
## Bath                 3,503    2.1       1.3      0      1         3        12    
## Stories              3,503    1.2       0.4      0      1         1         4    
## commercialProperties 3,503   15.6      26.1      0      0        21        203   
## age                  3,503   69.2      22.5      1      67       82        115   
## distWater            3,503    1.1       1.3    0.000   0.1       1.6       5.4   
## foodEstablishments   3,503    2.9       7.8      0      0         3        120   
## cultureSpots         3,503   0.01       0.1      0      0         0         1    
## busStops             3,503    2.2       2.3      0      0         4        13    
## parkArea             3,503 318,639.1 664,530.2   0      0     272,049.5 4,625,338
## timeToWork           3,503  2,347.2   1,200.6   476   1,480     3,037     5,962  
## monthhousingcost     3,503  1,888.9    807.6    390   1,316     2,421     5,078  
## pctVacant            3,503    0.2       0.1     0.0    0.1       0.2       0.5   
## ---------------------------------------------------------------------------------

First, we use a correlation matrix (Figure 1) to get an idea of different variables’ correlations with sale price. Below is a list of positive and negative correlations:

Positive:
- AdjustedSqFt
- Bath
- Bed
- Stories
- Food establishments (Slight)
- Park Area
- Median household income

Negative:
- Time to work
- Monthly housing cost
- Age
- Percent renter occupancy

corrPlotVars <- houses %>%
  dplyr::select(-saleDate, -saleType, -saleYear, -Bldg, -Land, -Assessed, -WVDB, -HEX, -GPAR, -County.2nd.HEX, -County.Senior, -County.LongTermSenior, -County.Other.Exempt, -County.Taxable, -City.2nd.HEX, -City.Senior, -City.LongTermSenior, -City.Other.Exempt, -City.Taxable, -MillCode, -Owner1, -Owner2, -Mailing.State, -Mailing.Country, -Legal1, -Legal2, -Legal3, -Legal4, -Legal5, -Legal6, -YearBuilt, -EffectiveYearBuilt, -toPredict)

numericVars <-
  select_if(st_drop_geometry(corrPlotVars), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Figure 1: Sale Price Correlation with Numeric Variables") +
  plotTheme()

Three variables that we will explore further are distWater, parkArea, pctVacant, and busStops. We expect that higher distWater and pctVacant values correlate with low sale prices, and parkArea and busStops make nearby houses more expensive and cost more.

As expected, proximity to the shoreline (low distWater) is desirable, and correlates with higher sale price. Based on Figure 2.1, however, this trend is strongest when distWater < 1. This suggests that we need to feature engineer distWater to account for greater correlation at shorter distances. The amount of park area near a house is positively correlated with sale price, although not as strongly as expected (Figure 2.2).

Surprisingly, the share of a census tract that is vacant has a positive correlation with sale price (Figure 2.3). This is unexpected because typically, vacant houses correlate with neighborhood disorder and unattractiveness. However, it is possible that these vacancies are the result of new construction and therefore do not make the area less attractive.

Bus stops also defy our understanding of cities: according to TOD theory, homes near transportation should be more attractive. However, as Figure 2.4 shows, the number of nearby bus stops correlates negatively with sale price. This may be because lower-income households tend to rely more on public transit.

housesKnown <- houses %>%    
  filter(.,toPredict == 0)
housesUnknown <- houses %>%
  filter(.,toPredict ==1)

#1: distWater
ggplot(data = housesKnown, aes(x = distWater, y = SalePrice)) +
  geom_point(size=2, shape=20)  +
  labs(title = "Figure 2.1: Distance to Shoreline and Sale Price", subtitle = "Miami and Miami Beach, FL") +
  geom_smooth(method = "lm", se=F, colour = "green") +
  plotTheme()

#2: parkArea
ggplot(data = housesKnown, aes(x = parkArea, y = SalePrice)) +
  geom_point(size=2, shape=20) +
  labs(title = "Figure 2.2: Nearby Park Space and Sale Price", subtitle = "Miami and Miami Beach, FL") +
  geom_smooth(method = "lm", se=F, colour = "green") +
    plotTheme()

#3: vacant
ggplot(data = housesKnown, aes(x = pctVacant, y = SalePrice)) +
  geom_point(size=2, shape=20) +
 labs(title = "Figure 2.3: Share of Vacant Properties and Sale Price", subtitle = "Miami and Miami Beach, FL") +
  geom_smooth(method = "lm", se=F, colour = "green") +
    plotTheme()

#4: bus stops
ggplot(data = housesKnown, aes(x = busStops, y = SalePrice)) +
  geom_point(size=2, shape=20) +
 labs(title = "Figure 2.4: Bus Stops and Sale Price", subtitle = "Miami and Miami Beach, FL") +
  geom_smooth(method = "lm", se=F, colour = "green") +
    plotTheme()

We further examine relationships between variables, sale prices, and space in a series of maps. Figure 3 shows the distribution of observed sale prices. Prices are highest in Miami Beach and the shoreline of Miami. It gets progressively lower until it reaches its lowest cluster in the North.

#map of your dependent variable (sale price)
  #water is just for mapping visuals
water <- st_read("https://opendata.arcgis.com/datasets/bf9de3192c9c4e458d1453f6d4c88d6c_0.geojson") %>%
 st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658') %>%
  st_union() %>%
  st_intersection(.,miamiBound)

ggplot() +
  geom_sf(data = acs, fill = "gray80", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = housesKnown, aes(colour = q5(SalePrice))) +
  scale_colour_manual(values = paletteMap) +
  labs(title = "Figure 3: Home Sale Price", subtitle = "Miami and Miami Beach, FL") +
  mapTheme()

Next, we map distWater, AdjustedSqFt, MedHHInc, parkArea, and foodEstablishments.

It can be expected that distance to water is negatively correlated to sale price in a city known for its beaches. Figure 4.1 shows that distance to the shoreline closely matches sale price.

However, it’s possible that sale price appears to be strongly correlated with distance to water because of other factors. Downtown Miami is located near the beach, and the heightened development around the beaches and downtown may give the area other attractive features.

Two of these features are green space and dining venues. Figure 4.2 shows that houses in Miami Beach have the most green space nearby. Interestingly, only some of the high sale price houses have high park area, mostly in Miami Beach. This may be due to crime incidents in park areas. In Figure 4.3, we see similar patterns, with the highest values in the houses in Miami Beach and the north part of Miami’s shoreline. This suggests that distance to the shoreline is only part of the story.

Outside of amenities, we expect that sale price correlates with median household income of census tracts and the adjusted square footage of houses. In Figures 4.4 and 4.5, we see this confirmed: much like sale price, the highest values for income and square footage are along the shoreline.

#1: distWater
ggplot() +
geom_sf(data = acs, fill = "gray90", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = houses, aes(colour = q5(distWater))) +
  labs(title = "Figure 4.1: Distance to Shore", subtitle = "Miami and Miami Beach, FL") +
  scale_colour_manual(values = palette5) +
  mapTheme()

#2: parkArea
ggplot() +
  geom_sf(data = acs, fill = "gray90", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = housesKnown, aes(colour = q5(parkArea))) +
  scale_colour_manual(values = palette5) +
  labs(title = "Figure 4.2: Nearby Park Area", subtitle = "Miami, FL") +
  mapTheme()

#3: Food and bev
ggplot() +
  geom_sf(data = acs, fill = "gray90", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = housesKnown, aes(colour = q5(foodEstablishments))) +
  scale_colour_manual(values = palette5) +
  labs(title = "Figure 4.3: Number of Nearby Food/Bev Places", subtitle = "Miami, FL") +
  mapTheme()

#4: AdjustedSqFt
  ggplot() +
geom_sf(data = acs, fill = "gray90", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = houses, aes(colour = q5(AdjustedSqFt))) +
  labs(title = "Figure 4.4: Adjusted Square Feet", subtitle = "Miami and Miami Beach, FL") +
  scale_colour_manual(values = palette5) +
  mapTheme()

#5: MedHHInc
  ggplot() +
geom_sf(data = acs, fill = "gray90", colour = "white") +
  geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = houses, aes(colour = q5(medHHInc))) +
  labs(title = "Figure 4.5: Tract Median Household Income", subtitle = "Miami and Miami Beach, FL") +
  scale_colour_manual(values = palette5) +
  mapTheme()

Methods

Now knowing that attractive amenities, high income households, large houses, and high sale prices cluster around the shoreline area, we can begin to test features to put in our regression model. We ranked the correlation of each feature with sale price, then added them in order until each addition feature no longer increased the model’s predictive power for sale price. The regression model that we chose includes house internal features, census tract variables, transportation availability, park area, and flood risk.

 #0.72 R2
  reg2 <- lm(SalePrice ~ ., data = st_drop_geometry(housesKnown) %>%
             dplyr::select(SalePrice, ActualSqFt, LotSize, Zoning, Stories.cat, Bath.cat, Pool, medHHInc, Dock, Bed.cat, middleCatch, age, pctVacant, pctRenterOcc, monthhousingcost, Patio, foodEstablishments, timeToWork, metromoverStops, metrorailStops, parkArea, floodInsureType))

Next, to test this model, we used houses with observed sale prices to create a training set for training our model, and a test set for testing it.

#read FL neighborhoods
nhoods_fl <- aoi_boundary_HARV <- st_read("E:/Upenn/CPLN508/miami/zillow_nghbrhd_feb17/zillow_nghbrhd_feb17.shp")
#nhoods_fl <- aoi_boundary_HARV <- st_read("/Users/annaduan/Documents/GitHub/Miami-Oct12/Raw\ Data/zillow_nghbrhd_feb17/zillow_nghbrhd_feb17.shp")
nhoods_mb <- subset(nhoods_fl, CITY == "MIAMI BEACH")%>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')
nhoods_m <- subset(nhoods_fl, CITY == "MIAMI")%>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102658')

#Join neighborhoods
nhoods <- rbind(nhoods_mb, nhoods_m)
nhoods <- nhoods %>%
  dplyr::select(NAME) %>%
  rename(neighborhood = NAME)
housesKnown <- housesKnown %>% st_join(., nhoods, join = st_within)
houses <- houses %>% st_join(., nhoods, join = st_within)
housesUnknown <- housesUnknown %>% st_join(., nhoods, join = st_within)

#Separate test/train sets

inTrain <- createDataPartition(
              y = paste(housesKnown$Zoning, housesKnown$floodInsureType, housesKnown$neighborhood),
              p = .60, list = FALSE)
miami.training <- housesKnown[inTrain,]
miami.test <- housesKnown[-inTrain,]  


#Training regression 
reg.training <- 
  lm(SalePrice ~ ., data = st_drop_geometry(miami.training) %>% 
                             dplyr::select(SalePrice, ActualSqFt, LotSize, Zoning, Stories.cat, Bath.cat, Pool, medHHInc, Dock, Bed.cat, middleCatch, age, pctVacant, pctRenterOcc, monthhousingcost, Patio, foodEstablishments, timeToWork, metromoverStops, metrorailStops, parkArea, floodInsureType))

Results

The results of the regression model on the training set are presented in Table 2. Overall, the R-squared value of 0.9 means our variables explain most of the variation in sale price. The low p values also indicate a high level of confidence.

#polished table of  (training set) lm summary results (coefficients, R2 etc)
stargazer(reg.training, type="text", digits=1, title="Table 2: LM of Training Data", out = "Training LM.txt")
## 
## Table 2: LM of Training Data
## =====================================================================
##                                               Dependent variable:    
##                                           ---------------------------
##                                                    SalePrice         
## ---------------------------------------------------------------------
## ActualSqFt                                         829.3***          
##                                                     (26.3)           
##                                                                      
## LotSize                                              10.5            
##                                                      (7.9)           
##                                                                      
## Zoning0104 - SINGLE FAM - ANCILIARY UNIT           73,301.4          
##                                                   (83,493.6)         
##                                                                      
## Zoning0800 - SGL FAMILY - 1701-1900 SQ          1,675,399.0***       
##                                                   (145,693.8)        
##                                                                      
## Zoning2100 - ESTATES - 15000 SQFT LOT           5,052,095.0***       
##                                                   (287,379.7)        
##                                                                      
## Zoning2200 - ESTATES - 25000 SQFT LOT           4,247,518.0***       
##                                                   (420,466.2)        
##                                                                      
## Zoning2800 - TOWNHOUSE                             177,472.3         
##                                                   (397,764.4)        
##                                                                      
## Zoning3900 - MULTI-FAMILY - 38-62 U/A              91,734.9          
##                                                   (157,157.4)        
##                                                                      
## Zoning3901 - GENERAL URBAN 36 U/A LIMITED          127,001.4         
##                                                   (189,753.2)        
##                                                                      
## Zoning4600 - MULTI-FAMILY - 5 STORY                                  
##                                                   (257,562.7)        
##                                                                      
## Zoning4601 - MULTI-FAMILY - 8 STORY                                  
##                                                   (588,041.8)        
##                                                                      
## Zoning4801 - RESIDENTIAL-LIMITED RETAI             281,929.5         
##                                                   (410,297.5)        
##                                                                      
## Zoning5700 - DUPLEXES - GENERAL                    92,019.8          
##                                                   (73,595.1)         
##                                                                      
## Zoning6100 - COMMERCIAL - NEIGHBORHOOD            -106,258.5         
##                                                   (313,128.9)        
##                                                                      
## Zoning6101 - CEN-PEDESTRIAN ORIENTATIO             193,240.8         
##                                                   (292,266.0)        
##                                                                      
## Zoning6106 - RESIDENTIAL-LIBERAL RETAI           1,004,289.0*        
##                                                   (599,801.2)        
##                                                                      
## Zoning6107 - RESIDENTIAL-MEDIUM RETAIL             320,663.2         
##                                                   (250,718.1)        
##                                                                      
## Zoning6110 - COMM/RESIDENTIAL-DESIGN D             172,626.8         
##                                                   (433,235.2)        
##                                                                      
## Zoning6402 - URBAN CORE 24 STORY/7FLR             1,013,856.0        
##                                                   (824,277.7)        
##                                                                      
## Zoning7000 - INDUSTRIAL - GENERAL                  134,377.1         
##                                                   (814,280.5)        
##                                                                      
## Zoning7700 - INDUSTRIAL - RESTRICTED               196,635.5         
##                                                   (421,791.3)        
##                                                                      
## Stories.cat3+ Stories                           1,613,864.0***       
##                                                   (228,530.9)        
##                                                                      
## Stories.catUp to 1 Stories                       305,504.5***        
##                                                   (72,318.1)         
##                                                                      
## Bath.cat3+ Bathrooms                             -269,208.7***       
##                                                   (73,068.5)         
##                                                                      
## Bath.catUp to 1 Bathroom                         177,353.0***        
##                                                   (62,978.4)         
##                                                                      
## PoolPool                                           -62,657.5         
##                                                   (67,256.9)         
##                                                                      
## medHHInc                                             -0.01           
##                                                      (1.0)           
##                                                                      
## DockNo Dock                                      -313,217.6**        
##                                                   (130,720.0)        
##                                                                      
## Bed.cat4+ Beds                                   -214,034.7***       
##                                                   (68,081.4)         
##                                                                      
## Bed.catUp to 2 Beds                                97,219.1          
##                                                   (60,478.0)         
##                                                                      
## middleCatchde Diego, Jose Middle                   38,900.2          
##                                                   (144,072.8)        
##                                                                      
## middleCatchJones-Ayers, Georgia Middle             33,158.4          
##                                                   (135,887.4)        
##                                                                      
## middleCatchKinloch Park Middle                     61,712.6          
##                                                   (168,285.7)        
##                                                                      
## middleCatchMann, Horace Middle                     -72,703.3         
##                                                   (156,508.5)        
##                                                                      
## middleCatchNautilus Middle                         134,428.5         
##                                                   (275,167.6)        
##                                                                      
## middleCatchother                                   45,290.4          
##                                                   (128,273.1)        
##                                                                      
## middleCatchPonce de Leon Middle                    185,306.5         
##                                                   (134,181.3)        
##                                                                      
## middleCatchShenandoah Middle                       95,549.4          
##                                                   (117,704.5)        
##                                                                      
## age                                                -2,136.0*         
##                                                    (1,102.0)         
##                                                                      
## pctVacant                                         -728,991.9*        
##                                                   (409,690.3)        
##                                                                      
## pctRenterOcc                                       15,001.1          
##                                                   (220,519.9)        
##                                                                      
## monthhousingcost                                     -24.9           
##                                                     (82.3)           
##                                                                      
## PatioPatio                                         -44,995.1         
##                                                   (45,007.3)         
##                                                                      
## foodEstablishments                                  4,212.8          
##                                                    (3,549.2)         
##                                                                      
## timeToWork                                           -16.6           
##                                                     (53.8)           
##                                                                      
## metromoverStops                                     4,920.8          
##                                                    (6,332.9)         
##                                                                      
## metrorailStops                                     -12,098.3         
##                                                   (24,302.8)         
##                                                                      
## parkArea                                            -0.2***          
##                                                     (0.05)           
##                                                                      
## floodInsureType0294                                -54,809.2         
##                                                   (118,708.4)        
##                                                                      
## floodInsureType0304                                40,281.6          
##                                                   (193,332.5)        
##                                                                      
## floodInsureType0307                                -15,966.3         
##                                                   (287,712.4)        
##                                                                      
## floodInsureType0308                                256,863.3         
##                                                   (230,598.1)        
##                                                                      
## floodInsureType0309                               513,129.0*         
##                                                   (284,537.3)        
##                                                                      
## floodInsureType0311                               -210,266.8         
##                                                   (211,332.1)        
##                                                                      
## floodInsureType0312                                50,571.9          
##                                                   (239,226.8)        
##                                                                      
## floodInsureType0313                               -128,880.8         
##                                                   (202,493.4)        
##                                                                      
## floodInsureType0314                                -48,315.5         
##                                                   (228,168.4)        
##                                                                      
## floodInsureType0316                             2,176,855.0***       
##                                                   (332,671.3)        
##                                                                      
## floodInsureType0317                               679,144.9**        
##                                                   (290,617.2)        
##                                                                      
## floodInsureType0318                             1,552,020.0***       
##                                                   (519,511.0)        
##                                                                      
## floodInsureType0319                             12,628,485.0***      
##                                                   (954,024.7)        
##                                                                      
## floodInsureType0476                                23,330.7          
##                                                   (190,418.3)        
##                                                                      
## floodInsureType0477                               -342,170.9         
##                                                   (289,799.8)        
##                                                                      
## floodInsureTypeother                               51,789.1          
##                                                   (170,103.0)        
##                                                                      
## Constant                                         -885,135.7***       
##                                                   (294,759.4)        
##                                                                      
## ---------------------------------------------------------------------
## Observations                                         1,651           
## R2                                                    0.9            
## Adjusted R2                                           0.9            
## Residual Std. Error                          801,154.0 (df = 1586)   
## F Statistic                                179.5*** (df = 64; 1586)  
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Next, we test the model on our test set to see its effectiveness on new data. Overall, it is relatively accurate: the average percentage error is 7.1%. However, the mean absolute error is $ 351,281, which is concerning as the average price in the test set is $ 689,606. This may be because our model is less accurate for expensive houses, as the absolute errors for them is higher at a given percentage error. Indeed, in Figure 5.1 and 5.2, absolute error is higher for homes with high observed prices, but percent error is consistently low, at less than 10%.

#Test regression on miami.test
miami.test <-
  miami.test %>%
  mutate(Regression = "Baseline Regression",
         SalePrice.Predict = predict(reg.training, miami.test), #751571.5
         SalePrice.Error = SalePrice.Predict - SalePrice, #60608.35
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice), #363363
         SalePrice.APE = SalePrice.AbsError / SalePrice) %>% #0.05589877  #corrected 
  filter(SalePrice < 5000000) 

#Mean error and APE 432255.1   0.3407089   443611.7 0.1643853 467809 -0.0241117
mean(miami.test$SalePrice.AbsError, na.rm = T)#[1] 351280.6
mean(miami.test$SalePrice.APE, na.rm = T)#[1] 0.7101703
mean(miami.test$SalePrice.Predict, na.rm = T)#[1] 770165.4

ggplot(data = miami.test) +
  geom_point(aes(x = SalePrice, y = SalePrice.AbsError)) +
  labs(title = "Figure 5.1: Observed Sale Price and Absolute Error") +
  plotTheme()

ggplot(data = miami.test) +
  geom_point(aes(x = SalePrice, y = SalePrice.APE)) +
  labs(title = "Figure 5.2: Observed Sale Price and Absolute Percent Error") +
  plotTheme()

Is our model generalizable?

In addition to accuracy, generalizability is important for our model to be effective. To do this, we run a K-folds test to test the model on different segments of our test set.

We see in Table 3 that Fold75, one of the 100 partitioned segments of our test data, has an adjusted R-squared of 0.78 and a mean average error of $350,348.9. While the R^2 is high, the error is more than half of the average predicted price across all folds. These results are similar to the results for our training, which suggests strong generalizability across groups.

#K Folds test
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(SalePrice ~ ., data = st_drop_geometry(housesKnown) %>% 
                                dplyr::select(SalePrice, ActualSqFt, LotSize, Zoning, Stories.cat, Bath.cat, Pool, medHHInc, Dock, Bed.cat, middleCatch, age, pctVacant, pctRenterOcc, monthhousingcost, Patio, foodEstablishments, timeToWork, metromoverStops, metrorailStops, parkArea, floodInsureType), 
     method = "lm", trControl = fitControl, na.action = na.pass)


#k-fold function online
kfold.MLR = function(fit,k=10,data=fit$model) {    
  sum.sqerr = rep(0,k)
  sum.abserr = rep(0,k)
  sum.pererr = rep(0,k)
  y = fit$model[,1]
  x = fit$model[,-1]
  n = nrow(data)
  folds = sample(1:k,nrow(data),replace=T)
  for (i in 1:k) {
    fit2 <- lm(formula(fit),data=data[folds!=i,])
    ypred = predict(fit2,newdata=data[folds==i,])
    sum.sqerr[i] = sum((y[folds==i]-ypred)^2)
    sum.abserr[i] = sum(abs(y[folds==i]-ypred))
    sum.pererr[i] = sum(abs(y[folds==i]-ypred)/y[folds==i])
  }
  cv = return(data.frame(RMSEP=sqrt(sum(sum.sqerr)/n),
                         MAE=sum(sum.abserr)/n,
             MAPE=(sum(sum.pererr)/n)*100))
}

#ADDED TODAY
reg.cv.known <-
  housesKnown %>%
  mutate(Regression = "Baseline Regression",
         SalePrice.Predict = predict(reg.cv, housesKnown), #751571.5
         SalePrice.Error = SalePrice.Predict - SalePrice, #60608.35
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice), #363363
         SalePrice.APE = SalePrice.AbsError / SalePrice) %>% #0.05589877  #corrected 
  filter(SalePrice < 5000000) 
fold75 <- reg.cv$control$indexOut$Resample075

reg75 <- reg.cv.known[fold75,c("SalePrice", "SalePrice.Predict")]
reg75.test <-
  reg75 %>%
  mutate(SalePrice.Error = SalePrice.Predict - SalePrice, 
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice), 
         SalePrice.APE = SalePrice.AbsError / SalePrice) %>% 
  filter(SalePrice < 5000000) 
reg.cv.rs.min <- reg.cv$resample[75,]
reg.cv.rs.min$MAPE <- mean(reg75.test$SalePrice.APE)

round_df <- function(x, digits) {
    # round all numeric variables
    # x: data frame 
    # digits: number of digits to round
    numeric_columns <- sapply(x, mode) == 'numeric'
    x[numeric_columns] <-  round(x[numeric_columns], digits)
    x
}
reg.cv.rs.min <- round_df(reg.cv.rs.min, 2)

library(kableExtra)


reg.cv.rs.min <- reg.cv$resample[75,]


reg.cv.rs.min %>%                     #AD: knitting stops here
  gather(Variable, Value) %>%
  #filter(Variable == "MAE" | Variable == "RMSE") %>%
  group_by(Variable) %>%
    spread(Variable, Value) %>%
    kable(caption = "Table 3: Regression Results of One Test Set") %>%
   kable_classic(full_width = F, html_font = "Cambria")
Table 3: Regression Results of One Test Set
MAE Resample RMSE Rsquared
580396.734565082 Fold075 1114680.61531132 0.949728595628714

For more context, Figure 6.1 shows that most test sets have errors around $400,000, but that a few outliers skew the average error upward. If the model is generalizable, it should have a clustered distribution of errors. As this is not the case, it appears that the model is not actually predicting consistently across groups of houses.

ggplot(reg.cv$resample, aes(x=MAE)) +
  geom_histogram() +
  labs(title = "Figure 6.1: Mean Average Error in Cross Validation Tests") +
  plotTheme()

Do errors cluster spatially?

One way to investigate the reason for the model’s inconsistency is to look at how it treats houses across space.

library(knitr)
library(kableExtra)
library(scales)

coords <- st_coordinates(housesKnown)
neighborList <- knn2nb(knearneigh(coords, 5)) #5 nearest neighborhoods

spatialWeights <- nb2listw(neighborList, style="W") #not sure what is W here
housesKnown$lagPrice <- lag.listw(spatialWeights, housesKnown$SalePrice)
#plot(housesKnown$SalePrice, housesKnown$lagPrice)

coords.test <-  st_coordinates(miami.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

In Figure 7.1, we can see that residuals are evenly distributed spatially. This suggests that the low generalizability of the data is not due to spatial processes, but other factors.

#10.1 Map of test set residuals
library(modelr)

miami.test$resid <- 
  miami.test %>%
  as_data_frame() %>%
  add_residuals(., reg.training, var = "resid") %>%
  dplyr::select(resid, Folio) %>%
  pull(resid)

ggplot() +
geom_sf(data = acs, fill = "gray90", colour = "white") +
    geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = miami.test, aes(colour = q5(resid))) +
  scale_colour_manual(values = palette5) +
 labs(title = "Figure 7.1: Test Set Residual Errors", subtitle = "Miami and Miami Beach, FL") +
  mapTheme()

To see more clearly the effect of spatial processes on our errors, we can look at spatial lags, or the clustering of prices and errors. In Figure 8.1, we can see that neighboring houses’ price estimate errors do not increase in the same way with sale price. This again tells us that most of our errors are not spatial in nature.

miami.test %>%                
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)) %>%  
  ggplot(aes(lagPriceError, SalePrice)) +
  geom_point() +
  stat_smooth(aes(lagPriceError, SalePrice), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800")+
  labs(title = "Figure 8.1: Spatial Lag of Price Errors") +
  plotTheme() + theme(plot.title = element_text(size = 18, colour = "black")) 

Finally, we can use a Moran’s I test to gain further insight into the spatial autocorrelation of our model errors. If our model errors are not influenced by spatial processes, we should see a Moran’s I of 0. Our Moran’s I value is less than 0.2, which confirms that we have very minimal spatial clustering of errors.

moranTest <- moran.mc(miami.test$SalePrice.Error,
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Figure 9.1: Observed and Permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Accounting for neighborhood variance

We now add neighborhood as a feature in our model to account for the differences in sale price across neighborhoods. Table 4 tells us that when we account for neighborhood effects, we actually slightly increase the absolute error but we decrease the absolute percentage error. This may mean that the neighborhood model works best for lower priced homes, and that it increased the error in expensive ones. It also tells us that our Baseline model already accounted for spatial disparities through our use of open data features.

#Make new neighborhood regression
reg.nhood <- lm(SalePrice ~ ., data = as.data.frame(miami.training) %>% 
                                 dplyr::select(neighborhood, SalePrice, ActualSqFt, LotSize, Zoning, Stories.cat, Bath.cat, Pool, medHHInc, Dock, Bed.cat, middleCatch, age, pctVacant, pctRenterOcc, monthhousingcost, Patio, foodEstablishments, timeToWork, metromoverStops, metrorailStops, parkArea, floodInsureType))

#Outcomes
miami.test.nhood <-
  miami.test %>%
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = predict(reg.nhood, miami.test), #613237.3
         SalePrice.Error = SalePrice - SalePrice.Predict, # -108442.4      
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict), # 491238.7
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)%>% #0.7109973
  filter(SalePrice < 5000000)

#Check accuracy
bothRegressions <-
  rbind(
    dplyr::select(miami.test, starts_with("SalePrice"), Regression, neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
    dplyr::select(miami.test.nhood, starts_with("SalePrice"), Regression, neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))   

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -neighborhood) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable(caption = "Table 4: Neighborhood Effect on Error")
Table 4: Neighborhood Effect on Error
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 351705.1 0.7059724
Neighborhood Effects 351407.0 0.7032967

In Figure 10.1, we can see the effect of neighborhood model on the accuracy of our predictions. As suspected, the neighborhood model fits the data only marginally better. For some of the higher priced homes, however, it appears that adding neighborhood as a feature causes an overestimation of price. This may be due to variations within neighborhoods.

bothRegressions %>%
  dplyr::select(SalePrice.Predict, SalePrice, Regression) %>%
    ggplot(aes(SalePrice, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(SalePrice, SalePrice),
             method = "lm", se = FALSE, size = 1, colour="#FA7800") +
  stat_smooth(aes(SalePrice.Predict, SalePrice),
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Figure 10.1: Predicted Sale Price and Observed Price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme() + theme(plot.title = element_text(size = 18, colour = "black"))

Finally, Figure 11.1 shows the prices that we predicted for the set of 3503 Miami area homes. As we saw with the known home sale prices, our predicted prices are highest close to the shoreline and on Miami Beach.

#876 rows
#filter by toPredict = 1
housesPredictions <-                   
  houses %>%
  mutate(prediction = predict(reg.nhood, houses),
         team_name = 'Panda')

predictions <- housesPredictions[,c("Folio", "prediction", "team_name", "toPredict")] %>%
  st_drop_geometry() %>%
  filter(toPredict == 1) %>%
  dplyr::select(-toPredict)

# write.csv(predictions, "PANDA.csv") #"The column names MUST to be "prediction", "Folio", and "team_name"" - piazza

#Map values
ggplot() +
  geom_sf(data = acs, fill = "gray90", colour = "white") +
    geom_sf(data = water, fill = "light blue", colour = "light blue") +
  geom_sf(data = housesPredictions, aes(colour = q5(prediction))) +
 scale_colour_manual(values = palette5) +
 labs(title = "Figure 11.1: Predicted Sale Price Values", subtitle = "Miami and Miami Beach, FL") +
 # facet_wrap(~toPredict) +
  mapTheme()

In Figure 12.1, we can see the spatial distribution of our errors. Overall, it appears that mean average percentage error is lower in Miami Beach, where sale prices are higher. It is highest in the center of Miami, and relatively low on the Miami shoreline. This suggests that our errors were smaller in neighborhoods with higher observed sale prices.

#Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood
names(bothRegressions)[names(bothRegressions) == "neighborhood"] <- "neighborhood"
st_drop_geometry(bothRegressions) %>%
  group_by(Regression, neighborhood) %>%
  summarise(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>%
  left_join(nhoods) %>%
    st_as_sf() %>%
   ggplot() +
    geom_sf(data = water, fill = "light blue", colour = "light blue") +
      geom_sf(colour = "gray", aes(fill = q5(mean.MAPE))) +
      scale_fill_manual(values = paletteMap) +
  labs(title = "Figure 12.1: Mean Average Percentage Error by Neighborhood") +
      mapTheme()

Figure 13.1 confirms this trend. We can see clearly that with one exception, neighborhoods with lower mean prices have higher mean average percentage errors.

scatter_hood <-
    miami.test.nhood %>%
    group_by(neighborhood) %>%
    dplyr::select(neighborhood, SalePrice.APE, SalePrice.Predict)

mean_sca_hd <-
  scatter_hood %>%
  group_by(neighborhood) %>%
  summarise_at(vars("SalePrice.APE", "SalePrice.Predict"), mean)

plot(mean_sca_hd$SalePrice.Predict, mean_sca_hd$SalePrice.APE, main="Figure 13.1: MAPE by Neighborhood and Mean Price by Neighborhood", xlab="Mean Price by Neighborhood", ylab="MAPE by neighborhood") +
  plotTheme()

#https://r-graphics.org/recipe-scatter-labels or we can use the method below:

#scatter_mae_mean <- ggplot(mean_sca_hd, aes(x = SalePrice.Predict, y = SalePrice.APE)) +
#    geom_point() +
#  plotTheme()

#scatter_mae_mean +    #AD: this is cool but we can't really see which dot is which neighborhood
#  annotate("text", x = 820000, y = -11.7, label = "UPPER EASTSIDE") +
#  annotate("text", x = 330000, y = -6.5, label = "LITTLE HAITI")+
#  annotate("text", x = 1300000.82, y = 6.85325579, label = "NAUTILUS")+
#  annotate("text", x = 631682.30, y = 3.57218432, label = "BISCAYNE POINT")+
#  annotate("text", x = 1557855.78, y = 2.81996319, label = "LA GORCE")+
#  annotate("text", x = 2031665.51, y = 1.95167248, label = "WYNWOOD - EDGEWATER")+
#  annotate("text", x = 890031.19, y = -2.16394539, label = "OVERTOWN")

Does this model work equally for different demographic groups?

The variation in the MAPE of different neighborhoods suggests our model has limited generalizability. To test this, we look at how well it predicts prices across different racial and income contexts. In Figure 14.1 is the racial and income context of Miami.

Table 5 and 6 confirm that this model applies slightly differently for different demographics. In Majority non-white neighborhoods, MAPE is 45% higher than in majority white neighborhoods. Similarly, MAPE is 41% higher in low income neighborhoods than in high income ones.

#RACE & INCOME
  #make new layer
acsRaceIncome <-
  acs %>%
  mutate(raceContext= ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(medHHInc > 49256.56, "High Income", "Low Income"))
#context
grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(acsRaceIncome), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Figure 14.1: Race Context of Miami and Miami Beach") +
    mapTheme() + theme(legend.position="bottom"),
  ggplot() + geom_sf(data = na.omit(acsRaceIncome), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context of Miami and Miami Beach") +
    mapTheme() + theme(legend.position="bottom"))

#tables
st_join(bothRegressions, acsRaceIncome) %>%
  filter(!is.na(raceContext)) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Table 5: MAPE by Neighborhood Racial Context")
Table 5: MAPE by Neighborhood Racial Context
Regression Majority Non-White Majority White
Baseline Regression 107% 59%
Neighborhood Effects 108% 59%
st_join(bothRegressions, acsRaceIncome) %>%
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Table 6: MAPE by Neighborhood Income Context")
Table 6: MAPE by Neighborhood Income Context
Regression High Income Low Income
Baseline Regression 46% 88%
Neighborhood Effects 47% 87%

Discussion

In conclusion, our model is effective at predicting the distribution of home sale prices in the area as the pattern of predicted prices matches that of recent known home sale prices. However, it is limited in its ability to generalize across different types of neighborhoods.

We have many interesting variables which had relationships with salePrice. As expected, distance from the shore correlates negatively with sale price, however the correlation was much weaker than expected because it only seemed to matter for the first mile from water. By contrast, floodInsureType correlated quite strongly with sale price. This feature comes from FEMA’s rating of flood risk for different neighborhoods, and unlike our expectation, it was not only based on distance from the shoreline.

In general, we found that houses’ internal features were the strongest predictors of sale price. In addition, census features such as percent vacancy in a tract, median household income, and travel time to work correlated with sale price. Contrary to our expectations, distance to water, food establishments, businesses, school catchment areas, and park space are less correlated with home sale price.

Our errors were higher than we would like. As discussed in the cross validation section, some houses had a mean average error of hundreds of thousands of dollars. Further, our errors were not distributed evenly, and were higher in low income and minority majority neighborhoods. Indeed, the highest MAPE values on our maps corresponded with the lowest income and lowest sale price neighborhoods. Based on our Moran’s I test, however, we were successful at eliminating spatial clustering of errors. In all, our model predicted much better in high income, high observed sale price neighborhoods. This disparity is likely attributable to the fact that we used mainly positive home attributes and neighborhood amenities in our model. If we used more attributes such as poverty rate, race, and renter occupancy rate, our model may have been better at modeling majority-minority and low-income neighborhoods.

Conclusion

To conclude, we do not believe that this model is ready for use by Zillow. Its errors are too significant, and its predictions are too uneven between different types of neighborhoods. Moving forward, we will add more features to this model to increase its ability to predict in neighborhoods with varying demographics. Additionally, we will fine tune features to account for non-linear correlations with sale price, such as by creating more categories.