Introduction

Bike-share is a bike-sharing system that allows people to borrow and ride bikes without owning a bike. It’s comprised of a network of bike-sharing stations with docks for bikes to be checked in, stored and checked out.

Stations in different geographical locations face different demands that vary with time. One paradox is that popular stations usually have higher demand for bikes and will sooner run out of bikes if not replenished properly. For the bike-share companies, this deficiency can lead to increasing dissatisfaction and churning rate among users, thereby lowering their profits. For bike-share users, this can seriously cut down their utility to use bike-sharing service and cause inefficiency and inconvenience for their work and life. For the city/region, bike-share services would be useless in raising active travel if they can’t provide enough supply where and when needed. With some of the users turning to car travel again, the congestion and environmental issues will only be worsen in the city/region. Given all these perspectives, bike-share re-balancing, which means re-distributing bikes with certain strategies so that the shortage of supply in a station can be narrowed as much as possible, is much needed as a critical element to make the bike-share system a success.

The strategy for re-balancing here will be relying on trucks to collect and move bikes to certain station. This is because the bike stations in Philadelphia are quite dispersed, therefore a relatively high incentive should be provided if we want to purely rely on users to re-balance the bikes, adding to be a nonnegligible cost with possibly very limited effects. Compared with this method, small trucks may be more competent in guaranteeing the outcomes. At any given time, it would be ideal if we can predict the demand in the next 1 hour, which should be sufficient for the trucks to relocate bikes to hot-spot areas.

Data collection and feature engineering

I use the bike-sharing data in Philadelphia from 10/29/2018 to 12/02/2018 here to conduct a 5-week panel experiment in predicting demands. Features that are included are weather (temperature, precipitation and wind speed) as shown in Figure 2.1, amenity elements (closeness to school, shops, parks, tourism sites, cuisine places, offices and bus, trolly, transit stations), some other spatial features (median income, median age, percent of white population, mean commute time and percent of taking public transportation, number of commuters), and time lag features (Table 1). A fishnet is also created for spatial inputs.

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2018-10-29", date_end = "2018-12-02") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
library(ggplot2)
library(gridExtra)
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Figure 2.1Weather Data - Philadelphia PHL - Nov, 2018")

library(sf)
library(dplyr)
dat_net <- 
  dplyr::select(dat_sf%>%
  st_transform(st_crs(fishnet))) %>% 
  mutate(countRIDE = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRIDE = replace_na(countRIDE, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE)) %>%
  st_join(., neighborhoods) %>%
  mutate(nbname=ifelse(!is.na(name),name,"unknown"))

final_net <-
  left_join(dat_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net_weather <- left_join(final_net, st_drop_geometry(vars_net), by="uniqueID") 

##Prepare for moran's
library(spdep)
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)


final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countRIDE, final_net.weights)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(RIDE_Count = countRIDE, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z > 0)`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.0000001, 1, 0)) %>%
      gather(Variable, Value, -geometry)

##distance to hotspots## 
final_net <-
  final_net %>% 
  mutate(RIDE.isSig = 
           ifelse(localmoran(final_net$countRIDE, 
                             final_net.weights)[,5] <= 0.05, 1, 0)) %>%
  mutate(RIDE.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, RIDE.isSig == 1))), 1))
dat_census <- st_join(dat_ymd %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        PhillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., PhillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)
study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              dplyr::select(start_station, Origin.Tract, start_lon, start_lat )%>% #add station name
              distinct() %>%
              group_by(start_station) %>%
              slice(1))


study.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>% #add station name
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)
study.panel <- 
  study.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == c(315,326),1,0)) %>% ##??
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, 0))
as.data.frame(study.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
    kable (caption="Table 1: Time lag features and corelation", col.names = c('Time lag', 'Corelation')) %>%
    kable_styling()
Table 1: Time lag features and corelation
Time lag Corelation
lagHour 0.82
lag2Hours 0.59
lag3Hours 0.39
lag4Hours 0.23
lag12Hours -0.22
lag1day 0.72

Explorary analysis

More frequent re-balancing is required in places and at times that have the highest bike-share ridership demands.

According to Figure 4.1, there is a similar weekly trend in bike-share data. Weekdays tend to have higher ridership than weekends. There is a major decrease in ridership around Thanksgiving and a decline around the Veteran’s day, too.

ggplot(dat_ymd %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Figure 4.1: Bike share trips per hr. Philadelphia, Nov, 2018",
       x="Date", 
       y="Number of trips")+
  plotTheme

Figure 4.2 reveals that peak hours have the highest demand and need more frequent re-balancing.

dat_ymd %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Figure 4.2: Mean Number of Hourly Trips Per Station. Philadelphia, Nov, 2018",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

Figure 4.3 and 4.4 indicate that on average Wednesdays have the highest peak demands, and Fridays have the lowest demands for weekdays. Weekends have lower and more evenly distributed demands, and peaks on Sundays are typically around 1 hour later than the peaks on Saturdays.

ggplot(dat_ymd %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Figure 4.3: Bike share trips in Philadelphia, by day of the week, Nov, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

ggplot(dat_ymd %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Figure 4.4: Bike share trips in Philadelphia - weekend vs weekday, Nov, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

Figure 4.5 shows that 1) Weekdays tend to have higher demands than weekends; 2) On weekdays in PM rush hours, there are distinctly clustering high demands in center city and the bridge linking center city and eastern fringe of west Philly. Mid-days and overnight hours have very similar spatial distribution of demands, with nearly all hot spots in center city. Compared with other times on weekdays, AM rush hours have high demands extending into the south of center city; 3) On weekends, AM rush hours have the lowest overall demand. Hot spots appear mostly on mid-days and PM rush hours in center city. Figure 4.6 is an animation of a typical Monday ridership in early November in Philadelphia.

ggplot()+
  geom_sf(data = PhillyTracts %>%
          st_transform(crs=4326), fill = "white")+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.4, size = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Figure 4.5: Bike share trips per hr by station. Philadelphia, Nov, 2018")+
  mapTheme

PhillyCensus <- PhillyCensus %>%
    st_transform(st_crs(fishnet))
bs_tract <- st_join(dat_sf, PhillyCensus, join=st_within)

week44 <-
  filter(bs_tract , week == 44 & dotw == "Mon")

week44.panel <-
  expand.grid(
    interval15 = unique(week44$interval15),
    Pickup.Census.Tract = unique(bs_tract$GEOID))

ride.animation.data <-
  mutate(week44, Trip_Counter = 1) %>%
    right_join(week44.panel) %>% 
    group_by(interval15, Pickup.Census.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))

#  left_join(dat_net, st_drop_geometry(vars_net), by="uniqueID")
rideshare_animation <-
  ggplot() +
    geom_sf(data = ride.animation.data, aes(col = Trips, size = Trips), show.legend = "point")+
    geom_sf(data = PhillyTracts, color = "grey", fill = "transparent")+
    scale_fill_manual(values = c("green", "yellow", "orange","red")) +
    labs(title = "Figure 4.6: Rideshare pickups for one day in November 2018",
         subtitle = "15 minute intervals: {current_frame}") +
    transition_manual(interval15) +
    mapTheme

library(gganimate)
library(gifski)
animate(rideshare_animation, duration=20, renderer = gifski_renderer())

anim_save("rideshare_local", rideshare_animation, duration=20, renderer = gifski_renderer())

Comparison of model performance

We can learn from Figure 5.1 and 5.2 that DTime_Space_FE_timeLags model has the least overall errors, and adding holiday feature doesn’t have a significant accuracy improvement to the time-lag models.

ride.Train <- filter(study.panel, week <= 46) 
ride.Test <- filter(study.panel, week > 46)
study_panel.net <- merge(x=study.panel, y=st_drop_geometry(vars_net), by.x = "start_station", by.y = "id", all.x = TRUE) %>%
    filter(is.na(uniqueID) == FALSE)
  
ride.Train.net <- filter(study_panel.net, week <= 46) 
ride.Test.net <- filter(study_panel.net, week > 46)
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

ride.Test.weekNest.net <- 
  ride.Test.net %>%
  nest(-week) 
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Figure 5.1: Mean Absolute Errors by model specification and week") +
  plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Figure 5.2: Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    dplyr::select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>% ##regresstion
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = (PhillyTracts%>%
    st_transform(st_crs(4326))), color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Figure 5.3: Mean Abs Error, Test Set, Model 4")+
  mapTheme

Using our best model to map the error (Figure 5.3), it’s clear that higher errors concentrate in the center city and share a similar pattern with the overall peaks of demands. That’s to say, this model may not be competent enough to generalize in the center city to predict the demands.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    dplyr::select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Figure 5.4: Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

According to Figure 5.4, there is a serious under-prediction of ridership at all time of a day and a week, which suggests that our model is not accurate enough because the actual riderships are typically higher.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    dplyr::select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = PhillyTracts%>%
    st_transform(st_crs(4326)), color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Figure 5.5: Mean Absolute Errors, Test Set")+
  mapTheme

Further examining the spatial distribution of errors by geography and time, we can see that the biggest errors are on AM Rush hours in Lower North Philadelphia and northern fringe of the South Philadelphia. The errors are also high on PM Rush hours in center city and the bridge connecting West Philadelphia. It’s not necessary that high errors will accompany high demands given that AM Rush hours on weekdays tend to have lower demands than PM Rush hours (Figure 4.5), and AM hours on weekends have higher overall errors than overnight hours while demands are otherwise. Bigger mean absolute error means higher inaccuracy, while also casting a concern to expect this model to generalize in certain places and at certain time.

Cross validation

Temporal and spatial cross-validation are conducted to examine models’ generalizibility on new data.

bikenetsample <- sample_n(study.panel, 10000)%>%
  na.omit()

study_panel.net <- study_panel.net[,colSums(is.na(study_panel.net)) < nrow(study_panel.net)]
cp_stu_pa.net <- na.omit(study_panel.net) 

bikenetsample.net <- sample_n(cp_stu_pa.net, 10000)%>%
  na.omit()

library(caret)
fitControl <- trainControl(method = "cv", 
                           number = 100,
                           savePredictions = TRUE)

set.seed(1000)
# for k-folds CV

reg.cv.k <-  
  train(Trip_Count ~ start_station +  hour(interval60) + dotw + Temperature + Precipitation +
          lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
        data = bikenetsample,  
        method = "lm",  
        trControl = fitControl,  
        na.action = na.pass)

reg.cv.k
## Linear Regression 
## 
## 9712 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 9614, 9615, 9615, 9615, 9615, 9615, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.8660462  0.2714473  0.5275034
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv.k.net <-
   train(Trip_Count ~ start_station +  hour(interval60) + dotw + Temperature + Precipitation +
          lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + college.nn + national_park.nn + tourism.nn + shop.nn + cuisine.nn + office.nn + bus_station.nn + trolly_stops.nn + rail_station.nn + septaStops.nn + uniqueID + Total_Pop + Med_Inc + White_Pop + Travel_Time + Means_of_Transport + Total_Public_Trans + Med_Age + Percent_White + Mean_Commute_Time + Percent_Taking_Public_Trans, 
        data = bikenetsample.net,  
        method = "lm",  
        trControl = fitControl,  
        na.action = na.pass)

reg.cv.k.net
## Linear Regression 
## 
## 10000 samples
##    31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 9900, 9900, 9900, 9900, 9901, 9901, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.9356777  0.2825749  0.5723044
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
ggplot(reg.cv.k$resample, aes(x=MAE)) +
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  labs(title = "Figure 6.1: Mean Average Error in Cross Validation Tests with model 4") 

ggplot(reg.cv.k.net$resample, aes(x=MAE)) +
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  labs(title = "Figure 6.2: Mean Average Error in Cross Validation Tests with model 4 plus Addtional Features") 

K-fold method is used here for the temporal cross-validation. The errors of our best time-lag model DTime_Space_FE_timeLags cluster around 0.5-0.55 (Figure 6.1), which is a fair performance given the average ridership per station per hour. However, adding amenity and spatial features to the model doesn’t improve its temporal performance in this case (Figure 6.2).

K-fold and LOGO-CV methods are both used here to examine the improvements of the model generalizibility by adding spatial and amenity features. The accumulated mean absolute error for each station is significantly reduced by adding spatial process to the model (Fgure 6.3). The errors in center city are much lower now with higher error in the southeastern Philly (Figure 6.4). The predictions by all four models (Figure 6.5) are very similar to the actual observation (Figure 6.6).

Conclusions

Overall, I think my algorithm is more useful to predict the overall spatial distribution of the ridership hot spots than predicting the next 1 hour bike-share demand in a certain station. Based on my current algorithm, it’s advised that bikes be roughly divided according to the predicted spatial pattern and attributed to the hot spots first and then do minor adjustments among near hot spots.

To improve the current algorithm, longer range of time data should be applied but here the data is only 5-week long due to the computer processing power restrictions. Also, since the demand of next time period is most affected by the closest last time period, a smaller time lag in temporal prediction should also help to improve the performance of the current algorithm.

