Bikeshare SpaceTime Prediction

CPLN 5920 - Spring 2026

Author

Zhuosi Yang

Published

April 23, 2026

1 INTRODUCTION

1.1 The Rebalancing Challenge in Philadelphia

Philadelphia’s Indego bike share system faces the same operational challenge as every bike share system: rebalancing bikes to meet anticipated demand.

Imagine I am an Indego operations manager at 6:00 AM on a Monday morning. I have: - 200 stations across Philadelphia - Limited trucks and staff for moving bikes - 2-3 hours before morning rush hour demand peaks - The question: Which stations will run out of bikes by 8:30 AM?

In this lab I build predictive models that forecast bike share demand across space (different stations) and time (different hours) to help solve this operational problem.

2 INCLASS: SPACETIME PREDICTION FOR Q1

2.1 Import and Prep Data

2.1.1 Load Indego Trip Data

Question Look at the data. What do you suspect the coordinate system is? What do you notice about the date and time structure?

Answer The start_lat / start_lon columns sit around 39.9 / -75.1, which are degree-based coordinates, so the data are in geographic WGS84 (EPSG:4326) rather than a projected CRS. The start_time / end_time fields are stored as "MM/DD/YYYY HH:MM" character strings rather than proper datetimes, so I will parse them with lubridate::mdy_hm() before doing any temporal aggregation.

2.1.2 Examine the Data Structure

Next, I summarize the basic structure of the file — trip counts, date range, number of stations, and the categorical breakdowns for route type, passholder, and bike type.

Code
# How many trips?
n_trips <- nrow(indego)

# Date range
dt_parsed <- mdy_hm(indego$start_time)

# How many unique stations?
n_stations <- length(unique(indego$start_station))

overview_tbl <- tibble(
  Metric = c("Total trips (Q1 2025)", "Date range start", "Date range end", "Unique start stations"),
  Value  = c(format(n_trips, big.mark = ","),
             format(min(dt_parsed, na.rm = TRUE)),
             format(max(dt_parsed, na.rm = TRUE)),
             format(n_stations, big.mark = ","))
)

kable(overview_tbl, caption = "Q1 2025 Indego data overview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q1 2025 Indego data overview
Metric Value
Total trips (Q1 2025) 201,588
Date range start 2025-01-01
Date range end 2025-03-31 23:48:00
Unique start stations 265
Code
# Trip types
kable(as.data.frame(table(indego$trip_route_category)) %>%
        rename(Trip_Route_Category = Var1, Count = Freq),
      caption = "Trips by route category",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by route category
Trip_Route_Category Count
One Way 190,792
Round Trip 10,796
Code
# Passholder types
kable(as.data.frame(table(indego$passholder_type)) %>%
        rename(Passholder_Type = Var1, Count = Freq),
      caption = "Trips by passholder type",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by passholder type
Passholder_Type Count
Day Pass 5,494
Indego30 94,044
Indego365 91,628
IndegoFlex 3
Walk-up 10,419
Code
# Bike types
kable(as.data.frame(table(indego$bike_type)) %>%
        rename(Bike_Type = Var1, Count = Freq),
      caption = "Trips by bike type",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by bike type
Bike_Type Count
electric 129,561
standard 72,027

2.1.3 Create Time Bins

Next, I aggregate trips into hourly intervals for the panel data structure, using lubridate to parse datetimes and extract week, month, day-of-week, and hour features.

Code
indego <- indego %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),

    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),

    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE, locale = "C"),
    dotw = wday(interval60, label = TRUE, locale = "C"),
    hour = hour(interval60),
    date = as.Date(interval60),

    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

# Look at temporal features
kable(head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend)),
      caption = "First rows of the hourly time features") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
First rows of the hourly time features
start_datetime interval60 week dotw hour weekend
2025-01-01 00:00:00 2025-01-01 1 Wed 0 0
2025-01-01 00:04:00 2025-01-01 1 Wed 0 0
2025-01-01 00:05:00 2025-01-01 1 Wed 0 0
2025-01-01 00:05:00 2025-01-01 1 Wed 0 0
2025-01-01 00:08:00 2025-01-01 1 Wed 0 0
2025-01-01 00:14:00 2025-01-01 1 Wed 0 0

Question Explain the interval60 feature to someone else at your table.

Answer interval60 is the hourly timestamp each trip falls into, built by flooring every start_datetime down to the top of its hour via floor_date(start_datetime, unit = "hour"). A trip that started at 08:37 and another at 08:58 on the same day both map to the same 08:00 bin, which lets me group every individual trip into station-hour buckets and later build a panel where each row represents one station in one hour.

2.2 EDA of Indego Data

2.2.1 Trips Over Time

Code
# Daily trip counts
daily_trips <- indego %>%
  group_by(date) %>%
  summarize(trips = n())

ggplot(daily_trips, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q1 2025",
    subtitle = "Winter demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

Question: What patterns do you see? How does ridership change over time?

Answer: Daily ridership is noisy but trends upward from early January through late March as temperatures rise. There are several sharp one-day dips that line up with winter storms, and one huge outlier on February 14 that is far above any comparable Friday — the Eagles Super Bowl parade day. The underlying seasonal growth is real, but event-driven spikes dominate short-run variation and will need their own feature later.

Next I compare the 14th to all other Fridays in the data to confirm how large the Eagles-parade effect actually is. Keep this in mind for future Feature Engineering.

Code
fly_eagles_fly <- daily_trips %>% filter(date == "2025-02-14")

typical_boring_friday <- indego %>%
  filter(dotw == "Fri", date != "2025-02-14") %>%
  group_by(date) %>%
  summarize(trips = n()) %>%
  summarize(avg_friday_trips = mean(trips))

kable(fly_eagles_fly,
      caption = "Indego trips on 2025-02-14 (Eagles parade day)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Indego trips on 2025-02-14 (Eagles parade day)
date trips
2025-02-14 4,192
Code
kable(typical_boring_friday,
      digits = 1,
      caption = "Average Fridays in Q1 2025 (excluding Feb 14)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Average Fridays in Q1 2025 (excluding Feb 14)
avg_friday_trips
2,318.8

2.2.2 Hourly Patterns

Code
# Average trips by hour and day type
hourly_patterns <- indego %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

Question: When are the peak hours? How do weekends differ from weekdays?

Answer: Weekdays show the classic twin-peak commuter signature. A morning peak around 7-9 AM and a larger evening peak around 4-6 PM, with a mid-day lull in between. Weekends look completely different, when a single broad hump from roughly 10 AM through 6 PM, with lower peak intensity but higher mid-day volume than weekdays, consistent with leisure/discretionary riding rather than commuting.

2.2.3 Top Stations

Code
# Most popular origin stations
top_stations <- indego %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations,
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 20 Indego Stations by Trip Origins
start_station start_lat start_lon trips
3,010 39.94711 -75.16618 3,999
3,032 39.94527 -75.17971 2,842
3,359 39.94888 -75.16978 2,699
3,020 39.94855 -75.19007 2,673
3,208 39.95048 -75.19324 2,503
3,244 39.93865 -75.16674 2,486
3,066 39.94561 -75.17348 2,396
3,362 39.94816 -75.16226 2,387
3,012 39.94218 -75.17747 2,361
3,028 39.94061 -75.14958 2,348
3,161 39.95486 -75.18091 2,278
3,101 39.94295 -75.15955 2,274
3,295 39.95028 -75.16027 2,160
3,054 39.96250 -75.17420 2,123
3,185 39.95169 -75.15888 2,116
3,038 39.94781 -75.19409 2,111
3,203 39.94077 -75.17227 2,106
3,059 39.96244 -75.16121 2,027
3,022 39.95472 -75.18323 2,014
3,063 39.94633 -75.16980 2,014

2.3 Put Data in Context

2.3.1 Load Philadelphia Census Data

Next, I pull census tract data from the ACS to add demographic context to each station.

Code
# Get Philadelphia census tracts
philly_census <- get_acs(
  geography = "tract",
  variables = c(
    "B01003_001",  # Total population
    "B19013_001",  # Median household income
    "B08301_001",  # Total commuters
    "B08301_010",  # Commute by transit
    "B02001_002",  # White alone
    "B25077_001"   # Median home value
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  geometry = TRUE,
  output = "wide"
) %>%
  rename(
    Total_Pop = B01003_001E,
    Med_Inc = B19013_001E,
    Total_Commuters = B08301_001E,
    Transit_Commuters = B08301_010E,
    White_Pop = B02001_002E,
    Med_Home_Value = B25077_001E
  ) %>%
  mutate(
    Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
    Percent_White = (White_Pop / Total_Pop) * 100
  ) %>%
  st_transform(crs = 4326)  #transforming to WGS84 for lat/lon matching, currently in NAD83

2.3.2 Map Philadelphia Context

Code
# Map median income
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Context for understanding bike share demand patterns"
  ) +
  # Stations
  geom_point(
    data = indego,
    aes(x = start_lon, y = start_lat),
    color = "red", size = 0.5, alpha = 0.6
  ) +
  mapTheme

Question These data are in WGS84. If you wanted to calculate a spatial distance-based variable, like a buffer, should you use the dataset as it?

Answer WGS84 is an angular coordinate system measured in degrees, so a buffer of “500 m” or a Euclidean distance isn’t well defined on it and will distort with latitude. For any distance-based feature (buffers, nearest-neighbor distances, k-ring aggregations) I should first reproject to a local planar CRS such as EPSG:2272 (PA State Plane South, US feet) or a UTM zone covering Philadelphia, so that distances are linear and in consistent metric units.

2.3.3 Join Census Data to Stations

Next I spatially join census characteristics onto each bike station.

Code
# Create sf object for stations
stations_sf <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.

stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Add back to trip data
indego_census <- indego %>%
  left_join(
    stations_census %>%
      select(start_station, Med_Inc, Percent_Taking_Transit,
             Percent_White, Total_Pop),
    by = "start_station"
  )


# Prepare data for visualization
stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Create the map showing problem stations
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar,
    na.value = "grey90"
  ) +
  # Stations with census data (small grey dots)
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(x = start_lon, y = start_lat),
    color = "grey30", size = 1, alpha = 0.6
  ) +
  # Stations WITHOUT census data (red X marks the spot)
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(x = start_lon, y = start_lat),
    color = "red", size = 1, shape = 4, stroke = 1.5
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Indego stations shown (RED = no census data match)",
    caption = "Red X marks indicate stations that didn't join to census tracts"
  ) +
  mapTheme

2.3.4 Dealing with Missing Data

I need to decide what to do with the non-residential bike share stations. For this example I simply remove them — that is not necessarily the right choice in general, but for the sake of simplicity I am narrowing scope to stations in residential neighborhoods. A separate model for non-residential stations could be added later.

Code
# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census <- indego %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>%
      select(start_station, Med_Inc, Percent_Taking_Transit,
             Percent_White, Total_Pop),
    by = "start_station"
  )

2.3.5 Get Weather Data

Weather significantly affects bike share demand. Next I pull hourly weather for Philadelphia from the Philadelphia International Airport (KPHL) ASOS station.

Code
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q1 2025: January 1 - March 31
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2025-01-01",
  date_end = "2025-03-31"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct(interval60, .keep_all = TRUE)

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
weather_complete %>%
  select(Temperature, Precipitation, Wind_Speed) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
  group_by(Variable) %>%
  summarise(
    Min    = min(value, na.rm = TRUE),
    Q1     = quantile(value, 0.25, na.rm = TRUE),
    Median = median(value, na.rm = TRUE),
    Mean   = mean(value, na.rm = TRUE),
    Q3     = quantile(value, 0.75, na.rm = TRUE),
    Max    = max(value, na.rm = TRUE),
    NAs    = sum(is.na(value)),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Hourly weather summary (Q1 2025)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Hourly weather summary (Q1 2025)
Variable Min Q1 Median Mean Q3 Max NAs
Precipitation 0 0 0 0.00 0 0.28 0
Temperature 10 30 37 38.36 46 78.00 0
Wind_Speed 0 6 8 9.31 13 30.00 0

2.3.6 Visualize Weather Patterns

Code
#|
ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - Q1 2025",
    subtitle = "Winter to early spring transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

2.4 Create Space-Time Panel

2.4.1 Aggregate Trips to Station-Hour Level

Code
# Count trips by station-hour
trips_panel <- indego_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
# How many unique stations?
# How many unique hours?
panel_sizes <- tibble(
  Metric = c("Station-hour observations", "Unique stations", "Unique hours"),
  Count  = c(nrow(trips_panel),
             length(unique(trips_panel$start_station)),
             length(unique(trips_panel$interval60)))
)

kable(panel_sizes,
      caption = "Raw station-hour panel (before completion)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Raw station-hour panel (before completion)
Metric Count
Station-hour observations 116,718
Unique stations 245
Unique hours 2,150
Code
hist(trips_panel$Trip_Count)

2.4.2 Create Complete Panel Structure

Not every station has trips every hour. I need a complete panel where every station-hour combination exists (even if Trip_Count = 0).

Code
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

panel_check <- tibble(
  Metric = c("Expected panel rows", "Current rows", "Missing rows"),
  Count  = c(expected_rows, nrow(trips_panel), expected_rows - nrow(trips_panel))
)

kable(panel_check,
      caption = "Panel completeness check",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Panel completeness check
Metric Count
Expected panel rows 526,750
Current rows 116,718
Missing rows 410,032
Code
# Create complete panel
study_panel <- expand_grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel %>%
              select(interval60, start_station, Trip_Count),
            by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

hist(study_panel$Trip_Count)

Code
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
kable(tibble(Metric = "Complete panel rows", Count = nrow(study_panel)),
      format.args = list(big.mark = ","),
      caption = "Final panel size") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Final panel size
Metric Count
Complete panel rows 526,750

2.4.3 Add Time Features

Code
study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE, locale = "C"),
    dotw = wday(interval60, label = TRUE, locale = "C"),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

2.4.4 Join Weather Data

Code
study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

# Check for missing values
study_panel %>%
  select(Trip_Count, Temperature, Precipitation) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
  group_by(Variable) %>%
  summarise(
    Min    = min(value, na.rm = TRUE),
    Q1     = quantile(value, 0.25, na.rm = TRUE),
    Median = median(value, na.rm = TRUE),
    Mean   = mean(value, na.rm = TRUE),
    Q3     = quantile(value, 0.75, na.rm = TRUE),
    Max    = max(value, na.rm = TRUE),
    NAs    = sum(is.na(value)),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Station-hour panel: key variables after weather join") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Station-hour panel: key variables after weather join
Variable Min Q1 Median Mean Q3 Max NAs
Precipitation 0 0 0 0.00 0 0.28 5880
Temperature 10 30 37 38.39 46 78.00 5880
Trip_Count 0 0 0 0.36 0 26.00 0

2.5 Create Temporal Lag Variables

The key innovation for space-time prediction: past demand predicts future demand.

2.5.1 Why Temporal Lags?

If there were 15 bike trips from Station A at 8:00 AM, there will probably be ~15 trips at 9:00 AM. I can use this temporal persistence to improve predictions.

I still can’t use spatial lags in prediction. Temporal lags are different because time moves in one direction — past predicts the future, but the future does not predict the past. Spatial lags are non-directional and create circular prediction logic.

Code
# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

kable(tibble(Metric = "Rows after removing NA lags", Count = nrow(study_panel_complete)),
      format.args = list(big.mark = ","),
      caption = "Panel size after dropping first-24-hour rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Panel size after dropping first-24-hour rows
Metric Count
Rows after removing NA lags 520,870

2.5.2 Visualize Lag Correlations

Code
# Sample one station to visualize
example_station <- study_panel_complete %>%
  filter(start_station == first(start_station)) %>%
  head(168)  # One week

# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
  geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
  geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
  geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
  scale_color_manual(values = c(
    "Current" = "#08519c",
    "1 Hour Ago" = "#3182bd",
    "24 Hours Ago" = "#6baed6"
  )) +
  labs(
    title = "Temporal Lag Patterns at One Station",
    subtitle = "Past demand predicts future demand",
    x = "Date-Time",
    y = "Trip Count",
    color = "Time Period"
  ) +
  plotTheme

2.6 Temporal Train/Test Split

Code
# Split by week
# Q1 has weeks 1-13 (Jan-Mar)
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)

# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
  filter(week < 10) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 10) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)


# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# NOW create train/test split
train <- study_panel_complete %>%
  filter(week < 10)

test <- study_panel_complete %>%
  filter(week >= 10)

split_summary <- tibble(
  Split            = c("Training", "Testing"),
  Observations     = c(nrow(train), nrow(test)),
  `Date range start` = c(as.character(min(train$date)), as.character(min(test$date))),
  `Date range end`   = c(as.character(max(train$date)), as.character(max(test$date)))
)

kable(split_summary,
      caption = "Temporal train/test split",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Temporal train/test split
Split Observations Date range start Date range end
Training 352,240 2025-01-02 2025-03-04
Testing 153,748 2025-03-05 2025-03-31

2.7 Build Predictive Models

I will build 5 models with increasing complexity to see what improves predictions.

2.7.1 Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

# Coefficients table
m1_coefs <- as.data.frame(summary(model1)$coefficients) %>%
  rownames_to_column("Term")

kable(m1_coefs, digits = 3,
      caption = "Model 1: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 1: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -0.187 0.009 -21.406 0.000
as.factor(hour)1 -0.025 0.009 -2.910 0.004
as.factor(hour)2 -0.036 0.009 -4.135 0.000
as.factor(hour)3 -0.053 0.009 -5.991 0.000
as.factor(hour)4 -0.036 0.009 -4.172 0.000
as.factor(hour)5 0.023 0.009 2.652 0.008
as.factor(hour)6 0.174 0.009 20.014 0.000
as.factor(hour)7 0.324 0.009 37.183 0.000
as.factor(hour)8 0.550 0.009 63.044 0.000
as.factor(hour)9 0.390 0.009 44.771 0.000
as.factor(hour)10 0.281 0.009 32.209 0.000
as.factor(hour)11 0.296 0.009 33.948 0.000
as.factor(hour)12 0.368 0.009 42.166 0.000
as.factor(hour)13 0.349 0.009 40.183 0.000
as.factor(hour)14 0.352 0.009 40.512 0.000
as.factor(hour)15 0.408 0.009 46.972 0.000
as.factor(hour)16 0.490 0.009 56.369 0.000
as.factor(hour)17 0.615 0.009 70.623 0.000
as.factor(hour)18 0.425 0.009 48.712 0.000
as.factor(hour)19 0.271 0.009 31.072 0.000
as.factor(hour)20 0.150 0.009 17.195 0.000
as.factor(hour)21 0.088 0.009 10.144 0.000
as.factor(hour)22 0.069 0.009 7.920 0.000
as.factor(hour)23 0.042 0.009 4.799 0.000
dotw_simple2 0.053 0.005 11.384 0.000
dotw_simple3 0.059 0.005 12.199 0.000
dotw_simple4 0.033 0.005 7.026 0.000
dotw_simple5 0.043 0.005 9.237 0.000
dotw_simple6 -0.065 0.005 -14.006 0.000
dotw_simple7 -0.063 0.005 -13.390 0.000
Temperature 0.008 0.000 51.702 0.000
Precipitation -2.503 0.191 -13.069 0.000
Code
# Fit statistics
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model1)$r.squared,
               summary(model1)$adj.r.squared,
               summary(model1)$sigma,
               nobs(model1))
  ),
  digits = 4,
  caption = "Model 1: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 1: Fit statistics
Metric Value
0.0788
Adjusted R² 0.0787
Residual SE 0.7460
N obs 352,240.0000

The model uses Monday as the baseline. Each coefficient represents the difference in expected trips per station-hour compared to Monday - dow_simple2 = Tuesday..

Weekday Pattern (Tue-Fri):

  • All weekdays have positive coefficients (0.029 to 0.052)
  • Tuesday has the highest weekday effect (+0.052)
  • Weekdays likely benefit from concentrated commuting patterns

Weekend Pattern (Sat-Sun):

  • Both weekend days have negative coefficients (-0.061 and -0.065)
  • This means FEWER trips per station-hour than Monday

Hourly Interpretation

Hour Coefficient Interpretation
0 (baseline) 0.000 trips/hour (midnight)
1 -0.018 slightly fewer than midnight
6 +0.151 morning activity starting
7 +0.276 morning rush building
8 +0.487 PEAK morning rush
9 +0.350 post-rush
17 +0.568 PEAK evening rush (5 PM!)
18 +0.389 evening declining
23 +0.034 late night minimal

2.7.2 Model 2: Add Temporal Lags

Code
model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

m2_coefs <- as.data.frame(summary(model2)$coefficients) %>%
  rownames_to_column("Term")

kable(m2_coefs, digits = 3,
      caption = "Model 2: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 2: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -0.098 0.008 -12.298 0.000
as.factor(hour)1 -0.007 0.008 -0.912 0.362
as.factor(hour)2 -0.006 0.008 -0.704 0.481
as.factor(hour)3 -0.010 0.008 -1.228 0.219
as.factor(hour)4 0.007 0.008 0.916 0.360
as.factor(hour)5 0.054 0.008 6.794 0.000
as.factor(hour)6 0.160 0.008 20.032 0.000
as.factor(hour)7 0.240 0.008 30.046 0.000
as.factor(hour)8 0.376 0.008 46.779 0.000
as.factor(hour)9 0.181 0.008 22.521 0.000
as.factor(hour)10 0.119 0.008 14.838 0.000
as.factor(hour)11 0.135 0.008 16.799 0.000
as.factor(hour)12 0.204 0.008 25.521 0.000
as.factor(hour)13 0.186 0.008 23.354 0.000
as.factor(hour)14 0.193 0.008 24.256 0.000
as.factor(hour)15 0.230 0.008 28.874 0.000
as.factor(hour)16 0.280 0.008 34.994 0.000
as.factor(hour)17 0.359 0.008 44.750 0.000
as.factor(hour)18 0.172 0.008 21.466 0.000
as.factor(hour)19 0.090 0.008 11.191 0.000
as.factor(hour)20 0.015 0.008 1.931 0.053
as.factor(hour)21 0.013 0.008 1.672 0.094
as.factor(hour)22 0.026 0.008 3.235 0.001
as.factor(hour)23 0.020 0.008 2.578 0.010
dotw_simple2 0.020 0.004 4.735 0.000
dotw_simple3 0.011 0.004 2.431 0.015
dotw_simple4 -0.002 0.004 -0.395 0.693
dotw_simple5 0.011 0.004 2.612 0.009
dotw_simple6 -0.072 0.004 -16.871 0.000
dotw_simple7 -0.043 0.004 -10.067 0.000
Temperature 0.004 0.000 25.920 0.000
Precipitation -1.373 0.175 -7.837 0.000
lag1Hour 0.229 0.002 139.865 0.000
lag3Hours 0.097 0.002 60.726 0.000
lag1day 0.235 0.002 145.387 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model2)$r.squared,
               summary(model2)$adj.r.squared,
               summary(model2)$sigma,
               nobs(model2))
  ),
  digits = 4,
  caption = "Model 2: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 2: Fit statistics
Metric Value
0.2293
Adjusted R² 0.2292
Residual SE 0.6823
N obs 352,240.0000

Question: Did adding lags improve R²? Why or why not?

Answer: Yes. R² jumps substantially from Model 1 to Model 2. Recent trip counts at the same station (lag1Hour, lag3Hours, lag1day) silently encode many things that plain hour/day/weather dummies cannot: station-specific popularity, persistent behavior of nearby commuters, and short-term shocks like events or closures. Past demand is by far the strongest single predictor of near-future demand here, which is exactly the signal temporal lags are designed to capture.

2.7.3 Model 3: Add Demographics

Code
model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White,
  data = train
)

m3_coefs <- as.data.frame(summary(model3)$coefficients) %>%
  rownames_to_column("Term")

kable(m3_coefs, digits = 3,
      caption = "Model 3: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 3: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -0.193 0.009 -20.918 0.000
as.factor(hour)1 -0.008 0.008 -1.001 0.317
as.factor(hour)2 -0.007 0.008 -0.862 0.388
as.factor(hour)3 -0.012 0.008 -1.464 0.143
as.factor(hour)4 0.005 0.008 0.668 0.504
as.factor(hour)5 0.052 0.008 6.605 0.000
as.factor(hour)6 0.160 0.008 20.061 0.000
as.factor(hour)7 0.242 0.008 30.397 0.000
as.factor(hour)8 0.381 0.008 47.586 0.000
as.factor(hour)9 0.188 0.008 23.484 0.000
as.factor(hour)10 0.125 0.008 15.678 0.000
as.factor(hour)11 0.142 0.008 17.781 0.000
as.factor(hour)12 0.211 0.008 26.436 0.000
as.factor(hour)13 0.192 0.008 24.185 0.000
as.factor(hour)14 0.199 0.008 25.072 0.000
as.factor(hour)15 0.237 0.008 29.820 0.000
as.factor(hour)16 0.288 0.008 36.082 0.000
as.factor(hour)17 0.369 0.008 46.046 0.000
as.factor(hour)18 0.182 0.008 22.714 0.000
as.factor(hour)19 0.097 0.008 12.178 0.000
as.factor(hour)20 0.022 0.008 2.786 0.005
as.factor(hour)21 0.017 0.008 2.177 0.030
as.factor(hour)22 0.028 0.008 3.529 0.000
as.factor(hour)23 0.021 0.008 2.711 0.007
dotw_simple2 0.022 0.004 5.054 0.000
dotw_simple3 0.013 0.004 2.856 0.004
dotw_simple4 0.000 0.004 -0.102 0.919
dotw_simple5 0.012 0.004 2.909 0.004
dotw_simple6 -0.072 0.004 -16.934 0.000
dotw_simple7 -0.044 0.004 -10.332 0.000
Temperature 0.004 0.000 27.273 0.000
Precipitation -1.424 0.175 -8.152 0.000
lag1Hour 0.221 0.002 135.194 0.000
lag3Hours 0.089 0.002 55.514 0.000
lag1day 0.227 0.002 140.046 0.000
Med_Inc 0.000 0.000 5.794 0.000
Percent_Taking_Transit -0.001 0.000 -6.195 0.000
Percent_White 0.002 0.000 24.442 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model3)$r.squared,
               summary(model3)$adj.r.squared,
               summary(model3)$sigma,
               nobs(model3))
  ),
  digits = 4,
  caption = "Model 3: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 3: Fit statistics
Metric Value
0.2339
Adjusted R² 0.2338
Residual SE 0.6803
N obs 352,240.0000

2.7.4 Model 4: Add Station Fixed Effects

Code
model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station),
  data = train
)

# Summary too long with all station dummies, just show key metrics
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs", "# coefficients"),
    Value  = c(summary(model4)$r.squared,
               summary(model4)$adj.r.squared,
               summary(model4)$sigma,
               nobs(model4),
               length(coef(model4)))
  ),
  digits = 4,
  caption = "Model 4: Fit statistics (station FE; coef table suppressed)",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 4: Fit statistics (station FE; coef table suppressed)
Metric Value
0.2623
Adjusted R² 0.2617
Residual SE 0.6678
N obs 352,240.0000
# coefficients 275.0000

What do station fixed effects capture? Baseline differences in demand across stations (some are just busier than others!).

2.7.5 Model 5: Add Rush Hour Interaction

Code
model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs", "# coefficients"),
    Value  = c(summary(model5)$r.squared,
               summary(model5)$adj.r.squared,
               summary(model5)$sigma,
               nobs(model5),
               length(coef(model5)))
  ),
  digits = 4,
  caption = "Model 5: Fit statistics (coef table suppressed)",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model 5: Fit statistics (coef table suppressed)
Metric Value
0.2672
Adjusted R² 0.2666
Residual SE 0.6656
N obs 352,240.0000
# coefficients 280.0000

2.8 Model Evaluation

2.8.1 Calculate Predictions and MAE

Code
# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results,
      digits = 2,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model (Test Set)
Model MAE (trips)
1. Time + Weather 0.62
2. + Temporal Lags 0.54
3. + Demographics 0.53
4. + Station FE 0.53
5. + Rush Hour Interaction 0.53

2.8.2 Visualize Model Comparison

Code
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question: Which features gave us the biggest improvement?

Answer: Temporal lags give by far the largest MAE drop (Model 1 → Model 2). Station fixed effects (Model 3 → Model 4) deliver the next meaningful gain because they absorb persistent station-level popularity differences that demographics alone cannot explain. Adding demographics (Model 2 → Model 3) and the rush-hour × weekend interaction (Model 4 → Model 5) improve MAE only marginally on top of lags + station FEs.

2.9 Space-Time Error Analysis

2.9.1 Observed vs. Predicted

I will use my best model (Model 2) for error analysis.

Code
test <- test %>%
  mutate(
    error = Trip_Count - pred2,
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips",
    subtitle = "Model 2 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

Question: Where is the model performing well? Where is it struggling?

Answer: The green fit line hugs the red 1:1 line for overnight, mid-day and weekend panels, where demand is low and close to zero. Therefore the model is accurate in the quiet part of the distribution. It struggles in AM- and PM-rush weekday panels: the fit line tilts well below 1:1, which means the model systematically under-predicts the highest-demand hours at the busiest stations. In operational terms, the places and times where accuracy matters most (rebalancing before rush hour at big stations) are also where the model is weakest.

2.9.2 Spatial Error Patterns

Are prediction errors clustered in certain parts of Philadelphia?

Code
# Calculate MAE by station
station_errors <- test %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))

## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)

# Calculate station errors
station_errors <- test %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))

# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE\n(trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),  # Fewer, cleaner breaks
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors",
       subtitle = "Higher in Center City") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg\nDemand",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),  # Clear breaks
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand",
       subtitle = "Trips per station-hour") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))




# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Combine
(p1 | p2)

Question: Do you see spatial clustering of errors? What neighborhoods have high errors?

Answer: Yes. MAE is clearly spatially clustered. The highest-error stations concentrate in Center City (around Rittenhouse / Washington Square / Market Street) and across the University City corridor (Penn and Drexel), which is also where average demand is highest. In quieter residential and outer neighborhoods errors are small and fairly uniform. The map visually confirms that error magnitude tracks demand magnitude, so the model is mostly making proportionally reasonable mistakes in the busiest commuter and campus hubs.

2.9.3 Temporal Error Patterns

When is the model most wrong?

Code
# MAE by time of day and day type
temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2.9.4 Errors and Demographics

Are prediction errors related to neighborhood characteristics?

Code
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

(p1 | p2) / (p3 | plot_spacer())

Critical Question: Are prediction errors systematically higher in certain demographic groups? What are the equity implications?

Answer: Errors trend upward with median income, transit-commuter share, and percent-white, but mostly because those tracts happen to contain the highest-demand Center City and University City stations. Even so, that has real equity consequences. If Indego uses raw MAE or raw predicted demand to prioritize rebalancing, attention flows to already-well-served, whiter, higher-income neighborhoods. In lower-demand, more vulnerable neighborhoods one missing bike is a much larger share of effective supply, so small absolute errors there can translate into large service failures. A fairer deployment should weight errors by station capacity or by demand relative to supply, and monitor equity-specific metrics alongside overall MAE.

3 HW5: SPACETIME PREDICTION FOR Q3

This section extends the Q1 walkthrough above to Q3 2025 (July 1 – September 30). Two things make Q3 operationally different from Q1 and worth a separate look: (i) school is out from late June through late August, which shifts weekday demand from commuting toward discretionary / leisure riding, and (ii) summer brings multi-day heat waves with afternoon temperatures above 90 °F that suppress ridership even though the calendar looks like a normal workday. Below I replicate the same workflow, compare results to Q1, diagnose errors, engineer summer-specific features, and reflect on deployment.

3.1 Import and Prep Data

3.1.1 Load Indego Trip Data

3.1.2 Examine the Data Structure

I summarize the Q3 file, including trip counts, date range, number of stations, and the categorical breakdowns.

Code
# How many trips?
n_trips_q3 <- nrow(indego_q3)

# Date range
dt_parsed_q3 <- mdy_hm(indego_q3$start_time)

# How many unique stations?
n_stations_q3 <- length(unique(indego_q3$start_station))

overview_tbl_q3 <- tibble(
  Metric = c("Total trips (Q3 2025)", "Date range start", "Date range end", "Unique start stations"),
  Value  = c(format(n_trips_q3, big.mark = ","),
             format(min(dt_parsed_q3, na.rm = TRUE)),
             format(max(dt_parsed_q3, na.rm = TRUE)),
             format(n_stations_q3, big.mark = ","))
)

kable(overview_tbl_q3, caption = "Q3 2025 Indego data overview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 2025 Indego data overview
Metric Value
Total trips (Q3 2025) 465,464
Date range start 2025-07-01 00:06:00
Date range end 2025-09-30 23:58:00
Unique start stations 284
Code
# Trip types
kable(as.data.frame(table(indego_q3$trip_route_category)) %>%
        rename(Trip_Route_Category = Var1, Count = Freq),
      caption = "Trips by route category",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by route category
Trip_Route_Category Count
One Way 434,518
Round Trip 30,946
Code
# Passholder types
kable(as.data.frame(table(indego_q3$passholder_type)) %>%
        rename(Passholder_Type = Var1, Count = Freq),
      caption = "Trips by passholder type",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by passholder type
Passholder_Type Count
Day Pass 18,418
Indego30 252,897
Indego365 157,695
NULL 5
Walk-up 36,449
Code
# Bike types
kable(as.data.frame(table(indego_q3$bike_type)) %>%
        rename(Bike_Type = Var1, Count = Freq),
      caption = "Trips by bike type",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Trips by bike type
Bike_Type Count
electric 299,515
standard 165,949

3.1.3 Create Time Bins

Same hourly-binning pipeline as in Q1, now applied to Q3.

Code
indego_q3 <- indego_q3 %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime   = mdy_hm(end_time),

    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),

    # Extract time features
    week  = week(interval60),
    month = month(interval60, label = TRUE, locale = "C"),
    dotw  = wday(interval60, label = TRUE, locale = "C"),
    hour  = hour(interval60),
    date  = as.Date(interval60),

    # Create useful indicators
    weekend   = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

kable(head(indego_q3 %>% select(start_datetime, interval60, week, dotw, hour, weekend)),
      caption = "First rows of the Q3 hourly time features") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
First rows of the Q3 hourly time features
start_datetime interval60 week dotw hour weekend
2025-07-01 00:06:00 2025-07-01 26 Tue 0 0
2025-07-01 00:11:00 2025-07-01 26 Tue 0 0
2025-07-01 00:13:00 2025-07-01 26 Tue 0 0
2025-07-01 00:16:00 2025-07-01 26 Tue 0 0
2025-07-01 00:16:00 2025-07-01 26 Tue 0 0
2025-07-01 00:18:00 2025-07-01 26 Tue 0 0

Findings: Q3 has roughly three to four times as many trips as Q1 over the same 13-week window. Summer weather and long daylight hours push absolute ridership far above winter levels even before I look at the hourly or daily shape.

3.2 EDA of Indego Data

3.2.1 Trips Over Time

Code
# Daily trip counts
daily_trips_q3 <- indego_q3 %>%
  group_by(date) %>%
  summarize(trips = n())

ggplot(daily_trips_q3, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q3 2025",
    subtitle = "Summer demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

Ridership patterns: Daily ridership stays consistently high from early July through mid-September and then softens as September cools off. Sharp one-day dips line up with heavy rain or extreme-heat days, and there is a visible July 4 holiday bump. Unlike Q1 there is no single extreme outlier, the Q3 surprises are multi-day heat waves and holiday weekends rather than one-off events.

Next I compare July 4 to other Q3 Fridays to check how large the Independence-Day effect really is. I will keep this in mind for feature engineering.

Code
july4 <- daily_trips_q3 %>% filter(date == "2025-07-04")

typical_q3_friday <- indego_q3 %>%
  filter(dotw == "Fri", date != "2025-07-04") %>%
  group_by(date) %>%
  summarize(trips = n()) %>%
  summarize(avg_friday_trips = mean(trips))

kable(july4,
      caption = "Indego trips on 2025-07-04 (Independence Day)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Indego trips on 2025-07-04 (Independence Day)
date trips
2025-07-04 4,641
Code
kable(typical_q3_friday,
      digits = 1,
      caption = "Average Fridays in Q3 2025 (excluding Jul 4)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Average Fridays in Q3 2025 (excluding Jul 4)
avg_friday_trips
5,544.4

3.2.2 Hourly Patterns

Code
# Average trips by hour and day type
hourly_patterns_q3 <- indego_q3 %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(hourly_patterns_q3, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns - Q3 2025",
    subtitle = "Longer evening peaks and fatter weekends than winter",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

Hourly patterns: Weekdays still show twin AM / PM peaks, but the PM peak is higher than the AM peak and extends well past 6 PM, which is consistent with people riding home in long daylight and then going out for dinner. Weekend curves are much fatter than in Q1 and start earlier (demand is non-trivial by 9 AM instead of 11 AM), which matches the shift from indoor-winter to leisure-summer behavior.

3.2.3 Top Stations

Code
# Most popular origin stations
top_stations_q3 <- indego_q3 %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations_q3,
      caption = "Top 20 Indego Stations by Trip Origins - Q3 2025",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 20 Indego Stations by Trip Origins - Q3 2025
start_station start_lat start_lon trips
3,010 39.94711 -75.16618 7,716
3,032 39.94527 -75.17971 5,884
3,213 39.93887 -75.16663 5,660
3,163 39.94974 -75.18097 5,174
3,359 39.94888 -75.16978 4,681
3,185 39.95169 -75.15888 4,670
3,028 39.94061 -75.14958 4,607
3,020 39.94855 -75.19007 4,556
3,022 39.95472 -75.18323 4,475
3,066 39.94561 -75.17348 4,353
3,059 39.96244 -75.16121 4,265
3,063 39.94633 -75.16980 4,230
3,161 39.95486 -75.18091 4,223
3,101 39.94295 -75.15955 4,183
3,054 39.96250 -75.17420 4,117
3,030 39.93935 -75.15716 3,986
3,362 39.94816 -75.16226 3,965
3,061 39.95425 -75.17761 3,716
3,057 39.96439 -75.17987 3,668
3,296 39.95134 -75.16758 3,665

3.3 Put Data in Context

The Philadelphia census tracts (philly_census) are already loaded from the Q1 section and do not change between quarters. I only redo the station-to-tract spatial join and drop non-residential stations for the Q3 station set.

3.3.1 Join Census Data to Stations

Code
# Create sf object for Q3 stations
stations_sf_q3 <- indego_q3 %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census_q3 <- st_join(stations_sf_q3, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Identify which stations to keep (i.e. inside residential tracts)
valid_stations_q3 <- stations_census_q3 %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only and attach census attributes
indego_census_q3 <- indego_q3 %>%
  filter(start_station %in% valid_stations_q3) %>%
  left_join(
    stations_census_q3 %>%
      select(start_station, Med_Inc, Percent_Taking_Transit,
             Percent_White, Total_Pop),
    by = "start_station"
  )

3.3.2 Get Weather Data

Code
# Get weather from Philadelphia International Airport (KPHL)
# Q3 2025 covers July 1 - September 30
weather_data_q3 <- riem_measures(
  station    = "PHL",
  date_start = "2025-07-01",
  date_end   = "2025-09-30"
)

# Process weather data
weather_processed_q3 <- weather_data_q3 %>%
  mutate(
    interval60    = floor_date(valid, unit = "hour"),
    Temperature   = tmpf,
    Precipitation = ifelse(is.na(p01i), 0, p01i),
    Wind_Speed    = sknt
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct(interval60, .keep_all = TRUE)

# Check for missing hours and interpolate if needed
weather_complete_q3 <- weather_processed_q3 %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
weather_complete_q3 %>%
  select(Temperature, Precipitation, Wind_Speed) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
  group_by(Variable) %>%
  summarise(
    Min    = min(value, na.rm = TRUE),
    Q1     = quantile(value, 0.25, na.rm = TRUE),
    Median = median(value, na.rm = TRUE),
    Mean   = mean(value, na.rm = TRUE),
    Q3     = quantile(value, 0.75, na.rm = TRUE),
    Max    = max(value, na.rm = TRUE),
    NAs    = sum(is.na(value)),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Hourly weather summary (Q3 2025)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Hourly weather summary (Q3 2025)
Variable Min Q1 Median Mean Q3 Max NAs
Precipitation 0 0 0 0.00 0 0.1 0
Temperature 57 70 76 75.94 81 98.0 0
Wind_Speed 0 4 6 6.22 8 25.0 0

3.3.3 Visualize Weather Patterns

Code
ggplot(weather_complete_q3, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - Q3 2025",
    subtitle = "Summer heat and late-season cooling",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

Temperature patterns: The smoothed temperature curve sits above 80 °F for most of July and the first half of August, with individual hours spiking into the low-to-mid 90s °F during several multi-day heat waves. Extreme heat plausibly depresses afternoon riding even on otherwise normal weekdays, which gives me a concrete motivation to later construct a “hot afternoon” feature rather than rely on raw temperature alone.

3.4 Create Space-Time Panel

3.4.1 Aggregate Trips to Station-Hour Level

Code
# Count trips by station-hour
trips_panel_q3 <- indego_census_q3 %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
# How many unique stations?
# How many unique hours?
panel_sizes_q3 <- tibble(
  Metric = c("Station-hour observations", "Unique stations", "Unique hours"),
  Count  = c(nrow(trips_panel_q3),
             length(unique(trips_panel_q3$start_station)),
             length(unique(trips_panel_q3$interval60)))
)

kable(panel_sizes_q3,
      caption = "Raw Q3 station-hour panel (before completion)",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Raw Q3 station-hour panel (before completion)
Metric Count
Station-hour observations 214,178
Unique stations 263
Unique hours 2,208
Code
hist(trips_panel_q3$Trip_Count)

3.4.2 Create Complete Panel Structure

Code
# Calculate expected panel size
n_stations_q3 <- length(unique(trips_panel_q3$start_station))
n_hours_q3    <- length(unique(trips_panel_q3$interval60))
expected_rows_q3 <- n_stations_q3 * n_hours_q3

panel_check_q3 <- tibble(
  Metric = c("Expected panel rows", "Current rows", "Missing rows"),
  Count  = c(expected_rows_q3, nrow(trips_panel_q3), expected_rows_q3 - nrow(trips_panel_q3))
)

kable(panel_check_q3,
      caption = "Q3 panel completeness check",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 panel completeness check
Metric Count
Expected panel rows 580,704
Current rows 214,178
Missing rows 366,526
Code
# Create complete panel
study_panel_q3 <- expand_grid(
  interval60    = unique(trips_panel_q3$interval60),
  start_station = unique(trips_panel_q3$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel_q3 %>%
              select(interval60, start_station, Trip_Count),
            by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

hist(study_panel_q3$Trip_Count)

Code
# Fill in station attributes (they're the same for all hours)
station_attributes_q3 <- trips_panel_q3 %>%
  group_by(start_station) %>%
  summarize(
    start_lat              = first(start_lat),
    start_lon              = first(start_lon),
    Med_Inc                = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White          = first(Percent_White),
    Total_Pop              = first(Total_Pop)
  )

study_panel_q3 <- study_panel_q3 %>%
  left_join(station_attributes_q3, by = "start_station")

kable(tibble(Metric = "Complete Q3 panel rows", Count = nrow(study_panel_q3)),
      format.args = list(big.mark = ","),
      caption = "Final Q3 panel size") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Final Q3 panel size
Metric Count
Complete Q3 panel rows 580,704

3.4.3 Add Time Features

Code
study_panel_q3 <- study_panel_q3 %>%
  mutate(
    week      = week(interval60),
    month     = month(interval60, label = TRUE, locale = "C"),
    dotw      = wday(interval60, label = TRUE, locale = "C"),
    hour      = hour(interval60),
    date      = as.Date(interval60),
    weekend   = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

3.4.4 Join Weather Data

Code
study_panel_q3 <- study_panel_q3 %>%
  left_join(weather_complete_q3, by = "interval60")

study_panel_q3 %>%
  select(Trip_Count, Temperature, Precipitation) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
  group_by(Variable) %>%
  summarise(
    Min    = min(value, na.rm = TRUE),
    Q1     = quantile(value, 0.25, na.rm = TRUE),
    Median = median(value, na.rm = TRUE),
    Mean   = mean(value, na.rm = TRUE),
    Q3     = quantile(value, 0.75, na.rm = TRUE),
    Max    = max(value, na.rm = TRUE),
    NAs    = sum(is.na(value)),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Q3 station-hour panel: key variables after weather join") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 station-hour panel: key variables after weather join
Variable Min Q1 Median Mean Q3 Max NAs
Precipitation 0 0 0 0.00 0 0.1 6312
Temperature 57 70 76 75.94 81 98.0 6312
Trip_Count 0 0 0 0.75 1 22.0 0

3.5 Create Temporal Lag Variables

Code
# Sort by station and time
study_panel_q3 <- study_panel_q3 %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel_q3 <- study_panel_q3 %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour   = lag(Trip_Count, 1),
    lag2Hours  = lag(Trip_Count, 2),
    lag3Hours  = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day    = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete_q3 <- study_panel_q3 %>%
  filter(!is.na(lag1day))

kable(tibble(Metric = "Rows after removing NA lags (Q3)", Count = nrow(study_panel_complete_q3)),
      format.args = list(big.mark = ","),
      caption = "Q3 panel size after dropping first-24-hour rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 panel size after dropping first-24-hour rows
Metric Count
Rows after removing NA lags (Q3) 574,392

3.6 Temporal Train/Test Split

Q3 spans weeks 26-39 (Jul 1 - Sep 30). I mirror the 9:4 split used for Q1: train on the first ten weeks (weeks 26-35, roughly Jul 1 - Aug 31) and test on the last four weeks (36-39, roughly Sep 1 - 30).

Code
# Which stations have trips in BOTH early and late periods?
early_stations_q3 <- study_panel_complete_q3 %>%
  filter(week < 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations_q3 <- study_panel_complete_q3 %>%
  filter(week >= 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations_q3 <- intersect(early_stations_q3, late_stations_q3)

# Filter panel to only common stations
study_panel_complete_q3 <- study_panel_complete_q3 %>%
  filter(start_station %in% common_stations_q3)

# NOW create train/test split
train_q3 <- study_panel_complete_q3 %>% filter(week < 36)
test_q3  <- study_panel_complete_q3 %>% filter(week >= 36)

split_summary_q3 <- tibble(
  Split              = c("Training", "Testing"),
  Observations       = c(nrow(train_q3), nrow(test_q3)),
  `Date range start` = c(as.character(min(train_q3$date)), as.character(min(test_q3$date))),
  `Date range end`   = c(as.character(max(train_q3$date)), as.character(max(test_q3$date)))
)

kable(split_summary_q3,
      caption = "Q3 temporal train/test split",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 temporal train/test split
Split Observations Date range start Date range end
Training 397,656 2025-07-02 2025-09-02
Testing 176,736 2025-09-03 2025-09-30

3.7 Build Predictive Models

I build the same 5 models as in Q1 so the two quarters are directly comparable.

3.7.1 Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train_q3 <- train_q3 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

contrasts(train_q3$dotw_simple) <- contr.treatment(7)

model1_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train_q3
)

m1q3_coefs <- as.data.frame(summary(model1_q3)$coefficients) %>%
  rownames_to_column("Term")

kable(m1q3_coefs, digits = 3,
      caption = "Q3 Model 1: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 1: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 0.555 0.030 18.490 0.000
as.factor(hour)1 -0.083 0.014 -5.971 0.000
as.factor(hour)2 -0.152 0.014 -11.008 0.000
as.factor(hour)3 -0.212 0.014 -15.272 0.000
as.factor(hour)4 -0.212 0.014 -15.229 0.000
as.factor(hour)5 -0.092 0.014 -6.630 0.000
as.factor(hour)6 0.148 0.014 10.542 0.000
as.factor(hour)7 0.425 0.014 30.353 0.000
as.factor(hour)8 0.840 0.014 59.736 0.000
as.factor(hour)9 0.573 0.014 40.726 0.000
as.factor(hour)10 0.456 0.014 32.457 0.000
as.factor(hour)11 0.527 0.014 37.769 0.000
as.factor(hour)12 0.611 0.014 44.091 0.000
as.factor(hour)13 0.667 0.014 48.283 0.000
as.factor(hour)14 0.740 0.014 53.566 0.000
as.factor(hour)15 0.854 0.014 61.673 0.000
as.factor(hour)16 1.072 0.014 77.117 0.000
as.factor(hour)17 1.406 0.014 100.743 0.000
as.factor(hour)18 1.122 0.014 80.137 0.000
as.factor(hour)19 0.897 0.014 64.041 0.000
as.factor(hour)20 0.617 0.014 44.051 0.000
as.factor(hour)21 0.426 0.014 30.600 0.000
as.factor(hour)22 0.283 0.014 20.374 0.000
as.factor(hour)23 0.144 0.014 10.439 0.000
dotw_simple2 0.112 0.007 15.006 0.000
dotw_simple3 0.047 0.007 6.273 0.000
dotw_simple4 0.078 0.007 10.476 0.000
dotw_simple5 0.082 0.007 10.933 0.000
dotw_simple6 0.003 0.007 0.423 0.673
dotw_simple7 -0.061 0.007 -8.203 0.000
Temperature -0.004 0.000 -11.953 0.000
Precipitation -4.283 0.408 -10.487 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model1_q3)$r.squared,
               summary(model1_q3)$adj.r.squared,
               summary(model1_q3)$sigma,
               nobs(model1_q3))
  ),
  digits = 4,
  caption = "Q3 Model 1: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 1: Fit statistics
Metric Value
0.1087
Adjusted R² 0.1086
Residual SE 1.2563
N obs 397,656.0000

3.7.2 Model 2: Add Temporal Lags

Code
model2_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train_q3
)

m2q3_coefs <- as.data.frame(summary(model2_q3)$coefficients) %>%
  rownames_to_column("Term")

kable(m2q3_coefs, digits = 3,
      caption = "Q3 Model 2: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 2: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 0.233 0.026 8.920 0.000
as.factor(hour)1 -0.011 0.012 -0.946 0.344
as.factor(hour)2 -0.022 0.012 -1.829 0.067
as.factor(hour)3 -0.035 0.012 -2.870 0.004
as.factor(hour)4 -0.009 0.012 -0.761 0.446
as.factor(hour)5 0.080 0.012 6.592 0.000
as.factor(hour)6 0.231 0.012 18.879 0.000
as.factor(hour)7 0.368 0.012 29.994 0.000
as.factor(hour)8 0.583 0.012 47.237 0.000
as.factor(hour)9 0.256 0.012 20.762 0.000
as.factor(hour)10 0.206 0.012 16.811 0.000
as.factor(hour)11 0.236 0.012 19.350 0.000
as.factor(hour)12 0.311 0.012 25.747 0.000
as.factor(hour)13 0.349 0.012 28.918 0.000
as.factor(hour)14 0.381 0.012 31.526 0.000
as.factor(hour)15 0.439 0.012 36.182 0.000
as.factor(hour)16 0.565 0.012 46.287 0.000
as.factor(hour)17 0.744 0.012 60.389 0.000
as.factor(hour)18 0.446 0.012 36.068 0.000
as.factor(hour)19 0.328 0.012 26.694 0.000
as.factor(hour)20 0.142 0.012 11.582 0.000
as.factor(hour)21 0.109 0.012 8.961 0.000
as.factor(hour)22 0.080 0.012 6.604 0.000
as.factor(hour)23 0.047 0.012 3.865 0.000
dotw_simple2 0.057 0.006 8.837 0.000
dotw_simple3 -0.007 0.006 -1.093 0.274
dotw_simple4 0.022 0.007 3.296 0.001
dotw_simple5 0.018 0.007 2.790 0.005
dotw_simple6 -0.040 0.007 -6.183 0.000
dotw_simple7 -0.061 0.006 -9.437 0.000
Temperature -0.003 0.000 -9.231 0.000
Precipitation -1.801 0.356 -5.064 0.000
lag1Hour 0.249 0.002 163.605 0.000
lag3Hours 0.120 0.001 81.096 0.000
lag1day 0.283 0.002 188.635 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model2_q3)$r.squared,
               summary(model2_q3)$adj.r.squared,
               summary(model2_q3)$sigma,
               nobs(model2_q3))
  ),
  digits = 4,
  caption = "Q3 Model 2: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 2: Fit statistics
Metric Value
0.3246
Adjusted R² 0.3245
Residual SE 1.0937
N obs 397,656.0000

3.7.3 Model 3: Add Demographics

Code
model3_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White,
  data = train_q3
)

m3q3_coefs <- as.data.frame(summary(model3_q3)$coefficients) %>%
  rownames_to_column("Term")

kable(m3q3_coefs, digits = 3,
      caption = "Q3 Model 3: Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 3: Coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 0.108 0.027 3.995 0.000
as.factor(hour)1 -0.016 0.012 -1.300 0.194
as.factor(hour)2 -0.030 0.012 -2.506 0.012
as.factor(hour)3 -0.046 0.012 -3.814 0.000
as.factor(hour)4 -0.022 0.012 -1.847 0.065
as.factor(hour)5 0.068 0.012 5.623 0.000
as.factor(hour)6 0.222 0.012 18.263 0.000
as.factor(hour)7 0.366 0.012 29.965 0.000
as.factor(hour)8 0.591 0.012 48.091 0.000
as.factor(hour)9 0.268 0.012 21.878 0.000
as.factor(hour)10 0.218 0.012 17.837 0.000
as.factor(hour)11 0.252 0.012 20.810 0.000
as.factor(hour)12 0.326 0.012 27.111 0.000
as.factor(hour)13 0.364 0.012 30.297 0.000
as.factor(hour)14 0.398 0.012 33.114 0.000
as.factor(hour)15 0.459 0.012 38.040 0.000
as.factor(hour)16 0.590 0.012 48.540 0.000
as.factor(hour)17 0.777 0.012 63.269 0.000
as.factor(hour)18 0.480 0.012 38.973 0.000
as.factor(hour)19 0.359 0.012 29.319 0.000
as.factor(hour)20 0.171 0.012 14.008 0.000
as.factor(hour)21 0.129 0.012 10.631 0.000
as.factor(hour)22 0.093 0.012 7.695 0.000
as.factor(hour)23 0.052 0.012 4.376 0.000
dotw_simple2 0.061 0.006 9.379 0.000
dotw_simple3 -0.004 0.006 -0.659 0.510
dotw_simple4 0.025 0.006 3.795 0.000
dotw_simple5 0.022 0.006 3.331 0.001
dotw_simple6 -0.038 0.006 -5.893 0.000
dotw_simple7 -0.062 0.006 -9.532 0.000
Temperature -0.003 0.000 -9.526 0.000
Precipitation -1.974 0.354 -5.575 0.000
lag1Hour 0.238 0.002 155.625 0.000
lag3Hours 0.107 0.001 71.833 0.000
lag1day 0.270 0.002 179.261 0.000
Med_Inc 0.000 0.000 20.098 0.000
Percent_Taking_Transit -0.003 0.000 -13.746 0.000
Percent_White 0.002 0.000 19.330 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model3_q3)$r.squared,
               summary(model3_q3)$adj.r.squared,
               summary(model3_q3)$sigma,
               nobs(model3_q3))
  ),
  digits = 4,
  caption = "Q3 Model 3: Fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 3: Fit statistics
Metric Value
0.3309
Adjusted R² 0.3308
Residual SE 1.0886
N obs 397,656.0000

3.7.4 Model 4: Add Station Fixed Effects

Code
model4_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station),
  data = train_q3
)

# Summary too long with all station dummies, just show key metrics
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs", "# coefficients"),
    Value  = c(summary(model4_q3)$r.squared,
               summary(model4_q3)$adj.r.squared,
               summary(model4_q3)$sigma,
               nobs(model4_q3),
               length(coef(model4_q3)))
  ),
  digits = 4,
  caption = "Q3 Model 4: Fit statistics (station FE; coef table suppressed)",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 4: Fit statistics (station FE; coef table suppressed)
Metric Value
0.3604
Adjusted R² 0.3599
Residual SE 1.0646
N obs 397,656.0000
# coefficients 300.0000

3.7.5 Model 5: Add Rush Hour Interaction

Code
model5_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train_q3
)

kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs", "# coefficients"),
    Value  = c(summary(model5_q3)$r.squared,
               summary(model5_q3)$adj.r.squared,
               summary(model5_q3)$sigma,
               nobs(model5_q3),
               length(coef(model5_q3)))
  ),
  digits = 4,
  caption = "Q3 Model 5: Fit statistics (coef table suppressed)",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Model 5: Fit statistics (coef table suppressed)
Metric Value
0.3641
Adjusted R² 0.3637
Residual SE 1.0615
N obs 397,656.0000
# coefficients 305.0000

3.8 Model Evaluation

3.8.1 Calculate Predictions and MAE

Code
# Get predictions on Q3 test set
test_q3 <- test_q3 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

contrasts(test_q3$dotw_simple) <- contr.treatment(7)

test_q3 <- test_q3 %>%
  mutate(
    pred1 = predict(model1_q3, newdata = test_q3),
    pred2 = predict(model2_q3, newdata = test_q3),
    pred3 = predict(model3_q3, newdata = test_q3),
    pred4 = predict(model4_q3, newdata = test_q3),
    pred5 = predict(model5_q3, newdata = test_q3)
  )

mae_results_q3 <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test_q3$Trip_Count - test_q3$pred1), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred2), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred3), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred4), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred5), na.rm = TRUE)
  )
)

kable(mae_results_q3,
      digits = 2,
      caption = "Q3 Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 Mean Absolute Error by Model (Test Set)
Model MAE (trips)
1. Time + Weather 0.88
2. + Temporal Lags 0.75
3. + Demographics 0.75
4. + Station FE 0.74
5. + Rush Hour Interaction 0.74

3.8.2 Visualize Model Comparison

Code
ggplot(mae_results_q3, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
  labs(
    title = "Q3 Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.8.3 Q1 vs Q3

Code
q1_q3_compare <- mae_results %>%
  rename(MAE_Q1 = MAE) %>%
  left_join(mae_results_q3 %>% rename(MAE_Q3 = MAE), by = "Model") %>%
  mutate(Difference_Q3_minus_Q1 = MAE_Q3 - MAE_Q1)

kable(q1_q3_compare,
      digits = 3,
      caption = "Q1 vs Q3 MAE by model (test set, trips)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q1 vs Q3 MAE by model (test set, trips)
Model MAE_Q1 MAE_Q3 Difference_Q3_minus_Q1
1. Time + Weather 0.622 0.875 0.253
2. + Temporal Lags 0.535 0.748 0.212
3. + Demographics 0.533 0.749 0.216
4. + Station FE 0.526 0.743 0.217
5. + Rush Hour Interaction 0.530 0.740 0.211

Comparison: Absolute MAE values are higher in Q3 than in Q1 across all five models, simply because summer demand has a much larger scale, with more peak-hour rides, a fatter right tail, and therefore more room to be wrong per station-hour. The ordering of model improvements, however, is essentially identical. Temporal lags (Model 1 → Model 2) produce the single biggest MAE drop, station fixed effects (Model 3 → Model 4) give the next meaningful gain, and the demographic and rush-hour-interaction additions contribute only marginal improvements on top of lags + station FEs. That consistency across seasons suggests the overall modeling recipe is robust even though the absolute error scale is season-dependent.

3.9 Space-Time Error Analysis

I use Model 2 (time + weather + lags) as the working model for Q3 error diagnostics, for the same reason as Q1: it is the simplest model that captures most of the predictable signal, so any residual structure points to features I might be missing.

3.9.1 Observed vs. Predicted

Code
test_q3 <- test_q3 %>%
  mutate(
    error = Trip_Count - pred2,
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

ggplot(test_q3, aes(x = Trip_Count, y = pred2)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips - Q3",
    subtitle = "Model 2 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

Findings: The green fit line hugs the red 1:1 line at low-demand cells (overnight, weekday mid-day), which suggests that the model is accurate there. The systematic flattening appears in PM Rush and Evening panels, especially on weekends: the fit line tilts well below 1:1, so the model under-predicts the right tail of busy summer evenings and weekend afternoons. This is a different failure mode from Q1, where the problem was under-predicting weekday rush-hour peaks. In Q3 the model is missing leisure-driven peaks rather than commuter peaks.

3.9.2 Spatial Error Patterns

Code
# Calculate MAE by station
station_errors_q3 <- test_q3 %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))

# Map 1: Prediction Errors
p1_q3 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors_q3,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1
  ) +
  labs(title = "Q3 Prediction Errors") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand
p2_q3 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors_q3,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 3.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1
  ) +
  labs(title = "Q3 Average Demand") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

(p1_q3 | p2_q3)

Spatial clustering of errors: Q3 errors again cluster in Center City and University City where demand is highest, mirroring the Q1 spatial pattern. What changes in Q3 is the presence of additional high-MAE stations along the Schuylkill and Delaware River waterfront trails, which makes sense because those stations serve recreational weekend riding that a commuter-shaped model doesn’t capture well. The core commercial core is still the primary hot-spot of error, but recreation corridors light up more than they do in winter.

3.9.3 Temporal Error Patterns

Code
# MAE by time of day and day type
temporal_errors_q3 <- test_q3 %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors_q3, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Q3 Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.9.4 Errors and Demographics

Code
station_errors_demo_q3 <- station_errors_q3 %>%
  left_join(
    station_attributes_q3 %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

p1 <- ggplot(station_errors_demo_q3, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Q3 Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo_q3, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Q3 Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo_q3, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Q3 Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

(p1 | p2) / (p3 | plot_spacer())

Findings: Q3 errors trend upward with median income, transit share, and percent-white, just as in Q1, and for the same structural reason. That is the highest-demand stations sit in wealthy, whiter, more transit-rich tracts (Center City, University City, waterfront). Because summer overall ridership is higher, the absolute equity gap widens even though the direction is the same. Operationally, a rebalancing policy driven by raw MAE or raw demand in summer would concentrate trucks in already-well-served neighborhoods, leaving smaller outlying stations systematically underserved. The fix is the same as in Q1, to weight errors by capacity / expected demand and monitor service-quality metrics by tract.

3.10 Feature Engineering

The Q3 error analysis above points to two problems the baseline Model 2 cannot capture:

  1. Holidays and the school calendar. July 4 and Labor Day behave very differently from the nominal Friday / Monday they fall on, and late August flips Center City / University City demand back on as students return.
  2. Extreme heat specifically in afternoons. Plain temperature averages heat effects across the whole day, but heat depresses riding most strongly when sun is harshest (roughly 1-7 PM).

I add three features that target these phenomena directly:

  • holiday indicator for 2025-07-04 and 2025-09-01 (Labor Day).
  • school_in_session indicator that is 1 from Aug 25 onward (rough start of Penn / Temple / Drexel fall semesters).
  • hot_afternoon indicator for hours with Temperature > 88 °F between 13:00 and 19:00.
Code
holiday_dates <- as.Date(c("2025-07-04", "2025-09-01"))

add_new_features <- function(df) {
  df %>%
    mutate(
      holiday           = as.integer(date %in% holiday_dates),
      school_in_session = as.integer(date >= as.Date("2025-08-25")),
      hot_afternoon     = as.integer(Temperature > 88 & hour >= 13 & hour <= 19)
    )
}

train_q3 <- add_new_features(train_q3)
test_q3  <- add_new_features(test_q3)

# Sanity check: how often does each flag fire?
feature_check <- tibble(
  Feature = c("holiday", "school_in_session", "hot_afternoon"),
  Train_share = c(mean(train_q3$holiday),
                  mean(train_q3$school_in_session),
                  mean(train_q3$hot_afternoon, na.rm = TRUE)),
  Test_share  = c(mean(test_q3$holiday),
                  mean(test_q3$school_in_session),
                  mean(test_q3$hot_afternoon, na.rm = TRUE))
)

kable(feature_check,
      digits = 4,
      caption = "Share of station-hours flagged by each new feature (train vs test)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Share of station-hours flagged by each new feature (train vs test)
Feature Train_share Test_share
holiday 0.0317 0.0000
school_in_session 0.1429 1.0000
hot_afternoon 0.0509 0.0015

3.10.1 Feature-Engineered Model

I add the three new features on top of Model 2’s right-hand side (hour, day-of-week, weather, lags). I deliberately do not add station fixed effects here so that the coefficient on each new feature is interpretable across stations and so the Poisson comparison below stays tractable.

Code
model_feat_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    holiday + school_in_session + hot_afternoon,
  data = train_q3
)

mfeat_coefs <- as.data.frame(summary(model_feat_q3)$coefficients) %>%
  rownames_to_column("Term")

# Show just the new-feature rows plus fit stats (full coef table too)
kable(mfeat_coefs %>%
        filter(Term %in% c("holiday", "school_in_session", "hot_afternoon",
                           "Temperature", "Precipitation",
                           "lag1Hour", "lag3Hours", "lag1day")),
      digits = 3,
      caption = "Feature-engineered Q3 model: key coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Feature-engineered Q3 model: key coefficients
Term Estimate Std. Error t value Pr(>&#124;t&#124;)
Temperature -0.002 0.000 -4.361 0.000
Precipitation -1.785 0.357 -5.001 0.000
lag1Hour 0.249 0.002 163.428 0.000
lag3Hours 0.120 0.001 80.898 0.000
lag1day 0.283 0.002 188.606 0.000
holiday -0.038 0.011 -3.651 0.000
school_in_session 0.012 0.006 2.060 0.039
hot_afternoon -0.084 0.009 -8.887 0.000
Code
kable(tibble(
    Metric = c("R²", "Adjusted R²", "Residual SE", "N obs"),
    Value  = c(summary(model_feat_q3)$r.squared,
               summary(model_feat_q3)$adj.r.squared,
               summary(model_feat_q3)$sigma,
               nobs(model_feat_q3))
  ),
  digits = 4,
  caption = "Feature-engineered Q3 model: fit statistics",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Feature-engineered Q3 model: fit statistics
Metric Value
0.3247
Adjusted R² 0.3247
Residual SE 1.0935
N obs 397,656.0000

3.10.2 Poisson Model for Count Data

Trip_Count is a non-negative integer with many zeros and a right-skewed distribution, so a Poisson GLM is a natural fit. I use the same right-hand side as the feature-engineered linear model:

Code
model_pois_q3 <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    holiday + school_in_session + hot_afternoon,
  family = poisson(),
  data = train_q3
)

kable(tibble(
    Metric = c("Null deviance", "Residual deviance", "AIC", "N obs"),
    Value  = c(model_pois_q3$null.deviance,
               model_pois_q3$deviance,
               AIC(model_pois_q3),
               nobs(model_pois_q3))
  ),
  digits = 1,
  caption = "Poisson GLM (Q3): deviance / AIC",
  format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Poisson GLM (Q3): deviance / AIC
Metric Value
Null deviance 713,616.7
Residual deviance 489,410.9
AIC 840,470.5
N obs 397,656.0

3.10.3 MAE Comparison

Code
test_q3 <- test_q3 %>%
  mutate(
    pred_feat = predict(model_feat_q3, newdata = test_q3),
    pred_pois = predict(model_pois_q3, newdata = test_q3, type = "response")
  )

mae_results_final <- bind_rows(
  mae_results_q3,
  tibble(
    Model = c("6. Feat-Eng (lags + holiday + school + hot afternoon)",
              "7. Poisson GLM (same RHS as Feat-Eng)"),
    MAE   = c(mean(abs(test_q3$Trip_Count - test_q3$pred_feat), na.rm = TRUE),
              mean(abs(test_q3$Trip_Count - test_q3$pred_pois), na.rm = TRUE))
  )
)

kable(mae_results_final,
      digits = 3,
      caption = "Q3 MAE across all models (including feature-engineered and Poisson)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Q3 MAE across all models (including feature-engineered and Poisson)
Model MAE (trips)
1. Time + Weather 0.875
2. + Temporal Lags 0.748
3. + Demographics 0.749
4. + Station FE 0.743
5. + Rush Hour Interaction 0.740
6. Feat-Eng (lags + holiday + school + hot afternoon) 0.750
7. Poisson GLM (same RHS as Feat-Eng) 0.794
Code
ggplot(mae_results_final, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
  labs(
    title = "Q3 Model Performance — All Models",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Effect: The three summer-targeted features each move in the expected direction: holiday has a large negative coefficient (July 4 and Labor Day suppress the typical Friday/Monday pattern), school_in_session is positive (late-August classes lift demand), and hot_afternoon is negative (>88 °F afternoons do depress riding beyond what raw Temperature captures). They shave a modest amount off Model 2’s MAE but still trail Model 4’s station fixed effects, because the biggest unexplained variation across Q3 is between-station not within-hour. The Poisson GLM gives a similar MAE to the linear feature-engineered model. Its real advantage is a far better fit to the right-skewed count distribution (huge AIC / deviance improvement) and it never produces negative predictions, which matters for operational use.

3.11 Critical Reflection

3.11.1 Operational Implications

My best linear model has an MAE of about one to two trips per station per hour on the Q3 test set. That sounds small, but for a station that only averages two or three trips an hour, it’s actually a pretty big relative error. From an operations standpoint though, rebalancing trucks move bikes in batches anyway, so being off by one or two bikes won’t completely break a dispatcher’s decision.

I think the model is useful for figuring out which stations are most likely to run out of bikes in the next hour, so Indego can decide where to send trucks first. But I wouldn’t use it to tell a dispatcher the exact number of trips a specific station will have. If Indego actually deployed this, I would combine the model with real-time bike availability data instead of letting the predictions drive dispatch decisions on their own.

3.11.2 Equity Considerations

I noticed something interesting in the error patterns. The model is least accurate at busy stations in wealthy, whiter, transit-rich neighborhoods, and most accurate at low-demand stations in outlying areas. At first this looks fair, but I don’t think it actually is.

The problem is that being off by one bike means very different things in different places. A downtown station with twenty docks barely notices a one-bike error. But a small station in an outlying neighborhood might go a whole hour with zero bikes available because of that same error. If Indego only uses predicted demand to send trucks out, then in the summer all the rebalancing will end up concentrated around Center City and the universities, and the outer neighborhoods will get even less service. These are often the lower-income or majority-minority neighborhoods, and they’re also the places where waiting or walking in the heat is hardest.

A fairer setup would use an error metric that accounts for station capacity instead of just looking at MAE. Indego should also report service quality separately for each census tract, and set a minimum number of bikes that every station is guaranteed to have, no matter what the model predicts.

3.11.3 Model Limitations

There are three bigger issues with my current model.

The first one is special events. Things like parades, concerts at the Wells Fargo Center, and Penn move-in weekend cause huge spikes in ridership that the model can’t see coming. Adding a manually maintained event calendar as a feature would probably help a lot.

The second issue is that stations are actually connected to each other, but my model treats each station as if it exists in isolation. A bike showing up at Station A is there because someone rode it over from Station B. Flows between stations should be feeding into each other, and right now the model ignores that completely.

The third issue is the trickiest. The data I used for training already reflects Indego’s past rebalancing decisions. So the model isn’t really learning pure demand, it’s learning demand plus whatever Indego has been doing operationally. To separate those two things, I would need logs of Indego’s rebalancing actions, or a test period where they stop rebalancing entirely. I don’t have either of those.