Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Zhuosi Yang

Published

February 8, 2026

Assignment Overview

Scenario

You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

text: Assignments menu:
- href: labs/lab_1/your_file_name.qmd text: “Lab 1: Census Data Exploration”

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key
tidycensus::census_api_key("7d97800204c8442ccc623730f30b8698d88c5962", install = TRUE, overwrite = TRUE)

[1] "7d97800204c8442ccc623730f30b8698d88c5962"

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "CA"

State Selection: I have chosen [California] for this analysis because: Its counties vary widely in population size and context, which creates meaningful differences in ACS margins of error and supports clear reliability categorization at the county level.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
# 1) Retrieve county-level ACS data (wide format)
county_data <- get_acs(
  geography = "county",
  state = my_state,
  variables = c(
    median_income = "B19013_001",
    total_pop     = "B01003_001"
    ),
  year = 2022,
  survey = "acs5",
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
county_data <- county_data %>%
  mutate(
    county = NAME %>%
      str_remove(paste0(",\\s*", my_state, "$")) %>%
      str_remove(paste0(",\\s*", state.name[state.abb == my_state], "$")) %>%
      str_remove("\\s*County$") %>%
      str_trim()
  )


# Display the first few rows
head(county_data)

# A tibble: 6 × 7
  GEOID NAME          median_incomeE median_incomeM total_popE total_popM county
  <chr> <chr>                  <dbl>          <dbl>      <dbl>      <dbl> <chr> 
1 06001 Alameda Coun…         122488           1231    1663823         NA Alame…
2 06003 Alpine Count…         101125          17442       1515        206 Alpine
3 06005 Amador Count…          74853           6048      40577         NA Amador
4 06007 Butte County…          66085           2261     213605         NA Butte 
5 06009 Calaveras Co…          77526           3875      45674         NA Calav…
6 06011 Colusa Count…          69619           5745      21811         NA Colusa

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
  mutate(
    income_moe_pct = (median_incomeM / median_incomeE) * 100,
    reliability_categories = case_when(
      is.na(income_moe_pct) ~ NA_character_,
      income_moe_pct < 5 ~ "High Confidence",
      income_moe_pct <= 10 ~ "Moderate Confidence",
      TRUE ~ "Low Confidence"
    ),
    unreliable_flag = ifelse(income_moe_pct > 10, TRUE, FALSE)
  )


# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_categories_summary <- county_data %>%
  filter(!is.na(reliability_categories)) %>%
  count(reliability_categories, name = "n_counties") %>%
  mutate(
    pct_counties = (n_counties / sum(n_counties)) * 100
  )

reliability_categories_summary

# A tibble: 3 × 3
  reliability_categories n_counties pct_counties
  <chr>                       <int>        <dbl>
1 High Confidence                42        72.4 
2 Low Confidence                  5         8.62
3 Moderate Confidence            11        19.0

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top5_high_uncertainty <- county_data %>%
  filter(!is.na(income_moe_pct)) %>%
  arrange(desc(income_moe_pct)) %>%
  slice(1:5) %>%
  mutate(income_moe_pct = round(income_moe_pct, 2)) %>%
  select(
    county,
    median_incomeE,
    median_incomeM,
    income_moe_pct,
    reliability_categories
  )

# Format as table with kable() - include appropriate column names and caption
knitr::kable(
  top5_high_uncertainty,
  col.names = c("County", "Median income", "Margin of Error", "MOE (%)", "Reliability"),
  caption = "Top 5 California counties with the highest MOE percentage for median household income (ACS 2022 5-year)."
)

Top 5 California counties with the highest MOE percentage for median household income (ACS 2022 5-year).
County	Median income	Margin of Error	MOE (%)	Reliability
Mono	82038	15388	18.76	Low Confidence
Alpine	101125	17442	17.25	Low Confidence
Sierra	61108	9237	15.12	Low Confidence
Trinity	47317	5890	12.45	Low Confidence
Plumas	67885	7772	11.45	Low Confidence

Data Quality Commentary:

Counties with the highest MOE percentages, like Mono, Alpine and Sierra, may be poorly served by algorithms that rely on median household income, because their income estimates are least reliable and more likely to be incorrectly ranked. Higher uncertainty is often associated with smaller populations and fewer effective survey responses, which inflates the margin of error relative to the estimate.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
high_cty <- county_data %>%
  filter(reliability_categories == "High Confidence") %>%
  sample_n(1)

mod_cty <- county_data %>%
  filter(reliability_categories == "Moderate Confidence") %>%
  sample_n(1)

low_cty <- county_data %>%
  filter(reliability_categories == "Low Confidence") %>%
  sample_n(1)

# Store the selected counties in a variable called selected_counties
selected_counties <- bind_rows(high_cty, mod_cty, low_cty)

# Note: In earlier steps I used a random selection to pick example counties by reliability level. Quarto re-runs the code on every render, which would re-sample different counties each time. To keep the analysis reproducible and consistent with the counties used in my first run, I fix the selected counties here and use the same set for all subsequent tract-level analyses.
selected_counties <- county_data %>%
  filter(county %in% c("Lassen", "Madera", "Trinity")) %>%
  select(county, GEOID, median_incomeE, median_incomeM, income_moe_pct, reliability_categories)

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(county, GEOID, median_incomeE, median_incomeM, income_moe_pct, reliability_categories)

# A tibble: 3 × 6
  county  GEOID median_incomeE median_incomeM income_moe_pct
  <chr>   <chr>          <dbl>          <dbl>          <dbl>
1 Lassen  06035          59515           3551           5.97
2 Madera  06039          73543           2844           3.87
3 Trinity 06105          47317           5890          12.4 
# ℹ 1 more variable: reliability_categories <chr>

Comment on the output: These three California counties provide a clear comparison across reliability levels for median household income. Madera falls in the High Confidence category (MOE% ≈ 3.87), Lassen represents Moderate Confidence (MOE% ≈ 5.97), and Trinity is Low Confidence (MOE% ≈ 12.45). Keeping this fixed set of counties ensures the subsequent tract-level analyses are directly comparable and reproducible.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race <- c(
  total_pop = "B03002_001",
  white     = "B03002_003",
  black     = "B03002_004",
  hispanic  = "B03002_012"
)

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
county_codes <- substr(selected_counties$GEOID, 3, 5)

tract_demo <- get_acs(
  geography = "tract",
  state = my_state,
  county = county_codes,
  variables = race,
  year = 2022,
  survey = "acs5",
  output = "wide"
)

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_demo <- tract_demo %>%
  mutate(
    pct_white    = (whiteE / total_popE) * 100,
    pct_black    = (blackE / total_popE) * 100,
    pct_hispanic = (hispanicE / total_popE) * 100
  ) 

# Add readable tract and county name columns using str_extract() or similar
tract_demo <- tract_demo %>%
  mutate(
    tract = str_extract(NAME, "(?<=Census Tract\\s)[^;]+"),
    county = str_extract(NAME, "(?<=;\\s)[^;]+(?=\\sCounty;)")
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
top_hispanic_tract <- tract_demo %>%
  arrange(desc(pct_hispanic)) %>%
  slice(1) %>%
  select(county, tract, total_popE, pct_white, pct_black, pct_hispanic)

knitr::kable(
  top_hispanic_tract,
  caption = "Census Tract with the highest percentage of Hispanic/Latino residents in Lassen, Madera, Trinity."
)

Census Tract with the highest percentage of Hispanic/Latino residents in Lassen, Madera, Trinity.
county	tract	total_popE	pct_white	pct_black	pct_hispanic
Madera	9.03	4298	5.793392	0.6747324	93.53188

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_demo_summary <- tract_demo %>%
  group_by(county) %>%
  summarize(
    n_tracts = n(),
    avg_pct_white = mean(pct_white, na.rm = TRUE),
    avg_pct_black = mean(pct_black, na.rm = TRUE),
    avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
  ) %>%
  arrange(county)

# Create a nicely formatted table of your results using kable()
knitr::kable(
  county_demo_summary,
  caption = "Average tract-level demographics by county (Lassen, Madera, Trinity)."
)

Average tract-level demographics by county (Lassen, Madera, Trinity).
county	n_tracts	avg_pct_white	avg_pct_black	avg_pct_hispanic
Lassen	9	70.00704	4.953381	15.350125
Madera	34	33.67908	2.364939	58.009667
Trinity	4	79.19083	1.738676	7.037537

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_demo <- tract_demo %>%
  mutate(
    white_moe_pct    = (whiteM / whiteE) * 100,
    black_moe_pct    = (blackM / blackE) * 100,
    hispanic_moe_pct = (hispanicM / hispanicE) * 100
  )

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_demo <- tract_demo %>%
  mutate(
    high_demo_moe = ifelse(
      white_moe_pct > 15 | black_moe_pct > 15 | hispanic_moe_pct > 15,
      TRUE,
      FALSE
    )
  )

# Create summary statistics showing how many tracts have data quality issues
demo_moe_summary <- tract_demo %>%
  summarize(
    total_tracts = n(),
    tracts_flagged = sum(high_demo_moe, na.rm = TRUE),
    pct_flagged = (tracts_flagged / total_tracts) * 100
  )

demo_moe_summary

# A tibble: 1 × 3
  total_tracts tracts_flagged pct_flagged
         <int>          <int>       <dbl>
1           47             45        95.7

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison

pattern_summary <- tract_demo %>%
  group_by(high_demo_moe) %>%
  summarize(
    n_tracts = n(),
    avg_total_pop = mean(total_popE, na.rm = TRUE),
    avg_pct_white = mean(pct_white, na.rm = TRUE),
    avg_pct_black = mean(pct_black, na.rm = TRUE),
    avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
  ) %>%
  mutate(
    group = ifelse(high_demo_moe, "High MOE issues", "No high MOE issues"),
    avg_total_pop = round(avg_total_pop, 0),
    avg_pct_white = round(avg_pct_white, 2),
    avg_pct_black = round(avg_pct_black, 2),
    avg_pct_hispanic = round(avg_pct_hispanic, 2)
  ) %>%
  select(group, n_tracts, avg_total_pop, avg_pct_white, avg_pct_black, avg_pct_hispanic)

# Create a professional table showing the patterns
knitr::kable(
  pattern_summary,
  col.names = c("Group", "Number of tracts", "Avg total population", "Avg % White", "Avg % Black", "Avg % Hispanic"),
  caption = "Comparison of tract characteristics by presence of high MOE demographic issues."
)

Comparison of tract characteristics by presence of high MOE demographic issues.
Group	Number of tracts	Avg total population	Avg % White	Avg % Black	Avg % Hispanic
No high MOE issues	2	7380	27.78	18.52	45.80
High MOE issues	45	4228	45.25	2.11	45.49

Pattern Analysis: Census tracts flagged with high demographic MOE issues tend to have smaller populations and a more uneven racial composition, especially a much lower Black share. When a subgroup is very small in a tract, ACS has fewer sampled cases to estimate that group’s count, so the margin of error becomes large relative to the estimate. This pattern is consistent with sparsely populated tracts where limited sample size amplifies sampling error and reduces reliability.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: Across the analyses above, data quality varies systematically rather than randomly. Reliability generally improves with larger geographies (counties) and degrades at smaller geographies (census tracts). At the census tract level, demographic estimates show even greater variability, and high-MOE flags cluster in tracts with smaller populations and more uneven racial composition. This is because subgroup counts in those tracts are low and sampling variability is amplified.

The greatest risk of algorithmic bias falls on communities that are hardest to measure reliably. Communities most exposed to bias are sparsely populated areas with imbalanced demographics, especially places where minority subgroup counts are small. Algorithmic decision-making that rely on these estimates may misjudge demand, misallocate resources, or underestimate vulnerable groups. As a result, a tract may be incorrectly treated as “low priority” or “high priority” due to noise rather than true conditions, and this risk increases when models treat ACS estimates as precise inputs without accounting for MOE.

The bias risk is driven by human choices embedded throughout the process, including variable selection, classification, proxies, and interpretation, as well as structural sampling limits of small-area ACS estimates. In particular, using a single indicator as a stand-in for “demand” can create proxy failure. In this case, small subgroup counts at the tract level amplify sampling variability, producing high MOE percentages that can distort downstream ranking and eligibility rules.

The Department should treat MOE% as decision-relevant information by flagging or down-weighting high-MOE tract estimates, requiring corroboration from administrative or local program data when reliability is low, and documenting these rules transparently. It should also build equity safeguards into the workflow by adding human review and using vulnerability adjustments, so communities are not excluded simply because their data are noisier.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
recommendations_table <- county_data %>%
  select(county, median_incomeE, income_moe_pct, reliability_categories) 

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
recommendations_table <- recommendations_table %>%
   mutate(
    algorithm_recommendation = case_when(
      reliability_categories == "High Confidence" ~ "Safe for algorithmic decisions",
      reliability_categories == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
      reliability_categories == "Low Confidence" ~ "Requires manual review or additional data",
      TRUE ~ NA_character_
    )
   )%>%
  arrange(match(reliability_categories, c("High Confidence", "Moderate Confidence", "Low Confidence")))

# Format as a professional table with kable()
knitr::kable(
  recommendations_table,
  col.names = c("County", "Median household income (E)", "MOE (%)", "Reliability category", "Algorithm recommendation"),
  caption = "Decision framework for algorithm implementation based on county-level income reliability."
)

Decision framework for algorithm implementation based on county-level income reliability.
County	Median household income (E)	MOE (%)	Reliability category	Algorithm recommendation
Alameda	122488	1.0049964	High Confidence	Safe for algorithmic decisions
Butte	66085	3.4213513	High Confidence	Safe for algorithmic decisions
Calaveras	77526	4.9983231	High Confidence	Safe for algorithmic decisions
Contra Costa	120020	1.2464589	High Confidence	Safe for algorithmic decisions
El Dorado	99246	3.3552990	High Confidence	Safe for algorithmic decisions
Fresno	67756	1.4271799	High Confidence	Safe for algorithmic decisions
Humboldt	57881	3.6816917	High Confidence	Safe for algorithmic decisions
Imperial	53847	4.1116497	High Confidence	Safe for algorithmic decisions
Kern	63883	2.0741042	High Confidence	Safe for algorithmic decisions
Kings	68540	3.2871316	High Confidence	Safe for algorithmic decisions
Lake	56259	4.3353064	High Confidence	Safe for algorithmic decisions
Los Angeles	83411	0.5263095	High Confidence	Safe for algorithmic decisions
Madera	73543	3.8671254	High Confidence	Safe for algorithmic decisions
Marin	142019	2.8855294	High Confidence	Safe for algorithmic decisions
Mendocino	61335	3.5835983	High Confidence	Safe for algorithmic decisions
Merced	64772	3.3069845	High Confidence	Safe for algorithmic decisions
Monterey	91043	2.0869260	High Confidence	Safe for algorithmic decisions
Napa	105809	2.8173407	High Confidence	Safe for algorithmic decisions
Nevada	79395	4.8151647	High Confidence	Safe for algorithmic decisions
Orange	109361	0.8065032	High Confidence	Safe for algorithmic decisions
Placer	109375	1.6969143	High Confidence	Safe for algorithmic decisions
Riverside	84505	1.2555470	High Confidence	Safe for algorithmic decisions
Sacramento	84010	0.9713129	High Confidence	Safe for algorithmic decisions
San Bernardino	77423	1.0410343	High Confidence	Safe for algorithmic decisions
San Diego	96974	1.0239858	High Confidence	Safe for algorithmic decisions
San Francisco	136689	1.4295225	High Confidence	Safe for algorithmic decisions
San Joaquin	82837	1.7480112	High Confidence	Safe for algorithmic decisions
San Luis Obispo	90158	2.5632778	High Confidence	Safe for algorithmic decisions
San Mateo	149907	1.7484174	High Confidence	Safe for algorithmic decisions
Santa Barbara	92332	2.0458779	High Confidence	Safe for algorithmic decisions
Santa Clara	153792	1.0046036	High Confidence	Safe for algorithmic decisions
Santa Cruz	104409	3.0390100	High Confidence	Safe for algorithmic decisions
Shasta	68347	3.6285426	High Confidence	Safe for algorithmic decisions
Siskiyou	53898	4.8981409	High Confidence	Safe for algorithmic decisions
Solano	97037	1.7766419	High Confidence	Safe for algorithmic decisions
Sonoma	99266	2.0026998	High Confidence	Safe for algorithmic decisions
Stanislaus	74872	1.8311251	High Confidence	Safe for algorithmic decisions
Sutter	72654	4.7141245	High Confidence	Safe for algorithmic decisions
Tulare	64474	2.3094581	High Confidence	Safe for algorithmic decisions
Ventura	102141	1.4959713	High Confidence	Safe for algorithmic decisions
Yolo	85097	2.7415773	High Confidence	Safe for algorithmic decisions
Yuba	66693	4.1923440	High Confidence	Safe for algorithmic decisions
Amador	74853	8.0798365	Moderate Confidence	Use with caution - monitor outcomes
Colusa	69619	8.2520576	Moderate Confidence	Use with caution - monitor outcomes
Del Norte	61149	7.1562904	Moderate Confidence	Use with caution - monitor outcomes
Glenn	64033	6.1889963	Moderate Confidence	Use with caution - monitor outcomes
Inyo	63417	8.5986407	Moderate Confidence	Use with caution - monitor outcomes
Lassen	59515	5.9665631	Moderate Confidence	Use with caution - monitor outcomes
Mariposa	60021	8.8185802	Moderate Confidence	Use with caution - monitor outcomes
Modoc	54962	9.8049562	Moderate Confidence	Use with caution - monitor outcomes
San Benito	104451	5.2282889	Moderate Confidence	Use with caution - monitor outcomes
Tehama	59029	6.9508208	Moderate Confidence	Use with caution - monitor outcomes
Tuolumne	70432	6.6603249	Moderate Confidence	Use with caution - monitor outcomes
Alpine	101125	17.2479604	Low Confidence	Requires manual review or additional data
Mono	82038	18.7571613	Low Confidence	Requires manual review or additional data
Plumas	67885	11.4487737	Low Confidence	Requires manual review or additional data
Sierra	61108	15.1158604	Low Confidence	Requires manual review or additional data
Trinity	47317	12.4479574	Low Confidence	Requires manual review or additional data

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Counties with high confidence data: Alameda, Butte, Calaveras, Contra Costa, El Dorado, Fresno, Humboldt, Imperial, Kern, Kings, Lake, Los Angeles, Madera, Marin, Mendocino, Merced, Monterey, Napa, Nevada, Orange, Placer, Riverside, Sacramento, San Bernardino, San Diego, San Francisco, San Joaquin, San Luis Obispo, San Mateo, Santa Barbara, Santa Clara, Santa Cruz, Shasta, Siskiyou, Solano, Sonoma, Stanislaus, Sutter, Tulare, Ventura, Yolo, Yuba. These counties are appropriate for immediate algorithmic use because their median household income estimates fall in the High Confidence category (MOE% < 5%), meaning the margin of error is small relative to the estimate and the input data are less likely to distort rankings, eligibility thresholds, or resource allocation decisions.
Counties requiring additional oversight: Counties with moderate confidence data: Amador, Colusa, Del Norte, Glenn, Inyo, Lassen, Mariposa, Modoc, San Benito, Tehama, Tuolumne. These counties fall in the Moderate Confidence range (MOE% 5–10%), so automated decisions should be used with caution and paired with monitoring that checks whether small shifts in the income estimate (within the MOE range) would change rankings or eligibility outcomes. In practice, this means running sensitivity tests around key thresholds, tracking how often these counties move in or out of priority groups across updates, and reviewing with additional administrative/local data before actions are taken.
Counties needing alternative approaches: Counties with low confidence data: Alpine, Mono, Plumas, Sierra, Trinity. These counties fall in the Low Confidence category (MOE% > 10%), so relying on the ACS median income estimate alone creates a high risk of misranking. The Department should require manual review or secondary validation before automated actions are triggered, and consider pooling multiple years or using broader geographies. Where feasible, targeted outreach or supplemental surveying in these small-population counties can also reduce uncertainty over time.

Questions for Further Investigation

How stable are county- and tract-level reliability classifications over time?
Are high-MOE tracts spatially clustered within counties, and how do those clusters align with service access, poverty, or other vulnerability indicators?
How does demographic MOE-based reliability vary across additional characteristics such as age structure, housing tenure, or language proficiency, and do these factors identify communities that face compounded measurement risk?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on February 8, 2026

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://zhuosiyang-01.github.io/Zhuosi/

Methodology Notes: I used ACS 2018–2022 5-year estimates and retrieved county- and tract-level data via tidycensus with output = “wide”. County-level income reliability was assessed using MOE% = (MOE/estimate)×100 and classified into High (<5%), Moderate (5–10%), and Low (>10%) confidence categories. The same logic was applied to tract-level demographic variables. For tract-level MOE% calculations, high-MOE tracts (MOE > 15%) were flagged when any demographic variable exceeded the specified threshold. Counties for tract-level follow-up were selected randomly (Calaveras, Mono, Modoc) to compare patterns across different reliability contexts and to support interpretation of tract-level data quality variation.

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.] Firstly, ACS is sample-based, so MOE-driven reliability issues are expected to increase at smaller geographies such as census tracts, especially where subgroup counts are small. This can cause many tracts to be flagged even when the county-level estimate appears reliable. Secondly, The analysis reflects the 2018–2022 5-year period and may not capture short-term changes after 2022, and results may differ if a different ACS release, geography, or threshold is used. Finally, tract-level demographic MOE% is sensitive to small denominators, meaning reliability flags may be more common in rural or demographically concentrated areas and should be interpreted as measurement risk rather than definitive evidence of true population patterns.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html