Air Quality Project

Data Science Project by Josue Sanchez Hernandez and Phebe Palmer

10 minutes

Introduction #

Exposure to air pollution is the 4th leading cause of premature death around the world. As environmental concerns have become increasingly prevalent in public discourse, we find it crucial that we understand which populations are most prone to being exposed to suboptimal environmental conditions. Understanding the effects and periphery of air pollution can help to inform policy and direct action moving forward.

For the sake of our analysis, we decided to focus on LA for the year 2015. It is a highly diverse and densely populated area in a state that has been at the forefront of domestic conversations surrounding air pollution in recent years. We decided to compile different demographic and health-related indicators with measurements of air quality from different locations around LA. We were specifically interested in the concentration of the following pollutants: Particulate Matter, Ozone, and Carbon Monoxide. These have been listed as some of the pollutants with the most hazardous health effects by organizations such as the EPA.

Data #

Census API #

Our data comes from 3 sources. Our demographic data was collected using the R Package censusapi. This API allows users to draw information from a variety of surveys conducted by the Census Bureau. We focused on the Census Planning Database (PDB) from the year 2015. This data is available both by block and by census tract. For our analysis, we drew from the tract-level database. Below is the code used to ingest and wrangle the data. An API Key was necessary to access this data. Poverty and Black population were not listed as proportions, so for consistency, these variables were transformed into proportions by dividing by population in each tract. The variables that we ended up with from this dataset were: Tract, Percent of the population with no health insurance, Percent of the population without HS diploma, Total population, Percent of the population below the poverty level, Percent Asian, Percent White, Percent Hispanic, and Percent Black.

# USE ACS 1-Year Detailed Tables
# Add key to .Renviron
Sys.setenv(CENSUS_KEY = "###") # Placeholder API Key
# Reload .Renviron
readRenviron("~/.Renviron")
# LA Specifically
demoLA <- getCensus(
  name = "pdb/tract",
  vintage = 2015,
  vars = c(
    "pct_Hispanic_ACS_09_13",
    "pct_No_Health_Ins_ACS_09_13",
    "pct_NH_White_alone_ACS_09_13",
    "pct_Not_HS_Grad_ACS_09_13",
    "pct_NH_Asian_alone_ACS_09_13",
    "Tot_Population_ACS_09_13",
    "Prs_Blw_Pov_Lev_ACS_09_13",
    "NH_Blk_alone_ACS_09_13"
  ),
  region = "tract:*",
  regionin = "state:06+county:037"
)

demoLA_refined <- demoLA %>%
  rename(
    hispanic = pct_Hispanic_ACS_09_13,
    white = pct_NH_White_alone_ACS_09_13,
    asian = pct_NH_Asian_alone_ACS_09_13,
    no_hs_degree = pct_Not_HS_Grad_ACS_09_13,
    no_hi = pct_No_Health_Ins_ACS_09_13,
    population = Tot_Population_ACS_09_13
  ) %>%
  mutate(
    poverty = (Prs_Blw_Pov_Lev_ACS_09_13 / population) * 100,
    black = (NH_Blk_alone_ACS_09_13 / population) * 100,
    no_hi = as.numeric(no_hi)
  )

Asthma Data #

Our tract-level asthma rates were extracted from the CDC 500 Cities project, which provides census tract-level estimates for chronic disease risk factors and health outcomes, for the 500 largest cities in the United States. This data was available as a .csv file on their website. We extracted Tract and Percent of Adult population with asthma.

# Asthma Rates by Census Tract for Plots
Asthma <- read_csv("indicator_data_download_20191203.csv")
names(Asthma) <- str_replace_all(names(Asthma), c(" " = "_"))
Asthma <- Asthma %>%
  filter(Period_of_Measure == 2015) %>%
  mutate(tract = substr(Location, 6, 11)) %>%
  select(tract, Indicator_Value) %>%
  rename(asthma_prct = Indicator_Value)

EPA Data #

The data that we used to assess atmospheric quality was extracted from the EPA’s Historical Air Quality dataset available on Big Query. The authorization .json file and project name are necessary to establish a connection to the database, where data was called as a SQL command. This data was not tidy, as multiple rows were given for each location, where each atmospheric indicator was listed as a separate row. Each recording location listed two separate values for “average annual value” (arithmetic_mean). We simply averaged these two values for the year. We ended up with the following variables: Latitude of recording station, Longitude of recording station, Annual average for Particulate Matter (micrograms per cubic meter), Annual average for Ozone (parts per million), and Annual Average for Carbon Monoxide (parts per million).

Below, a separate dataset was created for all of California. This was used to calculate an annual average for each of the three indicators for our presentation. Our biggest challenge with this dataset is that the values recorded for each location were associated with the latitude and longitude of the station. We found the census blocks for each location, which we were then able to extract the tracts from. We initially joined the datasets based on these extracted tracts, but drawing inference from this dataset proved challenging, as there were very few recording stations relative to the number of tracts in LA.

options(gargle_quiet = FALSE)
bq_auth(path = "###.json") # link json file
myproject <- "###" # assign google key
con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = myproject
)
# All Data
sql <- "SELECT state_name, county_code, latitude, longitude, year, arithmetic_mean,
first_max_value, units_of_measure, parameter_name
FROM `bigquery-public-data.epa_historical_air_quality.air_quality_annual_summary`
WHERE state_name = 'California' AND year = 2015
AND parameter_name in( 'PM2.5 - Local Conditions', 'Ozone', 'Carbon monoxide');"

tb <- bq_project_query(myproject, sql) # use your project here
epa_data <- bq_table_download(tb)

epa_data_vis <- epa_data %>%
  filter(state_name == "California") %>%
  select(latitude, longitude, arithmetic_mean, parameter_name) %>%
  group_by(latitude, longitude, parameter_name) %>%
  summarize(arithmetic_mean = mean(arithmetic_mean)) %>%
  spread(parameter_name, arithmetic_mean)
names(epa_data_vis) <- str_replace_all(names(epa_data_vis), c(" " = "_", "-" = ""))
epa_data_vis <- epa_data_vis %>%
  rename(
    "ozone" = "Ozone", "co" = "Carbon_monoxide",
    "pm_2_5" = "PM2.5__Local_Conditions"
  )

# LA County
epa_data_la <- epa_data %>% filter(county_code == "037")
# FCC API Call Loop
census_block <- vector()
for (i in 1:nrow(epa_data_la)) {
  fcc_url <- paste0(
    "https://geo.fcc.gov/api/census/block/find?latitude=",
    epa_data_la$latitude[i], "&longitude=", epa_data_la$longitude[i],
    "&showall=false&format=json"
  )
  fcc_api_req <- GET(fcc_url)
  fcc_json <- content(fcc_api_req, as = "text")
  fcc_info <- fromJSON(fcc_json)

  census_block <- append(census_block, fcc_info$Block$FIPS)
}
epa_data_la$census_block <- census_block # Adding to original sample
epa_data_la_tracts <- epa_data_la %>% mutate(tract = substr(census_block, 6, 11))

# Tidy for EPA
epa_data_la_tidy <- epa_data_la_tracts %>%
  select(census_block, tract, latitude, longitude, arithmetic_mean, parameter_name) %>%
  group_by(latitude, longitude, census_block, tract, parameter_name) %>%
  summarize(arithmetic_mean = mean(arithmetic_mean)) %>%
  spread(parameter_name, arithmetic_mean)
names(epa_data_la_tidy) <- str_replace_all(names(epa_data_la_tidy), c(" " = "_", "-" = ""))
epa_data_la_tidy <- epa_data_la_tidy %>%
  rename(
    "ozone" = "Ozone", "co" = "Carbon_monoxide",
    "pm_2_5" = "PM2.5__Local_Conditions"
  )

Shapefiles #

# Getting Shapefiles
shapefile_og <- tracts(state = "06", county = "37", year = 2015, progress_bar = FALSE)

# Join Census demographic data with asthma data
temp_joined <- Asthma %>%
  left_join(demoLA_refined) %>%
  select(-c(state, county))

# Merging shapefile with all of the demographic data by tract
shapefile <- merge(shapefile_og, temp_joined, by.x = c("TRACTCE"), by.y = c("tract"), duplicateGeoms = TRUE)

# Creating shapefile dataframe for imputation
# shape_df <- as(shapefile, "data.frame")
shape_df <- as.data.frame(shapefile)

Data Imputation #

To move around this limitation, we decided to use a method of imputation. We joined our combined dataset with tract-level shapefiles and extracted the range of latitudes and longitudes for each tract. We then averaged the values for latitude and longitude for each tract to determine a “center”-point for each tract. We then calculated the distance from each of these tracts to all of the recording stations in LA and imputed the measurements from the closest location. This gave us a full dataset, which we then used for our Shiny App.

# Function to calculate distance between tracts and Epa Monitors
calc_distance <- function(tractlong, tractlat, stations) {
  distance <- sqrt((tractlong - stations$longitude)^2 + (tractlat - stations$latitude)^2)
  return(distance)
}

# Changing latitude and lingitude variable names
shape_df <- shape_df %>%
  mutate(lat = as.numeric(INTPTLAT), lon = as.numeric(INTPTLON))
shape_df_test <- shape_df

# Loop to find nearest epa air quality monitor to each tract
shape_df_test$nearMonitor <- NA
for (i in 1:nrow(shape_df_test)) {
  dist <- (calc_distance(
    as.numeric(shape_df_test$lon[i]),
    as.numeric(shape_df_test$lat[i]), epa_data_la_tidy
  ))
  min <- which.min(dist)
  shape_df_test$nearMonitor[i] <- as.numeric(epa_data_la_tidy$tract[min])
}

Joining Data #

# Joining new data with  Tract data shape data
join_df <- shape_df_test %>%
  select(TRACTCE, nearMonitor) %>%
  mutate(nearMonitor = as.character(nearMonitor))
# Joining imputated data with EPA air quality data
join_df <- left_join(join_df, epa_data_la_tidy, by = c("nearMonitor" = "tract"))

# Joining imputated data with Census and Asthma data
shiny_joined <- left_join(temp_joined, join_df, by = c("tract" = "TRACTCE"))
shapefile_merged <- merge(shapefile_og, shiny_joined, by.x = c("TRACTCE"), by.y = c("tract"), duplicateGeoms = TRUE)

Findings #

We created a Shiny app that allows users to map the demographic, asthma, and air quality variables we collected on a Leaflet map. We have included leaflet apps that can be generated on our Shiny app to demonstrate our findings.

# Leaflet map colored by poverty rates
colorData <- shapefile_merged[["poverty"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)

leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Poverty Rates", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)
# Leaflet map colored by Asthma Percentage
colorData <- shapefile_merged[["asthma_prct"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)
leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Asthma Rate", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)