Air Quality Project

Data Science Project by Josue Sanchez Hernandez and Phebe Palmer

December 14, 2019

10 minutes

Introduction #

Exposure to air pollution is the 4th leading cause of premature death around the world. As environmental concerns have become increasingly prevalent in public discourse, we find it crucial that we understand which populations are most prone to being exposed to suboptimal environmental conditions. Understanding the effects and periphery of air pollution can help to inform policy and direct action moving forward.

For the sake of our analysis, we decided to focus on LA for the year 2015. It is a highly diverse and densely populated area in a state that has been at the forefront of domestic conversations surrounding air pollution in recent years. We decided to compile different demographic and health-related indicators with measurements of air quality from different locations around LA. We were specifically interested in the concentration of the following pollutants: Particulate Matter, Ozone, and Carbon Monoxide. These have been listed as some of the pollutants with the most hazardous health effects by organizations such as the EPA.

Data #

Census API #

Our data comes from 3 sources. Our demographic data was collected using the R Package censusapi. This API allows users to draw information from a variety of surveys conducted by the Census Bureau. We focused on the Census Planning Database (PDB) from the year 2015. This data is available both by block and by census tract. For our analysis, we drew from the tract-level database. Below is the code used to ingest and wrangle the data. An API Key was necessary to access this data. Poverty and Black population were not listed as proportions, so for consistency, these variables were transformed into proportions by dividing by population in each tract. The variables that we ended up with from this dataset were: Tract, Percent of the population with no health insurance, Percent of the population without HS diploma, Total population, Percent of the population below the poverty level, Percent Asian, Percent White, Percent Hispanic, and Percent Black.

# USE ACS 1-Year Detailed Tables
# Add key to .Renviron
Sys.setenv(CENSUS_KEY = "###") # Placeholder API Key
# Reload .Renviron
readRenviron("~/.Renviron")

# LA Specifically
demoLA <- getCensus(
  name = "pdb/tract",
  vintage = 2015,
  vars = c(
    "pct_Hispanic_ACS_09_13",
    "pct_No_Health_Ins_ACS_09_13",
    "pct_NH_White_alone_ACS_09_13",
    "pct_Not_HS_Grad_ACS_09_13",
    "pct_NH_Asian_alone_ACS_09_13",
    "Tot_Population_ACS_09_13",
    "Prs_Blw_Pov_Lev_ACS_09_13",
    "NH_Blk_alone_ACS_09_13"
  ),
  region = "tract:*",
  regionin = "state:06+county:037"
)

demoLA_refined <- demoLA %>%
  rename(
    hispanic = pct_Hispanic_ACS_09_13,
    white = pct_NH_White_alone_ACS_09_13,
    asian = pct_NH_Asian_alone_ACS_09_13,
    no_hs_degree = pct_Not_HS_Grad_ACS_09_13,
    no_hi = pct_No_Health_Ins_ACS_09_13,
    population = Tot_Population_ACS_09_13
  ) %>%
  mutate(
    poverty = (Prs_Blw_Pov_Lev_ACS_09_13 / population) * 100,
    black = (NH_Blk_alone_ACS_09_13 / population) * 100,
    no_hi = as.numeric(no_hi)
  )

Asthma Data #

Our tract-level asthma rates were extracted from the CDC 500 Cities project, which provides census tract-level estimates for chronic disease risk factors and health outcomes, for the 500 largest cities in the United States. This data was available as a .csv file on their website. We extracted Tract and Percent of Adult population with asthma.

# Asthma Rates by Census Tract for Plots
Asthma <- read_csv("indicator_data_download_20191203.csv")
names(Asthma) <- str_replace_all(names(Asthma), c(" " = "_"))
Asthma <- Asthma %>%
  filter(Period_of_Measure == 2015) %>%
  mutate(tract = substr(Location, 6, 11)) %>%
  select(tract, Indicator_Value) %>%
  rename(asthma_prct = Indicator_Value)

EPA Data #

The data that we used to assess atmospheric quality was extracted from the EPA’s Historical Air Quality dataset available on Big Query. The authorization .json file and project name are necessary to establish a connection to the database, where data was called as a SQL command. This data was not tidy, as multiple rows were given for each location, where each atmospheric indicator was listed as a separate row. Each recording location listed two separate values for “average annual value” (arithmetic_mean). We simply averaged these two values for the year. We ended up with the following variables: Latitude of recording station, Longitude of recording station, Annual average for Particulate Matter (micrograms per cubic meter), Annual average for Ozone (parts per million), and Annual Average for Carbon Monoxide (parts per million).

Below, a separate dataset was created for all of California. This was used to calculate an annual average for each of the three indicators for our presentation. Our biggest challenge with this dataset is that the values recorded for each location were associated with the latitude and longitude of the station. We found the census blocks for each location, which we were then able to extract the tracts from. We initially joined the datasets based on these extracted tracts, but drawing inference from this dataset proved challenging, as there were very few recording stations relative to the number of tracts in LA.

options(gargle_quiet = FALSE)
bq_auth(path = "###.json") # link json file
myproject <- "###" # assign google key
con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = myproject
)

# All Data
sql <- "SELECT state_name, county_code, latitude, longitude, year, arithmetic_mean,
first_max_value, units_of_measure, parameter_name
FROM `bigquery-public-data.epa_historical_air_quality.air_quality_annual_summary`
WHERE state_name = 'California' AND year = 2015
AND parameter_name in( 'PM2.5 - Local Conditions', 'Ozone', 'Carbon monoxide');"

tb <- bq_project_query(myproject, sql) # use your project here
epa_data <- bq_table_download(tb)

epa_data_vis <- epa_data %>%
  filter(state_name == "California") %>%
  select(latitude, longitude, arithmetic_mean, parameter_name) %>%
  group_by(latitude, longitude, parameter_name) %>%
  summarize(arithmetic_mean = mean(arithmetic_mean)) %>%
  spread(parameter_name, arithmetic_mean)
names(epa_data_vis) <- str_replace_all(names(epa_data_vis), c(" " = "_", "-" = ""))
epa_data_vis <- epa_data_vis %>%
  rename(
    "ozone" = "Ozone", "co" = "Carbon_monoxide",
    "pm_2_5" = "PM2.5__Local_Conditions"
  )

# LA County
epa_data_la <- epa_data %>% filter(county_code == "037")
# FCC API Call Loop
census_block <- vector()
for (i in 1:nrow(epa_data_la)) {
  fcc_url <- paste0(
    "https://geo.fcc.gov/api/census/block/find?latitude=",
    epa_data_la$latitude[i], "&longitude=", epa_data_la$longitude[i],
    "&showall=false&format=json"
  )
  fcc_api_req <- GET(fcc_url)
  fcc_json <- content(fcc_api_req, as = "text")
  fcc_info <- fromJSON(fcc_json)

  census_block <- append(census_block, fcc_info$Block$FIPS)
}
epa_data_la$census_block <- census_block # Adding to original sample
epa_data_la_tracts <- epa_data_la %>% mutate(tract = substr(census_block, 6, 11))

# Tidy for EPA
epa_data_la_tidy <- epa_data_la_tracts %>%
  select(census_block, tract, latitude, longitude, arithmetic_mean, parameter_name) %>%
  group_by(latitude, longitude, census_block, tract, parameter_name) %>%
  summarize(arithmetic_mean = mean(arithmetic_mean)) %>%
  spread(parameter_name, arithmetic_mean)
names(epa_data_la_tidy) <- str_replace_all(names(epa_data_la_tidy), c(" " = "_", "-" = ""))
epa_data_la_tidy <- epa_data_la_tidy %>%
  rename(
    "ozone" = "Ozone", "co" = "Carbon_monoxide",
    "pm_2_5" = "PM2.5__Local_Conditions"
  )

Shapefiles #

# Getting Shapefiles
shapefile_og <- tracts(state = "06", county = "37", year = 2015, progress_bar = FALSE)

# Join Census demographic data with asthma data
temp_joined <- Asthma %>%
  left_join(demoLA_refined) %>%
  select(-c(state, county))

# Merging shapefile with all of the demographic data by tract
shapefile <- merge(shapefile_og, temp_joined, by.x = c("TRACTCE"), by.y = c("tract"), duplicateGeoms = TRUE)

# Creating shapefile dataframe for imputation
# shape_df <- as(shapefile, "data.frame")
shape_df <- as.data.frame(shapefile)

Data Imputation #

To move around this limitation, we decided to use a method of imputation. We joined our combined dataset with tract-level shapefiles and extracted the range of latitudes and longitudes for each tract. We then averaged the values for latitude and longitude for each tract to determine a “center”-point for each tract. We then calculated the distance from each of these tracts to all of the recording stations in LA and imputed the measurements from the closest location. This gave us a full dataset, which we then used for our Shiny App.

# Function to calculate distance between tracts and Epa Monitors
calc_distance <- function(tractlong, tractlat, stations) {
  distance <- sqrt((tractlong - stations$longitude)^2 + (tractlat - stations$latitude)^2)
  return(distance)
}

# Changing latitude and lingitude variable names
shape_df <- shape_df %>%
  mutate(lat = as.numeric(INTPTLAT), lon = as.numeric(INTPTLON))
shape_df_test <- shape_df

# Loop to find nearest epa air quality monitor to each tract
shape_df_test$nearMonitor <- NA
for (i in 1:nrow(shape_df_test)) {
  dist <- (calc_distance(
    as.numeric(shape_df_test$lon[i]),
    as.numeric(shape_df_test$lat[i]), epa_data_la_tidy
  ))
  min <- which.min(dist)
  shape_df_test$nearMonitor[i] <- as.numeric(epa_data_la_tidy$tract[min])
}

Joining Data #

# Joining new data with  Tract data shape data
join_df <- shape_df_test %>%
  select(TRACTCE, nearMonitor) %>%
  mutate(nearMonitor = as.character(nearMonitor))
# Joining imputated data with EPA air quality data
join_df <- left_join(join_df, epa_data_la_tidy, by = c("nearMonitor" = "tract"))

# Joining imputated data with Census and Asthma data
shiny_joined <- left_join(temp_joined, join_df, by = c("tract" = "TRACTCE"))
shapefile_merged <- merge(shapefile_og, shiny_joined, by.x = c("TRACTCE"), by.y = c("tract"), duplicateGeoms = TRUE)

Findings #

We created a Shiny app that allows users to map the demographic, asthma, and air quality variables we collected on a Leaflet map. We have included leaflet apps that can be generated on our Shiny app to demonstrate our findings.

# Leaflet map colored by poverty rates
colorData <- shapefile_merged[["poverty"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)

leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Poverty Rates", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)

# Leaflet map colored by Asthma Percentage
colorData <- shapefile_merged[["asthma_prct"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)
leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Asthma Rate", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)

Census tracts in LA that have high poverty rates tend to also have high asthma percentage rates. Census tracts around Compton, Inglewood and Pamdale have higher asthma (11%-13%) and poverty rates (63-73), while other census tracts have lower rates.

# Leaflet map colored by ozone
colorData <- shapefile_merged[["ozone"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)
leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Ozone Rates", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)

Census tracts in Santa Clarita Valley have the highest Ozone levels in LA county. This might be due to the impact recent wildfires have had on air quality in these area. Because of the data inputation, we cannot say there is a strong negative relationship between asthma rates and ozone levels.

# Leaflet map colored by Particle Matter
colorData <- shapefile_merged[["pm_2_5"]]
pal <- colorBin("viridis", colorData, 7, pretty = FALSE)
leaflet(shapefile_merged) %>%
  addTiles() %>%
  addPolygons(weight = 0.6, stroke = FALSE, fillOpacity = 0.4, fillColor = pal(colorData)) %>%
  addLegend(pal = pal, values = colorData, title = "Particulate Matter", layerId = "colorLegend") %>%
  addCircleMarkers(lng = epa_data_la_tidy$longitude, lat = epa_data_la_tidy$latitude, weight = 0, fillColor = "#FF3232", fillOpacity = 0.7, radius = 7)

The map colored by particulate matter shows that census tracts in downtown Los Angeles and Compton have the highest Particulate Matter levels. Although particulate matter level is highest in areas where asthma rates are high, because of the data imputation, we cannot show a strong association between asthma rates and particulate matter levels.

Conclusion #

Our analysis clearly showed that lower-income populations are more prone to ailments like Asthma, and are less likely to have access to health care. It was a bit harder to draw clear conclusions from the environmental indicators we were particularly interested in, but populations with higher concentrations of Asthma are likely more prone to this from some sort of exposure to air pollution.

Having a visualization that allows us to directly compare the demographic indicators of different regions in LA is very important for understanding how health and life quality outcomes can be compared to access to good education and necessary resources in the future.

We acknowledge that there are many limitations to our findings. The imputation technique we used was not necessarily the most precise, and it would have been better if we had atmospheric data at a more precise level. We wonder if there are confounding factors that affect where recording centers are more likely to be (if certain areas of LA are underrepresented in atmospheric data measurements). Also, the values that were recorded as averages tended not to fall outside of a healthy range. Our analysis might have been better suited if we had instead used the maximum recorded value for the year (which was as a variable we initially considered using) to assess which areas were more prone to seeing dangerous air quality levels.

Lastly, we believe that in the future this kind of analysis might be better suited for more relevant years/ locations. In recent years, California has been ravaged by wildfires that have raised extreme concern regarding exposure to air pollution. Perhaps focusing on different locations in California for the past few years might have led to different observations/ conclusions.

Sources #

Background Information: #

Data: #

CENSUS API: https://www.census.gov/data/developers/data-sets.html
CDC 500 CITIES: https://www.cdc.gov/500cities/
EPA HISTORICAL DATASET: https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=epa_historical_air_quality&page=dataset&project=data-compound-255800&folder&organizationId