tidycensus
?tidycensus
is an R
package that allows users to interface with the US Census Bureau’s decennial Census and five-year American Community APIs and return tidyverse-ready data frames.
tidycensus
is created and maintained by Kyle Walker, a professor in geography at TCU.
Yes but with limitations. While the Census does offer programatic access to some of its data products it is far from comprehensive and even for the ones it does offer access to its not temporally comprehensive. You can see a complete list of the available APIs here and even request new APIs to be made available.
tidycensus
library(tidyverse)
library(jsonlite)
# For this tutorial you should have obtained a census API key
# you can replace the line below with your own key
# myKey <- "my_census_api_key"
myKEY <- Sys.getenv("CENSUS_API_KEY")
url <- "https://api.census.gov"
path <- "/data/2018/acs/acs1/"
query <- paste0("?get=NAME,group(B01001)&for=state&key=", myKEY)
callMAT <- fromJSON(paste0(url, path, query))
colnames(callMAT) <- callMAT[1,]
as_tibble(callMAT)[2:nrow(callMAT),1:3] %>%
mutate(B01001_001E = as.numeric(B01001_001E)) %>%
arrange(NAME)
## # A tibble: 52 x 3
## NAME GEO_ID B01001_001E
## <chr> <chr> <dbl>
## 1 Alabama 0400000US01 4887871
## 2 Alaska 0400000US02 737438
## 3 Arizona 0400000US04 7171646
## 4 Arkansas 0400000US05 3013825
## 5 California 0400000US06 39557045
## 6 Colorado 0400000US08 5695564
## 7 Connecticut 0400000US09 3572665
## 8 Delaware 0400000US10 967171
## 9 District of Columbia 0400000US11 702455
## 10 Florida 0400000US12 21299325
## # … with 42 more rows
Want to store your API key in a separate file? Try something like this:
tidycensus
library(tidycensus)
library(sf)
library(mapview)
census_api_key(myKEY)
get_acs(geography = "state",
variables = c("Total Population" = "B01001_001"),
year = 2018,
survey = "acs1")
## # A tibble: 52 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama Total Population 4887871 NA
## 2 02 Alaska Total Population 737438 NA
## 3 04 Arizona Total Population 7171646 NA
## 4 05 Arkansas Total Population 3013825 NA
## 5 06 California Total Population 39557045 NA
## 6 08 Colorado Total Population 5695564 NA
## 7 09 Connecticut Total Population 3572665 NA
## 8 10 Delaware Total Population 967171 NA
## 9 11 District of Columbia Total Population 702455 NA
## 10 12 Florida Total Population 21299325 NA
## # … with 42 more rows
tidycensus
apiR
as your primary workimg environmentFor this tutorial we will run through several examples of how you would go about getting data from the census API when you have a research idea but aren’t sure what the proper codes are for a given census year or ACS. For this example we will use the 2018 ACS 5 year to get information about the educational attainment of census tracts in king county.
# Download the Variable dictionary
var_df <- load_variables(2018, "acs5", cache = TRUE)
# Use the View function to see where educational attainment variables are
# SEARCH in concept "Educational Attainment"
# then narrow down using the name field B15003_02
var_selection <- c(Bachelor = "B15003_022")
raw_tract_edu_df <- get_acs("tract",
variables = var_selection,
summary_var = "B15003_001",
state = "WA",
county = "King",
geometry = TRUE,
moe = 95,
cache_table = TRUE)
tidyverse
Using the built in function moe_prop
we can calculate margin of errors for proportions with appropriate upper and lower bounds. We can then visualize our results using ggplot
.
tract_edu_df <- raw_tract_edu_df %>%
mutate(
# Calculate the proportion
college_completion_prop = estimate/summary_est,
# recalculate margin of error for proportions
ccp_moe = moe_prop(estimate, summary_est, moe, summary_moe),
# get lower and upper estimate bounds
ccp_upr = college_completion_prop + ccp_moe,
ccp_lwr = college_completion_prop - ccp_moe)
ggplot(tract_edu_df, aes(x=college_completion_prop)) +
geom_histogram() +
theme_classic()
In addition to the data estimates from ACS, we also pulled geography information through tidycensus
. The function returns an sf
object which can easily be plotted through ggplot
and made into interactive plots using leaflet
or wrappers like mapplot
.
Table B11009
contains information about unmarried-partner households. We can get every variable in the table at once by using the table
argument instead of the variable
argument.
# get the table
tract_hh <- get_acs("tract",
table = "B11009",
summary_var = "B11009_001",
year = 2018,
state = "WA",
county = "King",
geometry = TRUE,
survey = "acs5")
# wouldn't it be nice if the variables had labels?
table_b11009 <-
var_df %>%
filter(str_starts(name, "B11009")) %>%
mutate(short_label = str_split(label, "!!"),
short_label = map_chr(short_label, tail, 1)) %>%
select(name, short_label)
tract_hh <-
tract_hh %>%
left_join(table_b11009, by = c("variable" = "name"))
# calculate proportions
tract_hh_prop <-
tract_hh %>%
mutate(prop = estimate/summary_est,
prop_moe = moe_prop(estimate, summary_est,
moe, summary_moe))
# make a map
tract_hh_prop %>%
filter(short_label %in% c(
"Male householder and male partner",
"Female householder and female partner"
)) %>%
ggplot() +
geom_sf(aes(fill = prop), size = .25) +
facet_wrap(vars(short_label)) +
scale_fill_viridis_c() +
theme_void()
tigris
The tidycensus
package uses the tigris
package under the hood to get the geometries for spatial units, but not every kind of unit is covered. For those that aren’t, you can use tigris
directly. For instance, here’s how to get the boundaries for a certain place called “Seattle”:
library(tigris)
options(tigris_use_cache = TRUE)
# load places in Washington
wa <- places("WA", year = 2018, class = "sf")
# get the water bodies of King County
king_water <- area_water("WA", "King", class = "sf")
# filter to Seattle
seattle <-
wa %>%
filter(NAME == "Seattle")
ggplot(seattle) +
geom_sf(fill = "transparent") +
theme_minimal()
Doesn’t that shape look familiar? No?
Let’s intersect it with the King County tracts and remove the bodies of water from the final product:
Because of the way that the API is constructed, each year of the census and ACS has a different endpoint. This means that you will have to make multiple calls to the API, our get_acs
function, in order to get all desired years of a particular variable. Lucky for us, within the ACS variable definitions are pretty consistent.
years <- c(2012, 2014, 2016, 2018)
all_var_df <- bind_rows(lapply(years, function(x) {
load_variables(x, "acs1", cache = TRUE) %>%
mutate(YEAR = x)
}))
all_var_df %>%
filter(name == "B19113B_001" | name == "B19113A_001")
## # A tibble: 8 x 4
## name label concept YEAR
## <chr> <chr> <chr> <dbl>
## 1 B19113A… Estimate!!Median family incom… <NA> 2012
## 2 B19113B… Estimate!!Median family incom… <NA> 2012
## 3 B19113A… Estimate!!Median family incom… MEDIAN FAMILY INCOME IN THE PAS… 2014
## 4 B19113B… Estimate!!Median family incom… MEDIAN FAMILY INCOME IN THE PAS… 2014
## 5 B19113A… Estimate!!Median family incom… <NA> 2016
## 6 B19113B… Estimate!!Median family incom… <NA> 2016
## 7 B19113A… Estimate!!Median family incom… MEDIAN FAMILY INCOME IN THE PAS… 2018
## 8 B19113B… Estimate!!Median family incom… MEDIAN FAMILY INCOME IN THE PAS… 2018
raw_inc_df <- bind_rows(lapply(years, function(x){
get_acs(
"county",
variables = c(Black = "B19113B_001", White ="B19113A_001"),
state = "WA",
county = "King",
year = x,
survey = "acs1",
moe = 95,
cache_table = TRUE) %>%
mutate(Year = x)}))
inc_df <- raw_inc_df %>%
mutate(
inc_upr = estimate + moe,
inc_lwr = estimate - moe)
inc_df %>%
ggplot(aes(x = Year, y = estimate, ymin = inc_lwr, ymax = inc_upr,
group = variable)) +
geom_line(aes(color = variable)) +
geom_point(aes(color = variable)) +
geom_ribbon(aes(fill = variable), alpha = .4) +
theme_classic() +
ggtitle("Median Household Income King County") +
scale_fill_manual(values=c("#b7a57a", "#4b2e83", "#000000", "#DCDCDC")) +
scale_color_manual(values=c("#b7a57a", "#4b2e83", "#000000", "#DCDCDC")) +
scale_y_continuous(labels = scales::dollar)
raw_inc_df %>%
pivot_wider(names_from = variable, values_from = estimate:moe) %>%
mutate(ratio = estimate_Black/estimate_White) %>%
mutate(moe = moe_ratio(estimate_Black, estimate_White,
moe_Black, moe_White)) %>%
ggplot(aes(x = Year, y = ratio, ymin = ratio - moe, ymax = ratio + moe)) +
geom_line() +
geom_point() +
geom_ribbon(alpha = .4) +
theme_classic() +
ggtitle("Black-White Median Household Income Ratio")