6 Week 6

Topics: Cohort Component Method of Population Projection; Add Health Study data (exploring variables and data documentation)

This week we will be covering two topics:

Download template.Rmd.txt as a template for this lesson. Save in in your working folder for the course, renamed to week_06.Rmd (with no .txt extension). Make any necessary changes to the YAML header.

6.0.1 Cohort Component Method of Population Projection (CCMPP)

The ccmpp package implements the cohort component method of population projection (CCMPP) described in this section. We will follow the first half of the excellent CCMPP vignette for this package, which gives a worked example of projection Thailand population from 1960 through 2000 using a baseline population for the year 1960.

It has been suggested that you can use the ccmpp package to check your work in CSDE 533.

Let’s start by loading the package. Loading the package also exposes the thailand_data and thailand_initial_estimates data set.

Let’s look at the Thailand data. This shows the object is a list:

is(demCore::thailand_data)
## [1] "list"   "vector"

Looking at the structure with str(), we see it is composed of only one element named population. This is a data.table, which is an enhanced data frame.

str(demCore::thailand_data)
## List of 1
##  $ population:Classes 'data.table' and 'data.frame': 136 obs. of  5 variables:
##   ..$ year     : int [1:136] 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##   ..$ sex      : chr [1:136] "female" "female" "female" "female" ...
##   ..$ age_start: num [1:136] 0 5 10 15 20 25 30 35 40 45 ...
##   ..$ age_end  : num [1:136] 5 10 15 20 25 30 35 40 45 50 ...
##   ..$ value    : num [1:136] 3097000 2660000 2229000 1817000 1507000 ...
##   ..- attr(*, ".internal.selfref")=<externalptr>

The population data.table has 136 rows and 5 columns.

dim(demCore::thailand_data)
## NULL

This is the population of five year age classes by sex for years spanning 1970 to 2000. Table 1 shows the first few rows.

Table 1: CCMPP Thailand data

demCore::thailand_data$population %>%
    head() %>%
    kable() %>%
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )
year sex age_start age_end value
1970 female 0 5 3097000
1970 female 5 10 2660000
1970 female 10 15 2229000
1970 female 15 20 1817000
1970 female 20 25 1507000
1970 female 25 30 1249000

There is also an object thailand_initial_estimates. It is a list:

is(thailand_initial_estimates)
## [1] "list"   "vector"

… that is composed of 5 elements. Do the elements have names?

## [1] "srb"           "asfr"          "baseline"      "survival"     
## [5] "net_migration"

What type of objects are they? We can use lapply() and is().

lapply(demCore::thailand_initial_estimates, FUN = function(x) (is(x)))
## $srb
## [1] "data.table" "data.frame" "list"       "oldClass"   "vector"    
## 
## $asfr
## [1] "data.table" "data.frame" "list"       "oldClass"   "vector"    
## 
## $baseline
## [1] "data.table" "data.frame" "list"       "oldClass"   "vector"    
## 
## $survival
## [1] "data.table" "data.frame" "list"       "oldClass"   "vector"    
## 
## $net_migration
## [1] "data.table" "data.frame" "list"       "oldClass"   "vector"

This shows us each of the elements is a data.table. Let’s perusing the structure using str():

## List of 5
##  $ srb          :Classes 'data.table' and 'data.frame':  8 obs. of  3 variables:
##   ..$ year_start: num [1:8] 1960 1965 1970 1975 1980 ...
##   ..$ year_end  : num [1:8] 1965 1970 1975 1980 1985 ...
##   ..$ value     : num [1:8] 1.11 1.08 1.07 1.05 1.04 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##   ..- attr(*, "sorted")= chr [1:2] "year_start" "year_end"
##  $ asfr         :Classes 'data.table' and 'data.frame':  56 obs. of  5 variables:
##   ..$ year_start: num [1:56] 1960 1960 1960 1960 1960 ...
##   ..$ year_end  : num [1:56] 1965 1965 1965 1965 1965 ...
##   ..$ age_start : num [1:56] 15 20 25 30 35 40 45 15 20 25 ...
##   ..$ age_end   : num [1:56] 20 25 30 35 40 45 50 20 25 30 ...
##   ..$ value     : num [1:56] 0.0518 0.2402 0.2888 0.2721 0.2122 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##   ..- attr(*, "sorted")= chr [1:4] "year_start" "year_end" "age_start" "age_end"
##  $ baseline     :Classes 'data.table' and 'data.frame':  34 obs. of  5 variables:
##   ..$ year     : int [1:34] 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##   ..$ sex      : chr [1:34] "female" "female" "female" "female" ...
##   ..$ age_start: num [1:34] 0 5 10 15 20 25 30 35 40 45 ...
##   ..$ age_end  : num [1:34] 5 10 15 20 25 30 35 40 45 50 ...
##   ..$ value    : num [1:34] 2338000 1857000 1543000 1287000 1250000 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ survival     :Classes 'data.table' and 'data.frame':  288 obs. of  6 variables:
##   ..$ year_start: num [1:288] 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##   ..$ year_end  : num [1:288] 1965 1965 1965 1965 1965 ...
##   ..$ sex       : chr [1:288] "female" "female" "female" "female" ...
##   ..$ age_start : num [1:288] 0 5 10 15 20 25 30 35 40 45 ...
##   ..$ age_end   : num [1:288] 5 10 15 20 25 30 35 40 45 50 ...
##   ..$ value     : num [1:288] 0.914 0.961 0.99 0.992 0.988 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##   ..- attr(*, "sorted")= chr [1:5] "year_start" "year_end" "sex" "age_start" ...
##  $ net_migration:Classes 'data.table' and 'data.frame':  272 obs. of  6 variables:
##   ..$ year_start: num [1:272] 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##   ..$ year_end  : num [1:272] 1965 1965 1965 1965 1965 ...
##   ..$ sex       : chr [1:272] "female" "female" "female" "female" ...
##   ..$ age_start : num [1:272] 0 5 10 15 20 25 30 35 40 45 ...
##   ..$ age_end   : num [1:272] 5 10 15 20 25 30 35 40 45 50 ...
##   ..$ value     : num [1:272] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##   ..- attr(*, "sorted")= chr [1:5] "year_start" "year_end" "sex" "age_start" ...

The help entry for thailand_initial_estimates states that it is a list of initial estimates for each ccmpp() input.

According to the vignette,

“[thailand_initial_estimates] is a summary of what the Thailand example inputs look like. The baseline population is for the year 1960 and includes both female and male populations for five year age groups up to 80+. The other four inputs define how the population changes every five calendar years between 1960 and 2000. The survivorship ratios go up to age 85+ while the asfr estimates are defined for the reproductive age groups between age 15 and 50.”

For example we can look at the sex ratio at birth (male / female) for each five-year interval (Table 2).

Table 2: Thailand sex-ratio at birth, 1960-2000

demCore::thailand_initial_estimates$srb %>% 
    kable() %>% 
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )    
year_start year_end value
1960 1965 1.110759
1965 1970 1.084629
1970 1975 1.067346
1975 1980 1.053400
1980 1985 1.043086
1985 1990 1.050142
1990 1995 1.055205
1995 2000 1.060781

For this run of CCMPP we will use the suggested settings, with comments in line with the code. See the documentation for ccmpp() for details.

# the settings for this run of ccmpp
thailand_settings <- list(
    # start of each calendar year, i.e., year_start
    years = seq(1960, 1995, 5),
    # use both sexes
    sexes = c("female", "male"),
    # ages being projected
    ages = seq(0, 80, 5),
    # ages for which mortality parameter estimates are available
    ages_mortality = seq(0, 85, 5),
    # assumed female fertility ages
    ages_asfr = seq(15, 45, 5)
)

We now run the ccmpp() function to generate a population projection

thailand_population <- ccmpp(
  inputs = demCore::thailand_initial_estimates,
  settings = thailand_settings
)

In order to make a graph roughly matching the one in the vignette, we add a new column ageclass by concatenating the start and end ages and reordering the factor based on the age_start column.

thailand_population %<>%
    mutate(ageclass = factor(str_c(age_start, age_end, sep = " to ") %>%
        fct_reorder(., age_start)))

Also we can change the maximum age class level to “80 plus” without hard coding in the string “80.”

# max label
maxlev <- length(levels(thailand_population$ageclass))

# current max level
currmaxlev <- levels(thailand_population$ageclass)[maxlev]

# substitute
updatemaxlev <- currmaxlev %>% 
    str_replace(pattern = "to Inf",
                replacement = "plus")

# update
levels(thailand_population$ageclass)[maxlev] <- updatemaxlev

Finally, make a graph plotting population against year, stratified by sex and age class.

ggplot(data = thailand_population,
       mapping = aes(x = year, y = value)) +
    geom_line() +
    # this stratifies the graph by sex and age class
    facet_grid(ageclass ~ sex, scales = "free_y", 
               labeller = labeller(age = thailand_population$age_class)) +
    # do not use a constant Y scale otherwise the older age classes would look flat.
    # and show the Y axis values with commas
    scale_y_continuous(labels = label_number(big.mark = ",")) +
    # change the axis labels
    xlab("Year") +
    ylab("Population")

6.0.2 The Add Health study data

The Add Health web site describes the study:

Initiated in 1994 and supported by five program project grants from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) with co-funding from 23 other federal agencies and foundations, Add Health is the largest, most comprehensive, nationally-representative longitudinal survey of adolescents ever undertaken. Beginning with an in-school questionnaire administered to a nationally representative sample of students in grades 7-12, the study followed up with a series of in-home interviews conducted in 1995, 1996, 2001-02, 2008, and 2016-18. Add Health participants are now full-fledged adults, aged 33-44, and will soon be moving into midlife. Over the years, Add Health has added a substantial amount of additional data for users, including contextual data on the communities and states in which participants reside, genomic data and a range of biological health markers of participants, and parental survey data.

The public-use data contain a subset of records and variables from the restricted-use full data set. The full data set requires a lengthy application process and meeting specific security standards. CSDE has a copy of most of the tables in the restricted-use data set on the UW Data Collaborative.

We will be using the Wave 1 public-use Add Health data for most of the remainder of the term. This week we will briefly delve into the documentation and see how the data set use is supported by the documentation.

The Add Health data are very well documented. The PDFs contain the verbatim text of the survey questions, the range of encoded answers, and count tabulations of responses.

We will revisit the documentation later in this lesson.

6.0.2.1 Data sets

The two data sets we will be using are AHwave1_v1.dta.zip and 21600-0001-Data.dta.zip. If you do not already have a folder data in your working directory, create it first, then download each file to that folder.

Note for the code examples to run without modification, the assumption is that the data have been downloaded and unzipped in a sub-folder named data within the current working directory. You can find out what the current working directory is by entering getwd() at the R prompt.

Both of these are Stata files. Stata files version 12 and below can be read with foreign::read.dta(), but version 13 and up require haven::read_dta() or readstata13::read.dta13().

One of the benefits of the Stata file formats, as compared to e.g., CSV or Excel, is that the Stata files can contain metadata about the variables. The imported data frames themselves often have cryptically coded variable names, but more extensive labels. The labels generally contain descriptions of the variables. The R import process can expose those labels, making the data more easy to interpret.

6.0.2.1.1 AHwave1_v1.dta

AHwave1_v1.dta is a Stata version 13 file. Here we will look at the two import options.

6.0.2.1.1.1 haven::read_dta()

haven::read_dta() will read the data in as a data frame. The following code will unzip an existing data/AHwave1_v1.dta.zip file and then read the unzipped file.

# unzip the file
if(file.exists("data/AHwave1_v1.dta.zip") & !file.exists("data/AHwave1_v1.dta")){
    unzip(zipfile = "data/AHwave1_v1.dta.zip", exdir = "data")
}

# read the data
AHwave1_v1_haven <- haven::read_dta(file = "data/AHwave1_v1.dta")

Each labeled variable has attributes can be perused by listing structure. Here we look at the imonth variable:

str(AHwave1_v1_haven$imonth)
##  dbl+lbl [1:6504] 6, 5, 6, 7, 7, 6, 5, 6, 6, 8, 9, 5, 6, 7, 5, 5, 7, 5, 8, ...
##  @ label       : chr "MONTH OF INTERVIEW-W1"
##  @ format.stata: chr "%13.0f"
##  @ labels      : Named num [1:10] 1 4 5 6 7 8 9 10 11 12
##   ..- attr(*, "names")= chr [1:10] "(1) January" "(4) April" "(5) May" "(6) June" ...

Another way to peruse a variable is to view the first few values (e.g., head()), here for the bio_sex variable:

head(AHwave1_v1_haven$bio_sex)
## <labelled<double>[6]>: BIOLOGICAL SEX-W1
## [1] 2 2 1 1 2 1
## 
## Labels:
##  value       label
##      1    (1) Male
##      2  (2) Female
##      6 (6) Refused

One can also list the attributes of the variable itself, which provides a more complete and verbose listing of the variable label, data format, class, and value labels. Here for the variable h1gi1m:

attributes(AHwave1_v1_haven$h1gi1m)
## $label
## [1] "S1Q1 BIRTH MONTH-W1"
## 
## $format.stata
## [1] "%13.0f"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##   (1) January  (2) February     (3) March     (4) April       (5) May 
##             1             2             3             4             5 
##      (6) June      (7) July    (8) August (9) September  (10) October 
##             6             7             8             9            10 
## (11) November (12) December  (96) Refused 
##            11            12            96

We can see that the variable has attributes label (a more verbose label for the variable itself), format.stata (data format for the column), class (showing the column is a haven_labelled vector stored in double precision format), and labels showing the translation of variable values and more descriptive labels.

In order to access the metadata as a single object, one can use the lapply() function, because the data frame can also be treated as a list. Here, each variable has its name, label, data format, and labels extracted to a single data frame and presented as a DT::datatable. This provides the metadata in a format that is probably easier to use than the PDF documentation.

AHwave1_v1_haven_metadata <- bind_cols(
    # variable name
    varname = colnames(AHwave1_v1_haven),
    # label
    varlabel = lapply(AHwave1_v1_haven, function(x) attributes(x)$label) %>% 
        unlist(),
    # format
    varformat = lapply(AHwave1_v1_haven, function(x) attributes(x)$format.stata) %>%
        unlist(),
    # values
    varvalues = lapply(AHwave1_v1_haven, function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")
)

DT::datatable(AHwave1_v1_haven_metadata)

Can you think about writing a generic function for pulling metadata from dta files using haven::read_dta?

6.0.2.1.1.2 readstata13::read.dta13()

readstata13::read.dta13() will similarly read the data as a data frame. There are a number of different options for converting factors, so if you are dealing with a lot of Stata files you may want to become familiar with some of these options.

For example, using all default options kicks out some warnings about double precision coding and missing factor labels. [Note: to hide these warnings, for better or worse (i.e., you might not want to hide them), use the code chunk option warning=FALSE].

# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta")
## Warning in readstata13::read.dta13(file = "data/AHwave1_v1.dta"): 
##    Factor codes of type double or float detected in variables
## 
##    h1hr7a, h1hr7b
## 
##    No labels have been assigned.
##    Set option 'nonint.factors = TRUE' to assign labels anyway.
## Warning in readstata13::read.dta13(file = "data/AHwave1_v1.dta"): 
##    Missing factor labels for variables
## 
##    h1hr8e
## 
##    No labels have beend assigned.
##    Set option 'generate.factors=TRUE' to generate labels.
# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta", generate.factors = TRUE, nonint.factors = TRUE)

Metadata can be generated similarly by binding columns representing variable names, labels, and formats:

AHwave1_v1_rs13_metadata <- bind_cols(
    varname = colnames(AHwave1_v1_rs13),
    varlabel = attributes(AHwave1_v1_rs13)$var.labels,
    varformat = attributes(AHwave1_v1_rs13)$formats
)

# value ranges; need to do this separately because those variables with no value labels were not accounted for
varvalues <- bind_cols(
    varname = names(attributes(AHwave1_v1_rs13)$label.table) %>% tolower,
    vals = attributes(AHwave1_v1_rs13)$label.table %>% 
    lapply(., function(x) names(x)) %>% 
    as.character() %>% 
    str_remove_all("^c\\(|\\)$"))

# join
AHwave1_v1_rs13_metadata %<>% 
    left_join(varvalues, by = "varname")

DT::datatable(AHwave1_v1_rs13_metadata)

The default conversion creates factors, so the tables may be easy to read/interpret …

head(x = AHwave1_v1_rs13$imonth, n = 6)
## [1] (6) June (5) May  (6) June (7) July (7) July (6) June
## 10 Levels: (1) January (4) April (5) May (6) June (7) July ... (12) December

… but programming with this data frame will require more work because the factors would need to be explicitly named, e.g. in the filter(),

AHwave1_v1_rs13 %>% 
    head(10) %>% 
    dplyr::filter(imonth == "(6) June") %>% 
    select(aid, imonth, iday)
##        aid   imonth iday
## 1 57100270 (6) June   23
## 2 57103171 (6) June   27
## 3 57104649 (6) June   12
## 4 57109625 (6) June    7
## 5 57110897 (6) June   27

Because the factors are really just labelled numbers, one could use the numeric values, but care needs to be taken:

levels(AHwave1_v1_rs13$imonth) %>% t() %>% t()
##       [,1]           
##  [1,] "(1) January"  
##  [2,] "(4) April"    
##  [3,] "(5) May"      
##  [4,] "(6) June"     
##  [5,] "(7) July"     
##  [6,] "(8) August"   
##  [7,] "(9) September"
##  [8,] "(10) October" 
##  [9,] "(11) November"
## [10,] "(12) December"

Because not all months are represented in the data, the numerical value of the month may not be what you expect. For example, the 4th month factor level/label is “(6) June” rather than “(4) April.”

Compare this with the results from haven::read_dta(). The values are not factors, but labelled double-precision numbers where the integer value is the month:

head(AHwave1_v1_haven$imonth)
## <labelled<double>[6]>: MONTH OF INTERVIEW-W1
## [1] 6 5 6 7 7 6
## 
## Labels:
##  value         label
##      1   (1) January
##      4     (4) April
##      5       (5) May
##      6      (6) June
##      7      (7) July
##      8    (8) August
##      9 (9) September
##     10  (10) October
##     11 (11) November
##     12 (12) December

To perform a similar filter() & select() operation with the haven version is much more straightforward:

AHwave1_v1_haven %>% 
    head(10) %>% 
    filter(imonth == 6) %>% 
    select(aid, imonth, iday)
## # A tibble: 5 x 3
##   aid            imonth  iday
##   <chr>       <dbl+lbl> <dbl>
## 1 57100270 6 [(6) June]    23
## 2 57103171 6 [(6) June]    27
## 3 57104649 6 [(6) June]    12
## 4 57109625 6 [(6) June]     7
## 5 57110897 6 [(6) June]    27

Which approach do you prefer?

6.0.2.1.2 21600-0001-Data.dta

21600-0001-Data.dta is a much larger data set. It has the same count of records, but more columns than AHwave1_v1.dta.

# unzip and read in the larger data set
if(file.exists("data/21600-0001-Data.dta.zip") & !file.exists("data/21600-0001-Data.dta")){
    unzip(zipfile = "data/21600-0001-Data.dta.zip", exdir = "data")
}

# because this is a big file we might want to check if it has been read
if(!exists("data_21600_0001")){
    data_21600_0001 <- haven::read_dta(file = "data/21600-0001-Data.dta")
}

# dimensions of the two
dim(AHwave1_v1_haven)
## [1] 6504  103
dim(data_21600_0001)
## [1] 6504 2794

In fact, AHwave1_v1.dta seems to be a subset of 21600-0001-Data.dta. We can show this by lower-casing the column names and then using a select() for the same named columns. The identical() function can be used to determine if the contents of two objects are exactly the same.

# lowercase the column names
colnames(data_21600_0001) %<>% str_to_lower()

# select() some columns of the same name
dat <- data_21600_0001 %>% 
    select(colnames(AHwave1_v1_haven))

# identical?
identical(dat, AHwave1_v1_haven)
## [1] TRUE
# if(file.exists("data/AHwave1_v1.dta")){
#     file.remove("data/AHwave1_v1.dta")
# }

We can build a similar table of metadata, but this time as a function:

# a generic(?) function to generate metadata for a Stata file read by haven::read_dta()
# x is a data frame from haven::read_dta
f_haven_stata_metadata <- function(x){
    # variable names
    varname <- colnames(x)
    # labels
    varlabel <- x %>% 
        lapply(., function(x) attributes(x)$label) %>% 
        unlist()
    # format
    varformat <- x %>% 
        lapply(., function(x) attributes(x)$format.stata) %>%
        unlist()
    # values
    varvalues <- x %>% 
        lapply(., function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")  
    
    bind_cols(varname = varname, 
              varlabel = varlabel, 
              varformat = varformat,
              varvalues = varvalues)
}

# generate the metadata
data_21600_0001_metadata <- f_haven_stata_metadata(data_21600_0001)

# print the metadata table as a DT::datatable
DT::datatable(data_21600_0001_metadata)

6.0.2.2 Searching through documentation

Good data sets have good documentation. Sometimes the documentation is voluminous, as is the case for the Add Health data. With voluminous metadata, are there good approaches to finding what you are interested in without opening each PDF file, reading the table of contents, searching for string matches, etc.?

This section will cover two tools to make searching through PDF files less onerous and more efficient. The two utilities are pdfgrep and pdftools::pdf_text()

6.0.2.2.1 pdfgrep

grep is a string-matching utility developed mainly for UNIX, now available for all common operating systems. It is also implemented in base R with the function grep(). The name comes from the ed text editor command g/re/p (\(\underline{g}\)lobally search for a \(\underline{r}\)egular \(\underline{e}\)xpression and \(\underline{p}\)rint matching lines), typically used to print the line number of a text file containing the search pattern.

If you are not familiar with regular expressions and you plan on doing computational social sciences, the sooner you learn, the better. See the R help topic for base::regex().

We won’t be covering grep in general here or regular expressions in detail, but will introduce some of the regular expression logic in pdfgrep.

Start by installing a version of pdfgrep. These are three implementations, one for Mac and two for Windows. The demo today will use the Cygwin version. The native Windows version is not as powerful/customizable as the Cygwin version and will also likely be more comparable with the Mac version. Note: pdfgrep will only work on PDF files that contain text. PDFs that are composed solely from scanned images contain no text and are therefore not searchable using regular expressions.

pdfgrep for native Windows

pdfgrep for Cygwin under Windows; Cygwin is highly recommended for Windows users–it creates a UNIX-like environment that offers many data processing tools.

pdfgrep for Mac

If you are on Terminal Server 4 or 5 let’s try using Cygwin. Download cygwin64.zip and unzip it at the root level of your H: drive. You should see something like this:

Double-click the shortcut to mintty.exe to open a Cygwin terminal. If you see a security warning, click Run.

A zipped file containing the metadata files for the public use Add Health data is available as add_health_wave1_metadata.zip. Download the file and unzip it in an appropriate location.

6.0.2.2.1.1 A few use examples

[Note: the images below may be hard to read; clicking them will open them in full-size; click with the middle mouse button to open in a new tab.]

Here we will search through the entire set of PDF files for the regular expression black. First, in your Cygwin terminal (or your regular terminal if you are on a Mac), let’s list the PDF files using

cd /H/csde502_winter_2022/Wave1_InHome_Codebooks ls *.pdf

The syntax for the search is

pdfgrep black *.pdf

This appears to run very slowly on TS4 and TS5 … run the command and we will come back to the terminal in a few minutes.

This prints the file name and the matching text of each PDF containing the regular expression black. Suppose we wanted to search for black or Black or BLACK? Use the flag -i which is short for case \(\underline{i}\)nsensitive.

pdfgrep -i black *.pdf

This shows all files that contain black in any case combination.

We might want to know where to look (i.e., page number) in the file. Using the -n flag prints the page \(\underline{n}\)umber.

Seeing that there are several matches in inh25pub.pdf, the first match on page 13, let’s view that:

Let’s look for another pattern, the word recreation in any case combination and also printing page numbers:

pdfgrep -n -i recreation *.pdf

Seeing the matches, we might want to narrow the search to include only “social or recreation”al activity. Here, the regular expression is social.*recreation, where the .* regexp translates to “any number of any characters.”

We will perform one more pattern match. Suppose we were interested in whether subjects were tired or stressed. The regexp pattern "pat1|pat2" (note the quotes and the vertical bar). The quotes indicate to the shell that this is not a pipe, but part of the regexp. The vertical bar functions as an “or” as it does in logical criteria in R syntax.

Here the full expression to show those metadata files that match either of these patterns.

pdfgrep -i "tired|stress" *.pdf

Why, you may ask, if we created those fancy metadata tables from the data, would we want to search for specific strings in the PDF documentation? Because the full documentation is likely to provide more complete explanations, whereas the metadata created from the data labels is only a brief description.

6.0.2.2.2 pdftools::pdf_text()

Staying completely within R, we can perform similar searches through PDF files. We start with pdftools::pdf_text(), which converts PDFs to text vectors, where each page is converted to one vector element. This can be piped through text-matching functions, such as base::grep() or stringr::str_match() (stringr is loaded by tidyverse).

Unlike pdfgrep, which can serially search through a set of files in a directory, pdf_text() requires additional work because the function processes one file. Here is an example that mimics searching for the case-insensitive regular expression black in the set of PDFs.

We create a function that searches through a single PDF and then loop the function over the set of PDFs in a specified folder, returning the file name list, the pattern we searched on, the page number with the match, and whether the search was case sensitive or not.

# a function to get matching strings in a PDF, ignore case
f_pdf_str_match <- function(x, pat, ignore.case = TRUE){
    # convert the PDF to text
    mytext <- pdf_text(x)
    # pattern
    if(ignore.case){
        mypat <- regex(pat, ignore_case = TRUE)
    } else {
        mypat <- pat
    }
    # match strings = pages
    pages <- str_which(string = mytext, pattern = mypat)
    if(length(pages) == 0){
        return(data.frame(fname = x, pat, page_num = as.integer(NA), ignore.case))
    }
    # create a data frame
    data.frame(fname = x, pat, page_num = pages, ignore.case)
}

# a list of my PDFs
mypdfs <- list.files(path = "data/metadata/Wave1_InHome_Codebooks", 
                     pattern = "*.pdf$", 
                     full.names = TRUE,
                     ignore.case = TRUE)

# an empty data frame
x <- NULL

# run each one
for(i in mypdfs){
    x <- rbind(x, f_pdf_str_match(i, "black", ignore.case = TRUE))
}

# ignore NAs
x %>% 
    dplyr::filter(!is.na(page_num))
##                                                fname   pat page_num ignore.case
## 1  data/metadata/Wave1_InHome_Codebooks/INH01PUB.PDF black        7        TRUE
## 2  data/metadata/Wave1_InHome_Codebooks/INH01PUB.PDF black        9        TRUE
## 3  data/metadata/Wave1_InHome_Codebooks/INH25PUB.PDF black       13        TRUE
## 4  data/metadata/Wave1_InHome_Codebooks/INH25PUB.PDF black       56        TRUE
## 5  data/metadata/Wave1_InHome_Codebooks/INH25PUB.PDF black       96        TRUE
## 6  data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black       13        TRUE
## 7  data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black       43        TRUE
## 8  data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black       69        TRUE
## 9  data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black       98        TRUE
## 10 data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black      120        TRUE
## 11 data/metadata/Wave1_InHome_Codebooks/INH26PUB.PDF black      143        TRUE
## 12 data/metadata/Wave1_InHome_Codebooks/SECTAPUB.PDF black        3        TRUE
## 13 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black        2        TRUE
## 14 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       28        TRUE
## 15 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       31        TRUE
## 16 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       33        TRUE
## 17 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       37        TRUE
## 18 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       39        TRUE
## 19 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       41        TRUE
## 20 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       43        TRUE
## 21 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       46        TRUE
## 22 data/metadata/Wave1_InHome_Codebooks/W1NDXPUB.PDF black       48        TRUE

6.0.3 Conclusion

We will use these data sets and metadata for the next several lessons. The methods presented in today’s lesson should increase efficiency and reduce busy-work.


Rendered at 2022-03-12 10:47:10

6.1 Source code

File is at CSDE-TS1: R:/Project/CSDE502/2022/csde502-winter-2022-main/06-week06.Rmd.

6.1.1 R code used in this document

pacman::p_load(
    magrittr,
    knitr,
    kableExtra,
    readstata13,
    haven,
    pdftools,
    captioner,
    dplyr, 
    magrittr,
    scales,
    tidyverse
)

figure_nums <- captioner(prefix = "Figure")
table_nums <- captioner(prefix = "Table")

# path to this file name
if (!interactive()) {
    fnamepath <- current_input(dir = TRUE)
    fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
    fnamepath <- ""
}
library(demCore)
is(demCore::thailand_data)
str(demCore::thailand_data)
dim(demCore::thailand_data)
demCore::thailand_data$population %>%
    head() %>%
    kable() %>%
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )
    
is(thailand_initial_estimates)
names(demCore::thailand_initial_estimates)
lapply(demCore::thailand_initial_estimates, FUN = function(x) (is(x)))
str(demCore::thailand_initial_estimates)
demCore::thailand_initial_estimates$srb %>% 
    kable() %>% 
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )    
# the settings for this run of ccmpp
thailand_settings <- list(
    # start of each calendar year, i.e., year_start
    years = seq(1960, 1995, 5),
    # use both sexes
    sexes = c("female", "male"),
    # ages being projected
    ages = seq(0, 80, 5),
    # ages for which mortality parameter estimates are available
    ages_mortality = seq(0, 85, 5),
    # assumed female fertility ages
    ages_asfr = seq(15, 45, 5)
)
thailand_population <- ccmpp(
  inputs = demCore::thailand_initial_estimates,
  settings = thailand_settings
)
thailand_population %<>%
    mutate(ageclass = factor(str_c(age_start, age_end, sep = " to ") %>%
        fct_reorder(., age_start)))
# max label
maxlev <- length(levels(thailand_population$ageclass))

# current max level
currmaxlev <- levels(thailand_population$ageclass)[maxlev]

# substitute
updatemaxlev <- currmaxlev %>% 
    str_replace(pattern = "to Inf",
                replacement = "plus")

# update
levels(thailand_population$ageclass)[maxlev] <- updatemaxlev
ggplot(data = thailand_population,
       mapping = aes(x = year, y = value)) +
    geom_line() +
    # this stratifies the graph by sex and age class
    facet_grid(ageclass ~ sex, scales = "free_y", 
               labeller = labeller(age = thailand_population$age_class)) +
    # do not use a constant Y scale otherwise the older age classes would look flat.
    # and show the Y axis values with commas
    scale_y_continuous(labels = label_number(big.mark = ",")) +
    # change the axis labels
    xlab("Year") +
    ylab("Population")
# unzip the file
if(file.exists("data/AHwave1_v1.dta.zip") & !file.exists("data/AHwave1_v1.dta")){
    unzip(zipfile = "data/AHwave1_v1.dta.zip", exdir = "data")
}

# read the data
AHwave1_v1_haven <- haven::read_dta(file = "data/AHwave1_v1.dta")
str(AHwave1_v1_haven$imonth)
head(AHwave1_v1_haven$bio_sex)
attributes(AHwave1_v1_haven$h1gi1m)
AHwave1_v1_haven_metadata <- bind_cols(
    # variable name
    varname = colnames(AHwave1_v1_haven),
    # label
    varlabel = lapply(AHwave1_v1_haven, function(x) attributes(x)$label) %>% 
        unlist(),
    # format
    varformat = lapply(AHwave1_v1_haven, function(x) attributes(x)$format.stata) %>%
        unlist(),
    # values
    varvalues = lapply(AHwave1_v1_haven, function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")
)

DT::datatable(AHwave1_v1_haven_metadata)
# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta")
# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta", generate.factors = TRUE, nonint.factors = TRUE)
AHwave1_v1_rs13_metadata <- bind_cols(
    varname = colnames(AHwave1_v1_rs13),
    varlabel = attributes(AHwave1_v1_rs13)$var.labels,
    varformat = attributes(AHwave1_v1_rs13)$formats
)

# value ranges; need to do this separately because those variables with no value labels were not accounted for
varvalues <- bind_cols(
    varname = names(attributes(AHwave1_v1_rs13)$label.table) %>% tolower,
    vals = attributes(AHwave1_v1_rs13)$label.table %>% 
    lapply(., function(x) names(x)) %>% 
    as.character() %>% 
    str_remove_all("^c\\(|\\)$"))

# join
AHwave1_v1_rs13_metadata %<>% 
    left_join(varvalues, by = "varname")

DT::datatable(AHwave1_v1_rs13_metadata)
head(x = AHwave1_v1_rs13$imonth, n = 6)
AHwave1_v1_rs13 %>% 
    head(10) %>% 
    dplyr::filter(imonth == "(6) June") %>% 
    select(aid, imonth, iday)
levels(AHwave1_v1_rs13$imonth) %>% t() %>% t()
head(AHwave1_v1_haven$imonth)
AHwave1_v1_haven %>% 
    head(10) %>% 
    filter(imonth == 6) %>% 
    select(aid, imonth, iday)
# unzip and read in the larger data set
if(file.exists("data/21600-0001-Data.dta.zip") & !file.exists("data/21600-0001-Data.dta")){
    unzip(zipfile = "data/21600-0001-Data.dta.zip", exdir = "data")
}

# because this is a big file we might want to check if it has been read
if(!exists("data_21600_0001")){
    data_21600_0001 <- haven::read_dta(file = "data/21600-0001-Data.dta")
}

# dimensions of the two
dim(AHwave1_v1_haven)
dim(data_21600_0001)
# lowercase the column names
colnames(data_21600_0001) %<>% str_to_lower()

# select() some columns of the same name
dat <- data_21600_0001 %>% 
    select(colnames(AHwave1_v1_haven))

# identical?
identical(dat, AHwave1_v1_haven)
# if(file.exists("data/AHwave1_v1.dta")){
#     file.remove("data/AHwave1_v1.dta")
# }
# a generic(?) function to generate metadata for a Stata file read by haven::read_dta()
# x is a data frame from haven::read_dta
f_haven_stata_metadata <- function(x){
    # variable names
    varname <- colnames(x)
    # labels
    varlabel <- x %>% 
        lapply(., function(x) attributes(x)$label) %>% 
        unlist()
    # format
    varformat <- x %>% 
        lapply(., function(x) attributes(x)$format.stata) %>%
        unlist()
    # values
    varvalues <- x %>% 
        lapply(., function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")  
    
    bind_cols(varname = varname, 
              varlabel = varlabel, 
              varformat = varformat,
              varvalues = varvalues)
}

# generate the metadata
data_21600_0001_metadata <- f_haven_stata_metadata(data_21600_0001)

# print the metadata table as a DT::datatable
DT::datatable(data_21600_0001_metadata)

# a function to get matching strings in a PDF, ignore case
f_pdf_str_match <- function(x, pat, ignore.case = TRUE){
    # convert the PDF to text
    mytext <- pdf_text(x)
    # pattern
    if(ignore.case){
        mypat <- regex(pat, ignore_case = TRUE)
    } else {
        mypat <- pat
    }
    # match strings = pages
    pages <- str_which(string = mytext, pattern = mypat)
    if(length(pages) == 0){
        return(data.frame(fname = x, pat, page_num = as.integer(NA), ignore.case))
    }
    # create a data frame
    data.frame(fname = x, pat, page_num = pages, ignore.case)
}

# a list of my PDFs
mypdfs <- list.files(path = "data/metadata/Wave1_InHome_Codebooks", 
                     pattern = "*.pdf$", 
                     full.names = TRUE,
                     ignore.case = TRUE)

# an empty data frame
x <- NULL

# run each one
for(i in mypdfs){
    x <- rbind(x, f_pdf_str_match(i, "black", ignore.case = TRUE))
}

# ignore NAs
x %>% 
    dplyr::filter(!is.na(page_num))

cat(readLines(fnamepath), sep = '\n')

6.1.2 Complete Rmd code

cat(readLines(fnamepath), sep = '\n')
# Week 6 {#week6}

```{r, echo=FALSE, warning=FALSE, message=FALSE}
pacman::p_load(
    magrittr,
    knitr,
    kableExtra,
    readstata13,
    haven,
    pdftools,
    captioner,
    dplyr, 
    magrittr,
    scales,
    tidyverse
)

figure_nums <- captioner(prefix = "Figure")
table_nums <- captioner(prefix = "Table")

# path to this file name
if (!interactive()) {
    fnamepath <- current_input(dir = TRUE)
    fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
    fnamepath <- ""
}
```
<h2>Topics: Cohort Component Method of Population Projection; Add Health Study data (exploring variables and data documentation)</h2>

This week we will be covering two topics:

* [Cohort Component Method of Population Projection (CCMPP)](#ccmpp)
* [Add Health Study data: exploring variables and data documentation](#addhealthmetadata)

Download [template.Rmd.txt](files/template.Rmd.txt) as a template for this lesson. Save in in your working folder for the course, renamed to `week_06.Rmd` (with no `.txt` extension). Make any necessary changes to the YAML header.

### Cohort Component Method of Population Projection (CCMPP) {#ccmpp}
The `ccmpp` package implements the cohort component method of population projection (CCMPP) described in this section. We will follow the first half of the excellent [CCMPP vignette](https://ihmeuw-demographics.github.io/demCore/articles/ccmpp.html) for this package, which gives a worked example of projection Thailand population from 1960 through 2000 using a baseline population for the year 1960.

It has been suggested that you can use the `ccmpp` package to check your work in CSDE 533.

Let's start by loading the package. Loading the package also exposes the `thailand_data` and `thailand_initial_estimates` data set.

```{r, message=FALSE, warning=FALSE}
library(demCore)
```

Let's look at the Thailand data. This shows the object is a list:

```{r}
is(demCore::thailand_data)
```

Looking at the structure with `str()`, we see it is composed of only one element named `population`. This is a [`data.table`](https://cran.r-project.org/web/packages/data.table/), which is an enhanced data frame. 

```{r}
str(demCore::thailand_data)
```

The `population` `data.table` has `r nrow(demCore::thailand_data$population)` rows and `r ncol(demCore::thailand_data$population)` columns. 

```{r}
dim(demCore::thailand_data)
```

This is the population of five year age classes by sex for years spanning `r min(thailand_data$population$year)` to `r max(thailand_data$population$year)`. `r table_nums(name = "tldata", display = "cite")` shows the first few rows.

*`r table_nums(name = "tldata", caption = "CCMPP Thailand data")`*

```{r}
demCore::thailand_data$population %>%
    head() %>%
    kable() %>%
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )
    
```

There is also an object `thailand_initial_estimates`. It is a list:

```{r}
is(thailand_initial_estimates)
```

... that is composed of `r length(thailand_initial_estimates)` elements. Do the elements have names?

```{r}
names(demCore::thailand_initial_estimates)
```

What type of objects are they? We can use `lapply()` and `is()`.

```{r}
lapply(demCore::thailand_initial_estimates, FUN = function(x) (is(x)))
```

This shows us each of the elements is a `data.table`. Let's perusing the structure using `str()`:

```{r}
str(demCore::thailand_initial_estimates)
```

The help entry for `thailand_initial_estimates` states that it is a list of initial estimates for each `ccmpp()` input. 

According to the vignette,

> "[thailand_initial_estimates] is a summary of what the Thailand example inputs look like. The baseline population is for the year 1960 and includes both female and male populations for five year age groups up to 80+. The other four inputs define how the population changes every five calendar years between 1960 and 2000. The survivorship ratios go up to age 85+ while the asfr estimates are defined for the reproductive age groups between age 15 and 50."

For example we can look at the sex ratio at birth (male / female) for each five-year interval (`r table_nums(name = "tldatasrb", display = "cite")`).

*`r table_nums(name = "tldatasrb", caption = "Thailand sex-ratio at birth, 1960-2000")`*

```{r}
demCore::thailand_initial_estimates$srb %>% 
    kable() %>% 
    kable_styling(
        bootstrap_options =
            c("striped", "hover", "condensed", "responsive"),
        font_size = 12,
        full_width = F, position = "left"
    )    
```

For this run of CCMPP we will use the suggested settings, with comments in line with the code. See the documentation for `ccmpp()` for details.

```{r, fig.width=3, fig.height=15}
# the settings for this run of ccmpp
thailand_settings <- list(
    # start of each calendar year, i.e., year_start
    years = seq(1960, 1995, 5),
    # use both sexes
    sexes = c("female", "male"),
    # ages being projected
    ages = seq(0, 80, 5),
    # ages for which mortality parameter estimates are available
    ages_mortality = seq(0, 85, 5),
    # assumed female fertility ages
    ages_asfr = seq(15, 45, 5)
)
```

We now run the `ccmpp()` function to generate a population projection

```{r, fig.width=3, fig.height=15}
thailand_population <- ccmpp(
  inputs = demCore::thailand_initial_estimates,
  settings = thailand_settings
)
```

In order to make a graph roughly matching the one in the vignette, we add a new column `ageclass` by concatenating the start and end ages and reordering the factor based on the `age_start` column.

```{r, fig.width=3, fig.height=15}
thailand_population %<>%
    mutate(ageclass = factor(str_c(age_start, age_end, sep = " to ") %>%
        fct_reorder(., age_start)))
```

Also we can change the maximum age class level to "80 plus" without hard coding in the string "80".

```{r, fig.width=4, fig.height=15}
# max label
maxlev <- length(levels(thailand_population$ageclass))

# current max level
currmaxlev <- levels(thailand_population$ageclass)[maxlev]

# substitute
updatemaxlev <- currmaxlev %>% 
    str_replace(pattern = "to Inf",
                replacement = "plus")

# update
levels(thailand_population$ageclass)[maxlev] <- updatemaxlev
```

Finally, make a graph plotting population against year, stratified by sex and age class.

```{r, fig.width=5, fig.height=15}
ggplot(data = thailand_population,
       mapping = aes(x = year, y = value)) +
    geom_line() +
    # this stratifies the graph by sex and age class
    facet_grid(ageclass ~ sex, scales = "free_y", 
               labeller = labeller(age = thailand_population$age_class)) +
    # do not use a constant Y scale otherwise the older age classes would look flat.
    # and show the Y axis values with commas
    scale_y_continuous(labels = label_number(big.mark = ",")) +
    # change the axis labels
    xlab("Year") +
    ylab("Population")
```


### The Add Health study data {#addhealthmetadata}
The Add Health [web site](https://data.cpc.unc.edu/projects/2/view) describes the study:

> Initiated in 1994 and supported by five program project grants from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) with co-funding from 23 other federal agencies and foundations, Add Health is the largest, most comprehensive, nationally-representative longitudinal survey of adolescents ever undertaken. Beginning with an in-school questionnaire administered to a nationally representative sample of students in grades 7-12, the study followed up with a series of in-home interviews conducted in 1995, 1996, 2001-02, 2008, and 2016-18. Add Health participants are now full-fledged adults, aged 33-44, and will soon be moving into midlife. Over the years, Add Health has added a substantial amount of additional data for users, including contextual data on the communities and states in which participants reside, genomic data and a range of biological health markers of participants, and parental survey data.

The public-use data contain a subset of records and variables from the restricted-use full data set. The full data set requires a lengthy application process and meeting specific security standards. [CSDE](https://csde.washington.edu/) has a copy of most of the tables in the restricted-use data set on the [UW Data Collaborative](https://dcollab.uw.edu/data/add-health/).

We will be using the Wave 1 public-use Add Health data for most of the remainder of the term. This week we will briefly delve into the documentation and see how the data set use is supported by the documentation.

The Add Health data are very well documented. The PDFs contain the verbatim text of the survey questions, the range of encoded answers, and count tabulations of responses.

We will revisit the documentation later in this lesson.

#### Data sets
The two data sets we will be using are [AHwave1_v1.dta.zip](data/AHwave1_v1.dta.zip) and [21600-0001-Data.dta.zip](data/21600-0001-Data.dta.zip). If you do not already have a folder `data` in your working directory, create it first, then download each file to that folder.

:::{.rmdnote}
Note for the code examples to run without modification, the assumption is that the data have been downloaded and unzipped in a sub-folder named `data` within the current working directory. You can find out what the current working directory is by entering `getwd()` at the R prompt.
:::

Both of these are Stata files. Stata files version 12 and below can be read with `foreign::read.dta()`, but version 13 and up require `haven::read_dta()` or `readstata13::read.dta13()`.

One of the benefits of the Stata file formats, as compared to e.g., CSV or Excel, is that the Stata files can contain metadata about the variables. The imported data frames themselves often have cryptically coded variable names, but more extensive ***labels***. The labels generally contain descriptions of the variables. The R import process can expose those labels, making the data more easy to interpret.

##### `AHwave1_v1.dta`
`AHwave1_v1.dta` is a Stata version 13 file. Here we will look at the two import options.

###### `haven::read_dta()`
`haven::read_dta()` will read the data in as a data frame. The following code will unzip an existing `data/AHwave1_v1.dta.zip` file and then read the unzipped file.

```{r}
# unzip the file
if(file.exists("data/AHwave1_v1.dta.zip") & !file.exists("data/AHwave1_v1.dta")){
    unzip(zipfile = "data/AHwave1_v1.dta.zip", exdir = "data")
}

# read the data
AHwave1_v1_haven <- haven::read_dta(file = "data/AHwave1_v1.dta")
```

Each labeled variable has attributes can be perused by listing structure. Here we look at the `imonth` variable:

```{r}
str(AHwave1_v1_haven$imonth)
```

Another way to peruse a variable is to view the first few values (e.g., `head()`), here for the `bio_sex` variable:

```{r}
head(AHwave1_v1_haven$bio_sex)
```

One can also list the attributes of the variable itself, which provides a more complete and verbose listing of the variable label, data format, class, and value labels. Here for the variable `h1gi1m`:

```{r}
attributes(AHwave1_v1_haven$h1gi1m)
```

We can see that the variable has attributes `label` (a more verbose label for the variable itself), `format.stata` (data format for the column), `class` (showing the column is a `haven_labelled` `vector` stored in `double precision` format), and `labels` showing the translation of variable values and more descriptive labels.

In order to access the metadata as a single object, one can use the `lapply()` function, because the data frame can also be treated as a list. Here, each variable has its name, label, data format, and labels extracted to a single data frame and presented as a `DT::datatable`. This provides the metadata in a format that is probably easier to use than the PDF documentation. 

```{r}
AHwave1_v1_haven_metadata <- bind_cols(
    # variable name
    varname = colnames(AHwave1_v1_haven),
    # label
    varlabel = lapply(AHwave1_v1_haven, function(x) attributes(x)$label) %>% 
        unlist(),
    # format
    varformat = lapply(AHwave1_v1_haven, function(x) attributes(x)$format.stata) %>%
        unlist(),
    # values
    varvalues = lapply(AHwave1_v1_haven, function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")
)

DT::datatable(AHwave1_v1_haven_metadata)
```

Can you think about writing a generic function for pulling metadata from `dta` files using `haven::read_dta`?

###### `readstata13::read.dta13()`
`readstata13::read.dta13()` will similarly read the data as a data frame. There are a number of different options for converting factors, so if you are dealing with a lot of Stata files you may want to become familiar with some of these options.

For example, using all default options kicks out some warnings about double precision coding and missing factor labels. _[Note: to hide these warnings, for better or worse (i.e., you might not want to hide them), use the code chunk option `warning=FALSE`]._

```{r}
# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta")
```

```{r}
# read the data
AHwave1_v1_rs13 <- readstata13::read.dta13(file = "data/AHwave1_v1.dta", generate.factors = TRUE, nonint.factors = TRUE)
```

Metadata can be generated similarly by binding columns representing variable names, labels, and formats:

```{r}
AHwave1_v1_rs13_metadata <- bind_cols(
    varname = colnames(AHwave1_v1_rs13),
    varlabel = attributes(AHwave1_v1_rs13)$var.labels,
    varformat = attributes(AHwave1_v1_rs13)$formats
)

# value ranges; need to do this separately because those variables with no value labels were not accounted for
varvalues <- bind_cols(
    varname = names(attributes(AHwave1_v1_rs13)$label.table) %>% tolower,
    vals = attributes(AHwave1_v1_rs13)$label.table %>% 
    lapply(., function(x) names(x)) %>% 
    as.character() %>% 
    str_remove_all("^c\\(|\\)$"))

# join
AHwave1_v1_rs13_metadata %<>% 
    left_join(varvalues, by = "varname")

DT::datatable(AHwave1_v1_rs13_metadata)
```

The default conversion creates factors, so the tables _may_ be easy to read/interpret ...

```{r}
head(x = AHwave1_v1_rs13$imonth, n = 6)
```

... but programming with this data frame will require more work because the factors would  need to be explicitly named, e.g. in the `filter()`,

```{r}
AHwave1_v1_rs13 %>% 
    head(10) %>% 
    dplyr::filter(imonth == "(6) June") %>% 
    select(aid, imonth, iday)
```

Because the factors are really just labelled numbers, one could use the numeric values, but care needs to be taken:

```{r}
levels(AHwave1_v1_rs13$imonth) %>% t() %>% t()
```

Because not all months are represented in the data, the numerical value of the month may not be what you expect. For example, the 4th month factor level/label is "(6) June" rather than "(4) April".

Compare this with the results from `haven::read_dta()`. The values are not factors, but labelled double-precision numbers where the integer value is the month:

```{r}
head(AHwave1_v1_haven$imonth)
```

To perform a similar `filter()` & `select()` operation with the `haven` version is much more straightforward:

```{r}
AHwave1_v1_haven %>% 
    head(10) %>% 
    filter(imonth == 6) %>% 
    select(aid, imonth, iday)
```

**Which approach do you prefer?**

##### `21600-0001-Data.dta`
`21600-0001-Data.dta` is a much larger data set. It has the same count of records, but more columns than `AHwave1_v1.dta`.

```{r}
# unzip and read in the larger data set
if(file.exists("data/21600-0001-Data.dta.zip") & !file.exists("data/21600-0001-Data.dta")){
    unzip(zipfile = "data/21600-0001-Data.dta.zip", exdir = "data")
}

# because this is a big file we might want to check if it has been read
if(!exists("data_21600_0001")){
    data_21600_0001 <- haven::read_dta(file = "data/21600-0001-Data.dta")
}

# dimensions of the two
dim(AHwave1_v1_haven)
dim(data_21600_0001)
```

In fact, `AHwave1_v1.dta` seems to be a subset of `21600-0001-Data.dta`. We can show this by lower-casing the column names and then using a `select()` for the same named columns. The `identical()` function can be used to determine if the contents of two objects are exactly the same.

```{r}
# lowercase the column names
colnames(data_21600_0001) %<>% str_to_lower()

# select() some columns of the same name
dat <- data_21600_0001 %>% 
    select(colnames(AHwave1_v1_haven))

# identical?
identical(dat, AHwave1_v1_haven)
# if(file.exists("data/AHwave1_v1.dta")){
#     file.remove("data/AHwave1_v1.dta")
# }
```

We can build a similar table of metadata, but this time as a function:

```{r}
# a generic(?) function to generate metadata for a Stata file read by haven::read_dta()
# x is a data frame from haven::read_dta
f_haven_stata_metadata <- function(x){
    # variable names
    varname <- colnames(x)
    # labels
    varlabel <- x %>% 
        lapply(., function(x) attributes(x)$label) %>% 
        unlist()
    # format
    varformat <- x %>% 
        lapply(., function(x) attributes(x)$format.stata) %>%
        unlist()
    # values
    varvalues <- x %>% 
        lapply(., function(x) attributes(x)$labels) %>% 
        # names the variable label vector
        lapply(., function(x) names(x)) %>% 
        # as character
        as.character() %>% 
        # remove the c() construction
        str_remove_all("^c\\(|\\)$")  
    
    bind_cols(varname = varname, 
              varlabel = varlabel, 
              varformat = varformat,
              varvalues = varvalues)
}

# generate the metadata
data_21600_0001_metadata <- f_haven_stata_metadata(data_21600_0001)

# print the metadata table as a DT::datatable
DT::datatable(data_21600_0001_metadata)

```

#### Searching through documentation
Good data sets have good documentation. Sometimes the documentation is voluminous, as is the case for the Add Health data. With voluminous metadata, are there good approaches to finding what you are interested in without opening each PDF file, reading the table of contents, searching for string matches, etc.?

This section will cover two tools to make searching through PDF files less onerous and more efficient. The two utilities are `pdfgrep` and `pdftools::pdf_text()`

##### `pdfgrep`
`grep` is a string-matching utility developed mainly for UNIX, now available for all common operating systems. It is also implemented in base R with the function `grep()`. The name comes from the [ed text editor](https://en.wikipedia.org/wiki/Ed_(text_editor)) command g/re/p ($\underline{g}$lobally search for a $\underline{r}$egular $\underline{e}$xpression and $\underline{p}$rint matching lines), typically used to print the line number of a text file containing the search pattern. 

If you are not familiar with regular expressions and you plan on doing computational social sciences, the sooner you learn, the better. See the R help topic for `base::regex()`.

We won't be covering `grep` in general here or regular expressions in detail, but will introduce some of the regular expression logic in `pdfgrep`. 

Start by installing a version of `pdfgrep`. These are three implementations, one for Mac and two for Windows. The demo today will use the Cygwin version. The native Windows version is not as powerful/customizable as the Cygwin version and will also likely be more comparable with the Mac version. ___Note:___ `pdfgrep` will only work on PDF files that contain text. PDFs that are composed solely from scanned images contain no text and are therefore not searchable using regular expressions.

[pdfgrep for native Windows](https://soft.rubypdf.com/software/pdfgrep-windows-version)

[pdfgrep for Cygwin under Windows](https://cygwin.com/cgi-bin2/package-cat.cgi?file=x86%2Fpdfgrep%2Fpdfgrep-1.4.1-1&grep=pdfgrep); Cygwin is highly recommended for Windows users--it creates a UNIX-like environment that offers many data processing tools.

[pdfgrep for Mac](http://macappstore.org/pdfgrep/)

:::{.rmdnote}
If you are on Terminal Server 4 or 5 let's try using Cygwin. Download [cygwin64.zip](http://gismo.gis.washington.edu/phurvitz/csde502_winter_2022/cygwin64.zip) and unzip it at the root level of your H: drive. You should see something like this:

![](images/week06/2022-02-11_08_39_49-cygwin64.png)
Double-click the shortcut to `mintty.exe` to open a Cygwin terminal. If you see a security warning, click `Run`.
:::

A zipped file containing the metadata files for the public use Add Health data is available as [add_health_wave1_metadata.zip](data/add_health_wave1_metadata.zip). Download the file and unzip it in an appropriate location.

###### A few use examples
_[Note: the images below may be hard to read; clicking them will open them in full-size; click with the middle mouse button to open in a new tab.]_

Here we will search through the entire set of PDF files for the regular expression `black`. First, in your Cygwin terminal (or your regular terminal if you are on a Mac), let's list the PDF files using

`
cd /H/csde502_winter_2022/Wave1_InHome_Codebooks
ls *.pdf
`

[![](images/week06/2021-02-11_21_31_11-_cygdrive_L.png)](images/week06/2021-02-11_21_31_11-_cygdrive_L.png)


The syntax for the search is 

`
pdfgrep black *.pdf
`

[![](images/week06/2021-02-11_21_45_41-_cygdrive_L.png)](images/week06/2021-02-11_21_45_41-_cygdrive_L.png)

*This appears to run very slowly on TS4 and TS5 ... run the command and we will come back to the terminal in a few minutes.*

This prints the file name and the matching text of each PDF containing the regular expression `black`. Suppose we wanted to search for `black` or `Black` or `BLACK`? Use the flag `-i` which is short for case $\underline{i}$nsensitive.

`
pdfgrep -i black *.pdf
`

[![](images/week06/2021-02-11_21_47_06-_cygdrive_L.png)](images/week06/2021-02-11_21_47_06-_cygdrive_L.png)

This shows all files that contain `black` in any case combination. 

We might want to know where to look (i.e., page number) in the file. Using the `-n` flag prints the page $\underline{n}$umber.

[![](images/week06/2021-02-11_21_47_38-_cygdrive_L.png)](images/week06/2021-02-11_21_47_38-_cygdrive_L.png)

Seeing that there are several matches in `inh25pub.pdf`, the first match on page 13, let's view that:

[![](images/week06/2021-02-11_22_16_08-inh25pub.pdf.png)](images/week06/2021-02-11_22_16_08-inh25pub.pdf.png)

Let's look for another pattern, the word `recreation` in any case combination and also printing page numbers:

`
pdfgrep -n -i recreation *.pdf
`

[![](images/week06/2021-02-11_21_48_14-_cygdrive_L.png)](images/week06/2021-02-11_21_48_14-_cygdrive_L.png)

Seeing the matches, we might want to narrow the search to include only "social or recreation"al activity. Here, the regular expression is `social.*recreation`, where the `.*` regexp translates to "any number of any characters."


[![](images/week06/2021-02-11_21_48_34-_cygdrive_L.png)](images/week06/2021-02-11_21_48_34-_cygdrive_L.png)


We will perform one more pattern match. Suppose we were interested in whether subjects were  tired or stressed. The regexp pattern `"pat1|pat2"` (note the quotes and the vertical bar). The quotes indicate to the shell that [this is not a pipe](#magrittr), but part of the regexp. The vertical bar functions as an "or" as it does in logical criteria in R syntax.

Here the full expression to show those metadata files that match either of these patterns.

`
pdfgrep -i "tired|stress" *.pdf
`

[![](images/week06/2021-02-11_21_54_04-_cygdrive_L.png)](images/week06/2021-02-11_21_54_04-_cygdrive_L.png)

___Why___, you may ask, if we created those fancy metadata tables from the data, would we want to search for specific strings in the PDF documentation? Because the full documentation is likely to provide more complete explanations, whereas the metadata created from the data labels is only a brief description.

##### `pdftools::pdf_text()`
Staying completely within R, we can perform similar searches through PDF files. We start with `pdftools::pdf_text()`, which converts PDFs to text vectors, where each page is converted to one vector element. This can be piped through text-matching functions, such as `base::grep()` or `stringr::str_match()` (`stringr` is loaded by `tidyverse`).

Unlike `pdfgrep`, which can serially search through a set of files in a directory, `pdf_text()` requires additional work because the function processes one file. Here is an example that mimics searching for the case-insensitive regular expression `black` in the set of PDFs.

We create a function that searches through a single PDF and then loop the function over the set of PDFs in a specified folder, returning the file name list, the pattern we searched on, the page number with the match, and whether the search was case sensitive or not.

```{r}
# a function to get matching strings in a PDF, ignore case
f_pdf_str_match <- function(x, pat, ignore.case = TRUE){
    # convert the PDF to text
    mytext <- pdf_text(x)
    # pattern
    if(ignore.case){
        mypat <- regex(pat, ignore_case = TRUE)
    } else {
        mypat <- pat
    }
    # match strings = pages
    pages <- str_which(string = mytext, pattern = mypat)
    if(length(pages) == 0){
        return(data.frame(fname = x, pat, page_num = as.integer(NA), ignore.case))
    }
    # create a data frame
    data.frame(fname = x, pat, page_num = pages, ignore.case)
}

# a list of my PDFs
mypdfs <- list.files(path = "data/metadata/Wave1_InHome_Codebooks", 
                     pattern = "*.pdf$", 
                     full.names = TRUE,
                     ignore.case = TRUE)

# an empty data frame
x <- NULL

# run each one
for(i in mypdfs){
    x <- rbind(x, f_pdf_str_match(i, "black", ignore.case = TRUE))
}

# ignore NAs
x %>% 
    dplyr::filter(!is.na(page_num))

```

### Conclusion
We will use these data sets and metadata for the next several lessons. The methods presented in today's lesson should increase efficiency and reduce busy-work.

<hr>
Rendered at <tt>`r Sys.time()`</tt>

## Source code
File is at `r fnamestr`.

### R code used in this document
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```

### Complete Rmd code
```{r comment=''}
cat(readLines(fnamepath), sep = '\n')
```