1 Week 1

Topics:


Today’s lessons will cover getting started with computing at CSDE, and quickly introduce R, RStudio, and R Markdown.

It is assumed that students in this course have a basic working knowledge of using R, including how to create variables with the assignment operator (“<-”), and how to run simple functions(e.g., mean(dat$age)). Often in courses that include using R for statistical analysis, some of the following foundations are not explained fully. This section is not intended to be a comprehensive treatment of R data types and structures, but should provide some background for students who are either relatively new at using R or who have not had a systematic introduction.

The other main topic for today is tidyverse, which refers to a related set of R packages for data management, analysis, and display. See Hadley Wickham’s tidy tools manifesto for the logic behind the suite of tools. For a brief description of the specific R packages, see Tidyverse packages. This is not intended to be a complete introduction to the tidyverse, but should provide sufficient background for data handling to support most of the technical aspects of the rest of the course and CSDE 533.

1.1 Getting started on Terminal Server 4

First, if you are not on campus, make sure you have the Husky OnNet VPN application running and have connected to the UW network. You should see the f5 icon in your task area:

Connect to TS4: csde-ts4.csde.washington.edu

If you are using the Windows Remote Desktop Protocol (RDP) connection, your connection parameters should look like this:

If you are using mRemoteNG, the connection parameters will match this:

Once you are connected you should see a number of icons on the desktop and application shortcuts in the Start area.

Open a Windows Explorer (if you are running RDP in full screen mode you should be able to use the key combination Win-E).

Before doing anything, let’s change some of the annoying default settings of the Windows Explorer. Tap File > Options. In the View tab, make sure that Always show menus is checked and Hide extensions for known file types is unchecked. The latter setting is very important because we want to see the complete file name for all files at all times.

Click Apply to Folders so that these settings become default. Click Yes to the next dialog.

Now let’s make a folder for the files in this course.

Navigate to This PC:

You should see the H: drive. This is is the mapped drive that links to your U Drive, and is the place where all of the data for this course is to be stored. Do not store any data on the C: drive! The C: drive can be wiped without any prior notification.

Be very careful with your files on the U Drive! If you delete files, there is no “undo” functionality. When you are deleting files, you will get a warning that you should take seriously:

Navigate into H: and create a new folder named csde502_winter_2022. Note the use of lowercase letters and underscores rather than spaces. This will be discussed in the section on file systems later in this lesson.

1.2 Introduction to R Markdown in RStudio

1.2.1 Create a project

Now we will use RStudio to create the first R Markdown source file and render it to HTML.

Start RStudio by either dbl-clicking the desktop shortcut or navigating to the alphabetical R section of the Start menu:

A brief aside: install R packages.

To get started, because it usually takes some time to install, open a second RStudio session and at the console, to install tidyverse, the other packages for CSDE 502 and 533, and for this lesson, download the file packages.R.

Open the file in your second RStudio session and in the upper right of the source code pane, click Source > Source.

Now continue on with the lesson in your original RStudio session…..

Create a new project (File > New Project...).

Since we just created the directory to house the project, select Existing Directory.

Navigate to that directory and select Open.

Click Create Project.

You will now have a blank project with only the project file.

1.2.2 Create an R Markdown file from built-in RStudio functionality

Let’s make an R Markdown file (File > New File > R Markdown...).

Do not change any of the metadata … this is just for a quick example.

Click OK and then name the file week_01.Rmd.

1.2.2.1 Render the Rmd file as HTML

At the console prompt, enter R Markdown::render("W and tap the TAB key. This should bring up a list of files that have the character “w” in the file name. Click week_01.Rmd.

The syntax here means “run the render() function from the R Markdown package on the file week_01.Rmd

After a few moments, the process should complete with a message that the output has been created.

If the HTML page does not open automatically, look for week_01.html in the list of files. Click it and select View in Web Browser.

You will now see the bare-bones HTML file.

Compare the output of this file with the source code in week_01.Rmd. Note there are section headers that begin with hash marks, and R code is indicated with the starting characters

```{r}

and the ending characters

```

Next, we will explore some enhancements to the basic R Markdown syntax.

1.2.3 Create an R Markdown file with some enhancements

Download this version of week_01.Rmd and overwrite the version you just created.

If RStudio prints a message that some packages are required but are not installed, click Install.

Change line 3 to include your name and e-mail address as shown.

1.2.3.1 Render and view the enhanced output

Repeat the rendering process (R Markdown::render("Week_01.Rmd"))

The new HTML file has a number of enhancements, including a mailto: hyperlink for your name, a table of contents at the upper left, a table that is easier to read, a Leaflet map, captions and cross-references for the figures and table, an image derived from a PNG file referenced by a URL, the code used to generate various parts of the document that are produced by R code, and the complete source code for the document. A downloadable version of the rendered file: week_01.html.

Including the source code for the document is especially useful for readers of your documents because it lets them see exactly what you did. An entire research chain can be documented in this way, from reading in raw data, performing data cleaning and analysis, and generating results.

1.3 R data types

You may want to download the file week01.Rmd, which contains many of the examples below.

There are six fundamental data types in R:

  1. logical
  2. numeric
  3. integer
  4. complex
  5. character
  6. raw

The most atomic object in R will exist having one of those data types, described below. An atomic object of the data type can have a value, NA which represents an observation with no data (e.g., a missing measurement), or NULL which isn’t really a value at all, but can still have the data type class.

You will encounter other data types, such as Date or POSIXct if you are working with dates or time stamps. These other data types are extensions of the fundamental data types.

To determine what data type an object is, use is(obj), str(obj), or class(obj).

print(is("a"))
## [1] "character"           "vector"              "data.frameRowLabels"
## [4] "SuperClassMethod"
print(str(TRUE))
##  logi TRUE
## NULL
print(class(123.45))
## [1] "numeric"
## [1] "integer"
n <- as.numeric(999999999999999999999)

print(class(n))
## [1] "numeric"

1.3.1 Logical

Use logical values for characteristics that are either TRUE or FALSE. Note that if logical elements can also have an NA value if the observation is missing. In the following examples,

# evaluate as logical, test whether 1 is greater than two
a <- 1 > 2
# create two numerical values, one being NA, representing ages
age_john <- 39
age_jane <- NA

# logical NA from Jane's undefined age
(jo <- age_john > 50)
## [1] FALSE
(ja <- age_jane > 50)
## [1] NA

Logical values are often expressed in binary format as 0 = FALSE and =TRUE`. in R these values are interconvertible. Other software (e.g., Excel, MS Access) may convert logical values to numbers that you do not expect.

(t <- as.logical(1))
## [1] TRUE
(f <- as.logical(0))
## [1] FALSE

1.3.2 Numeric

Numeric values are numbers with range about 2e-308 to 2e+308, depending on the computer you are using. You can see the possible range by entering .Machine at the R console. These can also include decimals. For more information, see Double-precision floating-point format

1.3.3 Integer

Integer values are numerical, but can only take on whole, rather than fractional values, and have a truncated range compared to numeric. For example, see below, if we try to create an integer that is out of range. The object we created is an integer, but because it is out of range, is value is set to NA.

i <- as.integer(999999999999999999999)
## Warning: NAs introduced by coercion to integer range
## [1] "integer"

1.3.4 Complex

The complex type is used in mathematics and you are unlikely to use it in applied social science research unless you get into some heavy statistics. See Complex number for a full treatment.

1.3.5 Character

Character data include the full set of keys on your keyboard that print out a character, typically [A-Z], [a-z], [0-9], punctuation, etc. The full set of ASCII characters is supported, e.g. the accent aigu in Café:

print(class("Café"))
## [1] "character"

Also numbers can function as characters. Be careful in converting between numerical and character versions. For example, see these ZIP codes:

# this is a character
my_zip <- "98115"

# it is not numeric.
my_zip + 2
## Error in my_zip + 2: non-numeric argument to binary operator
# we can convert it to numeric, although it would be silly to do with ZIP codes, which are nominal values
as.numeric(my_zip) + 2
## [1] 98117
# Boston has ZIP codes starting with zeros
boston_zip <- "02134"
as.numeric(boston_zip)
## [1] 2134

1.3.6 Raw

Raw values are used to store raw bytes in hexadecimal format. You are unlikely to use it in applied social science research. For example, the hexadecimal value for the character z is 7a:

## [1] 7a
## [1] "raw"

1.4 R data structures

There are 5 basic data structures in R, as shown in the graphic:

  1. vector
  2. matrix
  3. array
  4. list
  5. data frame

In addition, the factor data type is very important

1.4.1 Vector

A vector is an ordered set of elements of one or more elements of the same data type and are created using the c() constructor function. For example, a single value is a vector:

# create a vector of length 1
a <- 1
is(a)
## [1] "numeric" "vector"

If you try creating a vector with mixed data types, you may get unexpected results; mixing character elements with other type elements will result in character representations, e.g.,

c(1, "a", TRUE, charToRaw("z"))
## [1] "1"    "a"    "TRUE" "7a"

Results will depend on the data type you are mixing, for example because logical values can be expressed numerically, the TRUE and FALSE values are converted to 1 and 0, respectively.

(c(1:3, TRUE, FALSE))
## [1] 1 2 3 1 0

But if a character is added, all elements are converted to characters.

c(1:3, TRUE, FALSE, "awesome!")
## [1] "1"        "2"        "3"        "TRUE"     "FALSE"    "awesome!"

Order is important, i.e.,

1, 2, 3 is not the same as 1, 3, 2

R will maintain the order of elements in vectors unless a process is initiated that changes the order of those elements:

# a vector 
(v <- c(1, 3, 2))
## [1] 1 3 2
(sort(v))
## [1] 1 2 3

You can get some information about vectors, such as length and data type:

# create a random normal 
set.seed(5)
normvec1000 <- rnorm(n = 1000)

length(normvec1000)
## [1] 1000
class(normvec1000)
## [1] "numeric"
class(normvec1000 > 1)
## [1] "logical"

Elements of vectors are specified with their index number (1 .. n):

v <- seq(from = 0, to = 10, by = 2)
v[4]
## [1] 6

1.4.2 Matrix

A matrix is like a vector, in that it an contain only one data type, but it is two-dimensional, having rows and columns. A simple example:

# make a vector 1 to 100
(v <- 1:100)
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100
# load to a matrix
(m1 <- matrix(v, ncol = 10, byrow = TRUE))
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100
# different r, c ordering
(m2 <- matrix(v, ncol = 10, byrow = FALSE))
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100

If you try to force a vector into a matrix whose row \(\times\) col length does not match the length of the vector, the elements will be recycled, which may not be what you want. At least R will give you a warning.

(m3 <- matrix(letters, ncol = 10, nrow = 10))
## Warning in matrix(letters, ncol = 10, nrow = 10): data length [26] is not a sub-
## multiple or multiple of the number of rows [10]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] "a"  "k"  "u"  "e"  "o"  "y"  "i"  "s"  "c"  "m"  
##  [2,] "b"  "l"  "v"  "f"  "p"  "z"  "j"  "t"  "d"  "n"  
##  [3,] "c"  "m"  "w"  "g"  "q"  "a"  "k"  "u"  "e"  "o"  
##  [4,] "d"  "n"  "x"  "h"  "r"  "b"  "l"  "v"  "f"  "p"  
##  [5,] "e"  "o"  "y"  "i"  "s"  "c"  "m"  "w"  "g"  "q"  
##  [6,] "f"  "p"  "z"  "j"  "t"  "d"  "n"  "x"  "h"  "r"  
##  [7,] "g"  "q"  "a"  "k"  "u"  "e"  "o"  "y"  "i"  "s"  
##  [8,] "h"  "r"  "b"  "l"  "v"  "f"  "p"  "z"  "j"  "t"  
##  [9,] "i"  "s"  "c"  "m"  "w"  "g"  "q"  "a"  "k"  "u"  
## [10,] "j"  "t"  "d"  "n"  "x"  "h"  "r"  "b"  "l"  "v"

1.4.3 Array

An array is similar to matrix, but it can have more than one dimension. These can be useful for analyzing time series data or other multidimensional data. We will not be using array data in this course, but a simple example of creating and viewing the contents of an array:

# a vector 1 to 27
v <- 1:27

# create an array, 3 x 3 x 3
(a <- array(v, dim = c(3, 3, 3)))
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   19   22   25
## [2,]   20   23   26
## [3,]   21   24   27
# array index is r, c, m (row, column, matrix), e.g., row 1 column 2 matrix 3:
(a[1,2,3])
## [1] 22

1.4.4 List

R lists are ordered collections of objects that do not need to be of the same data type. Those objects can be single-value vectors, multiple-value vectors, matrices, data frames, other lists, etc. Because of this, lists are a very flexible data type. But because they can have as little or as much structure as you want, can become difficult to manage and analyze.

Here is an example of a list comprised of single value vectors of different data type. Compare this with the attempt to make a vector comprised of elements of different data type:

(l <- list("a", 1, TRUE))
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] TRUE

Let’s modify that list a bit:

(l <- list("a", 
           1:20, 
           as.logical(c(0,1,1,0))))
## [[1]]
## [1] "a"
## 
## [[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## [[3]]
## [1] FALSE  TRUE  TRUE FALSE

The top-level indexing for a list is denoted using two sets of square brackets. For example, the first element of our list can be accessed by l[[1]]. For example, the mean of element 2 is obtained by mean(l[[2]]): 10.5.

To perform operations on all elements of a list, use lapply():

# show the data types
(lapply(X = l, FUN = class))
## [[1]]
## [1] "character"
## 
## [[2]]
## [1] "integer"
## 
## [[3]]
## [1] "logical"
# mean, maybe?
(lapply(X = l, FUN = function(x) {mean(x)}))
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## [[1]]
## [1] NA
## 
## [[2]]
## [1] 10.5
## 
## [[3]]
## [1] 0.5

1.4.5 Factor

Factors are similar to vectors, in that they are one-dimensional ordered sets. However, factors also use informational labels. For example, you may have a variable with household income as a text value:

  • “<$10,000”
  • “$10,000-$549,999”
  • “$50,000-$99,999”
  • “$100,000-$200,000”
  • “>$200,000”

As a vector:

(income <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000"))
## [1] "<$10,000"          "$10,000-$49,999"   "$50,000-$99,999"  
## [4] "$100,000-$200,000" ">$200,000"

Because these are characters, they do not sort in proper numeric order:

sort(income)
## [1] "$10,000-$49,999"   "$100,000-$200,000" "$50,000-$99,999"  
## [4] "<$10,000"          ">$200,000"

If these are treated as a factor, the levels can be set for proper ordering:

# create a factor from income and set the levels
(income_factor <- factor(x = income, levels = income))
## [1] <$10,000          $10,000-$49,999   $50,000-$99,999   $100,000-$200,000
## [5] >$200,000        
## 5 Levels: <$10,000 $10,000-$49,999 $50,000-$99,999 ... >$200,000
# sort again
(sort(income_factor))
## [1] <$10,000          $10,000-$49,999   $50,000-$99,999   $100,000-$200,000
## [5] >$200,000        
## 5 Levels: <$10,000 $10,000-$49,999 $50,000-$99,999 ... >$200,000

As a factor, the data can also be used in statistical models and the magnitude of the variable will also be correctly ordered.

1.4.6 Data frame

Other than vectors, data frames are probably the most used data type in R. You can think of data frames as matrices that allow columns with different data type. For example, you might have a data set that represents subject IDs as characters, sex or gender as text, height, weight, and age as numerical values, income as a factor, and smoking status as logical. Because a matrix requires only one data type, it would not be possible to store all of these as a matrix. An example:

# income levels 
inc <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000")

BMI <-  data.frame(
   sid = c("A1001", "A1002", "B1001"),
   gender = c("Male", "Male","Female"), 
   height_cm = c(152, 171.5, 165), 
   weight_kg = c(81, 93, 78),
   age_y = c(42, 38, 26),
   income = factor(c("$50,000-$99,999", "$100,000-$200,000", "<$10,000"), levels = inc)
)
print(BMI)
##     sid gender height_cm weight_kg age_y            income
## 1 A1001   Male     152.0        81    42   $50,000-$99,999
## 2 A1002   Male     171.5        93    38 $100,000-$200,000
## 3 B1001 Female     165.0        78    26          <$10,000

1.5 File systems

Although a full treatment of effective uses of file systems is beyond the scope of this course, a few basic rules are worth covering:

  1. Never use spaces in folder or file names. Ninety-nine and 44/100ths percent of the time, most modern software will have no problems handling file names with spaces. But that 0.56% of the time when software chokes, you may wonder why your processes are failing. If your directly and file names do not have spaces, then you can at least rule that out!

  2. Use lowercase letters in directory and file names. In the olden days (MS-DOS), there was not case sensitivity in file names. UNIX has has always used case sensitive file names. So MyGloriousPhDDissertation.tex and mygloriousphddissertation.tex could actually be different files. Macs, being based on a UNIX kernel, also employ case sensitivity in file names. But Windows? No. Consider the following: there cannot be both foo.txt and FOO.txt in the same directory.

    So if Windows doesn’t care, why should we? Save yourself some keyboarding time and confusion by using only lowercase characters in your file names.

  3. Include dates in your file names. If you expect to have multiple files that are sequential versions of a file in progress, an alternative to using a content management system such as git, particularly for binary files such as Word documents or SAS data files, is to have multiple versions of the files but including the date as part of the file name. If you expect to have multiple versions on the same date, include a lowercase alphabetical character; it is improbable that you would have more than 26 versions of a fine on a single calendar date. If you are paranoid, use a suffix number 0000, 0002 .. 9999. If you have ten thousand versions of the same file on a given date, you are probably doing something that is not right. Now that you are convinced that including dates in file names is a good idea, please use the format yyyy-mm-dd or yyyymmdd. If you do so, your file names will sort in temporal order.

  4. Make use of directories! Although a folder containing 100,000 files can be handled programatically (if file naming conventions are used), it is not possible for a human being to visually scan 100,000 file names. If you have a lot of files for your project, consider creating directories, e.g., - raw_data - processed_data - analysis_results - scripts - manuscript

  5. Agonize over file names. Optimally when you look at your file names, you will be able to know something about the content of the file. We spend a lot of time doing analysis and creating output. Spending an extra minute thinking about good file names is time well spent.

1.6 Data manipulation in the tidyverse

One of the R packages we will use frequently is tidyverse, which is itself a collection of several other packages, each with a specific domain:

  • ggplot2 (graphics)
  • dplyr (data manipulation)
  • tidyr (reformatting data for efficient processing)
  • readr (reading rectangular R x C data)
  • purrr (functional programming, e.g., to replace for() loops)
  • tibble (enhanced data frames)
  • stringr (string, i.e., text manipulation)
  • forcats (handling factor, i.e., categorical variables)

We will touch on some of these during this course, but there will not be a full review or treatment of the tidyverse.

This section will introduce some of the main workhorse functions in tidy data handling.

Installing tidyverse is straightforward but it may take some time to download and install all of the packages. If you have not done so yet, use

install.packages("tidyverse")

For today’s lesson we will be using one of the Add Health public use data sets, AHwave1_v1.dta.

# load pacman if necessary
package.check <- lapply("pacman", FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x, dependencies = TRUE)
        library(x, character.only = TRUE)
    }
})

# load readstata13 if necessary
pacman::p_load(readstata13)

# read the dta file
dat <- readstata13::read.dta13(file.path(myurl, "data/AHwave1_v1.dta"))

The data set includes variable labels, which make handling the data easier. Here we print the column names and their labels. Wrapping this in a DT::data_table presents a nice interface for showing only a few variables at a time and that allows sorting and searching.

x <- data.frame(colname = names(dat), label = attributes(dat)$var.labels)
DT::datatable(data = x, caption = "Column names and labels in AHwave1_v1.dta.")

1.6.1 magrittr

The R package magrittr allows the use of “pipes.” In UNIX, pipes were used to take the output of one program and to feed as input to another program. For example, the UNIX command cat prints the contents of a text file. This would print the contents of the file 00README.txt:

cat 00README.txt

but with large files, the entire contents would scroll by too fast to read. Using a “pipe,” denoted with the vertical bar character | allowed using the more command to print one screen at a time by tapping the Enter key for each screen full of text:

cat 00README.txt | more

As shown in these two screen captures:

The two main pipe operators we will use in magrittr are %>% and ‘%<>%.’

%>% is the pipe operator, which functions as a UNIX pipe, that is, to take something on the left hand side of the operator and feed it to the right hand side.

%<>% is the assignment pipe operator, which takes something on the left hand side of the operator, feeds it to the right hand side, and replaces the object on the left-hand side.

For a simple example of the pipe, to list only the first 6 lines of a data frame in base R, we use head(), e.g.,

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

using a tidy version of this:

iris %>% head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

In the R base version, we first read head, so we know we will be printing the first 6 elements of something, but we don’t know what that “something” is. We have to read ahead to know we are reading the first 6 records of iris. In the tidy version, we start by knowing we are doing something to the data set, after which we know we are printing the first 6 records.

In base R functions, the process is evaluated from the inside out. For example, to get the mean sepal length of the setosa species in iris, we would do this:

mean(iris[iris$Species == 'setosa', "Sepal.Length"])
## [1] 5.006

From the inside out, we read that we are making a subset of iris where Species = “setosa,” we are selecting the column “Sepal.Length,” and taking the mean. However, it requires reading from the inside out. For a large set of nested functions, we would have y <- f(g(h((i(x))))), which would require first creating the innermost function (i()) and then working outward.

In a tidy approach this would be more like y <- x %>% i() %>% h() %>% g() %>% f()because the first function applied to the data setxisi()`. Revisiting the mean sepal length of setosa irises, example, under a tidy approach we would do this:

iris %>% filter(Species == 'setosa') %>% summarise(mean(Sepal.Length))
##   mean(Sepal.Length)
## 1              5.006

Which, read from left to right, translates to “using the iris data frame, make a subset of records where species is setosa, and summarize those records to get the mean value of sepal length.” The tidy version is intended to be easier to write, read, and understand. The command uses the filter() function, which will be described below.

1.6.2 Data subsetting (dplyr)

dplyr is the tidyverse R package used most frequently for data manipulation. Selection of records (i.e., subsetting) is done using logical tests to determine what is in the selected set. First we will look at logical tests and then we will cover subsetting rows and columns from data frames.

1.6.2.0.1 Logical tests

If elements meet a logical test, they will end up in the selected set. If data frame records have values in variables that meet logical criteria, the records will be selected.

Some logical tests are shown below.

1.6.2.0.1.1 ==: equals
# numeric tests
(1 == 2)
## [1] FALSE
(1 == 3 - 2)
## [1] TRUE
# character test (actually a factor)
(dat$imonth %>% head() %>% str_c(collapse = ", "))
## [1] "(6) June, (5) May, (6) June, (7) July, (7) July, (6) June"
((dat$imonth == "(6) June") %>% head())
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE
# character test for multiple patterns
(dat$imonth %in% c("(6) June", "(7) July") %>% head())
## [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
1.6.2.0.1.2 >, >=, <, <=: numeric comparisons
1 < 2
## [1] TRUE
1 > 2
## [1] FALSE
1 <= -10:10
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
1 >= -10:10
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1.6.2.0.1.3 !=: not equals
1 != 2
## [1] TRUE
# those of the first 6 days that are not 14
(dat$iday %>% head())
## [1] 23  5 27 14 14 12
((dat$iday != 14) %>% head())
## [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE
1.6.2.0.1.4 !: invert, or “not”

Sometimes it is more convenient to negate a single condition rather than enumerating all possible matching conditions.

dat$imonth %>% head(20)
##  [1] (6) June      (5) May       (6) June      (7) July      (7) July     
##  [6] (6) June      (5) May       (6) June      (6) June      (8) August   
## [11] (9) September (5) May       (6) June      (7) July      (5) May      
## [16] (5) May       (7) July      (5) May       (8) August    (7) July     
## 10 Levels: (1) January (4) April (5) May (6) June (7) July ... (12) December
((!dat$imonth %in% c("(6) June", "(7) July")) %>% head(20))
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [13] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

1.6.2.1 Subset rows (filter())

The filter() function creates a subset of records based on a logical test. Logical tests can be combined as “and” statements using the & operator and “or” statements using the | operator. Here we will perform a few filters on a subset of the data.

# first 20 records, fist 10 columns
dat_sub <- dat[1:20, 1:10]
kable(dat_sub, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
aid imonth iday iyear bio_sex h1gi1m h1gi1y h1gi4 h1gi5a h1gi5b
57100270
  1. June
23
  1. 1995
  1. Female
  1. October
  1. 1977
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57101310
  1. May
5
  1. 1995
  1. Female
  1. November
  1. 1976
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57103171
  1. June
27
  1. 1995
  1. Male
  1. October
  1. 1979
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57103869
  1. July
14
  1. 1995
  1. Male
  1. January
  1. 1977
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57104553
  1. July
14
  1. 1995
  1. Female
  1. June
  1. 1976
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57104649
  1. June
12
  1. 1995
  1. Male
  1. December
  1. 1981
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57104676
  1. May
31
  1. 1995
  1. Male
  1. October
  1. 1983
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57109625
  1. June
7
  1. 1995
  1. Male
  1. March
  1. 1981
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57110897
  1. June
27
  1. 1995
  1. Male
  1. September
  1. 1981
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57111071
  1. August
3
  1. 1995
  1. Male
  1. June
  1. 1981
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57111786
  1. September
7
  1. 1995
  1. Male
  1. September
  1. 1980
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57113943
  1. May
20
  1. 1995
  1. Male
  1. January
  1. 1979
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57116359
  1. June
24
  1. 1995
  1. Male
  1. April
  1. 1980
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57117542
  1. July
11
  1. 1995
  1. Male
  1. September
  1. 1979
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57117997
  1. May
20
  1. 1995
  1. Female
  1. October
  1. 1982
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57118381
  1. May
6
  1. 1995
  1. Female
  1. October
  1. 1982
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57118943
  1. July
19
  1. 1995
  1. Female
  1. April
  1. 1979
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57120005
  1. May
25
  1. 1995
  1. Male
  1. September
  1. 1982
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)
57120046
  1. August
20
  1. 1995
  1. Male
  1. October
  1. 1976
  1. Yes
  1. Marked
  1. Not marked
57120371
  1. July
20
  1. 1995
  1. Female
  1. August
  1. 1976
  1. No
  1. Legitimate skip (not Hispanic)
  1. Legitimate skip (not Hispanic)

Records from one month:

# from May
(dat_sub %>% filter(imonth == "(5) May"))
##        aid  imonth iday     iyear    bio_sex        h1gi1m    h1gi1y  h1gi4
## 1 57101310 (5) May    5 (95) 1995 (2) Female (11) November (76) 1976 (0) No
## 2 57104676 (5) May   31 (95) 1995   (1) Male  (10) October (83) 1983 (0) No
## 3 57113943 (5) May   20 (95) 1995   (1) Male   (1) January (79) 1979 (0) No
## 4 57117997 (5) May   20 (95) 1995 (2) Female  (10) October (82) 1982 (0) No
## 5 57118381 (5) May    6 (95) 1995 (2) Female  (10) October (82) 1982 (0) No
## 6 57120005 (5) May   25 (95) 1995   (1) Male (9) September (82) 1982 (0) No
##                               h1gi5a                             h1gi5b
## 1 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 2 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 3 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 4 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 5 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 6 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)

Records from one month from females:

(dat_sub %>% filter(imonth == "(5) May" & bio_sex == "(2) Female"))
##        aid  imonth iday     iyear    bio_sex        h1gi1m    h1gi1y  h1gi4
## 1 57101310 (5) May    5 (95) 1995 (2) Female (11) November (76) 1976 (0) No
## 2 57117997 (5) May   20 (95) 1995 (2) Female  (10) October (82) 1982 (0) No
## 3 57118381 (5) May    6 (95) 1995 (2) Female  (10) October (82) 1982 (0) No
##                               h1gi5a                             h1gi5b
## 1 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 2 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 3 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)

Records from one month and from females or where the day of month was before the 15th, which will probably include some males:

(dat_sub %>% filter(imonth == "(5) May" & (bio_sex == "(2) Female") | iday < 15))
##         aid        imonth iday     iyear    bio_sex        h1gi1m    h1gi1y
## 1  57101310       (5) May    5 (95) 1995 (2) Female (11) November (76) 1976
## 2  57103869      (7) July   14 (95) 1995   (1) Male   (1) January (77) 1977
## 3  57104553      (7) July   14 (95) 1995 (2) Female      (6) June (76) 1976
## 4  57104649      (6) June   12 (95) 1995   (1) Male (12) December (81) 1981
## 5  57109625      (6) June    7 (95) 1995   (1) Male     (3) March (81) 1981
## 6  57111071    (8) August    3 (95) 1995   (1) Male      (6) June (81) 1981
## 7  57111786 (9) September    7 (95) 1995   (1) Male (9) September (80) 1980
## 8  57117542      (7) July   11 (95) 1995   (1) Male (9) September (79) 1979
## 9  57117997       (5) May   20 (95) 1995 (2) Female  (10) October (82) 1982
## 10 57118381       (5) May    6 (95) 1995 (2) Female  (10) October (82) 1982
##     h1gi4                             h1gi5a                             h1gi5b
## 1  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 2  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 3  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 4  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 5  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 6  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 7  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 8  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 9  (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 10 (0) No (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)

Although these examples are silly and trivial, they show how filter() is used to create a selected set of data

1.6.2.2 Subset columns (select())

A subset of columns can be extracted from data frames using the select() function, most simply using named list of columns to keep.

# select 3 columns
(dat_sub_sel <- dat_sub %>%   
    select("aid", "imonth", "iday"))
##         aid        imonth iday
## 1  57100270      (6) June   23
## 2  57101310       (5) May    5
## 3  57103171      (6) June   27
## 4  57103869      (7) July   14
## 5  57104553      (7) July   14
## 6  57104649      (6) June   12
## 7  57104676       (5) May   31
## 8  57109625      (6) June    7
## 9  57110897      (6) June   27
## 10 57111071    (8) August    3
## 11 57111786 (9) September    7
## 12 57113943       (5) May   20
## 13 57116359      (6) June   24
## 14 57117542      (7) July   11
## 15 57117997       (5) May   20
## 16 57118381       (5) May    6
## 17 57118943      (7) July   19
## 18 57120005       (5) May   25
## 19 57120046    (8) August   20
## 20 57120371      (7) July   20
# select all but two named columns
(dat_sub_sel <- dat_sub %>%   
    select(-"imonth", -"iday"))
##         aid     iyear    bio_sex        h1gi1m    h1gi1y   h1gi4
## 1  57100270 (95) 1995 (2) Female  (10) October (77) 1977  (0) No
## 2  57101310 (95) 1995 (2) Female (11) November (76) 1976  (0) No
## 3  57103171 (95) 1995   (1) Male  (10) October (79) 1979  (0) No
## 4  57103869 (95) 1995   (1) Male   (1) January (77) 1977  (0) No
## 5  57104553 (95) 1995 (2) Female      (6) June (76) 1976  (0) No
## 6  57104649 (95) 1995   (1) Male (12) December (81) 1981  (0) No
## 7  57104676 (95) 1995   (1) Male  (10) October (83) 1983  (0) No
## 8  57109625 (95) 1995   (1) Male     (3) March (81) 1981  (0) No
## 9  57110897 (95) 1995   (1) Male (9) September (81) 1981  (0) No
## 10 57111071 (95) 1995   (1) Male      (6) June (81) 1981  (0) No
## 11 57111786 (95) 1995   (1) Male (9) September (80) 1980  (0) No
## 12 57113943 (95) 1995   (1) Male   (1) January (79) 1979  (0) No
## 13 57116359 (95) 1995   (1) Male     (4) April (80) 1980  (0) No
## 14 57117542 (95) 1995   (1) Male (9) September (79) 1979  (0) No
## 15 57117997 (95) 1995 (2) Female  (10) October (82) 1982  (0) No
## 16 57118381 (95) 1995 (2) Female  (10) October (82) 1982  (0) No
## 17 57118943 (95) 1995 (2) Female     (4) April (79) 1979  (0) No
## 18 57120005 (95) 1995   (1) Male (9) September (82) 1982  (0) No
## 19 57120046 (95) 1995   (1) Male  (10) October (76) 1976 (1) Yes
## 20 57120371 (95) 1995 (2) Female    (8) August (76) 1976  (0) No
##                                h1gi5a                             h1gi5b
## 1  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 2  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 3  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 4  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 5  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 6  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 7  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 8  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 9  (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 10 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 11 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 12 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 13 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 14 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 15 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 16 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 17 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 18 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
## 19                         (1) Marked                     (0) Not marked
## 20 (7) Legitimate skip (not Hispanic) (7) Legitimate skip (not Hispanic)
# select columns by position and whose name matches a pattern, in this case the regular expression "^i" meaning "starts with lowercase i"
(dat_sub_sel <- dat_sub %>%   
    select(1, matches("^i")))
##         aid        imonth iday     iyear
## 1  57100270      (6) June   23 (95) 1995
## 2  57101310       (5) May    5 (95) 1995
## 3  57103171      (6) June   27 (95) 1995
## 4  57103869      (7) July   14 (95) 1995
## 5  57104553      (7) July   14 (95) 1995
## 6  57104649      (6) June   12 (95) 1995
## 7  57104676       (5) May   31 (95) 1995
## 8  57109625      (6) June    7 (95) 1995
## 9  57110897      (6) June   27 (95) 1995
## 10 57111071    (8) August    3 (95) 1995
## 11 57111786 (9) September    7 (95) 1995
## 12 57113943       (5) May   20 (95) 1995
## 13 57116359      (6) June   24 (95) 1995
## 14 57117542      (7) July   11 (95) 1995
## 15 57117997       (5) May   20 (95) 1995
## 16 57118381       (5) May    6 (95) 1995
## 17 57118943      (7) July   19 (95) 1995
## 18 57120005       (5) May   25 (95) 1995
## 19 57120046    (8) August   20 (95) 1995
## 20 57120371      (7) July   20 (95) 1995

select() can also be used to rename columns:

#select one column, rename two columns
(dat_sub_sel %>% 
   select(aid, Month = imonth, Day = iday))
##         aid         Month Day
## 1  57100270      (6) June  23
## 2  57101310       (5) May   5
## 3  57103171      (6) June  27
## 4  57103869      (7) July  14
## 5  57104553      (7) July  14
## 6  57104649      (6) June  12
## 7  57104676       (5) May  31
## 8  57109625      (6) June   7
## 9  57110897      (6) June  27
## 10 57111071    (8) August   3
## 11 57111786 (9) September   7
## 12 57113943       (5) May  20
## 13 57116359      (6) June  24
## 14 57117542      (7) July  11
## 15 57117997       (5) May  20
## 16 57118381       (5) May   6
## 17 57118943      (7) July  19
## 18 57120005       (5) May  25
## 19 57120046    (8) August  20
## 20 57120371      (7) July  20

Or column renaming can be done with rename(), which maintains all input data and only changes the named columns:

(dat_sub_sel %>% 
   rename(Month = imonth, Day = iday))
##         aid         Month Day     iyear
## 1  57100270      (6) June  23 (95) 1995
## 2  57101310       (5) May   5 (95) 1995
## 3  57103171      (6) June  27 (95) 1995
## 4  57103869      (7) July  14 (95) 1995
## 5  57104553      (7) July  14 (95) 1995
## 6  57104649      (6) June  12 (95) 1995
## 7  57104676       (5) May  31 (95) 1995
## 8  57109625      (6) June   7 (95) 1995
## 9  57110897      (6) June  27 (95) 1995
## 10 57111071    (8) August   3 (95) 1995
## 11 57111786 (9) September   7 (95) 1995
## 12 57113943       (5) May  20 (95) 1995
## 13 57116359      (6) June  24 (95) 1995
## 14 57117542      (7) July  11 (95) 1995
## 15 57117997       (5) May  20 (95) 1995
## 16 57118381       (5) May   6 (95) 1995
## 17 57118943      (7) July  19 (95) 1995
## 18 57120005       (5) May  25 (95) 1995
## 19 57120046    (8) August  20 (95) 1995
## 20 57120371      (7) July  20 (95) 1995

1.6.2.3 Subset rows and columns: filter() and select()

We can combine filter() and select() with a pipe to create a new data frame with a subset of rows and columns:

# records with day of month > 15 and the first 3 named columns
(x <- dat_sub %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday)
   )
##         aid     imonth iday
## 1  57100270   (6) June   23
## 2  57103171   (6) June   27
## 3  57104676    (5) May   31
## 4  57110897   (6) June   27
## 5  57113943    (5) May   20
## 6  57116359   (6) June   24
## 7  57117997    (5) May   20
## 8  57118943   (7) July   19
## 9  57120005    (5) May   25
## 10 57120046 (8) August   20
## 11 57120371   (7) July   20

1.6.2.4 Create or calculate columns: mutate()

mutate() will create new named columns or re-calculate existing columns. Here we will make a column that stratifies birth month, with the cut at June.

Although the birth month column (h1gi1m) is a factor, it is unordered, so we need to make it ordered before using the factor label in a numeric comparison. Fortunately, the factor labels were handled in correct order:

# is this ordered?
is.ordered(dat$h1gi1m)
## [1] FALSE
# what are the levels?
(levels(dat$h1gi1m))
##  [1] "(1) January"   "(2) February"  "(3) March"     "(4) April"    
##  [5] "(5) May"       "(6) June"      "(7) July"      "(8) August"   
##  [9] "(9) September" "(10) October"  "(11) November" "(12) December"
## [13] "(96) Refused"

Assign order, create a new column, and print nicely:

# make birth month ordered
dat$h1gi1m <- factor(dat$h1gi1m, ordered = TRUE)

# now is it ordered?
is.ordered(dat$h1gi1m)
## [1] TRUE
# perform the mutate() using the string representation of the factor for comparison
dat %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_1st_half = (birth_month < "(7) July")) %>% 
    head(20) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
aid imonth iday birth_month birth_1st_half
57100270
  1. June
23
  1. October
FALSE
57103171
  1. June
27
  1. October
FALSE
57104676
  1. May
31
  1. October
FALSE
57110897
  1. June
27
  1. September
FALSE
57113943
  1. May
20
  1. January
TRUE
57116359
  1. June
24
  1. April
TRUE
57117997
  1. May
20
  1. October
FALSE
57118943
  1. July
19
  1. April
TRUE
57120005
  1. May
25
  1. September
FALSE
57120046
  1. August
20
  1. October
FALSE
57120371
  1. July
20
  1. August
FALSE
57121476
  1. May
20
  1. October
FALSE
57123494
  1. July
21
  1. February
TRUE
57129567
  1. July
26
  1. February
TRUE
57130633
  1. August
26
  1. October
FALSE
57131909
  1. April
27
  1. July
FALSE
57133772
  1. July
19
  1. February
TRUE
57134457
  1. July
18
  1. April
TRUE
57136630
  1. May
16
  1. May
TRUE
57139880
  1. June
19
  1. October
FALSE

A silly example but showing that mutate() can change values of existing columns:

(X <- dat_sub %>% 
     mutate(iday = -1000 + iday))
##         aid        imonth iday     iyear    bio_sex        h1gi1m    h1gi1y
## 1  57100270      (6) June -977 (95) 1995 (2) Female  (10) October (77) 1977
## 2  57101310       (5) May -995 (95) 1995 (2) Female (11) November (76) 1976
## 3  57103171      (6) June -973 (95) 1995   (1) Male  (10) October (79) 1979
## 4  57103869      (7) July -986 (95) 1995   (1) Male   (1) January (77) 1977
## 5  57104553      (7) July -986 (95) 1995 (2) Female      (6) June (76) 1976
## 6  57104649      (6) June -988 (95) 1995   (1) Male (12) December (81) 1981
## 7  57104676       (5) May -969 (95) 1995   (1) Male  (10) October (83) 1983
## 8  57109625      (6) June -993 (95) 1995   (1) Male     (3) March (81) 1981
## 9  57110897      (6) June -973 (95) 1995   (1) Male (9) September (81) 1981
## 10 57111071    (8) August -997 (95) 1995   (1) Male      (6) June (81) 1981
## 11 57111786 (9) September -993 (95) 1995   (1) Male (9) September (80) 1980
## 12 57113943       (5) May -980 (95) 1995   (1) Male   (1) January (79) 1979
## 13 57116359      (6) June -976 (95) 1995   (1) Male     (4) April (80) 1980
## 14 57117542      (7) July -989 (95) 1995   (1) Male (9) September (79) 1979
## 15 57117997       (5) May -980 (95) 1995 (2) Female  (10) October (82) 1982
## 16 57118381       (5) May -994 (95) 1995 (2) Female  (10) October (82) 1982
## 17 57118943      (7) July -981 (95) 1995 (2) Female     (4) April (79) 1979
## 18 57120005       (5) May -975 (95) 1995   (1) Male (9) September (82) 1982
## 19 57120046    (8) August -980 (95) 1995   (1) Male  (10) October (76) 1976
## 20 57120371      (7) July -980 (95) 1995 (2) Female    (8) August (76) 1976
##      h1gi4                             h1gi5a
## 1   (0) No (7) Legitimate skip (not Hispanic)
## 2   (0) No (7) Legitimate skip (not Hispanic)
## 3   (0) No (7) Legitimate skip (not Hispanic)
## 4   (0) No (7) Legitimate skip (not Hispanic)
## 5   (0) No (7) Legitimate skip (not Hispanic)
## 6   (0) No (7) Legitimate skip (not Hispanic)
## 7   (0) No (7) Legitimate skip (not Hispanic)
## 8   (0) No (7) Legitimate skip (not Hispanic)
## 9   (0) No (7) Legitimate skip (not Hispanic)
## 10  (0) No (7) Legitimate skip (not Hispanic)
## 11  (0) No (7) Legitimate skip (not Hispanic)
## 12  (0) No (7) Legitimate skip (not Hispanic)
## 13  (0) No (7) Legitimate skip (not Hispanic)
## 14  (0) No (7) Legitimate skip (not Hispanic)
## 15  (0) No (7) Legitimate skip (not Hispanic)
## 16  (0) No (7) Legitimate skip (not Hispanic)
## 17  (0) No (7) Legitimate skip (not Hispanic)
## 18  (0) No (7) Legitimate skip (not Hispanic)
## 19 (1) Yes                         (1) Marked
## 20  (0) No (7) Legitimate skip (not Hispanic)
##                                h1gi5b
## 1  (7) Legitimate skip (not Hispanic)
## 2  (7) Legitimate skip (not Hispanic)
## 3  (7) Legitimate skip (not Hispanic)
## 4  (7) Legitimate skip (not Hispanic)
## 5  (7) Legitimate skip (not Hispanic)
## 6  (7) Legitimate skip (not Hispanic)
## 7  (7) Legitimate skip (not Hispanic)
## 8  (7) Legitimate skip (not Hispanic)
## 9  (7) Legitimate skip (not Hispanic)
## 10 (7) Legitimate skip (not Hispanic)
## 11 (7) Legitimate skip (not Hispanic)
## 12 (7) Legitimate skip (not Hispanic)
## 13 (7) Legitimate skip (not Hispanic)
## 14 (7) Legitimate skip (not Hispanic)
## 15 (7) Legitimate skip (not Hispanic)
## 16 (7) Legitimate skip (not Hispanic)
## 17 (7) Legitimate skip (not Hispanic)
## 18 (7) Legitimate skip (not Hispanic)
## 19                     (0) Not marked
## 20 (7) Legitimate skip (not Hispanic)

… so do be careful!

Other functions can be used with mutate include (but are by no means limited to!)

When we recoded the birth month, the output was a logical data type. If we wanted to create a character or factor, we could use if_else(). Here we are creating a new data frame based on several operations on dat.

dat_1 <- dat %>% 
    filter(iday > 15) %>%
    head(20) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_year_half = ifelse(test = birth_month < "(7) July", yes = "first", no = "last"))

# make that a factor
dat_1$birth_year_half <- factor(dat_1$birth_year_half, levels = c("first", "last"))
    
# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
aid imonth iday birth_month birth_year_half
57100270
  1. June
23
  1. October
last
57103171
  1. June
27
  1. October
last
57104676
  1. May
31
  1. October
last
57110897
  1. June
27
  1. September
last
57113943
  1. May
20
  1. January
first
57116359
  1. June
24
  1. April
first
57117997
  1. May
20
  1. October
last
57118943
  1. July
19
  1. April
first
57120005
  1. May
25
  1. September
last
57120046
  1. August
20
  1. October
last
57120371
  1. July
20
  1. August
last
57121476
  1. May
20
  1. October
last
57123494
  1. July
21
  1. February
first
57129567
  1. July
26
  1. February
first
57130633
  1. August
26
  1. October
last
57131909
  1. April
27
  1. July
last
57133772
  1. July
19
  1. February
first
57134457
  1. July
18
  1. April
first
57136630
  1. May
16
  1. May
first
57139880
  1. June
19
  1. October
last

If one of your variables contains multiple values and you want to create classes, use case_when(). Here is a verbose example stratifying months into quarters. Also we are using the magrittr assignment pipe to update the input based on the statement, i.e., dat_1 will change based on the commands we use. Be careful using the assignment pipe because it will change your data frame.

case_when() will recode in order or the way the command is written, so for months and quarters, it is not necessary to specify both ends of the quarter. Also any cases that are not explicitly handled can be addressed with the TRUE ~ ... argument; in this case, any records that had birth months that were not before September get assigned to quarter 4.

dat_1 %<>% 
    mutate(quarter = case_when(
        birth_month < "(3) March" ~ 1,
        birth_month < "(6) June" ~ 2,
        birth_month < "(9) September" ~ 3,
        TRUE ~ 4))

# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
aid imonth iday birth_month birth_year_half quarter
57100270
  1. June
23
  1. October
last 4
57103171
  1. June
27
  1. October
last 4
57104676
  1. May
31
  1. October
last 4
57110897
  1. June
27
  1. September
last 4
57113943
  1. May
20
  1. January
first 1
57116359
  1. June
24
  1. April
first 2
57117997
  1. May
20
  1. October
last 4
57118943
  1. July
19
  1. April
first 2
57120005
  1. May
25
  1. September
last 4
57120046
  1. August
20
  1. October
last 4
57120371
  1. July
20
  1. August
last 3
57121476
  1. May
20
  1. October
last 4
57123494
  1. July
21
  1. February
first 1
57129567
  1. July
26
  1. February
first 1
57130633
  1. August
26
  1. October
last 4
57131909
  1. April
27
  1. July
last 3
57133772
  1. July
19
  1. February
first 1
57134457
  1. July
18
  1. April
first 2
57136630
  1. May
16
  1. May
first 2
57139880
  1. June
19
  1. October
last 4

recode() is used to change the birth_year_half column:

(dat_1 %<>% 
     mutate(birth_year_half_split = recode(birth_year_half,
                   "first" = "early",
                   "last" = "late")))
##         aid     imonth iday   birth_month birth_year_half quarter
## 1  57100270   (6) June   23  (10) October            last       4
## 2  57103171   (6) June   27  (10) October            last       4
## 3  57104676    (5) May   31  (10) October            last       4
## 4  57110897   (6) June   27 (9) September            last       4
## 5  57113943    (5) May   20   (1) January           first       1
## 6  57116359   (6) June   24     (4) April           first       2
## 7  57117997    (5) May   20  (10) October            last       4
## 8  57118943   (7) July   19     (4) April           first       2
## 9  57120005    (5) May   25 (9) September            last       4
## 10 57120046 (8) August   20  (10) October            last       4
## 11 57120371   (7) July   20    (8) August            last       3
## 12 57121476    (5) May   20  (10) October            last       4
## 13 57123494   (7) July   21  (2) February           first       1
## 14 57129567   (7) July   26  (2) February           first       1
## 15 57130633 (8) August   26  (10) October            last       4
## 16 57131909  (4) April   27      (7) July            last       3
## 17 57133772   (7) July   19  (2) February           first       1
## 18 57134457   (7) July   18     (4) April           first       2
## 19 57136630    (5) May   16       (5) May           first       2
## 20 57139880   (6) June   19  (10) October            last       4
##    birth_year_half_split
## 1                   late
## 2                   late
## 3                   late
## 4                   late
## 5                  early
## 6                  early
## 7                   late
## 8                  early
## 9                   late
## 10                  late
## 11                  late
## 12                  late
## 13                 early
## 14                 early
## 15                  late
## 16                  late
## 17                 early
## 18                 early
## 19                 early
## 20                  late

1.6.2.5 Summarizing/aggregating data

We will spend more time later in the course on data summaries, but an introduction with dplyr is worthwhile introducing at this stage. The two main functions are summarise() and group_by().

A simple summary will tabulate the count of respondents and the mean age. The filter ! str_detect(h1gi1y, "Refused") drops records from respondents who refused to give their birth year.

dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    summarise(n = n(),
              mean_age = mean(yeari - yearb))
##      n mean_age
## 1 6501 16.03676

Here we will summarize age by sex using the group_by() function, and also piping to prop_table() to get the percentage:

dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    group_by(bio_sex) %>% 
    summarise(mean_age = mean(yeari - yearb),
              sd_age = sd(yeari - yearb),
              n = n(),
              .groups = "drop_last") %>% 
    mutate(pct = prop.table(n) * 100)
## # A tibble: 2 x 5
##   bio_sex    mean_age sd_age     n   pct
##   <fct>         <dbl>  <dbl> <int> <dbl>
## 1 (1) Male       16.1   1.77  3147  48.4
## 2 (2) Female     16.0   1.77  3354  51.6

1.6.2.6 purrr: efficient iterating over elements in vectors and lists

More attention will be paid to purrr in the lesson on functions.

The workhorse function in purrr is map(), which applies a function over a list or atomic vector.

A brief example uses a vector c(9, 16, 25) and the map() function is used to get the square root of each element. The output is a list

# apply the sqrt() function to each element of a vector of integers
map(c(9, 16, 25), sqrt)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 5

Other resources for purrr: Learn to purrr, purrr tutorial

1.7 Data sets

1.7.1 Edward Babushkin’s Employee turnover data

Some data that will be used in CSDE 533: Edward Babushkin’s employee turnover data, explained a bit at kaggle.com and as a downloadable file.

Here we will load the data set from a URL:

etdata <- read.csv("https://raw.githubusercontent.com/teuschb/hr_data/master/datasets/turnover_babushkin.csv")

Just to get a bit of tidyverse in at the last minute, let’s get mean and standard deviation of job tenure by gender and 10-year age class:

# create 10-year age classes 
etdata %<>% 
    mutate(age_decade = plyr::round_any(age, 10, f = ceiling)) 

# summarize
etdata %>% 
    # group by gender and age class
    group_by(gender, age_decade) %>% 
    # mean and sd
    summarize(mean_tenure_months = mean(tenure) %>% round(1),
              sd_tenure_months = sd(tenure) %>% round(1), 
              .groups = "keep") %>% 
    # order the output by age and gender
    arrange(gender, mean_tenure_months) %>% 
    # print it nicely
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
gender age_decade mean_tenure_months sd_tenure_months
f 60 21.6 20.1
f 50 22.9 24.2
f 40 35.4 31.3
f 30 37.9 35.1
f 20 75.0 49.3
m 60 21.1 5.1
m 50 31.2 29.7
m 40 33.9 26.3
m 30 41.8 40.2
m 20 121.8 33.5

Rendered at 2022-03-12 10:45:06

1.8 Source code

File is at CSDE-TS1: R:/Project/CSDE502/2022/csde502-winter-2022-main/01-week01.Rmd.

1.8.1 R code used in this document

library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)

# this course's URL
myurl <- "https://csde-uw.github.io/csde502-winter-2022"

# path to this file name
if (!interactive()) {
    fnamepath <- current_input(dir = TRUE)
    fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
    fnamepath <- ""
}
print(is("a"))

print(str(TRUE))

print(class(123.45))

print(class(as.integer(1000)))

n <- as.numeric(999999999999999999999)

print(class(n))
# evaluate as logical, test whether 1 is greater than two
a <- 1 > 2
# create two numerical values, one being NA, representing ages
age_john <- 39
age_jane <- NA

# logical NA from Jane's undefined age
(jo <- age_john > 50)
(ja <- age_jane > 50)
(t <- as.logical(1))
(f <- as.logical(0))
i <- as.integer(999999999999999999999)

print(class(i))
print(class("Café"))
# this is a character
my_zip <- "98115"

# it is not numeric.
my_zip + 2
# we can convert it to numeric, although it would be silly to do with ZIP codes, which are nominal values
as.numeric(my_zip) + 2

# Boston has ZIP codes starting with zeros
boston_zip <- "02134"
as.numeric(boston_zip)
print(charToRaw("z"))

class(charToRaw("z"))
# create a vector of length 1
a <- 1
is(a)
c(1, "a", TRUE, charToRaw("z"))
(c(1:3, TRUE, FALSE))
c(1:3, TRUE, FALSE, "awesome!")
# a vector 
(v <- c(1, 3, 2))

(sort(v))
# create a random normal 
set.seed(5)
normvec1000 <- rnorm(n = 1000)

length(normvec1000)
class(normvec1000)
class(normvec1000 > 1)
v <- seq(from = 0, to = 10, by = 2)
v[4]
# make a vector 1 to 100
(v <- 1:100)

# load to a matrix
(m1 <- matrix(v, ncol = 10, byrow = TRUE))

# different r, c ordering
(m2 <- matrix(v, ncol = 10, byrow = FALSE))
(m3 <- matrix(letters, ncol = 10, nrow = 10))
# a vector 1 to 27
v <- 1:27

# create an array, 3 x 3 x 3
(a <- array(v, dim = c(3, 3, 3)))

# array index is r, c, m (row, column, matrix), e.g., row 1 column 2 matrix 3:
(a[1,2,3])
(l <- list("a", 1, TRUE))
(l <- list("a", 
           1:20, 
           as.logical(c(0,1,1,0))))
# show the data types
(lapply(X = l, FUN = class))

# mean, maybe?
(lapply(X = l, FUN = function(x) {mean(x)}))
(income <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000"))
sort(income)
# create a factor from income and set the levels
(income_factor <- factor(x = income, levels = income))

# sort again
(sort(income_factor))
# income levels 
inc <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000")

BMI <-  data.frame(
   sid = c("A1001", "A1002", "B1001"),
   gender = c("Male", "Male","Female"), 
   height_cm = c(152, 171.5, 165), 
   weight_kg = c(81, 93, 78),
   age_y = c(42, 38, 26),
   income = factor(c("$50,000-$99,999", "$100,000-$200,000", "<$10,000"), levels = inc)
)
print(BMI)
# load pacman if necessary
package.check <- lapply("pacman", FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x, dependencies = TRUE)
        library(x, character.only = TRUE)
    }
})

# load readstata13 if necessary
pacman::p_load(readstata13)

# read the dta file
dat <- readstata13::read.dta13(file.path(myurl, "data/AHwave1_v1.dta"))
x <- data.frame(colname = names(dat), label = attributes(dat)$var.labels)
DT::datatable(data = x, caption = "Column names and labels in AHwave1_v1.dta.")
head(iris)
iris %>% head()
mean(iris[iris$Species == 'setosa', "Sepal.Length"])
iris %>% filter(Species == 'setosa') %>% summarise(mean(Sepal.Length))
# numeric tests
(1 == 2)
(1 == 3 - 2)
# character test (actually a factor)
(dat$imonth %>% head() %>% str_c(collapse = ", "))
((dat$imonth == "(6) June") %>% head())
# character test for multiple patterns
(dat$imonth %in% c("(6) June", "(7) July") %>% head())
1 < 2
1 > 2
1 <= -10:10
1 >= -10:10
1 != 2
# those of the first 6 days that are not 14
(dat$iday %>% head())
((dat$iday != 14) %>% head())
dat$imonth %>% head(20)
((!dat$imonth %in% c("(6) June", "(7) July")) %>% head(20))
# first 20 records, fist 10 columns
dat_sub <- dat[1:20, 1:10]
kable(dat_sub, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
# from May
(dat_sub %>% filter(imonth == "(5) May"))
(dat_sub %>% filter(imonth == "(5) May" & bio_sex == "(2) Female"))
(dat_sub %>% filter(imonth == "(5) May" & (bio_sex == "(2) Female") | iday < 15))

# select 3 columns
(dat_sub_sel <- dat_sub %>%   
    select("aid", "imonth", "iday"))
# select all but two named columns
(dat_sub_sel <- dat_sub %>%   
    select(-"imonth", -"iday"))
# select columns by position and whose name matches a pattern, in this case the regular expression "^i" meaning "starts with lowercase i"
(dat_sub_sel <- dat_sub %>%   
    select(1, matches("^i")))
#select one column, rename two columns
(dat_sub_sel %>% 
   select(aid, Month = imonth, Day = iday))
(dat_sub_sel %>% 
   rename(Month = imonth, Day = iday))
# records with day of month > 15 and the first 3 named columns
(x <- dat_sub %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday)
   )
# is this ordered?
is.ordered(dat$h1gi1m)
# what are the levels?
(levels(dat$h1gi1m))
# make birth month ordered
dat$h1gi1m <- factor(dat$h1gi1m, ordered = TRUE)

# now is it ordered?
is.ordered(dat$h1gi1m)
# perform the mutate() using the string representation of the factor for comparison
dat %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_1st_half = (birth_month < "(7) July")) %>% 
    head(20) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
(X <- dat_sub %>% 
     mutate(iday = -1000 + iday))
dat_1 <- dat %>% 
    filter(iday > 15) %>%
    head(20) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_year_half = ifelse(test = birth_month < "(7) July", yes = "first", no = "last"))

# make that a factor
dat_1$birth_year_half <- factor(dat_1$birth_year_half, levels = c("first", "last"))
    
# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
dat_1 %<>% 
    mutate(quarter = case_when(
        birth_month < "(3) March" ~ 1,
        birth_month < "(6) June" ~ 2,
        birth_month < "(9) September" ~ 3,
        TRUE ~ 4))

# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
(dat_1 %<>% 
     mutate(birth_year_half_split = recode(birth_year_half,
                   "first" = "early",
                   "last" = "late")))
dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    summarise(n = n(),
              mean_age = mean(yeari - yearb))
dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    group_by(bio_sex) %>% 
    summarise(mean_age = mean(yeari - yearb),
              sd_age = sd(yeari - yearb),
              n = n(),
              .groups = "drop_last") %>% 
    mutate(pct = prop.table(n) * 100)
# apply the sqrt() function to each element of a vector of integers
map(c(9, 16, 25), sqrt)
etdata <- read.csv("https://raw.githubusercontent.com/teuschb/hr_data/master/datasets/turnover_babushkin.csv")
# create 10-year age classes 
etdata %<>% 
    mutate(age_decade = plyr::round_any(age, 10, f = ceiling)) 

# summarize
etdata %>% 
    # group by gender and age class
    group_by(gender, age_decade) %>% 
    # mean and sd
    summarize(mean_tenure_months = mean(tenure) %>% round(1),
              sd_tenure_months = sd(tenure) %>% round(1), 
              .groups = "keep") %>% 
    # order the output by age and gender
    arrange(gender, mean_tenure_months) %>% 
    # print it nicely
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
cat(readLines(fnamepath), sep = '\n')

1.8.2 Complete Rmd code

cat(readLines(fnamepath), sep = '\n')
# Week 1 {#week1}

```{r, echo=FALSE, message=FALSE, error=FALSE}
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)

# this course's URL
myurl <- "https://csde-uw.github.io/csde502-winter-2022"

# path to this file name
if (!interactive()) {
    fnamepath <- current_input(dir = TRUE)
    fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
    fnamepath <- ""
}
```

<h2>Topics:</h2>
* [Getting started on terminal server 4](#gettingstarted)
* [Introduction to R/RStudio/R Markdown](#intrormd)
* [R data types](#rdatatypes)
* [R data structures](#rdatastructures)
* [File systems](#filesystems)
* [Data manipulation in the `tidyverse`](#tidyverse)
* [Data sets:](#datasets001)
    * Employee turnover data
    
<hr>
Today's lessons will cover getting started with computing at CSDE, and quickly introduce R, RStudio, and R Markdown.

It is assumed that students in this course have a basic working knowledge of using R, including how to create variables with the assignment operator ("`<-`"), and how to run simple functions(e.g., `mean(dat$age)`). Often in courses that include using R for statistical analysis, some of the following foundations are not explained fully. This section is not intended to be a comprehensive treatment of R data types and structures, but should provide some background for students who are either relatively new at using R or who have not had a systematic introduction.    

The other main topic for today is [`tidyverse`](https://www.tidyverse.org/), which refers to a related set of R packages for data management, analysis, and display. See Hadley Wickham's [tidy tools manifesto](https://tidyverse.tidyverse.org/articles/manifesto.html) for the logic behind the suite of tools. For a brief description of the specific R packages, see [Tidyverse packages](https://www.tidyverse.org/packages/). This is not intended to be a complete introduction to the `tidyverse`, but should provide sufficient background for data handling to support most of the technical aspects of the rest of the course and CSDE 533.

## Getting started on Terminal Server 4 {#gettingstarted}
First, if you are not on campus, make sure you have the Husky OnNet VPN application running and have connected to the UW network. You should see the f5 icon in your task area:

![](images/week01/2021-01-07_21_40_25-.png)

Connect to TS4: `csde-ts4.csde.washington.edu`

If you are using the Windows Remote Desktop Protocol (RDP) connection, your connection parameters should look like this:

![](images/week01/2021-01-07_21_48_03-Remote Desktop Connection.png)

If you are using mRemoteNG, the connection parameters will match this:

![](images/week01/2021-01-07_21_37_36-Window.png)

Once you are connected you should see a number of icons on the desktop and application shortcuts in the Start area.

![](images/week01/2021-01-07_21_59_38-.png)

![](images/week01/2021-01-07_22_00_14-Window.png)

Open a Windows Explorer (if you are running RDP in full screen mode you should be able to use the key combination Win-E).

Before doing anything, let's change some of the annoying default settings of the Windows Explorer. Tap `File > Options`. In the `View` tab, make sure that `Always show menus` is checked and `Hide extensions for known file types` is unchecked. The latter setting is very important because we want to see the complete file name for all files at all times.

![](images/week01/2021-01-07_22_30_46-Folder_Options.png)

Click `Apply to Folders` so that these settings become default. Click `Yes` to the next dialog.

![](images/week01/2021-01-07_22_31_37-FolderViews.png)

Now let's make a folder for the files in this course.

Navigate to This PC:

![](images/week01/2021-01-07_22_05_59-Window.png)

You should see the `H:` drive. This is is the mapped drive that links to your [U Drive](https://itconnect.uw.edu/wares/online-storage/u-drive-central-file-storage-for-users/), and is the place where all of the data for this course is to be stored. __Do not store any data on the `C:` drive!__ The `C:` drive can be wiped without any prior notification.

__Be very careful with your files on the U Drive!__ If you delete files, there is no "undo" functionality. When you are deleting files, you will get a warning that you should take seriously:

![](images/week01/2021-01-07_23_01_10-Delete_Folder.png)

Navigate into `H:` and create a new folder named `csde502_winter_2022`. Note the use of lowercase letters and underscores rather than spaces. This will be discussed in the section on file systems later in this lesson.

![](images/week01/2021-01-07_22_32_29-new_folder.png)

## Introduction to R Markdown in RStudio {#intrormd}

### Create a project
Now we will use RStudio to create the first R Markdown source file and render it to HTML.

Start RStudio by either dbl-clicking the desktop shortcut or navigating to the alphabetical R section of the Start menu:

![](images/week01/2021-01-07_23_05_49-Window.png)

:::{.rmdnote}
***A brief aside: install R packages.***

To get started, because it usually takes some time to install, open a second RStudio session and at the console, to install `tidyverse`, the other packages for CSDE 502 and 533, and for this lesson, download the file [`packages.R`](tools/packages.R).

Open the file in your second RStudio session and in the upper right of the source code pane, click `Source > Source`.

![](images/week01/2022-01-06 21_47_15-source.png)

Now continue on with the lesson in your original RStudio session.....
:::

Create a new project (`File > New Project...`).

![](images/week01/2021-01-07_23_08_34-rstudiorappbroker.csde.washington.edu.png)

Since we just created the directory to house the project, select `Existing Directory`.

![](images/week01/2021-01-07_23_09_11-csde502_winter_2021_course-RStudiorappbroker.csde.washington.edu.png)

Navigate to that directory and select `Open`.

![](images/week01/2021-01-07_23_09_48-ChooseDirectoryrappbroker.csde.washington.edu.png)

Click `Create Project`.

![](images/week01/2021-01-07_23_10_02-csde502_winter_2021_course-RStudiorappbroker.csde.washington.edu.png)

You will now have a blank project with only the project file.

![](images/week01/2021-01-07_23_11_16-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

### Create an R Markdown file from built-in RStudio functionality
Let's make an R Markdown file (`File > New File > R Markdown...`).

![](images/week01/2021-01-07_23_12_31-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

Do not change any of the metadata ... this is just for a quick example.

![](images/week01/2021-01-07_23_13_41-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

Click `OK` and then name the file `week_01.Rmd`.

![](images/week01/2021-01-07_23_14_59-SaveFile-Untitled1rappbroker.csde.washington.edu.png)

#### Render the Rmd file as HTML

At the console prompt, enter `R Markdown::render("W` and tap the `TAB` key. This should bring up a list of files that have the character "w" in the file name. Click `week_01.Rmd`.

The syntax here means "run the `render()` function from the `R Markdown` package on the file `week_01.Rmd`"

![](images/week01/2021-01-07_23_15_32-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

After a few moments, the process should complete with a message that the output has been created.

![](images/week01/2021-01-07_23_16_13-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

If the HTML page does not open automatically, look for `week_01.html` in the list of files. Click it and select `View in Web Browser`.

![](images/week01/2021-01-07_23_16_39-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

You will now see the bare-bones HTML file.

![](images/week01/2021-01-07_23_17_10-Untitled.png)

Compare the output of this file with the source code in `week_01.Rmd`. Note there are section headers that begin with hash marks, and R code is indicated with the starting characters 

<code>
\`\`\`\{r\}
</code>

and the ending characters

<code>
\`\`\`
</code>

Next, we will explore some enhancements to the basic R Markdown syntax.

### Create an R Markdown file with some enhancements

Download this version of [`week_01.Rmd`](files/week_01.Rmd) and overwrite the version you just created.

If RStudio prints a message that some packages are required but are not installed, click `Install`.

![](images/week01/2021-01-07_23_26_55-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)

Change line 3 to include your name and e-mail address as shown. 

![](images/week01/2021-01-08_00_35_18-Window.png)

#### Render and view the enhanced output
Repeat the rendering process (`R Markdown::render("Week_01.Rmd")`) 

The new HTML file has a number of enhancements, including a mailto: hyperlink for your name, a table of contents at the upper left, a table that is easier to read, a Leaflet map, captions and cross-references for the figures and table, an image derived from a PNG file referenced by a URL, the code used to generate various parts of the document that are produced by R code, and the complete source code for the document. A downloadable version of the rendered file: [week_01.html](files/week_01.html).

![](images/week01/2021-01-08_00_19_27-Week01.png)

Including the source code for the document is especially useful for readers of your documents because it lets them see exactly what you did. An entire research chain can be documented in this way, from reading in raw data, performing data cleaning and analysis, and generating results.

## R data types {#rdatatypes}
You may want to download the file [week01.Rmd](files/week01/week01.R), which contains many of the examples below.

There are six fundamental data types in R:

1. logical
1. numeric
1. integer
1. complex
1. character
1. raw

The most atomic object in R will exist having one of those data types, described below. An atomic object of the data type can have a value, `NA` which represents an observation with no data (e.g., a missing measurement), or `NULL` which isn't really a value at all, but can still have the data type class.

You will encounter other data types, such as `Date` or `POSIXct` if you are working with dates or time stamps. These other data types are extensions of the fundamental data types.

To determine what data type an object is, use `is(obj)`, `str(obj)`, or `class(obj)`. 

```{r}
print(is("a"))

print(str(TRUE))

print(class(123.45))

print(class(as.integer(1000)))

n <- as.numeric(999999999999999999999)

print(class(n))
```

### Logical
Use `logical` values for characteristics that are either `TRUE` or `FALSE`. Note that if `logical` elements can also have an `NA` value if the observation is missing. In the following examples, 

```{r}
# evaluate as logical, test whether 1 is greater than two
a <- 1 > 2
```

```{r}
# create two numerical values, one being NA, representing ages
age_john <- 39
age_jane <- NA

# logical NA from Jane's undefined age
(jo <- age_john > 50)
(ja <- age_jane > 50)
```

Logical values are often expressed in binary format as 0 = `FALSE` and ` = `TRUE`. in R these values are interconvertible. Other software (e.g., Excel, MS Access) may convert logical values to numbers that you do not expect.

```{r}
(t <- as.logical(1))
(f <- as.logical(0))
```

### Numeric
`Numeric` values are numbers with range about 2e-308 to 2e+308, depending on the computer you are using. You can see the possible range by entering `.Machine` at the R console. These can also include decimals. For more information, see [Double-precision floating-point format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)


### Integer
`Integer` values are numerical, but can only take on whole, rather than fractional values, and have a truncated range compared to `numeric`. For example, see below, if we try to create an integer that is out of range. The object we created is an integer, but because it is out of range, is value is set to `NA`.

```{r}
i <- as.integer(999999999999999999999)

print(class(i))
```

### Complex
The `complex` type is used in mathematics and you are unlikely to use it in applied social science research unless you get into some heavy statistics. See [Complex number](https://en.wikipedia.org/wiki/Complex_number) for a full treatment.

### Character
`Character` data include the full set of keys on your keyboard that print out a character, typically [A-Z], [a-z], [0-9], punctuation, etc. The full set of ASCII characters is supported, e.g. the `accent aigu` in Café:

```{r}
print(class("Café"))
```

Also numbers can function as characters. Be careful in converting between numerical and character versions. For example, see these ZIP codes:

```{r error=TRUE}
# this is a character
my_zip <- "98115"

# it is not numeric.
my_zip + 2
```

```{r}
# we can convert it to numeric, although it would be silly to do with ZIP codes, which are nominal values
as.numeric(my_zip) + 2

# Boston has ZIP codes starting with zeros
boston_zip <- "02134"
as.numeric(boston_zip)
```

### Raw
`Raw` values are used to store raw bytes in hexadecimal format. You are unlikely to use it in applied social science research. For example, the hexadecimal value for the character `z` is `7a`:

```{r}
print(charToRaw("z"))

class(charToRaw("z"))
```


## R data structures {#rdatastructures}

![](images/week02/data_structures.png)

There are 5 basic data structures in R, as shown in the graphic: 

1. vector
1. matrix
1. array
1. list
1. data frame

In addition, the `factor` data type is very important

### Vector
A vector is an ordered set of elements of one or more elements of the same data type and are created using the `c()` constructor function. For example, a single value is a vector:

```{r}
# create a vector of length 1
a <- 1
is(a)
```


If you try creating a vector with mixed data types, you may get unexpected results; mixing character elements with other type elements will result in character representations, e.g., 

```{r}
c(1, "a", TRUE, charToRaw("z"))
```

Results will depend on the data type you are mixing, for example because logical values can be expressed numerically, the `TRUE` and `FALSE` values are converted to `1` and `0`, respectively.

```{r}
(c(1:3, TRUE, FALSE))
```

But if a character is added, all elements are converted to characters.

```{r}
c(1:3, TRUE, FALSE, "awesome!")
```

Order is important, i.e., 

`1, 2, 3` is not the same as `1, 3, 2`

R will maintain the order of elements in vectors unless a process is initiated that changes the order of those elements:

```{r}
# a vector 
(v <- c(1, 3, 2))

(sort(v))
```

You can get some information about vectors, such as length and data type:

```{r}
# create a random normal 
set.seed(5)
normvec1000 <- rnorm(n = 1000)

length(normvec1000)
class(normvec1000)
class(normvec1000 > 1)
```

Elements of vectors are specified with their index number (1 .. n):

```{r}
v <- seq(from = 0, to = 10, by = 2)
v[4]
```

### Matrix
A matrix is like a vector, in that it an contain only one data type, but it is two-dimensional, having rows and columns. A simple example:

```{r}
# make a vector 1 to 100
(v <- 1:100)

# load to a matrix
(m1 <- matrix(v, ncol = 10, byrow = TRUE))

# different r, c ordering
(m2 <- matrix(v, ncol = 10, byrow = FALSE))
```

If you try to force a vector into a matrix whose row $\times$ col length does not match the length of the vector, the elements will be recycled, which may not be what you want. At least R will give you a warning.

```{r}
(m3 <- matrix(letters, ncol = 10, nrow = 10))
```

### Array
An array is similar to matrix, but it can have more than one dimension. These can be useful for analyzing time series data or other multidimensional data. We will not be using array data in this course, but a simple example of creating and viewing the contents of an array:

```{r}
# a vector 1 to 27
v <- 1:27

# create an array, 3 x 3 x 3
(a <- array(v, dim = c(3, 3, 3)))

# array index is r, c, m (row, column, matrix), e.g., row 1 column 2 matrix 3:
(a[1,2,3])
```

### List
R lists are ordered collections of objects that do not need to be of the same data type. Those objects can be single-value vectors, multiple-value vectors, matrices, data frames, other lists, etc. Because of this, lists are a very flexible data type. But because they can have as little or as much structure as you want, can become difficult to manage and analyze.

Here is an example of a list comprised of single value vectors of different data type. Compare this with the attempt to make a vector comprised of elements of different data type:

```{r}
(l <- list("a", 1, TRUE))
```

Let's modify that list a bit:

```{r}
(l <- list("a", 
           1:20, 
           as.logical(c(0,1,1,0))))
```

The top-level indexing for a list is denoted using two sets of square brackets. For example, the first element of our list can be accessed by `l[[1]]`. For example, the mean of element 2 is obtained by `mean(l[[2]])`: ``r mean(l[[2]])``.

To perform operations on all elements of a list, use `lapply()`:

```{r}
# show the data types
(lapply(X = l, FUN = class))

# mean, maybe?
(lapply(X = l, FUN = function(x) {mean(x)}))
```
### Factor
Factors are similar to vectors, in that they are one-dimensional ordered sets. However, factors also use informational labels. For example, you may have a variable with household income as a text value:

* "<$10,000"
* "$10,000-$549,999"
* "$50,000-$99,999"
* "$100,000-$200,000"
* ">$200,000"

As a vector:

```{r}
(income <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000"))
```

Because these are characters, they do not sort in proper numeric order:

```{r}
sort(income)
```

If these are treated as a factor, the levels can be set for proper ordering:

```{r}
# create a factor from income and set the levels
(income_factor <- factor(x = income, levels = income))

# sort again
(sort(income_factor))
```

As a factor, the data can also be used in statistical models and the magnitude of the variable will also be correctly ordered.

### Data frame
Other than vectors, data frames are probably the most used data type in R. You can think of data frames as matrices that allow columns with different data type. For example, you might have a data set that represents subject IDs as characters, sex or gender as text, height, weight, and age as numerical values, income as a factor, and smoking status as logical. Because a matrix requires only one data type, it would not be possible to store all of these as a matrix. An example:

```{r}
# income levels 
inc <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000")

BMI <-  data.frame(
   sid = c("A1001", "A1002", "B1001"),
   gender = c("Male", "Male","Female"), 
   height_cm = c(152, 171.5, 165), 
   weight_kg = c(81, 93, 78),
   age_y = c(42, 38, 26),
   income = factor(c("$50,000-$99,999", "$100,000-$200,000", "<$10,000"), levels = inc)
)
print(BMI)
```

## File systems {#filesystems}
Although a full treatment of effective uses of file systems is beyond the scope of this course, a few basic rules are worth covering:

1. Never use spaces in folder or file names. 
    Ninety-nine and 44/100ths percent of the time, most modern software will have no problems handling file names with spaces. But that 0.56% of the time when software chokes, you may wonder why your processes are failing. If your directly and file names do not have spaces, then you can at least rule that out!
1. Use lowercase letters in directory and file names.
    In the olden days (MS-DOS), there was not case sensitivity in file names. UNIX has has always used case sensitive file names. So 
    `MyGloriousPhDDissertation.tex` and `mygloriousphddissertation.tex` could actually be different files. Macs, being based on a UNIX kernel, also employ case sensitivity in file names. But Windows? No. Consider the following: there cannot be both `foo.txt` and `FOO.txt` in the same directory. 
    ![](images/week01/2021-01-08_01_13_50-CommandPrompt.png)
    
    So if Windows doesn't care, why should we? Save yourself some keyboarding time and confusion by using only lowercase characters in your file names.
1. Include dates in your file names.
    If you expect to have multiple files that are sequential versions of a file in progress, an alternative to using a content management system such as [git](https://git-scm.com/), particularly for binary files such as Word documents or SAS data files, is to have multiple versions of the files but including the date as part of the file name. If you expect to have multiple versions on the same date, include a lowercase alphabetical character; it is improbable that you would have more than 26 versions of a fine on a single calendar date. If you are paranoid, use a suffix number `0000`, `0002` .. `9999`. If you have ten thousand versions of the same file on a given date, you are probably doing something that is not right.
    Now that you are convinced that including dates in file names is a good idea, _please_ use the format `yyyy-mm-dd` or `yyyymmdd`. If you do so, your file names will sort in temporal order.
1. Make use of directories! 
   Although a folder containing 100,000 files can be handled programatically (if file naming conventions are used), it is not possible for a human being to visually scan 100,000 file names. If you have a lot of files for your project, consider creating directories, e.g., 
       - raw_data
       - processed_data
       - analysis_results
       - scripts
       - manuscript
1. Agonize over file names. 
    Optimally when you look at your file names, you will be able to know something about the content of the file. We spend a lot of time doing analysis and creating output. Spending an extra minute thinking about good file names is time well spent.


## Data manipulation in the `tidyverse` {#tidyverse}
One of the R packages we will use frequently is [`tidyverse`](https://www.tidyverse.org/packages/), which is itself a collection of several other packages, each with a specific domain: 

* `ggplot2` (graphics)
* `dplyr` (data manipulation)
* `tidyr` (reformatting data for efficient processing)
* `readr` (reading rectangular R x C data)
* `purrr` (functional programming, e.g., to replace `for()` loops)
* `tibble` (enhanced data frames)
* `stringr` (string, i.e., text manipulation)
* `forcats` (handling factor, i.e., categorical variables)

We will touch on some of these during this course, but there will not be a full review or treatment of the `tidyverse`.

This section will introduce some of the main workhorse functions in tidy data handling. 

Installing tidyverse is straightforward but it may take some time to download and install all of the packages. If you have not done so yet, use

```
install.packages("tidyverse")
```

For today's lesson we will be using one of the Add Health public use data sets, [AHwave1_v1.dta](data/AHwave1_v1.dta). 

```{r warning=FALSE, message=FALSE}
# load pacman if necessary
package.check <- lapply("pacman", FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x, dependencies = TRUE)
        library(x, character.only = TRUE)
    }
})

# load readstata13 if necessary
pacman::p_load(readstata13)

# read the dta file
dat <- readstata13::read.dta13(file.path(myurl, "data/AHwave1_v1.dta"))
```

The data set includes variable labels, which make handling the data easier. Here we print the column names and their labels. Wrapping this in a `DT::data_table` presents a nice interface for showing only a few variables at a time and that allows sorting and searching.

```{r}
x <- data.frame(colname = names(dat), label = attributes(dat)$var.labels)
DT::datatable(data = x, caption = "Column names and labels in AHwave1_v1.dta.")
```


### magrittr{#magrittr}
![](images/week02/unepipe.jpeg)

The R package [`magrittr`](https://cran.r-project.org/web/packages/magrittr/index.html) allows the use of "pipes". In UNIX, pipes were used to take the output of one program and to feed as input to another program. For example, the UNIX command `cat` prints the contents of a text file. This would print the contents of the file `00README.txt`:

```cat 00README.txt```

but with large files, the entire contents would scroll by too fast to read. Using a "pipe", denoted with the vertical bar character `|` allowed using the `more` command to print one screen at a time by tapping the `Enter` key for each screen full of text:

```cat 00README.txt | more```

As shown in these two screen captures:

![](images/week02/cat_more.png)

![](images/week02/cat_more2.png)

The two main pipe operators we will use in `magrittr` are `%>%` and '%<>%'.

`%>%` is the pipe operator, which functions as a UNIX pipe, that is, to take something on the left hand side of the operator and feed it to the right hand side. 

`%<>%` is the assignment pipe operator, which takes something on the left hand side of the operator, feeds it to the right hand side, and replaces the object on the left-hand side.

For a simple example of the pipe, to list only the first 6 lines of a data frame in base R, we use `head()`, e.g.,

```{r}
head(iris)
```

using a tidy version of this:

```{r}
iris %>% head()
```

In the R base version, we first read `head`, so we know we will be printing the first 6 elements of something, but we don't know what that "something" is. We have to read ahead to know we are reading the first 6 records of `iris`. In the tidy version, we start by knowing we are doing something to the data set, after which we know we are printing the first 6 records.

In base R functions, the process is evaluated from the inside out. For example, to get the mean sepal length of the _setosa_ species in iris, we would do this:

```{r}
mean(iris[iris$Species == 'setosa', "Sepal.Length"])
```

From the inside out, we read that we are making a subset of `iris` where Species = "setosa", we are selecting the column "Sepal.Length", and taking the mean. However, it requires reading from the inside out. For a large set of nested functions, we would have ` y <- f(g(h((i(x)))))`, which would require first creating the innermost function (`i()`) and then working outward.

In a tidy approach this would be more like y <- x %>% i() %>% h() %>% g() %>% f()` because the first function applied to the data set `x` is `i()`. Revisiting the mean sepal length of _setosa_ irises, example, under a tidy approach we would do this:

```{r}
iris %>% filter(Species == 'setosa') %>% summarise(mean(Sepal.Length))
```

Which, read from left to right, translates to "using the iris data frame, make a subset of records where species is _setosa_, and summarize those records to get the mean value of sepal length." The tidy version is intended to be easier to write, read, and understand. The command uses the `filter()` function, which will be described below.

### Data subsetting (dplyr)
`dplyr` is the tidyverse R package used most frequently for data manipulation. Selection of records (i.e., subsetting) is done using logical tests to determine what is in the selected set. First we will look at logical tests and then we will cover subsetting rows and columns from data frames.

##### Logical tests
If elements meet a logical test, they will end up in the selected set. If data frame records have values in variables that meet logical criteria, the records will be selected. 

Some logical tests are shown below.

###### `==`: equals

```{r}
# numeric tests
(1 == 2)
```

```{r}
(1 == 3 - 2)
```

```{r}
# character test (actually a factor)
(dat$imonth %>% head() %>% str_c(collapse = ", "))
((dat$imonth == "(6) June") %>% head())
```

```{r}
# character test for multiple patterns
(dat$imonth %in% c("(6) June", "(7) July") %>% head())
```


###### `>`, `>=`, `<`, `<=`: numeric comparisons

```{r}
1 < 2
```

```{r}
1 > 2
```

```{r}
1 <= -10:10
```

```{r}
1 >= -10:10
```

###### `!=`: not equals

```{r}
1 != 2
```

```{r}
# those of the first 6 days that are not 14
(dat$iday %>% head())
((dat$iday != 14) %>% head())
```

###### `!`: invert, or "not"
Sometimes it is more convenient to negate a single condition rather than enumerating all possible matching conditions.

```{r}
dat$imonth %>% head(20)
((!dat$imonth %in% c("(6) June", "(7) July")) %>% head(20))
```

#### Subset rows (`filter()`)
The `filter()` function creates a subset of records based on a logical test. Logical tests can be combined as "and" statements using the `&` operator and "or" statements using the `|` operator. Here we will perform a few filters on a subset of the data.

```{r}
# first 20 records, fist 10 columns
dat_sub <- dat[1:20, 1:10]
kable(dat_sub, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```

Records from one month:

```{r}
# from May
(dat_sub %>% filter(imonth == "(5) May"))
```

Records from one month from females:

```{r}
(dat_sub %>% filter(imonth == "(5) May" & bio_sex == "(2) Female"))
```

Records from one month and from females or where the day of month was before the 15th, which will probably include some males:

```{r}
(dat_sub %>% filter(imonth == "(5) May" & (bio_sex == "(2) Female") | iday < 15))

```

Although these examples are silly and trivial, they show how `filter()` is used to create a selected set of data

#### Subset columns (`select()`)
A subset of columns can be extracted from data frames using the `select()` function, most simply using  named list of columns to keep.

```{r}
# select 3 columns
(dat_sub_sel <- dat_sub %>%   
    select("aid", "imonth", "iday"))
```

```{r}
# select all but two named columns
(dat_sub_sel <- dat_sub %>%   
    select(-"imonth", -"iday"))
```

```{r}
# select columns by position and whose name matches a pattern, in this case the regular expression "^i" meaning "starts with lowercase i"
(dat_sub_sel <- dat_sub %>%   
    select(1, matches("^i")))
```

`select()` can also be used to rename columns:

```{r}
#select one column, rename two columns
(dat_sub_sel %>% 
   select(aid, Month = imonth, Day = iday))
```

Or column renaming can be done with `rename()`, which maintains all input data and only changes the named columns:

```{r}
(dat_sub_sel %>% 
   rename(Month = imonth, Day = iday))
```

#### Subset rows and columns: `filter()` and `select()`
We can combine `filter()` and `select()` with a pipe to create a new data frame with a subset of rows and columns:

```{r}
# records with day of month > 15 and the first 3 named columns
(x <- dat_sub %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday)
   )
```

#### Create or calculate columns: `mutate()`
`mutate()` will create new named columns or re-calculate existing columns. Here we will make a column that stratifies birth month, with the cut at June. 

Although the birth month column (`h1gi1m`) is a factor, it is unordered, so we need to make it ordered before using the factor label in a numeric comparison. Fortunately, the factor labels were handled in correct order:

```{r}
# is this ordered?
is.ordered(dat$h1gi1m)
```

```{r}
# what are the levels?
(levels(dat$h1gi1m))
```

Assign order, create a new column, and print nicely:

```{r}
# make birth month ordered
dat$h1gi1m <- factor(dat$h1gi1m, ordered = TRUE)

# now is it ordered?
is.ordered(dat$h1gi1m)
```

```{r}
# perform the mutate() using the string representation of the factor for comparison
dat %>% 
    filter(iday > 15) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_1st_half = (birth_month < "(7) July")) %>% 
    head(20) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```

A silly example but showing that `mutate()` can change values of existing columns:

```{r}
(X <- dat_sub %>% 
     mutate(iday = -1000 + iday))
```

... so do be careful!

Other functions can be used with mutate include (but are by no means limited to!) 

* `if_else()`: create a column by assigning values based on logical criteria
* `case_when()`: similar to `if_else()` but for multiple input values
* `recode()`: change particular values

When we recoded the birth month, the output was a `logical` data type. If we wanted to create a 
`character` or `factor`, we could use `if_else()`. Here we are creating a new data frame based on several operations on `dat`.

```{r}
dat_1 <- dat %>% 
    filter(iday > 15) %>%
    head(20) %>%
    select(aid, imonth, iday, birth_month = h1gi1m) %>% 
    mutate(birth_year_half = ifelse(test = birth_month < "(7) July", yes = "first", no = "last"))

# make that a factor
dat_1$birth_year_half <- factor(dat_1$birth_year_half, levels = c("first", "last"))
    
# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```

If one of your variables contains multiple values and you want to create classes, use `case_when()`. Here is a verbose example stratifying months into quarters. Also we are using the `magrittr` assignment pipe to update the input based on the statement, i.e., `dat_1` will change based on the commands we use. __Be careful using the assignment pipe because it will change your data frame.__

`case_when()` will recode in order or the way the command is written, so for months and quarters, it is not necessary to specify both ends of the quarter. Also any cases that are not explicitly handled can be addressed with the `TRUE ~ ...` argument; in this case, any records that had birth months that were not before September get assigned to quarter 4.

```{r}
dat_1 %<>% 
    mutate(quarter = case_when(
        birth_month < "(3) March" ~ 1,
        birth_month < "(6) June" ~ 2,
        birth_month < "(9) September" ~ 3,
        TRUE ~ 4))

# print
kable(dat_1) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```

`recode()` is used to change the `birth_year_half` column:

```{r}
(dat_1 %<>% 
     mutate(birth_year_half_split = recode(birth_year_half,
                   "first" = "early",
                   "last" = "late")))
```

#### Summarizing/aggregating data
We will spend more time later in the course on data summaries, but an introduction with `dplyr` is worthwhile introducing at this stage. The two main functions are `summarise()` and `group_by()`.

A simple summary will tabulate the count of respondents and the mean age. The filter `! str_detect(h1gi1y, "Refused")` drops records from respondents who refused to give their birth year.

```{r}
dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    summarise(n = n(),
              mean_age = mean(yeari - yearb))
```

Here we will summarize age by sex using the `group_by()` function, and also piping to `prop_table()` to get the percentage:

```{r}
dat %>% 
    filter(! str_detect(h1gi1y, "Refused")) %>% 
    mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
           yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>% 
    group_by(bio_sex) %>% 
    summarise(mean_age = mean(yeari - yearb),
              sd_age = sd(yeari - yearb),
              n = n(),
              .groups = "drop_last") %>% 
    mutate(pct = prop.table(n) * 100)
```

#### purrr: efficient iterating over elements in vectors and lists
More attention will be paid to `purrr` in the lesson on [functions](#week2).

The workhorse function in `purrr` is `map()`, which applies a function over a list or atomic vector.

A brief example uses a vector `c(9, 16, 25)` and the `map()` function is used to get the square root of each element. The output is a list

```{r}
# apply the sqrt() function to each element of a vector of integers
map(c(9, 16, 25), sqrt)
```

Other resources for `purrr`: [Learn to purrr](https://www.rebeccabarter.com/blog/2019-08-19_purrr/), [purrr tutorial](https://jennybc.github.io/purrr-tutorial/)

## Data sets {#datasets001}

### Edward Babushkin's Employee turnover data {#babushkin1}
Some data that will be used in CSDE 533: Edward Babushkin's employee turnover data, explained a bit at [kaggle.com](https://www.kaggle.com/davinwijaya/employee-turnover) and as a [downloadable file](https://github.com/teuschb/hr_data/blob/master/datasets/turnover_babushkin.csv).

Here we will load the data set from a URL:

```{r}
etdata <- read.csv("https://raw.githubusercontent.com/teuschb/hr_data/master/datasets/turnover_babushkin.csv")
```

Just to get a bit of `tidyverse` in at the last minute, let's get mean and standard deviation of job tenure by gender and 10-year age class:

```{r}
# create 10-year age classes 
etdata %<>% 
    mutate(age_decade = plyr::round_any(age, 10, f = ceiling)) 

# summarize
etdata %>% 
    # group by gender and age class
    group_by(gender, age_decade) %>% 
    # mean and sd
    summarize(mean_tenure_months = mean(tenure) %>% round(1),
              sd_tenure_months = sd(tenure) %>% round(1), 
              .groups = "keep") %>% 
    # order the output by age and gender
    arrange(gender, mean_tenure_months) %>% 
    # print it nicely
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```

<hr>
Rendered at <tt>`r Sys.time()`</tt>

## Source code
File is at `r fnamestr`.

### R code used in this document
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```

### Complete Rmd code
```{r comment=''}
cat(readLines(fnamepath), sep = '\n')
```