class: center, middle, inverse, title-slide # Types of variables --- layout: true <div class="my-footer"> <span> by Dr. Lucy D'Agostino McGowan </span> </div> --- ## <i class="fas fa-laptop"></i> `Diamonds` - Go to RStudio Cloud and open `Diamonds` --- ## Variable types * There are two major classes of variables * numeric (quantitative) * categorical --- ## Variable types * Recall from the first week of class, you can use the `glimpse()` function to see all of your variables and their types -- .small[ ```r data("PorschePrice") glimpse(PorschePrice) ``` ``` ## Observations: 30 ## Variables: 3 ## $ Price <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67… ## $ Age <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, … ## $ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13… ``` ] -- * What are the variables here? --- ## Variable types * Recall from the first week of class, you can use the `glimpse()` function to see all of your variables and their types .small[ ```r data("Diamonds") glimpse(Diamonds) ``` ``` ## Observations: 351 ## Variables: 6 ## $ Carat <dbl> 1.08, 0.31, 0.31, 0.32, 0.33, 0.33, 0.35, 0.35, 0.37,… ## $ Color <fct> E, F, H, F, D, G, F, F, F, D, E, F, D, D, F, F, D, D,… ## $ Clarity <fct> VS1, VVS1, VS1, VVS1, IF, VVS1, VS1, VS1, VVS1, IF, V… ## $ Depth <dbl> 68.6, 61.9, 62.1, 60.8, 60.8, 61.5, 62.5, 62.3, 61.4,… ## $ PricePerCt <dbl> 6693.3, 3159.0, 1755.0, 3159.0, 4758.8, 2895.8, 2457.… ## $ TotalPrice <dbl> 7228.8, 979.3, 544.1, 1010.9, 1570.4, 955.6, 860.0, 8… ``` ] -- * What are the variables here? -- * `fct`: "factor" this is a type of categorical variable --- ## Variable types * Recall from the first week of class, you can use the `glimpse()` function to see all of your variables and their types .small[ ```r glimpse(starwars) ``` ] .small[ ``` ## Observations: 87 ## Variables: 5 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "L… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, … ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.… ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "bro… ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "lig… ``` ] -- * `chr`: "character" this is a type of categorical variable --- ## Variable types * So far, our models have only included _numeric_ (_quantitative_) variables -- * What would the equation be for predicting `\(y\)` from `\(x\)` when `\(x\)` is numeric? -- * What would happen if `\(x\)` is categorical? -- * What would the equation be for predicting `\(y\)` from `\(x\)` if `\(x\)` is categorical with 2 levels? -- * What would the equation be for predicting `\(y\)` from `\(x\)` if `\(x\)` is categorical with 3 levels? --- class: middle ## indicator variable An **indicator variable** uses two values, usually 0 and 1, to indicate whether a data case does (1) or does not (0) belong to a specific category --- ```r data("Diamonds") ``` .small[
] --- ## Indicator variables .question[ What does this line of code do? ] ```r Diamonds <- Diamonds %>% mutate( * ColorD = ifelse(Color == "D", 1, 0), ColorE = ifelse(Color == "E", 1, 0), ColorF = ifelse(Color == "F", 1, 0), ColorG = ifelse(Color == "G", 1, 0), ColorH = ifelse(Color == "H", 1, 0), ColorI = ifelse(Color == "I", 1, 0), ColorJ = ifelse(Color == "J", 1, 0) ) ``` --- ## Indicator variables .question[ What does this line of code do? ] ```r Diamonds <- Diamonds %>% mutate( ColorD = ifelse(Color == "D", 1, 0), * ColorE = ifelse(Color == "E", 1, 0), ColorF = ifelse(Color == "F", 1, 0), ColorG = ifelse(Color == "G", 1, 0), ColorH = ifelse(Color == "H", 1, 0), ColorI = ifelse(Color == "I", 1, 0), ColorJ = ifelse(Color == "J", 1, 0) ) ``` --- ## Indicator variables .small[
] --- ## Indicator variables .question[ What if I wanted to model the relationship between `TotalPrice` and `Color`? ] .small[
] --- ## Indicator variables .question[ Why is `ColorJ` `NA`? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI + ColorJ, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ## 1936 3632 2423 7224 7623 *## ColorH ColorI ColorJ *## 6732 5704 NA ``` ] -- * When including indicator variables in a model for `k` categories, always include `k-1` -- * The one that is left out is the "reference" category --- ## Indicator variables .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ## 1936 3632 2423 7224 7623 ## ColorH ColorI ## 6732 5704 ``` ] -- * **Interpretation:** A diamond with Color `D` compared to color `J` increases the expected total price by 3632. -- * **Interpretation:** A diamond with Color `E` compared to color `J` increases the expected total price by 2423 --- ## Indicator variables .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ## ColorH + ColorI, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ## 1936 3632 2423 7224 7623 ## ColorH ColorI ## 6732 5704 ``` ] * **Interpretation:** A diamond with Color `D` compared to color `J` increases the expected total price by 3632. * What is the interpretation for a diamond with Color `F`? --- ## R is smart .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorE ColorF ColorG ColorH ## 5569 -1209 3592 3990 3100 ## ColorI ColorJ ## 2071 -3632 ``` ] --- ## R is smart .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorE ColorF ColorG ColorH ## 5569 -1209 3592 3990 3100 ## ColorI ColorJ ## 2071 -3632 ``` ] -- * What is the interpretation for Color `E` now? -- * What if we wanted a different referent category? -- * We could code the indicators ourselves -- * We could use the **forcats** --- ## forcats .pull-right[ ![](img/09/forcats-logo.png) ] * R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. * The **forcats** package is loaded with the **tidyverse**, it helps you do things like order your factors Source: [forcats.tidyverse.org](forcats.tidyverse.org) --- ## forcats ```r levels(Diamonds$Color) ``` ``` ## [1] "D" "E" "F" "G" "H" "I" "J" ``` -- ```r new_levels <- c("J", "D", "E", "F", "G", "H", "I") Diamonds <- Diamonds %>% mutate(Color = fct_relevel(Color, new_levels)) ``` ```r levels(Diamonds$Color) ``` ``` ## [1] "J" "D" "E" "F" "G" "H" "I" ``` --- ## R is smart .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ## 1936 3632 2423 7224 7623 ## ColorH ColorI ## 6732 5704 ``` ] --- ## R is smart .question[ What is the reference category? ] .small[ ```r lm(TotalPrice ~ Color, data = Diamonds) ``` ``` ## ## Call: ## lm(formula = TotalPrice ~ Color, data = Diamonds) ## ## Coefficients: ## (Intercept) ColorD ColorE ColorF ColorG ## 1936 3632 2423 7224 7623 ## ColorH ColorI ## 6732 5704 ``` ] --- ## What if the variable is **binary** * A **binary** variable is a special type of categorical variable with **two levels** --- ## ICU example * A sample of 200 patients in an ICU unit * Want to see if the patient's heart rate is related to whether they were admitted via the emergency room -- * y: Heart rate (beats per minute) * x: indicator for emergency room admission -- * Aside: Is this inference or prediction? --- ## Binary x variable ```r data("ICU") lm(Pulse ~ Emergency, data = ICU) ``` ``` ## ## Call: ## lm(formula = Pulse ~ Emergency, data = ICU) ## ## Coefficients: ## (Intercept) Emergency ## 91.11 10.63 ``` -- * How can we interpret `\(\hat{\beta}_0\)` now? -- * How can we interpret `\(\hat{\beta}_1\)`? --- ## <i class="fas fa-laptop"></i> `Diamonds` - Go to RStudio Cloud and open `Diamonds` ---