+ - 0:00:00
Notes for current slide
Notes for next slide

Types of variables

1 / 27

Diamonds

  • Go to RStudio Cloud and open Diamonds
2 / 27

Variable types

  • There are two major classes of variables
    • numeric (quantitative)
    • categorical
3 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
4 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("PorschePrice")
glimpse(PorschePrice)
## Observations: 30
## Variables: 3
## $ Price <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67…
## $ Age <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, …
## $ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13…
4 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("PorschePrice")
glimpse(PorschePrice)
## Observations: 30
## Variables: 3
## $ Price <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67…
## $ Age <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, …
## $ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13…
  • What are the variables here?
4 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("Diamonds")
glimpse(Diamonds)
## Observations: 351
## Variables: 6
## $ Carat <dbl> 1.08, 0.31, 0.31, 0.32, 0.33, 0.33, 0.35, 0.35, 0.37,…
## $ Color <fct> E, F, H, F, D, G, F, F, F, D, E, F, D, D, F, F, D, D,…
## $ Clarity <fct> VS1, VVS1, VS1, VVS1, IF, VVS1, VS1, VS1, VVS1, IF, V…
## $ Depth <dbl> 68.6, 61.9, 62.1, 60.8, 60.8, 61.5, 62.5, 62.3, 61.4,…
## $ PricePerCt <dbl> 6693.3, 3159.0, 1755.0, 3159.0, 4758.8, 2895.8, 2457.…
## $ TotalPrice <dbl> 7228.8, 979.3, 544.1, 1010.9, 1570.4, 955.6, 860.0, 8…
5 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("Diamonds")
glimpse(Diamonds)
## Observations: 351
## Variables: 6
## $ Carat <dbl> 1.08, 0.31, 0.31, 0.32, 0.33, 0.33, 0.35, 0.35, 0.37,…
## $ Color <fct> E, F, H, F, D, G, F, F, F, D, E, F, D, D, F, F, D, D,…
## $ Clarity <fct> VS1, VVS1, VS1, VVS1, IF, VVS1, VS1, VS1, VVS1, IF, V…
## $ Depth <dbl> 68.6, 61.9, 62.1, 60.8, 60.8, 61.5, 62.5, 62.3, 61.4,…
## $ PricePerCt <dbl> 6693.3, 3159.0, 1755.0, 3159.0, 4758.8, 2895.8, 2457.…
## $ TotalPrice <dbl> 7228.8, 979.3, 544.1, 1010.9, 1570.4, 955.6, 860.0, 8…
  • What are the variables here?
5 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
data("Diamonds")
glimpse(Diamonds)
## Observations: 351
## Variables: 6
## $ Carat <dbl> 1.08, 0.31, 0.31, 0.32, 0.33, 0.33, 0.35, 0.35, 0.37,…
## $ Color <fct> E, F, H, F, D, G, F, F, F, D, E, F, D, D, F, F, D, D,…
## $ Clarity <fct> VS1, VVS1, VS1, VVS1, IF, VVS1, VS1, VS1, VVS1, IF, V…
## $ Depth <dbl> 68.6, 61.9, 62.1, 60.8, 60.8, 61.5, 62.5, 62.3, 61.4,…
## $ PricePerCt <dbl> 6693.3, 3159.0, 1755.0, 3159.0, 4758.8, 2895.8, 2457.…
## $ TotalPrice <dbl> 7228.8, 979.3, 544.1, 1010.9, 1570.4, 955.6, 860.0, 8…
  • What are the variables here?
  • fct: "factor" this is a type of categorical variable
5 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
glimpse(starwars)
## Observations: 87
## Variables: 5
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "L…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, …
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "bro…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "lig…
6 / 27

Variable types

  • Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types
glimpse(starwars)
## Observations: 87
## Variables: 5
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "L…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, …
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "bro…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "lig…
  • chr: "character" this is a type of categorical variable
6 / 27

Variable types

  • So far, our models have only included numeric (quantitative) variables
7 / 27

Variable types

  • So far, our models have only included numeric (quantitative) variables
    • What would the equation be for predicting y from x when x is numeric?
7 / 27

Variable types

  • So far, our models have only included numeric (quantitative) variables
    • What would the equation be for predicting y from x when x is numeric?
  • What would happen if x is categorical?
7 / 27

Variable types

  • So far, our models have only included numeric (quantitative) variables
    • What would the equation be for predicting y from x when x is numeric?
  • What would happen if x is categorical?
    • What would the equation be for predicting y from x if x is categorical with 2 levels?
7 / 27

Variable types

  • So far, our models have only included numeric (quantitative) variables
    • What would the equation be for predicting y from x when x is numeric?
  • What would happen if x is categorical?
    • What would the equation be for predicting y from x if x is categorical with 2 levels?
    • What would the equation be for predicting y from x if x is categorical with 3 levels?
7 / 27

indicator variable

An indicator variable uses two values, usually 0 and 1, to indicate whether a data case does (1) or does not (0) belong to a specific category

8 / 27
data("Diamonds")
9 / 27

Indicator variables

What does this line of code do?

Diamonds <- Diamonds %>%
mutate(
ColorD = ifelse(Color == "D", 1, 0),
ColorE = ifelse(Color == "E", 1, 0),
ColorF = ifelse(Color == "F", 1, 0),
ColorG = ifelse(Color == "G", 1, 0),
ColorH = ifelse(Color == "H", 1, 0),
ColorI = ifelse(Color == "I", 1, 0),
ColorJ = ifelse(Color == "J", 1, 0)
)
10 / 27

Indicator variables

What does this line of code do?

Diamonds <- Diamonds %>%
mutate(
ColorD = ifelse(Color == "D", 1, 0),
ColorE = ifelse(Color == "E", 1, 0),
ColorF = ifelse(Color == "F", 1, 0),
ColorG = ifelse(Color == "G", 1, 0),
ColorH = ifelse(Color == "H", 1, 0),
ColorI = ifelse(Color == "I", 1, 0),
ColorJ = ifelse(Color == "J", 1, 0)
)
11 / 27

Indicator variables

12 / 27

Indicator variables

What if I wanted to model the relationship between TotalPrice and Color?

13 / 27

Indicator variables

Why is ColorJ NA?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI + ColorJ, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI ColorJ
## 6732 5704 NA
14 / 27

Indicator variables

Why is ColorJ NA?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI + ColorJ, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI ColorJ
## 6732 5704 NA
  • When including indicator variables in a model for k categories, always include k-1
14 / 27

Indicator variables

Why is ColorJ NA?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI + ColorJ, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI ColorJ
## 6732 5704 NA
  • When including indicator variables in a model for k categories, always include k-1
  • The one that is left out is the "reference" category
14 / 27

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
15 / 27

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
  • Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632.
15 / 27

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
  • Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632.
  • Interpretation: A diamond with Color E compared to color J increases the expected total price by 2423
15 / 27

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG +
## ColorH + ColorI, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
  • Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632.
  • What is the interpretation for a diamond with Color F?
16 / 27

R is smart

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
17 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
18 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
  • What is the interpretation for Color E now?
18 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
  • What is the interpretation for Color E now?
  • What if we wanted a different referent category?
18 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
  • What is the interpretation for Color E now?
  • What if we wanted a different referent category?
    • We could code the indicators ourselves
18 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorE ColorF ColorG ColorH
## 5569 -1209 3592 3990 3100
## ColorI ColorJ
## 2071 -3632
  • What is the interpretation for Color E now?
  • What if we wanted a different referent category?
    • We could code the indicators ourselves
    • We could use the forcats
18 / 27

forcats

  • R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
  • The forcats package is loaded with the tidyverse, it helps you do things like order your factors

Source: forcats.tidyverse.org

19 / 27

forcats

levels(Diamonds$Color)
## [1] "D" "E" "F" "G" "H" "I" "J"
20 / 27

forcats

levels(Diamonds$Color)
## [1] "D" "E" "F" "G" "H" "I" "J"
new_levels <- c("J", "D", "E", "F", "G", "H", "I")
Diamonds <- Diamonds %>%
mutate(Color = fct_relevel(Color, new_levels))
levels(Diamonds$Color)
## [1] "J" "D" "E" "F" "G" "H" "I"
20 / 27

R is smart

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
21 / 27

R is smart

What is the reference category?

lm(TotalPrice ~ Color, data = Diamonds)
##
## Call:
## lm(formula = TotalPrice ~ Color, data = Diamonds)
##
## Coefficients:
## (Intercept) ColorD ColorE ColorF ColorG
## 1936 3632 2423 7224 7623
## ColorH ColorI
## 6732 5704
22 / 27

What if the variable is binary

  • A binary variable is a special type of categorical variable with two levels
23 / 27

ICU example

  • A sample of 200 patients in an ICU unit
  • Want to see if the patient's heart rate is related to whether they were admitted via the emergency room
24 / 27

ICU example

  • A sample of 200 patients in an ICU unit
  • Want to see if the patient's heart rate is related to whether they were admitted via the emergency room
    • y: Heart rate (beats per minute)
    • x: indicator for emergency room admission
24 / 27

ICU example

  • A sample of 200 patients in an ICU unit
  • Want to see if the patient's heart rate is related to whether they were admitted via the emergency room
    • y: Heart rate (beats per minute)
    • x: indicator for emergency room admission
  • Aside: Is this inference or prediction?
24 / 27

Binary x variable

data("ICU")
lm(Pulse ~ Emergency, data = ICU)
##
## Call:
## lm(formula = Pulse ~ Emergency, data = ICU)
##
## Coefficients:
## (Intercept) Emergency
## 91.11 10.63
25 / 27

Binary x variable

data("ICU")
lm(Pulse ~ Emergency, data = ICU)
##
## Call:
## lm(formula = Pulse ~ Emergency, data = ICU)
##
## Coefficients:
## (Intercept) Emergency
## 91.11 10.63
  • How can we interpret β^0 now?
25 / 27

Binary x variable

data("ICU")
lm(Pulse ~ Emergency, data = ICU)
##
## Call:
## lm(formula = Pulse ~ Emergency, data = ICU)
##
## Coefficients:
## (Intercept) Emergency
## 91.11 10.63
  • How can we interpret β^0 now?
  • How can we interpret β^1?
25 / 27

Diamonds

  • Go to RStudio Cloud and open Diamonds
26 / 27
27 / 27

Diamonds

  • Go to RStudio Cloud and open Diamonds
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow