Unusual Observations

Unusual Observations1 / 25

Dr. Lucy D'Agostino McGowan

 Starwars (2)Go to RStudio Cloud and open Starwars (2)
2 / 25

Definitions

How does your book define an "outlier"?

3 / 25

Definitions

How does your book define an "influential point"?

4 / 25

5 / 25

Example

lm(mass ~ height, data = starwars)

## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Coefficients:
## (Intercept)       height  
##    -13.8103       0.6386

6 / 25

Example

y_hat <- lm(mass ~ height, data = starwars) %>%
  predict()
starwars %>%
  filter(!is.na(height) & !is.na(mass)) %>%
  mutate(residual = mass - y_hat) %>%
  ggplot(aes(y_hat, residual)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

7 / 25

Example

What does this line of code do?

y_hat <- lm(mass ~ height, data = starwars) %>%
  predict()
starwars %>%
  filter(!is.na(height) & !is.na(mass)) %>%
  mutate(residual = mass - y_hat) %>%
  ggplot(aes(y_hat, residual)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

8 / 25

Example

What does this line of code do?

y_hat <- lm(mass ~ height, data = starwars) %>%
  predict()
starwars %>%
  filter(!is.na(height) & !is.na(mass)) %>% 
  mutate(residual = mass - y_hat) %>%
  ggplot(aes(y_hat, residual)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

9 / 25

Example

Is this an outlier?

10 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

11 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

12 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

13 / 25

Dr. Lucy D'Agostino McGowan

How can we tell if a residual is "unusually" large?14 / 25

Dr. Lucy D'Agostino McGowan

How can we tell if a residual is "unusually" large?Do we have a "typical" error we can standardize by?15 / 25

Dr. Lucy D'Agostino McGowan

Standardize residualsˆσϵ: reflects the typical error
16 / 25

Dr. Lucy D'Agostino McGowan

Standardize residualsˆσϵ: reflects the typical error
residualˆσϵ
16 / 25

Dr. Lucy D'Agostino McGowan

Standardize residualsˆσϵ: reflects the typical error
residualˆσϵ
y−ˆyˆσϵ
16 / 25

Dr. Lucy D'Agostino McGowan

Studentized residualsAnother option is to estimate the standard deviation of the regression error using a model that is fit after omitting the point in question
17 / 25

Dr. Lucy D'Agostino McGowan

Studentized residualsAnother option is to estimate the standard deviation of the regression error using a model that is fit after omitting the point in question
In R: rstudent()
17 / 25

Example

What is model?

model <- lm(Gold ~ Year, data = LongJumpOlympics)

18 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics) 
y_hat <- model %>%
  predict()

19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics) 
y_hat <- model %>%
  predict()

y_hat <- lm(Gold ~ Year, data = LongJumpOlympics) %>%
  predict()

19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics) 
y_hat <- model %>%
  predict()

y_hat <- lm(Gold ~ Year, data = LongJumpOlympics) %>%
  predict()

LongJumpOlympics %>%
  mutate(y_hat = model %>% predict())

19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
LongJumpOlympics %>%
  mutate(y_hat = model %>% predict(),
         stud_resid = model %>% rstudent())

##    Year  Gold    y_hat  stud_resid
## 1  1900 7.185 7.241150 -0.24969110
## 2  1904 7.340 7.297413  0.18773767
## 3  1906 7.200 7.325544 -0.55459469
## 4  1908 7.480 7.353675  0.55605557
## 5  1912 7.600 7.409938  0.83801927
## 6  1920 7.150 7.522463 -1.69661296
## 7  1924 7.445 7.578726 -0.57565964
## 8  1928 7.730 7.634988  0.40587196
## 9  1932 7.640 7.691251 -0.21761617
## 10 1936 8.060 7.747514  1.37486325
## 11 1948 7.825 7.916301 -0.38535068
## 12 1952 7.570 7.972564 -1.80894501
## 13 1956 7.830 8.028827 -0.84888005
## 14 1960 8.120 8.085089  0.14690763
## 15 1964 8.070 8.141352 -0.30102045
## 16 1968 8.900 8.197615  3.76651449
## 17 1972 8.240 8.253877 -0.05865636
## 18 1976 8.350 8.310140  0.16903844
## 19 1980 8.540 8.366402  0.74709891
## 20 1984 8.540 8.422665  0.50367210
## 21 1988 8.720 8.478928  1.05875652
## 22 1992 8.670 8.535190  0.58546175
## 23 1996 8.500 8.591453 -0.39790914
## 24 2000 8.550 8.647716 -0.42816057
## 25 2004 8.590 8.703978 -0.50378890
## 26 2008 8.370 8.760241 -1.85376067

20 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
LongJumpOlympics %>%
  mutate(y_hat = model %>% predict(),
         stud_resid = model %>% rstudent()) %>%
  ggplot(aes(Year, stud_resid)) + 
  geom_point() + 
  geom_hline(yintercept = c(2, 4, -2, -4), lty = 2) + 
  labs(y = "studentized residual")

21 / 25

Influential points

Would removing the observation from the dataset change the regression equation by much?

22 / 25

Dr. Lucy D'Agostino McGowan

Examplelm(mass ~ height, data = starwars)

## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Coefficients:
## (Intercept)       height  
##    -13.8103       0.6386
starwars %>%
  filter(name != "Jabba Desilijic Tiure") %>%
  lm(mass ~ height, data = .)

## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##    -32.5408       0.6214
23 / 25

Example

24 / 25

Dr. Lucy D'Agostino McGowan

 Starwars (2)Go to RStudio Cloud and open Starwars (2)
For each question you work on, set the eval chunk option to TRUE and knit
25 / 25

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Unusual Observations

`Starwars (2)`

Definitions

Definitions

Example

Example

Example

Example

Example

Example

Example

Example

How can we tell if a residual is "unusually" large?

How can we tell if a residual is "unusually" large?

Do we have a "typical" error we can standardize by?

Standardize residuals

Standardize residuals

Standardize residuals

Studentized residuals

Studentized residuals

Example

Example

Example

Example

Example

Example

Influential points

Example

Example

`Starwars (2)`

`Starwars (2)`

Help