Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Unusual Observations

1 / 25

Starwars (2)

  • Go to RStudio Cloud and open Starwars (2)
2 / 25

Definitions

How does your book define an "outlier"?

3 / 25

Definitions

How does your book define an "influential point"?

4 / 25

5 / 25

Example

lm(mass ~ height, data = starwars)
##
## Call:
## lm(formula = mass ~ height, data = starwars)
##
## Coefficients:
## (Intercept) height
## -13.8103 0.6386
6 / 25

Example

y_hat <- lm(mass ~ height, data = starwars) %>%
predict()
starwars %>%
filter(!is.na(height) & !is.na(mass)) %>%
mutate(residual = mass - y_hat) %>%
ggplot(aes(y_hat, residual)) +
geom_point() +
geom_hline(yintercept = 0)

7 / 25

Example

What does this line of code do?

y_hat <- lm(mass ~ height, data = starwars) %>%
predict()
starwars %>%
filter(!is.na(height) & !is.na(mass)) %>%
mutate(residual = mass - y_hat) %>%
ggplot(aes(y_hat, residual)) +
geom_point() +
geom_hline(yintercept = 0)
8 / 25

Example

What does this line of code do?

y_hat <- lm(mass ~ height, data = starwars) %>%
predict()
starwars %>%
filter(!is.na(height) & !is.na(mass)) %>%
mutate(residual = mass - y_hat) %>%
ggplot(aes(y_hat, residual)) +
geom_point() +
geom_hline(yintercept = 0)
9 / 25

Example

Is this an outlier?

10 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

11 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

12 / 25

Example

Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008

13 / 25

How can we tell if a residual is "unusually" large?

14 / 25

How can we tell if a residual is "unusually" large?

Do we have a "typical" error we can standardize by?

15 / 25

Standardize residuals

  • ˆσϵ: reflects the typical error
16 / 25

Standardize residuals

  • ˆσϵ: reflects the typical error
  • residualˆσϵ
16 / 25

Standardize residuals

  • ˆσϵ: reflects the typical error
  • residualˆσϵ
  • yˆyˆσϵ
16 / 25

Studentized residuals

  • Another option is to estimate the standard deviation of the regression error using a model that is fit after omitting the point in question
17 / 25

Studentized residuals

  • Another option is to estimate the standard deviation of the regression error using a model that is fit after omitting the point in question
  • In R: rstudent()
17 / 25

Example

What is model?

model <- lm(Gold ~ Year, data = LongJumpOlympics)
18 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
y_hat <- model %>%
predict()
19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
y_hat <- model %>%
predict()
y_hat <- lm(Gold ~ Year, data = LongJumpOlympics) %>%
predict()
19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
y_hat <- model %>%
predict()
y_hat <- lm(Gold ~ Year, data = LongJumpOlympics) %>%
predict()
LongJumpOlympics %>%
mutate(y_hat = model %>% predict())
19 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
LongJumpOlympics %>%
mutate(y_hat = model %>% predict(),
stud_resid = model %>% rstudent())
## Year Gold y_hat stud_resid
## 1 1900 7.185 7.241150 -0.24969110
## 2 1904 7.340 7.297413 0.18773767
## 3 1906 7.200 7.325544 -0.55459469
## 4 1908 7.480 7.353675 0.55605557
## 5 1912 7.600 7.409938 0.83801927
## 6 1920 7.150 7.522463 -1.69661296
## 7 1924 7.445 7.578726 -0.57565964
## 8 1928 7.730 7.634988 0.40587196
## 9 1932 7.640 7.691251 -0.21761617
## 10 1936 8.060 7.747514 1.37486325
## 11 1948 7.825 7.916301 -0.38535068
## 12 1952 7.570 7.972564 -1.80894501
## 13 1956 7.830 8.028827 -0.84888005
## 14 1960 8.120 8.085089 0.14690763
## 15 1964 8.070 8.141352 -0.30102045
## 16 1968 8.900 8.197615 3.76651449
## 17 1972 8.240 8.253877 -0.05865636
## 18 1976 8.350 8.310140 0.16903844
## 19 1980 8.540 8.366402 0.74709891
## 20 1984 8.540 8.422665 0.50367210
## 21 1988 8.720 8.478928 1.05875652
## 22 1992 8.670 8.535190 0.58546175
## 23 1996 8.500 8.591453 -0.39790914
## 24 2000 8.550 8.647716 -0.42816057
## 25 2004 8.590 8.703978 -0.50378890
## 26 2008 8.370 8.760241 -1.85376067
20 / 25

Example

model <- lm(Gold ~ Year, data = LongJumpOlympics)
LongJumpOlympics %>%
mutate(y_hat = model %>% predict(),
stud_resid = model %>% rstudent()) %>%
ggplot(aes(Year, stud_resid)) +
geom_point() +
geom_hline(yintercept = c(2, 4, -2, -4), lty = 2) +
labs(y = "studentized residual")

21 / 25

Influential points

Would removing the observation from the dataset change the regression equation by much?

22 / 25

Example

lm(mass ~ height, data = starwars)
##
## Call:
## lm(formula = mass ~ height, data = starwars)
##
## Coefficients:
## (Intercept) height
## -13.8103 0.6386
starwars %>%
filter(name != "Jabba Desilijic Tiure") %>%
lm(mass ~ height, data = .)
##
## Call:
## lm(formula = mass ~ height, data = .)
##
## Coefficients:
## (Intercept) height
## -32.5408 0.6214
23 / 25

Example

24 / 25

Starwars (2)

  • Go to RStudio Cloud and open Starwars (2)
  • For each question you work on, set the eval chunk option to TRUE and knit
25 / 25

Starwars (2)

  • Go to RStudio Cloud and open Starwars (2)
2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow