class: center, middle, inverse, title-slide # Unusual Observations --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## <i class="fas fa-laptop"></i> `Starwars (2)` - Go to RStudio Cloud and open `Starwars (2)` --- ## Definitions .question[ How does your book define an "outlier"? ] --- ## Definitions .question[ How does your book define an "influential point"? ] --- <img src="img/02/jabbaplot.png" width="768" /> --- ## Example ```r lm(mass ~ height, data = starwars) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Coefficients: ## (Intercept) height ## -13.8103 0.6386 ``` --- ## Example .small[ ```r y_hat <- lm(mass ~ height, data = starwars) %>% predict() starwars %>% filter(!is.na(height) & !is.na(mass)) %>% mutate(residual = mass - y_hat) %>% ggplot(aes(y_hat, residual)) + geom_point() + geom_hline(yintercept = 0) ``` ![](04-unusual-observations_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- ## Example .question[ What does this line of code do? ] ```r y_hat <- lm(mass ~ height, data = starwars) %>% predict() starwars %>% * filter(!is.na(height) & !is.na(mass)) %>% mutate(residual = mass - y_hat) %>% ggplot(aes(y_hat, residual)) + geom_point() + geom_hline(yintercept = 0) ``` --- ## Example .question[ What does this line of code do? ] ```r y_hat <- lm(mass ~ height, data = starwars) %>% predict() starwars %>% filter(!is.na(height) & !is.na(mass)) %>% mutate(residual = mass - y_hat) %>% ggplot(aes(y_hat, residual)) + geom_point() + * geom_hline(yintercept = 0) ``` --- ## Example .question[ Is this an outlier? ] ![](04-unusual-observations_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## Example Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008 ![](04-unusual-observations_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Example Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008 ![](04-unusual-observations_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ## Example Gold-medal-winning distances (m) for the men's Olympic long jump, 1900–2008 ![](04-unusual-observations_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- class: center, middle ## How can we tell if a residual is "unusually" large? --- class: center, middle ## How can we tell if a residual is "unusually" large? ### Do we have a "typical" error we can standardize by? --- ## Standardize residuals * `\(\hat{\sigma}_\epsilon\)`: reflects the typical error -- * `\(\Large\frac{\textrm{residual}}{\hat{\sigma}_\epsilon}\)` -- * `\(\Large{\frac{y - \hat{y}}{\hat{\sigma}_\epsilon}}\)` --- ## Studentized residuals * Another option is to estimate the standard deviation of the regression error using a model that is fit **after omitting the point in question** -- * In R: `rstudent()` --- ## Example .question[ What is `model`? ] ```r model <- lm(Gold ~ Year, data = LongJumpOlympics) ``` --- ## Example ```r model <- lm(Gold ~ Year, data = LongJumpOlympics) y_hat <- model %>% predict() ``` -- ```r y_hat <- lm(Gold ~ Year, data = LongJumpOlympics) %>% predict() ``` -- ```r LongJumpOlympics %>% mutate(y_hat = model %>% predict()) ``` --- ## Example ```r model <- lm(Gold ~ Year, data = LongJumpOlympics) LongJumpOlympics %>% mutate(y_hat = model %>% predict(), * stud_resid = model %>% rstudent()) ``` ``` ## Year Gold y_hat stud_resid ## 1 1900 7.185 7.241150 -0.24969110 ## 2 1904 7.340 7.297413 0.18773767 ## 3 1906 7.200 7.325544 -0.55459469 ## 4 1908 7.480 7.353675 0.55605557 ## 5 1912 7.600 7.409938 0.83801927 ## 6 1920 7.150 7.522463 -1.69661296 ## 7 1924 7.445 7.578726 -0.57565964 ## 8 1928 7.730 7.634988 0.40587196 ## 9 1932 7.640 7.691251 -0.21761617 ## 10 1936 8.060 7.747514 1.37486325 ## 11 1948 7.825 7.916301 -0.38535068 ## 12 1952 7.570 7.972564 -1.80894501 ## 13 1956 7.830 8.028827 -0.84888005 ## 14 1960 8.120 8.085089 0.14690763 ## 15 1964 8.070 8.141352 -0.30102045 ## 16 1968 8.900 8.197615 3.76651449 ## 17 1972 8.240 8.253877 -0.05865636 ## 18 1976 8.350 8.310140 0.16903844 ## 19 1980 8.540 8.366402 0.74709891 ## 20 1984 8.540 8.422665 0.50367210 ## 21 1988 8.720 8.478928 1.05875652 ## 22 1992 8.670 8.535190 0.58546175 ## 23 1996 8.500 8.591453 -0.39790914 ## 24 2000 8.550 8.647716 -0.42816057 ## 25 2004 8.590 8.703978 -0.50378890 ## 26 2008 8.370 8.760241 -1.85376067 ``` --- ## Example .small[ ```r model <- lm(Gold ~ Year, data = LongJumpOlympics) LongJumpOlympics %>% mutate(y_hat = model %>% predict(), stud_resid = model %>% rstudent()) %>% ggplot(aes(Year, stud_resid)) + geom_point() + geom_hline(yintercept = c(2, 4, -2, -4), lty = 2) + labs(y = "studentized residual") ``` ![](04-unusual-observations_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- ## Influential points .question[ Would removing the observation from the dataset change the regression equation by much? ] --- ## Example .small[ ```r lm(mass ~ height, data = starwars) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Coefficients: ## (Intercept) height ## -13.8103 0.6386 ``` ] .small[ ```r starwars %>% * filter(name != "Jabba Desilijic Tiure") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -32.5408 0.6214 ``` ] --- ## Example ![](04-unusual-observations_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- ## <i class="fas fa-laptop"></i> `Starwars (2)` - Go to RStudio Cloud and open `Starwars (2)` - For each question you work on, set the `eval` chunk option to `TRUE` and knit