class: center, middle, inverse, title-slide # Simple linear regression --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan </span> </div> --- ## <i class="fas fa-laptop"></i> `Porsche Price` - Go to RStudio Cloud and open `Porsche Price` --- # Steps for modeling ![](img/03/flowchart.png) --- # Steps for modeling ![](img/03/flowchart-arrow.png) --- class: center, middle # Data = Model + Error --- class: center, middle `\(\Huge y = f(x) + \epsilon\)` --- class: center, middle `\(\Huge y = f(x) + \epsilon\)` ## Simple linear regression --- class: center, middle `\(\Huge \color{blue}y = f(x) + \epsilon\)` * **y:** continuous (quantitative) variable <br><br><br> ### properties of simple linear regression --- class: center, middle `\(\Huge y = f(\color{blue}x) + \epsilon\)` * **x:** continuous (quantitative) variable <br><br><br> ### properties of simple linear regression --- class: center, middle `\(\Huge y = \color{blue}{f(x)} + \epsilon\)` * **f(x):** a function that gives the mean value of `\(y\)` at any value of `\(x\)` <br><br><br> ### properties of simple linear regression --- class: middle .definition[ **function**: a function is the relationship between a set of inputs to a set of outputs ] -- - For example, `\(y = 1.5 + 0.5x\)` is a function where `\(x\)` is the input and `\(y\)` is the output -- - If you plug in `\(2\)` for `\(x\)`: `\(y = 1.5 + 0.5 \times 2\rightarrow y = 1.5 + 1 \rightarrow y = 2.5\)` --- .question[ What function do you think we are using get the mean value of `\(y\)` with simple **linear** regression? ] --- .question[ What function do you think we are using get the mean value of `\(y\)` with simple **linear** regression? ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- class: center, middle ## We express the mean weight of sparrows as a _linear function_ of wing length. --- .question[ What is the equation that represents this line? ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # y = mx + b ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # `\(y = \beta_0 + \beta_1 x\)` ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # `\(y = \beta_0 + \beta_1 \times \textrm{Wing Length}\)` ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- ## `\(\textrm{Weight} = \beta_0 + \beta_1 \times \textrm{Wing Length}\)` ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- .large[.question[ What is `\(\beta_0\)`? ] ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- .large[.question[ What is `\(\beta_0\)`? ] ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- .large[ .question[ What is `\(\beta_1\)`? ] ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- .large[ .question[ What is `\(\beta_1\)`? ] ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- class: center, middle ## Do all of the data points actually fall exactly on the line? --- class: center, middle ## `\(y = \beta_0 + \beta_1x + \color{red}{\epsilon}\)` --- ## `\(y = \beta_0 + \beta_1x + \color{red}\epsilon\)` ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## `\(y = \beta_0 + \beta_1x + \color{red}\epsilon\)` ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- # Truth `\(\Huge y = \beta_0 + \beta_1x + \epsilon\)` ![](https://media.giphy.com/media/s7uHDGT8SoJAQ/giphy.gif) --- # Truth `\(\Huge y = \beta_0 + \beta_1x + \epsilon\)` .definition[ If we had the **whole population** of sparrows we could quantify the exact relationship between `\(y\)` and `\(x\)` ] --- # Reality `\(\Huge \hat{y} = \hat{\beta_0} + \hat{\beta_1}x\)` ![](https://media.giphy.com/media/3o6nURrfzsAC9zZaG4/giphy.gif) --- # Reality `\(\Huge \hat{y} = \hat{\beta_0} + \hat{\beta_1}x\)` .definition[ In reality, we have a **sample** of sparrows to **estimate** the relationship between `\(x\)` and `\(y\)`. The "hats" represent that these are estimated (fitted) values ] --- # Put a hat on it .question[ How can you tell the difference between a **parameter** that is from the **whole population** versus a **sample**? ] ![](https://media.giphy.com/media/dZ0yRjxBulRjW/giphy.gif) --- class: center, middle # Pause for definitions --- # Definitions - **parameters** - `\(\beta_0\)` - `\(\beta_1\)` - **population** versus **sample** - **simple linear model** --- # Definitions - **parameters**: `\(\beta_0\)`, `\(\beta_1\)` - `\(\beta_0\)`: intercept - `\(\beta_1\)`: slope - **population** versus **sample** - **simple linear model**: `\(y = \beta_0 + \beta_1x + \epsilon\)` **estimated by** `\(\hat{y} = \hat{\beta}_0+\hat{\beta}_1x\)` --- class: center, middle # Let's do this in R --- ```r library(Stat2Data) data(Sparrows) *lm(Weight ~ WingLength, data = Sparrows) ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength ## 1.3655 0.4674 ``` --- .question[ What is `\(\hat{\beta}_0\)`? ] ```r lm(Weight ~ WingLength, data = Sparrows) ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength ## 1.3655 0.4674 ``` --- .question[ What is `\(\hat{\beta}_1\)`? ] ```r lm(Weight ~ WingLength, data = Sparrows) ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength ## 1.3655 0.4674 ``` --- ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% * predict() ``` --- ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% * mutate(y_hat = y_hat) %>% select(WingLength, y_hat) %>% slice(1:5) ``` ``` ## WingLength y_hat ## 1 29 14.92020 ## 2 31 15.85501 ## 3 25 13.05059 ## 4 29 14.92020 ## 5 30 15.38761 ``` --- class: center, middle # Let's try to match these values using `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` --- ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength ## 1.3655 0.4674 ``` ``` ## WingLength y_hat *## 1 29 14.92020 ## 2 31 15.85501 ## 3 25 13.05059 ## 4 29 14.92020 ## 5 30 15.38761 ``` --- ```r lm(Weight ~ WingLength, data = Sparrows) ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength *## 1.3655 0.4674 ``` # `\(1.365 + 0.4674 \times 29\)` --- ```r lm(Weight ~ WingLength, data = Sparrows) ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Coefficients: ## (Intercept) WingLength ## 1.3655 0.4674 ``` # `\(1.365 + 0.4674 \times 29 = 14.92\)` --- # How'd we do? ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% mutate(y_hat = y_hat) %>% select(WingLength, y_hat) %>% slice(1:5) ``` ``` ## WingLength y_hat *## 1 29 14.92020 ## 2 31 15.85501 ## 3 25 13.05059 ## 4 29 14.92020 ## 5 30 15.38761 ``` --- .question[ How did we decide on THIS line? ] ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- ## Minimizing Least Squares ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ## Minimizing Least Squares ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- ## Minimizing Least Squares ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ## Minimizing Least Squares .center[ ![](img/03/least-squares-vis.gif) ] --- ## "Squared Residuals" ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-29-1.png)<!-- --> --- ## "Residuals" ![](03-simple-linear-regression_files/figure-html/unnamed-chunk-30-1.png)<!-- --> --- ## Definitions * **residual** `\((e)\)` * **squared residual** `\((e^2)\)` * **sum of squared residual (SSE)** * **standard deviation of the errors** `\((\sigma_\epsilon)\)` * **n** --- ## Definitions * **residual** `\((e)\)`: observed `\(y\)` - predicted `\(y\)` `\(\rightarrow\)` `\(y - \hat{y}\)` -- * **squared residual** `\((e^2)\)`: `\((y - \hat{y})^2\)` -- * **sum of squared residual (SSE)**: `\(\sum (y-\hat{y})^2\)` -- * **standard deviation of the errors** `\((\sigma_\epsilon)\)`: estimated by `\(\hat{\sigma}_\epsilon = \sqrt{\frac{SSE}{n-2}}\)` (**regression standard error**) -- * **n**: sample size --- ## ☝️ Note about notation - `\(\Huge\sum(y - \hat{y})^2\)` -- - `\(\Huge\sum_{i=1}^n(y_i - \hat{y}_i)^2\)` --- ## ☝️ Note about notation .center[ ### the i indicates for a single individual `\(\Huge e_i = y_i - \hat{y}_i\)` ] --- ## ☝️ Note about notation .center[ ### for the first observation, i = 1 `\(\Huge e_1 = y_1 - \hat{y}_1\)` ] --- ## ☝️ Note about notation .center[ ### for the first observation, i = 1 `\(\Huge -0.02 = 14.9 -14.92\)` ] --- class: center, middle # Back to R! --- ## Calculate the residual ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% mutate(y_hat = y_hat, * residual = Weight - y_hat) %>% select(Weight, y_hat, residual) %>% slice(1:5) ``` ``` ## Weight y_hat residual ## 1 14.9 14.92020 -0.02020496 ## 2 15.0 15.85501 -0.85501292 ## 3 14.3 13.05059 1.24941095 ## 4 17.0 14.92020 2.07979504 ## 5 16.0 15.38761 0.61239106 ``` --- ## Calculate the squared residuals ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% mutate(y_hat = y_hat, residual = Weight - y_hat, * residual_2 = residual^2) %>% select(Weight, y_hat, residual, residual_2) %>% slice(1:5) ``` ``` ## Weight y_hat residual residual_2 ## 1 14.9 14.92020 -0.02020496 0.0004082405 ## 2 15.0 15.85501 -0.85501292 0.7310470869 ## 3 14.3 13.05059 1.24941095 1.5610277150 ## 4 17.0 14.92020 2.07979504 4.3255474012 ## 5 16.0 15.38761 0.61239106 0.3750228116 ``` --- ## Calculate the SSE ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% mutate(y_hat = y_hat, residual = Weight - y_hat, residual_2 = residual^2) %>% * summarise(sse = sum(residual_2)) ``` ``` ## sse ## 1 223.3107 ``` --- ## Calculate the regression standard error ```r y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>% predict() Sparrows %>% mutate(y_hat = y_hat, residual = Weight - y_hat, residual_2 = residual^2) %>% summarise(sse = sum(residual_2), n = n(), * rse = sqrt(sse / (n - 2))) ``` ``` ## sse n rse ## 1 223.3107 116 1.399595 ``` --- ## Calculate the regression standard error .small[ ``` ## sse n rse ## 1 223.3107 116 1.399595 ``` ```r lm(Weight ~ WingLength, data = Sparrows) %>% summary() ``` ``` ## ## Call: ## lm(formula = Weight ~ WingLength, data = Sparrows) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5440 -0.9935 0.0809 1.0559 3.4168 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.36549 0.95731 1.426 0.156 ## WingLength 0.46740 0.03472 13.463 <2e-16 ## *## Residual standard error: 1.4 on 114 degrees of freedom ## Multiple R-squared: 0.6139, Adjusted R-squared: 0.6105 ## F-statistic: 181.3 on 1 and 114 DF, p-value: < 2.2e-16 ``` ] --- ## <i class="fas fa-laptop"></i> `Porsche Price` - Go to RStudio Cloud and open `Porsche Price` - For each question you work on, set the `eval` chunk option to `TRUE` and knit --- ## Linearity The overall relationship between the variables has a linear pattern. The average values of the response `\(y\)` for each value of `\(x\)` fall on a common straight line. --- ## Zero Mean The error distribution is centered at zero. This means that the points are scattered at random above and below the line. _(Note: By using least squares regression, we force the residual mean to be zero. Other techniques would not necessarily satisfy this condition.)_ --- ## Constant Variance The variability in the errors is the same for all values of the predictor variable. This means that the spread of points around the line remains fairly constant. --- ## Independence The errors are assumed to be independent from one another. Thus, one point falling above or below the line has no influence on the location of another point. When we are interested in using the model to make formal inferences (conducting hypothesis tests or providing confidence intervals), additional assumptions are needed. --- ## Random The data are obtained using a random process. Most commonly, this arises either from random sampling from a population of interest or from the use of randomization in a statistical experiment. --- ## Normality In order to use standard distributions for confidence intervals and hypothesis tests, we often need to assume that the random errors follow a normal distribution. --- ## Summarise conditions For a quantitative response variable `\(y\)` and a single quantitative explanatory variable `\(x\)`, the simple linear regression model is `\(y = \beta_0 + \beta_1 x + \epsilon\)` where `\(\epsilon\)` follows a normal distribution, that is, `\(\epsilon ∼ N(0, \sigma_\epsilon)\)`, and the errors are independent from one another. --- # broom .pull-left[ ![](img/03/tidyverse.png) ![](img/03/broom.png) ] .pull-right[ - You're familiar with the tidyverse: ```r library(tidyverse) ``` - The broom package takes the messy output of built-in functions in R, such as `lm`, and turns them into tidy data frames. ```r library(broom) ``` ``` ## Warning: package 'broom' was built under R version 3.5.2 ``` ] .my-footer[ <font size="2"> Slides adapted from <a href="https://github.com/Sta199-S18/website" target="_blank">Dr. Mine Çetinkaya-Rundel</a> by Dr. Lucy D'Agostino McGowan</font> ] --- ## <i class="fas fa-laptop"></i> `Porsche Price` - Go to RStudio Cloud and open `Porsche Price`