Simple linear regression1 / 75

Dr. Lucy D'Agostino McGowan

 Porsche PriceGo to RStudio Cloud and open Porsche Price
2 / 75

Steps for modeling

3 / 75

Steps for modeling

4 / 75

Dr. Lucy D'Agostino McGowan

Data = Model + Error5 / 75

$y = f (x) + ϵ$

6 / 75

$y = f (x) + ϵ$

Simple linear regression

7 / 75

$y = f (x) + ϵ$

y: continuous (quantitative) variable

properties of simple linear regression

8 / 75

$y = f (x) + ϵ$

x: continuous (quantitative) variable

properties of simple linear regression

9 / 75

$y = f (x) + ϵ$

f(x): a function that gives the mean value of $y$ at any value of $x$

properties of simple linear regression

10 / 75

function: a function is the relationship between a set of inputs to a set of outputs

11 / 75

function: a function is the relationship between a set of inputs to a set of outputs

For example, $y = 1.5 + 0.5 x$ is a function where $x$ is the input and $y$ is the output

11 / 75

function: a function is the relationship between a set of inputs to a set of outputs

For example, $y = 1.5 + 0.5 x$ is a function where $x$ is the input and $y$ is the output
If you plug in $2$ for $x$ : $y = 1.5 + 0.5 \times 2 \to y = 1.5 + 1 \to y = 2.5$

11 / 75

What function do you think we are using get the mean value of $y$ with simple linear regression?

12 / 75

What function do you think we are using get the mean value of $y$ with simple linear regression?

13 / 75

Dr. Lucy D'Agostino McGowan

We express the mean weight of sparrows as a linear function of wing length.14 / 75

What is the equation that represents this line?

15 / 75

y = mx + b

16 / 75

$y = β_{0} + β_{1} x$

17 / 75

$y = β_{0} + β_{1} \times Wing Length$

18 / 75

$Weight = β_{0} + β_{1} \times Wing Length$

19 / 75

What is $β_{0}$ ?

20 / 75

What is $β_{0}$ ?

21 / 75

What is $β_{1}$ ?

22 / 75

What is $β_{1}$ ?

23 / 75

Dr. Lucy D'Agostino McGowan

Do all of the data points actually fall exactly on the line?24 / 75

Dr. Lucy D'Agostino McGowan

y=β0+β1x+ϵy=β0+β1x+ϵ25 / 75

$y = β_{0} + β_{1} x + ϵ$

26 / 75

$y = β_{0} + β_{1} x + ϵ$

27 / 75

Truth

$y = β_{0} + β_{1} x + ϵ$

28 / 75

Truth

$y = β_{0} + β_{1} x + ϵ$

If we had the whole population of sparrows we could quantify the exact relationship between $y$ and $x$

29 / 75

Reality

$\hat{y} = \hat{β_{0}} + \hat{β_{1}} x$

30 / 75

Reality

$\hat{y} = \hat{β_{0}} + \hat{β_{1}} x$

In reality, we have a sample of sparrows to estimate the relationship between $x$ and $y$ . The "hats" represent that these are estimated (fitted) values

31 / 75

Put a hat on it

How can you tell the difference between a parameter that is from the whole population versus a sample?

32 / 75

Dr. Lucy D'Agostino McGowan

Pause for definitions33 / 75

Dr. Lucy D'Agostino McGowan

Definitionsparameters
β0β0
β1β1
population versus sample
simple linear model
34 / 75

Dr. Lucy D'Agostino McGowan

Definitionsparameters: β0β0, β1β1
β0β0: intercept
β1β1: slope
population versus sample
simple linear model: y=β0+β1x+ϵy=β0+β1x+ϵ estimated by ^y=^β0+^β1xy^=β^0+β^1x
35 / 75

Dr. Lucy D'Agostino McGowan

Let's do this in R36 / 75

library(Stat2Data)
data(Sparrows)
lm(Weight ~ WingLength, data = Sparrows)

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

37 / 75

What is ${\hat{β}}_{0}$ ?

lm(Weight ~ WingLength, data = Sparrows)

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

38 / 75

What is ${\hat{β}}_{1}$ ?

lm(Weight ~ WingLength, data = Sparrows)

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

39 / 75

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict()

40 / 75

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat) %>%
  select(WingLength, y_hat) %>%
  slice(1:5)

##   WingLength    y_hat
## 1         29 14.92020
## 2         31 15.85501
## 3         25 13.05059
## 4         29 14.92020
## 5         30 15.38761

41 / 75

Dr. Lucy D'Agostino McGowan

Let's try to match these values using ^β0β^0 and ^β1β^142 / 75

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

##   WingLength    y_hat
## 1         29 14.92020
## 2         31 15.85501
## 3         25 13.05059
## 4         29 14.92020
## 5         30 15.38761

43 / 75

lm(Weight ~ WingLength, data = Sparrows)

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

$1.365 + 0.4674 \times 29$

44 / 75

lm(Weight ~ WingLength, data = Sparrows)

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Coefficients:
## (Intercept)   WingLength  
##      1.3655       0.4674

$1.365 + 0.4674 \times 29 = 14.92$

45 / 75

How'd we do?

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat) %>%
  select(WingLength, y_hat) %>%
  slice(1:5)

##   WingLength    y_hat
## 1         29 14.92020
## 2         31 15.85501
## 3         25 13.05059
## 4         29 14.92020
## 5         30 15.38761

46 / 75

How did we decide on THIS line?

47 / 75

Minimizing Least Squares

48 / 75

Minimizing Least Squares

49 / 75

Minimizing Least Squares

50 / 75

Minimizing Least Squares

51 / 75

"Squared Residuals"

52 / 75

"Residuals"

53 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e)
squared residual (e2)(e2)
sum of squared residual (SSE)
standard deviation of the errors (σϵ)(σϵ)
n
54 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e): observed yy - predicted yy →→ y−^yy−y^
55 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e): observed yy - predicted yy →→ y−^yy−y^
squared residual (e2)(e2): (y−^y)2(y−y^)2
55 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e): observed yy - predicted yy →→ y−^yy−y^
squared residual (e2)(e2): (y−^y)2(y−y^)2
sum of squared residual (SSE): ∑(y−^y)2∑(y−y^)2
55 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e): observed yy - predicted yy →→ y−^yy−y^
squared residual (e2)(e2): (y−^y)2(y−y^)2
sum of squared residual (SSE): ∑(y−^y)2∑(y−y^)2
standard deviation of the errors (σϵ)(σϵ): estimated by ^σϵ=√SSEn−2σ^ϵ=SSEn−2 (regression standard error)
55 / 75

Dr. Lucy D'Agostino McGowan

Definitionsresidual (e)(e): observed yy - predicted yy →→ y−^yy−y^
squared residual (e2)(e2): (y−^y)2(y−y^)2
sum of squared residual (SSE): ∑(y−^y)2∑(y−y^)2
standard deviation of the errors (σϵ)(σϵ): estimated by ^σϵ=√SSEn−2σ^ϵ=SSEn−2 (regression standard error)
n: sample size
55 / 75

Dr. Lucy D'Agostino McGowan

☝️ Note about notation∑(y−^y)2∑(y−y^)2
56 / 75

Dr. Lucy D'Agostino McGowan

☝️ Note about notation∑(y−^y)2∑(y−y^)2
∑ni=1(yi−^yi)2∑i=1n(yi−y^i)2
56 / 75

☝️ Note about notation

the i indicates for a single individual

$e_{i} = y_{i} - {\hat{y}}_{i}$

57 / 75

☝️ Note about notation

for the first observation, i = 1

$e_{1} = y_{1} - {\hat{y}}_{1}$

58 / 75

☝️ Note about notation

for the first observation, i = 1

$- 0.02 = 14.9 - 14.92$

59 / 75

Dr. Lucy D'Agostino McGowan

Back to R!60 / 75

Calculate the residual

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat,
         residual = Weight - y_hat) %>%
  select(Weight, y_hat, residual) %>%
  slice(1:5)

##   Weight    y_hat    residual
## 1   14.9 14.92020 -0.02020496
## 2   15.0 15.85501 -0.85501292
## 3   14.3 13.05059  1.24941095
## 4   17.0 14.92020  2.07979504
## 5   16.0 15.38761  0.61239106

61 / 75

Calculate the squared residuals

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat,
         residual = Weight - y_hat,
         residual_2 = residual^2) %>%
  select(Weight, y_hat, residual, residual_2) %>%
  slice(1:5)

##   Weight    y_hat    residual   residual_2
## 1   14.9 14.92020 -0.02020496 0.0004082405
## 2   15.0 15.85501 -0.85501292 0.7310470869
## 3   14.3 13.05059  1.24941095 1.5610277150
## 4   17.0 14.92020  2.07979504 4.3255474012
## 5   16.0 15.38761  0.61239106 0.3750228116

62 / 75

Calculate the SSE

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat,
         residual = Weight - y_hat,
         residual_2 = residual^2) %>% 
  summarise(sse = sum(residual_2))

##        sse
## 1 223.3107

63 / 75

Calculate the regression standard error

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
  predict() 
Sparrows %>%
  mutate(y_hat = y_hat,
         residual = Weight - y_hat,
         residual_2 = residual^2) %>% 
  summarise(sse = sum(residual_2),
            n = n(),
            rse = sqrt(sse / (n - 2)))

##        sse   n      rse
## 1 223.3107 116 1.399595

64 / 75

Dr. Lucy D'Agostino McGowan

Calculate the regression standard error##        sse   n      rse
## 1 223.3107 116 1.399595
lm(Weight ~ WingLength, data = Sparrows) %>%
  summary()

## 
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5440 -0.9935  0.0809  1.0559  3.4168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.36549    0.95731   1.426    0.156
## WingLength   0.46740    0.03472  13.463   <2e-16
## 
## Residual standard error: 1.4 on 114 degrees of freedom
## Multiple R-squared:  0.6139,    Adjusted R-squared:  0.6105 
## F-statistic: 181.3 on 1 and 114 DF,  p-value: < 2.2e-16
65 / 75

Dr. Lucy D'Agostino McGowan

 Porsche PriceGo to RStudio Cloud and open Porsche Price
For each question you work on, set the eval chunk option to TRUE and knit
66 / 75

Linearity

The overall relationship between the variables has a linear pattern. The average values of the response $y$ for each value of $x$ fall on a common straight line.

67 / 75

Zero Mean

The error distribution is centered at zero. This means that the points are scattered at random above and below the line. (Note: By using least squares regression, we force the residual mean to be zero. Other techniques would not necessarily satisfy this condition.)

68 / 75

Constant Variance

The variability in the errors is the same for all values of the predictor variable. This means that the spread of points around the line remains fairly constant.

69 / 75

Independence

The errors are assumed to be independent from one another. Thus, one point falling above or below the line has no influence on the location of another point. When we are interested in using the model to make formal inferences (conducting hypothesis tests or providing confidence intervals), additional assumptions are needed.

70 / 75

Random

The data are obtained using a random process. Most commonly, this arises either from random sampling from a population of interest or from the use of randomization in a statistical experiment.

71 / 75

Normality

In order to use standard distributions for confidence intervals and hypothesis tests, we often need to assume that the random errors follow a normal distribution.

72 / 75

Summarise conditions

For a quantitative response variable $y$ and a single quantitative explanatory variable $x$ , the simple linear regression model is

$y = β_{0} + β_{1} x + ϵ$

where $ϵ$ follows a normal distribution, that is, $ϵ \sim N (0, σ_{ϵ})$ , and the errors are independent from one another.

73 / 75

broom

You're familiar with the tidyverse:

library(tidyverse)

The broom package takes the messy output of built-in functions in R, such as lm, and turns them into tidy data frames.

library(broom)

## Warning: package 'broom' was built under R version 3.5.2

74 / 75

Dr. Lucy D'Agostino McGowan

 Porsche PriceGo to RStudio Cloud and open Porsche Price
75 / 75

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Simple linear regression

Porsche Price

Steps for modeling

Steps for modeling

Data = Model + Error

Simple linear regression

properties of simple linear regression

properties of simple linear regression

properties of simple linear regression

We express the mean weight of sparrows as a linear function of wing length.

y = mx + b

y=β0+β1xy=β0+β1x

y=β0+β1×Wing Lengthy=β0+β1×Wing Length

Weight=β0+β1×Wing LengthWeight=β0+β1×Wing Length

Do all of the data points actually fall exactly on the line?

y=β0+β1x+ϵy=β0+β1x+ϵ

y=β0+β1x+ϵy=β0+β1x+ϵ

y=β0+β1x+ϵy=β0+β1x+ϵ

Truth

Truth

Reality

Reality

Put a hat on it

Pause for definitions

Definitions

Definitions

Let's do this in R

Let's try to match these values using ^β0β^0 and ^β1β^1

1.365+0.4674×291.365+0.4674×29

1.365+0.4674×29=14.921.365+0.4674×29=14.92

How'd we do?

Minimizing Least Squares

Minimizing Least Squares

Minimizing Least Squares

Minimizing Least Squares

"Squared Residuals"

"Residuals"

Definitions

Definitions

Definitions

Definitions

Definitions

Definitions

☝️ Note about notation

☝️ Note about notation

☝️ Note about notation

the i indicates for a single individual

☝️ Note about notation

for the first observation, i = 1

☝️ Note about notation

for the first observation, i = 1

Back to R!

Calculate the residual

Calculate the squared residuals

Calculate the SSE

Calculate the regression standard error

Calculate the regression standard error

Porsche Price

Linearity

Zero Mean

Constant Variance

Independence

Random

Normality

Summarise conditions

broom

Porsche Price

Porsche Price

Help

`Porsche Price`

$y = β_{0} + β_{1} x$

$y = β_{0} + β_{1} \times Wing Length$

$Weight = β_{0} + β_{1} \times Wing Length$

$y = β_{0} + β_{1} x + ϵ$

$y = β_{0} + β_{1} x + ϵ$

$y = β_{0} + β_{1} x + ϵ$

Let's try to match these values using ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$

$1.365 + 0.4674 \times 29$

$1.365 + 0.4674 \times 29 = 14.92$

`Porsche Price`

`Porsche Price`

`Porsche Price`