+ - 0:00:00
Notes for current slide
Notes for next slide

Simple linear regression

1 / 75

Porsche Price

  • Go to RStudio Cloud and open Porsche Price
2 / 75

Steps for modeling

3 / 75

Steps for modeling

4 / 75

Data = Model + Error

5 / 75

y=f(x)+ϵ

6 / 75

y=f(x)+ϵ

Simple linear regression

7 / 75

y=f(x)+ϵ

  • y: continuous (quantitative) variable




properties of simple linear regression

8 / 75

y=f(x)+ϵ

  • x: continuous (quantitative) variable




properties of simple linear regression

9 / 75

y=f(x)+ϵ

  • f(x): a function that gives the mean value of y at any value of x




properties of simple linear regression

10 / 75

function: a function is the relationship between a set of inputs to a set of outputs

11 / 75

function: a function is the relationship between a set of inputs to a set of outputs

  • For example, y=1.5+0.5x is a function where x is the input and y is the output
11 / 75

function: a function is the relationship between a set of inputs to a set of outputs

  • For example, y=1.5+0.5x is a function where x is the input and y is the output
  • If you plug in 2 for x: y=1.5+0.5×2y=1.5+1y=2.5
11 / 75

What function do you think we are using get the mean value of y with simple linear regression?

12 / 75

What function do you think we are using get the mean value of y with simple linear regression?

13 / 75

We express the mean weight of sparrows as a linear function of wing length.

14 / 75

What is the equation that represents this line?

15 / 75

y = mx + b

16 / 75

y=β0+β1x

17 / 75

y=β0+β1×Wing Length

18 / 75

Weight=β0+β1×Wing Length

19 / 75

What is β0?

20 / 75

What is β0?

21 / 75

What is β1?

22 / 75

What is β1?

23 / 75

Do all of the data points actually fall exactly on the line?

24 / 75

y=β0+β1x+ϵ

25 / 75

y=β0+β1x+ϵ

26 / 75

y=β0+β1x+ϵ

27 / 75

Truth

y=β0+β1x+ϵ

28 / 75

Truth

y=β0+β1x+ϵ

If we had the whole population of sparrows we could quantify the exact relationship between y and x

29 / 75

Reality

y^=β0^+β1^x

30 / 75

Reality

y^=β0^+β1^x

In reality, we have a sample of sparrows to estimate the relationship between x and y. The "hats" represent that these are estimated (fitted) values

31 / 75

Put a hat on it

How can you tell the difference between a parameter that is from the whole population versus a sample?

32 / 75

Pause for definitions

33 / 75

Definitions

  • parameters
  • β0
  • β1
  • population versus sample
  • simple linear model
34 / 75

Definitions

  • parameters: β0, β1
  • β0: intercept
  • β1: slope
  • population versus sample
  • simple linear model: y=β0+β1x+ϵ estimated by y^=β^0+β^1x
35 / 75

Let's do this in R

36 / 75
library(Stat2Data)
data(Sparrows)
lm(Weight ~ WingLength, data = Sparrows)
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674
37 / 75

What is β^0?

lm(Weight ~ WingLength, data = Sparrows)
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674
38 / 75

What is β^1?

lm(Weight ~ WingLength, data = Sparrows)
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674
39 / 75
y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
40 / 75
y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat) %>%
select(WingLength, y_hat) %>%
slice(1:5)
## WingLength y_hat
## 1 29 14.92020
## 2 31 15.85501
## 3 25 13.05059
## 4 29 14.92020
## 5 30 15.38761
41 / 75

Let's try to match these values using β^0 and β^1

42 / 75
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674
## WingLength y_hat
## 1 29 14.92020
## 2 31 15.85501
## 3 25 13.05059
## 4 29 14.92020
## 5 30 15.38761
43 / 75
lm(Weight ~ WingLength, data = Sparrows)
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674

1.365+0.4674×29

44 / 75
lm(Weight ~ WingLength, data = Sparrows)
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Coefficients:
## (Intercept) WingLength
## 1.3655 0.4674

1.365+0.4674×29=14.92

45 / 75

How'd we do?

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat) %>%
select(WingLength, y_hat) %>%
slice(1:5)
## WingLength y_hat
## 1 29 14.92020
## 2 31 15.85501
## 3 25 13.05059
## 4 29 14.92020
## 5 30 15.38761
46 / 75

How did we decide on THIS line?

47 / 75

Minimizing Least Squares

48 / 75

Minimizing Least Squares

49 / 75

Minimizing Least Squares

50 / 75

Minimizing Least Squares

51 / 75

"Squared Residuals"

52 / 75

"Residuals"

53 / 75

Definitions

  • residual (e)
  • squared residual (e2)
  • sum of squared residual (SSE)
  • standard deviation of the errors (σϵ)
  • n
54 / 75

Definitions

  • residual (e): observed y - predicted y yy^
55 / 75

Definitions

  • residual (e): observed y - predicted y yy^
  • squared residual (e2): (yy^)2
55 / 75

Definitions

  • residual (e): observed y - predicted y yy^
  • squared residual (e2): (yy^)2
  • sum of squared residual (SSE): (yy^)2
55 / 75

Definitions

  • residual (e): observed y - predicted y yy^
  • squared residual (e2): (yy^)2
  • sum of squared residual (SSE): (yy^)2
  • standard deviation of the errors (σϵ): estimated by σ^ϵ=SSEn2 (regression standard error)
55 / 75

Definitions

  • residual (e): observed y - predicted y yy^
  • squared residual (e2): (yy^)2
  • sum of squared residual (SSE): (yy^)2
  • standard deviation of the errors (σϵ): estimated by σ^ϵ=SSEn2 (regression standard error)
  • n: sample size
55 / 75

☝️ Note about notation

  • (yy^)2
56 / 75

☝️ Note about notation

  • (yy^)2
  • i=1n(yiy^i)2
56 / 75

☝️ Note about notation

the i indicates for a single individual

ei=yiy^i

57 / 75

☝️ Note about notation

for the first observation, i = 1

e1=y1y^1

58 / 75

☝️ Note about notation

for the first observation, i = 1

0.02=14.914.92

59 / 75

Back to R!

60 / 75

Calculate the residual

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat,
residual = Weight - y_hat) %>%
select(Weight, y_hat, residual) %>%
slice(1:5)
## Weight y_hat residual
## 1 14.9 14.92020 -0.02020496
## 2 15.0 15.85501 -0.85501292
## 3 14.3 13.05059 1.24941095
## 4 17.0 14.92020 2.07979504
## 5 16.0 15.38761 0.61239106
61 / 75

Calculate the squared residuals

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat,
residual = Weight - y_hat,
residual_2 = residual^2) %>%
select(Weight, y_hat, residual, residual_2) %>%
slice(1:5)
## Weight y_hat residual residual_2
## 1 14.9 14.92020 -0.02020496 0.0004082405
## 2 15.0 15.85501 -0.85501292 0.7310470869
## 3 14.3 13.05059 1.24941095 1.5610277150
## 4 17.0 14.92020 2.07979504 4.3255474012
## 5 16.0 15.38761 0.61239106 0.3750228116
62 / 75

Calculate the SSE

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat,
residual = Weight - y_hat,
residual_2 = residual^2) %>%
summarise(sse = sum(residual_2))
## sse
## 1 223.3107
63 / 75

Calculate the regression standard error

y_hat <- lm(Weight ~ WingLength, data = Sparrows) %>%
predict()
Sparrows %>%
mutate(y_hat = y_hat,
residual = Weight - y_hat,
residual_2 = residual^2) %>%
summarise(sse = sum(residual_2),
n = n(),
rse = sqrt(sse / (n - 2)))
## sse n rse
## 1 223.3107 116 1.399595
64 / 75

Calculate the regression standard error

## sse n rse
## 1 223.3107 116 1.399595
lm(Weight ~ WingLength, data = Sparrows) %>%
summary()
##
## Call:
## lm(formula = Weight ~ WingLength, data = Sparrows)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5440 -0.9935 0.0809 1.0559 3.4168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.36549 0.95731 1.426 0.156
## WingLength 0.46740 0.03472 13.463 <2e-16
##
## Residual standard error: 1.4 on 114 degrees of freedom
## Multiple R-squared: 0.6139, Adjusted R-squared: 0.6105
## F-statistic: 181.3 on 1 and 114 DF, p-value: < 2.2e-16
65 / 75

Porsche Price

  • Go to RStudio Cloud and open Porsche Price
  • For each question you work on, set the eval chunk option to TRUE and knit
66 / 75

Linearity

The overall relationship between the variables has a linear pattern. The average values of the response y for each value of x fall on a common straight line.

67 / 75

Zero Mean

The error distribution is centered at zero. This means that the points are scattered at random above and below the line. (Note: By using least squares regression, we force the residual mean to be zero. Other techniques would not necessarily satisfy this condition.)

68 / 75

Constant Variance

The variability in the errors is the same for all values of the predictor variable. This means that the spread of points around the line remains fairly constant.

69 / 75

Independence

The errors are assumed to be independent from one another. Thus, one point falling above or below the line has no influence on the location of another point. When we are interested in using the model to make formal inferences (conducting hypothesis tests or providing confidence intervals), additional assumptions are needed.

70 / 75

Random

The data are obtained using a random process. Most commonly, this arises either from random sampling from a population of interest or from the use of randomization in a statistical experiment.

71 / 75

Normality

In order to use standard distributions for confidence intervals and hypothesis tests, we often need to assume that the random errors follow a normal distribution.

72 / 75

Summarise conditions

For a quantitative response variable y and a single quantitative explanatory variable x, the simple linear regression model is

y=β0+β1x+ϵ

where ϵ follows a normal distribution, that is, ϵN(0,σϵ), and the errors are independent from one another.

73 / 75

broom

  • You're familiar with the tidyverse:
library(tidyverse)
  • The broom package takes the messy output of built-in functions in R, such as lm, and turns them into tidy data frames.
library(broom)
## Warning: package 'broom' was built under R version 3.5.2
74 / 75

Porsche Price

  • Go to RStudio Cloud and open Porsche Price
75 / 75

Porsche Price

  • Go to RStudio Cloud and open Porsche Price
2 / 75
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow