+ - 0:00:00
Notes for current slide
Notes for next slide

Data and visualization
📉

Dr. Lucy D’Agostino McGowan

1 / 61

Starwars

  • Go to RStudio Cloud and launch the Starwars project.
2 / 61

Exploratory data analysis

3 / 61

What is EDA?

  • Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize its main characteristics.
  • Often, this is visual. That's what we're focusing on today.
  • But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. That's what we'll focus on next.
4 / 61

Data visualization

5 / 61

Data visualization

"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey

  • Data visualization is the creation and study of the visual representation of data.
  • There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (ggplot2 is one of them, and that's what we're going to use).
6 / 61

ggplot2 tidyverse

  • ggplot2 is tidyverse's data visualization package
  • The gg in "ggplot2" stands for Grammar of Graphics
  • It is inspired by the book Grammar of Graphics by Leland Wilkinson
  • A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

Source: BloggoType

7 / 61

Which function creates the plot?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")
## Warning: Removed 28 rows containing missing values (geom_point).

8 / 61

What is the dataset being plotted?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")
## Warning: Removed 28 rows containing missing values (geom_point).

9 / 61

Which variables are on the x-axis and y-axis?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")
## Warning: Removed 28 rows containing missing values (geom_point).

10 / 61

What about that warning?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")
## Warning: Removed 28 rows containing missing values (geom_point).

11 / 61

What does geom_smooth() do? What else changed between the previous plot and this one?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
geom_smooth() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")

12 / 61

Hello ggplot2!

13 / 61

ggplot2 components

  • data
  • aesthetic mapping
  • layer(s)
14 / 61

ggplot2 premise

15 / 61

Hello ggplot2!

16 / 61

Hello ggplot2!

17 / 61

Hello ggplot2!

18 / 61

Hello ggplot2!

19 / 61

Hello ggplot2!

20 / 61

Hello ggplot2!

  • ggplot() is the main function in ggplot2 and plots are constructed in layers
  • The structure of the code for plots can often be summarized as
ggplot +
geom_xxx
21 / 61

Hello ggplot2!

  • ggplot() is the main function in ggplot2 and plots are constructed in layers
  • The structure of the code for plots can often be summarized as
ggplot +
geom_xxx

or, more precisely

ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
other options
21 / 61

Hello ggplot2!

  • To use ggplot2 functions, first load tidyverse
library(tidyverse)
22 / 61

Hello ggplot2!

  • To use ggplot2 functions, first load tidyverse
library(tidyverse)
22 / 61

Visualizing Star Wars

23 / 61

Dataset terminology

What does each row represent? What does each column represent?

starwars
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
24 / 61

Dataset terminology

What does each row represent? What does each column represent?

starwars
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
  • Each row is an observation
  • Each column is a variable
24 / 61

Luke Skywalker

luke-skywalker

25 / 61

What's in the Star Wars data?

Take a glimpse at the data:

glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "L…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, …
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "bro…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "lig…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, …
## $ gender <chr> "male", NA, NA, "male", "female", "male", "female", N…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaa…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human",…
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "The …
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <…
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanc…
26 / 61

What's in the Star Wars data?

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Run the following in the Console to view the help

?starwars

27 / 61

Mass vs. height

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

28 / 61

What's that warning?

  • Not all characters have height and mass information (hence 28 of them not plotted)
## Warning: Removed 28 rows containing missing values (geom_point).
  • Going forward I'll suppress the warning to save room on slides, but it's important to note it
29 / 61

Mass vs. height

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character?

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")

30 / 61

Jabba!

31 / 61

Additional variables

We can map additional variables to various features of the plot:

  • aesthetics
    • color
    • size
    • shape
    • alpha (transparency)
  • faceting: small multiples displaying different subsets
32 / 61

Aesthetics

33 / 61

Aesthetics options

Visual characteristics of plotting characters that can be mapped to a specific variable in the data are

  • color
  • size
  • shape
  • alpha (transparency)
34 / 61

Mass vs. height + gender

ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) +
geom_point()

35 / 61

Mass vs. height + gender

Let's map the size to birth_year:

ggplot(data = starwars,
mapping = aes(x = height, y = mass, color = gender,
size = birth_year
)) +
geom_point()

36 / 61

Mass vs. height + gender

Let's now increase the size of all points not based on the values of a variable in the data:

ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) +
geom_point(size = 2)

37 / 61

Aesthetics summary

  • Continuous variable are measured on a continuous scale
  • Discrete variables are measured (or often counted) on a discrete scale
aesthetics discrete continuous
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each shouldn't (and doesn't) work
  • Use aesthetics for mapping features of a plot to a variable, define the features in the geom for customization not mapped to a variable
38 / 61

Faceting

39 / 61

Faceting options

  • Smaller plots that display different subsets of the data
  • Useful for exploring conditional relationships and large data
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
facet_grid(. ~ gender) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
subtitle = "Faceted by gender",
x = "Height (cm)", y = "Weight (kg)")

40 / 61

Dive further...

In the next few slides describe what each plot displays. Think about how the code relates to the output.

41 / 61

Dive further...

In the next few slides describe what each plot displays. Think about how the code relates to the output.




The plots in the next few slides do not have proper titles, axis labels, etc. because we want you to figure out what's happening in the plots. But you should always label your plots!

41 / 61
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
facet_grid(gender ~ .)

42 / 61
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
facet_grid(. ~ gender)

43 / 61
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
facet_wrap(~ eye_color)

44 / 61

Facet summary

  • facet_grid():
  • 2d grid
  • rows ~ cols
  • use . for no split
  • facet_wrap(): 1d ribbon wrapped into 2d
45 / 61

More ggplot2 info:

ggplot2 in 2

https://leanpub.com/ggplot2in2/c/sta-212-f19

46 / 61

Starwars

  • Go to RStudio Cloud and launch the Starwars project.
  • Open and knit the R Markdown document
47 / 61

Identifying variables

48 / 61

Types of variables

  • Numerical variables (quantitative variables) can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
  • If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.
49 / 61

Visualizing numerical data

50 / 61

Describing shapes of numerical distributions

  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median), mode (not always useful)
  • spread: range (range), standard deviation (sd), inter-quartile range (IQR)
  • unusual observations
51 / 61

Histograms

ggplot(data = starwars, mapping = aes(x = height)) +
geom_histogram(binwidth = 10)

52 / 61

Density plots

ggplot(data = starwars, mapping = aes(x = height)) +
geom_density()

53 / 61

Side-by-side box plots

ggplot(data = starwars, mapping = aes(y = height, x = gender)) +
geom_boxplot()

54 / 61

Side-by-side box plots

ggplot(data = starwars, mapping = aes(y = height, x = gender)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()

55 / 61

Show your data!

56 / 61

Visualizing categorical data

57 / 61

Bar plots

ggplot(data = starwars, mapping = aes(x = gender)) +
geom_bar()

58 / 61

Segmented bar plots, counts

ggplot(data = starwars,
mapping = aes(x = gender, fill = hair_color)) +
geom_bar()

59 / 61

Segmented bar plots, proportions

ggplot(data = starwars,
mapping = aes(x = gender, fill = hair_color)) +
geom_bar(position = "fill") +
labs(y = "proportion")

60 / 61

Which plot is more appropriate?

Which plot is a more useful representation for visualizing the relationship between gender and height?

61 / 61

Starwars

  • Go to RStudio Cloud and launch the Starwars project.
2 / 61
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow