+ - 0:00:00
Notes for current slide
Notes for next slide

Tidy data and data wrangling
🔧

1 / 36

NC bike crashes

  • Go to RStudio Cloud and open NC bike crashes
2 / 36

Tidy data

3 / 36

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

4 / 36

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.
4 / 36

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data:

!@#$%^&*()

4 / 36

Summary tables

Is each of the following a dataset or a summary table?

## # A tibble: 87 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # … with 77 more rows
## # A tibble: 5 x 2
## gender avg_height
## <chr> <dbl>
## 1 <NA> 120
## 2 female 165.
## 3 hermaphrodite 175
## 4 male 179.
## 5 none 200
5 / 36

Pipes

6 / 36

Where does the name come from?

The pipe operator is implemented in the package magrittr, it's pronounced "and then".

pipe

magrittr

7 / 36

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))
  • Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
8 / 36

What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:

starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
##
## Call:
## lm(formula = mass ~ height, data = .)
##
## Coefficients:
## (Intercept) height
## -116.58 1.11
9 / 36

Data wrangling

10 / 36

Bike crashes in NC 2007 - 2014

The dataset is in the dsbox package:

library(dsbox)
ncbikecrash
11 / 36

Variables

View the names of variables via

names(ncbikecrash)
## [1] "object_id" "city" "county"
## [4] "region" "development" "locality"
## [7] "on_road" "rural_urban" "speed_limit"
## [10] "traffic_control" "weather" "workzone"
## [13] "bike_age" "bike_age_group" "bike_alcohol"
## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury"
## [19] "bike_position" "bike_race" "bike_sex"
## [22] "driver_age" "driver_age_group" "driver_alcohol"
## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury"
## [28] "driver_race" "driver_sex" "driver_vehicle_type"
## [31] "crash_alcohol" "crash_date" "crash_day"
## [34] "crash_group" "crash_hour" "crash_location"
## [37] "crash_month" "crash_severity" "crash_time"
## [40] "crash_type" "crash_year" "ambulance_req"
## [43] "hit_run" "light_condition" "road_character"
## [46] "road_class" "road_condition" "road_configuration"
## [49] "road_defects" "road_feature" "road_surface"
## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci"
## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to"
## [58] "num_bikes_ui" "num_lanes" "num_units"
## [61] "distance_mi_from" "frm_road" "rte_invd_cd"
## [64] "towrd_road" "geo_point" "geo_shape"
12 / 36

Variables

See detailed descriptions with ?ncbikecrash.

13 / 36

Viewing your data

  • In the Environment, after loading with data(ncbikecrash), click on the name of the data frame to view it in the data viewer
  • Use the glimpse function to take a peek
glimpse(ncbikecrash)
## Observations: 7,467
## Variables: 66
## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…
## $ city <chr> "None - Rural Crash", "Henderson", "None - …
## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…
## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…
## $ development <chr> "Farms, Woods, Pastures", "Residential", "F…
## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70…
## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…
## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…
## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M…
## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double …
## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…
## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "…
## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…
## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff…
## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", …
## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…
## $ bike_race <chr> "Black", "Black", "White", "Black", "White"…
## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "…
## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",…
## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…
## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…
## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In…
## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl…
## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema…
## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…
## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…
## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…
## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist…
## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…
## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In…
## $ crash_month <chr> "December", "November", "November", "Decemb…
## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", …
## $ crash_time <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…
## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist…
## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…
## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…
## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…
## $ road_character <chr> "Straight - Level", "Straight - Level", "St…
## $ road_class <chr> "State Secondary Route", "Local Street", "U…
## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…
## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, …
## $ road_defects <chr> "None", NA, "None", "None", "None", "None",…
## $ road_feature <chr> "No Special Feature", "T-Intersection", "No…
## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…
## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …
## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187…
## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-7…
14 / 36

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

  • filter: pick rows matching criteria
  • slice: pick rows using index(es)
  • select: pick columns by name
  • pull: grab a column as a vector
  • arrange: reorder rows
  • mutate: add new variables
  • distinct: filter for unique rows
  • sample_n / sample_frac: randomly sample rows
  • summarise: reduce variables to values
  • ... (many more)
15 / 36

dplyr rules for functions

  • First argument is always a data frame
  • Subsequent arguments say what to do with that data frame
  • Always return a data frame
  • Doesn't modify in place
16 / 36

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
17 / 36

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
  • The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.
17 / 36

filter to select a subset of rows

for crashes in Durham County

ncbikecrash %>%
filter(county == "Durham")
## # A tibble: 340 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban
## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban
## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban
## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban
## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## # … with 330 more rows, and 58 more variables: speed_limit <chr>,
## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## # bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## # driver_age_group <chr>, driver_alcohol <chr>,
## # driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## # crash_day <chr>, crash_group <chr>, crash_hour <int>,
## # crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## # crash_time <drtn>, crash_type <chr>, crash_year <int>,
## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>,
## # road_character <chr>, road_class <chr>, road_condition <chr>,
## # road_configuration <chr>, road_defects <chr>, road_feature <chr>,
## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,
## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,
## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,
## # num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
18 / 36

filter for many conditions at once

for crashes in Durham County where biker was 0-5 years old

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5")
## # A tibble: 4 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban
## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
19 / 36

Logical operators in R

operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y
20 / 36

select to keep variables

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5") %>%
select(locality, speed_limit)
## # A tibble: 4 x 2
## locality speed_limit
## <chr> <chr>
## 1 Urban (>70% Developed) 30 - 35 MPH
## 2 Urban (>70% Developed) 5 - 15 MPH
## 3 Urban (>70% Developed) 20 - 25 MPH
## 4 Urban (>70% Developed) 20 - 25 MPH
21 / 36

select to exclude variables

ncbikecrash %>%
select(-object_id)
## # A tibble: 7,467 x 65
## city county region development locality on_road rural_urban speed_limit
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M…
## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M…
## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M…
## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M…
## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA>
## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M…
## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M…
## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M…
## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M…
## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
22 / 36

select a range of variables

ncbikecrash %>%
select(city:locality)
## # A tibble: 7,467 x 5
## city county region development locality
## <chr> <chr> <chr> <chr> <chr>
## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop…
## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D…
## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop…
## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop…
## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop…
## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop…
## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D…
## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop…
## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop…
## 10 New Bern Craven Coastal Residential Urban (>70% Develop…
## # … with 7,457 more rows
23 / 36

slice for certain row numbers

First five

ncbikecrash %>%
slice(1:5)
## # A tibble: 5 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural
## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban
## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural
## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban
## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
24 / 36

slice for certain row numbers

Last five

last_row <- nrow(ncbikecrash)
ncbikecrash %>%
slice((last_row - 4):last_row)
## # A tibble: 5 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban
## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban
## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban
## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban
## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
25 / 36

pull to extract a column as a vector

ncbikecrash %>%
slice(1:6) %>%
pull(locality)
## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)"
## [3] "Rural (<30% Developed)" "Urban (>70% Developed)"
## [5] "Urban (>70% Developed)" "Rural (<30% Developed)"

vs.

ncbikecrash %>%
slice(1:6) %>%
select(locality)
## # A tibble: 6 x 1
## locality
## <chr>
## 1 Rural (<30% Developed)
## 2 Mixed (30% To 70% Developed)
## 3 Rural (<30% Developed)
## 4 Urban (>70% Developed)
## 5 Urban (>70% Developed)
## 6 Rural (<30% Developed)
26 / 36

sample_n / sample_frac for a random sample

  • sample_n: randomly sample 5 observations
ncbikecrash_n5 <- ncbikecrash %>%
sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
## [1] 5 66
27 / 36

sample_n / sample_frac for a random sample

  • sample_n: randomly sample 5 observations
ncbikecrash_n5 <- ncbikecrash %>%
sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
## [1] 5 66
  • sample_frac: randomly sample 20% of observations
ncbikecrash_perc20 <-ncbikecrash %>%
sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)
## [1] 1493 66
27 / 36

distinct to filter for unique rows

And arrange to order alphabetically

ncbikecrash %>%
select(county, city) %>%
distinct() %>%
arrange(county, city)
## # A tibble: 391 x 2
## county city
## <chr> <chr>
## 1 Alamance Alamance
## 2 Alamance Burlington
## 3 Alamance Elon
## 4 Alamance Elon College
## 5 Alamance Gibsonville
## 6 Alamance Graham
## 7 Alamance Green Level
## 8 Alamance Mebane
## 9 Alamance None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
28 / 36

summarise to reduce variables to values

ncbikecrash %>%
summarise(avg_hr = mean(crash_hour))
## # A tibble: 1 x 1
## avg_hr
## <dbl>
## 1 14.7
29 / 36

group_by to do calculations on groups

ncbikecrash %>%
group_by(hit_run) %>%
summarise(avg_hr = mean(crash_hour))
## # A tibble: 2 x 2
## hit_run avg_hr
## <chr> <dbl>
## 1 No 14.6
## 2 Yes 15.0
30 / 36

count observations in groups

ncbikecrash %>%
count(driver_alcohol_drugs)
## # A tibble: 6 x 2
## driver_alcohol_drugs n
## <chr> <int>
## 1 <NA> 6654
## 2 Missing 99
## 3 No 695
## 4 Yes-Alcohol, impairment suspected 12
## 5 Yes-Alcohol, no impairment detected 3
## 6 Yes-Drugs, impairment suspected 4
31 / 36

mutate to add new variables

ncbikecrash %>%
mutate(driver_alcohol_drugs_simplified = case_when(
driver_alcohol_drugs == "Missing" ~ NA,
str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
TRUE ~ "No"
))

32 / 36

"Save" when you mutate

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash <- ncbikecrash %>%
mutate(driver_alcohol_drugs_simplified = case_when(
str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
TRUE ~ driver_alcohol_drugs
))
33 / 36

"Save" when you mutate

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash %>%
mutate(driver_alcohol_drugs_simplified = case_when(
str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
TRUE ~ driver_alcohol_drugs
)) -> ncbikecrash
34 / 36

Check before you move on

ncbikecrash %>%
count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
## # A tibble: 6 x 3
## driver_alcohol_drugs driver_alcohol_drugs_simplified n
## <chr> <chr> <int>
## 1 <NA> <NA> 6654
## 2 Missing Missing 99
## 3 No No 695
## 4 Yes-Alcohol, impairment suspected Yes 12
## 5 Yes-Alcohol, no impairment detected Yes 3
## 6 Yes-Drugs, impairment suspected Yes 4
ncbikecrash %>%
count(driver_alcohol_drugs_simplified)
## # A tibble: 4 x 2
## driver_alcohol_drugs_simplified n
## <chr> <int>
## 1 <NA> 6654
## 2 Missing 99
## 3 No 695
## 4 Yes 19
35 / 36

NC bike crashes

  • Go to RStudio Cloud and open NC bike crashes
  • For each question you work on, set the eval chunk option to TRUE and knit
36 / 36

NC bike crashes

  • Go to RStudio Cloud and open NC bike crashes
2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow