class: center, middle, inverse, title-slide # Tidy data and data wrangling
🔧 --- layout: true <div class="my-footer"> <span> <img src = "img/dsbox-logo.png" width = "30"> </img> Slides adapted from <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> by Dr. Lucy D'Agostino McGowan </span> </div> --- ## <i class="fas fa-laptop"></i> `NC bike crashes` - Go to RStudio Cloud and open `NC bike crashes` --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .small[ .pull-left[ ``` ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # … with 77 more rows ``` ] .pull-right[ ``` ## # A tibble: 5 x 2 ## gender avg_height ## <chr> <dbl> ## 1 <NA> 120 ## 2 female 165. ## 3 hermaphrodite 175 ## 4 male 179. ## 5 none 200 ``` ] ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**, it's pronounced "and then". .pull-left[ ![pipe](img/02/magritte.jpg) ] .pull-right[ ![magrittr](img/02/magrittr.jpg) ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`: ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- # Data wrangling ![](img/02/dplyr_wrangling.png) .my-footer[ <font size="2"> Artwork by @allison_horst </font> ] --- ## Bike crashes in NC 2007 - 2014 The dataset is in the **dsbox** package: ```r library(dsbox) ncbikecrash ``` --- ## Variables View the names of variables via .small[ ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci" ## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to" ## [58] "num_bikes_ui" "num_lanes" "num_units" ## [61] "distance_mi_from" "frm_road" "rte_invd_cd" ## [64] "towrd_road" "geo_point" "geo_shape" ``` ] --- ## Variables See detailed descriptions with `?ncbikecrash`. ![](img/02/bike-help.png) --- ## Viewing your data - In the Environment, after loading with `data(ncbikecrash)`, click on the name of the data frame to view it in the data viewer - Use the `glimpse` function to take a peek ```r glimpse(ncbikecrash) ``` ``` ## Observations: 7,467 ## Variables: 66 ## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1… ## $ city <chr> "None - Rural Crash", "Henderson", "None - … ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N… ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal… ## $ development <chr> "Farms, Woods, Pastures", "Residential", "F… ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70… ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK… ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"… ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M… ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double … ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",… ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "… ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "… ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"… ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff… ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", … ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"… ## $ bike_race <chr> "Black", "Black", "White", "Black", "White"… ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "… ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",… ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-… ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M… ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1… ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In… ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl… ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema… ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "… ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D… ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd… ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist… ## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22… ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In… ## $ crash_month <chr> "December", "November", "November", "Decemb… ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", … ## $ crash_time <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13… ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist… ## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y… ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",… ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R… ## $ road_character <chr> "Straight - Level", "Straight - Level", "St… ## $ road_class <chr> "State Secondary Route", "Local Street", "U… ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi… ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, … ## $ road_defects <chr> "None", NA, "None", "None", "None", "None",… ## $ road_feature <chr> "No Special Feature", "T-Intersection", "No… ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth… ## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", … ## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"… ## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187… ## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-7… ``` --- ## A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[ ![](img/02/dplyr-part-of-tidyverse.png) ] .pull-right[ .midi[ - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `select`: pick columns by name - `pull`: grab a column as a vector - `arrange`: reorder rows - `mutate`: add new variables - `distinct`: filter for unique rows - `sample_n` / `sample_frac`: randomly sample rows - `summarise`: reduce variables to values - ... (many more) ] ] --- ## **dplyr** rules for functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Doesn't modify in place --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. -- - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. --- ## `filter` to select a subset of rows for crashes in Durham County .small[ ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 58 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, ## # driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <int>, ## # crash_location <chr>, crash_month <chr>, crash_severity <chr>, ## # crash_time <drtn>, crash_type <chr>, crash_year <int>, ## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>, ## # road_character <chr>, road_class <chr>, road_condition <chr>, ## # road_configuration <chr>, road_defects <chr>, road_feature <chr>, ## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>, ## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>, ## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>, ## # num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` ] --- ## `filter` for many conditions at once for crashes in Durham County where biker was 0-5 years old .small[ ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") ``` ``` ## # A tibble: 4 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` ] --- ## Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ## `select` to keep variables .small[ ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit) ``` ``` ## # A tibble: 4 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Urban (>70% Developed) 30 - 35 MPH ## 2 Urban (>70% Developed) 5 - 15 MPH ## 3 Urban (>70% Developed) 20 - 25 MPH ## 4 Urban (>70% Developed) 20 - 25 MPH ``` ] --- ## `select` to exclude variables .small[ ```r ncbikecrash %>% select(-object_id) ``` ``` ## # A tibble: 7,467 x 65 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` ] --- ## `select` a range of variables .small[ ```r ncbikecrash %>% select(city:locality) ``` ``` ## # A tibble: 7,467 x 5 ## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D… ## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop… ## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop… ## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop… ## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D… ## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop… ## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop… ## 10 New Bern Craven Coastal Residential Urban (>70% Develop… ## # … with 7,457 more rows ``` ] --- ## `slice` for certain row numbers First five .small[ ```r ncbikecrash %>% slice(1:5) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` ] --- ## `slice` for certain row numbers Last five .small[ ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` ] --- ## `pull` to extract a column as a vector .small[ ```r ncbikecrash %>% slice(1:6) %>% pull(locality) ``` ``` ## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)" ## [3] "Rural (<30% Developed)" "Urban (>70% Developed)" ## [5] "Urban (>70% Developed)" "Rural (<30% Developed)" ``` ] vs. .small[ ```r ncbikecrash %>% slice(1:6) %>% select(locality) ``` ``` ## # A tibble: 6 x 1 ## locality ## <chr> ## 1 Rural (<30% Developed) ## 2 Mixed (30% To 70% Developed) ## 3 Rural (<30% Developed) ## 4 Urban (>70% Developed) ## 5 Urban (>70% Developed) ## 6 Rural (<30% Developed) ``` ] --- ## `sample_n` / `sample_frac` for a random sample - `sample_n`: randomly sample 5 observations .small[ ```r ncbikecrash_n5 <- ncbikecrash %>% sample_n(5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 66 ``` ] -- - `sample_frac`: randomly sample 20% of observations .small[ ```r ncbikecrash_perc20 <-ncbikecrash %>% sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 66 ``` ] --- ## `distinct` to filter for unique rows And `arrange` to order alphabetically .small[ ```r ncbikecrash %>% select(county, city) %>% distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # … with 381 more rows ``` ] --- ## `summarise` to reduce variables to values .small[ ```r ncbikecrash %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 1 x 1 ## avg_hr ## <dbl> ## 1 14.7 ``` ] --- ## `group_by` to do calculations on groups .small[ ```r ncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 2 x 2 ## hit_run avg_hr ## <chr> <dbl> ## 1 No 14.6 ## 2 Yes 15.0 ``` ] --- ## `count` observations in groups .small[ ```r ncbikecrash %>% count(driver_alcohol_drugs) ``` ``` ## # A tibble: 6 x 2 ## driver_alcohol_drugs n ## <chr> <int> ## 1 <NA> 6654 ## 2 Missing 99 ## 3 No 695 ## 4 Yes-Alcohol, impairment suspected 12 ## 5 Yes-Alcohol, no impairment detected 3 ## 6 Yes-Drugs, impairment suspected 4 ``` ] --- ## `mutate` to add new variables .small[ ```r ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ "No" )) ``` ] <img height="400" src="img/02/dplyr_mutate.png"> </img> .my-footer[ <font size="2"> Artwork by @allison_horst </font> ] --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. .small[ ```r *ncbikecrash <- ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs )) ``` ] --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. .small[ ```r ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs * )) -> ncbikecrash ``` ] --- ## Check before you move on .small[ ```r ncbikecrash %>% count(driver_alcohol_drugs, driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 6 x 3 ## driver_alcohol_drugs driver_alcohol_drugs_simplified n ## <chr> <chr> <int> ## 1 <NA> <NA> 6654 ## 2 Missing Missing 99 ## 3 No No 695 ## 4 Yes-Alcohol, impairment suspected Yes 12 ## 5 Yes-Alcohol, no impairment detected Yes 3 ## 6 Yes-Drugs, impairment suspected Yes 4 ``` ```r ncbikecrash %>% count(driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 4 x 2 ## driver_alcohol_drugs_simplified n ## <chr> <int> ## 1 <NA> 6654 ## 2 Missing 99 ## 3 No 695 ## 4 Yes 19 ``` ] --- ## <i class="fas fa-laptop"></i> `NC bike crashes` - Go to RStudio Cloud and open `NC bike crashes` - For each question you work on, set the `eval` chunk option to `TRUE` and knit