Lab 05: Multiple regression

Due October 31 at noon Turn in the .html file on Sakai

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Source: UCI Machine Learning Repository - Bike Sharing Dataset

Data

The data include daily bike rental counts (by members and casual users) of Capital Bikeshare in Washington, DC in 2011 and 2012 as well as weather information on these days.

The original data sources are http://capitalbikeshare.com/system-data and http://www.freemeteo.com.

If you don’t recall how this function works, try ?read_csv

The data is in your RStudio Project as bike-data.csv. Use the read_csv() function to read this data in.

The codebook is below:

Variable name Description
instant record index
dteday date
season season (1:winter, 2:spring, 3:summer, 4:fall)
yr year (0: 2011, 1:2012)
mnth month (1 to 12)
holiday weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday day of the week
workingday if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in Celsius. The values are divided by 41 (max)
atemp Normalized feeling temperature in Celsius. The values are divided by 50 (max)
hum Normalized humidity. The values are divided by 100 (max)
windspeed Normalized wind speed. The values are divided by 67 (max)
casual Count of casual users
registered Count of registered users
cnt Count of total rental bikes including both casual and registered

Exercises

Data wrangling

  1. Recode the season variable to be a factor with meaningful level names as outlined in the codebook. Change spring to be the referent level.

  2. Recode the binary variables holiday and workingday to be factors with levels no (0) and yes (1), with no as the baseline level.

  3. Recode the yr variable to be a factor with levels 2011 and 2012, with 2011 as the baseline level.

  4. Recode the weathersit variable as 1 - clear, 2 - mist, 3 - light precipitation, and 4 - heavy precipitation, with clear as the baseline.

  5. Calculate raw temperature, feeling temperature, humidity, and windspeed as their values given in the dataset multiplied by the maximum raw values stated in the codebook for each variable. Instead of writing over the existing variables, create new ones with concise but informative names.

  6. Check that the sum of casual and registered adds up to cnt for each record.

Exploratory data analysis

  1. Recreate the following visualization, and interpret it in context of the data. Hint: You will need to use one of the variables you created above.

  1. Create a visualization displaying the relationship between bike rentals and season. Interpret the plot in context of the data.

Modelling

  1. Fit a linear model predicting total daily bike rentals from daily temperature. Write the linear model, interpret the slope and the intercept in context of the data, and determine and interpret the \(R^2\).

  2. Fit another linear model predicting total daily bike rentals from daily feeling temperature. Write the linear model, interpret the slope and the intercept in context of the data, and determine and interpret the \(R^2\). Is temperature or feeling temperature a better predictor of bike rentals?

  3. Fit a full model predicting total daily bike rentals from season, year, whether the day is holiday or not, whether the day is a workingday or not, the weather category, temperature, feeling temperature, humidity, and windspeed, as well as the interaction between at least one numerical and one categorical variable. Report the \(R^2_{adj}\), AIC, and BIC for this model.

  4. Fit a reduced model, a model nested in the model fit in Exercise 11. Report the \(R^2_{adj}\), AIC, and BIC for this model. Is it better than the full model?

  5. Perform a nested F-test comparing the full model in Exercise 11 to your reduced model in exercise 12. Do these results match what you saw based on \(R^2_{adj}\), AIC, and BIC?

  6. Interpret slope coefficients associated with two of the variables in your final model in context of the data. Note: If one of these is categorical with multiple levels, make sure you interpret all of the slope coefficients associated with the levels of the variable.

  7. Based on the final model you found in the previous question, discuss what makes for a good day to bike in DC (as measured by rental bikes being more in demand).

Principles of Data Analysis

  1. Rank each of the following principles from the Elements and Principles of Data Analysis article for this data analysis from 1 to 10 along with a one sentence summary:




Lab adapted from datasciencebox.org by Dr. Lucy D’Agostino McGowan