Module 5

Last week we discussed general guidelines for first interacting with a new data set. This week we want to build on those activities by performing early exploratory data analysis to answer questions about your data via visualizing and transforming your data. We have two objectives for this week:

  1. Visualization should be used in explicit ways to answer questions regarding your data. You will learn how to use the ggplot2 package to advance your visualization skills to systematically analyze your data.
  2. Early in the data wrangling process you will likely need to sort, filter, or summarize your data set or even create new variables from the existing data. You will learn how to work with the dplyr package to perform many common data transformation and manipulation tasks.

Combining the activities of data transformation and visualization in a methodical way is what defines exploratory data analysis (EDA). Only by systematically applying these techniques will you be able to answer and refine questions about your data. Module 5 focuses on the visualization component of EDA.


Tutorials & Resources

Being able to create visualizations (graphical representations) of data is a key step in data analysis. In this module you will learn to use the ggplot2 library to visualize your data. As illustrated in the last module R does provide built-in plotting functions; however, the ggplot2 library implements what is known as the Grammar of Graphics. This makes it particularly effective for describing how visualizations should represent data, and has turned it into the preeminent plotting library in R.

The following tutorials will provide you the knowledge and skills required to create the meaningful, elegant, and finely tuned data visualizations that I will be looking for in the remainder of your project deliverables.

  1. Introduction to ggplot2: Read and work through Chapter 3: Data Visualization in R for Data Science to get an introduction to the ggplot2 package.

  2. Advancing your visualizations: In your final project I will be looking for publication worthy visualizations. Thus, I fully expect your visualizations to improve with each deliverable submitted. Therefore it is essential that you learn how to use some of the more advanced features of ggplot2 and other packages that work with ggplot2. Here are some resources to help you take your visualizations to the next level:


Class Prep

  1. Work through as many of the exercises as you can in Chapters 3 & 28 of R for Data Science.
  2. Identify at least 10 specific questions you want to ask of your thesis data. What types of visualizations do you need to explore to answer these questions? Be ready to use ggplot2 to answer these questions in class.

You can download class material here: