In module 3 you learned to import data. However, data by themselves are pretty useless so we need to start doing some basic care and feeding of the data we’ve imported. In this module we investigate good practices for when we get a new data set. Spending a little time up front to understand your data will help speed up your analysis later on. Thus, for this session we are going to focus on three objectives that we should have when we first open up a new data set:
Please work through the following tutorials prior to class. The skills and functions introduced in these tutorials will be necessary to complete your project deliverable #2.
1. Review the codebook: Understanding the source data is crucial to any analysis. A codebook is the documentation that explicitly tells you about the data you are working with and should be the first thing you review before starting any kind of analysis. Read Review the Codebook to get a taste of what to look for.
2. Learn about the data: When first opening a data set it is important to get a basic understanding of the data dimensions (rows and columns), what the data looks like, how many missing values are in the data, and some basic summary statistics such as mean, median, and the range of each variable. Read and work through Learn About the Data to understand some of the first things you should do with a fresh data set.
3. Quick visualizations: It is also good to get an initial understanding of your data through visual means. Module 5 will deep dive into creating more sophisticated visualizations; however, it is important to understand how to do some basic plotting for quick data exploration. Read and work through Getting Started with Charts in R.