Module 9

In the last several modules we have been learning to import, tidy, manipulate, and visualize our data. For the most part this was applied to some fairly standard data. Unfortunately, many times our data are not standard and contain very messy character strings that we need to clean up or extract patterns from. This module is going to teach you some of the basics of working with character strings and regular expressions.


Tutorials & Resources

  1. Basics of strings & regex: To understand the fundamentals of working with strings and regular expressions, read and work through Chapter 14: Strings of R for Data Science.
  2. Creating tidy text: A fundamental requirement to perform text mining is to get your text in a tidy format and perform word frequency analysis. Text is often in an unstructured format so performing even the most basic analysis requires some re-structuring. Read and work through the Tidying Text and Word Frequency tutorial to learn how to tidy unstructured text.
  3. More (voluntary) Fun: If your final project is going to require you to work with lots of messy text data, or if you just want to learn more, you can get more practice by working through the following tutorials:

Class Prep

Work through as many exercises as you can in Chapter 14: Strings of R for Data Science and in the Tidying Text and Word Frequency tutorial. Don’t worry, you will get more practice in class!

You can download class material here: