AFIT Data Science Lab R Programming Guide

Random Forests

2018-05-09T00:00:00+00:00

Bagging regression trees is a technique that can turn a single tree model with high variance and poor predictive power into a fairly accurate prediction function. Unfortunately, bagging regression trees typically suffers from tree correlation, which reduces the overall performance of the model. Random forests are a modification of bagging that builds a large collection of de-correlated trees and have become a very popular “out-of-the-box” learning algorithm that enjoys good predictive performance. This latest tutorial will cover the fundamentals of random forests.

Regression Trees

2018-04-28T00:00:00+00:00

Basic regression trees partition a data set into smaller groups and then fit a simple model (constant) for each subgroup. Unfortunately, a single tree model tends to be highly unstable and a poor predictor. However, by bootstrap aggregating (bagging) regression trees, this technique can become quite powerful and effective. Moreover, this provides the fundamental basis of more complex tree-based models such as random forests and gradient boosting machines. This latest tutorial will get you started with regression trees and bagging.

Naïve Bayes Classifier

2018-04-20T00:00:00+00:00

The Naïve Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem but with strong assumptions regarding independence. Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. Although it is often outperformed by other techniques, and despite the naïve design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, it is definitely a machine learning algorithm to have in your toolkit. This tutorial will introduce you to this simple classifier.

Feedforward Deep Learning Models

2018-04-09T00:00:00+00:00

Machine learning algorithms typically search for the optimal representation of data using some feedback signal (aka objective/loss function). However, most machine learning algorithms only have the ability to use one or two layers of data transformation to learn the output representation. As data sets continue to grow in the dimensions of the feature space, finding the optimal output representation with a shallow model is not always possible. Deep learning provides a multi-layer approach to learn data representations, typically performed with a multi-layer neural network. Like other machine learning algorithms, deep neural networks (DNN) perform learning by mapping features to targets through a process of simple data transformations and feedback signals; however, DNNs place an emphasis on learning successive layers of meaningful representations. Although an intimidating subject, the overarching concept is rather simple and has proven highly successful in predicting a wide range of problems (i.e. image classification, speech recognition, autonomous driving). This tutorial will teach you the fundamentals of building a feedfoward deep learning model.

Regularized Regression

2018-03-28T00:00:00+00:00

Linear regression is a simple and fundamental approach for supervised learning. Moreover, when the assumptions required by ordinary least squares (OLS) regression are met, the coefficients produced by OLS are unbiased and, of all unbiased linear techniques, have the lowest variance. However, in today’s world, data sets being analyzed typically have a large amount of features. As the number of features grow, our OLS assumptions typically break down and our models often overfit (aka have high variance) to the training sample, causing our out of sample error to increase. Regularization methods provide a means to control our regression coefficients, which can reduce the variance and decrease out of sample error. This latest tutorial explains why you should know this technique and how to implement it in R.

Visual Data Exploration

2018-01-22T00:00:00+00:00

Visual data exploration is a mandatory intial step whether or not more formal analysis follows. When combined with descriptive statistics, visualization provides an effective way to identify summaries, structure, relationships, differences, and abnormalities in the data. Often times no elaborate analysis is necessary as all the important conclusions required for a decision are evident from simple visual examination of the data. Other times, data exploration will be used to help guide the data cleaning, feature selection, and sampling process. Regardless, visual data exploration is about investigating the characteristics of your data set. To do this, we typically create numerous plots in an interactive fashion. This tutorial will show you how to create plots that answer some of the fundamental questions we typically have of our data.

Exponential Smoothing Models

2017-10-16T00:00:00+00:00

Exponential forecasting models are smoothing methods that have been around since the 1950s and are extremely effective. Where niave forecasting places 100% weight on the most recent observation and moving averages place equal weight on k values, exponential smoothing allows for weighted averages where greater weight can be placed on recent observations and lesser weight on older observations. Exponential smoothing methods are intuitive, computationally efficient, and generally applicable to a wide range of time series. Consequently, exponentially smoothers are great forecasting tools to have and this tutorial will walk you through the basics.

Support Vector Machines

2017-09-27T00:00:00+00:00

The advent of computers brought on rapid advances in the field of statistical classification, one of which is the Support Vector Machine, or SVM. The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. The different groups that must be separated will be called “classes”. SVMs can handle any number of classes, as well as observations of any dimension. SVMs can take almost any shape (including linear, radial, and polynomial, among others), and are generally flexible enough to be used in almost any classification endeavor that the user chooses to undertake. This new tutorial will introduce you to this supervised classification technique.

Introduction to Neural Networks

2017-09-07T00:00:00+00:00

Artificial neural networks (ANNs) describe a specific class of machine learning algorithms designed to acquire their own knowledge by extracting useful patterns from data. ANNs are function approximators, mapping inputs to outputs, and are composed of many interconnected computational units, called neurons. Each individual neuron possesses little intrinsic approximation capability; however, when many neurons function cohesively together, their combined effects show remarkable learning performance. A new set of tutorials introduce you to this popular analytic technique.

Imprecise Regression

2017-08-23T00:00:00+00:00

Imprecise regression is a generalization of linear regression that gives us stronger tools for modeling uncertainty. In a typical linear regression setting, we consider the input data to be precise observations: single points. In imprecise regression, we generalize the notion of observations to intervals rather than points. This allows us to more accurately represent scenarios with measurement error or other sources of uncertainty that make our input data “fuzzy”. This tutorial is based on the imprecise regression work in Cattaneo and Wiencierz (2012) and will teach you the fundamentals of this technique.

Moving Averages

2017-08-04T00:00:00+00:00

Smoothing methods are a family of forecasting approaches that average values over multiple periods in order to reduce the noise and uncover patterns in the data. Moving averages are one such smoothing method. In this new time series tutorial, you will learn the basics of performing smoothing averages.

Benchmark Methods & Forecast Accuracy

2017-06-16T00:00:00+00:00

In this new time series tutorial, you will learn general tools that are useful for many different forecasting situations. It will describe some methods for benchmark forecasting, methods for checking whether a forecasting model has adequately utilized the available information, and methods for measuring forecast accuracy. These are important tools to have in your forecasting toolbox as each are leveraged repeatedly as you develop and explore a range of forecasting methods.

New Tutorial on Exploring and Visualizing Time Series

2017-06-02T00:00:00+00:00

Time series forecasting is performed in nearly every organization that works with quantifiable data. Retail stores forecast sales. Energy companies forecast reserves, production, demand, and prices. Educational institutions forecast enrollment. Goverments forecast tax receipts and spending. International financial organizations forecast inflation and economic activity. The list is long but the point is short - forecasting is a fundamental analytic process in every organization. This new tutorial gets you started doing some fundamental time series exploration and visualization.

New Tutorial on Linear Model Selection

2017-04-21T00:00:00+00:00

It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response variable. Including such irrelevant variables leads to unnecessary complexity in the resulting model. Unfortunately, manually filtering through and comparing regression models can be tedious. Luckily, several approaches exist for automatically performing feature selection or variable selection — that is, for identifying those variables that result in superior regression results. This latest tutorial covers a traditional approach known as linear model selection.

New Tutorial Providing an Introduction to `ggplot2`

2017-04-07T00:00:00+00:00

Although R provides built-in plotting functions, ggplot2 has become the preeminent visualization package in R. ggplot2 implements the Grammar of Graphics theory making it particularly effective for constructing visual representations of data and learning this library will allow you to make nearly any kind of (static) data visualization, customized to your exact specifications. This intro tutorial will get you started in making effective visualizations with R.

New Tutorial on Resampling Methods

2017-03-17T00:00:00+00:00

See the new tutorial on resampling methods, which are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.

New Tutorials on Clustering

2017-03-03T00:00:00+00:00

Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Clustering allows us to identify which observations (i.e. customers, students, states) are alike, and potentially categorize them therein. Check out our new tutorials covering k-means and hierarchical clustering.

New Tutorial on Principal Components Analysis

2017-02-17T00:00:00+00:00

Principal components analysis (PCA) reduces the dimensionality of our data, allowing most of the variability to be explained using fewer variables than the original data set. This allows us to understand the primary features that can best represent our data. Check out the latest tutorial which covers PCA

New Tutorial on Logistic Regression

2017-02-03T00:00:00+00:00

Logistic regression is a foundational analytic technique for classification problems. It allows us to estimate the probability of a categorical response based on one or more predictor variables and tells us if the presence of a predictor increases (or decreases) the probability of a given outcome by a specific percentage. Check out the recently added tutorial on logistic regression.

New Tutorial on Linear Regression

2017-01-20T00:00:00+00:00

Linear regression is a useful and widely used statistical learning method and serves as a good jumping-off point for newer predictive analytic approaches. Check out the newly added linear regression tutorial that covers the basics of this power analytic technique.

New Tutorials for Text Mining

2017-01-07T00:00:00+00:00

Analysts are often trained to handle tabular or rectangular data that are mostly numeric, but much of the data proliferating today is unstructured and typically text-heavy. Many of us who work in analytic fields are not trained in even simple interpretation of natural language. Fortunately, many of the principles used to organize, analyze, and visualize tabular data can be applied to unstructured text to extract meaning from this information.

Check out the first couple text mining tutorials that have been released. Additional tutorials will be released in the coming weeks.

New Tutorials for Better Workflow

2016-09-17T00:00:00+00:00

A.A. Milne said that “organization is what you do before you do something, so that when you do it, it is not all mixed up.” If you are not careful your data analyses can become an explosion of data files, R scripts, ggplot graphs, and final reports. Each project evolves and mutates in its own way and keeping all the files associated with a project organized together is a wise practice. Furthermore, reproducibility is crucial in any analytic project.

RStudio provides several great options for improving your analytic workflow. This includes RStudio projects, R Markdown, and the newly released R Notebook. Check out the tutorials we just released on these tools and get your analytic workflow moving in the right direction before your next project!

Welcome!

2016-08-16T00:00:00+00:00

Welcome to R programming at AFIT! This is a a new online repository hosted by the Air Force Institute of Technology Department of Operational Sciences that provides tutorials to get you proficient in R programming. You will find a wide range of tutorials - from the basics of getting R and RStudio up and running to more advanced subjects such as regular expressions, analytic modeling, and visualization techniques. And we will continue to add more tutorials over time so be sure to check back often so you can keep up to date on our most current offerings!