ARTICLE

From Practical Data Science with R, 2nd Ed. By Nina Zumel and John Mount

___________________________________________________________________

Take 37% off Practical Data Science with R, Second Edition. Just enter fcczumel3 into the discount code box at checkout at manning.com.
___________________________________________________________________

In this article, we demonstrate some ways to get to know your data, and discuss some of the potential issues that you’re looking for as you explore.

Image for post
Image for post
Figure 1 Mental Model of Practical Data Science with R

As shown in the mental model (figure 1), this article emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data which is as clean and useful as possible.

Example Scenario

Suppose your goal is to build a model to predict which of your customers doesn’t have health insurance. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and details like this.

You’ve put all your data into a single data frame called customer_data which you’ve input into R.[1] Now you’re ready to start building the model to identify the customers you’re interested in.

It’s tempting to dive right into the modeling step without looking closely at the dataset first, like when you have a lot of data. Resist the temptation. No dataset is perfect: you’ll be missing information about some of your customers, and you’ll have incorrect data about others. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions — and you won’t be sure why.

TIP: Get to know your data before modeling. By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches!

You’d also like to get a sense of who your customers are: Are they young, middle-aged, or seniors? How affluent are they? Where do they live? Knowing the answers to these questions can help you build a better model, because you’ll have a more specific idea of what information most accurately predicts the probability of insurance coverage.

Data exploration uses a combination of summary statistics — means and medians, variances, and counts — and visualization, or graphs of the data. You can spot some problems by using summary statistics; other problems are easier to find visually.

Organizing data for analysis

For this article, we’ll assume that the data you’re analyzing is in a single data frame. This isn’t how that data is usually stored. In a database, for example, data is usually stored in normalized form to reduce redundancy: information about a single customer is spread across many small tables. In log data, data about a single customer can be spread across many log entries, or sessions. These formats make it easy to add (or in the case of a database, modify) data, but aren’t optimal for analysis. You can often join all the data you need into a single table in the database using SQL, but commands like join can be used within R to further consolidate data.

Using summary statistics to spot problems

In R, you’ll typically use the summary command to take your first look at the data. The goal is to understand whether you have the kind of customer information that can potentially help you predict health insurance coverage, and whether the data is of good enough quality to be informative.

Listing 1. The summary() command

❶ Change this to your actual path to the directory where you unpacked PDSwR2

❷ The variable is_employed is missing for about a third of the data. The variable income has negative values, which are potentially invalid.

❸ About 90% of the customers have health insurance.

❹ The variables housing_type, recent_move, num_vehicles, and gas_usage are each missing 1720 or 1721 values.

❺ The average value of the variable age seems plausible, but the minimum and maximum values seem unlikely. The variable state_of_res is a categorical variable; summary() reports how many customers are in each state (for the first few states).

The summary command on a data frame reports a variety of summary statistics on the numerical columns of the data frame, and count statistics on any categorical columns (if the categorical columns have already been read in as factors ).

As you see from listing 1, the summary of the data helps you quickly spot potential problems, like missing data or unlikely values. You also get a rough idea of how categorical data is distributed. Let’s go into more detail about the typical problems that you can spot using the summary.

Typical problems revealed by data summaries

At this stage, you’re looking for several common issues:

  • Missing values
  • Invalid values and outliers
  • Data ranges that are too wide or too narrow
  • The units of data

Let’s address each of these issues in detail.

MISSING VALUES

A few missing values may not be a problem, but if a particular data field is largely unpopulated, it shouldn’t be used as an input without some repair. In R, for example, many modeling algorithms, by default, quietly drop rows with missing values. As you see in listing 2, all the missing values in the is_employed variable could cause R to quietly ignore over a third of the data.

Listing 2. Will the variable is.employed be useful for modeling?

❶ The variable is_employed is missing for over a third of the data. Why? Is employment status unknown? Did the company start collecting employment data only recently? Does NA mean “not in the active workforce” (for example, students or stay-at-home parents)?

❷ The variables housing_type, recent_move, num_vehicles and gas_usage are missing relatively few values — about 2% of the data. It’s probably safe to just drop the rows that are missing values, especially if the missing values are all in the same 1720 rows.

If a particular data field is largely unpopulated, it’s worth trying to determine why; sometimes the fact that a value is missing is informative in and of itself. For example, why is the is_employed variable missing many values? As we noted in listing 2, there are many possible reasons.

Whatever the reason for missing data, you must decide on the most appropriate action. Do you include a variable with missing values in your model? If you decide to include it, do you drop all the rows where this field is missing, or do you convert the missing values to 0 or to an additional category? In this example, you might decide to drop the data rows where that exclude data about housing or vehicles, because there aren’t many of them. You probably don’t want to throw out the data where you’re missing employment information, because employment status is probably highly predictive of having health insurance; you might instead treat the NAs as a third employment category. You’ll likely encounter missing values when model scoring, and you should deal with them during model training.

That’s all for now. If you want to know more about the book, check it out on liveBook here and see this slide deck.

About the authors:
Nina Zumel and John Mount are co-founders of Win-Vector LLC, a San Francisco-based data science consulting firm. Both hold PhDs from Carnegie Mellon and blog on statistics, probability, and computer science at win-vector.com.

1] We have a copy of this synthetic dataset available for download from https://github.com/WinVector/PDSwR2/tree/master/Custdata, and once saved, you can load it into R with the command customer_data <- readRDS("custdata.RDS"). This data set is derived from census data. We introduced a little noise to the age variable to reflect what’s typically seen in real-world noisy data sets. We also included columns which are not necessarily relevant to our example scenario, but which exhibit some important data anomalies.

[2] Categorical variables are of class factor in R. They can be represented as strings (class character), and some analytical functions automatically convert string variables to factor variables. To get a useful summary of a categorical variable, it needs to be a factor.

Originally published at freecontent.manning.com.

Written by

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store