Function With Special Talent from ‘caret’ package in R — NearZeroVar()

Xiaotong Ding (Claire)
5 min readAug 31, 2021

By Xiaotong Ding (Claire), With Greg Page

A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling

In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.

Identification of near zero variance predictors

For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.

For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values.

The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.

The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):

In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842.

The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points.

To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).

By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:

For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision — it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes.

For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood.

The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type.

The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function.

By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10.

Let’s take a look at the variables with the 10 highest frequency ratios:

Right now, number_of_reviews_l30d (Number of reviews in the last 30 days) is considered an ‘nzv’ variable, with its frequency ratio of 26.54 falling above the default of 19, and its uniqueness percentage of 0.046 falling below 0.10. If we adjust the function’s settings in a way that would nullify either of those conditions, it will no longer be considered an nzv variable:

Note that with the lower cutoff for percentUnique in place, number_of_reviews_l30d no longer qualifies for nzv status. Adjusting the frequency ratio to any value above 26.55 would have had a similar effect.

So what is the “correct” setting to use? Like nearly everything else in the world of modeling, this question does not lend itself to a “one-size-fits-all” answer. At times, nearZeroVar() may serve as a handy way to quickly whittle down the size of an enormous dataset. Other times, it might even be used in a nearly-opposite way — if a modeler is specifically looking to call attention to anomalous values, this function could be used to flag variables that contain them.

Either way, we encourage you to explore this function, and to consider making it part of your Exploratory Data Analysis (EDA) routine, especially when you are faced with a large dataset and looking for places to simplify the task in front of you.

--

--