Introduction

Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process.

rbin has the following features:

  • manual binning using shiny app
  • equal length binning method
  • winsorized binning method
  • quantile binning method
  • combine levels of categorical data
  • create dummy variables based on binning method
  • calculates weight of evidence (WOE), entropy and information value (IV)
  • provides summary information about binning pre-processing

Manual Binning

For manual binning, you need to specify the cut points for the bins. rbin follows the left closed and right open interval ([0,1) = {x | 0 ≤ x < 1}) for creating bins. The number of cut points you specify is one less than the number of bins you want to create i.e. if you want to create 10 bins, you need to specify only 9 cut points as shown in the below example. The accompanying RStudio addin, rbinAddin() can be used to iteratively bin the data and to enforce monotonic increasing/decreasing trend.

After finalizing the bins, you can use rbin_create() to create the dummy variables.

Plot

# plot
plot(bins)

Factor Binning

You can collapse or combine levels of a factor/categorical variable using rbin_factor_combine() and then use rbin_factor() to look at weight of evidence, entropy and information value. After finalizing the bins, you can use rbin_factor_create() to create the dummy variables. You can use the RStudio addin, rbinFactorAddin() to interactively combine the levels and create dummy variables after finalizing the bins.

Combine Levels

upper <- c("secondary", "tertiary")
out <- rbin_factor_combine(mbank, education, upper, "upper")
table(out$education)
#> 
#> primary unknown   upper 
#>     691     179    3651

out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper")
table(out$education)
#> 
#> primary unknown   upper 
#>     691     179    3651