Class imbalance is a common problem in machine learning, where the total number of one class of data is far less than that of another class of data. Class imbalance affects the quality and reliability of results in machine learning tasks as most of the evaluation metrics assume a balanced class distribution. In this post, I am going to share a few simple yet effective methods that help handle imbalanced datasets, using R language.
The most regular method is to assign weights to each class. Considering a dataset named
data_training with the target variable
target as a binary variable and with
Y being positive and
N being negative, the class weight assignment can be written as follows.
The simplest data resampling methods are downsampling and upsampling. Both of them are covered by the caret package.
We only need to specify the resampling method in the control object for training. For downsampling:
And for upsampling:
There are also a few hybrid methods, such as random over-sampling examples (ROSE) and synthetic minority over-sampling technique (SMOTE), which downsample the majority class and synthesize new data points in the minority class. To use ROSE, we need to load the ROSE package.
And we create a wrapper around the
We specify the resampling method in the control object.
Similarly, to use SMOTE, we first load the DMwR package.
Then we create a wrapper around the
Finally, we specify the resampling method in the control object.