A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification Journal of Digital Information Management Fengxiang Chang, Jun Guo, Weiran Xu, Kejun Yao 13 3 2015 Imbalanced data problem is often encountered in application of text classification. Feature selection, which could reduce the dimensionality of feature space and improve the performance of the classifier, is widely used in text classification. This paper presents a new feature selection method named NFS, which selects class information words rather than terms with high document frequency. To improve classifier performance further, we combine a feature selection method (NFS) with data resampling technology to solve the problem of imbalanced data. Experiments were evaluated on Reuters- 21578 Collection, and results show that the NFS method performs better than chi-square statistics and mutual information on the original dataset when the number of selected features is greater than 1000. The maximum value of Macro-F1 is 0.7792 when the NFS method is applied to the resampling dataset, which represents an increase in Macro-F1 by 4.02% given the original dataset. Thus, our proposed method effectively improves minority class performance.