Comparative Analysis of Filter Feature Selection Methods on Microarray Datasets

Authors

Madhuri Gokhale

Author

Keywords:

Microarray, classification, feature selection, gene selection

Abstract

Microarray technology is an emerging technology used to analyze large-scale gene expression data simultaneously. However, interpreting gene expression data remains a challenging task because of its highdimensional and low-sample-size characteristics. Microarray datasets contain thousands of genes and only a limited number of samples, which complicates the classification process. Therefore, feature selection methods, also known as gene selection methods, are essential for identifying the most informative genes that provide maximum discriminative power between cancerous and normal tissues. Although several feature selection approaches have been proposed, there is still no universally accepted method that consistently produces optimal results across different datasets. In this study, a comparative analysis of four widely used filter-based feature selection methods, namely Chi-Square (χ2), ReliefF, Mutual Information, and Symmetrical Uncertainty, is performed on five benchmark microarray cancer datasets: Colon, Central Nervous System (CNS), Leukemia, Lung, and Ovarian datasets. The selected features are evaluated using six machine learning classifiers, including Random Forest, Decision Tree, Support Vector Machine (SVM), KNearest Neighbor (KNN), Naive Bayes, and Logistic Regression. Experimental results demonstrate that feature selection significantly improves classification performance by reducing irrelevant and redundant features. Among the evaluated classifiers, SVM combined with Mutual Information achieved the best overall performance on most datasets. The study provides a comprehensive evaluation of filterbased feature selection techniques and their impact on cancer classification accuracy using microarray data.

References

[1] S. M. Alladi, P. Shinde Santosh, V. Ravi,

U. S. Murthy, Colon cancer prediction

with genetic profiles using intelligent

techniques, Bioinformation 3 (3)

(2008) 130.

[2] U. Alon, N. Barkai, D. A. Notterman, K.

Gish, S. Ybarra, D. Mack, A. J. Levine,

Broad patterns of gene expression

revealed by clustering analysis of

tumor and normal colon tissues probed

by oligonucleotide arrays, Proceedings

of the National Academy of Sciences 96

(12) (1999) 6745–6750.

[3] H. M. Zawbaa, E. Emary, C. Grosan, V.

Snasel, Large-dimensionality smallinstance set feature selection: a hybrid bio-inspired heuristic approach, Swarm

and Evolutionary Computation 42

(2018) 29–42.

[4] H. Liu, R. Setiono, Chi2: Feature

selection and discretization of numeric

attributes, in: Proceedings of 7th IEEE

International Conference on Tools with

Artificial Intelligence, IEEE, 1995, pp.

388–391.

[5] I. Kononenko, Estimating attributes:

Analysis and extensions of relief, in:

European conference on machine

learning, Springer, 1994, pp. 171 182.

[6] B. Gierlichs, L. Batina, P. Tuyls, B.

Preneel, Mutual information analysis,

in: International Workshop on

Cryptographic Hardware and

Embedded Systems, Springer, 2008, pp.

426–442.

[7] A. Kraskov, H. St¨ogbauer, P.

Grassberger, Estimating mutual

information, Physical review E 69 (6)

(2004) 066138.

[8] M. A. Hall, Correlation-based feature

selection for machine learning (1999).

[9] K. S. Durgesh, B. Lekha, Data

classification using support vector

machine, Journal of theoretical and

applied information technology 12 (1)

(2010) 1–7.

[10] S. Bhandari, N. Agrawal, N. S. Parande,

Design a binary neural network

classifier algorithm with parallel

training in hidden layer.

[11] P. Mewada, J. Patil, Performance

analysis of k-nn on high dimensional

datasets, International Journal of Computer Applications 975 (2011)

8887.

[12] K. P. Murphy, et al., Naive bayes

classifiers, University of British

Columbia 18 (60) (2006).

[13] T. Sapatinas, Discriminant analysis and

statistical pattern recognition (2005).

[14] Q. Shen, W.-M. Shi, W. Kong, Hybrid

particle swarm optimization and tabu

search approach for selecting genes for

tumor classification using gene

expression data, Computational Biology

and Chemistry 32 (1) (2008) 53–60.

[15] D. K. Slonim, From patterns to

pathways: gene expression data

analysis comes of age, Nature genetics

32 (4) (2002) 502–508.

[16] O. A. Alomari, A. T. Khader, M. A. AlBetar, L. M. Abualigah, Mrmrba: a

hybrid gene selection algorithm for

cancer classification, J Theor Appl Inf

Technol 95 (12) (2017) 2610–2618.

[17] L. Rangarajan, et al., Bi-level

dimensionality reduction methods

using feature selection and feature

extraction, International Journal of

Computer Applications 4 (2) (2010) 33

–38.

[18] B. Chandra, M. Gupta, An efficient

statistical feature selection approach

for classification of gene expression

data, Journal of biomedical informatics

44 (4) (2011) 529–535.

[19] Z. Mao, W. Cai, X. Shao, Selecting

significant genes by randomization test

for cancer classification using gene

expression data, Journal of biomedicalinformatics 46 (4) (2013) 594–601.

[20] V. Santos, N. Datia, M. Pato, Ensemble

feature ranking applied to medical data,

Procedia Technology 17 (2014) 223–

230.

[21] J. Cao, L. Zhang, B. Wang, F. Li, J. Yang, A

fast gene selection method for multicancer classification using multiple

support vector data description,

Journal of biomedical informatics 53

(2015) 381–389.

[22] M. Mohammadi, H. S. Noghabi, G. A.

Hodtani, H. R. Mashhadi, Robust and

stable gene selection via maximum–

minimum correntropy criterion,

Genomics 107 (2-3) (2016) 83–87.

[23] Y. He, J. Zhou, Y. Lin, T. Zhu, A class

imbalance-aware relief algorithm for

the classification of tumors using

microarray gene expression data,

Computational biology and chemistry

80 (2019) 121–127.

[24] T. R. Golub, D. K. Slonim, P. Tamayo, C.

Huard, M. Gaasenbeek, J. P. Mesirov, H.

Coller, M. L. Loh, J. R. Downing, M. A.

Caligiuri, et al., Molecular classification

of cancer: class discovery and class

prediction by gene expression

monitoring, science 286 (5439) (1999)

531–537. [25] G. J. Gordon, R. V. Jensen, L.-L. Hsiao, S.

R. Gullans, J. E. Blumen stock, S.

Ramaswamy, W. G. Richards, D. J.

Sugarbaker, R. Bueno, Translation of

microarray data into clinically relevant

cancer diagnostic tests using gene

expression ratios in lung cancer and

mesothelioma, Cancer research 62 (17)

(2002) 4963–4967.

[26] E. F. Petricoin III, A. M. Ardekani, B. A.

Hitt, P. J. Levine, V. A. Fusaro, S. M.

Steinberg, G. B. Mills, C. Simone, D. A.

Fishman, E. C. Kohn, et al., Use of

proteomic patterns in serum to identify

ovarian cance The lancer 359 (9306)

(2002) 572-577.

[27] H. Vural, A. Subas¸ı, Data-mining

techniques to classify microarray gene

expression data using gene selection by

svd and information gain, Modeling of

Artificial Intelligence (2) (2015) 171–

182.

[28] Genecards: The human gene database,

https://www.genecards.org/, accessed:

2024-10-25 (2016)

Cover Image

Published

2026-05-26

Versions

Issue

Volume. 2 Issue. 1 (March 2026)

Section

Articles

How to Cite

Most read articles by the same author(s)

Similar Articles

Similar Articles

A Comparative Study of Machine Learning Algorithms for Multi-Disease Healthcare Prediction: A Web-Based Intelligent System

Leaf Disease Detection using Deep Learning

CropPulse: AI Sentinels againts Crop Diseases

Tomato Leaf Disease Detection Using Deep Learning