Comparative Analysis of Filter Feature Selection Methods on Microarray Datasets
- Authors
-
-
Madhuri Gokhale
Author
-
- Keywords:
- Microarray, classification, feature selection, gene selection
- Abstract
-
Microarray technology is an emerging technology used to analyze large-scale gene expression data simultaneously. However, interpreting gene expression data remains a challenging task because of its highdimensional and low-sample-size characteristics. Microarray datasets contain thousands of genes and only a limited number of samples, which complicates the classification process. Therefore, feature selection methods, also known as gene selection methods, are essential for identifying the most informative genes that provide maximum discriminative power between cancerous and normal tissues. Although several feature selection approaches have been proposed, there is still no universally accepted method that consistently produces optimal results across different datasets. In this study, a comparative analysis of four widely used filter-based feature selection methods, namely Chi-Square (χ2), ReliefF, Mutual Information, and Symmetrical Uncertainty, is performed on five benchmark microarray cancer datasets: Colon, Central Nervous System (CNS), Leukemia, Lung, and Ovarian datasets. The selected features are evaluated using six machine learning classifiers, including Random Forest, Decision Tree, Support Vector Machine (SVM), KNearest Neighbor (KNN), Naive Bayes, and Logistic Regression. Experimental results demonstrate that feature selection significantly improves classification performance by reducing irrelevant and redundant features. Among the evaluated classifiers, SVM combined with Mutual Information achieved the best overall performance on most datasets. The study provides a comprehensive evaluation of filterbased feature selection techniques and their impact on cancer classification accuracy using microarray data.
- References
-
[1] S. M. Alladi, P. Shinde Santosh, V. Ravi,
U. S. Murthy, Colon cancer prediction
with genetic profiles using intelligent
techniques, Bioinformation 3 (3)
(2008) 130.
[2] U. Alon, N. Barkai, D. A. Notterman, K.
Gish, S. Ybarra, D. Mack, A. J. Levine,
Broad patterns of gene expression
revealed by clustering analysis of
tumor and normal colon tissues probed
by oligonucleotide arrays, Proceedings
of the National Academy of Sciences 96
(12) (1999) 6745–6750.
[3] H. M. Zawbaa, E. Emary, C. Grosan, V.
Snasel, Large-dimensionality smallinstance set feature selection: a hybrid bio-inspired heuristic approach, Swarm
and Evolutionary Computation 42
(2018) 29–42.
[4] H. Liu, R. Setiono, Chi2: Feature
selection and discretization of numeric
attributes, in: Proceedings of 7th IEEE
International Conference on Tools with
Artificial Intelligence, IEEE, 1995, pp.
388–391.
[5] I. Kononenko, Estimating attributes:
Analysis and extensions of relief, in:
European conference on machine
learning, Springer, 1994, pp. 171 182.
[6] B. Gierlichs, L. Batina, P. Tuyls, B.
Preneel, Mutual information analysis,
in: International Workshop on
Cryptographic Hardware and
Embedded Systems, Springer, 2008, pp.
426–442.
[7] A. Kraskov, H. St¨ogbauer, P.
Grassberger, Estimating mutual
information, Physical review E 69 (6)
(2004) 066138.
[8] M. A. Hall, Correlation-based feature
selection for machine learning (1999).
[9] K. S. Durgesh, B. Lekha, Data
classification using support vector
machine, Journal of theoretical and
applied information technology 12 (1)
(2010) 1–7.
[10] S. Bhandari, N. Agrawal, N. S. Parande,
Design a binary neural network
classifier algorithm with parallel
training in hidden layer.
[11] P. Mewada, J. Patil, Performance
analysis of k-nn on high dimensional
datasets, International Journal of Computer Applications 975 (2011)
8887.
[12] K. P. Murphy, et al., Naive bayes
classifiers, University of British
Columbia 18 (60) (2006).
[13] T. Sapatinas, Discriminant analysis and
statistical pattern recognition (2005).
[14] Q. Shen, W.-M. Shi, W. Kong, Hybrid
particle swarm optimization and tabu
search approach for selecting genes for
tumor classification using gene
expression data, Computational Biology
and Chemistry 32 (1) (2008) 53–60.
[15] D. K. Slonim, From patterns to
pathways: gene expression data
analysis comes of age, Nature genetics
32 (4) (2002) 502–508.
[16] O. A. Alomari, A. T. Khader, M. A. AlBetar, L. M. Abualigah, Mrmrba: a
hybrid gene selection algorithm for
cancer classification, J Theor Appl Inf
Technol 95 (12) (2017) 2610–2618.
[17] L. Rangarajan, et al., Bi-level
dimensionality reduction methods
using feature selection and feature
extraction, International Journal of
Computer Applications 4 (2) (2010) 33
–38.
[18] B. Chandra, M. Gupta, An efficient
statistical feature selection approach
for classification of gene expression
data, Journal of biomedical informatics
44 (4) (2011) 529–535.
[19] Z. Mao, W. Cai, X. Shao, Selecting
significant genes by randomization test
for cancer classification using gene
expression data, Journal of biomedicalinformatics 46 (4) (2013) 594–601.
[20] V. Santos, N. Datia, M. Pato, Ensemble
feature ranking applied to medical data,
Procedia Technology 17 (2014) 223–
230.
[21] J. Cao, L. Zhang, B. Wang, F. Li, J. Yang, A
fast gene selection method for multicancer classification using multiple
support vector data description,
Journal of biomedical informatics 53
(2015) 381–389.
[22] M. Mohammadi, H. S. Noghabi, G. A.
Hodtani, H. R. Mashhadi, Robust and
stable gene selection via maximum–
minimum correntropy criterion,
Genomics 107 (2-3) (2016) 83–87.
[23] Y. He, J. Zhou, Y. Lin, T. Zhu, A class
imbalance-aware relief algorithm for
the classification of tumors using
microarray gene expression data,
Computational biology and chemistry
80 (2019) 121–127.
[24] T. R. Golub, D. K. Slonim, P. Tamayo, C.
Huard, M. Gaasenbeek, J. P. Mesirov, H.
Coller, M. L. Loh, J. R. Downing, M. A.
Caligiuri, et al., Molecular classification
of cancer: class discovery and class
prediction by gene expression
monitoring, science 286 (5439) (1999)
531–537. [25] G. J. Gordon, R. V. Jensen, L.-L. Hsiao, S.
R. Gullans, J. E. Blumen stock, S.
Ramaswamy, W. G. Richards, D. J.
Sugarbaker, R. Bueno, Translation of
microarray data into clinically relevant
cancer diagnostic tests using gene
expression ratios in lung cancer and
mesothelioma, Cancer research 62 (17)
(2002) 4963–4967.
[26] E. F. Petricoin III, A. M. Ardekani, B. A.
Hitt, P. J. Levine, V. A. Fusaro, S. M.
Steinberg, G. B. Mills, C. Simone, D. A.
Fishman, E. C. Kohn, et al., Use of
proteomic patterns in serum to identify
ovarian cance The lancer 359 (9306)
(2002) 572-577.
[27] H. Vural, A. Subas¸ı, Data-mining
techniques to classify microarray gene
expression data using gene selection by
svd and information gain, Modeling of
Artificial Intelligence (2) (2015) 171–
182.
[28] Genecards: The human gene database,
https://www.genecards.org/, accessed:
2024-10-25 (2016)
- Published
- 2026-05-26
- Versions
-
- 2026-05-26 (3)
- 2026-05-26 (2)
- 2026-05-26 (1)
- Section
- Articles
How to Cite
Most read articles by the same author(s)
- Madhuri Gokhale, A Comparative Study of Machine Learning Algorithms for Multi-Disease Healthcare Prediction: A Web-Based Intelligent System , Journal of Integrated Engineering Innovation & Applications: Volume. 2 Issue. 1 (March 2026)
Similar Articles
- Madhuri Gokhale, A Comparative Study of Machine Learning Algorithms for Multi-Disease Healthcare Prediction: A Web-Based Intelligent System , Journal of Integrated Engineering Innovation & Applications: Volume. 2 Issue. 1 (March 2026)
- Suyash Srivastava, Yashna, CropPulse: AI Sentinels againts Crop Diseases , Journal of Integrated Engineering Innovation & Applications: Volume. 1 Issue. 1 (December 2025): Inaugural Issue
- Afifa Rubani, Tomato Leaf Disease Detection Using Deep Learning , Journal of Integrated Engineering Innovation & Applications: Volume. 2 Issue. 1 (March 2026)
You may also start an advanced similarity search for this article.