Application of Random Forest Method Classification for Glycosylation in Lysine Protein Sequences

Main Article Content

Silfia Fitriyana
Admi Syarif
Favorisen Rossyking
Mohammad Reza Faisal

Abstract

Grouping glycosylated lysine proteins into groups according to the type of glycosylation seen in the lysine protein sequence is known as glycosylation in the lysine protein sequence. In this work, the sensitivity, specificity, accuracy, and Matthew’s correlation coefficient (MCC) of the random forest approach for classifying the glycosylation of lysine protein sequences were examined. With 214 positive and 406 negative data, the lysine protein dataset derived from benchmark data contains 620 total proteins with a protein length of 15 sequences. 90% of the dataset is used for training, while 10% is used for testing. Using the R package BioSeqClass version 1.44.0, feature extraction employed protein descriptors, specifically AA Index, CTD, and PseAAC, with a total of 60 features. The Random Forest classification algorithm was used to reprocess the results with Mtry values of 4, 8, and 16. The number of trees (ntree) was randomly set to 250, 500, 750, and 1000. The best results were achieved with a dataset split of 90% training data and 10% test data, using Mtry of 42 and 1000 trees, resulting in 89.97% sensitivity, 92.79% specificity, 80.76% MCC, and 90.42% accuracy. These results demonstrate that the combination of feature extraction and the Random Forest algorithm is effective in classifying lysine proteins.

Article Details

Section
Articles

References

[1] M. Audagnotto and M. Dal Peraro. Protein posttranslational modifications: In silico prediction tools and

molecular modeling. Computational and Structural Biotechnology Journal, 15:307–319, 2017.

[2] D. Pascovici, J. X. Wu, M. J. McKay, C. Joseph, Z. Noor, K. Kamath, Y. Wu, S. Ranganathan, V. Gupta, and M. Mirzaei.

Clinically relevant post-translational modification analyses—maturing workflows and bioinformatics tools. International Journal of Molecular Sciences, 20(1):16, 2019.

[3] N. Apriani, E. Suhartono, I. Z. Akbar, and U. L. Mangkurat. Korelasi kadar glukosa darah dengan kadar advanced

oxidation protein products (aopp) tulang pada tikus putih model hiperglikemia. JKM, 11(1):48–55, 2011.

[4] E. Suhartono and B. Setiawan. Modifikasi protein akibat pembebanan glukosa dengan model reaksi glikosilasi

nonenzimatik In Vitro. Jurnal Ilmiah, 08:40–47, 2008.

[5] M. He, X. Zhou, and X. Wang. Glycosylation: Mechanisms, biological functions and clinical implications. Signal Transduction and Targeted Therapy, 1:194, 2024.

[6] Y. Xu, L. Li, J. Ding, L.-Y. Wu, G. Mai, and F. Zhou. Glypseaac: Identifying protein lysine glycation through sequences. Gene, 602:1–7, 2017.

[7] S. R. Künzel, T. F. Saarinen, E. W. Liu, and J. S. Sekhon. Linear aggregation in tree-based estimators. Journal of

Computational and Graphical Statistics, 31(3):917–934, 2022.

[8] C. D. Sutton. Classification and regression trees, bagging, and boosting. In C. R. Rao, E. J. Wegman, and J. L. Solka,

editors, Handbook of Statistics, volume 24, pages 303–329. Elsevier, 2005.

[9] G. Biau and E. Scornet. A random forest guided tour. Test, 2016.

[10] C. Kern, T. Klausch, and F. Kreuter. Tree-based machine learning methods for survey research. Survey Research

Methods, 13(1):73–93, 2019.

[11] F. Mbonyinshuti, J. Nkurunziza, J. Niyobuhungiro, and E. Kayitare. Application of random forest model to predict

the demand of essential medicines for non-communicable diseases management in public health facilities. Pan African Medical Journal, 42:89, 2022.

[12] M. L. Wallace, L. Mentch, B. J. Wheeler, et al. Use and misuse of random forest variable importance metrics in medicine: Demonstrations through incident stroke prediction. BMC Medical Research Methodology, 23:144, 2023.

[13] W. Hong, Y. Lu, X. Zhou, S. Jin, J. Pan, Q. Lin, S. Yang, Z. Basharat, M. Zippi, and H. Goyal. Usefulness of random

forest algorithm in predicting severe acute pancreatitis. Frontiers in Cellular and Infection Microbiology, 12:893294, 2022.

[14] P. Liu, Y. Liu, H. Liu, L. Xiong, C. Mei, and L. Yuan. A random forest algorithm for assessing risk factors associated

with chronic kidney disease: Observational study. Asian Pacific Island Nursing Journal, 8, 2024.

[15] B. Alberts, A. Johnson, J. Lewis, et al. Molecular Biology of the Cell. Garland Science, New York, 4th edition, 2002.

[16] M. Akram, H. M. Asif, M. Uzair, N. Akhtar, A. Madni, S. M. Ali Shah, Z. U. Hasan, and A. Ullah. Amino acids: A review

article. Journal of Medicinal Plants Research, 5(17):3997–

4000, 2011.

[17] L. Guo, D. Rivero, J. Dorado, C. R. Munteanu, and A. Pazos. Automatic feature extraction using genetic programming: An application to epileptic eeg classification. Expert Systems with Applications, 38(8):10425–10436, 2011.

[18] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, and M. Kanehisa. Aaindex: Amino acid index database, progress report 2008. Nucleic Acids Research, 36(SUPPL. 1):202–205, 2008.

[19] C. Reily, T. J. Stewart, M. B. Renfrow, and J. Novak. Glycosylation in health and disease. Nature Reviews Nephrology, 15:346–366, 2019.

[20] A. Primajaya and B. N. Sari. Random forest algorithm for prediction of precipitation. Indonesian Journal of Artificial Intelligence and Data Mining, 1(1):27, 2018.

[21] S. Ohannessian. Historical background. In Language in Zambia, pages 271–291. 2017.

[22] M. Bekkar, H. K. Djemaa, and T. A. Alitouche. Evaluation measures for models assessment over imbalanced data

sets. Journal of Information Engineering and Applications, 3(10):27–38, 2013.