Ekstraksi Fitur Rantai Markov untuk Klasifikasi Famili Protein

Toto Haryanto, Rizky Kurniawan, Sony Muhammad, Aziz Kustiyo, Endang Purnama Giri

Abstract


As complex molecules, proteins have various roles for living things. Proteins are organic molecules formed from twenty amino acid combinations with various functions for living things, such as transportation systems, a catalyst of chemical reactions for metabolism, and food reserves. This research aims to classify proteins family based on sequences of amino acids as the primary structure. There are 300 amino acid fragments obtained from the Pfam database. The proteins family database subset with three sub-sample classes was obtained, including 1-cysPrx_C, 4HBT, and ABC_Tran. In this research, the first and second order of the Markov chain for extracting features were applied. Moreover, we use a Probabilistic Neural Network (PNN) as a classifier compared to the joint probability technique with Markov assumptions. We evaluate the results by comparing the sensitivity and specificity of both classification techniques. The evaluation results show that overall, PNN has slightly better performance than the joint probability technique for classifying protein families.

Keywords


amino acid, joint probability, markov chain, probabilistic neural network, protein family

Full Text:

PDF

References


Andreeva, A., Kulesha, E., Gough, J., & Murzin, A. G. (2020). The SCOP database in 2020: Expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Research, No. 48, Vol. D1, D376–D382. https://doi.org/10.1093/nar/gkz1064.

Baliarsingh, S.K., Vipsita, S., Gandomi, A.H. (2020). Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network. Computer Methods and Programs in Biomedicine, No. 195, Vol. 105625. https: 10.1016/j.cmpb.2020.105625.

Bileschi, M. L., Belanger, D., Bryant, D. H., Sanderson, T., Carter, B., Sculley, D., Bateman, A., DePristo, M. A., & Colwell, L. J. (2019). Using deep learning to annotate the protein universe. Nature Biotechnology, No. 40, Vol. 6, 932–937. https://doi.org/10.1038/s41587-021-01179-w.

Blum, M., Chang, H. Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G. A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D. H., Letunic, I., Marchler-Bauer, A., et al. (2021). The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, No. 49, Vol. D1, D344–D354. https://doi.org/10.1093/nar/gkaa977.

Chang, D., Ding, L., Malmberg, R., Robinson, D., Wicker, M., Yan, H., Martinez, A., & Cai, L. (2022). Optimal learning of Markov k-tree topology. Journal of Computational Mathematics and Data Science, No. 4, Issue May, 100046. https://doi.org/10.1016/j.jcmds.2022.100046.

De Gooijer, J. G., Henter, G. E., & Yuan, A. (2022). Kernel-based hidden Markov conditional densities. Computational Statistics and Data Analysis, No. 169, Vol. 107431. https://doi.org/10.1016/j.csda.2022.107431.

Desaire, H., Go, E. P., & Hua, D. (2022). Advances, obstacles, and opportunities for machine learning in proteomics. Cell Reports Physical Science, No. 3, Vol. 10, 101069. https://doi.org/10.1016/j.xcrp.2022.101069.

Eskin, E., Grundy, W. N., & Singer, Y. (2000). Protein Family Classification using Sparse Markov Transducers. Proc Int. Conf Intell Syst Mol Biol, No. 8, 134-45. PMID: 10977074.

Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., & Punta, M. (2014). Pfam: The protein families database. Nucleic Acids Research, No. 42, Vol. D1, 222–230. https://doi.org/10.1093/nar/gkt1223.

Gupta, C.L.P., Bihari, A., Tripathi, S. (2019). Protein Classification using Machine Learning and Statistical Techniques: A Comparative Analysis. https://arxiv.org/pdf/1901.06152.

Hatton, L., & Warr, G. (2015). Protein structure and evolution: Are they constrained globally by a principle derived from information theory? PLoS ONE, No. 10, Vol. 5, 1–23. https://doi.org/10.1371/journal.pone.0125663.

Ibrahim, A.A., Yasseen, I.S. (2017). Using Neural Networks to Predict Secondary Structure for Protein Folding. Journal of Comp and Comm, Vol. 5, No. 1, 1-8.

Imrie, F., Bradley, A. R., Van Der Schaar, M., & Deane, C. M. (2018). Protein Family-Specific Models Using Deep Neural Networks and Transfer Learning Improve Virtual Screening and Highlight the Need for More Data. Journal of Chemical Information and Modeling, Vol. 58, No. 11, 2319–2330. https://doi.org/10.1021/acs.jcim.8b00350.

Lasfar, M., & Bouden, H. (2018). A method of data mining using Hidden Markov Models (HMMs) for protein secondary structure prediction. Procedia Computer Science, No. 127, 42–51. https://doi.org/10.1016/j.procs.2018.01.096.

Mitchell, A., Chang, H. Y., Daugherty, L., Fraser, M., Hunter, S., Lopez, R., McAnulla, C., McMenamin, C., Nuka, G., Pesseat, S., Sangrador-Vegas, A., Scheremetjew, M., Rato, C., Yong, S. Y., Bateman, A., Punta, M., Attwood, T. K., Sigrist, C. J. A., Redaschi, N., … Finn, R. D. (2015). The InterPro protein families database: The classification resource after 15 years. Nucleic Acids Research, Vol. 43, No. D1, D213–D221. https://doi.org/10.1093/nar/gku1243.

Mohebali, B., Tahmassebi, A., Meyer-Baese, A., & Gandomi, A. H. (2020). Probabilistic neural networks: A brief overview of theory, implementation, and application. In Handbook of Probabilistic Models. Elsevier Inc. https://doi.org/10.1016/B978-0-12-816514-0.00014-X.

Naveenkumar, K.S., Babu, R.M.H., Vinayakumar, R., Soman, K.P. (2018). Protein Family Classification using Deep Learning. BioRxiv, No. 414128. https://www.biorxiv.org/content/early/2018/09/11/414128.full.pdf+html

Rao, P. N., Edu, N., Devi, T. U., Kaladhar, D., Sridhar, G. R., & Rao, A. A. (2005). A Probabilistic Neural Network Approach for Protein Superfamily Classification. JATIT, Vol. 6, No. 1, 101-105.

Robert, F., & Alexa, M. (2012). Markov Chain Ontology Analysis (MCOA). BMC Bioinformatics, No. 13, Issue February, 23–23.

Robinson, S., Guyon, L., Nevalainen, J., Toriseva, M., Åkerfelt, M., & Nees, M. (2015). Segmentation of image data from complex organotypic 3D models of cancer tissues with markov random fields. PLoS ONE, Vol. 10, No. 12, 1–26. https://doi.org/10.1371/journal.pone.0143798.

Robinson, S., Nevalainen, J., Pinna, G., Campalans, A., Radicella, J. P., & Guyon, L. (2017). Incorporating interaction networks into the determination of functionally related hit genes in genomic experiments with Markov random fields. Bioinformatics, Vol. 33, No. 14, i170–i179. https://doi.org/10.1093/bioinformatics/btx244.

Sandaruwan, P. D., & Wannige, C. T. (2021). An improved deep learning model for hierarchical classification of protein families. PLoS ONE, No. 16, Issue October, 1–15. https://doi.org/10.1371/journal.pone.0258625.

Seo, S., Oh, M., Park, Y., & Kim, S. (2018). DeepFam: Deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics, Vol. 34, No. 13, i254–i262. https://doi.org/10.1093/bioinformatics/bty275.

Specht, D. F. (1990). Probabilistic Neural Networks. Neural Network, No. 3, Vol. 1, 109–118. https://doi.org/10.1007/s11069-015-1595-z.

Teugels, J. L. (2008). Markov Chains: Models, Algorithms and Applications. Journal of the American Statistical Association, Vol. 103, Issue 483. https://doi.org/10.1198/jasa.2008.s254.

Usotskaya, N., & Ryabko, B. (2009). Application of information-theoretic tests for the analysis of DNA sequences based on Markov chain models. Computational Statistics and Data Analysis, No. 53, Vol. 5, 1861–1872. https://doi.org/10.1016/j.csda.2008.07.002.

Wu, C. H., Huang, H., Yeh, L. S. L., & Barker, W. C. (2003). Protein family classification and functional annotation. Computational Biology and Chemistry, No. 27, Vol. 1, 37–47. https://doi.org/10.1016/S1476-9271(02)00098-1.

Wu, X., Lü, F., Wang, B., & Cheng, J. (2005). Analysis of DNA sequence pattern using probabilistic neural network model. Journal of Research and Practice in Information Technology, No. 37, Vol. 4, 353–362.




DOI: http://dx.doi.org/10.30646/sinus.v21i2.748

Refbacks

  • There are currently no refbacks.


 


STMIK Sinar Nusantara

KH Samanhudi 84 - 86 Street, Laweyan Surakarta, Central Java, Indonesia
Postal Code: 57142, Phone & Fax: +62 271 716 500 

Email: ejurnal @ sinus.ac.id | https://p3m.sinus.ac.id/jurnal/e-jurnal_SINUS/

ISSN: 1693-1173 (print) | 2548-4028 (online)


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

View My Stats