Research Article

Predicting bacterial essential genes using only sequence composition information

Published: June 17, 2014
Genet. Mol. Res. 13 (2) : 4564-4572 DOI: https://doi.org/10.4238/2014.June.17.8
Cite this Article:
(2014). Predicting bacterial essential genes using only sequence composition information. Genet. Mol. Res. 13(2): gmr3515. https://doi.org/10.4238/2014.June.17.8
1,322 views

Abstract

Essential genes are those genes that are needed by organisms at any time and under any conditions. It is very important for us to identify essential genes from bacterial genomes because of their vital role in synthetic biology and biomedical practices. In this paper, we developed a support vector machine (SVM)-based method to predict essential genes of bacterial genomes using only compositional features. These features are all derived from the primary sequences, i.e., nucleotide sequences and protein sequences. After training on the multiple samplings of the labeled (essential or not essential) features using a library for SVM, we obtained an average area under the ROC curve (AUC) of about 0.82 in a 5-fold cross-validation for Escherichia coli and about 0.74 for Mycoplasma pulmonis. We further evaluated the performance of the method proposed using the dataset consisting of 16 bacterial genomes, and an average AUC of 0.76 was achieved. Based on this training dataset, a model for essential gene prediction was established. Another two independent genomes, Shewanella oneidensis RW1 and Salmonella enterica serovar Typhimurium SL1344 were used to evalutate the model. Results showed that the AUC sores were 0.77 and 0.81, respectively. For the convenience of the vast majority of experimental scientists, a web server has been constructed, which is freely available at http://cefg.uestc.edu.cn:9999/egp.

Essential genes are those genes that are needed by organisms at any time and under any conditions. It is very important for us to identify essential genes from bacterial genomes because of their vital role in synthetic biology and biomedical practices. In this paper, we developed a support vector machine (SVM)-based method to predict essential genes of bacterial genomes using only compositional features. These features are all derived from the primary sequences, i.e., nucleotide sequences and protein sequences. After training on the multiple samplings of the labeled (essential or not essential) features using a library for SVM, we obtained an average area under the ROC curve (AUC) of about 0.82 in a 5-fold cross-validation for Escherichia coli and about 0.74 for Mycoplasma pulmonis. We further evaluated the performance of the method proposed using the dataset consisting of 16 bacterial genomes, and an average AUC of 0.76 was achieved. Based on this training dataset, a model for essential gene prediction was established. Another two independent genomes, Shewanella oneidensis RW1 and Salmonella enterica serovar Typhimurium SL1344 were used to evalutate the model. Results showed that the AUC sores were 0.77 and 0.81, respectively. For the convenience of the vast majority of experimental scientists, a web server has been constructed, which is freely available at http://cefg.uestc.edu.cn:9999/egp.