Research Article

Mining ORESTES no-match database: can we still contribute to cancer transcriptome?

Published: February 24, 2006
Genet. Mol. Res. 5 (1) : 24-32
Cite this Article:
Rda Silva Fonseca, D.Maria Carraro, H. Brentani (2006). Mining ORESTES no-match database: can we still contribute to cancer transcriptome?. Genet. Mol. Res. 5(1): 24-32.
2,807 views

Abstract

The Human Cancer Genome Project generated about 1 million expressed sequence tags by the ORESTES method, principally with the aim of obtaining data from cancer. Of this total, 341,680 showed no similarity with sequences in the public transcript databases, referred to as “no-match”. Some of them represent low abundance or difficult to detect human transcripts, but part of these sequences represent genomic contamination or immature mRNA. We performed a bioinformatics pipeline to determine the novelty of ORESTES “no-match” datasets from prostate or breast tissues. We started with 14,908 clusters mapped on the human genome. A total of 2226 clusters originating from more than two libraries or singletons with gaps upon genome alignment were selected. Ninety-four clusters with canonical splice sites representing the most stringent criteria to be considered a gene were subjected to manual inspection regarding genomic hits. Of the manually inspected clusters, 49.6% contained new sequences where 42.2% were probable low-expression alternative forms of the characterized genes and 7.4% unpredicted genes. RT-PCR followed by sequencing was performed to validate the largest spliced sequence from 8 clusters, resulting in the confirmation of five sequences as true human transcript fragments. Some of them were differentially expressed between tumor and normal tissue by an in silico analysis. We can conclude that after clean up of the no-match dataset, we still have about 939 new exons and 165 unpredicted genes that could complete the prostate or breast transcriptome.

The Human Cancer Genome Project generated about 1 million expressed sequence tags by the ORESTES method, principally with the aim of obtaining data from cancer. Of this total, 341,680 showed no similarity with sequences in the public transcript databases, referred to as “no-match”. Some of them represent low abundance or difficult to detect human transcripts, but part of these sequences represent genomic contamination or immature mRNA. We performed a bioinformatics pipeline to determine the novelty of ORESTES “no-match” datasets from prostate or breast tissues. We started with 14,908 clusters mapped on the human genome. A total of 2226 clusters originating from more than two libraries or singletons with gaps upon genome alignment were selected. Ninety-four clusters with canonical splice sites representing the most stringent criteria to be considered a gene were subjected to manual inspection regarding genomic hits. Of the manually inspected clusters, 49.6% contained new sequences where 42.2% were probable low-expression alternative forms of the characterized genes and 7.4% unpredicted genes. RT-PCR followed by sequencing was performed to validate the largest spliced sequence from 8 clusters, resulting in the confirmation of five sequences as true human transcript fragments. Some of them were differentially expressed between tumor and normal tissue by an in silico analysis. We can conclude that after clean up of the no-match dataset, we still have about 939 new exons and 165 unpredicted genes that could complete the prostate or breast transcriptome.

Download: