DDEC: Dragon database of genes implicated in esophageal cancer
© Essack et al. 2009
Received: 12 December 2008
Accepted: 06 July 2009
Published: 06 July 2009
Skip to main content
© Essack et al. 2009
Received: 12 December 2008
Accepted: 06 July 2009
Published: 06 July 2009
Esophageal cancer ranks eighth in order of cancer occurrence. Its lethality primarily stems from inability to detect the disease during the early organ-confined stage and the lack of effective therapies for advanced-stage disease. Moreover, the understanding of molecular processes involved in esophageal cancer is not complete, hampering the development of efficient diagnostics and therapy. Efforts made by the scientific community to improve the survival rate of esophageal cancer have resulted in a wealth of scattered information that is difficult to find and not easily amendable to data-mining. To reduce this gap and to complement available cancer related bioinformatic resources, we have developed a comprehensive database (Dragon Database of Genes Implicated in Esophageal Cancer) with esophageal cancer related information, as an integrated knowledge database aimed at representing a gateway to esophageal cancer related data.
Manually curated 529 genes differentially expressed in EC are contained in the database. We extracted and analyzed the promoter regions of these genes and complemented gene-related information with transcription factors that potentially control them. We further, precompiled text-mined and data-mined reports about each of these genes to allow for easy exploration of information about associations of EC-implicated genes with other human genes and proteins, metabolites and enzymes, toxins, chemicals with pharmacological effects, disease concepts and human anatomy. The resulting database, DDEC, has a useful feature to display potential associations that are rarely reported and thus difficult to identify. Moreover, DDEC enables inspection of potentially new 'association hypotheses' generated based on the precompiled reports.
We hope that this resource will serve as a useful complement to the existing public resources and as a good starting point for researchers and physicians interested in EC genetics. DDEC is freely accessible to academic and non-profit users at http://apps.sanbi.ac.za/ddec/. DDEC will be updated twice a year.
The major histological form of esophageal cancer (EC), esophageal squamous cell carcinoma (ESCC), comprises 90% of ECs worldwide [1, 2]. The poor prognosis of EC results in a five year survival rate of 5–20% . The lethality of EC stems from our inability to detect the disease during the early stage, combined with the lack of effective therapies for advanced-stage disease. Like most diseases, EC arises as a consequence of errors occurring in the cellular regulatory system or errors being introduced into the genome as mutations causing cellular behavior to deviate from the norm . Identifying the mechanisms by which the genomic information is controlled in EC will provide further insights into partially understood cellular and molecular functioning that characterizes this disease.
Gene expression in EC is a multifunctional process influenced by chromatin remodeling and the interplay between transcription regulatory proteins and DNA sequences known as transcription factor binding sites (TFBSs) [5, 6]. This combination of transcription regulatory proteins, TFBSs, and affected transcripts, defines the transcription regulatory networks (TRNs) that are responsible for the regulation of every transcript encoded in the genome. Knowledge of these transcripts and the control mechanisms of their initiation set the stage for inferring transcriptional regulatory networks and may help in search for the therapeutic mechanisms to potentially correct or compensate for the errors underlying pathological states of EC.
Efforts made by the scientific community to improve the survival rate associated with EC have resulted in a wealth of scattered research data. Researchers need to sieve through this scattered research data to identify relevant research findings. However, this phase hampers the research process as the compiling of the relevant information is tedious and time consuming. In an attempt to enhance research endeavors related to EC we have developed D ragon D atabase of Genes Implicated in E sophageal C ancer (DDEC) as an integrated knowledge database that contains information about various genes differentially expressed in EC. It should be noted that there are two initiatives aimed at coordinating activities in producing resources related to cancer research, such as the International Cancer Genome Consortium – ICGC http://www.icgc.org/ and caBIG (cancer Biomedical Informatics Grid™, http://cabig.cancer.gov/. These two intend to promote specific data formats and other conditions that will enable easier integration of cancer-related resources. There are cancer related databases that include information on EC, such as Cancer Gene Expression Database (CGED) , PDQ  and Oncomine . CGED houses a collection of gene expression and clinical data from a large number of patients with major cancers including EC. CGED expression data have been obtained by adaptor-tagged competitive PCR (ATAC-PCR) and allows researchers to explore the correlation between gene expression and clinical data for future diagnostic application . PDQ is the National Cancer Institute's (NCI's) cancer database that includes peer-reviewed summaries on cancer treatment, screening, prevention, genetics, and complementary and alternative medicine . The Oncomine initiative collects and analyzes all published cancer microarray data and currently house EC-related microarray data . However, none of the current public databases focuses on genes implicated in EC and their potential associations with other relevant biological, biochemical and medical entities. Moreover, DDEC provides a combination of features for exploration of information related to EC-implicated genes that cannot be found elsewhere, such as filtering for putative transcription factors shared amongst promoters of EC-implicated genes, inference of association networks and precompiled reports that provide insights into other human genes and proteins, metabolites and enzymes, toxins, chemicals with pharmacological effects, disease concepts and human anatomy associated with differentially expressed EC-implicated genes. It also enables finding rare information that will be likely missed in the common literature search. As a special feature, DDEC provides a module for generation of 'association hypotheses' between concepts related to EC-implicated genes. Batch queries and database dump are also provided. We thus believe that DDEC represents a useful complement to the existing databases and will contribute to more efficient EC-related research. DDEC is freely accessible for academic and non-profit users at http://apps.sanbi.ac.za/ddec/. The semi-automated methodology used to populate DDEC genes and related data will be used to update the database twice a year.
Information in the DDEC is structured into four distinct parts:
Platform that can be used to search the integrated gene information through standardized vocabularies.
Selection of the genes of interest from the list. This search criteria provides users with gene details such as; general information, gene in other resources, experimental evidence, related proteins, associated pathways, associated diseases, orthologous genes, regulations and text-mined reports that can support building interactive association networks.
Transcription regulation information which includes all putative TFBSs for the EC-implicated genes in DDEC. This segment is useful for gene regulation studies since TFBSs of interest can be selected and the results will list each TFBS and gene promoter with corresponding TFBSs. Genes sharing all the selected TFBSs are listed as well.
Batch queries and data download interface is provided to increase utility for users.
DDEC contains information on EC-implicated genes compiled based on scientific publications from PubMed. The PubMed database was queried with keyword expression: "esophageal (cancer OR cancers OR tumor OR tumors OR carcino* OR adenocarc* OR malign* OR neoplasm*)" on 31/01/2008 and 35,892 PubMed abstracts were retrieved. The search for relevant publications was further refined using the licensed Dragon Exploration System (DES) from OrionCell http://www.orioncell.org, that has an integrated Biomedical Text-Miner tool. DES retrieved a list of 1677 putative genes associated with EC from the extracted abstracts. Biologists then evaluated information about experimental conditions these genes have been subjected to using full-text articles whenever possible, and abstracts in other cases. When the available information was insufficient to deduce the correct experimental conditions, the gene has been discarded. Taking into account that experimental conditions influence gene expression, DDEC provide details of the cell line, tissue or cell type, expression status, disease stage, tumor grade, esophageal cancer type and laboratory method reported in literature.
A final list of 529 genes was identified in this way and used to populate the database. The general information about the genes, which include HGNC ID, approved symbol, approved name, entrez ID, previous symbol, previous name, aliases, OMIM-related information, and chromosome location, were extracted from sources such as HUGO http://www.genenames.org/ and GeneCards http://www.genecards.org/index.shtml. Included in the database are gene related identifiers such as EMBL http://www.ebi.ac.uk/embl/, Ensembl http://www.ensembl.org/index.html, Refseq , Genbank http://www.ncbi.nlm.nih.gov/, Unigene http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene&orig_db, Uniprot http://www.ebi.ac.uk/uniprot/, Swiss-Prot http://www.expasy.ch/sprot/ and PDB http://www.rcsb.org/pdb/home/home.do. ID conversion tools like IDconvertor http://idconverter.bioinfo.cnio.es/ and Onto-tools http://vortex.cs.wayne.edu/ontoexpress/servlet/UserInfo were used to convert between different types of identifiers. A summary of the statistics of the above mentioned features are listed in documentation. We have provided links to the relevant sources of data such as gene ontologies http://www.geneontology.org/, Evoc http://www.evocontology.org/, and Reactome pathway data http://www.reactome.org/.
As a useful feature, we generated lists of putative TFBSs that map to the promoter regions of EC-implicated genes allowing users to identify genes that share common TFBSs. For this purpose, promoter sequences were extracted using mainly FANTOM3 CAGE tag data , as well as TOUCAN v. 3.0.2 . To map TFBSs to promoters we used the TRANSFAC Professional database v.11.4 . All TRANSFAC mammalian matrix models of binding sites  were mapped using the Match™ program with minFP profiles for optimized thresholds of the matrix models . The complete list of 529 genes was used to extract promoter sequences for the identification of putative TFBSs. Promoter sequences of 409 genes (1200 bp upstream and 200 bp downstream from the transcription start site, TSS) were extracted from the Fantom3 CAGE tag data that correspond to 1582 transcription start sites (TSSs) that each has at least five tags in the tag cluster and a minimum of three tags in the representative tag . An additional 108 promoter sequences (1200 bp upstream and 200 bp downstream from the TSS) were extracted using Toucan v. 3.0.2 .
As an additional feature, for each of the 529 EC-implicated genes, we extracted all related PubMed documents and analyzed them using DES. DES uses a dictionary based text-mining approach to extract information used for the precompiled reports by mapping the entities from the dictionaries to the submitted PubMed documents. We applied six manually curated DES dictionaries namely; human genes and proteins, metabolites and enzymes, toxins, chemicals with pharmacological effects, disease concepts and human anatomy. These dictionaries were compiled from literature and public databases. The accuracy of this integrated data has been evaluated in Sagar et al. in terms of precision, recall and F-measure. The analysis of the results displayed precision and recall ranging from 81%–100% and with an average F-measure of 92.9% for the SCN1A gene . The precompiled reports in this study are incorporated in the DDEC and provide the user with a possibility to inspect possible interactions associated with the genes of interest and associated networks of relevant biomedical entities. An additional feature in DES allows for hypotheses to be generated between two dictionary entries that are linked to a common dictionary entry. This tool allows the user to test the hypotheses generated by retrieving PubMed documents related to the two dictionary terms linked through the hypothesis, if no PubMed documents are retrieved the hypothesis may warrant further exploration. This functioning of the text-mining modules of DDEC is based on similar concepts as used in Pan et al.  and Bajic et al. . DES has also been employed in the creation of a module for the ovarian cancer database, DDOC .
Batch queries and data download are provided to increase utility for users. Further, a database dump has been provided to support integration with other database resources.
The above outlined process of biocurated data collection and integration will be repeated twice yearly as an update process. Updates will incorporate extracting abstracts from the last update day to current day. This semi-automated process is more time consuming than current automated update systems but has the advantage of reducing redundant information.
DDEC provides a comprehensive compilation of information obtained from published EC research, complemented with the information from public databases and information derived from computational analysis. The information captured in DDEC is centered on genes differentially expressed in EC. The information used for selection of genes to be included in DDEC was curated by biologists. Only genes that satisfy all conditions listed below are included in DDEC:
Genes that are differentially expressed in human EC with experimental proof.
Differential expression of EC-implicated genes has not been influenced by anti-cancer therapy.
Differentially expressed EC-implicated genes have not been artificially constructed.
Microarray data has been excluded at this stage as the results obtained using high throughput technologies are debatable in terms of deciding about a meaningful level of gene expression and statistical methods used for analysis and interpretation of data [34, 35]. However, as a future prospect we will expand the database by adding a subset for raw expression data and analysis of the EC-related microarray data.
DDEC contains precompiled text-mined and data-mined reports that allow for easy exploration of information about associations of EC-implicated genes with other genes and proteins, metabolites and enzymes, toxins, chemicals with pharmacological effects, disease concepts, human anatomy, pathways and pathway reactions. Moreover, DDEC provides for potentially new 'association hypotheses' generated in the precompiled reports. It also provides frequency of associations that allows users to observe rare associations with the genes of interest that will usually be overlooked in a normal literature search taking into account the huge volume of data available. DDEC can be used to answer questions such as:
Is my gene of interest differentially expressed in EC, i.e. is it an EC-implicated gene as defined here?
Which putative transcription factors regulate the expression of an EC-implicated gene or sets of these genes?
Which of the other EC-implicated genes in DDEC are regulated by the same transcription factor (or factors) as the gene of interest?
My gene of interest has putative associations with other biomedical concepts. What are these concepts and what are the documents from which such associations are deduced so that I can explore them?
The potential uses and advantages of the database are described in the documentation section http://apps.sanbi.ac.za/ddec/ddec.pdf. An example of data analysis has been included in the documentation and should help users to understand and utilize different functions implemented in this database to maximize information exploration and extraction.
A comparison of the DDEC and DDOC gene lists.
Gene Ontology terms representing functionally distinct groups
Genes unique for Esophageal Cancer (EC)
Gene unique for Ovarian Cancer (OC)
Genes common to EC and OC
Neuron differentiation and development
Sex differentiation and development
Regulation of apoptosis
Regulation of cell cycle
We further identified which KEGG pathways (see additional file 1) are enriched for the genes unique to EC, genes unique to OC and the genes common to EC and OC . We found the MAPK signaling pathway, ErbB signaling pathway and p53 signaling pathway to be most pronounced pathways for genes common to EC and OC. The pathways most pronounced for the genes unique to EC were the MAPK signaling pathway, Wnt signaling pathway, with androgen and estrogen metabolism being unique to this group. The MAPK signaling pathway, ErbB signaling pathway and TGF-beta signaling pathways were most pronounced for the genes unique to OC.
Above analysis suggests that distinct categories of genes participating in specific pathways are involved in pathogenesis of different types of cancers. These cancer specific categories of genes can be investigated as potential biomarkers for prognosis and diagnosis of the disease.
In future, we intend to incorporate the effect of current therapeutic drugs. Additional features that may enhance search and retrieval of DDEC information will be added in due course, as well as incorporation of DDEC into ICGC, caBIG and LinkOut. DDEC will further be updated twice a year and will continue to grow in both content and functionality.
DDEC is an integrated knowledge database aimed at representing a gateway to EC-related data. DDEC houses information associated with 529 hand-curated human genes implicated in EC and allows the users to easily access the wealth of EC related data that is typically difficult to find and not easily amendable to data mining. Users are also provided with the DES interface that allows for the easy exploration of information, viewing of potential associations that are rarely reported and thus difficult to identify and inspection of potentially new 'association hypotheses' generated based on the precompiled reports. We hope that this resource will serve as a useful complement to the existing public resources and as a good starting point for researchers and physicians interested in EC genetics.
DDEC is freely accessible to academic and non-profit users at http://apps.sanbi.ac.za/ddec/.
ME was partly supported with a Scarce Skills Scholarship from the National Research Fund, South Africa; ME and VBB were partly supported by the National Research Foundation grant (61070); SRS, AC and VBB were supported partly by the DST/NRF Research Chair grant (64751); VBB was partly supported by the National Research Foundation grant (62302). AR, US, SS and VBB were partly supported by the National Bioinformatics Network grants; MK has been supported by the postdoctoral fellowship from the Claude Leon Foundation, South Africa.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.