Comparative transcriptomics reveals similarities and differences between astrocytoma grades

Background Astrocytomas are the most common primary brain tumors distinguished into four histological grades. Molecular analyses of individual astrocytoma grades have revealed detailed insights into genetic, transcriptomic and epigenetic alterations. This provides an excellent basis to identify similarities and differences between astrocytoma grades. Methods We utilized public omics data of all four astrocytoma grades focusing on pilocytic astrocytomas (PA I), diffuse astrocytomas (AS II), anaplastic astrocytomas (AS III) and glioblastomas (GBM IV) to identify similarities and differences using well-established bioinformatics and systems biology approaches. We further validated the expression and localization of Ang2 involved in angiogenesis using immunohistochemistry. Results Our analyses show similarities and differences between astrocytoma grades at the level of individual genes, signaling pathways and regulatory networks. We identified many differentially expressed genes that were either exclusively observed in a specific astrocytoma grade or commonly affected in specific subsets of astrocytoma grades in comparison to normal brain. Further, the number of differentially expressed genes generally increased with the astrocytoma grade with one major exception. The cytokine receptor pathway showed nearly the same number of differentially expressed genes in PA I and GBM IV and was further characterized by a significant overlap of commonly altered genes and an exclusive enrichment of overexpressed cancer genes in GBM IV. Additional analyses revealed a strong exclusive overexpression of CX3CL1 (fractalkine) and its receptor CX3CR1 in PA I possibly contributing to the absence of invasive growth. We further found that PA I was significantly associated with the mesenchymal subtype typically observed for very aggressive GBM IV. Expression of endothelial and mesenchymal markers (ANGPT2, CHI3L1) indicated a stronger contribution of the micro-environment to the manifestation of the mesenchymal subtype than the tumor biology itself. We further inferred a transcriptional regulatory network associated with specific expression differences distinguishing PA I from AS II, AS III and GBM IV. Major central transcriptional regulators were involved in brain development, cell cycle control, proliferation, apoptosis, chromatin remodeling or DNA methylation. Many of these regulators showed directly underlying DNA methylation changes in PA I or gene copy number mutations in AS II, AS III and GBM IV. Conclusions This computational study characterizes similarities and differences between all four astrocytoma grades confirming known and revealing novel insights into astrocytoma biology. Our findings represent a valuable resource for future computational and experimental studies. Electronic supplementary material The online version of this article (doi:10.1186/s12885-015-1939-9) contains supplementary material, which is available to authorized users.


Text S1: Processing of Rembrandt copy number data
We downloaded Affymetrix 100K SNP array data from Rembrandt (release 1.5.9) [1], reconstructed hybridization images and performed stringent quality controls to remove all arrays with hybridization artifacts. We identified 78 tumor samples (16 AS II, 17 AS III, 45 GBM IV) and 8 blood reference samples that passed this quality check (Tab. S1). Next, we analyzed these data using R connected to a local installation of the GenePattern server [2]. The SNPFileCreator module was used to transform each array into a single-nucleotide polymorphism (SNP) file in combination with standard dChip normalization of each array against a global blood reference sample (Rembrandt-IDs: E09419 for Xba240 arrays and E10615 for Hind240 arrays). The obtained SNP files of tumor samples were further converted into log 2ratio copy number profiles under consideration of average blood sample reference measurements. All resulting tumor-specific copy number profiles were further segmented using circular binary segmentation (CBS) [3] to identify individual deletions and amplifications of chromosomal regions. We finally utilized the obtained segmentation profiles to determine corresponding gene copy number measurements for each individual gene in each individual tumor. In some cases, breakpoints occurred within genes and we then utilized the corresponding average log-ratio of involved segments to specify the copy of the affected gene.

Text S2: Lasso-based regulatory network inference
We considered gene-specific sub-network inference problems to learn a transcriptional regulatory network associated with the expression of molecular signature genes distinguishing PA I from AS II, AS III and GBM IV. Therefore, we focused on the expression levels of N = 1, 089 signature genes in our individual parameter a ji ∈ R quantifies the impact of the expression level of transcriptional regulator j on the expression level of signature gene i: (i) a ji < 0 specifies that TF j is a putative inhibitor of gene i, (ii) a ji > 0 defines that TF j is a putative activator of gene i, and (iii) a ji = 0 means that no dependency between j and i exists.
We used lasso (least absolute shrinkage and selection operator) regression [4] to determine a sparse solution for the signature gene-specific linear model in Eq. (1). Lasso minimizes the residual sum of squares of the measured expression e id of gene i and the model-based predicted expression of gene i under consideration of all astrocytomas in dependency of a fixed signature gene-specific complexity parameter λ i . The complexity parameter λ i determines the amount of shrinkage of the individual model parameters a ji toward zero, where larger values of λ i lead to greater shrinkage. This allows to select relevant transcriptional regulators, because irrelevant model parameters will be shrunken to zero. We used the R package glmnet [5] to determine an optimal signature gene-specific complexity parameter and corresponding optimal model parameters. We determined λ i by averaging the optimal complexity parameters (cv.glmnet: lambda.min) obtained from ten independent repeats of a ten-fold cross-validation across all astrocytomas. We then used this gene-specific complexity parameter λ i to compute the corresponding optimal model parameters considering all astrocytomas. We further determined the significance of model parameters when they first enter the lasso model in Eq. (2) using a recently developed significance test for lasso [6]. This provides an efficient way to get p-values instead of using computationally expensive permutation strategies. To realize this, we first computed the lasso solution paths for the active transcriptional regulators (model parameters in a * i that are unequal zero) with respect to all astrocytomas using the R package lars [7]. These results were further evaluated using the R package covTest [8] to obtain p-values that characterize the importance of individual active transcriptional regulators for the signature gene-specific linear model.
We applied this approach to each signature gene to identify corresponding putative transcriptional regulators that best predict the expression levels of the signature gene across all types of astrocytomas. For each signature gene, we only considered potential transcriptional regulators with p-values less than 5 × 10 −5 (standard numerical precision limit of covTest). The obtained regulatory network is provided in Tab. S4. Since we used expression data to learn this model, the expression levels of a selected regulator are either positively (potential activator link from regulator to signature gene) or negatively (potential repressor link from regulator to signature gene) correlated with those of the signature gene.
Thus, this does not necessarily imply that a selected regulator directly regulates the signature gene via physical interactions. A selected regulator may only have an indirect regulatory impact on the signature gene or may only be co-expressed with the signature gene.
We further validated the predictive power of the obtained regulatory network on independent astrocytoma data sets (Text S4, Fig. S7) and we also evaluated the putative proportion of included direct TF-target gene interactions (Text S5, Fig. S8). All these validation studies clearly indicated that the obtained regulatory network included relevant regulator information to predict the expression levels of signature genes based on the expression profiles of TFs.

Immunohistochemistry
Glioma and control specimens were fixed in 10% formaldehyde and embedded in paraffin. Sections (4µm) were deparaffinized and antigen retrieval was performed. A Ventana Benchmark automated staining system was used for immunohistochemistry of Ang2 (anti-human Ang-2 antibody ((MAB 0983), 1:100 from R&D Biosciences, Minneapolis, MN, USA). For antigen retrieval, cell conditioning 1 (prediluted, Ventana Medical Systems) was used. After incubation with the primary and corresponding secondary antibody, the signal was detected with the Chromo DAB Map kit (Ventana Medical Systems) and with the IHC Ultra Map AP (Ventana Medical Systems) kit. Note that the Ang2 protein is encoded by the ANGPT2 gene (alias ANG2).

Endothelial cells are the source of Ang2 protein expression in both PA I and GBM IV
To assess the cellular source of genes significantly contributing to the clustering of PA I within the mesenchymal subgroup of glioblastoma, we exemplarily analyzed normal and neoplastic CNS tissue for Ang2 by immunohistochemistry (Fig. S4). Of note, normal white ( Fig. S4A) and gray (Fig. S4B) matter were completely devoid of Ang2. In contrast, both PA I (Fig. S4C) and GBM IV (Fig. S4F) showed prominent Ang2-positive endothelial cells in case that activated blood vessel were present, a feature that was largely absent in AS II (Fig. S4D) and AS III (Fig. S4E). This finding indicates that the significantly higher Angpt2 mRNA expression found in PA I and GBM IV is strongly related to the presence of vascular proliferation which -per definition -are only seen in those tumor entities while being absent in WHO grade II and III astrocytomas.

Independent glioma gene expression data for regulatory network validation
We downloaded raw PA I expression data published in [9] from Gene Expression Omnibus (GSE5675) and normalized all microarrays using GCRMA [10]. This independent PA I cohort contained expression levels of 16,973 genes for 41 PA I tumor samples. We further downloaded processed gene expression data of brain lower grade gliomas and glioblastoma from The Cancer Genome Atlas (TCGA) data por-

Network validation using independent glioma cohorts
We validated the predictive power of the obtained signature-specific regulatory network on independent data. Therefore, we utilized more than one thousand publicly available gene expression profiles from three independent brain tumor cohorts (41 PA I tumors from [9], 465 low grade gliomas (including AS II and AS III) and 553 GBM IV tumors from TCGA). For each signature gene, we predicted its expression level in each of these public tumor expression profiles using the previously determined signature genespecific linear model with corresponding parameters derived from our astrocytoma training data set. We then computed the correlation between predicted and measured expression levels of each signature gene across all tumor samples for each validation cohort and observe significant shifts of each cohortspecific correlation distribution into the positive range ( Fig. S7b-d, P < 2.2 × 10 −16 for each validation cohort, Wilcoxon rank sum test). Thus, the derived regulatory network is also highly predictive for the vast majority of signature genes in independent brain tumors suggesting that the network includes relevant regulatory links.

Text S5: Comparison of motif search-based and gene expression-based
TF-target links

Motif search-based TF-target identification
We downloaded TF-binding site models from [11] and [12] for TFs included in the molecular signature distinguishing PA I from AS II, AS III and GBM IV. These models enabled us to compare predicted target genes of TFs in the transcriptional regulatory network to corresponding target genes of signature-specific TFs identified using motif search to determine the proportion of putative direct physical interactions in the network. Each TF-binding site model was represented by a specific position weight matrix (PWM) modeling the DNA-binding preferences of the TF. We obtained PWMs of 28 TFs from HOCOMOCO [11] and 37 TF-specific PWMs from JOLMA [12]. We further determined putative promoter sequences (DNA sequences 1,000 bp upstream of annotated transcription start sites) of all 1,089 molecular signature genes using the UCSC Genome Browser (upstream1000.fa.gz: hg19, GRCh37, downloaded April 2014). We next used FIMO (Find Individual Motif Occurrences) [13] to search for binding sites of each signaturespecific TF in the promoter sequences of all signature genes. For each TF, we provided its specific PWM and all promoter sequences of signature genes as input data to FIMO. FIMO reported by default all motif occurrences in promoter sequences with p-values less than 0.0001. This enabled us to determine putative target genes of each specific TF based on its DNA-binding preferences. We compared these independently derived TF target gene interactions with the corresponding target genes of the TF in our signature-specific transcriptional regulatory network. To realize this, we quantified the significance of the size of the overlap between both sets of target genes using Fisher's exact test.

Signature-specific TF-target validation
We further compared predicted target genes of TFs in the regulatory network to target genes predicted using motif search to evaluate the proportion of putative direct physical interactions in the regulatory network. Therefore, we considered binding site models of signature-specific TFs included in HOCO-MOCO [11] and JOLMA [12] and searched for occurrences of putative TF binding sites in promoter sequences 1,000 bp upstream of each signature gene using FIMO [13]. This allowed us to identify potential direct target genes that we compared with the corresponding target genes of the TF in the regulatory network. We then tested for each TF if the intersection of both target gene sets was larger than