Identification of somatic and germline mutations using whole exome sequencing of congenital acute lymphoblastic leukemia

Background Acute lymphoblastic leukemia (ALL) diagnosed within the first month of life is classified as congenital ALL and has a significantly worse outcome than ALL diagnosed in older children. This suggests that congenital ALL is a biologically different disease, and thus may be caused by a distinct set of mutations. To understand the somatic and germline mutations contributing to congenital ALL, the protein-coding regions in the genome were captured and whole-exome sequencing was employed for the identification of single-nucleotide variants and small insertion and deletions in the germlines as well as the primary tumors of four patients with congenital ALL. Methods Exome sequencing was performed on Illumina GAIIx or HiSeq 2000 (Illumina, San Diego, California). Reads were aligned to the human reference genome and the Genome Analysis Toolkit was used for variant calling. An in-house developed Ensembl-based variant annotator was used to richly annotate each variant. Results There were 1–3 somatic, protein-damaging mutations per ALL, including a novel mutation in Sonic Hedgehog. Additionally, there were many germline mutations in genes known to be associated with cancer predisposition, as well as genes involved in DNA repair. Conclusion This study is the first to comprehensively characterize the germline and somatic mutational profile of all protein-coding genes patients with congenital ALL. These findings identify potentially important therapeutic targets, as well as insight into possible cancer predisposition genes.


Background
Acute lymphoblastic leukemia (ALL) is the most common type of cancer diagnosed in children. Congenital ALL is a rare and aggressive subtype of ALL defined as diagnosis within the first month of life. A recent study of 30 patients with congenital ALL treated on the Interfant-99 protocol reported a 2-year event-free survival of 20% despite intensive chemotherapy [1]. This is significantly worse than the 5-year event-free survival of older children with ALL, which approaches 80% [2]. Although the 11q23 rearrangement is the most common cytogenetic abnormality in congenital and infant ALL [3], studies demonstrate that this rearrangement is not sufficient for leukemogenesis [4,5] and does not entirely explain the aggressiveness of ALL in this population of patients [6][7][8].
These data demonstrate that congenital ALL is a biologically different disease, and therefore may be caused by a distinct set of mutations in ALL blast cells that differ from blasts from older patients. Whole-exome sequencing can be used to characterize the majority of amino acid encoding base positions of the genome. When applied to cancer, this method can identify somatic mutations that may contribute to leukemogenesis, as well as germline mutations that may reveal cancer predisposition genes [9][10][11][12]. In this paper, we report whole-exome sequencing on four paired tumor-normal samples from patients with congenital ALL and fully characterize the germline and somatic mutations. In addition, healthy parents of one patient were also sequenced to verify any inherited germline mutations. Our results demonstrate that there are very few somatic mutations in cALL and that there are potential druggable targets that may provide new therapeutic options to improve outcomes.

Methods
The UCLA Institutional Review Board approved this study, which was carried out in compliance with the Helsinki Declaration, and all participants, or parents of participants, provided written informed consent before samples were collected.

Patient characteristics
We collected peripheral blood at diagnosis and remission bone marrow from four patients with congenital ALL ( Table 1). The institutional review board reviewed and approved this study.

DNA extraction and sequencing
Tumor genomic DNA was extracted from peripheral blood at diagnosis and normal genomic DNA was extracted from remission bone marrow using QIAmp DNA Minikit (Qiagen, Valencia, California). Genomic DNA was enriched for coding exons using Sure Select Human All Exon for sample 1, and Human All Exon 50Mb kits for samples 2-4 (Agilent, Santa Clara, California). Sample 1 was sequenced on one full lane of the Illumina Genome Analyzer IIx as 76x76 base paired-end reads as well as one full lane of the HiSeq2000 as 50x50 base paired-end reads and reads were merged for downstream analysis (Illumina, San Diego, California). Leukemia sample numbers 2 through 4 and parents of sample 1 were sequenced on one full lane of the HiSeq2000 as 100x100 base pair, paired-end reads, while the germlines of samples 2-4 were sequenced on one full lane of the HiSeq2000 as 50x50 base pair, paired-end reads.

Variant calling and filtration
Sequence reads were aligned to the human reference genome build 37, using Novoalign (novocraft.com). Post-processing of reads was performed using Samtools (samtools.sf.net) and Picard (picard.sf.net) for removal of PCR duplicates, merging, and indexing [13].
The Genome Analysis Toolkit (GATK) was used for recalibration of base quality, variant calling, filtration and evaluation [14,15]. Quality scores generated by the sequencer were recalibrated by analyzing the covariation among reported.
Quality score, position within the read, dinucleotide, and probability of a reference mismatch. Local realignment around small insertions and deletions (indels) was performed, using GATK's indel realigner to minimize the number of mismatching bases across all reads. Statistically significant non-reference variants, single nucleotide substitutions (SNS) and small indels were identified using the GATK UnifiedGenotyper. The GATK VariantAnnotator annotated each variant with various statistics, including allele balance, depth of coverage, strand balance, and multiple quality metrics. These statistics were then used in an adaptive error model to identify likely false positive SNSs, using the GATK VariantQualityScoreRealibrator (VQSR). Single nucleotide substitutions with a low VQSR score were filtered out, leaving a set of likely true variants. Hard filtering was applied to indels and only passing indels were used for subsequent analyses.

Germline analysis
Variants were filtered out if they were in non-coding regions, resulted in synonymous amino acid changes, or were predicted to have a benign change in protein function by Polyphen (http://genetics.bwh.harvard.edu/pph) or Sift (http://sift.jcvi.org). Variants were classified as rare if alternate allele frequencies were less than 1%.
Nonsynonymous, protein-damaging, and rare germline variants were intersected with known germline mutations that predispose to cancer syndromes, found in Cosmic [16]. Germline variants were also intersected with known DNA repair genes [17]. Germline variants in sample 1 were cross-checked with the parents' sequence data to identify inherited versus de novo mutations. All germline and somatic variants at the last step of filtering were manually visualized using Integrated Genomics Viewer [18].

Somatic analysis
Mutations were classified as somatic if they were rare and found in the tumor sample only with no evidence in the germline data. Fisher's Exact test was performed on the reference and non-reference reads and p-value <1x10 -6 was used as the cut-off for significance. Somatic mutations found in sample 1 were cross-checked with the parents' sequence data to ensure they were indeed somatic and not alleles missed in the germline. Three somatic variants were excluded because they were present as non-reference reads in one or both parents.

Polymerase chain reaction and capillary sequencing
The SHH mutation in Sample 1, FLT3 mutation in Sample 3, and DMBT-1 mutation in Sample 4 were validated using PCR and capillary sequencing.

Alignment and coverage statistics
The total number of reads per sample ranged from about 185,000,000-304,000,000 (Table 2). Sixty eight to ninety nine percent of reads aligned to the reference human genome and 87-94% of reads were covered at a minimum 20 times. The overall average coverage ranged from 107-210x. Each sample had 19,210-23,859 total single nucleotide substitutions. Greater than 93% of these were single nucleotide polymorphisms found in dbSNP132, with 99.8% concordance with the alternate allele found in dbSNP132. There were 791-1,462 novel single nucleotide variants per sample after removing polymorphisms found in dbSNP132 ( Figure 1). Each sample had 1,222-1,716 total small indels. After removing polymorphisms found in dbSNP132, each sample had 688-943 novel indels ( Figure 2). Variants were further prioritized if they were nonsynonymous, predicted  to be damaging by either Sift or Polyphen, and rare in the general control population (Figure 3).

Germline mutations
There were 2-6 germline mutations in each sample that were also in the Cosmic list of genes that have previously been associated with cancer predisposition [16]. Additionally, there were 7-13 germline mutations in each sample in genes that are known DNA repair genes. When comparing the congenital ALL samples with 28 control exomes from children without cancer sequenced in the same laboratory and analyzed with the same workflow, there were no statistically significant differences between mean numbers of mutations that overlapped with Cosmic germline genes or DNA repair genes (Table 3). Due to the small numbers of patients in each group, it was not possible to directly compare specific germline mutations within the cALL group and the control group.

Somatic mutations
There were 1-3 nonsynonymous, protein-damaging, rare variants found in each tumor sample with no evidence in the corresponding germline data using Fisher's Exact p-value <1x10 -6 on the reference and nonreference reads (Table 4). All these mutations were heterozygous. The two somatic mutations in sample 1 were homozygous reference in both parents.

Discussion
Although there has been significant progress in overall survival for children with ALL, newborns with congenital ALL continue to have poor prognoses despite intensive therapy. There is a need to identify new therapeutic targets in congenital ALL to rationally design treatment regimens that will produce sustained remissions with less toxicity. Additionally, understanding the molecular basis for congenital ALL may lead to novel insights into   leukemogenesis and new cancer predisposition syndromes. This study is the first to comprehensively characterize the somatic and germline mutational profile of all protein-coding genes in four tumor-normal paired samples from patients born with congenital ALL. Sample 1 had a somatic mutation in SHH, which has not previously been reported in ALL. The Hedgehog pathway is known to have a role in normal B-lymphocyte development and use of Hedgehog pathway inhibitors leads to decreased self-renewal potential [19]. The G143S mutation found in Sample 1 lies in a critical signaling region of the SHH protein that interacts with the SHH receptor, Patched (PTCH). Association of SHH with PTCH releases the inhibitory effect of PTCH on Smoothened (SMO), which allows for the propagation of SHH signals to activate transcription factors including GLI-1, 2, and 3 [20]. It is possible that this mutation has an activating effect on SHH that leads to dysregulation of downstream target genes.
Two of the four samples had somatic mutations in FLT3. Point mutations and internal tandem duplications in FLT3 are known to be driver mutations in acute myelogenous leukemia (AML) but are also enriched in infant ALL [21]. Multiple oral FLT3 inhibitors have been tested in Phase 1 and 2 trials as single agents, as well as in combination with other chemotherapy agents for treatment of AML [22][23][24][25] with promising results. This study identified that single nucleotide substitutions in FLT3 are recurrent in ALL and infants with ALL might benefit from treatment with FLT3 inhibitors.

Conclusion
This is the first study to perform exome sequencing on paired tumor and normal samples from patients with congenital ALL. Three of the four tumor samples had somatic mutations in genes that are druggable targets. Germline analyses did not reveal any clear set of cancer predisposition genes but a larger number of samples will need to be sequenced in order to delineate the role of DNA repair genes and known germline cancer predisposition genes, as well as to identify novel cancer predisposition genes.
As the cost of next-generation sequencing continues to decrease, patients and physicians will routinely encounter opportunities to supplement traditional morphology, flow cytometry, and cytogenetics tests with a base-pair level resolution of all variants in the exome as well as whole genome. High-throughput functional assays to validate the effect of all candidate driver mutations will be needed to fully take advantage of this level of mutational profiling. Additionally, inherited or de novo mutations in patients' germlines will continue to expand currently known cancer predisposition syndromes and may eventually lead to approaches for earlier cancer detection and even cancer prevention.