The landscape of human SNPs
To get the full set of human SNP data, we retrieved the recent version of human SNPs from NCBI dbSNP (http://www.ncbi.nlm.nih.gov/SNP/). Among the “common” and “all” SNPs that are labeled distinctly in the website, we first chose the common SNPs for analysis due to its higher occurrence and reliability, since the total number of all SNPs already came up to one tenth of the human genome. In total, 34,082,224 common SNP sites were downloaded. We began to filter these common SNPs (Fig. 1a). We sought for the ancestral variants in human (Homo sapiens) genome according to the orthologous sites in two other mammalian species, rhesus macaque (monkey, Macaca mulatta) and mouse (Mus musculus). Only those SNPs with genomic sequences identical to at least one species in monkey or mouse were considered as ancestral SNPs (Fig. 1a). Next, part of the SNPs has more than one mutation types and might cause conflict in functional annotation. Thus, these “multi-mutation” SNPs were discarded and only those “uni-mutation” SNPs were retained (Fig. 1b). After filtration, 21,221,571 SNPs were remaining, among which the C > T and G > A transitions are the most prevalent while transversions are less frequent (Fig. 1c). We annotated the SNPs with SnpEff [33] and the canonical transcript of each gene were chosen. Most of the SNPs were located in intergenic or intronic region, and the exonic SNPs were dispersed in CDSs, UTRs or noncoding RNAs (Fig. 1d). There were totally 17,940 genes that has at least one SNP in CDS (Additional file 2: Table S2). Note that we have excluded those sites in splicing region when defining nonsynonymous or synonymous mutations in CDSs. Therefore, if any selection is detected for the nonsynonymous or synonymous sites, it might not be imposed by the selection on splicing events.
Profiling the SNPs in cancer-related genes and other genes
We downloaded the list of human cancer-related genes (719 genes, Additional file 1: Table S1) from the latest version of cancer gene census website (CGC, https://cancer.sanger.ac.uk/census/), and 649 of them belong to the 17,940 genes and have SNPs in CDSs. In the following analyses, the genes with SNPs in coding regions were divided into cancer-related genes and other genes. First we calculated the SNP density (number of SNPs per Kb) in exonic (Additional file 3: Figure S1A) or coding (Additional file 3: Figure S1B) region. The result is that the SNPs in cancer-related genes are strongly avoided especially those in CDSs (Additional file 3: Figure S1B). Furthermore, we displayed the global profile of number of SNPs in each functional category, in cancer-related genes and other genes, respectively (Additional file 3: Figure S1C). It is reasonable that many SNPs in other genes belong to the “other” category, the majority of which are noncoding RNAs. Since the identified cancer-related genes are mainly protein-coding genes, the noncoding RNAs are consequently grouped to the other genes.
Constraint on mutations belonging to different evolutionary categories
We have profiled the SNPs in different functional categories and found that the mutations in CDS of cancer-related genes are constrained. This might not be new to us since we did not separate nonsynonymous and synonymous sites. Before we go deep into the selection on nonsynonymous and synonymous SNPs, we tried to reveal the selection in CDS of cancer-related genes from another aspect. We defined “mutation to monkey” as the human SNPs with reference allele identical to the mouse genome and the alternative allele identical to the monkey genome (Fig. 2a). We found that the cancer-related genes have higher fraction of this kind of SNPs (Fig. 2). On the contrary, for the “mutation to mouse” as we definebd (Fig. 2c), they are significantly avoided in cancer-related genes (Fig. 2d). For those mutations to mouse, the reference allele in human should already be fixed in the common ancestor of human and monkey, so the novel mutation in human might be deleterious and depleted in cancer-related genes (Fig. 2c, d). But for the mutations to monkey, the reference allele in monkey should already be fixed in the common ancestor of human and monkey, and the human reference allele might be the deleterious one. Thus, the changes caused by human SNPs simply “reversed” the deleterious allele in human genome, which might be welcome by the cancer-related genes (Fig. 2a, b).
Purifying selection on the nonsynonymous as well as the synonymous SNPs in cancer-related genes
Nonsynonymous to synonymous ratio (nsy/syn) is a very commonly used measurement in evolutionary biology to detect the positive or purifying selection. From our SNPs data, we could see that the nsy/syn ratios in cancer-related genes are significantly lower than those in other genes (Fig. 3a for each gene and Fig. 3b for pooled result). This pattern suggests that nonsynonymous mutations in cancer-related genes are subjected to purifying selection and are repressed. Again, this might not be new to us since numerous driver mutations were reported to cause nonsynonymous changes. Next, we examined the conservation level of the nonsynonymous and synonymous SNP sites. Note that the conservation level of a site refers to the DNA level conservation pattern across different species (and it has nothing to do with the SNP). A very smart measurement is the phyloP score (downloaded from UCSC Genome Browser, genome.ucsc.edu) calculated from the multiple sequence alignment of a large set of species. We found that for both the nonsynonymous and synonymous sites in cancer-related genes have higher conservation level than those in other genes (Fig. 3c). Another aspect to reflect the DNA conservation level is to calculate the fraction of sites that are conserved between human and the orthologous sites in another species (e.g. mouse). Similarly, we observed higher conservation level for the nonsynonymous or synonymous sites in cancer-related genes compared to those in other genes (Fig. 3d). Taken together, the DNA sites of the nonsynonymous and synonymous SNPs are more conserved in cancer-related genes than other genes, indicating that the mutations on these sites might cost a higher price. This result supports our notion that both nonsynonymous and synonymous SNPs in cancer-related genes are subjected to purifying selection.
Purifying selection on the synonymous SNPs in cancer-related genes
As we have mentioned above, the major selection force acting on synonymous mutations is the codon usage bias. We followed an early study [21] and calculated the parameters needed for codon bias (see Methods for detail). In brief, each of the 59 sense codons (excluding three stop codons and two codons without synonymous counterparts) has a “codon bias value” between − 1 and 1 (which is actually a correlation coefficient). A positive value indicates a preferred codon, and vice versa. The extent of preference increases with this codon bias value (Fig. 4a). As clearly shown, the preferred codons are C/G ending and those unpreferred ones are A/T ending. But do not simply treat this codon bias value as the frequency of each codon appearing in the genome. The codon bias and codon frequency is highly correlated in the preferred or unpreferred groups separately, but not together (Fig. 4b). Therefore, the codon bias and codon frequency could be used as independent parameters to measure a synonymous change. For each synonymous human SNP, we calculated the change (delta) of codon bias and codon frequency after mutation. We divided all these synonymous sites into ten bins with the order of increasing delta codon bias (Fig. 4c) or delta codon frequency (Fig. 4d). In each bin, we calculated the fraction of cancer-related genes. Amazingly, the fractions of cancer-related genes are significantly positively correlated with the bins of delta codon bias (Fig. 4c) or delta codon frequency (Fig. 4d). This trend demonstrates that the synonymous SNPs in cancer-related genes tend to gain a more frequently used codon or a preferred codon (compared to those in other genes). The result strongly suggests the purifying selection exerted on synonymous SNPs in cancer-related genes.
Dilemma of cancer-related genes: lower nsy/syn ratios but preferred codon changes
We have demonstrated that the cancer-related genes have significantly lower nsy/syn ratios (Fig. 3), and that the synonymous or nonsynonymous mutations in cancer-related genes tend to increase the codon preference or codon frequency compared to other genes (Fig. 4). It is reasonably to ask that whether the nsy/syn ratio of genes and the changing direction of codon preference are correlated. Contrary to our expectation, at gene level, the nsy/syn ratio and the delta codon bias (Additional file 3: Figure S2A) or the delta codon frequency (Additional file 3: Figure S2B) is positively correlated. Here comes the dilemma of cancer-related genes. Cancer-related genes are the set of genes with low nsy/syn ratios but high delta codon bias/frequency values. We selected the cancer-related genes with relatively low nsy/syn ratios (nsy ≤ 2 and syn ≥ 2), and displayed the top 20 genes with the highest delta codon bias (Additional file 3: Figure S2C). These genes are supposed to suffer most from the dilemma. We also illustrated the relative position of the nonsynonymous or synonymous SNPs on the CDSs of these 20 genes (Additional file 3: Figure S2D). Given the dilemma of cancer-related genes, we could consider that the nsy/syn ratios and the codon usage might play independent roles in oncogenesis. This is not paradoxical because the nonsynonymous mutations contribute to cancer by changing the amino acid and the protein function, while synonymous mutations putatively contribute to cancer by affecting codon usage and the mRNA translation rate.
Purifying selection on the nonsense mutations in cancer-related genes
Nonsense mutations introduce pre-mature stop codons to CDS, which are highly deleterious and subjected to purifying selection in most cases. Intriguingly, when we calculated the relative positions of nonsense mutations (SNPs) on CDS, we discovered that the nonsense mutations in cancer-related genes are located closer to the stop codons (Fig. 5a). If a “less truncated” protein is less deleterious, the nonsense mutations closer to the end of CDS would reduce their harms. Thus, our observation is probably the relics shaped by the purifying selection on the nonsense mutations in cancer-related genes. Moreover, similar to the nsy/syn ratio, we calculated the nonsense/syn ratios of cancer-related genes and other genes (Fig. 5b), unsurprisingly, the ratio in cancer-related genes is significantly lower. An interesting issue is that if we consider multiple mutations in the same codon simultaneously, the combined effect of the mutations might be different from any of the single mutations (Fig. 5c), and the combined effect can even lead to a nonsense mutation. Considering that the adjacent mutations are not necessarily coupled in the same individual, this concern of “multiple mutations” may not be useful in all cases. In this case of human SNPs, we extracted all the “double-mutations” that are located in the same codon, and selected the cases in which the combined effect is different from any of the single mutations (in total 691 cases). Amazingly, among all these 691 cases, 129 (18.7%) were causing nonsense mutations eventually. We calculated the double-mutation/syn ratios in cancer-related genes and other genes, and found that the double-mutations in cancer-related genes are significantly suppressed (Fig. 5d), probably due to their deleterious effect similar to that of nonsynonymous mutations. For the 129 cases of nonsense mutations caused by double-mutations, the cases in cancer-related genes are located closer to the end of CDSs (Fig. 5e). In this part, we fully demonstrated the purifying selection on nonsense mutations in cancer-related genes, and these so called nonsense mutations could even be derived from multiple nonsynonymous or synonymous mutations within the same codon.
Other confounding factors to be addressed
Several confounding factors need to be considered before a consolidated conclusion is drawn. The factors are listed as follows: (1) While the cancer-related genes are well-annotated, the “other genes” might contain many poorly characterized genes which could introduce bias into the analysis; (2) GC content might potentially affect the results; (3) The canonical transcript chosen for each gene might not be representative or could be lowly expressed.
To address these problems, we first decided to take into account the gene expression level. We believe that the highly expressed genes (usually more conserved and more important) tend to be better annotated and those lowly expressed genes tend to be (relatively) poorly characterized. Therefore, if we only focus on the highly expressed genes and observe the same pattern (between cancer-related genes versus other genes), then the potential bias could be canceled.
We retrieved the highly expressed genes (NGS read count > 100, see Methods for details) in human HeLa cells. In total, 9903 genes are defined as highly expressed. We used this set of highly expressed genes and tested the following patterns between cancer-related genes versus other genes (Additional file 3: Figure S3): (1) Nonsynonymous to synonymous ratios; (2) Conservation level of SNP site; (3) The changes of CUB of synonymous mutations. The results are as follows: (1) The nonsynonymous to synonymous ratios in cancer-related genes are significantly lower than those in other genes; (2) Both nonsynonymous and synonymous mutations in cancer-related genes are more conserved at DNA level compared to mutations in other genes; (3) The synonymous mutations exhibit preferred changes in codon usage in cancer-related genes compared to other genes. These results indicate that even if we only look at the highly expressed genes (which are likely to be well-characterized), the pattern on nonsynonymous or synonymous mutations in cancer-related genes also exists.
To control for the effect of GC content, we chose the top 20% genes with the highest GC content (Additional file 3: Figure S4) and the bottom 20% genes with the lowest GC content (Additional file 3: Figure S5). Within each gene set, we compared the following issues in cancer-related genes versus other genes. (1) Nonsynonymous to synonymous ratios; (2) Conservation level of SNP site; (3) The changes of CUB of synonymous mutations. These patterns are the main points of our study. Fortunately, the patterns keep intact if we only focus on the high GC genes (Additional file 3: Figure S4) or low GC genes (Additional file 3: Figure S5), suggesting that the differences observed in cancer-related genes versus other genes are not caused by GC content of genes.
Next, we need to discuss the potential limitation if we only consider the canonical transcript of each gene. We think our control for gene expression level (Additional file 3: Figure S3) should partially cancel this bias because the gene expression level is calculated by NGS read count on the canonical transcript (Methods). Moreover, in some cases, different isoforms of the same gene share the same CDS, which might reduce the bias of this approach (since we only focus on the variations in CDS).
In addition, in the CUB analyses, one might concern that a comprehensive study should add analysis on (1) only the 4-fold degenerate sites and considering that (2) the mutations are context-dependent. Moreover, the patterns observed in nonsense mutations (between cancer-related genes versus other genes) might be related to CUB. These issues are untested at this stage and are regarded as limitations in this study.