The viral expression and immune status in human cancers and insights into novel biomarkers of immunotherapy

Background Viral infections are prevalent in human cancers and they have great diagnostic and theranostic values in clinical practice. Recently, their potential of shaping the tumor immune microenvironment (TIME) has been related to the immunotherapy of human cancers. However, the landscape of viral expressions and immune status in human cancers remains incompletely understood. Methods We developed a next-generation sequencing (NGS)-based pipeline to detect viral sequences from the whole transcriptome and used machine learning algorithms to classify different TIME subtypes. Results We revealed a pan-cancer landscape of viral expressions in human cancers where 9 types of viruses were detected in 744 tumors of 25 cancer types. Viral infections showed different tissue tendencies and expression levels. Multi-omics analyses further revealed their distinct impacts on genomic, transcriptomic and immune responses. Epstein-Barr virus (EBV)-infected stomach adenocarcinoma (STAD) and Human Papillomavirus (HPV)-infected head and neck squamous cell carcinoma (HNSC) showed decreased genomic variations, significantly altered gene expressions, and effectively triggered anti-viral immune responses. We identified three TIME subtypes, in which the “Immune-Stimulation” subtype might be the promising candidate for immunotherapy. EBV-infected STAD and HPV-infected HNSC showed a higher frequency of the “Immune-Stimulation” subtype. Finally, we constructed the eVIIS pipeline to simultaneously evaluate viral infection and immune status in external datasets. Conclusions Viral infections are prevalent in human cancers and have distinct influences on hosts. EBV and HPV infections combined with the TIME subtype could be promising biomarkers of immunotherapy in STAD and HNSC, respectively. The eVIIS pipeline could be a practical tool to facilitate clinical practice and relevant studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-021-08871-9.


Background
Human oncogenic viruses have been implicated in causing 10-15% of human cancers worldwide [1]. They can lead to carcinogenesis directly or indirectly by influencing many of the hallmarks of human cancers, such as sustaining proliferative signaling, triggering genome instability and mutation, eliciting inflammation and avoiding immune destruction [2]. The exploration of virus-cancer associations is critical, as it provides identifiable targets for the prevention, diagnosis and treatment of human cancers [3][4][5][6]. Hypothesis-driven methods through epidemiology and low-throughput investigations were the primary methods to study virus-cancer associations. These methods had great limitations in efficiency and have caused false associations [7,8]. With the advent of nextgeneration sequencing (NGS), successful efforts have been made in screening viral sequences in a high-throughput fashion. Moreover, large cohorts, such as The Cancer Genome Atlas (TCGA) database, combined with bioinformatics techniques further inspired research of detecting viral sequences in human genomes [9][10][11].
Immunotherapy has revolutionized the therapeutic strategies of human cancers. The presence of programmed cell death 1 ligand 1 (PD-L1), microsatellite instability-high (MSI-high) or DNA mismatch-repair deficiency (dMMR) and tumor mutation burden (TMB) are the most promising biomarkers for immunotherapy. However, these biomarkers have limited ability in selecting responders [12,13]. Besides, patients lacking PD-L1, MSI-high or high TMB markers could lose the therapeutic opportunity. Thus finding effective biomarkers in this group of patients is critically urgent. Recently, a study exploring viral infections in six cancers revealed that certain viral infections can shape tumor immune microenvironment (TIME) with altered immune cellular infiltrations. Interestingly, these infected tumors harbor neither high TMB nor MSI-high markers. This indicates viral infections are likely to be an independent biomarker of immunotherapy in certain cancers [14]. Furthermore, several clinical trials have reported a higher overall response to anti-PD treatment in virus-positive cohorts, including squamous cell carcinoma of the head and neck [15], Merkel cell polyomavirus [16,17], Hodgkin lymphomas and post-transplant lymphoproliferative disorders [18]. And the potential of virus-cancer associations to be applied to immunotherapy has also been reported in several other studies [19,20]. However, the landscape of viral expressions, immune status and the interplay between them in human cancers remain incompletely understood.
In this study, we aim to investigate viral sequences across human cancers and find their influences on the genome, transcriptome and TIME of their hosts. Also, we aim to generate an integrated pipeline to detect viral infections and identify TIME subtypes. Hopefully, the revealed virus-cancer associations and the developed tools may provide insights into immunotherapy in human cancers.

Datasets and viral sequences detection pipeline
We downloaded 11,206 TCGA BAM format files of 33 cancer types from The Genomic Data Commons (GDC) data portal with official authorization. We aligned raw RNA-seq data in BAM files which came from STAR with reference sequences of human and viral genomes to detect viral expression. Next, we employed StringTie (version 1.3.3) to assemble transcripts for each BAM file, with GENCODE v22 as the reference annotation. Finally, the expression level of each transcript was normalized into TPM (transcripts per kilobase million). For transcripts of the same viral infection type, we selected the maximum TPM value as the final viral mRNA expression. Of note, we refer to an infected case as a tumor infected by a specific virus type, for example, an HPV-infected tumor of HNSC is an HPV infection case in HNSC. By comparison, an infected sample is a sample harboring viral sequences, no matter how many types of viral sequences were detected. Therefore, in samples co-infected by different viruses, the number of infected cases is not equal to that of infected samples or tumor samples.
MAF format files, gene expression profiles (fragments per kilobase million, FPKM) and corresponding clinical information of 11,206 tumor samples were also downloaded from the TCGA database. Information of total leukocyte fraction (LF), 22 types of lymphocyte infiltration, genomic features (including silent mutation rate, nonsilent mutation rate, copy number variation (CNV), aneuploidy score, homologous recombination defects (HRD), intratumor heterogeneity (ITH)) of each tumor sample was obtained from a previous study [21]. All data involved in the analysis is available in Supplementary  Table 1.

TMB calculation
Tumor mutation burden (TMB) is defined as the number of somatic mutations (excluding germline mutations) within the whole genome. In this study, the TMB of each tumor sample was calculated by measuring mutations per megabase (Mb) based on MAF format files from TCGA. For each sample, we merged all somatic mutations calculated by four different techniques, including MuTect [22], MuSE [23], VarScan [24], and SomaticSniper [25]. 40 Mb was considered as the size of the exome, and the final TMB is calculated as the number of somatic mutations divided by 40. According to the previous study [26], tumor samples were further partitioned into 3 groups: samples with 1-5 mutations/Mb are defined as "Low-TMB"; samples with 6-19 mutations/Mb are defined as "Intermediate-TMB"; samples with > = 20 mutations/Mb are defined as "High-TMB". TMB of each sample is available in Supplementary  Table 1.

Criteria of segregating 22 leukocyte subtypes into 9 subsets
We segregated 22 leukocyte subtypes into 9 subsets according to the criteria from a previous study [21]

Gene set variation analysis
Gene set variation analysis (GSVA) was implemented using the "GSVA" R package (3.8). Sixteen immunerelated pathways were obtained from the Molecular Signatures Database (MSigDB) [27,28] Detailed information about GSVA is available in a previous study [29].

Differential expression analysis
Gene expressions (FPKM) were log2-transformed after adding one as the pseudo count and then processed by the "limma" R package (3.36.5). Differentially expressed genes (DEGs) were defined as genes with j log FC 2 j > 0:5 and adjusted P < 0.05. Enrichment analyses were performed using the "clusterProfiler" R package [30].

Soft clustering analysis
Using whole DEGs for enrichment analysis can cause redundancies which is much less biologically interpretable. In this case, subtle changes would be masked by dominant alterations. Therefore, we chose fuzzy c-means (FCM) to divide whole DEGs into different functional modules. FCM is a soft clustering approach and we implemented the method using the "Mfuzz" R package (3.8) [31,32]. After excluding genes with low standard deviation and standardization, DEGs profiles were converted into "ExpressionSet" objects. Fuzzier "m" was identified using the "mestimate" function and the number of clusters was adjusted by the "Dmin" function where the turning point in the decline curve of the minimum centroid distance was selected. Cluster comparisons were implemented using the "compareCluster" function of "clusterProfiler" R package (3.8.1).

K-means unsupervised clustering
We performed clustering analysis on the "LF-high" samples of six cancers including LGG, COAD, CESC, KIRC, UVM and SKCM [33]. These cancers were reported covering various immune environment types. We used infiltration levels of 22 immune cell subtypes and expression levels of 76 immune-related genes as features (Supplementary Table 3). Features were first scaled and the Kmeans clustering algorithm was performed by the "kmeans" function in R (version 4.0.1). The optimal number of the cluster was determined by the "NbClust" function.

Kaplan-Meier analysis
The Kaplan-Meier analysis was used to estimate the empirical survival probabilities. Differences between survival curves were tested by the log-rank test using the "survival" R package (version 3.2-7).

TIDE and IFNG score prediction
We first normalized raw FPKM values according to the recommended way (http://tide.dfci.harvard.edu/). First, raw FPKM values were log2 transformed after adding a pseudo value of one (log2(FPKM+ 1)). For each gene, we then subtracted the mean value calculated by averaging normalized FPKM values across all patients of the same cancer type. Finally, we uploaded the normalized tab files of the expression matrix to the website to get TIDE predictions. The results include "Responder", "TIDE score", "IFNG score", "Dysfunction score", and "Exclusion score", etc. (Supplementary Table 5). . We found LFs of cancers lacking leucocyte infiltrations [33] were mostly below the 25th percentile of LFs (0.086) in all samples). Using this threshold, samples with LF > = 0.086 (the 25th percentile of LF) were assigned to the "LF-high" group and samples with LF < 0.086 were assigned to the "LF-low" group.

Construction of the support vector machine (SVM) classifier
The SVM classifier is built for distinguishing the "Immune-Anergy" subtype from the "Immune-Stimulation" subtype. We used samples of LGG, COAD, CESC, KIRC, UVM and SKCM from TCGA to develop the model. The "Immune-Anergy" and "Immune-Stimulation" subtypes of these samples were first randomly assigned into a training cohort (1200 samples) and a validation cohort (513 samples). For feature selection, DEG analysis was performed in the training cohort where thirty-seven genes were found differentially expressed between two subtypes ( j log FC 2 j > 2 and adjusted p < 0.05). An SVM model was then trained using the 10-fold crossvalidation strategy by the "e1071" R package. The efficiency of the classifier was evaluated by the receiver operating characteristic curve (ROC) calculated by the"pROC" R package. The classifier could distinguish the "Immune-Stimulation" subtype from the "Immune-Anergy" subtype in both training cohort (AUC = 0.99, 95% CI: 0.997 to 0.999) ( Supplementary Fig. S9) and validation cohort (AUC = 0.995, 95% CI: 0.991 to 0.999) ( Supplementary Fig. S9). The best cut-off threshold showed great performance in both training (sensitivity: 99%, specificity: 94%, accuracy: 0.98) ( Supplementary  Fig. S9) and validation (sensitivity: 98%, specificity: 93%, accuracy: 0.97) ( Supplementary Fig. S9) cohorts.

Construction of eVIIS pipeline
Based on the LF.Score classifier and the SVM classifier, we developed the eVIIS pipeline. The pipeline was built using Python (version 3.6.0) and shell language. It takes several steps to predict viral infections and the TIME subtype: (1) align RNA-seq data to human and viral reference genome sequences using STAR (version <= 2.5) [34]; (2) assemble mapped reads by StringTie (version <= 1.2.3) [35] to detect viral sequences; (3) obtain read counts of genes by featureCounts (version > = 1.5.0) [36], quantify expression levels of genes of FPKM values, and predict the immune status (TIME). The expression of a virus type within a tumor is defined as the maximum TPM value of the multiple transcripts (TPM) from the assembly results generated by StringTie. For TIME prediction, eVIIS quantifies gene expression by normalizing counts into FPKM, according to the formulas provided in the GDC mRNA analysis pipeline: where RC g and L are the number of counts mapped to the gene and the total length of exons for the gene, respectively. RC pc indicates the number of reads mapped to all protein-coding genes (excluding mitochondrial genes). TIME prediction was implemented based on the two models (LASSO regression and SVM) derived from the TCGA training cohort. For a given sample, eVIIS first calculates its LF.Score using the LASSO regression model based on FPKM values of 30 LF-relevant genes. The sample will be classified as the "Immune-Exclusion" subtype if LF.Score < 81.03. Samples with LF.Score > = 81.03 will be further processed by the SVM model. The SVM model evaluates each sample by the 37 DEGs (FPKM) and predict the sample as the "Immune-Stimulation" or the "Immune-Anergy" subtype. The eVIIS pipeline is available at https://github.com/HuangLab-Fudan/eVIIS.

Statistical analyses
Wilcoxon rank-sum test was used to compare continuous variables. Pearson's Chi-squared test was used to compare unordered categorical variables. Correlation analysis was performed by Spearman's rank correlation. All tests were two-tailed with p < 0.05 as a significance cutoff. All statistical analyses were performed using R (version 4.0.1).

The landscape of viral expressions in human cancers
We designed a pipeline to detect 212 types of viral sequences by interrogating RNA-Seq data of 11,206 tumor samples of 33 cancer types from TCGA (Fig. 1A). Overall, 9 types of viral sequences were detected in 744 tumors of 25 cancer types (Table 1). Over 1% of tumors were found infected in 20 cancer types and over 5% of tumors were infected in 6 cancer types (Fig. 1B) Fig. 1C.
For CESC, all infected tumors were related to Human Papillomavirus (HPV), in which the high-risk HPV16 and HPV18 dominantly accounted for 60.90 and 14.53%, respectively. Besides, we detected 2 cases of HBV infection in CESC. For LIHC, 93.6% tumors (117 out of 125) were infected by Hepatitis B virus (HBV). Additionally, four tumors were infected by HPV (HPV16 and HPV35), four tumors were infected by Hepatitis C virus (HCV), and one tumor was infected by Epstein-Barr virus (EBV). For STAD, 55.1% of tumors (43 out of 78) were infected by EBV and 51.3% of tumors (40 out of 78) were infected by Cytomegalovirus (CMV). Besides, one tumor was infected by HPV18, one tumor was infected by Kaposi's sarcoma-associated herpesvirus (KSHV), and one tumor was infected by Human Immunodeficiency Virus (HIV). Of note, we found co-infection of EBV and CMV occurred frequently in gastrointestinal tumors. Apart from STAD, ESCA showed 3.2% of tumors (6 out of 185) with EBV infection and 5.9% of tumors (11 out of 185) with CMV infection. For HNSC, 89.61% of tumors (69 out of 77) were infected by HPV and HPV16 was the dominant subtype (79.7%; 55 out of 69). Also, we found one tumor infected by EBV, 3 tumors infected by CMV, 2 tumors infected by HBV. For LAML, we detected CMV infection in all tumors and SV40 infection in 83.6% of tumors (46 out of 55). We compared the above results with the clinical information provided by TCGA. For HPV infection, 22 out of 23 records of CESC and 91 out of 96 records of HNSC agreed with our detection. Additionally, we detected other subtypes of HPV, including HPV45, HPV52, HPV58 and HPV70. As for EBV infection, EBV infections were detected in 27 out of 30 tumors that were accordingly defined as the GI.EBV subtype by TCGA (Supplementary Table 1) [37].

EBV and HPV infections showed decreased genomic variations
To assess the effects of viral infections upon the host genome, we analyzed tumor mutation burden (TMB), mutation rate (silent and non-silent), copy number variation (CNV, both segment and fraction of genomic alterations), aneuploidy, homologous recombination defects (HRD) and intratumor heterogeneity (ITH) for the most prevalent virus-cancer associations (including EBV infection in STAD, HBV infection in LIHC and HPV infection in HNSC). We found all eight features decreased consistently in HPV-infected HNSC and features including CNV, HRD and ITH were decreased in EBV-infected STAD (p < 0.05; two-tailed Mann-Whitney U test). By comparison, five features (SNV, CNV-segment, CNVfraction, aneuploidy and HRD) were increased in HBVinfected LIHC (p < 0.05; two-tailed Mann-Whitney U test) (Fig. 3A and Fig. 3B; Supplementary Table 2). Using TMB as an indicator, we further compared infected tumors with tumors with high microsatellite instability (MSI-high) [38,39]. Most virus-infected tumors were classified as "intermediate TMB" or "low TMB", compared to that most MSI-high tumors were classified as "high TMB" (Fig. 3C). Beside, we found a significant negative correlation between HPV expressions and TMB in all HPV-infected tumors (P < 0.05, Spearman's rank correlation) (Fig. 3D).

EBV and HPV infections displayed significantly changed gene expressions
To explore the impact of viral infections upon the transcriptome, we performed differential expression analysis. For HPV-infected HNSC, we identified 3367 differentially expressed genes (DEGs), of which 2446 DEGs were (See figure on previous page.) Fig. 1 The landscape of viral expressions in human cancers. A Pipeline of viral sequence detection and downstream analyses. We downloaded 11,206 BAM files, expression profiles (fragments per kilobase million, FPKM) and clinical data including 33 cancer types from TCGA. For viral sequence detection, we aligned raw RNA-seq data in BAM files came from STAR with reference genomic sequences of human and 212 types of virus to detect viral sequences. We assembled transcripts with StringTie and the expression level of each transcript was normalized by TPM (transcripts per kilobase million). For transcripts of the same viral infection type, the maximum TPM value was selected as the final viral mRNA expression. Viral expression data, RNA-seq expression data and clinical data were then used for downstream statistical, genomic, transcriptional and immune analyses.  (Fig. 4A). Expressions of these genes were positively correlated with the expressions of EBV and HPV (Fig. 4B). Except that HLA-DPA1 is related to antigen presentation, the rest genes are exclusively associated with cell proliferation. To gain a better understanding of the biological functions of these DEGs, we performed fuzzy c-means (FCM) clustering analysis. We identified 3 gene clusters in STAD (S1, S2, S3), 4 gene clusters in LIHC (L1, L2, L3, L4) and 4 gene clusters in HNSC (H1, H2, H3, H4) (Supplementary Fig. 1). 385  DEGs of S2 and 440 DEGs of H1 were exclusively upregulated, while 18 DEGs of L3 were all downregulated. Enrichment analysis showed these gene clusters were mainly associated with immunity. All DEGs in L2 and H3 were upregulated, while a large portion of DEGs of S1 was decreased. Enrichment analysis showed these gene clusters were relevant to cell proliferation (Fig. 4C). Moreover, we found pathways involving DNA replication, mismatch repair, base excision repair, and nucleotide excision repair were upregulated in S1 and H3 ( Fig.  4D; Supplementary Fig. S2).

EBV and HPV infections showed effective anti-viral immune responses and upregulated PD1 signaling pathway
We then focused on the influences of viral infections on immune cellular infiltrations. Distributions of the integrated 9 immune cells of STAD, LIHC and HNSC are presented in Fig. 5A. For STAD, the leukocyte fraction (LF) and infiltrations of CD8+ T cells and macrophages were higher in EBV-infected tumors (p < 0.05; two-tailed Mann-Whitney U test). For LIHC, LF, CD4+ T cells and mast cells were decreased in HBV-infected tumors (p < 0.05; two-tailed Mann-Whitney U test). For HNSC, infiltrations of CD8+ T cells and B cells were increased, while LF, macrophages and mast cells were decreased in HPV-infected tumors (p < 0.05; two-tailed Mann-Whitney U test) (Fig. 5B). In all EBV-and HPVassociated cancers, infiltrations of CD8+ T cells were higher in infected tumors ( Supplementary Fig. S3). For all EBV-infected tumors, a positive correlation was observed between EBV expressions and infiltration levels of CD8+ T cells (r = 0.37, P = 0.0086, Spearman's rank correlation) (Supplementary Fig. S4). Shannon entropy and species richness are indicators of T cell infiltration levels, and TCR evenness could stand for the diversity of T cell receptors. We noticed that Shannon entropy and species richness were increased, while TCR evenness was decreased in EBV-infected STAD. This implies that EBV-infected tumors may have undergone a clonal expansion [14] ( Supplementary Fig. S5). This is consistent with the observation that ITH was decreased in EBV-infected tumors. We further examined how immune-related functions were regulated in these virus-infected tumors [27,28]. In terms of immune stimulation,12 pathways responsible for immune cell migration and infiltration, antigen recognition, innate immune response and adaptive immune response were upregulated in EBV-infected STAD; and 4 pathways representing B cell receptor signaling, NK cell-mediated cytotoxicity, RIG-1 like receptor signaling and T cell receptor signaling were upregulated in HPVinfected HNSC. As for immune suppression, suppressive pathways like PD1 signaling and CTLA-4 pathways were upregulated in EBV-infected STAD and HPV-infected HNSC (two-tailed Mann-Whitney U test) (Fig. 5C). Immune-suppressive molecules, including LAG3, PDCD1, CD274 (PD-L1), CTLA4 and IL10 were also upregulated in EBV-infected STAD and HPV-infected HNSC (two-tailed Mann-Whitney U test) (Fig. 5D). By comparison, all immune-related pathways were decreased in HBV-infected LIHC (two-tailed Mann-Whitney U test) (Fig. 5C).

Identification of TIME subtypes
Tumor immune microenvironment (TIME) is a prerequisite of applying immunotherapy in the clinic [40]. In this part, we proposed a method to classify tumors into different types of TIME. The general idea of identifying different TIME subtypes is to first separate tumors with high leucocyte infiltrations from those lacking leucocyte infiltrations. For this purpose, we first studied LF distributions of the 30 cancer types with available data from TCGA. We found LFs of cancers lacking leucocyte infiltrations [33] were mostly below the 25th percentile of LFs (0.086) (Fig. 6A). Using this threshold, we separated tumors into a "LF-high" group and a "LFlow" group. Then within the "LF-high" group, we wanted to identify tumors with activated immune responses and those with functionally impaired immune responses. Specifically, we used infiltrations of all 22 types of immune cells and expressions of 76 immune-related genes (Supplementary Table 3) as features to perform an unsupervised clustering analysis on all "LF-high" tumors of LGG, COAD, CESC, KIRC, UVM and SKCM TCGA datasets. We chose these cancer types because they have been reported covering different immune environment subtypes [33]. The clustering analysis identified two different clusters (Fig. 6B). Cluster1 is characterized by increased infiltrations of both cytolytic cells (CD8+ T cells, M1 macrophages, activated NK cells and follicular helper T cells) and immune-suppressive cells (Tregs and M2 macrophages). Accordingly, expressions of cytolytic genes (GZMA, GZMB, PRF1, IFNG and TNF) and immunosuppressive genes (PDCD1, CD274, LAG3 and CTLA4) are also upregulated. By contrast, infiltration levels of immune cells and expressions of immunerelated genes are nearly all lower in Cluster2 than Clus-ter1 (Fig. 6C). LF levels reflect the overall level of immune cellular infiltrations, and different patterns of expressions of 76 immune-related genes and infiltrations of 22 immune cells indicate different functional conditions. By aggregating LF levels and the clustering results, we manually identified three different TIME subtypes. The "LF-low" group is defined as the "Immune-Exclusion" subtype, and Cluster1 and Cluster2 within the "LF- high" group are defined as the "Immune-Stimulation" subtype and the "Immune-Anergy" subtype separately. We compared our TIME subtyping results with a previous research that identified different immune types across human cancers [41]. LGG and UVM showed higher fractions of the "Immune-Exclusion" subtype and lower fractions of the "Immune-Stimulation" subtype. This result is compatible with the decreased immune responses in LGG and UVM reported by the study. By comparison, CESC, KIRC and SKCM showed relatively higher fractions of the "Immune-Stimulation" subtype, and these cancer types also showed stimulated immune responses in the study (Fig. 6D). TIDE is a tool to predict immune responses for an individual cancer type. It currently only supports predicting for SKCM and non-small cell lung cancer (NSCLC) and the prediction of SKCM is more robust [42]. Therefore, we chose the SKCM TCGA dataset for comparison. Both TIDE and TIME methods evaluate CTL infiltration levels and dysfunction status, and TIME additionally evaluates a stimulated status. For the evaluation of CTL infiltrations, all "CTLhigh" tumors are constituted of the "TIME-Stimulation" (62%) and the "TIME-Anergy" (38%) TIME subtypes, and most of "CTL-low" tumors are constituted of the "TIME-Exclusion" (25%) and the "TIME-Anergy" (73%) TIME subtypes (Supplementary Fig. S10). Besides, "T cell exclusion scores" showed a significant negative correlation with "LF scores" (cor = − 0.66; p-value < 2.2e-16, Pearson's correlation) (Supplementary Fig. S10). These results are expected for both "T cell exclusion scores" and "LF scores" are measurements of CTL infiltrations. In addition, we found "T cell exclusion scores" is significantly higher in the "Immune-Exclusion" subtype than the "Immune-Anergy" subtype (mean: 0.88 and 0.35; p- value = 5.971e-06, t-test) and the "Immune-Stimulation" subtype (mean: 0.88 and − 1.67; p-value < 2.2e-16, t-test); and the "Immune-Anergy" subtype is higher than the "Immune-Stimulation" subtype (mean: 0.35 and − 1.67; p-value < 2.2e-16, t-test) (Supplementary Fig. S10). For the measurement of dysfunction status, we found "T cell dysfunction scores" are significantly higher in the "Immune-Stimulation" subtype than the "Immune-Anergy" subtype (mean: 1.16 and − 0.23; p < 2.2e-16, t-test) and the "Immune-Exclusion" subtype (mean: 1.16 and − 0.99; p < 2.2e-16, t-test); and the "Immune-Anergy" subtype is higher than the "Immune-Exclusion" subtype (mean: − 0.99 and − 0.23; p = 2.3e-11, t-test) (Supplementary Fig.  S10). These results are expected because, in the "Immune-Stimulation" subtype, cytolytic genes and immunosuppressive genes are co-expressed and upregulated. The "Immune-Stimulation" subtype gets higher "T cell dysfunction scores" for their high expressions of immune-suppressive genes. By comparison, the "Immune-Anergy" subtype lacks immune signals, so "T cell dysfunction scores" are expected to be lower. We also compared expression levels of the IFN-Y signature among TIME subtypes. The "Immune-Stimulation" subtype showed the highest level of "IFNG scores" (mean: 2.09), which is higher than the "Immune-Anergy" subtype (mean: − 0.26; p < 2e-16, t-test) and the "Immune-Exclusion" subtype (mean: − 1.68; p < 2e-16, ttest). and "IFNG scores" are higher in the "Immune-Anergy" subtype also shows a higher level of "IFNG scores" than the "Immune-Exclusion" subtype (p < 2e-16, t-test) ( Supplementary Fig. S11). These results are expected as the IFN-Y signal is strengthened by stimulated immune responses and increased immune infiltration.

Correlations between viral infections and TIME subtypes
To predict TIME subtypes in STAD, LIHC, HNSC, as well as other cancers, we developed a workflow, including a LASSO regression model to predict the LF level and an SVM classifier to separate two clusters. The LF model was trained on the training cohort (3865 samples) and validated in the two validation datasets (2931 and 2911 samples) from the whole TCGA datasets. And the SVM model was trained (1200 samples) and validated (513 samples) on samples of the "Immune-Anergy" and the "Immune-Stimulation" TIME subtypes of LGG, COAD, CESC, KIRC, UVM and SKCM from TCGA datasets ( Supplementary Figs. S7, S8, and S9). EBV and HPV significantly influenced the TIME subtypes of STAD (p = 5.8e-9, Pearson's Chi-squared test) and HNSC (p = 1.4e-7, Pearson's Chi-squared test), respectively. However, HBV infection showed no significant impact on the TIME subtypes of LIHC (Fig. 7A). Compared to non-infected tumors, EBV-infected STAD (59.52% vs 17.61%) and HPV-infected HNSC (44.12% vs 16.12%) presented higher fractions of the "Immune-Stimulation" subtype. Fractions of the "Immune-Anergy" subtype were lower in EBV-infected STAD (35.71% vs 74.93%) and HPV-infected HNSC (47.06% vs 78.04%) (Fig. 7A). Kaplan-Meier analysis further demonstrated prognostic significance of TIME subtypes in STAD and HNSC. The "Immune-Exclusion" subtype exhibited the best overall survival (OS) (P = 0.026, log-rank test) and progression-free interval (PFI) (P = 0.041, log-rank test) in STAD, and the "Immune-Stimulation" subtype showed the best OS (P = 0.014, log-rank test) and PFI (P = 0.029, log-rank test) in HNSC (Fig. 7B). In contrast, TIME subtypes showed no significant difference in the prognosis of LIHC (Supplementary Fig. S6).

Construction of the eVIIS pipeline
We combined the viral sequence detection pipeline and the TIME subtyping workflow to develop an integrated eVIIS pipeline. Given an RNA-seq dataset, eVIIS simultaneously evaluates viral infection and immune status (Fig. 8). eVIIS provides a stepwise or one-step analysis for each sample. Users can choose the appropriate mode for different purposes and different forms of the dataset. Details about the usage of the eVIIS pipeline are described in the README file (https://github.com/ HuangLab-Fudan/eVIIS).
For independent validation, we used an extra dataset including 83 human primary gastric tumor tissue samples from the surgical specimen archives from Fudan University Shanghai Cancer Center (FUSCC). The results showed that 4 EBV-infected samples and 12 CMVinfected samples, in which 6 samples (7.23%) were predicted as the "Immune-Stimulation" subtype. The results were consistent with the prevalence of EBV and CMV infections in gastrointestinal tumors observed in the TCGA cohort.

Discussion
In this study, we designed an NGS-based pipeline to detect 212 types of viral sequences. We 14% of tumors (43 out of 304) in HNSC [11]. However, there're also some disagreements relating to EBV and CMV infections. For STAD, the study only detected one EBV-infected tumor in STAD (2.3%; 1 out of 43). However, the number is much lower than the number of tumors of the GI.EBV TCGA subtype (7.8%; 30/383) [37]. Furthermore, the occurrence of SV40 and CMV in LAML was controversial in several studies [43][44][45][46][47]. Our results supported the SV40 infection in LAML and we also detected high proportions of CMV infection in LAML. These differences might due to different standards of the definition of viral infections. One of those studies used the expression levels of oncogenic viruses HBV and HPV as referrence and they excluded samples with viral expressions less than 0.5 ppm (p.p.m.) [46]. The purpose is to neglect the trace signals generated by the infiltrated virus-positive lymphocytes instead of the tumor cells themselves [48,49]. In contrast, we didn't discard these weak signals. One reason is these weak signals can be caused by technical issues and the exclusion of weak virus-cancer associations might miss out on rare virus-cancer associations [50,51]. The other reason is that the trace signals coming from infiltrated viruspositive lymphocytes may sensitively reflect the immune status. In a recent study, after filtering low viral expressions, the study showed that the CIBERSORT estimation of absolute immune cell infiltrations was not significantly different between EBV-infected and non-infected tumors in multiple cancers, including STAD [52]. Therefore, they rejected the idea that the detection of EBV is due to infiltrating immune cells and confirmed the active contribution of EBV to STAD. However, the sensitivity of EBV infection to reflect immune status is undercut. We showed no viral infection in eight cancers, including ACC, BRCA, GBM, KICH, PCPG, TGCT, THYM and UVM. The absence of viral infection in BRCA and GBM has been reported previously [11], and our results may offer convincing evidence against viral etiology in the rest tumors.
Viral infections showed different tissue tendencies and their main hosts were highly selective. This could be explained by their different viral receptors that are required during infections. However, viral infections were also detected sporadically in some uncommon hosts. HPV16 was detected in a broad spectrum of cancers including uterus, lung, bladder carcinomas and low-grade gliomas tumors. These have also been reported in previous findings [10,11,53]. In our study, HPV-infected BLCA presented as the third most prevalent type of all HPV infections and this is consistent with previous results [52]. Moreover, among different HPV variants, HPV33 ranked the third most prevalent type of HPV infection in HNSC. And a recent study also reported HPV33 in HNSC (n = 3). Besides, the study reported HPV6 and HPV45 infections in BLCA. These have also been detected in our results and we additionally detected HPV52 and HPV56. It's technically hard to discard all the false-positive results of viral infections, but the varied tissue tendencies and expression levels could be used as a reference. For example, HBV-infected LIHC tumors usually harbor high viral expressions. Therefore, in tumors with low HBV viral expressions, they would likely be considered as contaminations or from infected lymphocytes [54], and this has been proved to be a contaminant in KIRC [11]. Similarly, further assessments of low-expressed CMV-infections and SV40-infections in OV will also be needed to exclude contaminations [51]. Compared to this, the uncommon hosts with high levels of viral expressions may represent a special subtype and the diagnostic and therapeutical values should be further explored.
Commonly, viral infections have the potential to cause perturbations in the host genome. In our study, EBVinfected STAD showed decreased CNV and HRD and HPV-infected HNSC exhibited consistently decreased genomic variations. This could partially be explained by the following transcriptional analyses that pathways involving DNA replication, mismatch repair, base excision repair, and nucleotide excision repair in S1 and H3 were upregulated, leading to decreased genomic instability in EBV-infected STAD and HPV-infected HNSC. A recent (See figure on previous page.) Fig. 6 Identification of TIME subtypes. A Distributions of leukocyte fractions (LFs) of human cancers. We analyzed LFs of 9692 tumors of 30 different cancer types, and 25th (LF = 0.086), 50th (LF = 0.171) and 75th (LF = 0.295) percentiles of LFs of these tumors are annotated in the figure. (B) Heatmap reveals two different clusters in the LF-high tumors. The clustering analysis was implemented using expression levels of 76 immunerelated genes and infiltration levels of 22 types of immune cells. All values were scaled before clustering. Important immune cell types are labeled with arrows, and immune checkpoint and super categories of each gene are labeled on the left side. C Heatmap of scaled cluster centers of Cluster 1 and Cluster2. Values greater than 0 are in red, values less than 0 are in blue. Labels of 76 immune-related genes and 22 types of immune cells are annotated on the right side. Nearly all expression levels of the immune-related genes and immune cellular infiltrations are higher in Cluster1 than Cluster2. D TIME subtypes in LGG, COAD, CESC, KIRC, UVM and SKCM. Based on the LF levels and clustering results, we manually classified training samples of LGG, COAD, CESC, KIRC, UVM and SKCM into three different TIME subtypes, namely "Immune-Stimulation", "Immune-Anergy" and "Immune-Exclusion" subtypes study reported that HPV-positive HNSC exhibited an almost complete mutual exclusivity with mutations in known drivers such as TP53, CDKN2A and TERT. Such decreased mutation burden and the independence from carcinogenic drivers confirmed the mutation-independent oncogenic and tumorigenic potential of HPV [52].
The impacts of viral infections were different on the host transcriptomes. While the small number and small changes of expression levels of DEGs were seen in HBVinfected LIHC, much greater changes were observed in EBV-infected STAD and HPV-infected HNSC. The common genes that were changed in all types of Fig. 7 Correlations between viral infections and TIME subtypes. A Associations between viral infections and TIME subtypes. We predicted TIME subtypes for tumors of STAD, LIHC and HSNC. We found significant correlations between TIME subtypes and viral infections in STAD and HNSC. By comparison, LIHC showed no significant correlation with TIME subtypes. B Survival analyses of different TIME subtypes of STAD and HNSC. The survival curves show better outcomes of overall survival (OS) and progression-free interval (PFI) in the "Immune-Exclusion" subtype of STAD and the "Immune-Stimulation" subtype of HNSC infections were primarily concerning cell proliferation. This reflects the ability of viral infections to stimulate cell proliferation that leads to tumor development [2]. The clustering results provided more interpretable results that the cell proliferation-relevant DEGs were consistently upregulated in HBV-infected LIHC and HPVinfected HNSC, compared to the inconsistent change in EBV-infected STAD. We showed that different viral infections harbored varied expression levels, with HBV and HPV infections displaying the highest expression levels. This could be interpreted as the greater ability of HBV and HPV to enhance cell proliferation, which leads to higher levels of viral expressions.
Another group of genes that changed commonly were immune-related. These immune-related genes were consistently upregulated in EBV-infected STAD and HPV-infected HNSC. Accordingly, multiple immune cells were increased in the HPV-infected HNSC. This could be supported by a recent study that reported a significant increase in M1 macrophages and T-cells (follicular helper, CD8+ T cells and regulatory T cells) in HPV-positive HNSC [52]. Many immune-stimulating signaling pathways were also upregulated in EBVinfected STAD and HPV-infected HNSC. Of note, the RIG-1 like receptor signaling pathway is primarily responsible for detecting and eliminating viral pathogens. Moreover, immune-suppressive pathways and molecular were also upregulated in the two types of infections. This finding indicates that EBV and HPV infections could elicit effective anti-viral immune responses that Fig. 8 Construction of eVIIS pipeline. The eVIIS pipeline includes several steps: (1) align RNA-seq data to human and viral reference genome sequences using STAR (version <= 2.5); (2) For viral expression detection: use StringTie (version <= 1.2.3) to assemble mapped reads and quantify viral expressions into TPM values. Samples with TPM over zero will be predicted as "Infected", and samples with TPM equals zero will be predicted as "Nnon-Infected"; (3) For TIME status prediction: use featureCounts (version > = 1.5.0) to obtain read counts and normalize gene expressions into FPKM values. TIME prediction will be made based on the two models (LASSO regression and SVM). For a given sample, eVIIS first calculates its LF.Score using the LASSO regression model based on FPKM values of 30 LF-relevant genes. The sample will be classified as "Immune-Exclusion" if LF.Score < 81.03. Otherwise, it will be further evaluated by the SVM model as "Immune-Stimulation" or "Immune-Anergy", using FPKM values of the 37 DEGs. The eVIIS pipeline is available at https://github.com/HuangLab-Fudan/eVIIS were concomitant with immune suppression. It's worth noting that though immune responses are often aroused by mutation-driven neoantigens, in our study, viral infections were independent of MSI status for most infected tumors exhibited low or intermediate TMB levels. Therefore, the stimulated immune responses might largely be caused by viral infections rather than the intrinsic tumor genomic variations. This result highlights the potential of viral infections being independent markers from MSI-high or TMB.
For anti-PD immunotherapy, TIME is considered as a prerequisite to select appropriate patients [55]. A large number of studies have tried to identify TIME subtypes [56,57], in which the type displaying high-level tumorinfiltrating leukocytes and stimulated PD pathway was suggested as the optimal one [55]. Since T cells could be functionally impaired and merely serve as bystanders in many tumors with persistent viral infections, the existence of immune cells doesn't guarantee an effective antitumor immune response [58][59][60]. For this concern, we took both immune cellular infiltrations and their functionality into account to construct the TIME subtyping system model. The model contains three subtypes. The "Immune-Stimulation" subtype showed increased leukocyte infiltration, enhanced cytotoxic activity and immunosuppressive status, especially the increased CD274 (PD-L1) expression level, which could be considered as an appropriate candidate for the anti-PD therapy. EBV-infected STAD and HPV-infected HNSC showed increased fractions of the "Immune-Stimulation" subtype and decreased fractions of the "Immune-Anergy" subtype. Because most EBV-and HPV-infected tumors lacked MSI-high status or high TMB burden and may still have some portion of the "Immune-Stimulation" subtype, highlighting the potential of combing viral infections and immune status to select responders in this group of patients. Recently, a study reported a gastric cancer patient who responded to the anti-PD-L1 drug avelumab. This patient showed no high TMB or MSIhigh markers, but the tumor was strongly positive for EBV mRNA [61]. This indicates that viral infection could be a biomarker of immunotherapy to help recognize responders without high TMB or MSI-high markers. In this regard, evaluating viral infection and TIME status could be of great importance to immunotherapy, and the eVIIS pipeline that can simultaneously evaluate viral infection and TIME status could help achieve this goal. Considering the fruitful results from previous studies [10,11,52], we did not perform DNAbased fusion site analysis and focused our research on the transcriptome-based detection results. Since viral infections not expressed at the transcript level may be missed out, the results could be further improved by incorporating genome-based results.

Conclusions
We provided a comprehensive virus-cancer association landscape and revealed different properties of viral infections. EBV-infection and HPV-infection led to decreased genomic variations, significantly altered gene expressions, and effectively triggered anti-viral immune responses in STAD and HNSC. EBV-infection and HPVinfection combined with the TIME subtype could be candidate biomarkers of the immunotherapy in STAD and HNSC, respectively. Finally, the eVIIS pipeline could be a practical tool to facilitate clinical practice and relevant studies.