Skip to main content

Estimation of immune cell content in tumor using single-cell RNA-seq reference data



The rapid development of single-cell RNA sequencing (scRNA-seq) provides unprecedented opportunities to study the tumor ecosystem that involves a heterogeneous mixture of cell types. However, the majority of previous and current studies related to translational and molecular oncology have only focused on the bulk tumor and there is a wealth of gene expression data accumulated with matched clinical outcomes.


In this paper, we introduce a scheme for characterizing cell compositions from bulk tumor gene expression by integrating signatures learned from scRNA-seq data. We derived the reference expression matrix to each cell type based on cell subpopulations identified in head and neck cancer dataset. Our results suggest that scRNA-Seq-derived reference matrix outperforms the existing gene panel and reference matrix with respect to distinguishing immune cell subtypes.


Findings and resources created from this study enable future and secondary analysis of tumor RNA mixtures in head and neck cancer for a more accurate cellular deconvolution, and can facilitate the profiling of the immune infiltration in other solid tumors due to the expression homogeneity observed in immune cells.

Peer Review reports


Cancer immunotherapy has made substantial progress and has dramatically impacted the treatment of multiple cancers, including skin cancer, lung cancer, and head and neck cancer. The cellular composition of a tumor and its immune microenvironment varies between patients and tissue types. The presence and higher content of tumor-infiltrating lymphocytes (TILs) is believed to be associated with response to the immunotherapy. In melanoma, it was also found that the composition of immune cells such as CD8+ cytotoxic lymphocytes and dendritic cells are strong prognostic predictors themselves and are associated with overall clinical outcomes. However, there are still considerable technological and analytical barriers to assess cancer and immune cell compositions in the tumor quantitatively. The pathological approaches such as immunohistochemical (IHC) staining and flow cytometry analysis are labor intensive and often involve considerable inter-observer variation. Therefore, the cell decomposition based on existing molecular profiles of tumors has received many attentions in recent years. Earlier work has been centered on whole exome sequencing data. Based on DNA mutational signatures and the distribution of local copy numbers, several methods have been proposed to infer the tumor purity—defined as the proportion of cancerous cells in the tumor tissue. Based on the similar computational model [1], subclonal heterogeneity and somatic homozygosity can also be explored. Previous studies have also attempted to deconvolve gene expression profiles (including microarray and RNA-seq) of tumor samples to infer the stromal and immune cell admixture [2]. These methods leverage distinct transcriptional properties of different cell types, which provide finer granularity in the cell composition estimation than using DNA mutational profiles alone.

The software CIBERSORT has now been widely used in the area to estimate immune cell subsets from tumor expression profiles. But its application has been limited to microarray studies due to the source of the training gene expression panel. Only recently have efforts begun to extend the cell deconvolution method to RNA-seq data and to identify more microenvironment-informative markers. These reference markers were selected from whole transcriptome data and narrowed down through correlating gene expression with tumor purity estimates. The nCounter system (NanoString) has gained popularity in the clinical and translational setting as an alternative tool for immune cell profiling. The advantage of NanoString platform is that it is based on a highly sensitive and non-enzymatic process to enable a more precise quantification of RNA expression, which provides reliable data even with FFPE samples. However, nCounter is a targeted gene expression panel, and the surrogate expression profile cannot differentiate all cell subpopulations. Therefore, there is a pressing need to develop more efficient gene reference panel and related computational tools to quantify the components of tumor microenvironment in situ on a larger scale, which will facilitate both retrospective and prospective studies.

The recent maturation of single-cell RNA sequencing (scRNA-seq) has enabled us to directly profile the cell composition and understand tumor heterogeneity at a cellular level. With newly developed high-throughput cell sorting and barcoding technologies, thousands of individual cells per tumor can be profiled in parallel to capture intra-tumor heterogeneity at an unprecedented resolution [3,4,5]. Unless the main goal of a project is to study underrepresented cell populations, scRNA-seq experiments can be done without the need for cell sorting--which is laborious and prone to considerable bias due to cell death and cell selection. The unbiased and simultaneous characterization of both immune and cancer cell is essential for tracking and forecasting the tumor ecosystem, e.g., in patients before and after immunotherapy. The cellular composition, as well as the relationships between different cell subpopulations, are generally explored by clustering analysis using all gene expression data--most notably, based on the method called t-Distributed Stochastic Neighbor Embedding (t-SNE). Cell types corresponding to each cell cluster can then be inferred based on existing cell-type-specific marker genes and any available prior knowledge about the cells. Furthermore, a differential expression analysis between distinct cell populations may provide new marker genes for cell mixture deconvolution. Nevertheless, large-scale scRNA-seq studies involve expensive sequencing efforts, prohibiting them from being more widely used in practical and clinical settings. There is still considerable interest in the community to drive cell-type-informative markers for facilitating the analysis of bulk tumor sequencing. It thus motivates us to derive more efficient cell-type-informative markers by leveraging high-quality scRNA-seq data generated from existing studies.

Here we investigated gene expression profiles of 6,000 single cells from 15 head and neck squamous cell carcinoma (HNSCC) patients. To allow for a finer deconvolution of immune cell subtypes, we employ an adaptive divide-and-conquer scheme to isolate cell populations in silico. The reference gene expression profile matrix was then built based on identified single cell populations. We show that the reference profiles obtained from single cell expression data enable a more reliable estimation of cellular composition in bulk tumor, and they have ability to discriminate immune cell types with finer granularity. Our work demonstrates that established single cell gene expression in each tumor type can further add value to the digital dissection the tumor microenvironments. We provide these reference matrices and gene panels, namely single-cell gene expression profiles (scGEPs), to the community as a useful resource for studying heterogeneous tumor ecosystems.


Single-cell RNAseq data

We downloaded the single-cell RNA-seq data from Puram et al. [3] which generated expression data of 6,000 single cells from head and neck squamous cell carcinoma (HNSCC) patients. By reviewing all published single-cell RNA-seq data (up to Dec 2018) in cancer, we found that this dataset covered the most diverse stromal, malignment and immune cells in the tumor microenvironment (TME), and relatively large number of patients. Importantly, it provides annotated cells from four T cell major subpopulations: regulatory T-cells (Tregs), convectional CD4+ T cells (CD4+ Tconv), CD8+ T and CD8+ T exhausted. Therefore, single cell expression profiles from Puram et al. study is an ideal source of reference data. Note that expression profiles of malignant cells are highly specific to HNSCC, but we hypothesize that expression reference of immune cells is applicable to other cancer types. After removing the patient samples (MEEI9 and MEEI23) with less than 50 cells, 5712 cells from 16 treatment-naive patients plus matched lymph nodes from three of these patients remained for analysis (Additional file 2: Table S1). As described in Methods in Puram et al. [3] gene expressions were quantified as y = log2 (TPM + 1), where TPM refers to transcripts per million, a gene quantification method that has been considered superior to FPKM (fragments per kilobase per million read) and more robust to differences in RNA library size [6].

Enrichment analysis of cell-type-specific genes

We adapted the single-sample Gene Set Enrichment Analysis, or ssGSEA [7], to calculate the enrichment scores of pre-existing cell-type-specific marker genes. These scores will be used to assist the cell type assignment step to be described in the following sections. ssGSEA is an extension of GSEA method that computes an aggregated enrichment score for a gene set. But instead of gene-phenotype association score, ssGSEA considers rankings of gene expression relative to remaining genes in the genome within each sample, and calculate a score that represents the degree that genes in a gene set are coordinately up- or down-regulated. Signature genes for HNSCC tumor, immune, and stromal cells were obtained from previous studies [3, 8, 9]. To choose the most reliable and generalized signatures, we used only the genes shared by all resources. Together, we collected 140 signature genes covering 15 cell types including HNSCC tumor cells, immune cells, T cell subtypes, and stromal cells. The curated gene list is given in Additional file 3: Table S2. Note that this list alone is not sufficient to be used as a reference panel for the cell content deconvolution with bulk tumor gene expression data. Enrichment of each cell-type signature was assessed using ssGSEA implemented in R package gsva [8].

Cell type identification

Similar to the data analysis presented in Purma study, we choose to use the t-SNE method to visualize the cell clusters and explore the cell type compositions based on transcriptomes of all examined cells. However, as shown in the previous analysis and in the results section, t-SNE method alone is only able to identify clusters of major cell types and not able to distinguish between T cell subpopulations. Furthermore, the location of the clusters in the t-SNE map and their relative positions to other clusters will change across analysis runs. As a limitation of the technique, t-SNE cannot reproduce the same clustering map if different cells or perplexity parameters are chosen in one analysis run. Therefore, we propose to use a multi-stage cell identification scheme for obtaining more accurate cell type inference--by adaptively integrating t-SNE and ssGSEA results. The steps and detailed parameters used are described below.

  1. (1).

    Tumor cell classification: To classify HNSCC malignant cells, we performed t-SNE analysis of all cells using perplexity parameter of 50 followed by DBscan clustering (with parameters eps = 5 and minPts =5). Clusters were classified as malignant cells and non-malignant cells based on their ssGSEA enrichment scores using signature genes for HNSCC tumor cells (Additional file 1: Figure S1A, B). As reported previously in various cancer studies [3, 5], malignant cells were clustered by patients while non-malignant cells were clustered by cell types (Additional file 1: Figure S1C).

  2. (2).

    Non-tumor cell classification: The non-tumor cells identified in step 1 were subjective to a secondary stage of clustering analysis. t-SNE with the perplexity of 30 was performed followed by DBscan clustering (with parameters eps = 6 and minPts =15). These parameters were chosen based on two criteria: (1) the resulted clusters should maximize the degree of differentiation of cell populations; (2) the resulted clusters should have the greatest consensus possible with the ssGSEA metrics. Based on the ssGSEA enrichment scores, clusters are assigned to major immune and stromal cell types including Fibroblasts, B cell, Macrophages, Endothelial cells, Dendritic cells, Mast cells and T cells (Additional file 1: Figure S2A and Additional file 1: Figure S2B).

  3. (3).

    T cell subtype identification: Similar procedure was used to classify T cell subtypes from the lumped T cells population identified in step 2. We performed single-cell consensus clustering (SC3) analysis [10] and were able to identify four distinct clusters of T cell subpopulations. These four clusters were assigned to conventional CD4+ T cells (CD4+ Tconv), T-regulatory cells (Treg), conventional CD8+ T cells (CD8+ Tconv), and exhausted CD8+ T cells, based on their ssGSEA enrichment scores (Additional file 1: Figure S3A and Additional file 1: Figure S3B). Next, differential expression analysis was performed comparing CD4 Tconv vs. Treg cells, and CD8+ Tconv vs. exhausted CD8+ T cells using R package limma [11]. Only genes with |log2FoldChange| > 1 and Benjamini-Hochberg adjusted p-value < 0.05 were considered significantly differentially expressed and reported in Additional file 4: Table S3. The identified differentially expressed genes were compared with previously reported marker genes for these cell types.

scRNA-derived marker genes

To develop a finer panel of cell-type-specific genes, we identified marker genes that are specifically expressed in each cell type. Differential expression analysis was first performed between any pairs of the 11 cell types using R package limma. Then marker genes of each cell type were identified as those significantly highly expressed in cell type under consideration compared to at least 5 other cell types (log2FoldChange > 3 and Benjamini-Hochberg adjusted p-value < 0.05). In total, we identified 581 marker genes and reported the gene names and limma results in Additional file 5: Table S4.

Deconvolution method for bulk tumor

The objective of the deconvolution algorithm is designed to solve for the linear equations m = f × B, where m is the input gene expression profile (GEP) matrix, f is a vector of cell fractions to be estimated, and B is the gene expression signature or reference GEP matrix. A machine learning method, ν-support vector regression (ν-SVR) combining feature selection with a linear loss function and L2-regularisation [12], was used to infer the compositions of the malignant cells, tumor-infiltrating cell types/subtypes, and stromal cells from the bulk gene expression. This method has been implemented in CIBERSOR [13], a tool that has now been widely used for in cancer research. The initial setting of CIBERSORT was designed for estimating 22 immune cell types using 547 signature genes (LM22) derived from microarray data. In this study, we will apply the same SVR method implemented in CIBERSORT to infer cell types that are more representative in head and neck tumors. The reference GEP panels used in SVR will be described in the following section.

In silico assessment of final reference GEP panels

With the availability of high-resolution scRNA-seq data, one main objective of this study is to explore new ways to generate the reference GEP matrices to be used in bulk tumor deconvolution, i.e., the matrix B as described in the previous section. The ideal B matrix should be able to yield maximal and robust discriminatory power between cell type clusters. Meanwhile, the pooled scRNA-seq data can be served as ground truth for benchmarking the performance of reference GEP as well as deconvolution methods—because the true cell composition in the bulk gene expression data will be known. The similar idea has been implemented in a recent study [9]. The first step of constructing reference GEP matrices is to choose a panel of reference genes that can distinguish the cell populations. In this study, we will focus on four gene panels: (1) LM22 gene reference panel, designed by Newman et al.: it contains 547 genes that distinguish 22 human hematopoietic cell phenotypes including several T-cells types, B cells, and natural killer cells. This panel is the default panel used in CIBERSORT and thus has been used extensively; (2) A panel of signature genes identified from previous literature: it contains 140 genes that are served as signatures for 15 major cell types including HNSCC tumor cells, immune cells, T cell subtypes, and stromal cells (Additional file 3: Table S2). (3) The scRNA-derived marker gene panel discovered through the steps described previously in the method: which contains genes that uniquely expressed in each cell population identified from HNSC scRNA-seq data (Additional file 5: Table S4); (4) A T-cell-specific GEP panel discovered through steps similar to GEP panel (3) but with a focus on four T cell subtypes (Additional file 4: Table S3). Note that we only used the gene list information of these panels. The GEP matrix of these genes is formed through averaging all single cells assigned to these populations. In order to assess the prediction performance of the above four GEP panels, we tested them on in silico bulk tumors by aggregating the single cell transcriptome data. Expression data of individual cells from the same patient in Puram study were pooled to form 15 in-silico tumors, which exhibit varied cellular compositions.


Identifiable cell types using HNSCC single cell data

Overall, the adaptive clustering analysis on single-cell transcriptome data pooled from all HNSCC tumor samples identified distinct 11 cell clusters to be used in generating reference GEP. These cells types are: HNSCC Malignant cells, Fibroblasts, Macrophages, Dendritic cells, Endothelial cells, Mast cells, B cells, conventional CD4+ T cells, T-regulatory cells, conventional CD8+ T cells, and exhausted CD8+ T cells. As shown in the t-SNE plot with all cells projected (Fig. 1a), most cells from same immune cell types are grouped together while malignant cell and Fibroblasts cell clusters contains multiple subgroups within each cluster. In the follow-up analyses, we will show that these subgroups are mainly driven by inter-tumor heterogeneity. The cell grouping information was then used to construct the cell composition map back in each tumor. As illustrated in the stacked bar chart in Fig. 1b, the proportions of malignant cells (tumor purity) vary uniformly between 0 and 1. This pattern reflects the original experimental design and is consistent with results from the original analysis [3]. We also observed that some important immune subsets such as tumor-infiltrating Treg cells (coded with dark blue) only exist in tumor samples with lower tumor purity, i.e. sample towards the right side of the plot. Treg cells plays important role as regulators of anti-tumor immune suppression and Treg/CD8+ T cell ratio may have a clinical significance in analyzing tumors in HNSCC patients [14]. However, results from scRNA-seq data suggests that the overall Treg expression signature may be underrepresented in genomic projects that are biased towards tumors with higher purity, such as TCGA. In the following, we briefly describe results generated from each step. First, we observed that the unsupervised clustering on all cells based on t-SNE revealed eight major clusters as depicted in Additional file 1: Figure S1A. Note that, at this stage, we had no information about cell types underlying these cell groups and the number of clusters might differ subject to the perplexity parameter choice in t-SNE. We started the cell type identification from first distinguishing tumor and non-tumor cells. By adding ssGSEA scores representing the tumor cell signature into the t-SNE map (Additional file 1: Figure S1B), we identified two major cluster regions of malignant cells located in the very top and lower regions. By further adding the color layers reflecting the tumor origin, we observed that the cell clusters in these regions were clearly separated by patient IDs while they were mixed together in a mosaic pattern in other cluster regions (Fig. 1c). The results above align with previous findings [3, 5, 9] that inter-tumor heterogeneity may arise more at the tumor malignant cell level than at the immune cell level—suggesting that immune cell signatures abstracted from the proposed scheme will be applicable to not only HNSCC samples generated from different studies but also samples from different tumor types. Next, we performed a second round of t-SNE analysis by excluding all tumor cells identified from previous steps. The new clustering analysis revealed seven major cell clusters (Additional file 1: Figure S2A). We were able to identify the cell types corresponding to each cluster by adding ssGSEA score specific to Fibroblasts, B cell, Macrophages, endothelial cells, dendritic cells, mast cell, and T cells one at each time as depicted in Additional file 1: Figure S2B. As expected, this subset of cell population is dominated by Fibroblasts and T cells. When we adding the color layers reflecting patient origins into Additional file 1: Figure S2A, we found a similar pattern that patient IDs were mixed together in each cell type cluster, indicating that the sub-clusters (such as in the T cells) may reveal further cell subtypes. This leads us to the next step by further zooming into the expression profiles of cells from T cell populations.

Fig. 1
figure 1

Profiling cellular composition of HNSCC tumors using scRNA-seq. a 2D t-sne projection of the expression profiles of 5,712 single cells (3,259 immune cells and 2,453 malignant cells) from 20 HNSCC tumors and lymph node samples of 16 patients. Single cells are shown in dot and colored by cell types. b Cell composition per sample. Patients are ordered by their fractions of malignant cells. c The same 2D t-sne projection as (a) with cells colored by patient origins

Deconvolution of T cell subtypes using identified T cell population

Based on SC3, we further identified four clusters from T cells (Figs. 2a). The cell types in the T cell subpopulation were first determined based on the gene enrichment signatures of CD4+ and CD8+ cells (as shown in the upper panel in Additional file 1: Figure S3B). Within these two subpopulations, CD8+ cells further marked with ssGSEA signatures for CD8+ Tconv and CD8+ exhausted; and CD4+ cells were marked with CD4+ Tconv and Treg cells signature values (Additional file 1: Figure S3B). As shown in Additional file 1: Figure S3, the signatures for two CD8+ cell types are overlapped and it is difficult to assign these cells to any subtypes. As further summarized in the heatmap of ssGSEA scores (Additional file 1: Figure S4), the ssGSEA analysis based on curated signature genes were able to distinguish between major cell types using single cell level expression data but failed to provide the necessary granularity in separating T cell subtypes. To determine T cell subtypes, especially CD8+ subtypes, we performed differential expression analysis between the two cell groups identified within CD4+ T cells and CD8+ T cells. Differentially expressed genes (adjusted p value < 0.05, limma moderated t-test, and |log2fold-change| > 1) are reported in Additional file 4: Table S3. Cell subtypes were then inferred from the status of top differentially expressed genes, by comparing them with existing cell-type-specific marker genes. Figure 2b and c are heatmaps depicting top differentially expressed genes between CD8+ cell clusters and CD4+ cell clusters, respectively. Candidate genes that overlapped with marker genes identified from previous studies are listed and labeled in heatmaps. Note that several exhaustion-related genes can serve as markers for separating both subtypes in CD4 and CD8, such as TIGIT and CTLA4. For the CD8+ T cell subtypes, we compared the candidate marker genes identified in our DE analysis to the exhausted CD8+ T cells marker genes reported in a previous single-cell RNA-seq from infiltrating T cells of lung cancer [15]. A total of 36 genes are found shared by the two studies and all labeled in Fig. 2b. Among these 36 genes also includes 14 known exhaustion markers, such as PCCD1, TIGIT, HAVCR2, and CTLA4 (Fig. 2b, text in red), which further confirmed the identify of these exhausted CD8+ T cells. The other CD8+ T cell cluster without expression of exhaustion genes is considered as conventional CD8+ T cells. For the CD4+ T cell subtypes, we also compared the candidate marker genes identified from the DE analysis with the Tregs marker genes reported by four previously published scRNA-seq data from different cancer types [15,16,17,18] (Fig. 2d). We observed that there were 20 genes shared by all five studies (Fig. 2c, text in red), including known Tregs markers FOXP3, TIGIT, and CLTA4; and there were many more genes previously identified at least once (Fig. 2e). Our study also identified 207 genes that uniquely enriched in this HNSCC dataset (Fig. 2e), including PPP1CA, RUNX3, CCR6, and PSMB8 which were previously reported to be associated with Tregs and their functions [19,20,21,22]. Based on these observations, we assigned Tregs to this cluster of CD4+ T cells. The other CD4+ cluster with low expression of exhaustion markers and with exclusively high expression of CCR7, CXCR4, and TOBI was considered as conventional CD4+ T cells.

Fig. 2
figure 2

Deconvolution of T cell subtypes. a 2D t-sne projection of T cells. T cell subtypes identified by clustering analysis are annotated and marked by color codes. b Heatmap of genes significantly expressed in exhausted CD8+ T cells comparing to conventional CD8+ T cells (adjusted p-value <0.05, log2fold-change > 1). Genes also reported by a previous study are labeled on left, of which the known exhaustion markers are labeled in red text. Cell types are indicated by the colored bar at top. c Heatmap of genes differentially expressed in Tregs comparing with conventional CD4+ T cells (adjusted p-value < 0.05, |log2fold-change| > 1). Selective Treg genes are labeled in dark blue and known markers for conventional CD4+ T cells are labeled in light blue. d Comparing Treg genes identified in (c) with Treg genes reported by previous four studies. The combination matrix at the bottom indicates all intersections of any of the five studies. If a study is participating in an interaction, the corresponding matrix cell is filled with black. All studies participating in the same interaction are linked by lines. The bars above the combination matrix encode the size of each intersection. The 20 Treg genes shared by all five studies are highlighted in orange and also labeled in (c). e Volcano plot of genes differentially expressed in Tregs vs. conventional CD4+ T cells. Unique genes found by this study are labeled in green. Those identified once (blue), twice (red), and three times (pink) previously are also labeled

Evaluation of prediction performance of reference GEPs

For each cell type identified from previous steps, we established cell-type-specific reference GEP matrix by the mean expression values of selected genes. We use C1 to denote the curated gene list from previous literatures which are used in ssGSEA (Additional file 3: Table S2), C2 to denote marker genes selected from the DE analysis described above (Additional file 5: Table S4), T1 to denote the marker genes selected from DE analyses for separating T cell subtypes (Additional file 4: Table S3), and M1 to denote marker genes selected from DE analyses for separating tumor and non-tumor cells. In our analysis, we constructed reference GEP matrices by taking the mean from the following ensemble gene lists: (1) LM22, (2) C1, (3) C2, (4) LM22 + C1, (5) LM22 + C1 + T1, (6) LM22 + C1 + T1 + M1, and (7) LM22 + C1 + C2 + T1 + M1. As presented in Additional file 1: Figure S5, we evaluated the prediction performance of CIBERSORT using these GEPs in terms of correlation between predicted abundance and the true abundance in the simulated bulk tumor (through pooling all cells in one patient, see Methods). We observed that all of these reference GEPs achieved promising prediction accuracies (r > 0.9). This result indicates that existing marker genes provides saturated signatures if forming GEPs on right cell groups. Therefore, we will focus on the evaluation of the LM22 + C1 gene panel because of it has a moderate number of genes and all genes included are well studied. All reference GEPs matrices used in this study are provided in Additional file 6: Table S5.

Scatterplots in Fig. 3a demonstrate strong correlations between true cell proportions and predicted cell proportions based on GEP curated form LM22 + C1 scRNA-seq data, where each point represents a simulated bulk sample. Figure 3b further compares the cell abundance estimation accuracy (correlation) for the reference GEP included in CIBERSORT and the reference GEP trained based on the LM22 + C1 scRNA-seq panel. Our method shows better prediction performance in all case for cell types that CIBERSORT can provide estimation, especially in estimating CD8 T cells. We further gauged the estimated cell proportion from CIBERSROT by taking into account the fact that the original GEP only include reference for immune cells. Such adjustment was made by assuming that tumor cell (purity) and stromal cell proportion were known so that a relative abundance on each remaining cell types can be calculated. Even with this unrealistic scenario, the prediction performance based on the adjusted proportion was still inferior to the scRNA-seq trained GEP in all cases. But we did observe that CIBERSORT estimation on macrophages and dendritic cells was greatly improved with this adjustment (Additional file 1: Figure S6). To test the robustness of the GEP panel to the cell components, we re-run the devolution analysis on all simulated samples using the leave-one-out GEP, i.e. each time we remove one cell-type-specific vector from the GEP matrix. As shown in Additional file 1: Figure S7, the high prediction accuracy was maintained in most scenarios, and only the estimations for fibroblasts and malignant cells were detectably impacted by the leave-one-out GEP.

Fig. 3
figure 3

Estimation accuracy of cellular compositions using LM22 + C1 scGEPs. a Scatter plots of the estimated and true cell proportions for the 20 simulated bulk tumor samples. Each dot represents one sample and r denotes the Pearson’s correlation coefficient. b Estimation accuracy of LM22 + C1 scGEPs and CIBERSORT microarray GEPs. Estimation accuracy is measured as the Pearson’s correlation coefficient between true cell proportions and the estimated proportions. The value of Pearson’s correlation coefficients is coded by both area and color of the pie charts for CIBERSORT microarray GEPs (top) and LM22 + C1 scGEPs (bottom). Larger pie slices and darker blue represent larger Pearson’s correlation coefficients thus higher accuracy. The missing cell types in CIBERSORT microarray GEPs are denoted by dashes. T cell composition is calculated taking sum of the four T cell subtypes

Although C2 and T1 gene sets (determined based on DE tests) did not provide additional information as a gene panel in constructing GEP, they provide a new alternative cell-type-specific biomarker for future studies. As shown in violin plots (Additional file 1: Figure S14), these markers are exclusively over-expressed in cell types that they are representing, indicating their validity as independent surrogate biomarkers. A total of 182 genes were found overlapping between groups C2 + T1 and LM22 + C1. Expressions of these genes for each single cell were plotted in Additional file 1: Figure S15, demonstrating their ability as biomarker panel alone to separate major cell types but not T cell subtypes.

Finally, as a supportive validation, we tested the proposed scGEP on TCGA HNSCC tumor samples and compared with results generated from similar methods developed for bulk tumor deconvolution. Additional file 1: Figure S8 compared the tumor purity estimates with three other methods: ABSOLUTE [1], ESTIMATE [2] and CPE [23]. These methods are based on WES, RNAseq and a consensus score based on all molecular data. Our method showed the best correlation with the estimation from ESTIMATE in terms of purity estimation. Further, we compared the Immune and Stromal score predicted by ESTIMATE with the absolute proportion estimates from the scGEP-based method. As shown in Additional file 1: Figure S9, the analysis showed a good agreement between two methods. We also compared the estimated total immune cell proportions and total T cell proportions between HPV positive and HPV negative cancer patients. As expected, tumors from HPV positive patients showed higher infiltration of immune cells and T cells (Additional file 1: Figure S10). Abundance of tumor infiltrating CD8 and total immune cells were also found associated with survival outcomes in TCGA HNSCC patients (Additional file 1: Figure S11).


scRNA-seq provides high resolution data to study cell heterogeneity, and provides new chance to understand the dynamic ecosystem comprising tumor cells, fibroblasts, and immune cells. Nevertheless, gene expression data from bulk tumors is indispensable and still dominates the clinical and translational settings. In this study we developed a pipeline to construct the reference gene expression profile matrix based on scRNA-seq data (scGEP), and assessed its performance in estimating cancer and immune cell compositions from bulk tumor gene expression data. By combining gene expression profiles of major cancer and immune cell types in HNSCC established from a high-quality single cell data, our approach overcomes a key shortcoming of most existing studies that relied on limited source of FACS-purified cell populations for the reference signature gene matrix. As noted in previous studies, PBMC-based GEP is also insufficient to provide accurate estimate on bulk tumor samples. The scGEP matrix derived from our analysis provides a new resource for future endeavors in analyzing expression data in head and neck cancers. The estimation on tumor purity will be greatly improved with the tailored reference signature for HNSCC malignant cells. Importantly, more accurate estimation on cancer cells partly contributes to better estimation on the relative abundance of immune cells. We validated results by using in silico pooled bulk tumor samples, and also showed that single-cell-derived signatures provides the ability to separate T cell subtypes. The finer and more accurate tumor immune profiling of HNSCC samples will help reveal more prognostic biomarkers with implications for immunotherapy. Furthermore, because immune cell share very similar expression profiles across cancer types, in theory the reference matrix can be broadly employed to other solid tumors, but it will only provide relative abundance for immune cell types. With the increased availability of single-cell data in cancers such as melanoma and lung cancers, an ideal scGEP matrix should be generated based on the same tumor type using the proposed pipeline.

The key step in constructing scGEP matrix involves accurately identifying cells of the same types or subtypes from heterogeneous populations, which is the in-silico equivalent of isolating cells using physical sorting methods. Compared to traditional sorting methods such as FACS, in-silco methods are less time consuming, less laborious, and more cost effective. Cell type determination at cellular level have benefited greatly from specialized clustering methods developed for scRNA-seq [10, 16, 24,25,26]. While there are more advanced approaches including deep learning [27, 28] have been proposed in recent years, fully automated decomposition of cell types is still a challenging problem. Part of the difficulty arises from the fact that each tumor includes a large variety of malignant and nonmalignant cells at different stages. The cellular mixing component and proportions even with the same section of a tumor can be very different if sampled under different time or conditions, e.g., before or after treatment. In addition, due to the limitations of the scRNA-seq technology itself, single cell gene expression data are often very noisy. And hence cells of the same type can end up in different clusters, and cells of different types can be in the same cluster due to unknown technology batch effects. Therefore, it is important to carefully curate and select high-quality cell clusters before calculating cell-type-specific reference matrix. In this study, we adopted an adaptive divide-and-conquer scheme to identify all major cell types in HNSCC tumor tissues, starting from the easiest split of cancer vs. non-cancer cells to the most challenging T cell subtype separation. In every step of the process, cell types are inferred based on both the results from the unsupervised clustering analysis and the expression status of existing marker genes. Any prior knowledge about the cellular component of the studied tumor type also helps in assigning a cell cluster to a cell types. In later stages of the analysis where cell subtypes are getting harder to distinguish, multiple settings or even multiple methods of clustering analysis need to tired. This process cannot be automated due to the need to visually inspect the clustering results in each step, but can achieve the best possible results for cell mixture deconvolution. The adaptive method was also based on a key assumption and it was further demonstrated in our clustering analysis: despite the significant heterogeneity of malignant cells across tumors, cells from the same immune cell types can be clustered together due to their relatively similar gene expression profiles.

It is important to highlight main advantage of using the data from Puram et al. as training dataset: it by far contains the largest collection of single cells from solid tumors (in terms of patient number and cell number) from a single study. The above-described property allows us to calculate the composite reference GEP not only from pooled cells from different tumor samples, but also from different scRNA-seq experiments. The single-cell HNSCC data is complementary to the TCGA bulk tumor data in that, while TCGA designs have been focused on tumor regions, scRNA-seq experiments can capture more immune cells in the surrounding stroma or tumor margin, where a higher amount of lymphocytes such as Treg cells might reside in. A caveat is that data pooled across studies involves more complicated batch effects and it is by now generally accepted that correcting for the batch effect in RNAseq data across experiments is technically challenging. It is interesting to note, though, that some recently proposed ideas for batch effect correction with scRNA-seq data are based on consensus clustering, which leverages the same philosophy mentioned above by projecting more homogenous immune cells into the same cluster. As pointed out in the original analysis, some apparent batch effect observed may be linked to the enzyme used for reverse transcription in the scRNA experiments. We further investigated the factor of enzyme usage in the adaptive clustering scheme and found that it explains well the sub-clusters observed in Fibroblast cell populations (Additional file 1: Figure S12), but had limited impact on other cell types (Additional file 1: Figure S13).

One notable observation in our simulation studies is that the GEP calculated based on existing marker gene panel (LM22 + C1) can provide as accurate a predictive capability as the genes selected only from the differential expression of single cell populations, although they are overlapped in many genes. We conclude that the prediction performance of GEP is more sensitive to the cell populations purified for a particular cell type than the marker gene panel. Nevertheless, the newly discovered genes from the scRNA-seq data and their underlying pathway warrant further validations as potential biomarkers, especially those genes that are differentially expressed between T subtypes.

Although we have only tested support vector regression method for cell mixture estimation, the HNSCC single cell sequencing data curated from this study provides a useful source for the assessment of accuracy of newly developed deconvolution methods. For example, the core SVR algorithm implemented in CIBESRORT only uses a single kernel under the fixed default parameter setting. The prediction performance might be improved through searching for an optimal kernel or using the state-of-the-art multiple kernel learning technologies [24]. Currently there is a lack of suitable benchmark dataset that allows a fair and systematic evaluation of methods for estimating cell mixtures in solid tumors. Weak correlations were often found between molecular-data-based estimations and pathology based methods such as IHC and H&E images [29]. This is partially due to the fact that each of these assays was carried out using input materials from different parts of a tumor. Because all cell proportions are known, the in silico pooled bulk tumor data from individual cells provides a more accurate reference at almost zero cost. Plus the composite cells from a single tumor could better mimic the real case scenario than creating bulk expression dataset through conducting RNA-seq on randomly mixed cells. For head and neck cancer per se, the scRNA-seq data from Puram study provide an ideal source for both training and validation purposes because the studied tumors have (1) uniformly varied tumor purity, and (2) it provides reference for subpopulations such as exhaustive CD8+ T cell that were not present in previous scRNA-seq experiments on melanoma and lung cancers. A limitation of the in silico method is that the cell size factor has not been taken into account. As cell types of different size have different amount of RNA yield, it is of interest for future research to be able to adjust for the cell size factor so that the estimated relative abundance will be closer to absolute cell proportions.

The key idea proposed in this work is most similar to a previous study conducted by Schelker et al. [9] which focused on scRNA-seq data from melanoma and PBMC. The two main differences between the two works are (1) the melanoma data used by Schelker et al. only provided sufficient information for distinguish nine major cell types and three T cell subtypes, whereas the HNSCC data we studied was able to further separate exhaustive CD8 T cells and provide corresponding reference GEP; (2) In our method, we used both marker gene information and a global ssGSEA scores to determine cell types from adaptive clustering analysis. We believe that more studies along this line will be conducted to generate more accurate cancer-type-specific and T-cell-subtype-specific reference GEP. Finally, we believe that apart from looking for reference profiles based on gene expression, the same approach can be extended in future search to identify reference DNA methylation profiles (DMP). DMP will be a promising new resource for tumor composition deconvolution because Alternations at DNA methylation level are deemed to be more stable than the gene expression level. But the single-cell DNA methylation analysis, such as bisulfite sequencing, is still in an experimental phase.


We developed a novel scheme for characterizing cell compositions from bulk tumor gene expression by integrating signatures learned from scRNA-seq data. Findings and reference panels created from this study enable future and secondary analysis of tumor RNA mixtures in head and neck cancer for a more accurate cellular deconvolution, and can facilitate the profiling of the immune infiltration in other solid tumors due to the expression homogeneity observed in immune cells.

Availability of data and materials

All data generated during this study are included in this published article and its supplementary information files. All single-cell data used in this analysis were downloaded from the published literature cited in this paper.


CD4+ Tconv:

conventional CD4+ T cells

CD8+ exhausted:

exhausted CD8+ T cells

CD8+ Tconv:

conventional CD8+ T cells


gene expression profiles


head and neck squamous cell carcinoma




single-cell gene expression profiles


single-cell RNA sequencing


the single-sample Gene Set Enrichment Analysis


tumor-infiltrating lymphocytes


tumor microenvironment


transcripts per million


t-Distributed Stochastic Neighbor Embedding


ν-support vector regression


  1. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30:413.

    Article  CAS  Google Scholar 

  2. Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, Treviño V, Shen H, Laird PW, Levine DA, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.

    Article  Google Scholar 

  3. Puram SV, Tirosh I, Parikh AS, Patel AP, Yizhak K, Gillespie S, Rodman C, Luo CL, Mroz EA, Emerick KS, et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck Cancer. Cell. 2017;171(7):1611–1624.e1624.

    Article  CAS  Google Scholar 

  4. Tanay A, Regev A. Scaling single-cell genomics from phenomenology to mechanism. Nature. 2017;541(7637):331–8.

    Article  CAS  Google Scholar 

  5. Tirosh I, Izar B, Prakadan SM, Wadsworth MH 2nd, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science (New York, NY). 2016;352(6282):189–96.

    Article  CAS  Google Scholar 

  6. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5.

    Article  CAS  Google Scholar 

  7. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, Schinzel AC, Sandy P, Meylan E, Scholl C, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462(7269):108–12.

    Article  CAS  Google Scholar 

  8. Şenbabaoğlu Y, Gejman RS, Winer AG, Liu M, Van Allen EM, de Velasco G, Miao D, Ostrovnaya I, Drill E, Luna A, et al. Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures. Genome Biol. 2016;17(1):231.

    Article  Google Scholar 

  9. Schelker M, Feau S, Du J, Ranu N, Klipp E, MacBeath G, Schoeberl B, Raue A. Estimation of immune cell content in tumour tissue using single-cell RNA-seq data. Nat Commun. 2017;8(1):2032.

    Article  Google Scholar 

  10. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–6.

    Article  CAS  Google Scholar 

  11. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

    Article  Google Scholar 

  12. Schölkopf B, Smola AJ, Williamson RC, Bartlett PL. New support vector algorithms. Neural Comput. 2000;12(5):1207–45.

    Article  Google Scholar 

  13. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12(5):453–7.

    Article  CAS  Google Scholar 

  14. Mandal R, Şenbabaoğlu Y, Desrichard A, Havel JJ, Dalin MG, Riaz N, Lee K-W, Ganly I, Hakimi AA, Chan TA, et al. The head and neck cancer immune landscape and its immunotherapeutic implications. JCI insight. 2016;1(17):e89829.

    Article  Google Scholar 

  15. Zheng C, Zheng L, Yoo J-K, Guo H, Zhang Y, Guo X, Kang B, Hu R, Huang JY, Zhang Q, et al. Landscape of infiltrating T cells in liver Cancer revealed by single-cell sequencing. Cell. 2017;169(7):1342–1356.e1316.

    Article  CAS  Google Scholar 

  16. Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 2018;7:1141.

    Article  Google Scholar 

  17. Plitas G, Konopacki C, Wu K, Bos PD, Morrow M, Putintseva EV, Chudakov DM, Rudensky AY. Regulatory T cells exhibit distinct features in human breast Cancer. Immunity. 2016;45(5):1122–34.

    Article  CAS  Google Scholar 

  18. De Simone M, Arrigoni A, Rossetti G, Gruarin P, Ranzani V, Politano C, Bonnal RJP, Provasi E, Sarnicola ML, Panzeri I, et al. Transcriptional landscape of human tissue lymphocytes unveils uniqueness of tumor-infiltrating T regulatory cells. Immunity. 2016;45(5):1135–47.

    Article  Google Scholar 

  19. Sugai M, Aoki K, Osato M, Nambu Y, Ito K, Taketo MM, Shimizu A. Runx3 is required for full activation of regulatory T cells to prevent colitis-associated tumor formation. J Immunol. 2011;186(11):6515–20.

    Article  CAS  Google Scholar 

  20. Yamazaki T, Yang XO, Chung Y, Fukunaga A, Nurieva R, Pappu B, Martin-Orozco N, Kang HS, Ma L, Panopoulos AD, et al. CCR6 regulates the migration of inflammatory and regulatory T cells. Journal of immunology (Baltimore, Md: 1950). 2008;181(12):8391–401.

    Article  CAS  Google Scholar 

  21. Ruszkowski J, Lisowska KA, Pindel M, Heleniak Z, Dębska-Ślizień A, Witkowski JM. T cells in IgA nephropathy: role in pathogenesis, clinical significance and potential therapeutic target. Clin Exp Nephrol. 2019;23(3):291–303.

    Article  CAS  Google Scholar 

  22. Li Z, Li D, Tsun A, Li B. FOXP3+ regulatory T cells and their functional regulation. Cell Mol Immunol. 2015;12(5):558–65.

    Article  Google Scholar 

  23. Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015;6:8971.

    Article  CAS  Google Scholar 

  24. Wang B, Ramazzotti D, De Sano L, Zhu J, Pierson E, Batzoglou S. SIMLR: a tool for large-scale genomic analyses by multi-kernel learning. PROTEOMICS. 2018;18(2):1700232.

    Article  CAS  Google Scholar 

  25. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411.

    Article  CAS  Google Scholar 

  26. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics. 2008;9:559.

    Article  Google Scholar 

  27. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15(12):1053–8.

    Article  CAS  Google Scholar 

  28. Cho H, Berger B, Peng J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Systems. 2018;7(2):185–191.e184.

    Article  CAS  Google Scholar 

  29. Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, Samaras D, Shroyer KR, Zhao T, Batiste R, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep. 2018;23(1):181–193.e187.

    Article  CAS  Google Scholar 

Download references


The authors would like to thank Colleagues at Department of Biostatistics and Bioinformatics at Moffitt Cancer Center for providing feedback.


This work was supported in part by Institutional Research Grant number 14–189-19 from the American Cancer Society, and a Department Pilot Project Award from Moffitt Cancer Center. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



All authors read and approved the final manuscript. XY, YAC, JRC, CHC and XW conceived the study. XY, YAC and XW designed the algorithm. XY and XW performed the analyses. XY, JRC and CHC interpreted the results. XY, YAC, JRC, CHC and XW contributed to literature search, analysis of head and neck data, and were major contributors in manuscript writing and revision.

Corresponding author

Correspondence to Xuefeng Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Figure S1. Separating malignant cells from immune and stromal cells. Figure S2. Identifying major immune and stromal cell-type clusters. Figure S3. ssGSEA scores for CD4+ and CD8+ T cell subtypes. Figure S4. Heatmap of the ssGSEA scores of 12 cell types across all single cells. Figure S5. Estimation accuracy of cellular compositions using different scGEPs. Figure S6. Comparing the estimation accuracy between LM22 + C1 scGEPs and CIBERSORT microarray GEPs, based on adjusted proportion. Related to Fig. 3b. Figure S7. The impact of missing cell types in scGEPs. Figure S8. Malignant cell proportion estimated by scGEPs vs. tumor purity in TCGA HNSCC bulk tumor RNA-seq data. Figure S9. Comparing the estimated immune and stromal proportions with ESTIMATE scores in TCGA HNSCC bulk tumor RNA-seq data. Figure S10. Association between immune/T cell proportion and HPV status in TCGA HNSCC. Figure S11. Association of estimated cellular compositions with overall survival in TCGA HNSCC patients. Figure S12. Identification of fibroblast cell subtypes. Figure S13. Batch effect of enzyme treatment. Figure S14. Expression of DE markers (T1) across all cells stratified by cell types. Figure S15. Expression of genes shared between C2 + T1 and LM22 + C1 across all single cells stratified by cell types. (PDF 9315 kb)

Additional file 2:

Table S1. Patient origins of tumor and lymph node samples, related to Figure S1. (CSV 1 kb)

Additional file 3:

Table S2. Cell-type specific signature genes used in ssGSEA. (CSV 2 kb)

Additional file 4:

Table S3. Differentially expressed genes between T cell subtypes, related to Fig. 2. Differentially expressed genes between CD4+ T cell subtypes in sheet 1. Differentially expressed genes between CD8+ T cell subtypes in sheet 2. (XLSX 132 kb)

Additional file 5:

Table S4. Cell-type specific marker genes identified from HNSC scRNA-seq data. (XLSX 304 kb)

Additional file 6:

Table S5. The seven reference GEPs matrices constructed using scRNA-seq data, related to Additional file 1: Figure S5. (XLSX 640 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, X., Chen, Y.A., Conejo-Garcia, J.R. et al. Estimation of immune cell content in tumor using single-cell RNA-seq reference data. BMC Cancer 19, 715 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: