The study protocol was reviewed and approved by the institutional review board (IRB) of Cathay General Hospital. Written informed consent was obtained from all the participants after explanation by the investigators (MHS and CCH). An overview of the study design is depicted in Fig. 1.
CRC samples
CRC samples were prospectively and consecutively collected during surgery. Enrollment criteria included CRC patients who had never received neoadjuvant therapy, were in clinical stages 0 (in situ) to III (no systemic spread) and had no concurrent secondary malignancy. Enrolled patients were managed according to standard guidelines with regular follow-up. All patients with resectable CRC were treated with curative surgeries.
The cancerous tissues were snap frozen and stored in liquid nitrogen below − 80 °C with RNAlater reagent (Qiagen, Germantown, MD) to stabilize RNA. The frozen samples were dissected into slices of 1-2 mm thickness, and more than 90% cancerous content was a prerequisite for microarray experiments.
Microarray experiments: GE arrays
Total RNA was extracted from frozen specimens using TRIzol reagent (Invitrogen, Carlsbad, CA). Purification of RNA was performed using a RNeasy Mini Kit (Qiagen, Valencia, CA) according to the manufacturer’s instructions. The minimal RNA concentration was set to 100 ng/μl (25 μl) per sample. RNA integration was checked by gel electrophoresis with 2 bands of 18S and 28S indicating satisfactory RNA quality, and an RIN value > 8.0 was pursued to remove heavily degraded samples. Illumina Human HT-12 BeadChip v4.0 (Illumina, San Diego, CA) was used for GE experiments, which provided genome-wide transcriptional coverage of well-characterized genes. Each array on HumanHT-12 v4.0 Expression BeadChip targeted more than 47,000 probes derived from NCBI Reference Sequence (RefSeq) Release 38 (November 7, 2009) and other sources. GE data were normalized using Illumina BeadStudio software Gene Expression Module, and the generated final report files were exported for further analyses, including the AVGSignal spreadsheet.
Microarray experiments: SNP arrays
DNA was extracted from cancerous tissues using a QIAamp DNA mini kit (Qiagen, Valencia, CA) from the same subject undergoing GE experiments. A minimum of 4 μg DNA was required. A Bioanalyzer 2100 (Agilent, Santa Clara, CA) was used to verify the purity and concentration of cancerous DNA with quality control indicated by OD260/280 > 1.8. Illumina Human Omni 25 BeadChip v1.1 was used for SNP array experiments, which featured approximately 2.5 million markers that captured variants down to a minor allele frequency (MAF) of 2.5% and delivered whole genomic coverage across diverse populations. Structural variations, mainly copy number variants (CNVs), were detected. Illumina HiScan array scanners supported genotyping, CNV, and GE profiling. Projects created with BeadStudio were exported, with three spreadsheets, namely, Genotype, Intensity, and BAlleleFrequency, reported separately.
CNV detection
CNV detection began with segmentation of normalized data (Intensity spreadsheet from BeadStudio), followed by identification of common (recurrent) gains and losses across multiple SNP arrays. Circular binary segmentation (CBS) was used to identify regions in each chromosome such that copy numbers in each region were equal [20, 21]. The significance level for the test to accept change points was set to 0.01, and the number of permutations was 1000. The Smoothing and MergeLevels algorithm were applied to enhance efficiency [22, 23]. Based on segmented log ratios, the copy number at a particular genomic location was determined using the median absolute deviation (MAD) of log ratios of each array. High-level CNV (amplification and homozygous deletion) was assigned to regions with segmentation mean log ratios > 1 and < − 1 timed the MAD of each corresponding array. The thresholds for low-level CNV (both gains and losses) were 0.5 and − 0.5 MAD, respectively. Pathway enrichment analyses were based on the BioCarta (URL: https://maayanlab.cloud/Harmonizome/dataset/Biocarta+Pathways) database, evaluating the association between a pathway and regions of gain/loss with an empirical P-value by 1000 times random sampling.
Regions of recurrent CNV within a cohort of samples were identified using the Genomic Identification of Significant Targets in Cancer (GISTIC, ref. [24]). A null distribution of G scores was generated based on 10,000 resamplings. The significance of CNV at a particular genomic location was determined based on a statistical test obtained from the segmentation log ratios of assayed samples. All bioinformatics analyses of CNV were conducted with the CGH Tools v1.3, part of the BRB-ArrayTools [21]. Results of pathway enrichment and GISTIC analyses were reported for CRC cases assayed for SNP arrays.
Concurrent gains and losses
Concurrent gains and losses were detected from common genes across SNP and GE microarrays by using HUGO gene symbols as identifiers. The process of mapping between SNP and GE microarray platforms was performed with the SOURCE (URL: https://source-search.princeton.edu/) or Clone/Gene ID Converter (URL: https://cran.r-project.org/web/packages/IDConverter/index.html), depending on which method provided the greatest number of reliable conversions. For probe reduction, multiple probes/probe sets were reduced to one per gene symbol by using the most variable probe/probe set measured by IQR across arrays.
We integrated GE and CNV data to identify genes whose transcriptional abundance was impacted by CNV. A Gene-centric table was outputted by the CGH Tools and detailed the average log-intensity ratio (calculated from all markers within a gene) rather than the discrete CNV status per gene, which was also used to deduce a value corresponding to each gene for each array in the array-covered genomic regions. This value was used to calculate correlations between CNV and GE arrays to distinguish concurrent genes. Concurrent gains and losses were declared if significant changes in a coherent manner were observed for both GE and SNP microarrays (assessed by Spearman correlation coefficients with P values < 0.05).
Concurrent gene signature and classification algorithms
Concurrent genes were identified from Taiwanese CRC and The Cancer Genome Atlas (TCGA) data. The TCGA-COAD (colon adenocarcinoma) Project level 3 dataset of 345 patients was assayed for both GE and CNV profiles using Agilent 4502A (Agilent, Santa Clara, CA) and Affymetrix Genome-wide SNP 6.0 (Thermo Fisher Scientific, Waltham, MA) microarrays. Clinical, CNV, and GE data were downloaded under the synapse ID syn1461155 as the secondary discovery dataset from the URL (URL: https://www.synapse.org/, ref. [25]). Gene mapping was performed as described in Method E, with an additional source of NetAffy (URL: https://www.affymetrix.com/analysis/netaffx_analysis_center_retired.html).
Rather than identifying common genes from both discovery datasets, the “universal” concurrent gene set was derived statistically. Fisher’s z transformation was used to combine correlation estimates from concurrent genes identified from Taiwanese populations and those from the TCGA-COAD dataset with the mathematical formula as follows:
$${\mathrm Z}_r=\tanh^{-1}(r)=\frac12\log\left(\frac{1+\mathrm r}{1-\mathrm r}\right)$$
$$V\left({\mathrm Z}_r\right)=\frac1{n-3}$$
The combined correlations from independent samples were:
$$\overline Z=\frac{\left(n1-3\right)\;z1+\left(n2-3\right)\;z2}{n1+n2-6}$$
$$\overline r=\tanh\left(\overline Z\right)$$
$$V\left(\overline Z\right)=\frac1{\left(n1+n2-6\right)}$$
where z is Fisher’s z-transformation, r is the sample correlation, V is variance, and n is the sample size. The universal concurrent gene set was filtered with the predefined threshold of a 10−3 α level and was used for downstream prognostic model construction. SAS/STAT software version 15.1 (SAS Institute Inc., Cary, NC) with the CORR procedure was used for the estimation of z-transformed correlation coefficients.
Microarray datasets
Publicly available microarray datasets were retrieved and fulfilled the purpose of training and validation of the risk predictive model. The primary outcomes were relapse-free or overall survival, and the secondary outcomes were adverse events following curative therapy, such as recurrence, metastasis, or mortality (all were dichotomous outcomes without survival data). Datasets that met the outcome variables were retrieved and are detailed in Supplementary Table 1 (GSE12945, GSE14333, GSE17538, GSE39582, and TCGA_COAD with survival data) and Supplementary Table 2 (GSE5206, GSE9348, GSE18088, GSE18105, and GSE64857 with dichotomous outcomes but without survival data).
Risk predictive model
A CRC risk predictive model for relapse-free/overall survival was constructed using supervised principal component regression [26]. Concurrent genes were first filtered by the univariate Cox proportional hazards regression, and significant genes within a stringent α level of 0.001 were further used to synthesize the first principal component (supergene), which was subsequently used in risk prediction. A continuous prognostic index score was calculated based on the first principal component for each subject within a dataset, and the high- and low-risk groups were defined by the 50th percentile prognostic index score (noninformative prior). A sensitivity analysis was performed with the prognostic index score cutoff between the predicted high- and low-risk group determined by the lowest censored percentage across all studies (the 75th percentile, Supplementary Table 6).
For dichotomous outcomes such as recurrence, metastasis or death, differentially expressed concurrent genes were identified using the univariate two-sample t test at a 0.001 significance level. A global multivariate permutation test (α level of 10− 3) was further used to control false positivity. Multiple methods, including compound covariate predictor, diagonal linear discriminative analysis, 3 nearest neighbors, nearest centroid, and support vector machine (SVM, with default penalty of LIBSVM, ref. [27]), were used to evaluate the prediction accuracy of the CRC risk model (class prediction functions of the BRB- ArrayTools, ref. [21]). For all class prediction methods, leave-one-out cross-validation (LOOCV) was used to calculate the misclassification rate with a permutation P-value reported. For each random permutation of class labels, the entire cross-validation procedure was repeated to calculate the cross-validated misclassification rate with the final P value determined from the proportion of the random permutations giving the least misclassification rate. A minimum of 1000 permutations was required.
As distinct statistic/bioinformatics tools were adopted with different underlying hypotheses and corresponding scenarios, there was no uniform alpha-level across all these tests. Consequently, default alpha-level of each test from the BRB-ArrayTools was followed. Usually reduced α levels were required for multiple testing (Bonferroni correction). Model training and LOOCV were performed within each study, and consensual genes across all microarray datasets were used to build the CRC risk predictive model. Genes were median-centered first within each dataset to avoid introducing bias from extremely high intensities as well as batch effects.