Skip to main content

Differential presence of exons (DPE): sequencing liquid biopsy by NGS. A new method for clustering colorectal Cancer patients

Abstract

Differential presence of exons (DPE) by next generation sequencing (NGS) is a method of interpretation of whole exome sequencing. This method has been proposed to design a predictive and diagnostic algorithm with clinical value in plasma from patients bearing colorectal cancer (CRC). The aim of the present study was to determine a common exonic signature to discriminate between different clinical pictures, such as non-metastatic, metastatic and non-disease (healthy), using a sustainable and novel technology in liquid biopsy.

Through DPE analysis, we determined the differences in DNA exon levels circulating in plasma between patients bearing CRC vs. healthy, patients bearing CRC metastasis vs. non-metastatic and patients bearing CRC metastasis vs. healthy comparisons. We identified a set of 510 exons (469 up and 41 down) whose differential presence in plasma allowed us to group and classify between the three cohorts. Random forest classification (machine learning) was performed and an estimated out-of-bag (OOB) error rate of 35.9% was obtained and the predictive model had an accuracy of 75% with a confidence interval (CI) of 56.6–88.5.

In conclusion, the DPE analysis allowed us to discriminate between different patho-physiological status such as metastatic, non-metastatic and healthy donors. In addition, this analysis allowed us to obtain very significant values with respect to previous published results, since we increased the number of samples in our study. These results suggest that circulating DNA in patient’s plasma may be actively released by cells and may be involved in intercellular communication and, therefore, may play a pivotal role in malignant transformation (genometastasis).

Peer Review reports

Introduction

Colorectal cancer (CRC) is one of the most widespread malignancies and represents a challenge due to its high incidence and mortality worldwide [1]. Its burden is expected to increase by 60% with more than 2.2 million new cases and 1.1 million cancer deaths by 2030 [1]. Metastasis is the leading cause of death and prevention and early diagnosis are key to counteracting this trend [2].

The gold standard for the detection and diagnosis of CRC is colonoscopy. This approach allows for purely static analysis of the tumor at a given time and location, i.e. at the time of surgery [3]. However, CRC is a slowly progressive and dynamic disease, which becomes symptomatic when it progresses to advanced stages, with the timing of diagnosis being the most important factor influencing survival rates [4].

Fecal Occult Blood Test (FOBT) is the current preferred method for CRC screening in large populations [5]. Stool and blood tests to identify methylated DNA are rising as the preferred genetic-based methods for screening pre-symptomatic CRC patients [6], but have important limitations such as cost, standardization, and high false negative and positive results [7]. Therefore, there is a need to develop new early detection methods and to develop non-invasive methods to detect CRC at earlier stages, as well as to improve and incorporate new detection methods at more advanced stages in order to introduce Precision Medicine Criteria in the follow-up of these patients. The search for specific markers appears to be essential in improving the management of CRC patients along disease phases: from diagnosis to treatment and follow-up.

Liquid biopsy, by means of tumoral cell-free DNA (cfDNA) detection, has been one of the most encouraging expectations of cancer monitoring during the last two decades, providing additional information and enabling the discovery of new biomarkers [8]. In recent years, circulating tumoral cfDNA has proven to be the most appropriate non-invasive method and the best way to analyze tumors specially when a biopsy to obtain tumor tissue was difficult or not available [9]. Most studies suggest that cfDNA analysis should be used for molecular profiling, therapy-related mutation detection and minimal residual disease [10,11,12]. Moreover, a recent study showed how a circulating tumor DNA (ctDNA) -guided approach to the treatment of stage II colon cancer reduced adjuvant chemotherapy use without compromising recurrence-free survival [13].

Circulating tumor DNA (ctDNA) is a plasma biomarker widely used in oncology [14, 15]. ctDNA detection in colorectal cancer is related to RAS/BRAF point mutations before anti-EGFR treatment [16,17,18]. Moreover, increased cfDNA level, together with a heterogeneous hotspot mutation pattern, provide a strong clinical prognostic predictor.

Although droplet digital PCR (ddPCR) seems to be the most used technology to analyze cancer-specific mutations [19], the advent of next-generation sequencing (NGS) and new bioinformatic methods make it possible to use more complex liquid biopsy analysis, leading to what is known as precision medicine. Our group has gone further, developing an approach based on whole exome sequencing in plasma called “differential presence of exons” (DPE).

DPE is a new and innovative strategy of NGS analysis with the evaluation of differential presence of exons in cfDNA to cluster and classifies patients with disseminated and localized disease [20]. The DPE method for clustering showed to be easier and more cost effective than other NGS methods with the same task [20]. This new approach was alto tested in an animal model of CRC showing similar results [21].

The target of the present study is to expand DPE analysis by NGS comparing healthy to patients bearing CRC at different stages, in order to explore the genomic heterogeneity of cfDNA and to search for exonic signatures that can be used in precision medicine.

Materials and methods

Patients selection

A ninety-six CRC patients cohort was recruited from April 2018 to November 2019 in the Department of General Surgery at the University Hospital Fundación Jiménez Díaz, Madrid, Spain. All patients underwent proper informed consent and the study received approval by the hospital clinical research ethical committee (Cod ER_PIC_135/2017_FJD). Inclusion and exclusion criteria are shown in Table 1.

Table 1 Criteria for patient selection

Patients were distributed into three groups: nonmetastatic colon cancer (N; n = 68), metastatic colon cancer (M; n = 17) and unclassified patients according to the selection criteria (U; n = 11). Patient’s clinical characteristics are shown in Table 2.

Table 2 Clinical-biological characteristics

On the other hand, 63 volunteer healthy donors (H) were enrolled in the study by providing informed consent and approval by the hospital clinical research ethical committee. Blood samples were obtained through vein puncture from the biobank of the University Hospital Fundación Jiménez Díaz, Madrid, Spain.

cfDNA extraction from plasma samples

Ten millilitres of peripheral blood were collected from each patient before surgery at room temperature into a Streck cell-free circulating DNA (cfDNA) BCT® (Streck, La Vista, NE, US) tubes and processed immediately (less than 1 hour). Samples were centrifuged at 1800×g for 10 minutes and the plasma obtained in the first centrifugation was centrifuged again at 3000×g at 4 °C for 10 minutes [20], aliquoted and stored at − 80 °C prior to analysis. Plasma cDNA was extracted automatically using a modified protocol of the QiaSymphony DSP circulating DNA kit on the QIAsymphony (QIAGEN, Hilden, Germany). cfDNA was quantified using the Qubit dsDNA HS assay kit (Thermo Fisher Scientific Inc., Waltham, MA, USA) and stored at − 20 °C.

Whole exome sequencing

An optimized WES protocol for cfDNA samples was performed using the Twist Human Core Exome kit + RefSeq V1 (Twist Bioscience Corporation, San Francisco, CA, USA), focused on clinically relevant genes with 41.2 Mb capture size.

Samples were barcoded (unique dual index), qualified and quantified using TapeStation 2200 (Agilent Technologies, Santa Clara, CA, USA) and Qubit 2.0 Fluorometer (ThermoFisher Scientific, Waltham, MA, USA).

Illumina NovaSeq 6000 system (Illumina, Inc., San Diego, CA, USA) was used for 100 bp pair-end read sequencing. Reads were aligned against GRCh38.103 human genome build using Bowtie-2 aligner [22].

Data analysis

Detection of differentially present exons (DPE)

DPE analysis was done with ‘EdgeR’ R package [23], using the reads counts by exon that were calculated with HTSeq-count [24]. Sequence data was normalized and filtered for exons with less than 1 count per million (CPM) in at least 20 samples. Edger’s background statistical methods were based on generalized linear models (glm), which test for DPE using either likelihood ratio tests (LRT) [23] or quasi-likelihood F-tests (QLF) [25]. Exons with a False Discovery Rate (FDR) ≤ 0.001 were selected for each method. Finally, common exons highlighted by both methods were considered as DPE. A Venn diagram was made to detect the exons in common for the different comparisons of the study. Venny 2.1 (https://bioinfogp.cnb.csic.es/tools/venny/index.html) virtual tool was used for the Venn diagrams.

DPE: clustering and principal components analysis (PCA)

Three comparisons were performed to obtain a DPE signature for each of the comparisons:

  1. 1.

    Patients with metastatic colorectal cancer (M) vs patients with non-metastatic colorectal cancer (N).

  2. 2.

    Cancer patients (C) vs. healthy controls (H).

  3. 3.

    Patients with metastatic colorectal cancer (M) vs. healthy controls (H).

PCA was performed using the DPE resulting from the above filters and using an in-house script. A Venn diagram was performed for the three comparisons in order to obtain a common signature that could explain the differences in DPE. After obtaining the exons in common for the three comparisons-, a clustering pooled using Ward’s method [26] and principal component analysis was performed for the three comparisons to see how these exons behaved.

Random forest

Random forest (RF) classification was implemented with an R script using the “randomForest” package [27]. To generate a predictive model, 16 metastatic samples, 67 non-metastatic samples and 62 healthy controls were selected as training sets, because the outlier healthy control and two patients with NA values in some of the 510 exons were previously eliminated. The mean value of the probabilities obtained was calculated. The accuracy of the resulting model was tested by checking its ability to correctly classify the randomly drawn samples into their corresponding groups of origin. In addition, the 11 unclassifiable samples were tested to see in which group they were classified.

On the other hand, we used the dataset to extract random samples for training and for test (70 and 30% respectively) using 5000 trees and using all the variables except those with some NA value that were previously eliminated in the random forest process.

Gene list functional enrichment analysis - pathway analysis

Ensemble biomart platform (https://www.ensembl.org/info/data/biomart/index.html) was used to highlight a gene list from DPE [28]. Functional analysis in gene list was performed Enrichr (https://maayanlab.cloud/Enrichr/#) [29], Genecodis 4.0 (https://genecodis.genyo.es/) [30] and ShinyGO V0.65 (http://bioinformatics.sdstate.edu/go/) [31].

Enrichr was used to analyze gene ontology (GO) [32] and KEGG [33].

Statistical analysis

Nonparametric Kolmogorov–Smirnov and Kruskal-Wallis test for significance were performed in R to test differences in DNA concentration in plasma, histologic grade and TNM stratification (cutoff P-value of P < 0.05). LRT and QLF tests were performed for DPE with a FDR of ≤0.001.

Data availability

Raw whole exome sequencing data of the samples in the study were deposited at European Genome-phenome Archive (EGA) under accession number EGAP00001002916 (https://wwwdev.ebi.ac.uk/ega/studies/EGAS00001006656).

Results

Data analysis

On average, 129 million paired-end reads per sample were collected after sequencing, with a minimum of 76 million and a maximum of 241 million reads. Read depth varied in a range from 105x to 333x, with an average read depth of 196x per sample.

Exploratory data analysis. Clinical data

After performing the corresponding filters, the results were visualized in a multidimensional scaling plot (MDSplot) and the samples were separated according to the sex of the patient. (Supplementary Fig. S1). To avoid this bias, an internal script was developed to remove those sexual exons from the analysis and an MDSplot was re-run for each of the available clinical situations (Supplementary Fig. S2). We observed that the samples were not distributed according to TNM stratification, disease stage, histologic grade, presence of MSI, BRAF and RAS mutations and age.

The age of the patients and the results for each study group (metastatic, non-metastatic, unclassified) are shown in supplementary Table S1.

DPE analysis. Clustering and principal component analysis

The differential presence of exons was analyzed with edgeR, using the QLF and LRT tests, with a threshold of FDR ≤ 0.001. The following comparisons were performed: metastatic vs non-metastatic, cancer vs healthy and metastatic vs healthy.

Metastatic vs non-metastatic patients

For the metastatic (M) vs non-metastatic (N) group comparison, a total of 1760 differentially present exons were obtained, common between the QLF and LRT method with FDR ≤ 0.001, of which 1405 overrepresented in the M group and 355 overrepresented in the N group. The MA plots for selected DPEs are shown in Fig. 1A-B. Statistically significant exons were located at the margins of the point cloud, as expected.

Fig. 1
figure 1

Exploratory analysis of patients using DPE (M vs N). MA plots for selected differentially present exons combining two different methods: quasi-likelihood F-tests (QLF) (A plot) and likelihood ratio tests (LRT) (B plot) for comparative M vs N. The log-fold change (FC) ratio is plotted on the y-axis, and the average normalized counts (counts per million; CPM) is plotted on the x-axis. Differentially present exons are highlighted in red (DPE; p ≤ 0.001) and a total of 1760 exons were obtained with EdgeR. In the graph (C) we can see grouping of patients using normalized values of differentially present exons (DPE) by Ward’s method. Patients are marked with different colors according to the group in which they were included: M: red; N: green; and U: blue. Most metastatic (M) and non-metastatic (N) patients were clearly separated into two groups, while unclassifiable (U) patients were located between M and N, indicating that they share features with both groups. In the two-dimensional (D) plot, a principal component analysis (PCA) can be observed. Metastatic (M) and non-metastatic (N) patients cluster in a cloud of different points and are separated from each other, while unclassifiable (U) patients are located between M and N, probably because of their intermediate characteristics. M: red; N: green; and U: blue

Normalized DPEs clustering was performed using Ward’s method, which gives a distance tree shown in Fig. 1C with the 1760 DPEs. For clustering, unclassifiable samples were also included to see how they were distributed in the tree. Samples were mainly clustered into disease groups (metastatic, non-metastatic and unclassified). PCA was then performed as can be seen in Fig. 1D, which shows a two-dimensional plot with the first two principal components. As can be seen, the M and N groups are clearly separated and correctly grouped; unclassifiable samples were also included to see their distribution in the graph.

Cancer patients vs healthy donor analysis

For the cancer patients vs. healthy donor comparison, a total of 14,300 DPE were obtained, common between the QLF and LRT method with FDR ≤ 0.001. Despite this, the fold changes (lgFC) were much lower than in the M vs N comparison, thus indicating that the changes found were small.

Of the 14,300 DPE’s of which 5663 were overrepresented in the cancer group and 8637 in the healthy group, MA plots for the selected DPE’s as shown in Fig. 2A-B. Statistically significant exons were located at the margins of the point cloud, as expected.

Fig. 2
figure 2

Exploratory analysis of patients using DPE (C vs H). MA plots for selected differentially present exons combining two different methods: quasi-likelihood F-tests (QLF) (A plot) and likelihood ratio tests (LRT) (B plot) for comparative C vs H. The log-fold change (FC) ratio is plotted on the y-axis, and the average normalized counts (counts per million; CPM) is plotted on the x-axis. Differentially present exons are highlighted in red (DPE; p ≤ 0.001) and a total of 14,300 exons were obtained with EdgeR. In the graph (C) we can see the grouping of patients using the normalized values of differentially present exons (DPE) by Ward’s method. Patients are marked with different colors according to the group in which they were included: C: brown and H: blue. Some of the cancer patients were clearly separated from the healthy ones. However, some of the cancer patients and healthy controls appear grouped together. A principal component analysis (PCA) can be seen in the two-dimensional (D) plot. Cancer patients (C) and healthy controls (H) where separated in a cloud of individual points. However, some patients and healthy controls appear together in the same point cloud, corresponding to non-metastatic patients who have a small tumor. C: cancer - brown and H: healthy - blue

As in the previous case, the clustering of the normalized DPEs were performed using Ward’s method, which gave a distance tree shown in Fig. 2C with the 14,300 DPEs. The samples clustered sparsely, although it was seen that most of the cancer patients clustered together such as most of the healthy controls, although some of the N cancer samples tended to cluster together with the healthy ones. We then performed PCA with the DPEs in common (QLF and LRT test) with an FDR less than 0.001. Fig. 2D shows a two-dimensional plot with the first two principal components, showing two-point clouds, one for cancer patients and one for healthy donors, although some samples from the two groups in the study clustered together.

Metastatic cancer patients vs healthy donor analysis

In the comparison of metastatic colorectal cancer patients vs. healthy controls, a total of 14,430 DPE’s were obtained, common between the QLF and LRT method with FDR ≤ 0.001, of which 6526 overrepresented in the M group and 7904 overrepresented in the H group. The MA plots for selected DPEs are shown in Fig. 3A-B. Statistically significant exons were located at the margins of the point cloud, as expected.

Fig. 3
figure 3

Exploratory analysis of patients using DPE (M vs H). MA plots for selected differentially present exons combining two different methods: quasi-likelihood F-tests (QLF) (A plot) and likelihood ratio tests (LRT) (B plot) for comparative M vs H. The log-fold change (FC) ratio is plotted on the y-axis, and the average normalized counts (counts per million; CPM) is plotted on the x-axis. Differentially present exons are highlighted in red (DPE; p ≤ 0.001) and a total of 14,430 exons were obtained with EdgeR. In the graph (C) we can see the grouping of the participants using the normalized values of differentially present exons (DPE) by Ward’s method. The participants are marked with different colors according to the group in which they were included: M: red, U: blue and H: yellow. Some of the metastatic patients were clearly separated from the healthy ones. However, the metastatic patients appear at the ends of the tree and the healthy controls in the middle, and the unclassifiable near the metastatic patients. A principal component analysis (PCA) can be seen in the two-dimensional (D) plot. Metastatic patients (M) and healthy controls (H) appear separated in a cloud of individual points. M (red) and H (yellow)

Clustering of the normalized DPEs was performed, as in the previous cases, using Ward’s method, which gave a distance tree shown in Fig. 3C with the 14,430 DPEs. For clustering, unclassifiable samples were also included to see how they were distributed in the tree. The samples were grouped at the extremes mainly those M patients and some indeterminate and a few healthy controls. Healthy controls were mainly grouped together in the middle. PCA was then performed, as can be seen in Fig. 3D, which showed a two-dimensional plot with the first two principal components. M and healthy groups were clearly separated and correctly grouped.

Integration of the three groups (M, N, H)

After performing an integration of the three comparatives of the study using a Venn diagram (Fig. 4A) and the QLF/LRT tests, we obtained a signature of 510 exons that could have biological value. This list was used to perform a supervised separation analysis between M, N and healthy groups (Fig. 4B-C).

Fig. 4
figure 4

Exploratory analysis of common exons in the three groups of patients studied. A Venn diagram of all comparisons using the Venny platform; Oliveros, J.C. (2007–2015) Venny. An interactive tool for comparing lists with Venn’s diagrams. https://bioinfogp.cnb.csic.es/tools/venny/index.html. As we can observe, there are 510 exons in common from the three comparatives (M vs N, C vs H and M vs H) and these exons were used an exonic signature related to colorectal cancer. B Two-dimensional principal component analysis (PCA) plot with the 510 exons in common (exonic signature). Using samples from the comparative M vs N, C vs H and M vs H. It can be seen that there is discrimination between the groups using these 510 exons. Samples from groups M, N, U and H are marked with different colors according to the graph. The graph on the left: M in red, N in green, and U in blue. The middle graph: C in brown and H in blue. The graph on the right: M in red and H in yellow. C Clustering of the samples by Ward’s method with the 510 exons in common (exonic signature). We can see the clustering on the left: M in red, U in blue and N in green. It is clear that there is separation between the clusters. In the middle cluster: C in brown and H in blue, the separation between groups is also observed and in the cluster on the right: M in red, U in pink and H in yellow, the separation between groups is also observed. Unclassifiable patients (U) were placed in the center, indicating that they share features with both groups. M; Metastatic, N; Non-metastatic, C: Colorectal cancer patients and H; healthy controls

We succeeded to segregate M patients from healthy controls, although the separation between cancer and healthy was not entirely possible. Most of the N patients were clustered together, while the M patients were more scattered.

We then examined in which genes were located those exons. Considering that some exons belonged to the same gene, we finally identified a total of 382 genes. We then also used these exons in common among the three comparisons to perform pathway enrichment analysis.

Random forest

These results encouraged us to develop a predictive algorithm to classify the samples using an internal random forest script. To do so, we performed two approaches:

1) Classification of unclassifiable samples, the results of which are shown in supplementary Table S2.

2) Divide the dataset into two study populations (training 70%) and (testing 30%) and perform a random forest with a ntree of 5000 and a mtry of 510, obtaining a 35.9% OBB error rate and a 75% accuracy. The sensitivity and classification accuracy of all MCC, NMCC and healthy groups metastatic for the test dataset is shown in Supplementary Table S3.

Pathway analysis

The pathway study (KEGG and Gene Ontology - GO) at that signature revealed an enrichment of pathways that could be associated with cancer with the Enrichr tool. However, we did not obtain significant results for the 510 exons when we corrected the p-value by multiple testing (adjusted p-value). The genes were related to three biological functions according to the KEGG 2021 Human database; Estrogen signalling pathway (p-value = 0,004), protein digestion and absorption (p-value = 0,01) and Gap junction (p-value = 0,02) (Fig. 5A).

Fig. 5
figure 5

Pathways analysis. A Bar chart of top enriched terms from the KEGG_2021_Human gene set library. The top 10 enriched terms for the input gene set are displayed based on the -log10(p-value), with the actual p-value shown next to each term. The term at the top has the most significant overlap with the input query gene set. B Bar chart of top enriched terms from the GO_Biological_Process_2021 gene set library. The top 10 enriched terms for the input gene set are displayed based on the -log10(p-value), with the actual p-value shown next to each term. The term at the top has the most significant overlap with the input query gene set. C Bar chart of top enriched terms from the GO_Molecular_Function_2021 gene set library. The top 10 enriched terms for the input gene set are displayed based on the -log10(p-value), with the actual p-value shown next to each term. The term at the top has the most significant overlap with the input query gene set. D Bar chart of top enriched terms from the GO_Cellular_Component_2021 gene set library. The top 10 enriched terms for the input gene set are displayed based on the -log10(p-value), with the actual p-value shown next to each term. The term at the top has the most significant overlap with the input query gene set

The most overrepresented gene ontology pathways were as follows:

- Biological Processes (BP): regulation of retrograde transport, endosome to Golgi (GO:1905279) (p-value = 1.3E-4), negative regulation of tumor necrosis factor production (GO:0032720) (p-value = 0.001) and histone H3-K4 trimethylation (GO:0080182) (p-value = 0.002). It is worth mentioning the fourth most significant pathway known as negative regulation of tumor necrosis factor superfamily cytokine production (GO:1903556) (p-value = 0,002), which is related to cancer (Fig. 5B).

- Molecular Functions (MF): vascular endothelial growth factor-activated receptor activity (GO:0005021) (p-value = 0,007), protein phosphatase 1 binding (GO:0008157) (p-value = 0,009) and sodium:chloride symporter activity (GO:0015378) (p-value = 0,01) (Fig. 5C).

- Cellular Components (CC): collagen type IV trimer (GO:0005587) (p-value = 0,005), early endosome (GO:0005769) (p-value = 0,005) and cytoskeleton of presynaptic active zone (GO:0048788) (p-value = 0,007) (Fig. 5D).

We performed a functional analysis with the ShinyGO V0.65 tool of the 382 genes with all the Pathway database incorporated in the tool and obtained significant results (FDR < 0.005) with pathways related to cancer Supplementary Table S4. We also observed that our gene list was enriched in genes found on chromosome 20 as can be seen in Supplementary Fig. S3.

Finally, the genecodys platform was used to perform the functional analysis with the panther pathways, bioplanet and wikipathways databases (Supplementary Fig. S4):

- Panther pathways, the three most significant pathways were: 5HT3 type receptor mediated signaling pathway (p-value = 0,03), angiotensin II-stimulated signaling through G proteins and beta-arrestin (p-value 0,02) and integrin signaling pathway (p-value = 0,02).

- Bioplanet, the three most significant pathways were: collagen biosynthesis and modifying enzymes (p-value = 0,002), Ion transport by P-type ATPases (0,002) and Extracellular matrix organization (p-value = 0,003).

- Wiki pathways, the three most significant pathways were: circadian rhythm genes (p-value = 0,001), hippo signaling regulation pathways (p-value = 0,004) and G protein signaling pathways (p-value = 0,004).

Discussion

Next-generation sequencing (NGS) has been proposed as a suitable tool for liquid biopsy in colorectal cancer (CRC), although most studies to date have focused on sequencing panels of potential candidate genes that are clinically actionable [34] . Mutation analysis by liquid biopsy are affected by several issues, including, among others: cfDNA concentration [35], ctDNA abundance in plasma (related to tumor mass), and/or tumor heterogeneity [36]. To solve these issues that affect sensitivity, high sequencing depths are needed, being a costly strategy in expanded NGS panels.

In this study,, we have tested a new approach called differential presence of exons (DPE), previously described by the Garcia-Olmo research group at the University Hospital Fundación Jiménez Díaz [20, 21]. This NGS approach at a relatively shallow depth represents an easy, rapid, non-invasive and affordable strategy for future use in clinical practice.

In this DPE scenario, we developed a more novel and robust technology to draw a signature in common between metastatic and non-metastatic CRC patients and healthy controls. We were able to define differential DPE signatures between the three groups (metastatic, non-metastatic, healthy) consisting of 510 exons that could explain the overall variation.

The resulting common DPE profile was used to cluster and classify all study participants and this information was processed to develop a DPE algorithm generating a predictive model using machine learning. Most patients were correctly grouped and separated between metastatic and non-metastatic, and it was also observed that when using these 510 DPEs healthy controls were placed in a point cloud in the PCA. Unclassifiable patients were clustered between non-metastatic and metastatic groups according to the common features they share.

Differential detection of exons suggests differential release of cfDNA actively by tumor and non-tumor cells, which could have biological implications by acting as a means of intercellular communication. Previous studies have described that cell-free nucleic acids circulating in the plasma of patients with colorectal cancer induce oncogenic transformation of distant cells, predisposing them to malignant transformation (genometastasis theory) [37, 38]. Moreover, It has been suggested that metastatic seeding may occur before clinical detection [39] and cfDNA may be involved in the metastatic process [40]. This leads us to believe that this DPE signature may have some involvement in malignant transformation and metastasis.

In fact, functional analysis of the 510 DPEs highlighted significant results in cancer related pathways such as kidney cancer [41], liver cancer [42] and estrogen signaling (all of them related to oncogenic development and associated with CRC) [43, 44]. Examples of CRC related Gene Ontology also associated with colorectal cancer were GO:0005021 (vascular endothelial growth factor receptor activity) [45] and GO:0032720 (negative regulation of tumor necrosis factor production) [46]. Therefore, DPE discovery not only may contribute to elucidate the molecular mechanism of carcinogenesis, but also provides a new approach to liquid biopsy analysis and proposes the use of DPE as a non-invasive biomarker.

To date, this is the largest dataset using DPE for CRC, we have included healthy controls that were not included in previous studies. We have also innovated in the use of new exome technology, as well as implemented a new standardised cfDNA extraction protocol that could facilitate its incorporation into clinical practice.

To conclude, it should be commented that liquid biopsy analysis can be used to gain new insights into the biology of metastasis and as a companion diagnostic to improve the stratification of therapies and to obtain information on therapy-induced cancer cell selection (precision medicine) and the technical and clinical validation of assays is very important [47].

Conclusions

The DPE-based approach has allowed us to discriminate between patients with cancer and healthy controls, patients with metastatic CRC and healthy controls, and a tendency in patients with metastatic and non-metastatic CRC. From all comparisons we have obtained a common signature of 510 exons and corroborates with previous results of discrimination between groups. Functional analysis of the resulting signature (510 DPEs) is associated with cancer-associated pathways. The results found here support the theory of an active and targeted release of cfDNA (genometastasis). For future studies, we need more samples to profile this signature and predict the prognostic value of survival of these patients.

Availability of data and materials

The datasets supporting the conclusions of this article are available in the European Genome Archive (EGA) repository, under data accession number EGAP00001002916 (https://wwwdev.ebi.ac.uk/ega/studies/EGAS00001006656).

References

  1. Arnold M, Sierra MS, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global patterns and trends in colorectal Cancer incidence and mortality. Gut. 2017;66(4):683–91. https://doi.org/10.1136/gutjnl-2015-310912.

    Article  Google Scholar 

  2. Fares J, Fares MY, Khachfe HH, Salhab HA, Fares Y. Molecular principles of metastasis: a Hallmark of Cancer revisited. Signal Transduct Target Ther. 2020;5:28. https://doi.org/10.1038/s41392-020-0134-x.

    Article  Google Scholar 

  3. Rex DK, Boland CR, Dominitz JA, Giardiello FM, Johnson DA, Kaltenbach T, et al. Colorectal Cancer screening: recommendations for physicians and patients from the U.S. multi-society task force on colorectal Cancer. Gastrointest Endosc. 2017;86(1):18–33. https://doi.org/10.1016/j.gie.2017.04.003.

    Article  Google Scholar 

  4. Toma SC, Ungureanu BS, Patrascu S, Surlin V, Georgescu I. Colorectal Cancer biomarkers - a new trend in early diagnosis. Curr Health Sci J. 2018;44(2):140–6. https://doi.org/10.12865/CHSJ.44.02.08.

    Article  CAS  Google Scholar 

  5. D’Onise K, Iacobini ET, Canuto KJ. Colorectal Cancer screening using Faecal occult blood tests for indigenous adults: a systematic literature review of barriers Enablers and Implemented Strategies. Prev Med. 2020;134:106018. https://doi.org/10.1016/j.ypmed.2020.106018.

    Article  Google Scholar 

  6. Song L, Jia J, Peng X, Xiao W, Li Y. The performance of the SEPT9 gene methylation assay and a comparison with other CRC screening tests: a Meta-analysis. Sci Rep. 2017;7(1):3032. https://doi.org/10.1038/s41598-017-03321-8.

    Article  CAS  Google Scholar 

  7. Ferrari A, Neefs I, Hoeck S, Peeters M, Van Hal G. Towards novel non-invasive colorectal Cancer screening methods: a comprehensive review. Cancers (Basel). 2021;13(8):1820. https://doi.org/10.3390/cancers13081820.

    Article  CAS  Google Scholar 

  8. Keller L, Belloum Y, Wikman H, Pantel K. Clinical relevance of blood-based CtDNA analysis: mutation detection and beyond. Br J Cancer. 2020:1–14. https://doi.org/10.1038/s41416-020-01047-5.

  9. Tivey A, Church M, Rothwell D, Dive C, Cook N. Circulating tumour DNA — looking beyond the blood. Nat Rev Clin Oncol. 2022;19(9):600–12. https://doi.org/10.1038/s41571-022-00660-y.

    Article  CAS  Google Scholar 

  10. Cescon DW, Bratman SV, Chan SM, Siu LL. Circulating tumor DNA and liquid biopsy in oncology. Nat Cancer. 2020;1(3):276–90. https://doi.org/10.1038/s43018-020-0043-5.

    Article  CAS  Google Scholar 

  11. Cisneros-Villanueva M, Hidalgo-Pérez L, Rios-Romero M, Cedro-Tanda A, Ruiz-Villavicencio CA, Page K, et al. Cell-free DNA analysis in current Cancer clinical trials: a review. Br J Cancer. 2022;126(3):391–400. https://doi.org/10.1038/s41416-021-01696-0.

    Article  CAS  Google Scholar 

  12. Corcoran RB, Chabner BA. Application of cell-free DNA analysis to Cancer treatment. N Engl J Med. 2018;379(18):1754–65. https://doi.org/10.1056/NEJMra1706174.

    Article  CAS  Google Scholar 

  13. Tie J, Cohen JD, Lahouel K, Lo SN, Wang Y, Kosmider S, et al. Circulating tumor DNA analysis guiding adjuvant therapy in stage II Colon Cancer. N Engl J Med. 2022;386(24):2261–72. https://doi.org/10.1056/NEJMoa2200075.

    Article  CAS  Google Scholar 

  14. Otandault A, Anker P, Dache ZAA, Guillaumon V, Meddeb R, Pastor B, et al. Recent advances in circulating nucleic acids in oncology. Ann Oncol. 2019;30(3):374–84. https://doi.org/10.1093/annonc/mdz031.

    Article  CAS  Google Scholar 

  15. Pastor B, André T, Henriques J, Trouilloud I, Tournigand C, Jary M, et al. Monitoring levels of circulating cell-free DNA in patients with metastatic colorectal Cancer as a potential biomarker of responses to Regorafenib treatment. Mol Oncol. 2021;15(9):2401–11. https://doi.org/10.1002/1878-0261.12972.

    Article  CAS  Google Scholar 

  16. van Helden EJ, Angus L, der Houven M-v, van Oordt CW, Heideman DAM, Boon E, et al. RAS and BRAF mutations in cell-free DNA are predictive for outcome of Cetuximab monotherapy in patients with tissue-tested RAS wild-type advanced colorectal Cancer. Mol Oncol. 2019;13(11):2361–74. https://doi.org/10.1002/1878-0261.12550.

    Article  CAS  Google Scholar 

  17. Vitiello PP, De Falco V, Giunta EF, Ciardiello D, Cardone C, Vitale P, et al. Clinical practice use of liquid biopsy to identify RAS/BRAF mutations in patients with metastatic colorectal Cancer (MCRC): a single institution experience. Cancers (Basel). 2019;11(10):E1504. https://doi.org/10.3390/cancers11101504.

    Article  CAS  Google Scholar 

  18. Yao J, Zang W, Ge Y, Weygant N, Yu P, Li L, et al. RAS/BRAF circulating tumor DNA mutations as a predictor of response to first-line chemotherapy in metastatic colorectal Cancer patients. Can J Gastroenterol Hepatol. 2018;2018:4248971. https://doi.org/10.1155/2018/4248971.

    Article  Google Scholar 

  19. Palacín-Aliana I, García-Romero N, Asensi-Puig A, Carrión-Navarro J, González-Rumayor V, Ayuso-Sacido Á. Clinical utility of liquid biopsy-based actionable mutations detected via DdPCR. Biomedicines. 2021;9(8):906. https://doi.org/10.3390/biomedicines9080906.

    Article  CAS  Google Scholar 

  20. Olmedillas-López S, García-Olmo DC, García-Arranz M, Peiró-Pastor R, Aguado B, García-Olmo D. Liquid biopsy by NGS: differential presence of exons (DPE) in cell-free DNA reveals different patterns in metastatic and nonmetastatic colorectal Cancer. Cancer Medicine. 2018;7(5):1706–16. https://doi.org/10.1002/cam4.1399.

    Article  CAS  Google Scholar 

  21. García-Olmo DC, Peiró-Pastor R, Picazo MG, Olmedillas-López S, García-Arranz M, Aguado B, et al. Liquid biopsy by NGS: differential presence of exons (DPE) is related to metastatic potential in a Colon-Cancer model in the rat. Transl Oncol. 2020;13(11):100837. https://doi.org/10.1016/j.tranon.2020.100837.

    Article  Google Scholar 

  22. Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2019;35(3):421–32. https://doi.org/10.1093/bioinformatics/bty648.

    Article  CAS  Google Scholar 

  23. Robinson MD, McCarthy DJ, Smyth GK. EdgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. https://doi.org/10.1093/bioinformatics/btp616.

    Article  CAS  Google Scholar 

  24. Anders S, Pyl PT, Huber W. HTSeq--a Python Framework to Work with High-Throughput Sequencing Data. Bioinformatics. 2015;31(2):166–9. https://doi.org/10.1093/bioinformatics/btu638.

    Article  CAS  Google Scholar 

  25. Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012;11(5). https://doi.org/10.1515/1544-6115.1826.

  26. Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44. https://doi.org/10.1080/01621459.1963.10500845.

    Article  Google Scholar 

  27. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324.

    Article  Google Scholar 

  28. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–95. https://doi.org/10.1093/nar/gkab1049.

    Article  CAS  Google Scholar 

  29. Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL, et al. Gene set knowledge discovery with Enrichr. Current Protocols. 2021;1(3):e90. https://doi.org/10.1002/cpz1.90.

    Article  CAS  Google Scholar 

  30. Garcia-Moreno A, López-Domínguez R, Villatoro-García JA, Ramirez-Mena A, Aparicio-Puerta E, Hackenberg M, et al. Functional enrichment analysis of regulatory elements. Biomedicines. 2022;10(3):590. https://doi.org/10.3390/biomedicines10030590.

    Article  CAS  Google Scholar 

  31. Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36(8):2628–9. https://doi.org/10.1093/bioinformatics/btz931.

    Article  CAS  Google Scholar 

  32. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9. https://doi.org/10.1038/75556.

    Article  CAS  Google Scholar 

  33. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. https://doi.org/10.1093/nar/28.1.27.

    Article  CAS  Google Scholar 

  34. Normanno N, Cervantes A, Ciardiello F, De Luca A, Pinto C. The liquid biopsy in the Management of Colorectal Cancer Patients: current applications and future scenarios. Cancer Treat Rev. 2018;70:1–8. https://doi.org/10.1016/j.ctrv.2018.07.007.

    Article  CAS  Google Scholar 

  35. Shohdy KS, West H. Circulating tumor DNA testing—liquid biopsy of a Cancer. JAMA Oncology. 2020;6(5):792. https://doi.org/10.1001/jamaoncol.2020.0346.

    Article  Google Scholar 

  36. Ignatiadis M, Sledge GW, Jeffrey SS. Liquid biopsy enters the clinic — implementation issues and future challenges. Nat Rev Clin Oncol. 2021;18(5):297–312. https://doi.org/10.1038/s41571-020-00457-x.

    Article  Google Scholar 

  37. García-Olmo DC, Domínguez C, García-Arranz M, Anker P, Stroun M, García-Verdugo JM, et al. Cell-free nucleic acids circulating in the plasma of colorectal Cancer patients induce the oncogenic transformation of susceptible cultured cells. Cancer Res. 2010;70(2):560–7. https://doi.org/10.1158/0008-5472.CAN-09-3513.

    Article  CAS  Google Scholar 

  38. Olivera-Salazar R, García-Arranz M, Sánchez A, Olmedillas-López S, Vega-Clemente L, et al. Oncological transformation in vitro of hepatic progenitor cell lines isolated from adult mice. Sci Rep. 2022;12:3149. https://doi.org/10.1038/s41598-022-06427-w.

    Article  CAS  Google Scholar 

  39. Magrì A, Bardelli A. Does early metastatic seeding occur in colorectal Cancer? Nat Rev Gastroenterol Hepatol. 2019;16(11):651–3. https://doi.org/10.1038/s41575-019-0200-4.

    Article  CAS  Google Scholar 

  40. García-Olmo D, García-Olmo DC. Functionality of circulating DNA: the hypothesis of Genometastasis. Ann N Y Acad Sci. 2001;945:265–75. https://doi.org/10.1111/j.1749-6632.2001.tb03895.x.

    Article  Google Scholar 

  41. Aksu G, Fayda M, Sakar B, Kapran Y. Colon Cancer with isolated metastasis to the kidney at the time of initial diagnosis. Int J Gastrointest Cancer. 2003;34(2–3):73–7. https://doi.org/10.1385/IJGC:34:2-3:073.

    Article  Google Scholar 

  42. Valderrama-Treviño AI, Barrera-Mera B, Ceballos-Villalva C, Montalvo-Javé EE. Hepatic metastasis from colorectal Cancer. Euroasian J Hepatogastroenterol. 2017;7(2):166–75. https://doi.org/10.5005/jp-journals-10018-1241.

    Article  Google Scholar 

  43. Lipovka Y, Konhilas JP. The complex nature of Oestrogen Signalling in breast Cancer: enemy or ally? Biosci Rep. 2016;36(3):e00352. https://doi.org/10.1042/BSR20160017.

    Article  CAS  Google Scholar 

  44. Barzi A, Lenz AM, Labonte MJ, Lenz H-J. Molecular pathways: estrogen pathway in colorectal Cancer. Clin Cancer Res. 2013;19(21):5842–8. https://doi.org/10.1158/1078-0432.CCR-13-0325.

    Article  CAS  Google Scholar 

  45. Fan F, Wey JS, McCarty MF, Belcheva A, Liu W, Bauer TW, et al. Expression and function of vascular endothelial growth factor Receptor-1 on human colorectal Cancer cells. Oncogene. 2005;24(16):2647–53. https://doi.org/10.1038/sj.onc.1208246.

    Article  CAS  Google Scholar 

  46. Al Obeed OA, Alkhayal KA, Al Sheikh A, Zubaidi AM, Vaali-Mohammed M-A, Boushey R, et al. Increased expression of tumor necrosis factor-α is associated with advanced colorectal Cancer stages. World J Gastroenterol. 2014;20(48):18390–6. https://doi.org/10.3748/wjg.v20.i48.18390.

    Article  CAS  Google Scholar 

  47. Alix-Panabières C, Pantel K. Liquid biopsy: from discovery to clinical application. Cancer Discov. 2021;11(4):858–73. https://doi.org/10.1158/2159-8290.CD-20-1311.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge all the staff members of the Surgery Department of the University Hospital Fundación Jiménez Díaz for their collaboration in the collection of samples and in the clinical follow-up of the patients, especially Susana Olmedillas PhD. The authors would like to acknowledge Juan Cruz Cigudosa PhD, Beatriz Maroto PhD and Antonio Gomez PhD for their contribution to the initial steps of this project. We also would like to thank Marta Carcajona for her involvement in the development of sample alignment pipelines. The authors would also like to thank Dr. Sara Rosenstone Calvo for editing this manuscript. Lastly, we would especially like to thank all the patients for making this research possible.

Funding

The authors declared having received the following financial support for this research, authorship and/or publication of this article: This study has been supported by a grant from the “Industrial PhD students Community of Madrid” (grant number IND2019/BMD-17241) and the grant in the framework of the Regional Strategy for Research and Innovation for Smart Specialization (RIS3) for the high intensity innovative SMEs of more than 5 years and less than 15 years of age of the Community of Madrid (reference: S-2018/L3–25). And a grant from the Fondo de Investigaciones Sanitarias of Instituto de Salud Carlos III (ISCIII)-Fondo Europeo de Desarrollo Regional (FEDER) [grant number PI20/01052]. Behind this research is a patent related to “methods for identifying cancer patients at high risk of developing metastases” (EP17382659.5).

Author information

Authors and Affiliations

Authors

Contributions

D. Rubio-Mangas: Conceptualization, resources, data curation, software, formal analysis, validation, investigation, analysis and interpretation of data, visualization, methodology, writing–original draft, writing–review and editing. M. García-Arranz: Conceptualization, supervision, project administration, investigation, funding acquisition, writing–review and editing. Y. Torres-Rodriguez: Supervision, methodology, writing–review and editing. M. León-Arellano: Resources (acquisition of clinical information), writing–review and editing. J. Suela-Rubio: Conceptualization, resources, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. D. García-Olmo: Conceptualization, resources, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. The author(s) read and approved the final manuscript.

Corresponding authors

Correspondence to David Rubio-Mangas, Javier Suela or Damián García-Olmo.

Ethics declarations

Ethics approval and consent to participate

The methods have been carried out in accordance with the relevant guidelines and regulations. All study participants signed the informed consent form provided by the hospital. The study received approval from the clinical research ethics committee of the Hospital Jimenez Diaz (Cod ER_PIC_135/2017_FJD).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rubio-Mangas, D., García-Arranz, M., Torres-Rodriguez, Y. et al. Differential presence of exons (DPE): sequencing liquid biopsy by NGS. A new method for clustering colorectal Cancer patients. BMC Cancer 23, 2 (2023). https://doi.org/10.1186/s12885-022-10459-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12885-022-10459-w

Keywords