TP53 mutation prevalence in normal airway epithelium as a biomarker for lung cancer risk

Background There is a need for biomarkers that improve accuracy compared with current demographic risk indices to detect individuals at the highest lung cancer risk. Improved risk determination will enable more effective lung cancer screening and better stratification of lung nodules into high or low-risk category. We previously reported discovery of a biomarker for lung cancer risk characterized by increased prevalence of TP53 somatic mutations in airway epithelial cells (AEC). Here we present results from a validation study in an independent retrospective case–control cohort. Methods Targeted next generation sequencing was used to identify mutations within three TP53 exons spanning 193 base pairs in AEC genomic DNA. Results TP53 mutation prevalence was associated with cancer status (P < 0.001). The lung cancer detection receiver operator characteristic (ROC) area under the curve (AUC) for the TP53 biomarker was 0.845 (95% confidence limits 0.749–0.942). In contrast, TP53 mutation prevalence was not significantly associated with age or smoking pack-years. The combination of TP53 mutation prevalence with PLCOM2012 risk score had an ROC AUC of 0.916 (0.846–0.986) and this was significantly higher than that for either factor alone (P < 0.03). Conclusions These results support the validity of the TP53 mutation prevalence biomarker and justify taking additional steps to assess this biomarker in AEC specimens from a prospective cohort and in matched nasal brushing specimens as a potential non-invasive surrogate specimen. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-023-11266-7.


Introduction
Lung cancer causes more deaths worldwide than any other cancer [1].The National Lung Screening Trial (NLST) and the NELSON trial each demonstrated that early detection through low dose computed tomography (LDCT) screening reduces lung cancer mortality by at least 20% compared to chest x-ray screening [2][3][4].Based on the strength of data from these and other studies, the latest guidelines from the United States Preventative Services Task Force (USPSTF) recommend expansion of LDCT screening eligibility to individuals 50-80 years of age with at least a 20 pack-year smoking history who quit < 15 years prior.
In spite of these important advances, there remains an opportunity to better stratify individuals for LDCT screening based on biomarker risk criteria that supplement demographic risk criteria.For example, a large fraction of lung cancers are found in individuals who do not meet even the expanded LDCT screening threshold criteria [5], including those younger than 50.This younger group comprised 28% of lung cancer cases in one study [6].Further, roughly 25% of all lung cancer cases worldwide occur in never smokers and this proportion will continue to increase due to smoking cessation and prevention efforts [7,8].
Another opportunity is to reduce the large number of false positive findings associated with LDCT screening.Specifically, each early lung cancer first appears on a chest CT as an indeterminate pulmonary nodule (IPN) (defined as a well-defined, non-calcified, nodule in the lung less than 3 cm in size), but > 95% of nodules identified by screening are the result of a benign process, such as scar tissue, an infection, or non-infectious inflammatory process [9].Based on nodule size, shape, and location characteristics, combined with patient age and smoking history, cancer risk for each nodule may be designated as low, intermediate, or high (e.g.< 15%, 15-75%, or > 75%, respectively) [10].For example, regardless of lung cancer risk based on smoking history and age of the individual, the probability of malignancy is less than 1% for all nodules smaller than 6 mm, and 1% to 2% for nodules 6 mm to 8 mm [11].The American College of Radiology [12], the National Comprehensive Cancer Network (NCCN) [13], the American College of Chest Physicians (ACCP) [10], and other consensus groups agree that lowrisk nodules may be safely followed with an interval CT scan (e.g., three to six months) to assess for growth, while high-risk nodules should be subjected to expedited diagnosis and/or surgical removal.However, intermediate nodules represent about 1/3 of all nodules and present a significant clinical challenge.Specifically, intermediate nodules are associated with the highest rate of diagnostic errors, in part because there is no clear strategy for their management [14].Thus, a pressing unmet need is a biomarker that will reliably move intermediate-risk nodules to high or low-risk [15].
A prevalent strategy to more accurately identify individuals at risk for cancer or to detect early-stage cancer is to develop biomarkers based on inherited (i.e., germ cell) and/or acquired (i.e., somatic cell) genetic risk determinants [16][17][18][19][20][21][22][23][24].In an application of this strategy, we used an optimized targeted NGS method [21,25,26] in a discovery study to measure driver gene mutations, including those with low variant allele frequency (VAF) (mutations between 0.01-1.0%VAF), in grossly normal airway epithelial cells (AEC) [21,26].We discovered that TP53 mutations at known lung cancer hotspot sites within three TP53 exons 5-7 were significantly more prevalent in AEC specimens from lung cancer cases compared to non-cancer controls [21].Here, we evaluated this TP53 mutation biomarker in a larger independent retrospective case-control cohort using AEC specimen DNA from lung cancer and non-cancer subjects.

Study cohort enrollment and characterization
Subjects undergoing bronchoscopy for clinical purposes provided informed consent and were enrolled into one of two approved research protocols at Vanderbilt University Medical Center (VUMC) (Table 1).
The first VUMC protocol (Molecular Predictors of Lung Cancer Behavior-0398) included subjects aged 18-80 years with IPN between 6 and 30 mm in the largest axial diameter detected incidentally or through screening.
The second VUMC protocol (Nashville Lung Screening Trial-1078) included subjects aged 55 years or older with screening detected nodules and a > / = 25 packyear smoking history.Current and former smokers (quit < 15 years prior) were enrolled.In each protocol, cases were subjects with lung cancer confirmed through positive biopsy, and controls were subjects confirmed to not have cancer through negative biopsies and/or 2-year longitudinal imaging follow-up with no sign of growth.
AEC specimens were collected by bronchoscopic brush biopsy of grossly normal (not overtly metaplastic or abnormal-appearing to the trained pulmonologist) airway from 30 lung cancer (CA) cases all of whom had a smoking history, and 30 non-cancer (NC) controls.Brushings were taken from the opposite lung or a region distant from known or suspected disease using cytology brushes (Cook Medical BCB-5-120-3-S).AEC were collected by rotating the brush while completing 20-30 brush strokes.Brushes were clipped into empty tubes, immediately frozen on dry ice, then stored at -80 °C until DNA extraction.
The PLCO M2012 for lung cancer risk score was calculated for each subject based on demographic characteristics as previously described [27].

DNA extraction
Genomic DNA (gDNA) was extracted from each AEC specimen at Vanderbilt using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany) according to manufacturer protocol and assessed for purity using NanoDrop.Aliquots of genomic DNA extracted from the AEC specimens were de-identified and blinded by VUMC and provided to the University of Toledo (UToledo) through an approved material transfer agreement.

DNA quantification
The number of amplifiable gDNA copies present in each sample was quantified at UToledo using competitive polymerase chain reaction (PCR) amplification of a well-characterized genomic locus in the Secretoglobin, family 1A, member 1 gene (SCGB1A1), as previously described previously using SCGB1A1-q primers (Supplementary Table 1) and SCGB1A1 genomic DNA reagents (Accugenomics, Inc., Wilmington, NC) [28,29].

Target primer design
Primer sequences and priming strategy are depicted in Supplementary Table 1 and Supplementary Fig. 1.TP53 primers were designed to amplify both gDNA and complementary DNA (cDNA), although only gDNA was used in this study.An alien tag sequence (APEX) was added to the 5' end of each primer as previously reported [29] to enable addition of barcode/sequencing adaptors in a downstream reaction.Barcode/adaptor primers comprise an Illumina P5/read 1 or P7/read 2 region, a 4-base variable pad region, a 10-base barcode region and a region complementary to the APEX tag sequence at the 5' ends of the target primers (Supplementary Fig. 1).A complete list of barcodes is presented in Supplementary Table 2. Target-specific primer sequences were designed to optimize PCR efficiency, and primers were synthesized as standard, desalted oligos as a service at Integrated DNA Technologies (IDT, Coralville, IA).Barcode/adapter primers were synthesized using the IDT Ultramer platform due to the length of these primers (93-96 bases).

Internal Standard (IS) design
Competitive synthetic DNA internal standard (IS) molecules for TP53 targets described above were designed, using methods previously described, to mimic each target area but with substituted dinucleotides approximately every 50 bases to enable bioinformatic separation of native template (NT) and IS reads following sequencing (details in Supplementary Methods).The IS for each target was included in each assay to control for technical sequencing error as described previously [21,26].

External Complexity Calibration Ladder (ECCL)
We designed a control for PCR amplicon library complexity using a known number of synthetic SCGB1A1 IS molecules to ensure that we reliably measured original genomic copies loaded.The design of the ECCL is provided in Supplementary Methods.

Amplicon library generation
In order to maximize the number of genome copies loaded into the assay for each sample and thereby maximize the opportunity to detect low frequency variants in each target, a multiplex competitive PCR amplicon library was prepared for each AEC gDNA sample [21].
Conditions were optimized to minimize technical error during PCR, including use of Q5 HotStart High Fidelity DNA Polymerase that has a reported error frequency of 10 -6 (New England Biolabs, Ipswich, MA) and minimization of PCR cycles in each of two rounds (details in Supplementary Methods).Each PCR reaction was prepared containing at least 50,000 genome equivalents of both AEC gDNA sample and ISM.

Sequencing
The purified sequencing library was sent to the University of Michigan Advanced Genomics Core facility for Next Generation Sequencing on an Illumina NovaSeq 6000 SP flow cell with a 20% PhiX spike-in control.

Analysis of NGS Data
FASTQ files were received from the University of Michigan Genomics core facility and processed using the Qiagen CLC Genomics Workbench software suite for quality-trimming, alignment, and variant calling.A modified IS reference genome was made in silico by concatenating each IS reference sequence as a separate contig to the end of the hg19 reference genome.Primer sequences and internal standard dinucleotide positions plus the nucleotides on their 5' and 3' sites, were excluded from variant analysis.Spiking a known number of synthetic SCGB1A1 IS ladder molecules into each sample controlled for a) sample loading and b) down-sampling of sequencing reads from each library prep to represent the true number of molecules captured for each sample.This step satisfied the Poisson statistical criterion that each event (observed sequencing read) be independent (Supplementary Tables 3 and 4).Down-sampling was completed as follows: First, if the lowest diluted IS in the ladder yielded a minimum threshold number of reads, we inferred that at least 50,000 molecules of SCGB1A1 IS1 were captured.If SCGB1A1 IS represented at lower concentrations in the ECCL did not yield sufficient reads, the expected SCGB1A1 IS1 molecule number was adjusted down from 50,000 accordingly (Supplementary Table 3).Next, measured SCGB1A1 molecules in the sample (NT molecules) were calculated using the formula: Finally, target NT molecules measured were calculated for each TP53 target using the formula:

Variant calling
The Basic Variant Detection tool in Qiagen CLC Genomics Workbench software was used to identify every single nucleotide variant (SNV) present in the patient sample NT reads as well as the IS reads for that patient sample.As described above, the methods used to synthesize the IS molecules for this study result in very low synthesis-error variant frequency (VAF < 0.0001%), a level sufficiently low that it would not confound analysis of biological mutations with VAF of 0.01% or higher that were the subject of this study.Thus, any variants in the synthetic IS spike-in measured with VAF > 0.01% resulted from technical error during library preparation or sequencing.As such, the variant allele frequency (VAF) measured for each type of transition/transversion at each base position in the IS enabled limit of blank (LOB) calculation.Poisson Exact Test (PET) was then used to determine significance of each detected NT variant relative to the LOB measurement in the IS for each type of mutation at each base position, as previously described [26].A Bonferroni correction for false discovery was used based on the number of nucleotides assessed (193 bp) and the number of substitution mutations possible at each nucleotide position (N = 3).Further, to minimize potential analytical variation resulting from stochastic sampling, only mutations present in > 5 NT molecules measured were included.

Variant annotation and hotspot analysis
Called variants were characterized for pathogenicity using publicly available databases including dbSNP, COSMIC, and FASMIC.Identification of known oncogenic hotspots and generation of corresponding figures were assessed using the cBioPortal for Cancer Genomics developed at Memorial Sloan Kettering (MSK) Cancer Center [30].

Statistical analysis Primary endpoint
The primary endpoint was to determine whether the diagnostic performance of the TP53 biomarker in this expanded case-control cohort was consistent with that reported in the discovery study [21].This biomarker comprises the number of unique TP53 mutation clones in an AEC specimen obtained by bronchoscopy from each subject (TP53 mutation prevalence).In primary endpoint analysis, the biomarker was assessed for association with lung cancer risk based on the mean TP53 mutation prevalence in the 193 bp assessed in AEC among all cases (mutations/ bp/30 subjects) vs all controls (mutations/bp/29 subjects) (Kruskal-Wallis), and the association of TP53 mutation prevalence with cancer diagnosis based on receiver operator characteristic (ROC) area under the curve (AUC).

Secondary endpoint
Assessment for previously reported significant mutation enrichment among cases in a) tobacco smoke or age signatures, or b) with TP53 "hot-spots" was assessed with Kruskal-Wallis test using a Chi-square distribution.

Exploratory endpoint
A combined biomarker (CBM) comprising the TP53 biomarker and the PLCO M2012 risk score was calculated using a logistic regression model with flexible functional forms, as previously described [31].Comparison between the TP53 biomarker alone, PLCO M2012 alone, and the CBM was based on ROC analysis well as risk distribution based on sampling distribution of sample proportions ( p ) and Wilcoxon rank sum test with continuity correction [31].

Results
Subject characteristics are presented in Table  4).

Diagnostic performance of the TP53 mutation biomarker
A total of 192 unique biological TP53 mutations with VAF > 0.01% were observed within the targeted regions (193 bp) among specimens from the 59 subjects studied.All of these mutations were missense except for one silent mutation observed in a control subject (Supplementary Table 5).After un-blinding we determined that the mean TP53 mutation prevalence (mutations/bp) among AEC specimens from cases was significantly higher compared to controls (P < 0.001, Kruskal-Wallis) (Fig. 1a).Specifically, of the 192 mutations observed, there were 159 mutations among the cases (prevalence 0.027 mutations/ bp) and 33 mutations among the controls, (prevalence 0.006 mutations/bp).Notably, TP53 mutation prevalence in AEC was not associated with smoking pack-years among the entire group (N = 59), the 30 cases, or the 29 controls (Table 3).There was also no association of TP53  3).In this study there was not sufficient information to assess effect of reported race on association of the TP53 biomarker with lung cancer (Tables 2 and 3).

Tobacco signature TP53 mutation characteristics
Mutation substitution types with known cigarette smoke exposure association and phenotypic effects were significantly enriched among AEC specimens from cases and closely approximated the spectrum of TP53 mutations reported for lung cancer tissues (Fig. 1b) [32,33].Specifically, as observed in the discovery study nearly all of the TP53 mutations in cases were tobacco smoke signature (C > A) or age signature (C > T, T > C) mutations [21], [34].After Bonferroni adjustment the difference between cases and controls was significant for each of these substitution types: C > A (P = 0.001), C > T (P = 0.0013), and T > C (P = 0.0024).Notably, the proportion of mutations at known TP53 hot-spot locations was significantly higher among cases compared to controls (Supplementary Fig. 2).

Diagnostic accuracy of a CBM comprising the TP53 biomarker and the PLCO M2012 risk score compared with each biomarker alone
A logistic regression model with flexible functional forms as previously described [31] was used to develop a CBM comprising the PLCO M2012 risk score continuous values (PRS.con) and the T53 biomarker continuous values (TP53.con)as: where X β = −3.165853+ 29.31218PRS.con+ 0.5462315TP53.conUsing the Youden optimal cutoff, the diagnostic accuracy (mean ROC AUC [± 95% confidence limits]) for a CBM comprising the T53 biomarker with the PLCO M2012 risk score was 0.916 [0.846-0.986](Fig. 2a) and this was significantly higher than for the PLCO M2012 risk score alone (0.856 [0.763-0.949])(P < 0.03) (Fig. 2b) or the TP53 biomarker alone (0.845 [0.749-0.942])(P < 0.03) (Fig. 2c).As presented in Fig. 3, based on probability distribution analysis the difference between cases and controls was significant for the CBM (P < 0.001), the PLCO M2012 risk score alone (P < 0.001), and the TP53 biomarker alone (P < 0.001).As is evident, in comparison with the PLCO M2012 risk score alone or TP53 biomarker alone, for the CBM the distribution of intermediate risk nodules was more shifted to high or low risk.

Discussion
This independent retrospective case-control cohort study confirmed our prior discovery that somatic TP53 mutations are significantly more prevalent in grossly normal AEC specimens from cases with lung nodules diagnosed as cancer compared with controls with lung nodules confirmed to be benign (Fig. 1a).Moreover,  confirmation of our previous observation that these TP53 mutations represent tobacco-smoke signatures (Fig. 1b) and are primarily in hot-spot driver mutation sites (Supplementary Fig. 2) [21] further supports the conclusion that they are biological true positives.Importantly, the TP53 biomarker was not associated with smoking pack-years (Table 3).This observation is consistent with our hypothesis that the TP53 biomarker measures the effect of hereditary susceptibility to lung cancer and, therefore, is independent and synergistic with risk conferred by cigarette smoke exposure.Specifically, it is hypothesized that AEC TP53 mutations and lung cancer each occur predominantly in those individuals with a heavy smoking history who also inherit a set of germ line variants that increase their risk for somatic mutations caused by exposure to cigarette smoke inhalation.Possible mechanisms for hereditary predisposition include sub-optimal DNA repair or antioxidant protection [18,35,36] and higher risk for nicotine addiction [36].If this hypothesis is correct, the TP53 biomarker may serve as a summation biomarker for both hereditary risk as well as the degree of acquired risk from smoking and other forms of environmental exposure.For example, individuals with the same pack-year smoking history may acquire a different prevalence of TP53 mutations not only due to differences in hereditary risk, but also differences in cigarette smoking characteristics, such as brand of cigarettes, depth of inhalation, or the number of inhalations per cigarette, and/or a difference in inhalational exposure to other carcinogens, such as radon.
Based on ROC analysis, the TP53 biomarker at 100% specificity has a relatively high sensitivity (Fig. 2c).This is in contrast to demographic risk criteria such as the PLCO M2012 risk score that have moderately high specificity with 100% sensitivity, as observed in our cohort (Fig. 2b).This difference at least partly explains the observation that the CBM had significantly higher ROC AUC than either the TP53 biomarker or PLCO M2012 risk score alone (Fig. 2a), and better-identified nodule probability for malignancy (Fig. 3).In future studies, through collaboration with the EDRN, we plan to explore the combination of TP53 biomarker with other promising biomarkers [31,35,[37][38][39][40][41].

Conclusion
The strong performance of the TP53 biomarker in this independent cohort justifies additional validation studies, including analysis of TP53 mutation prevalence alone and in CBM in AEC specimens as well as nasal brush as potential non-invasive surrogate specimens from prospective cohorts.If validated, this biomarker may help achieve the key goals to reduce unnecessary invasive tests for benign nodules and reduce time to diagnosis for malignant nodules [31] as well as better stratify patients for lung cancer prevention trials and improve performance of annual LDCT screening.Specifically, use of this biomarker may enable selection of more optimal eligibility criteria and thereby reduce cost and false positive results associated with prevention trials and LDCT screening.Moreover, this biomarker may increase screening access for individuals who are at increased risk but do not meet current eligibility criteria based on demographic factors alone.The performance of this biomarker in different racial groups will require additional study.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from: /193 bp targeted/subject h See Methods Sect.0398-Molecular Predictors of Lung Cancer Behavior, 1078-Nashville Lung Screening Trial

Fig. 1 a
Fig. 1 a Subject group-specific mean mutation prevalence (mutations/bp) in the targeted TP53 exon regions (spanning 193 bp) among control non-cancer smokers (NCS) (N = 21), control non-cancer non-smokers (NCNS) (N = 8), all controls (NC ALL) (N = 29) and cancer (CA) (N = 30) subjects.Prevalence in AEC specimens from CA subjects was significantly higher compared with NC ALL (P < 0.001) while there was no significant difference between NCS and NCNS (Kruskal-Wallis).b Assessment of TP53 mutation signature patterns as prevalence (mutations/bp) according to nucleotide transition type.After Bonferroni adjustment, there was a significantly higher prevalence of C > A (P = 0.001), C > T (P = 0.0013), and T > C (P = 0.0024) in CA compared with NC ALL (Kruskal-Wallis test using a Chi-square distribution)

Fig. 2 Fig. 3
Fig. 2 ROC curves with AUC (95% confidence limits) for a CBM comprising the PLCO M2012 Risk Score (PRS) + TP53, b PRS alone, and c TP53 biomarker alone.The AUC for the CBM was significantly higher (P = 0.03) compared to PRS or TP53 alone

Table 2
Summary Demographic Characteristics of Cohort With TP53 Biomarker Data

Table 3
TP53 Biomarker Association with Demographic Characteristics