Minimum follow-up time required for the estimation of statistical cure of cancer patients: verification using data from 42 cancer sites in the SEER database

Background The present commonly used five-year survival rates are not adequate to represent the statistical cure. In the present study, we established the minimum number of years required for follow-up to estimate statistical cure rate, by using a lognormal distribution of the survival time of those who died of their cancer. We introduced the term, threshold year, the follow-up time for patients dying from the specific cancer covers most of the survival data, leaving less than 2.25% uncovered. This is close enough to cure from that specific cancer. Methods Data from the Surveillance, Epidemiology and End Results (SEER) database were tested if the survival times of cancer patients who died of their disease followed the lognormal distribution using a minimum chi-square method. Patients diagnosed from 1973–1992 in the registries of Connecticut and Detroit were chosen so that a maximum of 27 years was allowed for follow-up to 1999. A total of 49 specific organ sites were tested. The parameters of those lognormal distributions were found for each cancer site. The cancer-specific survival rates at the threshold years were compared with the longest available Kaplan-Meier survival estimates. Results The characteristics of the cancer-specific survival times of cancer patients who died of their disease from 42 cancer sites out of 49 sites were verified to follow different lognormal distributions. The threshold years validated for statistical cure varied for different cancer sites, from 2.6 years for pancreas cancer to 25.2 years for cancer of salivary gland. At the threshold year, the statistical cure rates estimated for 40 cancer sites were found to match the actuarial long-term survival rates estimated by the Kaplan-Meier method within six percentage points. For two cancer sites: breast and thyroid, the threshold years were so long that the cancer-specific survival rates could yet not be obtained because the SEER data do not provide sufficiently long follow-up. Conclusion The present study suggests a certain threshold year is required to wait before the statistical cure rate can be estimated for each cancer site. For some cancers, such as breast and thyroid, the 5- or 10-year survival rates inadequately reflect statistical cure rates, and highlight the need for long-term follow-up of these patients.


Background
The normal distribution is often used to describe the random variation of data in many scientific disciplines. However some distributions are skewed with low mean values and large variance. The distributions may be exclusively positive, such as the duration of survival of cancer patients with chronic leukemias [1], the incubation time of infectious diseases [2], and the abundance of biological species [3], etc. These skewed distributions often fit the lognormal distribution [4][5][6][7]. A lognormal distribution is one with random variables whose logarithms follow a normal distribution. The lognormal survival time of cancer patients, who died of their disease, has been tested and applied for various anatomic sites [8][9][10][11][12][13][14][15][16][17][18][19][20][21].
The term survival rate is used commonly, yet it is inaccurate. Five-and ten-year survival rates are commonly used in the literature. Kaplan-Meier or life-table (actuarial) methods estimate the proportion (or fraction) of survivors. In this study, the term "survival rate" means "survival fraction" expressed in percentage, and the term "cure rate" means "cure fraction" expressed in percentage, to be proper.
It is difficult to know the statistical cure rate, which is an estimation based on statistical models [11,15], especially for slowly proliferating tumors. Because of this limitation, oncologists usually discuss survival in terms of 5-year or 10-year survival rates. In certain rapidly growing tumors, the cancer-specific survival rate [22] reaches a plateau within 5-10 years and approaches the statistical cure rate, which is the survival rate observed when no more risk of death from the disease. In the Kaplan-Meier method [23], a person with residual cancer but who died of reasons other than the specific cancer is censored for cancer-specific survival. In the present study we applied the analysis on the cancer-specific survival rates. The statistical cure was reached at the plateau of the Kaplan-Meier plot of the cancer-specific survival rate. For some fast proliferating cancers, such as pancreatic and stomach cancers, the plateau appears within 10 years. However for some slow proliferating cancers, such as thyroid and early breast cancers, the plateau does not appear even after decades. Hence the present commonly used five-year survival rates are not adequate to represent the statistical cure rate.
In the present study, we established the minimum number of years required for follow-up in order to estimate the statistical cure rate, by using a lognormal distribution of the survival time of those who died of their cancer by applying the result derived by Limpert et al. [24] This minimum number of years required for follow-up was defined as the threshold year by µ* × (σ*) 2 where µ* is the median and σ* is the multiplicative standard deviation of a lognormal distribution (see Additional file 1 : Appendix). The follow-up time for patients dying from the specific cancer covers most of the survival data, leaving less than 2.25% uncovered. This is close enough to cure from that specific cancer.

Data sources
We analysed the 1973-1999 Database of the Surveillance, Epidemiology, and End Results (SEER) Program [25] of the United States National Cancer Institute. Data from registries of Connecticut and Detroit were used in this study and results were compared with the SEER-9 registries. The cancer sites were chosen according to the SEER codes. Primary site, vital status, cause of death, and survival time information were used. Those patients with unknown or zero survival time and unknown cause of death were excluded.
To test for lognormality of survival time of cancer patients who died of cancer-specific disease, patients diagnosed from 1973-1992 in the registries of Connecticut and Detroit were chosen so that a maximum of 27 years was allowed for follow-up to 1999. A total of 49 specific cancer sites were tested. For prostate, salivary gland, breast, and thyroid cancer, patients diagnosed from 1973-1977 were chosen to allow a long enough time for follow-up to 1999. For cancer sites with high frequencies of older patients, such as lung and bronchus, colon, prostate, and breast, younger patient ages less than 60 were chosen, because elderly patients are more likely to die of intercurrent diseases and so affect the distribution of cancer-specific deaths versus other deaths.
As an illustration for special cohorts of interest, breast cancer was further analyzed based on accepted prognostic factors such as stage, histologic type and tumor grade, to show that the threshold years were different for different cohorts of interest.

Statistical analysis
The survival times of the cancer patients who died of their disease were tested for goodness of fit for lognormality using a minimum chi-square method. The class intervals were in the powers of 2 in months of the survival time, such as 0-2, >2-4, >4-8, >8-16, and so on. The values of M, mean of the log (survival time), and S, standard deviation of the log (survival time), were varied in the tests so that a minimum chi-square value was obtained. The null hypothesis states that there is no difference between the observed data distribution and the lognormal distribution. It is rejected if P < 0.05. The values of M and S were obtained when the chi-square value reached a minimum.
Let τ be the threshold year at which statistical cure rate can be estimated, then the cancer-specific survival rate can be obtained by Kaplan-Meier method with follow-up time equal to τ. This cancer-specific survival rate at time τ was an estimation of the statistical cure rate of the disease. It was compared with the long-term cancer-specific rate calculated using Kaplan-Meier method with the actual longterm data available up to 1999 (Figures 1 and 2).

Lognormality
The present study verified that, for 42 specific organ sites out of 49 cancer sites in the SEER database, the survival time of cancer patients who died of their disease followed different lognormal distributions approximately. For the Table 1 The order of organ sites follows Table 2 in ascending order of the threshold years; SD* Mutiplicative standard deviation of the lognormal distribution; † M values in log-months; ‡ Median in months; Pt. No. denotes the number of patients died of the specific cancer; # Nervous system other than brain.

: List of cancer sites following lognormal distribution and their S (standard deviation of the log-survival), multiplicative standard deviation, M (mean of the log-survival), and median values at minimum chi-square with maximum P-value
cancer sites with cancer-specific survival time following lognormal distribution, the number of patients and values of S, multiplicative standard deviation, M, median and P at minimum chi-square are listed in Table 1. All the P-values in Table 1 are above 0.10.
The seven specific organ sites failed in the test for lognormality were: lip, oropharynx, rectosigmoid junction, rectum, testis, urinary bladder, and kidney & renal pelvis.

Estimation of statistical cure rate
The threshold years validated for statistical cure were found to range from 2.6 years for pancreatic cancer to 25.2 years for cancer of salivary gland. For these 40 cancer sites with survival time followed a lognormal distribution, their cancer-specific survival rates were obtained by Kaplan-Meier method at the threshold year, and they were compared with their corresponding long-term survival rates with follow-up to 1999. Out of the 42 cancer sites with survival time followed a lognormal distribution approximately, the statistical cure rates for 40 cancer sites were found to match the actuarial long-term survival rates estimated by the Kaplan-Meier method within six percentage points, at threshold year. For the two remaining cancer sites: breast and thyroid, the values of their threshold years were so long that the cancer-specific survival rates could not be obtained because the SEER data up to 1999 do not provide sufficient long-term follow-up. Table 2 shows the comparison of the cancer-specific survival rates at τ year and at long-term follow-up.
Breast cancer was further analyzed according to accepted prognostic factors. The corresponding threshold years according to stages in the SEER classification: localized (age < 50), regional (age < 50) and distant were 30.8, 27.9, and 13.1 years respectively. For histologic types: medullary, ductal combined with adenocarcinoma not otherwise specified (NOS), and lobular, the threshold years were 17.0, 30.6, and 46.0 years respectively. The threshold years according to breast tumor grades were 39.8, 26.3, and 20.7 years for Grades I, II, and III+IV respectively. The breast cancer-specific survival rates at their threshold years were 9% for distant stage, 68% for medullary histology, and 40% for Grades III+IV breast cancer. These three survival rates were only one percentage point higher when compared with the available long-term actuarial survival rates. For regional stage and Grade II breast cancer, their threshold years were 27.9 and 26.3 respectively. It can be predicted that a few more years of follow-up are needed to see the plateau and their statistical cure rates are close to 39% and 47% respectively. The most recently available SEER database is now up to end of 2001. With 2 additional years of follow-up from the end of 1999, the cancerspecific survival rates were 39% and 44% respectively. For those with τ values longer than 27 years, the cancer-specific survival rates could not be obtained because the SEER data to date do not provide sufficiently long enough follow-up time.

Discussion
The cancer-specific survival has not included the deaths due to other causes. The cancer-specific rates also depend on the reliability of the assignment of the cause of death. Generally, cancer-specific death rates underestimate the mortality associated with a diagnosis of the specific cancer, because some patients died of other causes [26]. SEER is a set of geographically defined, population-based, central cancer registries in the United States, operated by local non-profit organizations under contract to the National Cancer Institute (NCI). Registry data are submitted electronically to the NCI on a biannual basis, and the NCI makes the data available for analysis. The SEER Program is considered the standard for quality among cancer registries around the world. Quality control has been an integral part of SEER since its inception. Every year, studies are conducted in SEER areas to evaluate the quality and completeness of the data being reported.
Gamel and Vogel [27] have compared the advantage of lognormal distribution over other distributions such as Weibull and log logit.
The 1973-1992 data were used so that the data were not out-dated. For a lognormal distribution of the survival time of those patients died of the specific cancer, there were only a very small proportion dying at the tail of follow-up (Figure 1), so it would not cause much change to the lognormal distribution, even with only 7 years of follow-up to 1999 at the tail. For those cancer sites with threshold years longer than 24 years, the 1973-1977 data were used so as to allow a long enough time for follow-up to 1999.
For 42 sites in the SEER database, the survival times of cancer patients who died of their disease followed different lognormal distributions. For 40 cancer sites, the ultimate cure rate can be roughly estimated from the cancerspecific survival rates at τ years. These are the required minimum number of follow-up years for the estimation of the cure rates. They are different for different cancer sites. For pancreatic cancer, with its typically short natural history, the cure rate can be estimated after only 2.6 years. For cancers with a longer natural history, longer follow-up periods are required; such as breast (36.2 years). These long periods are cancer-specific survival times and in reality patients may die from intercurrent non-cancer causes before τ years. For thyroid cancer, the estimated threshold year was 134.1 years. It seems that for some slow proliferating cancer types, the cure can never be estimated due to the limit of human lifetime.
We also find that the required minimum number of years of follow-up, τ, is independent of cure rate (correlation coefficient of determination, R 2 = 0.10). Even for cancer sites where the cure rates were >50%, the required followup time τ could be less than 10 years. On the other hand, for other cancer sites, the cure rates could be < 50%, and the required follow-up time were >10 years. It shows that 5-or 10-year survival rates are inadequate to reflect the statistical cure rates.
If there are more patients dying due to other causes than dying due to the specific cancer, then the cause specific survival time distribution will not be lognormal. Hence it is not expected that all cancer-specific survival time distributions will follow lognormal distributions.
According to the bell-shaped property of a normal distribution, from 0 to τ year covers 97.75% of the lognormally distributed survival time of those cancer patients who died from their specific cancer. The cancer-specific survival rates estimated at τ years, generally, slightly overestimate the long-term cure rates compared to the Kaplan-Meier method, but the differences are reasonably small, by less than six percentage points as verified empirically. We still need to follow the patients to τ year to know the actual value of the estimated cure rates.
For both rapidly and slowly proliferating cancers, we have shown that the statistical cure rates can be estimated before a stable plateau is reached in the Kaplan-Meier survival curve. It may take decades to see a stable plateau, during this waiting time many patients might be lost to follow-up or die of intercurrent diseases.
Gamel and Vogel [28] used cause-specific survival and relative survival to determine actuarial survival in breast can- The survival time for cancer patients who died of their disease

Time (months)
Fr e q u e n cy cer patients from the SEER database. They found that there was only minimal deviation between the two survival methods.
Ries et al. [29] reported up to 20-year relative survival rates (RSR) from 9 registries of the SEER database. For cancers of pancreas, esophagus and stomach, the RSR were slightly decreased after five years since diagnosis. These are consistent with the threshold years of the present study varying from 2.6 years for pancreas, to 3.9 years for esophagus, and to 5.8 years for stomach. Dickman et al. [30]  Cancer-specific survival rate of cancer patients who died of their disease calculated by Kaplan-Meier method. Note the small difference between the cancer-specific survival rate at τ year as compared to the long-term survival rate. The single arrow shows the location of the threshold year (τ). The pair of opposing arrows at the long-term follow-up shows that the difference between the estimated and actual cancer-specific rate is small, less than six percentage points. The survival time for cancer patients who died of their disease for stomach cancer is shown as an example. Time (months) C a n c e r -s p e c ific s u r v iv a l r a te (% ) Difference study are similar to those from 9 registries and from the Finnish and Swedish Cancer Registries.

Kaplan-Meier Survival Curve
For prostate cancer, the threshold year for statistical cure was 24.6 years. The RSR graph started to level after 24 years since diagnosis in the article of Talback et al.
For breast cancer, the threshold year was 36.2 years, RSR leveling was not seen even 30 years after diagnosis in the article of Talback et al. Leveling of RSR in breast cancer was also not seen in two separate studies by Schairer et al. [32] and Brenner and Hakulinen [33]. Kerr et al. [34] reported that the ratio of observed to expected mortality remained significantly greater than unity for at least 25 years following diagnosis and treatment, indicating a failure to demonstrate cure of the disease in a statistical sense for a median of 32 years of follow-up.

Conclusion
The present study suggests a certain threshold year is required to wait before the statistical cure rate can be estimated for different cancer sites. Although the often used 5-or 10-year survival rates may adequately reflect statistical cure rates for cancers with short natural history, such as pancreatic cancer, this is not the case for many other cancers. This highlights the need for continued long-term cancer surveillance, especially for cancers with long natural histories such as thyroid cancer and early stage, well-differentiated breast cancer. This study is relevant for public health and cancer control. Whether knowledge of the threshold year will have any impact on decisions regarding therapy for cancers thought to have a good prognosis remains to be investigated.