Estimation of age- and stage-specific Catalan breast cancer survival functions using US and Catalan survival data

Background During the last part of the 1990s the chance of surviving breast cancer increased. Changes in survival functions reflect a mixture of effects. Both, the introduction of adjuvant treatments and early screening with mammography played a role in the decline in mortality. Evaluating the contribution of these interventions using mathematical models requires survival functions before and after their introduction. Furthermore, required survival functions may be different by age groups and are related to disease stage at diagnosis. Sometimes detailed information is not available, as was the case for the region of Catalonia (Spain). Then one may derive the functions using information from other geographical areas. This work presents the methodology used to estimate age- and stage-specific Catalan breast cancer survival functions from scarce Catalan survival data by adapting the age- and stage-specific US functions. Methods Cubic splines were used to smooth data and obtain continuous hazard rate functions. After, we fitted a Poisson model to derive hazard ratios. The model included time as a covariate. Then the hazard ratios were applied to US survival functions detailed by age and stage to obtain Catalan estimations. Results We started estimating the hazard ratios for Catalonia versus the USA before and after the introduction of screening. The hazard ratios were then multiplied by the age- and stage-specific breast cancer hazard rates from the USA to obtain the Catalan hazard rates. We also compared breast cancer survival in Catalonia and the USA in two time periods, before cancer control interventions (USA 1975–79, Catalonia 1980–89) and after (USA and Catalonia 1990–2001). Survival in Catalonia in the 1980–89 period was worse than in the USA during 1975–79, but the differences disappeared in 1990–2001. Conclusion Our results suggest that access to better treatments and quality of care contributed to large improvements in survival in Catalonia. On the other hand, we obtained detailed breast cancer survival functions that will be used for modeling the effect of screening and adjuvant treatments in Catalonia.


Background
Mortality from breast cancer has been decreasing in the majority of industrialized countries since the beginning of the 1990s. Extended use of mammography screening and improvements in adjuvant therapies have been indicated as the causes of this decline. Adjuvant therapies are drugs, usually chemotherapy or hormonotherapy, used as additional treatments for patients with cancers that are thought to have spread beyond their original sites.
On one hand, screening with mammography may reduce breast cancer mortality, because more tumors are detected at earlier stages where survival time is longer. On the other hand, concurrently with mammography use, the introduction of adjuvant therapies also may have increased the survival time at all stages of the disease and modified the survival functions.
Based on a collaborative effort, seven research groups modeled the US breast cancer mortality trends from 1975 to 2000. The Cancer Intervention and Surveillance Modeling Network (CISNET) is a consortium of the US National Cancer Institute that sponsored and oversaw this initiative [1]. All groups shared the same model inputs: incidence, survival after breast cancer diagnosis and mortality by other causes of death, over time. The goal of the modeling process was to determine the contribution of screening mammography and adjuvant treatments on the reduction in breast cancer mortality [2]. One of the CIS-NET groups, lead by Marvin Zelen and Sandra Lee at the Dana-Farber Cancer Institute in Boston [3], developed a stochastic model that requires specific breast cancer survival probability density functions (pdf s) by age and disease stage at diagnosis.
Catalonia is an autonomous region of Spain with authority over health-care planning, administration, and provision since 1985. It has approximately 7.2 million inhabitants, one sixth of the Spanish population. The Catalan Health Department has implemented an independent information system and preventive programs in the region. Population breast cancer screening programs were initiated in the early 1990s [4]. All this facilitates the collection of almost all the required inputs to use the Lee and Zelen's model to study the impact of mammography and adjuvant treatments in Catalonia. But Catalan survival information is not reliable enough to directly derive the survival functions by age and disease stage at diagnosis. While stage-specific Catalan survival functions are robust, when stratifying by age and stage the number of women at risk over time decreases and the survival estimates are unstable (see Additional file 1). Since younger women are affected by more agressive tumors, it is important to estimate age-and stage-specific survival functions.
This work presents the methodology used to estimate ageand stage-specific Catalan breast cancer survival functions from scarce Catalan survival data by adapting the age-and stage-specific US functions. In addition, we compared breast cancer survival in Catalonia and the USA in two time periods, before and after the introduction of screening mammography.

Methods
This section is structured in four parts: a) The obtention of the US breast cancer hazard rate functions, by stage and also by age and stage; b) The description of the existent and inferred Catalan survival data; c) The estimation of the hazard ratios (HR) or relative hazards that compare Catalonia versus the US, which sometimes are timedependent; and d) The estimation of age-and stage-specific breast cancer survival functions for Catalonia using the US functions and the estimated HR.

US survival data
Background survival data To understand how early screening and adjuvant therapies contributed to the reduction of the breast cancer mortality it is necessary to consider breast cancer survival prior to these two interventions. During the period of 1975-79, the two factors were absent in the USA. Thus, background data and functions refer to these years. Survival information following cancer diagnosis was retrieved from the National Cancer Institute's Surveillance Epidemiology and End Results (SEER) Program [5]. SEER currently collects and publishes cancer incidence and survival data from population-based cancer registries covering approximately 26% of the US population.
For the period 1975-79, women diagnosed with breast cancer were classified using the SEER historical summary of disease stage at diagnosis (localized, regional, and distant). Localized is defined as disease confined entirely to the organ of origin, regional is disease that has extended beyond the limits of the organ of origin into surrounding organs or tissue and/or regional lymph nodes, and distant is disease with metastasis. This classification was used in the comparison of survival in the USA and Catalonia. From the historical summary stage information and additional data on the extent of disease, the CISNET inferred breast cancer survival functions for the US population in 5 age groups (30-39, 40-49, 50-59, 60-69 and 70-84 years) and for 5 disease stages at diagnosis (American Joint Committee on Cancer (AJCC) staging system: I, IInode negative (II-), II-node positive (II+), III, IV) [6]. These 25 age-and stage-specific functions will be used to obtain the detailed Catalan survival functions.

Recent survival data
Estimates of the US breast cancer specific survival functions (hazard, density and cumulative survival) for the period 1990-2001 were obtained using data output from the SEER*Stat Survival software developed at the National Cancer Institute [5]. The AJCC disease stage information is provided in detail for this period: I, II, III, IV, and whether regional nodes are affected or not. Data selection was done using the default parameters and the Display Standard Life by Cause-Specific Survival option. Women aged 30-84 years and with a diagnosis of breast cancer were grouped in stages I, II-, II+, III and IV.

Hazard functions
The previously mentioned SEER and CISNET data sources record cases with 25 years of follow-up after diagnosis for women diagnosed during the background period, and cases with 16 years of follow-up for women diagnosed during the recent period. The information provided was the number of women alive at the beginning of the interval (N t ) and the number of breast cancer deaths (D t ) and women lost to follow-up (or withdrawals, W t ) in time intervals of 1 year [t, t + 1). Then the conditional hazard rates ( ) by age and stage were estimated as the number of breast cancer deaths divided by person-years at risk The hazard rates estimated using expression (1) draw a step function with fluctuations. We smoothed these (u) and obtained continuous hazard rate functions using a restricted cubic spline model (natural splines) [7]. First, we used the function rc_spline from the STATA software to create a series of covariates that are functions of the independent variable time from diagnosis and predefined knots [8][9][10]. These covariates take the form of piecewise cubic polynomials between adjacent knots. Then, we run a weighted linear regression model with dependent variable (u) estimated in (1) and the specified covariates. We used knots at the 1.

Catalan survival data
Background and recent survival data Catalan breast cancer survival data is not available at the population level. There are two population based cancer registries that cover an area of 20% of the Catalan population. We obtained breast cancer survival data from one of these registries, the Girona Cancer Registry (GCR), which currently covers an area of 700 000 inhabitants and represents approximately 10% of the Catalan population. Breast cancer data from 1975 to 2005 collected by the GCR was considered representative of the Catalan breast cancer data. In 1986 the female population covered by the registry was 227 228 women, 7.45% of the Catalan female population (3 050 749 women) [11]. The reporting system was based on information from the area hospital records, pathology laboratories, and death certificates extracted from the Catalan Mortality Registry of the Catalan Government's Department of Health [12]. For the period 1980-89, the estimated exhaustivity was 96.7%, 93.2% of cases were histologically verified [11], and the percentage of unstaged cases was around 7%. For comparison, SEER provided 4% of non-classified stages for the same period.
For background analysis we used the period 1980-89 for two reasons. First, during this period screening and adjuvant therapies had no impact on the Catalan population. And second, data prior to this period was incomplete and inaccurate. For the recent analysis, all women incident from 1990 to 2001 were included. Ages selected for thê analysis were 30 to 84 years old. Additional file 1 presents the number of women diagnosed of breast cancer by age and stage in the GCR.

Relative survival
Relative breast cancer survival is defined as the ratio between observed breast cancer survival and expected survival in the general population. It corrects the excess of mortality experienced by patients diagnosed with breast cancer. Relative survival can be interpreted as the probability of cancer survival after adjustment for competing causes of death [13] and is a close estimate of cause-specific survival [14]. Relative survival was estimated using a web-based application (WAERS [15]) developed by the Catalan Institute of Oncology which uses the Hakulinen's method to estimate the expected survival [16], the Kaplan-Meier's method to estimate the observed survival and the Greenwood's method [17] to estimate confidence intervals.
WAERS provided the relative cumulative survival from GCR data. Then relative breast cancer cumulative survival data were smoothed using the same method as for US hazards. We fitted a cubic spline (knots at the 1.5, 2.5, 4.5, 8.5, 10.5 and 12.5 time points) and a regression model weighted by the number of women at risk at the beginning of each interval.

Hazard ratios (HR) of breast cancer mortality, Catalonia versus USA
Our goal was to obtain age-and stage-specific Catalan breast cancer survival functions. But the number of breast cancer cases in the GCR precluded estimation of these functions from the GCR data. Thus, we looked for a relation between the USA and the GCR (t) at the disease stage level, the hazard ratio (HR).
Since the size of the US population covered by SEER and the Girona province populations were very dissimilar, we used the USA data as reference data. First, we used the GCR relative cumulative survival and the number of all-cause deaths to estimate the number of deaths due to breast cancer (D bc ) in the GCR at each year interval [t, t + 1) after diagnosis (see first section in the Appendix for further details).
Second, we multiplied the by the GCR person-years at risk to estimate the expected number of breast cancer deaths (expected bc ) in the GCR, if the hazard rates were those of the USA (see second section in the Appendix). Third, we fitted a Poisson regression model with dependent variable and with the as an offset variable. This corresponds to a proportional hazards model, with the usual Cox baseline hazard λ 0 replaced by the known hazard function from the US data. When the proportional hazards assumption was not fulfilled we included a function with time from diagnosis as a covariate. The log of time (log(t)) was the function that worked best. We included it in the models when it was statistically significant [18].
We checked for overdispersion by fitting negative binomial regression models and assessing the significance of the extra variation parameter. The Poisson model was adequate in all cases. In addition, we assessed the goodnessof-fit of the model using the deviance. All the final models had non-significant p-values (> 0.05) for the deviance χ 2 test. Degrees of freedom were equal to the number of observations minus 1, or minus 2 if time was included as a covariate.
The following equations reflect the Poisson model and the expression used to obtain the estimated HR: Then, the stage-specific hazard ratios were obtained as: The model was fitted for the first 14 years after breast cancer diagnosis, because after 14 years of follow-up the number of breast cancer deaths approaches 0. After year 14 we assumed that the HR was constant and equal to the year 14 estimated value.
Ninety-five percent confidence intervals (95% CI) for the HR were obtained using the expression: When the HR was constant over time (log(HR) = β 0 ), the 95% CI for the HR were estimated as: When the HR was time-dependent (HR(t)), the variance of the log(HR(t)) was obtained using the expression: Then, we used expression (4) to obtain the 95% CI of the HR.
For the period of time prior to the dissemination of mammography (background), we used the 1975-1979 US hazard rates and the 1980-1989 GCR person-years at risk to estimate the GCR breast cancer deaths (expected bc ). We performed this analysis by historical stage of disease (localized, regional, distant). For the recent period, we used the 1990-2001 data from the GCR and the 1975-79 data for the USA, using the AJCC five disease stages at diagnosis (I, II-, II+, III, IV).
The following scenarios provided the periods chosen for the comparisons: Additionally, from the HR obtained in C1 and C3, we can assess the differences in breast cancer survival between the USA and Catalonia, in the recent past, before the dissemination of screening with mammography.

Estimation of age and stage-specific Catalan survival functions: λ, cumulative survival, and pdfs
We used the age-and stage-specific USA functions in the period 1975-79 and the HR by stage from C1 and C2 to estimate the age-and stage-specific Catalan functions for the periods 1980-89 and 1990-2001, respectively. We used the same HR for all age groups within disease stages: To obtain the AJCC (t) for Catalonia in 1980-89 we applied the historical stage localized HR to the USA hazard rate functions for AJCC stages I and II-. We used the historical stage regional HR for AJCC stages II+ and III, and the disseminated HR for the AJCC stage IV.
We also assessed whether the (t) functions, estimated using expression (7), fit the data well by using the goodness-of-fit deviance statistic. We obtained the deviance between the estimated breast cancer deaths in the GCR and the predicted number of deaths obtained assuming a Poisson distribution with parameter (t) with the offset term being the number of person-years at risk. In all cases the p-values of the χ 2 deviance tests were higher than 0.01, except for C1-Disseminated, C2-I, C2-II+, and C3-II+.
To estimate confidence bands for the variance of we used the delta method [19]. First, we applied the logarithmic transformation: Then, assuming independence of the hazard ratios and the USA hazard rates: The term was obtained from the Poisson regression models and the term was obtained using the delta method as follows. The delta method stays that if we want to approximate the variance of G(X) where X is a random variable with mean μ and G is differentiable, then ( 1 0 ) and  Once the cumulative survival function S(t) is estimated, the survival can be obtained from the relation [17]:

Software
We used the Mathematica v6.0.3 Grid Edition software package for almost all computations [20]. STATA 10.0 was used to smooth data, and to fit the Poisson models [21]. SEER*Stat provided USA breast cancer specific cumulative survival for period 1990-2001 [5]. Figure 1 shows the raw and smoothed US hazard rates, by stage and time period (historical and AJCC stages for 1975-79 and AJCC stages for 1990-2001). Figure 2 presents the observed and smoothed US hazard rates, by age groups and AJCC stages, for the 1975-79 (background) period. The hazard rate functions in Figure 2, together with the estimated HR, will be used to derive the hazard rate functions in Catalonia for the background (1980-89) and recent (1990-2001) time periods.

USA hazard functions
Data in Figure 2 displays the high variability of the observed US hazard rates for young women at favorable stages and justifies the need for smoothing and deriving these functions for Catalonia. Table 1 presents the coefficients and standard errors of the Poisson models that make it possible to estimate the HR. When the HR are not time-dependent (β 1 = 0) the estimated HR are log(β 0 ). In three situations both parameters β 0 and β 1 were not statistically different from 0, meaning that the compared (t) functions for Catalonia and the USA were similar. Figure 3 shows the estimated HR and their 95% confidence intervals. Horizontal lines that correspond to HR = 1 refer to non statistically different functions. HR > 1 indicate that the risk of breast cancer death in Catalonia was higher than in the USA and HR < 1 mean the opposite. It is interesting to note that in some cases the HR values cross the HR = 1 line, indicating a change in the risk relation which needs to be interpreted by taking into account the confidence intervals. In addition, when the are low, changes in the HR may not be rellevant, for example in the early stages of disease.   Table 2).     Table 2: Estimated relative cumulative survival at five (S(5)) and ten (S (10)) years after breast cancer diagnosis in Catalonia and the USA.

Source
Stage ing estimated HR. Although Figure 5 shows some overlap between the 95% confidence intervals for the 1980-89 and the 1990-2001 hazards, the estimated functions reflect important improvements in breast cancer survival in Catalonia for all age groups and stages of disease.

Discussion
Our study achieves two goals. We have obtained age-and stage-specific breast cancer survival functions for Catalonia for two time periods, 1980-89 and 1990-2001. These functions can be used 1) to assess changes over time in survival after breast cancer diagnosis, and 2) to generate breast cancer pdf s that are needed to model the impact of early detection and adjuvant treatments in the reduction of breast cancer mortality in Catalonia.
The results of this study show that breast cancer survival in Catalonia during the 1980s was worse than in the USA during the second half of the 1970s. But, during the 1990s breast cancer survival in Catalonia experienced great improvements and ended up being comparable to US survival for most disease stages.
Previous studies compared trends in breast cancer incidence, mortality, and survival in 10 European countries between 1983 and 1994 [22]. Spain and Italy started with the lowest incidence and mortality rates which increased markedly up to the mid 1990s, probably due to changes in lifestyle and diet over time [23]. Despite low mortality, the Spanish relative breast cancer survival in 1984 was among the worst ones according to the EUROCARE study [22]. Fortunately, the increasing trends in breast cancer incidence and mortality (with a maximum around mid 1990s) were accompanied by the steepest increase in relative survival from 1984 to 1994, of all European countries studied, (8% per year) [22]. Whereas breast cancer impact on mortality increased, relative survival improved, probably due to the rapid dissemination of scientific and medical advances.
Estimated age-(years) and stage-specific Catalan hazard rate functions From mid 1990s to the present, breast cancer mortality in Catalonia shows a decreasing trend [24]. The introduction of early screening and adjuvant treatments may have played an important role, as in most of the western countries [25]. Both cancer control interventions seem to have contributed similarly to the reduction in breast cancer mortality in the USA [2]. In Catalonia, the contribution of each still remains to be evaluated and the results of this study will be used as inputs to the Lee and Zelen model [3]. In this work we have observed that breast cancer survival improved from the 1980s to the 1990s within disease stages. Although earlier time of diagnosis for mammography detected cases (lead time bias) may play a role in the estimated survival distributions for the 1990s, the improvement within disease stages may be related to the introduction of better treatments, too.
When comparing breast cancer survival in different populations it is necessary to take into account that differences in overall mortality and life expectancy influence the cause-specific survival in these populations [13,26,27]. Relative survival and cancer-specific survival provide similar results [14,16]. In relative survival, the total mortality hazard is assumed to consist of the general population mortality hazard and of an excess hazard which is attributable to cancer. In the cause-specific survival, on the other hand, the cause-specific hazard is estimated using the cause of death information only. The cause-specific survival rate is calculated by censoring the survival time related to a non-cancer death. The relative survival analysis is often preferred over the cause-specific survival since the cause of death information in the cancer registries may be unreliable or unavailable. In this study, we have worked with breast cancer relative survival which was Estimated age-(years) and stage-specific Catalan cumulative survival obtained considering the life tables of the Catalan population [15].
Since survival data in Catalonia was not available at the population level and estimations of survival functions by age groups and disease stages would have been inaccurate and not reliable, we decided to customize the US survival functions. The limitations of our original data led us to undertake this study. Our assumption was that the Catalan functions had age and stage patterns similar to the US functions, but that levels of risk of breast cancer mortality were the Catalan observed levels. As the number of breast cancer cases in the GCR was small compared to the US SEER cases, instead of using the Cox regression model to compare both geographical areas, we used a Poisson regression model that handled the US as the reference group [28]. The Poisson regression model also made it possible to deal with time varying hazard ratios. We assessed the proportionality of hazards assumption and, when necessary, we included log(time) as a covariate.
Another limitation intrinsic to the type of data that we used is the noise and fluctuation of hazard rates. To deal with this problem we smoothed the data and weighted it by the number of person-years at risk in each time interval. Figures 1 and 2 show that the adjusted hazard rate functions capture the main trends of the raw data. There are several examples in the literature that combine fitting a spline function to the hazards and obtaining a HR as a function of time or log(time) [29,30].
We assessed the calibration of the Catalan estimated functions using the goodness-of-fit deviance statistic. The test did not reject the null hypothesis in all but four of the cases. The discrepancies can be attributed to the instability of both D bc and person-years at risk in the GCR. In general, Figure 4 shows that the observed and predicted values of the Catalan cumulative survival functions are very similar. Therefore, the methods used seem to provide a robust estimate of the breast cancer survival functions in Catalonia.

Conclusion
In summary, we have combined limited breast cancer survival data from Catalonia (Spain) with more robust data from a bigger population, the USA, to obtain robust survival functions for Catalonia. Our methodology has used standard statistical approaches that can be performed using available software. The results of our study are useful for public health and clinical purposes since they provide information on the progress against breast cancer in Catalonia. In addition, our results provide a comparison of Catalonia and the USA in terms of the prognosis of women diagnosed with breast cancer. Finally we have generated survival probability density functions for Cata-lonia that will be used in the modeling of the impact of screening mammography and adjuvant treatments. 4. From the 1,000 replicated hazard rate functions we estimated the variance. The 95% confidence interval limits were obtained using the percentiles 2.5 and 97.5 of the bootstrapped hazard rates.

Estimation of the number of breast cancer deaths in the GCR
To estimate the number of breast cancer deaths at each time interval after breast cancer diagnosis, ( ), we used data from the GCR and also the relative breast cancer survival data obtained using the WAERS web application [15]. Data available from the GCR were the number of women at risk at the beginning of each time interval (N t ), the number of all-cause deaths ( ) and the number of women lost to follow-up (W t ). The relative breast cancer cumulative survival (S bc (t)) provided by the WAERS application made it possible to derive the conditional probability of dying from breast cancer, ( ), at each interval [t, t + 1): Then, solving the following equation for each interval [t, t + 1): We estimated the number of breast cancer deaths ( ) as:

Estimation of the woman-years at risk in the GCR
Woman-years at risk for each time interval [t, t + 1) were obtained using the expression N t -0.5 ( + W t ) (see previous section for notation). The implied assumption is that either deaths or losses to follow-up occur uniformly in each time interval. In this formula it is not necessary to distinguish if deaths were from breast cancer or other causes. Both types of deaths contribute similarly subtracting woman-years at risk.
The number of woman-years at risk in each time interval has been used to estimate the expected number of breast cancer deaths (expected bc ) in the GCR, if the hazard rates were the US rates, as described in the Methods section.