Global and regional estimates of cancer mortality and incidence by site: I. Application of regional cancer survival model to estimate cancer mortality distribution by site

Background The Global Burden of Disease 2000 (GBD 2000) study starts from an analysis of the overall mortality envelope in order to ensure that the cause-specific estimates add to the total all cause mortality by age and sex. For regions where information on the distribution of cancer deaths is not available, a site-specific survival model was developed to estimate the distribution of cancer deaths by site. Methods An age-period-cohort model of cancer survival was developed based on data from the Surveillance, Epidemiology, and End Results (SEER). The model was further adjusted for the level of economic development in each region. Combined with the available incidence data, cancer death distributions were estimated and the model estimates were validated against vital registration data from regions other than the United States. Results Comparison with cancer mortality distribution from vital registration confirmed the validity of this approach. The model also yielded the cancer mortality distribution which is consistent with the estimates based on regional cancer registries. There was a significant variation in relative interval survival across regions, in particular for cancers of bladder, breast, melanoma of the skin, prostate and haematological malignancies. Moderate variations were observed among cancers of colon, rectum, and uterus. Cancers with very poor prognosis such as liver, lung, and pancreas cancers showed very small variations across the regions. Conclusions The survival model presented here offers a new approach to the calculation of the distribution of deaths for areas where mortality data are either scarce or unavailable.


Background
As a part of the Global Burden of Disease 2000 (GBD 2000) project, the present study aims at estimating the total global and regional cancer mortality and incidence based on its detailed analysis of all-cause levels and cause of death distributions for 191 Member States of the World Health Organization (WHO) [1]. GBD 2000 requires ageand sex-specific incidence, duration and mortality as a minimum input to estimate burden of each disease sequela by a composite measure of mortality and morbidity (i.e., disability-adjusted life years: DALYs).
Attempts have been made to quantify the global burden of cancer, and estimate site-specific cancer mortality and morbidity [2][3][4][5][6]. Such studies have repeatedly suggested that incidence and mortality from cancer are continuously increasing in many parts of the world. Despite an increasing trend of cancer incidence and mortality, data on survival and prognosis of incident cases from populationbased cancer registries are limited in majority of developing countries. One of the most credible sources of information is available from the International Agency for Research on Cancer (IARC), which has been coordinating and implementing the cancer registries in such regions [7].
While vital registration of causes of death and national cancer registries are perhaps the best source of data on cancer mortality, mortality data are still scarce, poor or even unavailable for some regions of the world. Innovative methods will thus continue to be needed to exploit available data. Estimating mortality from morbidity and, especially, morbidity from mortality was a common practice in the 70's and 80's [8,9]. More recently, the continuos effort made by IARC has lead to the Globocan 2000 estimates which has also used information on incidence and survival to estimate cancer death for the year 2000 from various sources including cancer registries [2,6]. Still others have made use of vital statistics and cancer incidence data to predict the number of new cancer cases and deaths for the US in the subsequent year [10].
On the basis of available published information on age-, sex-, and site-specific cancer incidence and survival, we developed an algorithm to estimate region-specific overall cancer mortality, and site-specific survival, death distributions and incidence for the year 2000. This paper presents the first of the two consecutive reports which present the detailed methods and results of GBD 2000 estimates for mortality and incidence of cancer by site.
The particular feature of the GBD 2000 study is that the number of deaths by age and sex in each region provides an essential envelope which constrains individual disease and injury estimates of deaths and that competing claims for the magnitude of deaths from various causes must be reconciled within this envelope [1]. Given the regional cancer mortality envelope by age and sex, the estimates of site-specific distributions of cancer mortality are necessary to disagreggate the estimated total cancer deaths by age and sex for each region.
For geographic disaggregation of the GBD 2000, the six WHO regions of the world have been further divided into 14 sub regions, based on levels of child (under five years) and adult (15-59 years) mortality for WHO Member States [1]. Five mortality strata were defined in terms of quintiles of the distribution of child and adult mortality (both sexes combined). Adult mortality was regressed on child mortality and the regression line used to divide countries with high child mortality into high adult mortality (stratum D) and very high adult mortality (stratum E). Stratum E includes the countries in sub-Saharan Africa where HIV/AIDS has had a very substantial impact.
When these mortality strata are applied to the six WHO regions, they produce 14 mortality subregions. For the purposes of burden of disease epidemiological analyses, 2 of these regions were further subdivided: EurB into EurB1 and EurB2 -the latter including the central Asian states; and WprB into WprB1 (mainly China), WprB2 (South East Asian countries) and WprB3 (Pacific Islands). Additionally, some Member States have been reclassified into subregions with similar epidemiological/geographic/ethnic patterns in order to maximise the epidemiological homogeneity of the subregions for the purposes of epidemiological analysis. The resulting 17 epidemiological subregions are listed in Table 1.
The approaches to estimating mortality distributions were different depending on the availability and quality of data on detailed causes of death. Direct estimates of the sitespecific distributions of cancer mortality were possible for the regions where established vital registration records with high coverage and coding practice based on the International Statistical Classification of Diseases and Related Health Problems (ICD) are available, including countries in the A sub regions (AmrA, EurA and WprA) and countries in AmrB, EurB1, EurB2 and EurC [1]. For the other regions of the world (AfrD, AfrE, AmrD, EmrB, EmrD, SearB, SearD, WprB1, WprB2 and WprB3), we developed a site-specific model for relative interval survival adjusted for each region and applied it to the regional incidence estimated to calculate the mortality distribution by site for the year 2000. This model can also be used to estimate survival at different ages and average duration of cancer by site. In this paper, we present a detailed model as a key input to estimate the distribution of cancer deaths by site model for the regions where few data are available.

Incidence data
We initially used the Globocan 2000 estimates of the international Agency for Research on Cancer (IARC) to apply the survival model for a region [6], assuming that incidence rates to be constant over the years. We then estimated the region-specific number of new cases for 1986 to 2000, by applying these age-specific incidence rates to the annual population. We carefully examined the methods used to estimate country-specific incidence data in Globocan 2000, to ensure that for all the regions where we required incidence estimates, the Globocan estimates were based on cancer registry incidence data, and not modelled from mortality data using assumptions about survival (which would then result in circularity in our mortality estimation process for regions without good mortality data by cancer site).  -14] and 2) Globocan 2000 estimates include cancers of unknown primary with cancers of other specified sites. The GBD 2000 attributes these illdefined cancers back to specific sites as described above. Accordingly Globocan 2000 incidence estimates by age, sex, site and country were adjusted for these differences. Firstly, unpublished data on the incidence of Kaposi's sarcoma for countries in Africa were provided by IARC and used to adjust incidence of other cancers to remove incidence of Kaposi's sarcoma. Secondly, relative risks of HNL from HIV [15][16][17][18] were estimated and, together with the UNAIDS prevalence estimates of HIV in each country of the African region, NHL attributable to HIV was also removed. Thirdly, incidence estimates for cancers of unknown primary site were redistributed among specific sites using the GBD 2000 algorithm [1]. The proportion of the others category (balance of all but skin cancers) in the Globocan 2000 corresponding to unknown primary sites was estimated from published data on the distribution of cancer incidence by site which included unknown primary as a specific category [19][20][21][22][23][24][25][26][27].
After adjusting the Globocan 2000 incidence estimates for each country as described above, these estimates were summed for the countries in each GBD 2000 region, re-sulting in estimated incidence distributions by site, age and sex for each region. Finally, the GBD 2000 uses the latest population estimates for the Member States of the World Health Organization prepared by the United Nations Population Division [28]. In order to obtain incidence from 1986 to 2000, we estimated the age-specific population by sex for each of these years, using growth rates also from the United Nation's data.   The specification of relative interval survival (RIS α,t,τ ) for age α, calendar year t across the interval (τ -1) since diagnosis in years, separately for each cancer site, was of the form:

Multiplicative model of relative interval survival
where RIS 1 is the relative interval survival after 1 year for all ages, averaged across the calendar years 1973 to 1997, A α is the ratio of the relative probability of death after 1 year at age α to the relative probability of death after 1 year for all ages averaged across the calendar years 1973 to 1997 , T t is the ratio of the relative probability of death after 1 year for all ages in calendar year t to the relative probability of death after 1 year for all ages from 1973 to 1997 and Y τ is the ratio of the relative probability of death after t years for all ages to the relative probability of death after 1 year for all ages from 1973 to 1997 .
To estimate survival for developing regions where little or no data is available, we replaced T t for the "equivalent" calendar year survival term T r in the equation (1) for each region. T r is the ratio of the relative probability of death after 1 year for all ages in the relevant region to the relative probability of death after 1 year for all ages in the SEER data, averaged across the calendar years 1973 to 1997. In this way, we obtain a new calendar year survival term for the model.
Equivalent period survival terms were estimated for each region by examining the relationship between period survival terms and gross domestic product per capita (measured in international dollars adjusted for purchasing power parity) using the following data: 1) SEER survival data for the USA for the years 1973 to 1997 [11,29]; 2) Connecticut survival data for the years 1950 and 1958 [30]; 3) survival data for the late 1980s from cancer registries in 5 developing countries [31]; 4) survival data for European countries [32]; 5) specific recent national estimates of cancer survival as published [24,33]. Survivorship functions were estimated to derive regional relative survival from registry data by fitting a Weibull distribution function. To allow for a proportion who are cured and never die from the cancer, we modify the Weibull model as follows: where k is the proportion who never die from the cancer, λ is the location parameter (1/λ is the time at which 50% of those will die have died) and γ is the shape parameter. The mean survival time for those who die ( ) is given by where Γ denotes the gamma distribution. The analysis of survival data in developed regions suggested that the 10year relative survival can be used as an estimate of the proportion who never die from the cancer. This is particularly useful when relative survival point estimates fluctuate significantly and the plausible exact solutions did not obtained as in some developing regions. To assess the goodness-of-fit of the survival curve, we compared fitted 5-year survival with the observed survival and ensured the good fits in all cases.
There are substantial variations in relative interval survival (all ages) among countries; these variations are even larger, and fluctuate substantially with age, when the age-sex specific survival estimates are examined. There is also a possibility of correlation among the observations within a same region. Thus we employed both liner and non-linear feasible generalised least squares (FGLS) by maximum likelihood estimation to accommodate with hetero-    where l x is the number of individuals surviving at exact age x in the life table, h x = ln (l x+1 /l x ), α is age and τ is time since diagnosis.
The number of individuals S α,τ who had survived up to 2000 was by multiplying incidence and observed interval survival for the corresponding year of age and survival time: where I α,t is the incidence at age α in calendar year t. The probability of dying in 2000 due to cancer hazard, for each single age and year of survival, was calculated as follows: For each site, the number of deaths in 2000, among those individuals aged α years and who had survived cancer for τ years, was estimated by multiplied the number of survivors S α,τ by the relevant probability of dying in 2000 due to cancer hazard P α,τ . The total cancer deaths of the same site in 2000 at age α years is then estimated by summing over all survival times τ: Based on these region-specific adjusted incidence estimates and survival levels, cancer deaths were calculated by equations (3)-(6) for each region by age group and sex to estimate the distribution, but not the magnitude, of cancer by site, sex, and age group.

Validation of the model
We tested the validity and performance of the proposed survival model in three different ways. Firstly, we compared our estimated RIS α,t,τ for τ = 1 to 10 years individuals diagnosed with cancer in 1986 with the SEER RIS α,t,τ for τ = 1 to 10 years for the same cohort of individuals.    AfrE, SearB and SearD sub regions) to assess whether our model estimates are comparable to the estimates extrapolated from the actual observed data from the registries. In all cases, non-parametric tests for trends and Pearson's correlation were performed to examine whether the models estimates and observed data are consistent with each other. All statistical analysis was performed by STATA 7.0 (STATA corporation, College Station, TX).

Parameter estimates
Relative interval survival (RIS) was directly obtained from the SEER database for every age group, sex, and cancer site The probability of death in the first year (1 -RIS 1 ), which is most crucial for the survival of most cancer patients, is shown in Table 2. The probability of death varied significantly from less than 5% in cancers of melanoma, breast, uterus, and prostate to over 80% in pancreas and liver cancers.
Relative interval survival (RIS) was further indexed to the three parameters in the multiplicative cancer survival model by dividing each of the time-specific probability of death (1 -RIS) by the probability of death at 1-year interval (1 -RIS 1 ). Tables 3,4,5 represent the indexed estimates of three parameters by site for every 5-year age group from 0 up to 85+ years of age, and for calendar year from 1981 to 1995, and for time since cancer diagnosis from 1-up to 15-year survival, respectively. While there is considerable variation in the cohort parameters which reflect the prognosis among patients since the time of diagnosis, both age and period parameters are generally consistent across different types of cancer.
Based on the fitted data for each site and sex, and the estimated GDP per capita in international dollars for each region in 2000, T r factors were estimated for each site and sex for each GBD 2000 region. The results are presented in Table 6. An example is shown for breast cancer: knowing that GDP per capita in AfrD was $1,158 in 2000, this corresponded to an indexed calendar year-specific T t = 2.748. This was then the value used in the age-period-cohort sur-vival model for breast cancer in the AfrD sub region. A similar process was applied to the other regions and for other cancer sites.
The period parameters (T r ) for all the available survival data and fitted values from regression analysis were plotted against GDP per capita (international dollars) for each site and sex as shown in Figures 1,2,3,4,5,6. The largest variation in survival was observed among cancers of breast, melanoma of the skin, and hematological malignancies such as lymphoma and leukemia. For the cancers of cervix and colon and rectum, both early detection and availability of treatment determine the survival and the variation among regions were moderate. On the other hand, cancers with very poor prognosis such as liver, lung, and pancreas cancers showed very small variations across the regions regardless of the levels of national income.

Model performance and validation
In order to check the performance of the model, we graphically compared our estimated RIS α,t,τ for τ = 1 to 10 years individuals diagnosed with cancer in 1986 with the SEER RIS α,t,τ for τ = 1 to 10 years for the same cohort of individuals. We show the results obtained for males and females 55-59 years old, and for every cancer site in Figures 7,8,9. From these figures, we can observe that the model predicts very well the relative interval survivals. For those cancer sites with greater number of cases, such as colon, lung, breast, corpus uteri, and prostate cancer, the model fits very well. For those with smaller numbers such as cancers of liver and pancreas, the estimated RIS smoothes the curves for the observed RIS, also showing a very good fit.
We also tested the validity of our model when applied to other populations. We have chosen the age groups from 45 to 79 in which cancer mortality rate is relatively stable and the probability of miscoding of cause of death is small. Figures 10,11     points in age groups from 45 to 79 were also presented. When compared with site-specific mortality distribution of the Globocan 2000 based on regional cancer registries for AFRO and SEARO regions, the model estimate also yielded consistent mortality distribution pattern ( Figure  12). In all cases, the correlation coefficients were in the range of 0.91 to 0.98, suggesting that model estimates for these regions are quite consistent with the observed cancer mortality distribution.

Probability of 5-year survival and mean duration by site
The proposed model also yields RIS for various years and the mean duration of cancer by site, both of which are important inputs for the future estimation of cancer morbidity burden in terms of years lived with disability (YLDs).
As an illustration of the further use of our model, Figures  13 and 14 show, respectively, the conventional 5-year survival and average duration for the female cancer patients aged 45-54 in four different sub regions (AfrE, AmrA, SearD and WprB3). Depending on the site, the chance of 5-year survival and average duration varied considerably across the resigns, which are consistent with the estimated survival pattern above.

Discussion
As a part of the Global Burden of Disease 2000 (GBD 2000) study, we have developed a multiplicative model of relative interval survival for cancer by site based on the best available evidence from published population-based survival data from both developed and developing countries. Because of the sparseness of survival data for the developing regions of the world, we decided to use all the available data, including the most valuable cancer registry data compiled by the International Agency for Research on Cancer (IARC), national cancer registries, and lengthy time series data from the United States, to establish trends in survival with gross domestic product (GDP) per capita and then to use latest estimates of GDP per capita for developing regions, in order to estimate survival by site.
This approach takes into account, through increases in average GDP per capita for regions, the likely improvements in survival over the periods since those for which developing country survival data are available. Since our survival model adjusted for age and differences in competing mortality in each population by employing relative interval survival, the remaining variations in survival are likely to be due to differences in diagnosis and availability of appropriate treatment options. For instance, the large variation in survival was observed among cancers of bladder, breast, melanoma of the skin, and hematological malignancies such as lymphoma and leukemia for which effective therapy is established in developed regions. For the cancers of cervix and colon and rectum, both early detection and availability of treatment determine the survival and the variation among regions were moderate. On the other hand, cancers with very poor prognosis showed very small variations across the regions. The survival pattern across the regions are consistent with previous analysis based on the IARC cancer registry data [35].
The proposed model takes into account time in its three dimensions: age, calendar year (period) and time since cancer (cohort) and, due to the availability of data, the model outcome was compared the to the data reported by the US vital statistics and other regions of the world. This has given us the opportunity to evaluate our model and the data available.
However, perhaps the main advantage of this approach to estimating regional survival distributions by cancer site for developing regions is that the model correctly estimates survival and smoothes it and ensures that regional survival estimates are consistent with trends in survival across all regions, where the numbers for some cancer sites are small and highly fluctuate, and are inconsistent with other regions. For example, as can be seen in Figure  1a, cancer registry survival estimates for some sites in some developing countries are better than recent experience in the United States, or significantly below the trend line with GDP per capita, suggesting that survival may have been overestimated due to small numbers or incomplete case follow-up. In these cases, the survival model provides survival estimates more consistent with the com-

Figure 11
Mortality distribution by site: comparison between model estimates and vital registration data in three sub regions (EurB2, EurC, and WprA) Cancer site: 1 = mouth and pharynx, 2 = oesophagus, 3 = stomach, 4 = colon and rectum, 5 = liver, 6 = pancreas, 7 = trachea, bronchus and lung, 8 = melanoma of the skin, 9 = breast, 10 = cervix uteri, 11 = corpus uteri, 12 = ovary, 13 = prostate, 14 = bladder, 15 = lymphomas and multiple myeloma, 16 = leukaemia. r = Pearson's correlation coefficient when analysed with all data in age groups 45-79. plete body of evidence. The second advantage of the proposed approach is that the model is flexible enough to yield the survival estimates of various age, years and period as well as mean duration of time of cancer by site. In addition to mortality and incidence estimates [36], such information is required to estimate the cancer burden in terms of disability-adjusted life years (DALYs) in the future analysis for the GBD 2000 [1].
The main limitations for applying this model were the relative lack of region-specific survival data and very few, and probably not always representative, regional cancer incidence data for some developing regions. We assumed that cancer incidence reported by a few countries of one region or sub region would represent the incidence of the whole area, which may not always be the case.
It is suggested that model-based estimates of cancer mortality in the previous GBD 1990 study did not reflect the actual profile of cancer recorded at the regional registries, in particular of site-specific cancer mortality distribution [3,5,37]. Although population-based estimates from cancer registry data should be incorporated, they may not to be representative of the whole countries they should represent. Such estimates are sometimes restricted to certain geographic areas and also related to the extent of health care and surveillance system. Furthermore, several devel-

Figure 12
Mortality distribution by site: comparison between model estimates and previous estimates based on cancer registration data in AFRO and SEARO regions Cancer site: 1 = mouth and pharynx, 2 = oesophagus, 3 = stomach, 4 = colon and rectum, 5 = liver, 6 = pancreas, 7 = trachea, bronchus and lung, 8 = melanoma of the skin, 9 = breast, 10 = cervix uteri, 11 = corpus uteri, 12 = ovary, 13 = prostate, 14 = bladder, 15 = lymphomas and multiple myeloma, 16 = leukaemia. r = Pearson's correlation coefficient when analysed with all data in age groups 45-79. oping regions of the world were not included in these estimates, and the need to produce model-based estimates would persist.
Contrary to the previous GBD 1990 model, the present survival model specifically developed for the GBD 2000 incorporated all available survival information obtained from registries and corrected for possible bias. The model was used to estimate the distribution of death by site not the actual magnitude of cancer mortality in regions where no or little data on detailed cause of death is available. In fact, the model estimates were quite comparable to the mortality distribution estimated from vital registration records. Furthermore, when for the regions where vital records are not available, our model was consistent with the Globocan 2000 estimates based on the regional registries [6].

Conclusions
The survival model presented offers a new approach to the calculation of the number and distribution of deaths for areas where mortality data are either scarce or unavailable. It can also be applied in areas with good quality data, but where there are small numbers of some site-specific cancers. The model is flexible enough to estimate some of the parameters required to estimate the cancer burden. In our future work, we will attempt to collect further information on survival and incidence from more individual countries in order to improve our estimates, with more precise inputs for the model.

Competing Interests
None declared.