### Incidence data

We initially used the Globocan 2000 estimates of the international Agency for Research on Cancer (IARC) to apply the survival model for a region [6], assuming that incidence rates to be constant over the years. We then estimated the region-specific number of new cases for 1986 to 2000, by applying these age-specific incidence rates to the annual population. We carefully examined the methods used to estimate country-specific incidence data in Globocan 2000, to ensure that for all the regions where we required incidence estimates, the Globocan estimates were based on cancer registry incidence data, and not modelled from mortality data using assumptions about survival (which would then result in circularity in our mortality estimation process for regions without good mortality data by cancer site).

Globocan 2000 estimates of cancer incidence by site for countries differ from those required for the GBD 2000 in two major respects: 1) Globocan 2000 estimates include Kaposi's sarcoma and non-Hodgkin lymphomas (NHL) caused by HIV/AIDS. The GBD 2000 includes these cases among AIDS sequela and their burden is included with the HIV/AIDS burden [12–14] and 2) Globocan 2000 estimates include cancers of unknown primary with cancers of other specified sites. The GBD 2000 attributes these ill-defined cancers back to specific sites as described above. Accordingly Globocan 2000 incidence estimates by age, sex, site and country were adjusted for these differences. Firstly, unpublished data on the incidence of Kaposi's sarcoma for countries in Africa were provided by IARC and used to adjust incidence of other cancers to remove incidence of Kaposi's sarcoma. Secondly, relative risks of HNL from HIV [15–18] were estimated and, together with the UNAIDS prevalence estimates of HIV in each country of the African region, NHL attributable to HIV was also removed. Thirdly, incidence estimates for cancers of unknown primary site were redistributed among specific sites using the GBD 2000 algorithm [1]. The proportion of the others category (balance of all but skin cancers) in the Globocan 2000 corresponding to unknown primary sites was estimated from published data on the distribution of cancer incidence by site which included unknown primary as a specific category [19–27].

After adjusting the Globocan 2000 incidence estimates for each country as described above, these estimates were summed for the countries in each GBD 2000 region, resulting in estimated incidence distributions by site, age and sex for each region. Finally, the GBD 2000 uses the latest population estimates for the Member States of the World Health Organization prepared by the United Nations Population Division [28]. In order to obtain incidence from 1986 to 2000, we estimated the age-specific population by sex for each of these years, using growth rates also from the United Nation's data.

### Multiplicative model of relative interval survival

In order to estimate cancer death distribution for regions where little cancer mortality data is available (AfrD, AfrE, AmrD, EmrB, EmrD, SearB, SearD, WprB1, WprB2 and WprB3 sub regions), we developed an age-period-cohort multiplicative model for the relative interval survival (*RIS*) for each site. To incorporate all three time dimensions, we have taken into account the relative survival for every 5-year age group from 0 up to 85+ years of age, and for calendar year for 15 years (1981 to 1995), and for time since cancer diagnosis (survival time for cohorts) from 1- up to 15-year survival. After obtaining the time-specific survival data, we have then further indexed all the age, time, and calendar year survival information to the first year interval survival for each sex, and cancer site. The first year of survival was chosen because, for most if not all cancer sites, it is the most critical year concerning cancer survival experience. After the first year of survival, the relative survival curve usually increases and then flattens smoothly. Indexing was done by dividing each of the time-specific *RIS* by the survival at 1-year interval.

The specification of relative interval survival (*RIS*
_{α, t,τ}) for age α, calendar year *t* across the interval (τ - 1) since diagnosis in years, separately for each cancer site, was of the form:

*RIS*_{α, t,τ}= 1 - (1 - *RIS*_{1}) *A*_{α}*T*_{
t
}*Y*_{τ} (1)

where *RIS*
_{1} is the relative interval survival after 1 year for all ages, averaged across the calendar years 1973 to 1997, *A*
_{α} is the ratio of the relative probability of death after 1 year at age α to the relative probability of death after 1 year for all ages averaged across the calendar years 1973 to 1997
, *T*
_{
t
}is the ratio of the relative probability of death after 1 year for all ages in calendar year *t* to the relative probability of death after 1 year for all ages from 1973 to 1997
and *Y*
_{τ} is the ratio of the relative probability of death after *t* years for all ages to the relative probability of death after 1 year for all ages from 1973 to 1997
.

To estimate survival for developing regions where little or no data is available, we replaced *T*
_{
t
}for the "equivalent" calendar year survival term *T*
_{
r
}in the equation (1) for each region. *T*
_{
r
}is the ratio of the relative probability of death after 1 year for all ages in the relevant region to the relative probability of death after 1 year for all ages in the SEER data, averaged across the calendar years 1973 to 1997. In this way, we obtain a new calendar year survival term for the model.

Equivalent period survival terms were estimated for each region by examining the relationship between period survival terms and gross domestic product per capita (measured in international dollars adjusted for purchasing power parity) using the following data: 1) SEER survival data for the USA for the years 1973 to 1997 [11, 29]; 2) Connecticut survival data for the years 1950 and 1958 [30]; 3) survival data for the late 1980s from cancer registries in 5 developing countries [31]; 4) survival data for European countries [32]; 5) specific recent national estimates of cancer survival as published [24, 33]. Survivorship functions were estimated to derive regional relative survival from registry data by fitting a Weibull distribution function. To allow for a proportion who are cured and never die from the cancer, we modify the Weibull model as follows:

S(*t*) = *k* + (1 - *k*) exp (-(λ *t*)^{γ}) (2)

where *k* is the proportion who never die from the cancer, λ is the location parameter (1/λ is the time at which 50% of those will die have died) and γ is the shape parameter. The mean survival time for those who die (
) is given by

where Γ denotes the gamma distribution. The analysis of survival data in developed regions suggested that the 10-year relative survival can be used as an estimate of the proportion who never die from the cancer. This is particularly useful when relative survival point estimates fluctuate significantly and the plausible exact solutions did not obtained as in some developing regions. To assess the goodness-of-fit of the survival curve, we compared fitted 5-year survival with the observed survival and ensured the good fits in all cases.

There are substantial variations in relative interval survival (all ages) among countries; these variations are even larger, and fluctuate substantially with age, when the age-sex specific survival estimates are examined. There is also a possibility of correlation among the observations within a same region. Thus we employed both liner and non-linear feasible generalised least squares (FGLS) by maximum likelihood estimation to accommodate with heteroscedasticity and correlation among the observations, and chose the best fit model for each site [34]. We did not include region fixed-effects since the data are not available for all sub regions.

#### Model estimation of cancer death distribution

For the estimation of the number of deaths and cancer death distribution by site, we needed to estimate the number of individuals who survived up to 2000 by age and time of survival as well as their corresponding probability of death during this year. The number of surviving individuals at age α in 2000 was calculated by multiplying incidence at age α in year (2000 - τ) by the observed interval survival for τ years since diagnosis for individuals aged α in 2000 (*OIS*
_{α,τ}), and summing over τ. To estimate *OIS*
_{α,τ}, we first calculated the relative cumulative survival (*RCS*
_{α,τ}) for every single age and year of survival for 2000, by multiplying *RIS*
_{α,τ} over the years of survival. In a standard life table format, *OIS*
_{α,τ} is written of the from:

where *l*
_{
x
}is the number of individuals surviving at exact age *x* in the life table, *h*
_{
x
}= ln (*l*
_{x +1}/*l*
_{
x
}), α is age and τ *is* time since diagnosis.

The number of individuals *S*
_{α,τ} who had survived up to 2000 was by multiplying incidence and observed interval survival for the corresponding year of age and survival time:

*S*_{α,τ} = *I*_{α-τ,2000-τ}*OIS*_{α,τ} (5)

where *I*
_{α, t
}is the incidence at age α in calendar year t. For example, the number of individuals who were 7 years of age (α = 7) in 2000, and who had survived cancer for 4 years (τ = 4) in 2000 was calculated by multiplying the incidence of cancer for the cohort of individuals who were 3 years of age (α - τ = 3) in 1996 (= 2000 - τ) (year of diagnosis) by the *OIS*
_{α,τ} calculated for a 7 year old person who had survived 4 years since cancer diagnosis.

The probability of dying in 2000 due to cancer hazard, for each single age and year of survival, was calculated as follows:

*P*_{α,τ} = [1 - exp (-(-ln (*RIS*_{α,τ}) + *h*_{α}))] [-ln (*RIS*_{α,τ})/(-ln (*RIS*_{α,τ}) + *h*_{α})] (6)

For each site, the number of deaths in 2000, among those individuals aged α years and who had survived cancer for τ years, was estimated by multiplied the number of survivors *S*
_{α,τ} by the relevant probability of dying in 2000 due to cancer hazard *P*
_{α,τ}. The total cancer deaths of the same site in 2000 at age α years is then estimated by summing over all survival times τ:

Based on these region-specific adjusted incidence estimates and survival levels, cancer deaths were calculated by equations (3)-(6) for each region by age group and sex to estimate the distribution, but not the magnitude, of cancer by site, sex, and age group.

#### Validation of the model

We tested the validity and performance of the proposed survival model in three different ways. Firstly, we compared our estimated *RIS*
_{α, t,τ}for τ = 1 to 10 years individuals diagnosed with cancer in 1986 with the SEER *RIS*
_{α, t,τ}for τ = 1 to 10 years for the same cohort of individuals. Secondly, we compared the model estimates of cancer mortality distribution with the observed distributions in the regions with good vital records (AmrB, EurA, EurB, EurC and WprA sub regions). AmrA sub region was excluded for the validation purpose since it includes the United States. Finally we compared the cancer death distribution of our model with the Globocan 2000 estimates for the regions where no vital records are available (AfrD, AfrE, SearB and SearD sub regions) to assess whether our model estimates are comparable to the estimates extrapolated from the actual observed data from the registries. In all cases, non-parametric tests for trends and Pearson's correlation were performed to examine whether the models estimates and observed data are consistent with each other. All statistical analysis was performed by STATA 7.0 (STATA corporation, College Station, TX).