Cases and controls
The population-based case–control study included all lung cancer cases, males and females aged 35–75 years residing in Montreal and its surrounding suburbs and who were Canadian citizens. Histologically confirmed incident cases of lung cancer diagnosed between January 1996 and December 1997 were ascertained through active monitoring of pathology reports in the 18 participating hospitals in the metropolitan Montreal region, providing almost complete (≈98%) coverage of lung cancer diagnosis in the area. Histology of lung cancer was coded according to World Health Organization/International Agency for Research on Cancer technical report 31 [11].
Controls were randomly sampled from population-based electoral lists, frequency-matched to cases by age group (±5 years), gender and residential area. Further details about the study can be found elsewhere [12]. Ethics approval was obtained from all collaborating institutions, and written informed consent was obtained for all participants.
Data collection
A face-to-face interview was conducted by one of our bilingual interviewers (English and French). If the subject was deceased or too ill to respond, we attempted to conduct the interview with a close next of kin proxy, usually the surviving spouse. The questionnaire was designed to collect information on sociodemographic and lifestyle characteristics, including smoking history, and a detailed semi-structured history of all jobs ever held.
Smoking history
Detailed self-reported information was collected about cigarette smoking habits including smoking status, ages at initiation and cessation, periods of interruption and average number of cigarettes smoked per day over the subject’s lifetime. Smokers were defined as those who smoked regularly (at least one cigarette per week) during at least 6 months, and at least 100 cigarettes in their lifetime, the others being considered “never smokers”. Since early symptoms of lung cancer can lead to changes in smoking behavior, in order to avoid reverse causality bias, we discounted the two years before index date in computing each of the smoking variables. This cutpoint of two years was recommended by Leffondré et al. [7], based on their fitting of models with different cutpoints. Thus “current smokers” were defined as subjects who still smoked at interview or had quit smoking less than 2 years before the reference date (i.e. date of diagnosis for cases and date of interview for controls), and former smokers were those who quit at least 2 years before this reference date. Smoking duration was defined as the difference between age at index date for current smokers, or age of cessation for former smokers, and age at initiation, and then subtracting total duration of any temporary cessation periods. Cumulative smoking exposure was represented by two alternative constructed variables: pack-years and cumulative smoking index (CSI). The pack-years variable was computed by multiplying the average number of cigarettes smoked per day by duration of smoking in years, and dividing by 20 (cigarettes per pack). The CSI is an index comprising all the smoking dimensions collected from study subjects in a function that is biologically motivated and that optimizes predictive power [13, 14]. Leffondre et al. proposed a modified version of CSI, adapted specifically for lung cancer, and demonstrated that the resulting aggregate exposure measure improved the fit of data, compared with conventional modeling of separate effects of different smoking components [14].
The equation is: CSI = (1–0.5dur*/τ) (0.5tsc*/τ) ln(int + 1), where
$$ {\displaystyle \begin{array}{l}\mathrm{tsc}=\mathrm{time}\ \mathrm{since}\ \mathrm{cessation},\\ {}{\mathrm{tsc}}^{\ast }=\max \left(\mathrm{tsc}-\updelta, 0\right),\\ {}\mathrm{dur}=\mathrm{duration},\\ {}{\mathrm{dur}}^{\ast }=\max \left(\mathrm{dur}+\mathrm{tsc}-\updelta, 0\right)-{\mathrm{tsc}}^{\ast },\\ {}\operatorname{int}=\mathrm{average}\ \mathrm{daily}\ \mathrm{amount}\ \mathrm{smoked}\ \mathrm{in}\ \mathrm{cigarettes},\\ {}\updelta =\mathrm{lag}\ {\mathrm{between}}^{`}{\mathrm{causalaction}}^{\prime}\mathrm{and}\ \mathrm{disease}\ \mathrm{detection},\\ {}\uptau =\mathrm{biological}\ \mathrm{half}-\mathrm{life}\ \mathrm{tabacco}\ \mathrm{of}\ \mathrm{carcinogens}\end{array}} $$
The latter two parameters, δ and τ, are estimated by trial-and-error so as to optimize the fit to data [13, 14].
Other covariates
Detailed information was collected on sociodemographic characteristics, including ethnicity, education and family income.
In addition, from the detailed employment history and description of each job, a team of chemists and industrial hygienists examined each completed questionnaire and translated each job into a list of potential exposures using a checklist of 294 agents that included many IARC-recognized Group 1 Lung Carcinogens [15, 16].
Statistical analysis
Analyses were performed separately for men and women, and either with all histologies combined or by histologic type. When simply using the term “lung cancer”, we mean all histologies combined.
All associations were estimated using multivariable unconditional logistic regression. When several variables were tested simultaneously, Wald statistics were used to compare the contribution of each variable in a model while Akaike information criterion (AIC) was used to compare the goodness of fit between the different models.
We assessed the relations between lung cancer and various smoking metrics, including duration, daily intensity, time since cessation, pack-years, and CSI. For CSI, its parameters were a priori set to values established by Leffondre et al.: half-life = 26 years and lag = 1 year (males) or 0.7 year (females) [14]. Initially the smoking metrics were analyzed one-at-a-time. Subsequently we conducted analyses with selected multiple smoking metrics in the same models. Analyses involving the time since cessation variable were performed among smokers only. For models involving all subjects, with nonsmokers being the reference group, an indicator of ever smoking was used and continuous smoking variables were centered by subtracting the mean value of the smoking variable from the original value for all smokers, while keeping 0 for never smokers [7]. For each model, the smoking variables under study and the non-smoking covariates were forced into the model.
Some analyses were conducted with continuous smoking variables transformed into categorical variables, while others were conducted on the continuous variables. For the latter, different functions were used to model the relations between continuous smoking metrics and the logit of the lung cancer risk, including (i) linear and (ii) logarithmic functions models as well as fractional polynomials (FP) [17]. In FP analyses, for each continuous variable X, one or two terms of the form Xp were fitted with powers p chosen from (− 2, − 1, − 0.5, 0, 0.5, 1, 2, and 3) to optimize goodness of fit, i.e. minimize the model’s deviance [17].
The following covariates were included in all models: age (continuous), respondent status (self, proxy), ethnic origin (dummy variables: French / British Isles / Italian / other Europeans / other), educational level (elementary, secondary, post-secondary), socioeconomic status (SES) as measured by median household income of the residential neighborhood, derived from census information (continuous) and exposure to those IARC Group 1 occupational lung carcinogens that had at least 1% lifetime prevalence in our study population. The following occupational exposures (lifetime prevalence as indicated) satisfied these criteria: diesel engine emissions (23.8%), crystalline silica (15.9%), benzo[a]pyrene (15.3%), chrysotile asbestos (10.9%), nickel and its compounds (6.2%), chromium VI and its compounds (4.5%) and cadmium and its compounds (2.2%). These were included in the models as qualitative ordinal variables for men: no exposure, ‘non-substantial’ exposure and ‘substantial’ exposure, where the two exposure subsets were distinguished by duration of exposure, concentration of exposure and number of hours per week of exposure. Among women, due to much lower prevalence of occupational exposures, binary variables (ever vs never exposed) were preferred.
Some sensitivity analyses were carried out with study subjects restricted to those who answered for themselves, i.e. excluding proxy responses.
The population attributable fraction (PAF) was estimated as PAF = pexp\( \left(\frac{OR-1}{OR}\right) \) where pexp represents the ratio of the number of exposed cases to the total number of cases [18]. A 95% confidence interval (CI) for the PAF was derived by replacing the point estimate of the relevant OR by, respectively, the lower and upper boundaries of the corresponding 95% CI.
All statistical analyses were performed using SAS® 9.4 software. The %MFP8 macro was used for determining the transformation of continuous variables using fractional polynomials [19].