Data sources
Lung and bronchus cancer cases in PA between 2010 and 2017 were obtained from the Pennsylvania Cancer Registry (PCR) [8] using International Statistical Classification of Diseases, 10th revision (ICD 10) diagnosis codes—C340 (main bronchus), C341 (upper lobe, bronchus or lung), C342 (middle lobe, bronchus or lung), C343 (lower lobe, bronchus or lung), C348 (overlapping sites of bronchus and lung), and C349 (unspecified part of bronchus or lung). PCR is an incidence-based registry and has earned Gold Certification from the North American Association of Central Cancer Registries (NAACCR), the highest level of data quality achieving at least 95% completeness, for all years under study [9]. The following three exclusion criteria were applied to exclude cases that were: (i) in situ and non-carcinoma histology, (ii) not uniquely matched with a census tract ID, and (iii) the age of diagnosis belonged to an age group with zero population size as estimated by US Census Bureau indicating a possible error. This resulted in a total of 73,937 cases from 3,197 census tracts. We used the census tract, the small and relatively permanent statistical subdivision defined by the US Census Bureau, as the unit of analysis for the consistency in the data collected over the years and the validity when used in research studies [10]. We conducted the present analysis under a data use agreement with the Pennsylvania Department of Health and with the approval of the University of Pennsylvania Institutional Review Board (IRB number 831671).
The reported street addresses at the time of diagnosis were geocoded using ArcGIS 10.6.1 software [11] and matched with the 2010 census tract ID. Lung cancer cases were grouped into 18 age groups (0–4, 5–9, 10–14,15–19, 20–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65–74, 75–84, 85 and above). Annual population size for the same year by age groups for a census tract was obtained using the American Community Survey (ACS), a national survey conducted by the US Census Bureau that provides various individual demographic and household information on a yearly basis [12].
Demographic data at the census tract level were extracted from the 2011–2015 ACS including median age (years), percentage of males, distribution by race and ethnicity, per capita income, median household income (thousands of $), percent poverty, distribution by educational attainment, total population size, and population density (per square mile).
Poor mental health and poor physical health, defined as the percent of individuals ≥ 18 years who self-reported having 14 or more days during the past 30 days in which their mental or physical health was not good, were extracted from the Centers for Disease Control and Prevention (CDC) PLACES 2020 database derived using the 2018 Behavioural Risk Factor Surveillance System (BRFSS). Both mental and physical health measures were based on self-assessment only without an objective health component [13].
Age-adjusted Incidence rates and trends over time
The age-adjusted incidence rates (number of cases per 100,000) for each census tract were calculated by adjusting the crude incidence rate with respect to the 2000 U.S. Standard Million Population, a commonly used standard population for adjustment that assumes a total population of 1,000,000 [14]. The adjustment used the 18 age groups and population size estimates described above. A choropleth map for the age-adjusted incidence rate using the cumulative cases over 8 years was created to visualize the spatial pattern. Temporal trends in the adjusted incidence rates were examined and modeled using linear quantile mixed models [15]. Such mixed models were utilized to allow census tract level random effects of intercept and slope for the calendar year to be estimated, while the use of the quantile regression provided a robust summary of the trends that were less sensitive to outlying values in the incidence rates, which are often observed in smaller census tracts. The estimated 50th (median), 75th, 80th, and 90th quantiles were plotted, and the mean profile was included as a reference.
Spatio-temporal disease risk and mapping
To understand the spatio-temporal disease risk, we modeled the observed case counts through a log-linear Poisson regression with both spatial and temporal terms, as well as a space–time interaction term. Specifically, the mean case count for location i (in this case a census tract) and year j was modeled as the expected case counts for the same location and year combination (Eij) times the relative risk parameter, RRij, which is also indexed by location i and year j (i.e., relative risk specific to a location and a time). The expected case counts Eij were determined based on the age distribution of the corresponding location i and year j such that Eij equals the crude incidence rate in a particular age group in the study population in year j times the population size in the same age group of the location i from the same year (i.e., internal standardization). Extending the model proposed by Lawson et al. [16] for the spatial model, the log of the space–time relative risk parameter RRij was modeled with four components: an intercept as the overall relative risk for the study region, location-specific random effects, a linear trend term in time j, and the interaction random effects between the location and time. The spatial random effects were assumed to follow a normal distribution under the conditional autoregressive (CAR) setting based on Queen contiguity spatial weight matrix (i.e., two areas are considered neighbors if they share a common boundary). The model was fit using the R-integrated nested Laplace approximation, R-INLA [17] under the Bayesian framework with a normal prior distribution. Temporal trends in the RR estimates from 20 randomly selected census tracts were plotted to examine the changes over time. The spatial pattern for the estimated RR for a given year was illustrated using a choropleth map. Furthermore, we calculated the standardized incidence ratio (SIR) for each census tract as the total number of cases observed divided by the total number of cases expected Eij across the 8 years combined. We then created a choropleth map of the SIR to examine this spatial pattern empirically with SIR > 1 indicating an elevated risk such that the number of cases observed is higher than the expected number of cases.
Detection of high-risk clusters
We used the SaTScan cluster detection method which employs Kulldorff scan statistics to detect high risk clusters. This approach has been widely used in spatial statistics to evaluate the risk of disease geographically to detect high risk clusters. This method generated circular spatial windows of various sizes and evaluated the observed over the expected number of cases by comparing inside versus outside the circles to identify statistically significant clusters [18]. To detect spatio-temporal clusters [19, 20], scan statistics covered the study area with many overlapping “windows” now defined as cylinders with the base as the area and the height as the time period in the space–time setting. As the window expanded to contain more areas and more cases, we used a log-linear ratio (LLR) to compare the number of cases inside the windows to the number of cases outside the window. The null hypothesis was calculated under the probability that being a case is the same inside and outside the window relative to the age-adjusted expected number of cases. A LLR > > 1 indicated evidence that the current window forms a high incidence or high-risk cluster. In our analysis, the age-adjusted expected case counts used were the same Eij that was used for the log-linear Poisson model in the previous section. The most likely cluster (i.e., the window with the maximum LLR) and secondary clusters (i.e., other statistically significant windows at 0.05 significance level) were identified in the current analysis. The RR of each cluster was determined by the total number of cases observed over the total number of cases expected in the years when the cluster is present. The statistical significance of a cluster was determined through a Monte Carlo hypothesis testing procedure [21]. The proposed analysis was performed using the R shiny application SpatialEpiApp, which allows estimation of spatio-temporal disease risk and detection of clusters [22].
Comparison of census tracts in high-risk cluster versus not
The nonparametric Wilcoxon rank sum test with a continuity correction was used to compare demographic variables between census tracts in any high incidence cluster at any time during 2010 to 2017 versus those not in any clusters. Data on smoking, which is a known risk factor for lung cancer development, were not available at the census tract level and for the same time frame as the demographic variables used, thus comparison of the smoking prevalence between census tracts was not possible. A two-sided p < 0.01 was considered statistically significant. We used a lower p-value threshold for statistical significance to account for testing multiple variables.