Mastectomy or breast conserving surgery? Factors affecting type of surgical treatment for breast cancer – a classification tree approach
© Martin et al; licensee BioMed Central Ltd. 2006
Received: 02 February 2006
Accepted: 20 April 2006
Published: 20 April 2006
A critical choice facing breast cancer patients is which surgical treatment – mastectomy or breast conserving surgery (BCS) – is most appropriate. Several studies have investigated factors that impact the type of surgery chosen, identifying features such as place of residence, age at diagnosis, tumor size, socio-economic and racial/ethnic elements as relevant. Such assessment of "propensity" is important in understanding issues such as a reported under-utilisation of BCS among women for whom such treatment was not contraindicated. Using Western Australian (WA) data, we further examine the factors associated with the type of surgical treatment for breast cancer using a classification tree approach. This approach deals naturally with complicated interactions between factors, and so allows flexible and interpretable models for treatment choice to be built that add to the current understanding of this complex decision process.
Data was extracted from the WA Cancer Registry on women diagnosed with breast cancer in WA from 1990 to 2000. Subjects' treatment preferences were predicted from covariates using both classification trees and logistic regression.
Tumor size was the primary determinant of patient choice, subjects with tumors smaller than 20 mm in diameter preferring BCS. For subjects with tumors greater than 20 mm in diameter factors such as patient age, nodal status, and tumor histology become relevant as predictors of patient choice.
Classification trees perform as well as logistic regression for predicting patient choice, but are much easier to interpret for clinical use. The selected tree can inform clinicians' advice to patients.
Breast cancer is a disease that affects about ten percent of Australian women. Because of its devastating impact on the community, much research has been conducted on multiple aspects of the condition, including possible causative factors, methods of treatment and patient care, and preventative measures such as breast screening. In this paper we investigate factors that affect the choice between the treatment options of mastectomy and breast conserving surgery (BCS) for Western Australian breast cancer patients. Current Australian guidelines for the treatment of early breast cancer recommend that women be treated using a multidisciplinary approach involving appropriate surgery, radiotherapy, and systematic adjuvant therapy [1, 2]. The factors investigated include tumor size and histology, nodal status, martial status, aboriginality, age, method of payment, area of residence, and country of birth.
A number of studies have investigated the factors that determine the type of surgery that breast cancer patients choose [3–9]. These studies have identified a number of important factors including place of residence, age at diagnosis, tumor size, socio-economic factors, and racial/ethnic factors. Studies of this nature are important because even though there has been an increase in the use of BCS since the early 1990s, an apparent under-utilisation of BCS among women for whom such treatment was not contraindicated has been documented . For this reason, research that may shed light on reasons for the underutilisation of BCS is important. For example, if it is found that women of a certain ethnicity are less likely to choose BCS when it is a viable option, education campaigns could be specifically implemented to target these women.
A recent Western Australian study found several factors that affected the likelihood of women with breast cancer receiving BCS . In particular, they found that women from disadvantaged backgrounds were significantly less likely to receive BCS than those from more privileged backgrounds. In this study, the classification of an individual to a background category was based on a five category index of relative social disadvantage, with disadvantaged backgrounds and privileged backgrounds at opposite ends of this scale. However, the findings from this study were criticized because the study failed to adjust for tumor size . In his criticism, Furnival stated that as a consequence of tumor size not being included in the analyses that "no reliable conclusions can be drawn as to the cause of the lower incidence of breast-conserving surgery in women from 'disadvantaged backgrounds"'. In our paper we also use data from Western Australia but address the issue raised by Furnival by including tumor size in our analyses. In addition, we employ classification trees to help elucidate the factors as well as the interaction between factors that are relevant predictors of the choice of surgery. Classification trees are gaining broader acceptance in this area of biomedical research, but they are not yet in widespread use. Our paper highlights how they can be used to improve and clarify the results obtained from the standard logistic regression approach. Classification trees enable a flow-chart to be produced that can be easily followed for a patient with a given set of characteristics in order to predict the likely treatment chosen by the patient. This property enables classification trees, unlike logistic regression models, to be readily interpreted by people with little statistical knowledge. Attempting to explain treatment decisions based on a set of patient characteristics has been termed the "propensity score method" in the literature. Graf (1997) provides an application of this method to the treatment decision of mastectomy versus BCS .
Data for the study was sourced from linked administrative data obtained from the Western Australian (WA) Department of Health Record Linkage unit. The dataset was extracted from the WA Linked Database, a dynamic linkage system linking three core data sources: the WA Cancer Registry (WACR), the WA Hospital Morbidity Database (WAHMD) and the WA Death Register. The WAHMD contains comprehensive patient demographic, diagnosis and procedure information for each hospital admission occurring in any WA hospital. The dataset used consists of the linked hospital, death and WACR records containing the diagnosis, subsequent admissions to hospital and death (where applicable) of 2713 women diagnosed with one primary breast tumor in WA between 1 January, 1990 and 31 December, 1999. The WA Linked Database is unique in the Australian context and is an extraordinarily rich and comprehensive resource .
In this analysis, treatment was defined as the last surgical treatment within four months (120 days) of diagnosis. The date of diagnosis is defined for this study as the time at which the subject enters the study as recorded on the WACR as none of the databases linked by the Western Australian unit specifies diagnosis date. It is common for people who initially choose BCS to be readmitted for mastectomy because initial surgery reveals that the cancer is more progressed than initially advised. Thus, if a subject had a lumpectomy but was later readmitted for a mastectomy within 120 days of diagnosis, their treatment would be defined for this study as mastectomy. Individuals who had no surgery within 120 days following diagnosis were removed from the dataset. The treatment (surgery) variable was assigned as follows: a value of 0 was assigned to those subjects who received mastectomy as surgical treatment for the breast cancer within 120 days as indicated on the WACR; a value of 1 was assigned to those subjects who received BCS as the last surgical treatment for the breast cancer within 120 days as indicated on the WACR.
Summary statistics for subject characteristics, N = 2713. For categorical variables, the percentages of subjects in each category are presented. For continuous variables, five-number summaries are presented.
Method of Payment
Country of Birth
Area of Residence
Public, eligible for Medicare
Private, not insured
Ineligible for assistance
Tumor Size (diameter in mm)
Patient Age (years)
Nodal Staging Characteristics
Tumor Size Categories (T1/2/3)
Patient Age Categories
Node negative (N-)
T1 (20 mm or less)
Age under 40
Node positive (N+) 1 to 3 nodes
T2 (20 to 50 mm)
Age 40 to 60
Node positive (N+) 4 plus nodes
T3 (exceeding 50 mm)
Age 60 or over
Classification trees and multivariate logistic regression were used to investigate whether the variables described in the previous paragraph affected the likelihood of a breast cancer patient choosing BCS as their treatment. Multivariate logistic regression has been used in a number of studies of this nature and will be not be described further here. Classification trees are another, less pervasive, method that can be used to discriminate for a categorical response based on several, possibly interacting covariates [14, 15]. In this context, classification trees create subgroups of the data with the property that within the subgroups the outcomes (choice of BCS or mastectomy) are as homogenous as possible, and between subgroups the outcomes are heterogeneous. These subgroups are created by a recursive series of rules or binary splits. For example, if we were only interested in how tumor size and area of residence were related to choice of surgery a classification tree might first split on "tumor size smaller than 20 mm", and then for such tumor sizes it might further split on "area of residence metropolitan". A classification tree of this form would be interpreted as follows: a patient with a tumor size exceeding 20 mm would be predicted to choose mastectomy; a patient with tumor size less than 20 mm and having a metropolitan area of residence would be predicted to choose BCS; and a patient with tumor size less than 20 mm and a rural or remote area of residence would be predicted to choose mastectomy. This tree has stratified the population of breast cancer patients into strata of treatment choice based on tumor size and area of residence. This simple example illustrates the high interpretability of classification trees and highlights the reason why tree representations are popular with medical scientists and doctors .
One potential problem with classification trees is that if continuos variables are used the tree may select "odd" splits for these variables; see, for example, Altman et al. (1994) . For example, a tree model may choose to split on the continuous variables size and age at a tumor size of 27.4 mm and at an age of 61.6 years, highly specific values that are objectively meaningless. One way of avoiding this problem is to a priori select a few sensible splits for the continuous variables of interest. This was the approach taken in this study. Patient age and nodal status were split into three categories ≤ 40,40–60,≥ 60 years and 0,1–3,> 3 nodes, respectively, splits that have been used in previous studies [17–19]. Tumor size was split into the three size categories consistent with the American Joint Committee on Cancer and the International Union Against Cancer TNM staging system: ≤ 20 mm (T1), 20–50 mm (T2), ≥ 50 mm (T3). Creating these categorical variables out of the three continuous variables ensures that the tree can only split at sensible cutpoints, that is, at points recognised as clinically important by the medical profession. For ease of comparison, these new categorical variables were also used in the logistic regression analysis. While such a choice inherently involves information loss, the practical loss of information is slight, especially considering the increased interpretability of the resultant models.
The classification trees in this paper were fit using the rpart package available in the statistical package S-Plus . The tree-fitting process initially proceeds by finding the covariate that "best" divides the subjects into two groups. The "best" split is defined as the one that results in most homogeneous subgroups with respect to the response variable, homogeneity assessed with respect to standard measures of goodness-of-fit such as the drop in deviance or the misclassification rate at each potential split. The process then partitions the subjects into the two resulting groups and repeats the splitting process in each of the two groups, a process referred to as recursive partitioning. The tree is "grown" in this way until some minimum group size is reached. This initial tree is usually too large and complicated to lead to useful inferences in the same way as an initial logistic regression fit often contains too many variables, many of them ultimately insignificant. The initial, full tree is then "pruned" to produce a simpler, more interpretable tree that adequately models patient choice of surgical treatment, compared with the full tree which likely overfits the data. Pruning a classification tree is the name given to the process of simplifying the initial tree by removing some of the lower splits. The pruning used in this paper was based on minimizing the cross-validated misclassification error across competing sub-trees. The minimum group size used and further details on the pruning process used are described below. After a suitable pruned tree is identified, a process called "burling" (examining alternative splits at each node in the tree) is used to assess the reliability of the selected tree.
Initial logistic regression model fit. Significant variables are denoted by ** (significant at 5%) or *** (significant at 1%).
Significance (1% or 5%)
Tumor Size T1 (baseline)
Tumor Size T2
Tumor Size T3
Subject Age under 40 (baseline)
Subject Age 40–60
Subject Age ≥ 60
Area of Residence – Metro (baseline)
Area of Residence – Rural
Area of Residence – Remote
Tumor Histology – Ductal (baseline)
Tumor Histology – Lobular
Tumor Histology – Other
Lymph Node Negative (N-) (baseline)
Lymph Node Positive, 1–3 nodes
Lymph Node Negative, 4 or more nodes
Country of birth – Australia/NZ (baseline)
Country of birth – Western Europe
Country of birth – Other
Payment class – Public/Medicare (baseline)
Payment class – Private, not insured
Payment class – Private, insured
Payment class – Ineligible
Marital Status – Married/De-Factor (baseline)
Marital Status – Other
Of course, the initial model fit ignored potential interactions between variables in the model. A stepwise procedure was used to fit a larger logistic regression model that included relevant two-way interactions terms. The results of that model revealed that several interaction terms were, indeed, significant, including interactions between size and area, size and marital status, age and country of birth, age and tumor histology, age and marital status, and area of residence and country of birth. The large number of significant two-way interactions in the fitted model makes presentation of results, as well as their subsequent interpretation rather difficult. The presence of numerous significant interactions not only makes interpretation of the model coefficients difficult, it also renders the model useless as a basis for forming clinical guidelines for women faced with the choice between mastectomy and BCS.
One of the benefits of tree models is that the output – a simple decision tree – is analogous to a diagnostic process with which medical professionals are familiar. As a result, the output of the tree model is generally easy to understand and summarize, even in the presence of significant two-way or even three-way interactions. This feature of trees allows the building of simple "profiles" of subjects who are predicted to prefer BCS over mastectomy and vice-versa. For example, patients with small tumors are likely to have BCS as their surgical treatment, while patients who have moderate to large sized tumors and are aged 60 years or older are likely to have mastectomy. Note that profiles of this kind derived from the tree are indicative rather than prescriptive, and that they describe tendencies, or broad patterns, rather than the behavior of particular individuals. Nevertheless, they are useful in characterising the patterns of patient behavior observed in the WACR data set, and in clarifying how the main factors identified in the analysis combine to influence patients' choices of surgical treatment for breast cancer. Tumor size is clearly the factor that dominates a patient's decision process, although a number of other factors appear relevant for patients with moderate to large sized tumors. The profiles paint rich but subtle pictures of patient behavior, indicating that patient decisions are often highly conditional rather than set by separate, individual rules for each relevant factor. This revelation would come as no surprise to doctors, as such a nested decision process would be entirely familiar to them in the context of advising their patients as to their most appropriate treatment option taking into account the patient's particular circumstances.
As with any analysis of complex data such as these, our tree analysis has limitations which must be acknowledged [15, 21]. Firstly, tree models have a tendency to be quite variable or unstable, wherein a small change in the data can result in a quite different series of splits and hence a different tree model. This instability of tree models can make their interpretation somewhat open to question. While this problem is inherent to tree models, it is likely that the tree model produced in this paper is very stable at the first split on tumor size. The choice of this split was unequivocal as the improvement in goodness-of-fit (both from drop in deviance and misclassification standpoints) from splitting on size at this point was substantially larger than for any of the other explanatory variables. Nevertheless, the interpretation of splits low in the tree structure is somewhat more uncertain as alternative splits low in the tree may have produced a tree model with similar goodness-of-fit to that of the selected model. Secondly, due to the large number of statistical comparisons that are performed during the fitting of a tree, p-values are not particularly useful or interpretable for these models. This issue necessitates the use of cross-validation or the use of an independent data set to validate the tree model. A cross-validatory exploration revealed the selected tree model to be relatively stable, particularly so with regard to the first few splits. Thirdly, trees may not capture global linear relationships between the response and covariates because the tree must approximate linear effects with a series of binary splits . These limitations illustrate that the easy interpretability and straightforward treatment of interactions characteristic of tree models compared to logistic regression models comes at some cost. It is important that these limitations be kept in mind when deciding whether classification trees are useful as an alternative to logistic regression in a study of this nature. Further, limitations imposed by the data must also be acknowledged. While data on tumor size was available, other relevant factors such as the size of the tumor relative to the size of breast and the degree of differentiation of the tumor were not available from this linked database.
Despite these limitations, our analysis offers some clear advantages over traditional approaches to analysing data such as these, and our findings are broadly useful for discovering which characteristics impact patients' choice of surgical treatment for breast cancer, and in estimating the extent to which each characteristic is important in the decision-making process.
Classification trees perform as well as logistic regression as a predictor of patient choice, but are much easier to interpret for clinical use. The selected tree can inform clinicians' advice to patients, as well as to clarify complex interactions between covariates in predicting patient choice. In the context of this study of breast cancer in Western Australian patients, the tree model shows that tumor size is a major determinant of which surgical treatment patients choose, but that a variety of other factors, such as patient age, nodal status, and tumor histology, are also important in refining predictions of patient choice.
- National Health and Medical Research Council: Clinical Practice Guidelines for the Management of Early Breast Cancer. 2001, Canberra: National Health and Medical Research Council, 2
- Spilsbury K, Semmens JB, Saunders CM, Holman CD: Long-term survival outcomes following breast cancer surgery in Western Australia. ANZ J Surg. 2005, 75: 625-30. 10.1111/j.1445-2197.2005.03478.x.View ArticlePubMed
- Potosky A, Ballard-Barbash R, Harlan L, Nayfield SG, Kessler LG: Factors associated with surgical and radiation therapy for early stage breast cancer in older women. J Natl Cancer Inst. 1996, 88: 716-26.View ArticlePubMed
- Riley G, Potosky A, Klablunde CN, Warren JL, Ballard-Barbash R: Stage at diagnosis and treatment patterns among older women with breast cancer. JAMA. 1999, 281: 720-6. 10.1001/jama.281.8.720.View ArticlePubMed
- Morris C, Cohen R, Schlag R, Wright WE: Increasing trends in the use of breast conserving surgery in California. Am J Public Health. 2000, 90: 281-4.PubMed CentralView ArticlePubMed
- Morrow M, White J, Moughan J, Owen J, Pajack T, Sylvester J, Wilson JF, Winchester D: Factors predicting the use of breast conserving surgery in stage I and II breast carcinoma. J Clinical Oncology. 2001, 19: 2254-62.
- Hall SE, Holman CD, Hendrie DV, Spilsbury K: Unequal access to breast-conserving surgery in Western Australia 1982–2000. ANZ J Surg. 2004, 74: 413-9. 10.1111/j.1445-1433.2004.03020.x.View ArticlePubMed
- Hiotis K, Ye W, Sposto R, Goldberg J, Mukhi V, Skinner K: The importance of location in determining breast conservation rates. Am J Surg. 2005, 190: 18-22. 10.1016/j.amjsurg.2004.12.002.View ArticlePubMed
- Hiotis K, Ye W, Sposto R, Skinner KA: Predictors of breast conservation therapy: size is not all that matters. Cancer. 2005, 103: 892-9. 10.1002/cncr.20853.View ArticlePubMed
- Maskarinec G, Dhakal S, Yamashiro G, Issell BF: The use of breast conserving surgery: linking insurance claims with tumor registry data. BMC Cancer. 2002, 2: 3-10.1186/1471-2407-2-3.PubMed CentralView ArticlePubMed
- Furnival C: Access to breast-conserving treatment: are surgeons responsible?. ANZ J Surg. 2004, 74: 402-03. 10.1111/j.1445-2197.2004.03104.x.View ArticlePubMed
- Graf E: The propensity score in the analysis of therapeutic studies. Biometrical J. 1997, 39: 297-307.View Article
- Giles GG: Medical record linkage in Australia: This is as good as it gets. ANZ J Surg. 2005, 75: 259-10.1111/j.1445-2197.2005.03412.x.View ArticlePubMed
- Brieman L, Friedman JH, Olshen RA, Stone CM: Classification and Regression Trees. 1984, Boca Raton, FL, CRC Press
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2001, Springer-Verlag, New YorkView Article
- Altman D, Lausen B, Sauerbrei W, Schumacher M: Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst. 1994, 86: 829-35.View ArticlePubMed
- Foo CS, Su D, Chong CK, Chng HC, Tay KH, Low SC, Tan SM: Breast cancer in young Asian women: study on survival. ANZ J Surg. 2005, 75: 566-2. 10.1111/j.1445-2197.2005.03431.x.View ArticlePubMed
- Jayasinghe UW, Taylor R, Boyages J: Is age at diagnosis an independent prognostic factor for survival following breast cancer?. ANZ J Surg. 2005, 75: 762-7. 10.1111/j.1445-2197.2005.03515.x.View ArticlePubMed
- Najafi M, Ebrahimi M, Kaviani A, Hashemi E, Montazeri A: Breast conserving surgery versus mastectomy: cancer practice by general surgeons in Iran. BMC Cancer. 2005, 5: 35-10.1186/1471-2407-5-35. doi:10.1186/1471=2407-5-35PubMed CentralView ArticlePubMed
- S-PLUS (Version 6.1), ©. 2002, Insightful Corporation, Seattle, WA, [http://www.insightful.com]
- Hess KR, Abbruzzese MC, Lenzi R, Raber MN, Abbruzzese JL: Classification and regression tree analysis of 1000 consecutive patients with unkown primary carcinoma. Clin Cancer Res. 1999, 5: 3403-1.PubMed
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2407/6/98/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.