Factors to improve distress and fatigue in Cancer survivorship; further understanding through text analysis of interviews by machine learning

Background From patient-reported surveys and individual interviews by health care providers, we attempted to identify the significant factors related to the improvement of distress and fatigue for cancer survivors by text analysis with machine learning techniques, as the secondary analysis using the single institute data from the Korean Cancer Survivorship Center Pilot Project. Methods Surveys and in-depth interviews from 322 cancer survivors were analyzed to identify their needs and concerns. Among the keywords in the surveys, including EQ-VAS, distress, fatigue, pain, insomnia, anxiety, and depression, distress and fatigue were focused. The interview transcripts were analyzed via Korean-based text analysis with machine learning techniques, based on the keywords used in the survey. Words were generated as vectors and similarity scores were calculated by the distance related to the text’s keywords and frequency. The keywords and selected high-ranked ten words for each keyword based on the similarity were then taken to draw a network map. Results Most participants were otherwise healthy females younger than 50 years suffering breast cancer who completed treatment less than 6 months ago. As the 1-month follow-up survey’s results, the improved patients were 56.5 and 58.4% in distress and fatigue scores, respectively. For the improvement of distress, dyspepsia (p = 0.006) and initial scores of distress, fatigue, anxiety, and depression (p < 0.001, < 0.001, 0.043, and 0.013, respectively) were significantly related. For the improvement of fatigue, economic state (p = 0.021), needs for rehabilitation (p = 0.035), initial score of fatigue (p < 0.001), any intervention (p = 0.017), and participation in family care program (p = 0.022) were significant. For the text analysis, Stress and Fatigue were placed at the center of the keyword network map, and words were intricately connected. From the regression anlysis combined survey scores and the quantitative variables from the text analysis, participation in family care programs and mention of family-related words were associated with the fatigue improvement (p = 0.033). Conclusion Common symptoms and practical issues were related to distress and fatigue in the survey. Through text analysis, however, we realized that the specific issues and their relationship such as family problem were more complicated. Although further research needs to explore the hidden problem in cancer patients, this study was meaningful to use personalized approach such as interviews. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-021-08438-8.

Conclusion: Common symptoms and practical issues were related to distress and fatigue in the survey. Through text analysis, however, we realized that the specific issues and their relationship such as family problem were more complicated. Although further research needs to explore the hidden problem in cancer patients, this study was meaningful to use personalized approach such as interviews.

Background
In Korea, the number of cancer patients has increased gradually over the years [1]. However, due to the development of therapeutic options, cancer mortality rates have decreased since the early 2000s. Consequently, the number of patients who survive after cancer treatment (cancer survivors) has continuously increased. In Korea, patients diagnosed with cancer between 2013 and 2017 showed a 5-year relative survival rate of 70.4%, which lead to a prevalence of approximately 1.87 million cases at the end of 2017. Although the population of cancer survivors is substantial, the importance of care for cancer survivors is underestimated.
After cancer treatment, the cancer survivors primarily screened for treatment outcomes, including physical complications. However, previous reports described that even after finishing cancer treatment, cancer survivors still experience various issues in physical, mental, and even practical aspects of life [2][3][4][5]. Although distress and fatigue are included in the guidelines for cancer survivorship, many survivors have often reported distress and fatigue regardless of disease progression [6]. In most cancer survivorship studies, analyses are based on simple surveys or reports from patients, which may be insufficient for detailed analysis to suggest solutions. Moreover, although many hospitals and public organizations show an interest in cancer survivorship [7], it has practical limitations on budget and resources to provide appropriate care for cancer survivors individually.
This study aimed to find specific and individualized causes behind cancer survivors' distress and fatigue in existing survey and interview data of the survivorship program through the newly approached machine learning technique for text analysis of the Korean language.

Methods
The manuscript of this study was prepared according to Strengthening the Reporting of Observationally Studies in Epidemiology (STROBE) guidelines [8].

Korean Cancer survivorship pilot project
The Korean Cancer Survivorship Pilot Project was launched in 2017 across Regional Cancer Centers in Korea. Adult cancer survivors who completed active cancer treatment such as surgery, radiotherapy, or chemotherapy, excluding palliative care, were recruited for a prospective and observational cohort study approved by IRB (AJIRB-MED-MDB-17-008). One of the regional cancer centers covering Kyung-gi province near Seoul, our institution joined the pilot project as one of the participating regional cancer centers.
After signing informed consent, enrolled cancer survivors filled out the surveys on quality of life (Supplementary File 1). Once each survivor's survey answers were reviewed, an experienced nurse conducted individual interviews for about an hour where each patient was asked further detailed questions about subjects listed in the Survivorship Assessment (NCCN Guidelines Version 2016) [9]. Furthermore, the nurse interviewed the patients more specifically focused on the issues with at least the three worst scores in the surveys. During the interviews, the nurse transcribed the consultation details as a text form, which was used to identify each patient's issues and ultimately match an appropriate program for each issue. Categories and process of cancer survivor support programs provided at the center and patients' participation records are shown in Table 1 and Fig. 1. The center's programs cover a wide range of topics such as nutrition, exercise, meditation, family care, and art therapy. The second round of the survey for evaluating the change of the scores was performed when the patients visited the hospital 1 month after the enrollment. If the patients could not visit the center at the time, the nurse tried the survey by the phone calls.

Sample size
A total of 561 patients initially signed up at our center between May 2018 and July 2019 to participate in the Korean Cancer Survivorship Pilot Project. The analysis excluded surveys filled out by ten patients as they did not thoroughly answer their questionnaires. Among those 551 patients, 322 patients completed the second round of surveys a month after the initial survey. To track changes in survey scores, we used survey sets from those 322 patients instead of those of 551. On the other hand, we used all of the individual interviews conducted on 551 patients for text analysis to ensure that machine learning had enough data to draw conclusions. Separate IRB approval for further analysis, including the text analysis, was obtained.

Measurements and outcomes
The survey questions were designed to assess patients' distress, fatigue, pain, insomnia, anxiety, depression based on the following tools: the National Comprehensive Cancer Network (NCCN) distress thermometer, a Korean version of the problem list and a five dimensional tool for visual analogue scales of anxiety, depression, fatigue, pain, and outcome domain of need for help [10,11]. The score systems were scaled from 0 (representing not at all or in the best condition) to 10 (in the worst condition). Also, EuroQol Visual Analogue Scale (EQ-VAS) was measured, ranging from 0 to 100 [10][11][12]. These survey measurements were developed and validated in previous Korean studies [10][11][12]. The survey also asked for general information about survivorship such as disease status, cancer treatment, socioeconomic condition, lifestyle behaviors and their physical, psychological, and practical needs, and problems. "High distress" was defined with a score 4 or higher, commonly considered severe distress [13]. This study's primary endpoint was the improvement in distress and fatigue scores at the time of the one-month follow-up. We grouped and compared the patients based on the improvement of the scores; the "improved" group indicated the patients had a higher score at follow-up visits, and the "not improved" group was defined by a follow-up score that was the same as or less than the initial score.

Text analysis
Machine learning-based text analysis was conducted to identify the survivors' hidden needs that the survey might not have captured. The text analysis process used Word2Vec, which is an advanced version of the neural network language model (NNLM) that provides a word embedding function by turning words into vectors [14]. Word2Vec uses previously mentioned words or context to predict ensuing words and content. Recognizing that Korean words are not separated by spaces like in English, the part of speech (POS) tagging was used to distinguish each individual word. The study specifically used the Python (version 3.4) package genism model from the Word2vec method and Konlpy for the POS tagging of Korean [15]. Once words are tokenized, we extracted nouns exclusively, which was followed by the postprocessing stage extracted list was reviewed to add or delete additional words to the library. Afterwards, the nouns were extracted in the order of the keywords by using the Most_similar function of the gensim model. The dimensionality of the word vector was 300, the minimum frequency of appearance was 50, and the window size, which means the number of surrounding words, was 8. After the extraction of words, the top10 ranked words were selected according to the relative similarity scores of each keyword including Health, Stress, Fatigue, Pain. Insomnia, Anxiety, and Depression. To quantify the result, mentioning the selected words were checked in the text of the interview for each patient. The counts for 10 words related with each keyword such as Stress and Fatigue, by frequency of mentions, were included in the statistical analysis as well as the survey results.
A network map was made with nodes and edges to visualize the relationship among the words. The nodes meant each word and their diameters reflected the frequency of mentions in the interviews, and the weights of the edges showed the relative similarities between two words. As the most basic centrality analysis measure, the degree centrality was defined as the count of the number of unique edges connected to the node [16,17]. This technique was used to determine the network's central node by measuring the degree of the edge between the nodes constituting the network and other nodes directly connected to the network. In this study, the word with the highest degree centrality score implies the frequency along with other words during the interview. Thus, significant keywords can be identified through degree centrality analysis.

Statistical analysis
Appropriate tests were selected for comparison of characteristics between the groups of high and low distress levels and univariable analysis for the improvement of the scores; Fisher's exact test for discrete variables, and Mann Whitney U-test for continuous variables. For the comparison between the groups based on the improvement of scores, multivariable analysis was performed using the logistic regression including the variables at p < 0.1 from the univariable analysis. Values were considered statistically significant when p < 0.05. Statistical analyses were performed using R 3.6.2 (R Development Core Team, Vienna, Austria, http://www.R-project.org). Table 2 displays the characteristics for the 322 patients who performed the 1-month follow-up survey. The majority of the patients were young (< 50 years, 81.1%), female (79.5%), less than 6 months from the end of treatment (70.5%), diagnosed with breast cancer (86.0%), early stages (75.2%), and previously healthy patients shown at the bottom of Table 2. All scores showed improvement after 1 month. Improvement at the 1 month follow-up was noted in about 50% of the patients in all categories. In the initial survey (Supplementary Table 1 in Supplementary File 2), participants' most frequently experienced physical problems were in exercise (33.2%), Most of patients with other treatment had hormonal therapies for breast cancer except radio-frequency ablation for one patient with liver cancer. b Continuous variables were represented by mean ± standard deviation followed by difficulties in memory or concentration (25.5%), and in nutrition (24.5%). Among emotional problems, more than 30% of the patients expressed difficulties with sleep and worry. Practical problems were relatively less common than physical or emotional problems. As for needs, 50.9 and 44.7% of the patients sought help in nutrition and exercise, respectively, which was higher than the percentage of patients actually expressing physical problems with nutrition and exercise (24.5 and 33.2%, respectively). However, for sleep or emotional problems, only a part of the patients experiencing those problems expressed the need for help. We checked the different characteristics between patients with high and low distress ( Table 2). Compared to low distress patients (less than 4), patients with higher distress (4 or higher) were younger (p = 0.043), more previous counseling history (p = 0.012), and less family caregiver (p < 0.001). Though the one-month follow-up scores of the initially high distress group were still worse, they showed a greater improvement rate (the number of improved patients in all patients) in distress (84.9% versus 33.0%, p < 0.001), anxiety (68.5% versus 49.9%, p < 0.001), and depression (71.2% versus 34.7%, p < 0.001) at the one-month follow-up than the initially low distress patients.

Text analysis for individual interviews
The top 10 ranked vocabularies were selected by correlation with each keyword, Health, Stress, Fatigue, Pain, Insomnia, Anxiety, and Depression, from the text analysis (Fig. 2). Even after finishing cancer treatment, many of the words correlating with the keywords that the patients used during the interviews were therapeuticrelated. Interestingly, family-related words such as family, child, and husband showed the highest correlation with the keyword Stress. Family-related words also correlated with Fatigue, Anxiety, and Depression. Emotional words were mostly related to Insomnia and Depression. In comparison, physical words were frequently associated with Fatigue.
During the interviews, the mean counts for mentioning the ten words related to Stress and Fatigue were 2.4 and 2.8, respectively. The counts were compared according to the improvement of distress and fatigue scores, and both were significantly different between the improved group and not-improved group (p < 0.001 and 0.036, respectively, Tables 3 and 4).
The keyword network map for the seven keywords and the extracted words showed multiple rings linked with each other or spherical form in the center of the network and some branches (Fig. 3). It implied that the words had a very complex relationship. In the center, the keywords Fatigue and Stress were positioned, which were the main outcomes of this study. The degree centrality of the seven keywords was higher than other words (Supplementary File 2). Among all the words on the keyword network map, Fatigue and Stress recorded the highest centrality score, implying that these two Fig. 2 The keywords (blue nodes) and the top 10 ranked correlated words (gray nodes) from the text analysis with the individual interviews; Gray nodes with yellow boundary, family-related words; and the numbers, the ranks of words  keywords frequently mentioned and tightly connected with other words when the patients described their conditions during the interviews.

Analysis of the scores
We performed uni-and multi-variable analysis with the characteristics, survey data, and the result of the text analysis. The significant factors of improvement in distress by the analyses are shown in Table 3. Multivariable analysis using logistic regression showed that dyspepsia (odds ratio (OR) 3.192, p = 0.006), the initial scores of distress (OR 2.777, p < 0.001), and anxiety (OR 1.253, p = 0.043) had a significantly positive relationship with the improvement in distress, and the initial score of fatigue (OR 0.681, p < 0.001), and depression (OR 0.753, p < 0.013) were negatively related to the improvement in distress. Table 4 shows the results for the improvement in fatigue by the uni-and multivariable analysis. After adjusting other factors, initially higher scores of fatigue (OR 1.517, p < 0.001), participation in family care programs (OR 3.895, p = 0.022), and mention of family-related words during the interview (OR 1.983, p = 0.033) were significantly related to the improvement of fatigue. On the other hand, practical problems with economy or finance (OR 0.409, p = 0.021), need for rehabilitation (OR 0.440, p = 0.035), and participation in any clinic or other program in the center (OR 0.491, p = 0.017) were unfavorable factors for the improvement in fatigue.

Discussion
Cancer survivors not only face cancer-related problems but also numerous physical, mental, and/or practical problems and challenges during and after cancer treatment. Subsequently, the needs of cancer survivors are frequently unmet during life after treatment. Although previous studies have addressed and tried to improve these issues, no effective solutions have been reported [18][19][20]. This could be due to the complexity of the problems that survivors experience and the various background and histories of individuals. Thus, the solutions from the previous studies using simple survey data may be insufficient to apply to a real-world situation. In this study, we used the latest technique, keyword  Fig. 3 Networks among the keywords (blue nodes) and their related words (gray nodes, except yellow family-related words) from the text analysis with individual interviews; the size of each circle (node) was the frequency of the word mentioned in the interviews, and the length or weight of line (edge) between two circles were their relative correlation network mapping using deep learning, and the conventional statistical analysis to better understand each individual cancer survivor with the survey and text review from the individual interviews. The main endpoints of this study were distress and fatigue, which are generally the most common chief complaints that in clinic patients express. Interestingly, distress and fatigue were also the most essential words in the keyword network map with the highest centrality scores. Statistical analysis showed that patients who participated in the family care program and those who used family-related words during the interview were significantly associated with fatigue improvement. Distress, one of the main endpoints in this study, is recommended as one of the main indicators for managing cancer survivors in the NCCN guidelines [6,13]. High emotional distress has been reported with a wide range from 9 to 50% [21][22][23], and our result was comparable. Since many patients were enrolled directly after finishing treatment, it may be seen that the psychological stress was mainly affected by the cancer treatment [24][25][26]. Additionally, the patients tended to focus on more physical or practical issues, not mental problems or stress issues, and high distress patients were younger and stayed without a family caregiver in our study. Though distress can be an overall indicator, its score may not show the detailed individual factors and its relationships. Furthermore, a recent report suggested that it was irrational to identify the cancer survivors' condition by distress level alone [27]. An interesting point in our study was that dyspepsia was one of the factors related to distress; it may have been due to the residual side effects after radiotherapy to left-sided breast or systemic therapy. Alternatively, the possibility of somatization disorder can also be considered. Although it is not currently used as a psychiatric diagnosis, one of the previous diagnoses in DSM-IV names called "Hwabyung" was used as a characteristic psychosis code in Koreans [28]. The diagnosis of Hwabyeong included ambiguous physical symptoms such as dyspepsia while showing anxiety or depression [28]. The cohort of this study included many middle-aged Korean housewives who were susceptible to Hwabyeong. In Korea, many Koreans who have less tendency to express themselves psychiatry-wise reflect their issues by the vague physical symptoms.
Fatigue associated with cancer or cancer treatment, another main outcome of this study, requires some time to be recovered after treatment [3,29]. However, long-term fatigue in cancer survivors is also majorly affected by factors other than cancer-related factors [30]. Many studies reported little association between persistent fatigue in cancer survivors and cancer treatment [31]. Thus, fatigue that persists after treatment may have different individual causes in aspects of lifestyles, practical situations, or even personalities [32]. Similarly, it is understandable that fatigue might not be improved if there were practical difficulties such as financial problems or lack of transportation [33]. For this reason, it is imperative that the cancer survivorship care team considers other various social or medical conditions of the patient such as physical recovery, moving distance, or family situation. Correspondingly, some reports suggest that individual life care in the community where the patients live or even private tele-cares through telephone or the Internet may be needed for effective personal supports [34,35].
In our study, like distress, the improvement of fatigue was positively related to few factors. Interestingly, the participation of any programs or clinics was not a good factor for fatigue improvement. We believe that it was not a worse factor, but most of the patients were enrolled just after the end of treatments, and some patients needed more time to rest. Although practical problems of economy or finance and the need for rehabilitation were negatively associated with improvement in fatigue, participating in the family care program was a positive factor for fatigue improvement. The family care program was not education for the family members as caregivers, as family members of cancer survivors. The program's purpose was to help the patients adapt to their role in the family after cancer treatment. In this study, most of the patients were diagnosed with breast cancer, and characteristics of the breast cancer population in Korea is that the median age is younger compared to the Western population, with many breast cancer patients younger than 50 years old. Accordingly, our results may reflect the situations of many young female cancer patients in which they need to maintain their lives as the caregivers for other family members such as their babies, husband, or older parents, rather than expressing their need to their family and society [36]. Even after finishing cancer treatment, their expected role and performance in the family may be the same as before their diagnosis. This could stem from the cultural notion of the close and united relationships as a family being one of the most critical values rather than a separate individual in Korea. Therefore, it is important to note that some cancer survivors need thoughtful considerations and programs for their role in their family and community.
Our study's uniqueness is that we used a new approach for analysis through text analysis. At the time of registration, each participant was interviewed individually by a nurse, trained for this project. Because time or budget limitations, few other centers could not interview all cancer survivors in depth. Even if the interviews were performed in our institution, it was a challenge to find an appropriate method to analyze the text data. Recently in Korea, analysis techniques for natural languages or free text analysis were developed [15], and have been used widely in non-medical parts such as marketing. However, the studies in medicine related to text analysis are scarce, and no significant results have yet been reported [37]. Although many researchers have tried to work with the electric health records (EMR) or similar systems in the hospitals, limitations existed due to the rigid structure of EMR and its information security issue. Our dataset included simple and large sets of interview results in text form as well as the survey form from the project, and we were able to perform text analysis without considering the limitations arising from the structure of EMR or security issues.
In the survey, patients were asked to answer questionnaires for each category of distress, fatigue, pain, insomnia, anxiety, depression, and EQ-VAS. However, there were very complex links in the text analysis that were difficult to interpret among the keywords in the interviews. Using the keyword network map, we showed the relationship among the words, which was previously hard to show using the survey results. In the keyword network map, each keyword showed a different frequency and centrality, and the keywords at the center with the highest centrality were fatigue and distress. Additional text analysis showing the top 10 words correlated with the keywords (health, stress, fatigue, pain, insomnia, anxiety and depression) (Fig. 2) indicated that there were still many treatment-related statements even after finishing treatment. Considering that the patients were enrolled directly after finishing the treatment, it seemed that the patients were still affected by cancer at the survey and interview time [36]. In severe cases, some patients may be affected by cancer, like a trauma that may last for a long time and need psychiatric interventions [38]. Therefore, further emphasis of post-treatment care and support for cancer survivors is required. Another finding in this additional text analysis was that the family-related vocabularies such as children and husband were important words related to stress, anxiety, and depression. This may be due to the concerns that the patients feel responsible about taking care of other family members. In addition, multivariate analysis showed that family-related vocabulary and participation in the family care program were related to fatigue improvement. Although only 15% of the patients expressed difficulties for child rearing in the initial survey, which may lead to no statistical significance, text analysis was able to catch the importance of issues related to family for cancer survivors. Thus, it should be noted that individual interviews are important to grasp the complex issues of that are not comprehensible through simple survey. To provide practical support to cancer survivors, a simple follow-up with indicators such as distress level or short clinic sessions may not be sufficient. Therefore, our institution has been carrying out surveys and rather lengthy interviews whenever possible and when necessary to help more practical problems of each patient. As this study confirmed that the family care program helped to improve fatigue, we will continue to take further individual actions.
There were some limitations in this study. First, the population of the study had a selection bias. We used the existing data with the registered patients in the project, making it impossible to compare to other groups such as non-registered patients or other specific disease sites. Additionally, due to the limitations in human resources and budget, individual interviews for all patients in real-world situation might be difficult. In the future, personalized care in our cancer survivorship center will be established with criteria for an effective operation to fully support those who are in need. Finally, the deep learning techniques and analytic methods for the text analysis were relatively limited due to the small number of patients.

Conclusions
Our cancer survivorship project showed favorable results in most aspects, including distress and fatigue. Particularly, considering the large portion of young breast cancer survivors and Korean culture, the participation in the family care programs was effective for improving fatigue. Moreover, text analysis with deep learning to use the individual interviews showed that the mention of family-related words during interviews was associated with fatigue improvement. It implied that the interviews had positive effects on the patients to participate in the family care program, which consequently lead to improvement in fatigue. This study will help health professionals provide more effective and personalized approaches to cancer survivorship.