GC-PROM: validation of a patient-reported outcomes measure for Chinese patients with gastric cancer

Background There is increasing recognition that PROs are important in the estimation of the burden of long-term survival among patients with gastric cancer. The study aimed to develop a disease-specific instrument to assess patient-reported outcomes for Chinese patients with gastric cancer. Method Following the FDA’s draft guidance for patient-reported outcome, conceptual framework and item pool were defined based on relevant existing work. A draft scale was formed after revising some items based on feedback from experts and Chinese patients with gastric cancer. The pre-survey and formal survey were conducted in eight different hospitals in Shanxi Province, and two item-selection process based on classical test theory and item response theory. Finally, the patient-reported outcomes measure for Chinese patients with gastric cancer (GC-PROM) was validated in terms of reliability, validity, and feasibility. The minimal clinically important difference was determined by distribution-based method. Results The final GC-PROM consisted of 38 items, 13 subdomains, and 4 domains. Reliability was verified by Cronbach’s alpha coefficient for four domains and 13 subdomains respectively. The validity results showed that the multidimensional scale fulfilled expectations. In the formal survey, the completion rate was 96.16%, and the average filling time was less than half an hour. The values of the minimal clinically important difference were 4.14, 3.41, 3.37, and 3.28 in the four domains. Conclusions The GC-PROM had good reliability, validity, and feasibility and thus can be considered an effective clinical evaluation instrument for Chinese patients with gastric cancer.


Background
Gastric cancer (gastric carcinoma, GC) is a malignant tumor occurring in the epithelial tissue of the stomach. GC accounts for more than 95% of malignant tumors of the stomach [1]. There are approximately 989,000 new patients with GC worldwide each year, but the incidence of the disease varies greatly by region [2]. Although the diagnosis and treatment of GC are developing, the 5year survival rate for patients with GC is only 20%. In China, GC is a major public health problem [3]. GC causes physical pain to patients, poor mental state, and enormous costs for many families, which reduce the Chinese patients' quality of life (QoL). So many patients with GC are focusing more on how improving overall QoL [4].
In recent years, patients' subjective feelings about treatment have been an important part of the improving patients' QoL [5]. However, earlier methods were unable to measure patients' self-reported results, such as physician report [6]. Therefore, new patient-generated reports, also known as patient-reported outcomes (PROs), are now used to assess the overall burden of cancer and the effectiveness of interventions. PROs involve reports taken directly from patients regarding their health status, functional status, and treatment experience [7]. In medical care for patients with GC, functional effects have usually been separated into three categories: physiological, psychological, and social. It is possible that treatments may also cause physical discomfort to patients, testing the psychological endurance of both patients and their families [8]. Economic effects have sometimes also been discussed in the functional effects of illness [9]. To select the best therapeutic schedule, it is necessary to carry out a comprehensive assessment of various plans.
At present, the main disease-specific instruments of GC that have been developed are the EORTC quality of life questionnaire-stomach cancer (EORTC QLQ-STO52) [10], the Functional Assessment of Cancer Therapy-gastric (FACT-Ga) [11], quality of life instruments for cancer patients-stomach cancer (QLICP-ST) [12], and the Special Symptom Scale developed by Chen-wun in Taiwan, China [13]. EORTC QLQ-STO 52, FACT-Ga, and QLICP-ST was developed by combining general module with special module. The Chinese version of EORTC QLQ-STO52 and FACT-Ga had been culturally debugged and evaluated [14]. But there were still some items that might not suitable for Chinese culture. QLICP-ST was a gastric cancer scale developed for Chinese cancer patients. However, the diseasespecific items might be less than those in the EORTC QLQ-STO52. It had few specific items on the effectiveness, compliance, satisfaction, and side effects in the field of cancer treatment [15]. The Special Symptom Scale developed by Chen-wun also didn't divide domains [13].
In sum, there are already many reliable scales for measuring the QoL of patients with GC worldwide. However, if used alone, these scales are often not specific enough and cannot be roundly used to measure the QoL of Chinese patients with GC [16]. Additionally, because of QoL strongly dependent on cultural background, foreign scales cannot be used directly after translation. Because of economic and cultural differences across regions of China, Chinese-developed instruments for patients with GC have not been widely used [17]. Therefore, it was necessary to develop the PROM for Chinese patients with GC to focus more on the related aspects of the treatment as it is perceived by patients. In addition to laboratory and imaging methods, the data from PROM can be used to improve the reliability of clinical efficacy evaluations by comprehensively measuring many aspects of patient-reported health [18]. As a result, PROs are able to provide a reference for doctors in their diagnosis and treatment practices [19]. Prior to using PRO measures in clinical practice and research, the instruments need to be cautiously developed and validated to avoid biased results that might lead to incorrect interpretations [20].

Setting
The two surveys (i.e., pre-survey and formal survey) were carried out in eight hospitals in Shanxi Province, China. These hospitals were the First Hospital of Shanxi Medical University, the Second Hospital of Shanxi Medical University, Shanxi Cancer Hospital, the 264 Hospital of Chinese People's Liberation Army (PLA), the 17th Hospital of the Chinese Railway, the People's Hospital of Gaoping City, the People's Hospital of Zezhou City, and the Fourth People's Hospital of Linfen City.

Sample
Before collecting samples, investigators contacted related departments of target hospitals and communities to get support from hospital staff and community workers. Preparations were also made to publicize the study through posters in hospital departments and communities. The documents introducing the survey were distributed. From July 2015 to September 2015, patients diagnosed with GC were recruited. The inclusion criteria for patients with GC were as follows: patients who had been diagnosed with GC, were over 18 years old. The exclusion criteria were as follows: patients with other serious disease; patients with disturbance of consciousness; patients who were unable to understand to complete the questionnaire for any reason. We simultaneously selected healthy subjects who lived in the same communities as the patients. Healthy subjects met the following criteria: They were not suffering from other diseases of the digestive system, other malignant tumors, or mental illness; were similar in age to the patients with GC; and they volunteered to participate in the investigation.

Development and formation of GC-PROM
The GC-PROM was developed in three phases [21], and details of each phase are described below. Figure 1 presented a flowchart of three-phase development process.
Phase 1: identification of conceptual framework and items Literature searches and patient interviews Literature searches were carried out on network databases for keywords such as PRO measure, PRO scale, PRO instruments, and gastric cancer. Using the principles of FDA on the PROM and search results, we established a conceptual framework for GC-PROM including four domains and 13 subdomains. We conducted faceto-face interviews with 10 patients with GC. Researchers wrote down the interviewees' original words as far as possible. After the interview, all information was sorted and an initial pool was developed.

Cognitive test and expert consultation
Other 10 hospitalized patients with GC took part in a cognitive test of the questionnaire. The group included seven men and three women, with an average age of 54 years. We also sought views from experts. In the final step, we integrated the views of experts and patients to modify the items and develop the draft version of GC-PROM.

Scale scoring
The response options of items used five-point Likert scoring scales, with scores ranging from zero to four points, including positive items (items with higher QoL) and negative items (items with lower QoL). For the convenience of calculation, positive items were recoded as the original score plus one point. The negative items were recoded as five minus the original score [22]. The higher total scores of the subdomain, the better the patients' QoL.
Phase 2: formation of initial and final scales using two item-selection processes During the formation process of GC-PROM, seven methods were used to select items through two itemselection processes. The first six methods were based on classical test theory (CTT). The IRT was used as the seventh method. One of IRT models (i.e., Samejima's Graded Response Model) were the preferred methodology for statistically analyzing patients' latent traits [23]. An item was considered for selection if it was retained by six or more methods. An item's practical significance was considered before deleting in the pre-survey. If it was meaningful in fact, the item would be temporarily retained and screened in the formal survey. We finally removed this item when it was still suggested to be deleted.

Statistical methods
Seven methods were used to evaluate the items: (1) When the standard deviation (SD) of an item was ≤1, the corresponding item was deleted [24]. (2) We deleted items with factor loading that were low (< 0.4) or close to other factors in the exploratory factor analysis [25]. (3) An item was considered for deletion when the Pearson correlation coefficient for the item and its subdomain was < 0.60 or the Pearson correlation coefficient for the item and another subdomain was > 0.50 [25]. (4) An item was considered for deletion when the corrected item-total correlation was < 0.50 and the item's deletion increased the value of Cronbach's alpha coefficient [24]. (5) Items with smaller correlation coefficients of retest reliability (< 0.6) were removed [26]. (6) Each item score of patients and healthy subjects was analyzed using a t-test to distinguish the items in distinction analysis. Deletion was recommended for items with P values > 0.05 [23]. (7) In the Graded Response Model, the practical values of the item parameters for deletion were as follows: item discrimination parameter (a) < 0.4 or difficulty

Phase 3: evaluation of measurement properties
The properties of the final GC-PROM version were assessed by using data from a formal investigation.

Evaluation of reliability
The internal consistency of the GC-PROM was assessed by using Cronbach's alpha coefficients of 13 subdomains. Generally, a value of more than 0.70 indicated that it had a good internal consistency [28].

Evaluation of validity
Content validity The relevant literature, subjects' opinions, and experts were consulted in establishing the content validity, which represents how well the items captured the concept of interest [29].
Construct validity Confirmatory factor analysis was used to examine the structure of the GC-PROM. The standardized factor loadings for an item should be greater than 0.5 [30].
Discriminant validity Discriminant validity is the ability of an instrument to measure a difference between two groups. The t-test was used to compare differences between patients with GC and healthy subjects, with the significance level set at P < 0.05 [31].

Evaluation of feasibility
Feasibility mainly reflects the acceptability of the GC-PROM. The return and response rate of the questionnaires was rationalized with the general requirement set at ≥85%. The questionnaire completion time was generally less than half an hour. We also took the proportion of miss data and maximum endorsement frequencies [32].

Interpretation of PRO results: minimal clinical important difference (MCID)
MCID was designed to solve the clinical explanation problem of a GC-PROM score change [33]. The methods used to estimate the MCID mainly include the effect size (ES), standard error of measurement (SEM), standardized response mean, and reliable change index (RCI) [34]. In this article, we used SEM and RCI to estimate the MCID.

Participant characteristics
A total of 145 patients and 55 healthy subjects were included in the pre-survey. Among these subjects, 20 patients completed the questionnaire again 4 days after first completing the questionnaire. Finally, completed questionnaires were collected from 130 patients and 52 healthy subjects. All 20 retest questionnaires were recovered. In the formal survey, a total of 530 questionnaires (400 patients with GC, 130 healthy subjects) were administered. Ultimately, completed questionnaires were collected from 364 patients with GC and 112 healthy subjects. A total of 45 patients with GC were retested, and all of the retest questionnaires were recovered. We compared baseline data of two groups using t-tests for continuous variables and chi-square tests for categorical variables. The results with the significance level set at P < 0.05 showed that the baseline data from patients with GC and from healthy subjects were all comparable ( Table 1).

The conceptual framework of the GC-PROM
The established conceptual framework included four domains, 13 subdomains. After the literature review and interviews with patients with GC, an initial pool of 79 items was developed. Based on the cognitive test and expert consultation, we deleted 14 items, added three items, and modified two items. Finally, conceptual framework included the scale contained 4 domains (physiological, psychological, social, and therapeutic domains), 13 subdomains (abdominal symptoms, systemic symptoms, physical state, independence, anxiety, depression, pessimism, fear, social support, social adaptation, effectiveness, satisfaction, compliance, and drug side effects), and 68 items.

Formation of the initial and final scales through two itemselection processes
Seven methods, including the SD, exploratory factor analysis, Cronbach's alpha coefficient, retest reliability, correlation coefficient, distinction analysis, and IRT, were used to select items. Twenty-two items in the selected item pool were suggested for deletion by seven methods. Meanwhile practical meanings of 22 items were taken in account. Finally, a consensus was reached that these items should be deleted. In the second itemselection process, a formal investigation was conducted with the above reduced (i.e., 46 items) questionnaire. The items were again screened using the above seven methods and practical meanings. According to the results shown in Table 2, eight items were deleted. Finally, the scale contained 4 domains, 13 subdomains, and 38 items (See Additional file 1). The structural framework of the final scale was shown in Table 3.

Evaluating the properties of the GC-PROM
The final GC-PROM was evaluated for validity, reliability, and feasibility using data obtained from 364 patients with GC and 112 healthy subjects.

Evaluation of reliability
Cronbach's alpha coefficients for the four domains and 13 subdomains were between 0.700 and 0.917. As was evident in these values, the GC-PROM demonstrated a good degree of internal consistency reliability.   Table 4. The standardized factor loadings of 13 subdomains were greater than 0.5. Therefore, the construct validity was deemed satisfactory.

Discriminant validity
The results of discriminant validity are shown in Table 5. The results of discriminant validity (P values < 0.05) suggested that the GC-PROM was an appropriate instrument to distinguish between patients and healthy subjects.

Evaluation of feasibility
In this formal survey, the return and response rate of questionnaires were 93.40 and 96.16%, respectively. The average completing time was less than half an hour. No major floor or ceiling effects were found. The maximum proportion of participants who endorsed a single category for each item was less than 80%. Only 3.84% of the responses to individual items were missing. We tested the missing questionnaire data using Little's Missing Completely at Random Test. The test showed that the data were missing at random, and we filled them in using the Expectation-Maximization Algorithm.

MCID
From statistical results of Table MCID, the value of the MCID was greater when determined using the RCI than when it was determined using the SEM. Therefore, the value of MCID determined using the RCI was chosen as the final judgment. We finally identified the minimum clinical values of 4.14, 3.41, 3.37, and 3.28 in the physiological, psychological, social, and therapeutic domains, respectively.

Discussion
There is increasing recognition that PROs are important in the estimation of the burden of long-term survival among patients with GC. In this environment, it is essential to get more acquainted with information regarding patients' QoL [3]. Therefore, the present study developed a reliable and valid patient-reported scale for patients with GC in China. Using the currently available PRO instruments as a starting point, we developed the GC-PROM to assess the QoL of patients with GC. The GC-PROM comprises four domains, 13 subdomains, and 38 items. The results of our study indicated that the GC-PROM is a valid instrument for measuring quality of life among patients with GC. The application of PROs in the evaluation of curative effects could make clinicians more aware of the patient's situation and provide a reference for diagnosis and treatment [7]. Quality of life research conducted in China has historically involved the use of questionnaires that have been translated from another language. As a result some of the items have been inconsistent with some habits  Drug side effects 9-, 10-Negative items were denoted by "-". Positive items were denoted by "+" typical of Chinese people; particularly habits pertaining to inherently personal practices, or questions about habits that many Chinese people would consider to be sensitive areas of inquiry-resulting in potential bias [17]. The scale developed in the current study via discussion with specialists and interviews with patients with GC addresses this applicability problem with regard to patients in China. The GC-PROM is characterized by taking the therapeutic field and family relationships as independent domains, in contrast to other GC questionnaires. The measurement of satisfaction with treatment that patients received is the main focus in new drug clinical trials [9]. These subdomains (i.e., effectiveness, compliance, drug side effects) can provide related information about the effects of the targeted drug on patients' quality of life and identify the acceptance of new drug among patients. Researchers can promote clinical therapeutic drug development and select an optimal therapy based on information and data gained. In the social field, family relationship is emphasized to recognize the importance of family support during the recovery of patients. Exploratory factor analysis was carried out in the four domains based on one-dimensional assumption of the IRT [27]. The Kaiser-Meyer-Olkin values in four domains were 0.822, 0.875, 0.761, and 0.774 in the first item-selection process. The P value of Bartlett's spherical test was < 0.001, indicating that the data were suitable for factor analysis. Four factors, three factors, two factors, and four factors with characteristic root greater than 1 were extracted from physical, psychological, social and therapeutic domains respectively. The factor analysis also showed that each factor (i.e., subdomain) had the unidimensionality. The method of GRM ran on the items of each subdomain.
There were many methods used in the selecting items. A variety of methods were used to ensure the quality of the selection and to make selected items more representative, independent, and sensitive. Previous research mostly used the method of CTT for item selection. Recently, IRT has gradually gained popularity for selecting items [23]. GRM is one of the most commonly used IRT models, and is suitable for Likert-type scales. The GRM method was used as a criterion for selecting items in our study. The significance of IRT is that it can guide item selection and test construction. The information function of IRT can be used to describe items' measurement validity, which can be used as direction for the formation and modification of these items [24]. Therefore, the present study used IRT in the process of creating the GC-PROM.
To obtain reliable and accurate parameter estimates, some scholars have suggested that the sample size should be 5 to 10 times the number of observed variables in a factor analysis [20]. Most previous work that has applied item response theory (IRT) has not specified the sample size [35]. We conducted a pre-survey among a small sample (145 patients with GC and 55 healthy subjects) using a 68-item questionnaire. The purpose of this pre-survey was to ask patients how they felt about the GC-PROM items. This avoided ambiguity in understanding and reduced omission of important information. Patients were also able to point out the shortcomings of the scale in the pre-survey. For the formal survey, a larger sample (400 patients with GC and 130 healthy subjects) responded to a questionnaire with a reduced number of items (46 items) to improve the rationality of the GC-PROM.
In the development stage of the GC-PROM, we used healthy subjects as a control group to evaluate discriminant validity. The scores of the healthy subjects on the 13 subdomains could be used as baseline values. In the practical application of the GC-PROM, we will evaluate the instrument's discriminant validity using patients with gastrointestinal diseases and non-GC patients as controls in the future. Concurrent validity was not evaluated as part of the validation stage of the GC-PROM because the simultaneous use of other previous scales in the actual investigation phase may result in estimation bias. And conducting multiple questionnaires will cause some burden to patients with GC, which may increase patient's boredom and survey cost. Therefore, this study also did not include specific comparison results between this scale and other conventional questionnaires such as EORTC QLQ-STO52 or FACT-Ga. We could not compare the validity between the newly developed questionnaire (GC-PROM) and conventional ones. In the subsequent questionnaire survey, multiple scales of gastric cancer (e.g., GC-PROM, EORTC QLQ-STO52, and FACT-Ga) will be used to evaluate the QoL of patients with GC and compare the concurrent validity. We used a distribution-based method to determine the value of the MCID. In the formal investigation, the repeatedmeasures sample size was relatively small. These conditions were not very suitable for using the anchor-based method. In future studies, we will further standardize the sample size and the time interval for repeated measurements. Shanxi is a Mandarin-speaking province in northern China. Therefore, in the actual survey, the GC-PROM was in Mandarin, which is the standardized language commonly used in China. This approach ensured that the scale could be used in most areas of China, where Mandarin is used. However, in a few areas of southern China, such as Guangdong and Shenzhen, the most common language is Cantonese. For use in these areas, the newly developed GC-PROM would require further adjustment and verification.

Conclusions
This project essentially completed the development and validation of the GC-PROM according to the PRO production process stipulated by the United States Food and Drug Administration. GC-PROM can be considered an effective clinical evaluation instrument for patients with GC.
Additional file 1. Final version of GC-PROM. After two item-selection process based on classical test theory and item response theory, the final GC-PROM consisted of 38 items. It described which items were included in the final scale.