GC-PROM: Validation of a patient-reported outcomes measure for patients with gastric cancer

Background: There is increasing recognition that PROs are important in the estimation of the burden of long-term survival among patients with gastric cancer. The study aimed to develop a disease-specic instrument to assess patient-reported outcomes for patients with gastric cancer. Method: Following the FDA's draft guidance for patient-reported outcome, conceptual framework and item pool were dened based on relevant existing work. A draft scale was formed after revising some items based on feedback from experts and patients with gastric cancer. The pre-survey and formal survey were conducted in eight different hospitals in Shanxi Province, and two item-selection process based on classical test theory and item response theory. Finally, the patient-reported outcomes measure for patients with gastric cancer (GC-PROM) was validated in terms of reliability, validity, and feasibility. The minimal clinically important difference was determined by distribution-based method. Results: The nal GC-PROM consisted of 38 items, 13 subdomains, and 4 domains. Reliability was veried by Cronbach’s alpha coecient for four domains and 13 subdomains respectively. The validity results showed that the multidimensional scale fullled expectations. In the formal survey, the completion rate was 96.16%, and the average lling time was less than half an hour. The values of the minimal clinically important difference were 4.14, 3.41, 3.37, and 3.28 in the four domains. Conclusions: The GC-PROM had good reliability, validity, and feasibility and thus can be considered an effective clinical evaluation instrument for patients with gastric cancer. GC-PROM: patient-reported outcomes measure for patients with gastric cancer; PRO(s): patient-reported outcome(s); GC: Gastric cancer; QoL: quality of life; EORTC QLQ-C30: European Organization for Research and Treatment of Cancer quality of life questionnaire-core questionnaire; EORTC QLQ-STO52: European Organization for Research and Treatment of Cancer quality of life questionnaire-stomach module; FACT-Ga: Functional Assessment of Cancer Therapy-gastric; QLICP-ST: quality of life instruments for cancer patients-stomach cancer; IRT: Item response theory; CTT: Classical test theory; MCID: Minimal clinically important difference; SEM: Standard error of measurement; RCI: Reliable change index.

Symptom Scale developed by Chen-wun in Taiwan, China [13]. EORTC QLQ-STO 52, FACT-Ga, and QLICP-ST was developed by combining general module with special module. The Chinese version of EORTC QLQ-STO52 and FACT-Ga had been culturally debugged and evaluated [14]. But there were still some items that might not suitable for Chinese culture. QLICP-ST was a gastric cancer scale developed for Chinese cancer patients.
However, the disease-speci c items might be less than those in the EORTC QLQ-STO52. It had few speci c items on the effectiveness, compliance, satisfaction, and side effects in the eld of cancer treatment [15]. The Special Symptom Scale developed by Chen-wun also didn't divide domains [13].
In sum, there are already many reliable scales for measuring the QoL of patients with GC worldwide. However, if used alone, these scales are often not speci c enough and cannot be roundly used to measure the QoL of patients with GC [16]. Additionally, because of QoL strongly dependent on cultural background, foreign scales cannot be used directly after translation. Because of economic and cultural differences across regions of China, Chinese-developed instruments for patients with GC have not been widely used [17]. Therefore, it was necessary to develop the PROM for patients with GC to focus more on the related aspects of the treatment as it is perceived by patients. In addition to laboratory and imaging methods, the data from PROM can be used to improve the reliability of clinical e cacy evaluations by comprehensively measuring many aspects of patientreported health [18]. As a result, PROs are able to provide a reference for doctors in their diagnosis and treatment practices [19]. Prior to using PRO measures in clinical practice and research, the instruments need to be cautiously developed and validated to avoid biased results that might lead to incorrect interpretations [20].

Setting
The two surveys (i.e., pre-survey and formal survey) were carried out in eight hospitals in Shanxi Province, China. These hospitals were the First Hospital of Shanxi Medical University, the Second Hospital of Shanxi Medical University, Shanxi Cancer Hospital, the 264 Hospital of Chinese People's Liberation Army (PLA), the 17th Hospital of the Chinese Railway, the People's Hospital of Gaoping City, the People's Hospital of Zezhou City, and the Fourth People's Hospital of Linfen City.

Sample
Before collecting samples, investigators contacted related departments of target hospitals and communities to get support from hospital staff and community workers. Preparations were also made to publicize the study through posters in hospital departments and communities. The documents introducing the survey were distributed. From July 2015 to September 2015, patients diagnosed with GC were recruited. The inclusion criteria for patients with GC were as follows: patients who had been diagnosed with GC, were over 18 years old. The exclusion criteria were as follows: patients with other serious disease; patients with disturbance of consciousness; patients who were unable to understand to complete the questionnaire for any reason. We simultaneously selected healthy subjects who lived in the same communities as the patients. Healthy subjects met the following criteria: They were not suffering from other diseases of the digestive system, other malignant tumors, or mental illness; were similar in age to the patients with GC; and they volunteered to participate in the investigation.

Development and formation of GC-PROM
The GC-PROM was developed in three phases [21], and details of each phase are described below. Figure 1 presented a owchart of three-phase development process. Literature searches were carried out on network databases for keywords such as PRO measure, PRO scale, PRO instruments, and gastric cancer. Using the principles of FDA on the PROM and search results, we established a conceptual framework for GC-PROM including four domains and 13 subdomains. We conducted face-to-face interviews with ten patients with GC. Researchers wrote down the interviewees' original words as far as possible. After the interview, all information was sorted and an initial pool was developed.

Cognitive test and expert consultation
Other ten hospitalized patients with GC took part in a cognitive test of the questionnaire. The group included seven men and three women, with an average age of 54 years. We also sought views from experts. In the nal step, we integrated the views of experts and patients to modify the items and develop the draft version of GC-PROM.

Scale scoring
The response options of items used ve-point Likert scoring scales, with scores ranging from zero to four points, including positive items (items with higher QoL) and negative items (items with lower QoL). For the convenience of calculation, positive items were recoded as the original score plus one point. The negative items were recoded as ve minus the original score[22]. Phase 2: Formation of initial and nal scales using two item-selection processes During the formation process of GC-PROM, seven methods were used to select items through two itemselection processes. The rst six methods were based on classical test theory (CTT). The IRT was used as the seventh method. One of IRT models (i.e., Samejima's Graded Response Model) were the preferred methodology for statistically analyzing patients' latent traits [23]. An item was considered for selection if it was retained by six or more methods. An item's practical signi cance was considered before deleting in the presurvey. If it was meaningful in fact, the item would be temporarily retained and screened in the formal survey.
We nally removed this item when it was still suggested to be deleted.

Statistical methods
Seven methods were used to evaluate the items: When the standard deviation (SD) of an item was ≤ 1, the corresponding item was deleted [24].
We deleted items with factor loading that were low (< 0.4) or close to other factors in the exploratory factor analysis[25].
An item was considered for deletion when the Pearson correlation coe cient for the item and its subdomain was < 0.60 or the Pearson correlation coe cient for the item and another subdomain was > 0.50 [25].
An item was considered for deletion when the corrected item-total correlation was < 0.50 and the item's deletion increased the value of Cronbach's alpha coe cient [24].
Each item score of patients and healthy subjects was analyzed using a t-test to distinguish the items in distinction analysis. Deletion was recommended for items with P values > 0.05[23].
In the Graded Response Model, the practical values of the item parameters for deletion were as follows: item discrimination parameter (a) < 0.4 or di culty parameter (b) (-3, 3)[27].

Phase 3: evaluation of measurement properties
The properties of the nal GC-PROM version were assessed by using data from a formal investigation.

Evaluation of reliability
The internal consistency of the GC-PROM was assessed by using Cronbach's alpha coe cients of 13 subdomains. Generally, a value of more than 0.70 indicated that it had a good internal consistency [28]. Construct validity. Con rmatory factor analysis was used to examine the structure of the GC-PROM. The standardized factor loadings for an item should be greater than 0.5[30].
Discriminant validity. Discriminant validity is the ability of an instrument to measure a difference between two groups. The t-test was used to compare differences between patients with GC and healthy subjects, with the signi cance level set at P< 0.05[31].

Evaluation of feasibility
Feasibility mainly re ects the acceptability of the GC-PROM. The return and response rate of the questionnaires was rationalized with the general requirement set at85%. The questionnaire completion time was generally less than half an hour. We also took the proportion of miss data and maximum endorsement frequencies[32].

Interpretation of PRO results: Minimal clinical important difference (MCID)
MCID was designed to solve the clinical explanation problem of a GC-PROM score change [33]. The methods used to estimate the MCID mainly include the effect size (ES), standard error of measurement (SEM), standardized response mean, and reliable change index (RCI) [34]. In this article, we used SEM and RCI to estimate the MCID.

Participant characteristics
A total of 145 patients and 55 healthy subjects were included in the pre-survey. Among these subjects, 20 patients completed the questionnaire again four days after rst completing the questionnaire. Finally, completed questionnaires were collected from 130 patients and 52 healthy subjects. All 20 retest questionnaires were recovered. In the formal survey, a total of 530 questionnaires (400 patients with GC, 130 healthy subjects) were administered. Ultimately, completed questionnaires were collected from 364 patients with GC and 112 healthy subjects. A total of 45 patients with GC were retested, and all of the retest questionnaires were recovered. We compared baseline data of two groups using t-tests for continuous variables and chi-square tests for categorical variables. The results with the signi cance level set at P< 0.05 showed that the baseline data from patients with GC and from healthy subjects were all comparable.

The Conceptual Framework of the GC-PROM
The established conceptual framework included four domains, 13 subdomains. After the literature review and interviews with patients with GC, an initial pool of 79 items was developed. Based on the cognitive test and expert consultation, we deleted 14 items, added three items, and modi ed two items. Finally, conceptual framework included the scale contained 4 domains (physiological, psychological, social, and therapeutic domains), 13 subdomains (abdominal symptoms, systemic symptoms, physical state, independence, anxiety, depression, pessimism, fear, social support, social adaptation, effectiveness, satisfaction, compliance, and drug side effects), and 68 items.

Formation of the Initial and Final Scales through Two Item-selection Processes
Seven methods, including the SD, exploratory factor analysis, Cronbach's alpha coe cient, retest reliability, correlation coe cient, distinction analysis, and IRT, were used to select items. Twenty-two items in the selected item pool were suggested for deletion by seven methods. Meanwhile practical meanings of 22 items were taken in account. Finally, a consensus was reached that these items should be deleted. In the second item-selection process, a formal investigation was conducted with the above reduced (i.e., 46 items) questionnaire. The items were again screened using the above seven methods and practical meanings.
According to the results shown in Table 2, eight items were deleted.
Insert Table 2 here Finally, the scale contained 4 domains, 13 subdomains, and 38 items (See Additional le 1). The structural framework of the nal scale was shown in Table 3.
Evaluating the Properties of the GC-PROM The nal GC-PROM was evaluated for validity, reliability, and feasibility using data obtained from 364 patients with GC and 112 healthy subjects.

Evaluation of reliability
Cronbach's alpha coe cients for the four domains and 13 subdomains were between 0.700 and 0.917. As was evident in these values, the GC-PROM demonstrated a good degree of internal consistency reliability.

Evaluation of validity
Content validity. To ensure that all the items appropriate, we assessed content validity by referring to the relevant previous literature. Face-to-face interviews were conducted with patients with GC to identify potential items. Meanwhile, we also consulted with experts for item re nement.  Table 4. The standardized factor loadings of 13 subdomains were greater than 0.5. Therefore, the construct validity was deemed satisfactory.
Insert Table 4 here Discriminant validity. The results of discriminant validity are shown in Table 5. The results of discriminant validity (P values < 0.05) suggested that the GC-PROM was an appropriate instrument to distinguish between patients and healthy subjects.

Evaluation of feasibility
In this formal survey, the return and response rate of questionnaires were 93.40% and 96.16%, respectively. The average completing time was less than half an hour. No major oor or ceiling effects were found. The maximum proportion of participants who endorsed a single category for each item was less than 80%. Only 3.84% of the responses to individual items were missing. We tested the missing questionnaire data using Little's Missing Completely at Random Test. The test showed that the data were missing at random, and we lled them in using the Expectation-Maximization Algorithm.

MCID
From statistical results of Table MCID, the value of the MCID was greater when determined using the RCI than when it was determined using the SEM. Therefore, the value of MCID determined using the RCI was chosen as the nal judgment. We nally identi ed the minimum clinical values of 4.14, 3.41, 3.37, and 3.28 in the physiological, psychological, social, and therapeutic domains, respectively.

Discussion
There is increasing recognition that PROs are important in the estimation of the burden of long-term survival among patients with gastric cancer. In this environment, it is essential to get more acquainted with information regarding patients' QoL [3]. Therefore, the study developed a reliable and valid patient-reported scale for patients with GC in China. Using the currently available PRO instruments as a starting point, we developed the GC-PROM to assess the QoL of patients with GC. It consisted of four domains, 13 subdomains, and 38 items. The results of our study indicated that the GC-PROM was a valid instrument for measuring survival state for patients with GC. The application of PRO in the evaluation of curative effects could make clinicians more aware of the patient's situation and provide a reference for diagnosis and treatment [7].
Exploratory factor analysis was carried out in the four domains based on one-dimensional assumption of the IRT[27]. The Kaiser-Meyer-Olkin values in four domains were 0.822, 0.875, 0.761, and 0.774 in the rst itemselection process. The P value of Bartlett's spherical test was < 0.001, indicating that the data were suitable for factor analysis. Four factors, three factors, two factors, and four factors with characteristic root greater than 1 were extracted from physical, psychological, social and therapeutic domains respectively. The factor analysis also showed that each factor (i.e., subdoamin) had the unidimensionality. The method of GRM ran on the items of each subdomain.
There were many methods used in the selecting items. A variety of methods were used to ensure the quality of the selection and to make selected items more representative, independent, and sensitive. Previous research mostly used the method of CTT for item selection. Recently, IRT has gradually gained popularity for selecting items [23]. GRM is one of the most commonly used IRT models, and is suitable for Likert-type scales. The GRM method was used as a criterion for selecting items in our study. The signi cance of IRT is that it can guide item selection and test construction. The information function of IRT can be used to describe items' measurement validity, which can be used as direction for the formation and modi cation of these items [24]. Therefore, the present study used IRT in the process of creating the GC-PROM.
In the development stage of the GC-PROM, we took healthy subjects as a control to evaluate the discrimination validity. These scores of 13 subdomains for healthy subjects could be suggested as baseline values. In practical application of GC PROM, we will evaluate the discrimination of GC-PROM by taking patients with gastrointestinal diseases and non-gastric cancer patients as controls. In the validation stage of the GC-PROM, the concurrent validity was not evaluated. Because in the actual investigation phase, the use of other scales may result in bias estimation at the same time. When determining the value of the MCID, we used a distribution-based method. In the formal investigation, the repeated measurement sample size was relatively small. These conditions were not very suitable for using the anchor-based method. In future studies, we will further standardize the sample size and the time interval for repeated measurements.

Consent for publication
All authors have approved the manuscript for publication.

Availability of data and materials
Please contact the corresponding author for the study data, which will be granted upon reasonable request.

Authors' contributions
All authors participated in the study design. XH and FZ were responsible for collecting the data and drafting the article. YH and YL participated in the data analysis. JL and YZ proposed the original concept for this study, supervised the data analysis, and revised the paper. All authors read and approved the nal manuscript.  Tables   Table 1 Baseline   Negative items were denoted by "-". Positive items were denoted by "+".  Table 5 Scores comparisons between healthy subjects and patients with GC ()

Figure 1
A owchart of three-phase developmental process Phase 1: Identi cation of conceptual framework and items

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. supplement1.docx