Datasets acquisition for ordinal response prediction
The gene expression datasets analyzed to predict the drug response were acquired from two independent clinical trials. The two datasets are publically available from GEO under accession number [GEO: GSE9782] and [GEO: GSE68871]. They were originally published in Mulligan et al. [10] and Terragna et al. [2]. Mulligan et al. [10] recruited patients (n = 169) with relapsed myeloma enrolled in phase 2 and phase 3 clinical trials of bortezomib, whose pretreated tumor samples were further analyzed for genomic profiling with consent. Myeloma samples were collected prior to enrollment in clinical trials of bortezomib and samples were subject to replicate gene expression profiling using the Affymetrix 133A microarray (22,283 probes). Terragna et al. [2] focused primarily on treating the new MM patients (n = 118) with the induction therapy of VTD. The gene expression profiling (54,677 probes) was carried out in the Affymetrix Human Genome U133 Plus 2.0 Array. We standardized all the probes for statistical analysis in both datasets.
Outcome definitions
The original datasets contained drug response and overall survival as two outcomes. We here focused on developing genomic classifier for the drug response. In Mulligan et al. [10], patients were classified as achieving complete response (CR), partial response (PR), minimal response (MR), no change (NC) and Progressive Disease (PD) according to European Group for Bone Marrow Transplantation criteria [21]. In Terragna et al. [2], patients’ drug responses were classified as five categories: complete response (CR), near complete response (nCR), very good partial response (VGPR), partial response (PR) and stable disease (SD). For both datasets, the fivelevel ordinal drug response was used in our analysis. In the meantime, we combined the fivelevel drug response to a new threelevel drug response in both datasets to avoid low frequencies in certain levels. For Mulligan et al. [10], we combined PD and NC to a new level, and PR and MR as another new level. For Terragna et al. [2], we combined SD and PR as a new level, and VGPR and nCR as another new level. Both the original fivelevel and the new threelevel outcomes were separately analyzed for these two datasets.
Ordinal drug response prediction modeling
Let y_{
i
} be the ordinal outcome for which there exists a clear ordering of the response categories, and X_{
ij
} the gene expression profile for the ith individual and jth probe in the data of sample size n with a total number J of probes. For notational convenience, we code the ordinal outcome as the integers 1, 2, ···, K, with K being the number of categories.
Univariate ordinal logistic regression to rank top probes
It will not be efficient to include all the genes in a predictive model, due to the large number of genes. We thus first use the univariate ordinal logistic regression to filter top q associated probes of the gene expression profile with the ordinal outcome. For the jth gene expression X_{
ij
}, the univariate ordinal logistic regression is expressed as:
$$ \Pr \left({y}_i=k\right)=\left\{\begin{array}{l}1{\mathrm{logit}}^{1}\left({X}_{ij}\alpha {c}_{1j}\right)\kern9em \mathrm{if}\kern0.5em k=1\\ {}{\mathrm{logit}}^{1}\left({X}_{ij}\alpha {c}_{\left(k1\right)j}\right){\mathrm{logit}}^{1}\left({X}_{ij}\alpha {c}_{kj}\right)\kern1.25em \mathrm{if}\kern0.5em 1<k<K\\ {}{\mathrm{logit}}^{1}\left({X}_{ij}\alpha {c}_{\left(k1\right)j}\right)\kern9.25em \mathrm{if}\kern0.5em k=K\ \end{array}\right. $$
where c_{
kj
} denoted cutpoints or thresholds, are constrained to increase, c_{1j} < ⋯ < c_{(K − 1)j}. We then select the top q associated probes based on the pvalue for testing the hypothesis H_{0}: α = 0.
Predictive modeling for genomic classifiers
We use all the q selected probes to build a multivariable ordinal model for predicting the multilevel response, i.e.
$$ \Pr \left({y}_i=k\right)=\left\{\begin{array}{l}1{\mathrm{logit}}^{1}\left({X}_i\beta {c}_1\right)\kern8.5em \mathrm{if}\kern0.5em k=1\\ {}{\mathrm{logit}}^{1}\left({X}_i\beta {c}_{k1}\right){\mathrm{logit}}^{1}\left({X}_{ij}\beta {c}_k\right)\kern1.25em \mathrm{if}\kern0.5em 1<k<K\\ {}{\mathrm{logit}}^{1}\left({X}_i\beta {c}_{k1}\right)\kern9em \mathrm{if}\kern0.5em k=K\ \end{array}\right. $$
where the vector X_{
i
} includes the expression measures of the q genes, and β = (β_{1}, · ··, β_{
q
})^{T} is a vector of the effects. With hundreds or tens of correlated top associated probes, however, the standard ordinal logistic regression may fail, due to the nonidentifiability and overfitting. To overcome the problems, we use an appropriate prior distribution to constrain the coefficients to lie in reasonable ranges and allow the model to be reliably fitted and to identify important predictors [22, 23]. We employ the commonly used Cauchy prior distribution on the coefficients in the ordered logistic regression:
$$ p\left({\beta}_j\right)=\mathrm{Caucy}\left(0,s\right)=\frac{1}{\pi s}\frac{1}{1+{\beta}_j^2/{s}^2} $$
The scale parameter s controls the amount of shrinkage in the coefficient estimate; smaller s induces stronger shrinkage and forces more coefficients towards zero. For the cutpoints parameters, we use a uniform prior. We have developed a quasiNewton algorithm (BFGS) for fitting the hierarchical ordinal model by finding the posterior mode of the parameters (β, c), i.e., estimating the parameters by maximizing the posterior density.
Our algorithm has been implemented in our R package BhGLM, which is freely available from the website http://www.ssg.uab.edu/bhglm/ and the public GitHub repository https://github.com/abbyyan3/BhGLM that includes R codes for examples.
Assessing the performance of a fitted hierarchical ordinal logistic regression
After fitting a hierarchical ordinal model, we obtain the estimate (\( \widehat{\beta},\widehat{c} \)) and can estimate the probabilities: \( {p}_{ik}=\Pr \left({y}_i=k{X}_i\widehat{\beta},\widehat{c}\right),i=1,\cdots, n;k=1,\cdots, K \). Denote y_{
ik
} = I(y_{
i
} = k) as the binary indictor response for the kth category. We can evaluate the performance using several measures:

(1)
Deviance: \( d=2{\sum}_{i=1}^n\log {p}_{ik} \). Deviance measures the overall quality of a fitted model;

(2)
AUC (area under the ROC curve). We can calculate AUC for the kth category using {y_{
ik
}, p_{
ik
}: i = 1, ···, n} as usual. Then the AUC for all the categories is defined as\( \frac{1}{K}{\sum}_{k=1}^K{AUC}_k \).

(3)
MSE (mean squared error). MSE is defined as: \( MSE=\frac{1}{K}{\sum}_{k=1}^K\left[\frac{1}{n}{\sum}_{i=1}^n{\left({y}_{ik}{p}_{ik}\right)}^2\right] \).

(4)
Misclassification. The misclassification is defined as: \( MIS=\frac{1}{K}{\sum}_{k=1}^K\left[\frac{1}{n}{\sum}_{i=1}^nI\left({y}_{ik}{p}_{ik}>0.5\right)\right] \), where I( y_{
ik
} − p_{
ik
} >0.5) = 1 if ∣y_{
ik
} − p_{
ik
} ∣ > 0.5, and I( y_{
ik
} − p_{
ik
} >0.5) = 0 if ∣y_{
ik
} − p_{
ik
} ∣ ≤ 0.5;
To evaluate the predictive performance of the model, we use the prevalidation method, a variant of crossvalidation [24, 25], by randomly splitting the data to H subsets of roughly the same size, and using (H – 1) subsets to fit a model. We then calculate the measures described above with hth subset and cycle through all H subsets to get the averaged measurements to evaluate the predictive performance. To get stable results, we can run Hfold crossvalidation multiple times and use the average of the measure over the repeats to assess the predictive performance. We also can use leaveoneout crossvalidation (i.e., H = n) to obtain unique result. In this study, 10fold crossvalidation with 10 repeats and leaveoneout crossvalidation were both utilized. Deviance AUC, MSE and misclassification rate were all reported.
Selecting optimal scale values
The performance of the hierarchical ordinal model can depend on the scale parameter in the Cauchy prior. We fit a sequence of models with different scales ranging from 0.01 to 1 by 0.01, from which we can choose an optimal one based on the criteria described above.
Selecting the optimal number of q probes
To select an optimal number of top q probes, we fit a sequence of models with a different number of probes with the options from 30 and 50 to 500 by 50. The number q will be determined based on the predictive performance of the hierarchical ordinal logistic regression by evaluating the deviance of the models. The chosen top q probes will be identified as associated significant biomarkers and present in heatmaps for visual examination.