This website was developed at the Charité Berlin to enhance the comparability of data from Patient-Reported Outcomes (PROs). Using various published IRT models – so-called common metrics – you are able to scale your data that were collected through different self-report instruments onto one common scale. Once calibrated, you can download the estimates for further statistical analysis. To date, our app includes metrics for Depression, Anxiety, and Physical Functioning.
In the field of PRO measurement there is a plethora of instruments and questionnaires. For example, it has been estimated that over 100 instruments alone have been designed to measure (aspects) of depression or depressive severity, such as the
All these instruments differ in various ways, for example, in view of their underlying philosophies, psychometric principles of test construction, their emphasis on different aspects of the construct, and their precision and validation. One of the main challenges is that data obtained through different measures are hard to compare. Thus, several measures are often used in a single study for the sake of comparability, resulting in increased respondent burden. In summary, a lack of standardization in measurement of PROs is a problem and has been widely acknowledged in the literature.
Unfortunately, in the framework of Classical Test Theory (CTT), the scores from different measures are hard to compare. Item-Response Theory (IRT) can help to enhance comparability across instruments by providing so-called common metrics.
A common metric is an IRT model, such as the GRM (Graded Response Model) or the GPCM (Generalized Partial Credit Model), that comprises parameters of items from various measures, measuring a common variable. Item parameters describe the relation between item response and latent variable. With such statistical model, one can estimate this common variable by subsets of items, e.g. if different measures are used or if data are missing. Such models are usually estimated using large samples, often calibrated to some reference population.
Several such common metrics have been developed over the past years. Further below you can find a short description including the full reference of the ones we included in the score conversion app.
The application presented here sets up an IRT model with all parameters fixed to the item parameters of the selected common metric. Currently, GRMs and GPCMs are implemented.
The underlying R package mirt uses a marginal maximum likelihood method to estimate item parameters of IRT models; hence, estimation of person parameters can be conducted independently from item parameters. For person parameter estimation we included the Expected A Posteriori (EAP), Bayes Modal (MAP), Weighted Likelihood Estimation (WLE) and Maximum Likelihood (ML) methods. An EAP estimate for Sum Scores is available when your data stem from one measure only.
Some general remarks about choice of estimation methods:
Using your data you can estimate theta scores with this app with a normally distributed prior distribution with mean = 0 and variance = 1, a normally diffuse prior with mean = 0 and variance = 10, or with a normal prior with mean and variance estimated from the data. Please note that after the estimation your theta scores are transformed to the popular T-metric with mean = 50 and SD = 10.
Test-specific standard errors for precision comparison were calculated using the testinfo() function of mirt for models comprising all items from a single questionnaire. These standard errors are the same when one estimates theta with the ML method; hence, comparison to test precision is only possible under the ML approach.
If you want to learn more about the implemented methods, we would like to
refer to the following books:
We strongly believe that common metrics offer the chance to set standards in Patient-Reported Outcome Measurement independent of the measures used. For example, many common metrics are anchored at meaningful values, e.g. 50 as a general population mean, facilitating score interpretation.
The particular strengths of the direct estimation of the latent variable from the response pattern as provided here are:
Nonetheless, there are some limitations of this method:
In our opinion an enhanced comparability of data - especially data already collected - outweighs these limitations. We encourage you to use our App and share your experiences with us, so that we can further investigate the strengths and limitations for future applicants. As the App has not been widely tested yet, we still would like to ask you to use this application with caution. Please feel free to contact us. We are interested in your experiences and your needs.
Please let us know what you think about this site and feel free to send us your questions. We are grateful for feedback and will provide support!
Furthermore, we would like to thank