Comparison of calibrated projection score equating and equipercentile score equating

Sangdon Lim

Introduction

Scale linking vs Score equating

a scale is an instrument composed of multiple items
a score is a value that quantifies responses to an instrument
- raw score
- T-score
- \(\theta\)

Scale linking

Scale link is achieved when

a set of item parameters \(\xi\)
is on the same metric with
another set of anchor item parameters \(\xi'\)

Notation

\(\xi\) : the set of item parameters (e.g. difficulty, …)

Scale linking

Suppose we have a response dataset \(\mathbf{X}\) with

items \(a_1 ... a_{10}\) from scale \(a\)

Through item parameter calibration on \(\mathbf{X}\), we can obtain

item parameters \(\xi_a\) for items \(a_1 ... a_{10}\)

Scale linking

Suppose we have another dataset \(\mathbf{X'}\) with

items \(a_1 ... a_{10}\) from scale \(a\)
items \(b_1 ... b_{10}\) from scale \(b\)

Let the item parameters from this dataset be denoted by \(\xi'_a\) and \(\xi'_b\)

Without conversion, \(\xi'_a\) is on a different metric compared to \(\xi_a\)

because \(\mathbf{X'}\) comes from a different ability range compared to \(\mathbf{X}\)

Scale linking

Scale link is achieved when \(\xi'_a\) is on the same metric with \(\xi_a\)

this makes \(\xi'_a\) comparable to \(\xi_a\)
this makes \(\xi'_b\) from \(\mathbf{X'}\) interpretable on the (unobtainable) metric of \(\xi_b\) as if it was from \(\mathbf{X}\)

Scale linking

Scale linking methods include

Linear transformation

find a function \(f: \xi' \rightarrow \xi\) that converts \(\xi'_a\) to the metric of \(\xi_a\)
once \(f\) is determined, \(f\) can be used to convert \(\xi'_b\) to \(\xi_b\)
Haebara method (1980)
Stocking-Lord method (1983)

Scale linking

Scale linking methods include

Fixed-parameter calibration

Item calibration phase on \(\mathbf{X'}\) is modified
\(\xi' = \{\xi'_a, \xi'_b\}\) is estimated subject to the constraint \(\xi'_a = \xi_a\)
\(\xi' = \{\xi'_a, \xi'_b\}\) is obtained so that \(\xi'_a\) is on the same metric with \(\xi_a\)
This achieves scale link
Further metric conversions should not be done
Because further altering the metric breaks link

Score equating

Score equating is achieved when

a set of score levels for one scale
is mapped to corresponding score levels on another scale

Score equating

Suppose that we have

scale \(a\) with scores \(x_a\) ranging in \([0, 10]\)
scale \(b\) with scores \(x_b\) ranging in \([0, 100]\)

Scores \(x_a\) and \(x_b\) are on different metrics

Score equating

Given a score \(x_a = 5\)

one may define a corresponding score level \(\hat{x}_b\) on instrument \(b\)
so that \(\hat{x}_b\) can be compared to \(x_b\) in the same metric

Score equating is the process of determining

the map \(f: x_a \rightarrow \hat{x}_b\) for all \(x_a\) levels

Score equating

Equipercentile equating is a method of score equating

scores of scale \(a\) are mapped onto the percentile \(p\) metric
and then onto the metric of scale \(b\)
so that \(x_a \rightarrow p \rightarrow \hat{x}_b\)

The process does not involve item parameters

only involves observed scores \(x_a\) and \(x_b\)

Score equating

Equipercentile equating may be modified to get standardized scores

scores of scale \(a\) are mapped onto the percentile \(p\) metric
and then onto the \(\theta\) metric
so that \(x_a \rightarrow p \rightarrow \theta\)

To accomplish this,

scale \(b\) scores are first mapped onto the \(\theta\) metric
using a presupplied set of item parameters for scale \(b\)
the item parameters may be obtained from free calibration or converted with scale linking as needed

Score equating

The end product of score equating is a crosswalk table

Scale A (raw)	Scale B (raw)	Scale B (theta)	Scale B (T-score)
0	5.0	-0.781	42.189
1	14.1	-0.416	45.835
2	23.2	-0.108	48.918
3	32.3	0.159	51.589
4	41.4	0.394	53.944
5	50.5	0.605	56.052
6	59.6	0.796	57.958
7	68.7	0.970	59.698
8	77.8	1.130	61.299
9	86.9	1.278	62.781
10	96.0	1.416	64.161

Summary

Scale linking is about the metrics of item parameters

Score equating is about the metrics of observed scores

Calibrated projection

Calibrated projection [CP; Thissen et al. (2011)] is a procedure for mapping the score levels between two scales

maps each score level in scale \(a\) onto a corresponding \(\theta\) in scale \(b\)
Lord-Wingersky recursion (1984) is the standard method

the objective is related to the metrics of scores
not to the metrics of item parameters
thus CP can be considered as a score equating method

Calibrated projection

Suppose that we have a response dataset \(\mathbf{X}\) with

scale \(a\) with items \(a_1 ... a_{10}\)
scale \(b\) with items \(b_1 ... b_{10}\)

A 2-factor IRT model is fitted onto the response dataset \(\mathbf{X}\)

Calibrated projection

model <- mirt.model("
  F1  =  1-10  # free estimation for scale a items
  F2  = 11-20  # free estimation for scale b items
  COV = F1*F2
")

cp_calib <- mirt(X, model, itemtype = "graded")

First discrimination parameter

is freely estimated for scale \(a\) items; fixed at \(0\) for other scales

Second discrimination parameter

is freely estimated for scale \(b\) items; fixed at \(0\) for other scales

Other item parameters are freely estimated as usual

The correlation between factors are freely estimated

Calibrated projection

Calibrated model can be used to produce a crosswalk table

Lord-Wingersky recursion (1984) is the standard method
requires multidimensional extension to apply to CP

In Thissen et al. (2011), the authors presented a table

raw scores in PedsQL scale are mapped onto T-scores in PAIS scale

Calibrated projection

Table 4 Thissen et al. (2011)

Calibrated projection

Table 4 (reproduced)

Calibrated projection

Step 1. Read in item parameters

(code blocks are scrollable)

# demo/CP_demo_read.r
# read origin tables and create item objects

d2 <- read.csv(file.path(root, "data/table2.csv"))
d3 <- read.csv(file.path(root, "data/table3.csv"))
d  <- cbind(
  d2[order(d2[, 1]), ],
  d3[order(d3[, 1]), -1]
)

ipar <- d[, -c(1, 14)]
colnames(ipar)[9:12] <- paste0("d", 1:4)
ipar <- ipar[, c(1:2, 9:12)]

itempool <- generate.mirt_object(ipar, itemtype = "graded")

Tables 2, 3 from Thissen et al. (2011)

Calibrated projection

Step 2. Initialize theta grid for multidimensional integration

# module/module_grid.r
# creates quadrature points over two-dimensional space

nd         <- 2
theta      <- seq(-4.5, 4.5, .2)
theta_grid <- as.matrix(expand.grid(theta, theta))
n_grid     <- dim(theta_grid)[1]

Used -4.5(0.2)4.5 for each dimension, fully crossed
Total # of quadrature points: 2116
Should use other ways of integration with more dimensions

Calibrated projection

Step 3. Function for getting category probability

# module/module_computeResponseProbability.r
# function for computing category response probability
# at a given theta point

computeResponseProbability <- function(
  itempool, theta, item_idx, score_level
) {

  n_examinees <- nrow(theta)
  p           <- rep(NA, n_examinees)

  probs       <- mirt::probtrace(itempool, Theta = theta)
  itemname    <- colnames(itempool@Data$data)[item_idx]
  use_these   <- sprintf("%s.P.%s", itemname, score_level + 1)
  probs       <- probs[, use_these]

  return(probs)

}

input: item pool, 2D \(\theta\), item ID, score level on that item
output: a single probability value
necessary for multidimensional Lord-Wingersky recursion

Calibrated projection

Step 4. Lord-Wingersky recursion (multidimensional extension)

# module/module_LWrecursion.r
# function for performing Lord-Wingersky recursion
# this obtains likelihoods of each score level over quadrature points

LWrecursion <- function(itempool, use_items, theta_grid) {

  L_init <- TRUE

  for (item_idx in use_items) {

    new_max_value_of_item <- itempool@Data$K[item_idx] - 1
    new_possible_values   <- 0:new_max_value_of_item

    P <- list()
    for (v in new_possible_values) {
      P[[as.character(v)]] <-
        computeResponseProbability(itempool, theta_grid, item_idx, v)
    }

    if (L_init) {

      L <- P
      old_possible_values <- new_possible_values
      L_init <- FALSE

    } else {

      map_values <- expand.grid(old_possible_values, new_possible_values)

      map_L <- do.call(rbind, L[as.character(map_values[, 1])])
      map_P <- do.call(rbind, P[as.character(map_values[, 2])])

      map_lls <- map_L * map_P

      tmp <- aggregate(map_lls, by = list(apply(map_values, 1, sum)), sum)

      tmp_lls   <- tmp[, -1]
      tmp_value <- tmp[, 1]

      L <- list()
      for (i in 1:nrow(tmp)) {
        L[[as.character(tmp_value[i])]] <-
          tmp_lls[i, ]
      }

      old_possible_values <- tmp[, 1]

    }

  }

  return(L)

}

Calibrated projection

Step 4. Lord-Wingersky recursion (multidimensional extension)

input: item pool, items to use, theta grid
output: for each possible score level, likelihood value of obtaining the score level at each quadrature point

# demo/CP_demo_LW.r
# likelihood values for 11 items in PedsQL instrument
# the test score ranges from 0-44

pedsql_items <- 18:28
L <- LWrecursion(itempool, pedsql_items, theta_grid)

Use PedsQL items

range of possible score levels: \([0, 44]\)
likelihood of obtaining score \(0\) at each of 2116 quadrature points
likelihood of obtaining score \(1\) at each of 2116 quadrature points
…
likelihood of obtaining score \(44\) at each of 2116 quadrature points

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

input: likelihoods, theta grid, latent correlation
output: EAP estimates and covariance matrix for each score level

# module/module_LtoEAP.r
# converts likelihoods obtained from Lord-Wingersky recursion
# into two-dimensional EAP estimates

LtoEAP <- function(L, theta_grid, sigma) {

  nd  <- dim(theta_grid)[2]
  tmp <- list()

  for (i in 1:length(L)) {

    num <- matrix(0, 1, nd)
    den <- 0

    for (j in 1:n_grid) {
      term_T <- theta_grid[j, , drop = FALSE]
      term_L <- as.numeric(L[[i]][j])
      term_W <- dmvn(term_T, rep(0, nd), sigma)
      num <- num + (term_T * term_L * term_W)
      den <- den + (term_L * term_W)
    }

    th <- num / den

    num <- matrix(0, nd, nd)
    den <- 0

    for (j in 1:n_grid) {
      term_T <- theta_grid[j, , drop = FALSE]
      term_C <- (term_T - th)
      term_V <- t(term_C) %*% term_C
      term_L <- as.numeric(L[[i]][j])
      term_W <- dmvn(term_T, rep(0, nd), sigma)
      num <- num + (term_V * term_L * term_W)
      den <- den + (term_L * term_W)
    }

    COV <- num / den

    tmp[[names(L)[i]]]$EAP <- th
    tmp[[names(L)[i]]]$COV <- COV

  }

  return(tmp)

}

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

estimated correlation \(.96\) is used here

# demo/CP_demo_EAP.r
# converts likelihood values of PedsQL instrument
# into two-dimensional theta estimates

est_cor     <- .96
sigma       <- diag(nd)
sigma[2, 1] <- est_cor
sigma[1, 2] <- est_cor

EAP <- LtoEAP(L, theta_grid, sigma)

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

the equations were adapted from Bryant et al. (2005)

Calibrated projection

Given a \(k\)-dimensional vector \(\theta\),

the \(k\)-dimensional EAP estimate given a score level \(x\) is

\[\mathrm{E}(\theta|x) = \frac{\int{\theta \mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}} {\int{\mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}}\]

approximated by

\[\mathrm{E}(\theta|x) = \frac{\sum{\theta \mathrm{L}(x|\theta) f(\theta,\Sigma)}} {\sum{\mathrm{L}(x|\theta) f(\theta,\Sigma)}}\]

\(\mathrm{L}(x|\theta)\): previously computed likelihood
\(f(\theta, \Sigma)\): multivariate normal density value
\(\Sigma\): the 2D correlation matrix

The summation is taken over all \(\theta\) grid

Calibrated projection

Given a \(k\)-dimensional vector \(\theta\),

the \(k\)-dimensional EAP covariance given a score level \(x\) is

\[\mathrm{C}(\theta|x) = \frac{\int{\mathrm{C}(\theta) \mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}} {\int{\mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}}\] approximated by

\[\mathrm{C}(\theta|x) = \frac{\sum{\mathrm{C}(\theta) \mathrm{L}(x|\theta) f(\theta,\Sigma)}} {\sum{\mathrm{L}(x|\theta) f(\theta,\Sigma)}}\]

\(\mathrm{C}(\theta)\): variance-covariance matrix \((\theta - \theta_\mathrm{EAP})(\theta - \theta_\mathrm{EAP})'\)

The summation is taken over all \(\theta\) grid

Calibrated projection

Step 6. Aggregate into a table

# module/module_EAPtoTABLE.r
# function for converting EAP estimates into a table

EAPtoTABLE <- function(EAP, dimension, tscore) {

  eap  <- lapply(EAP, function(x) x$EAP[dimension])
  se   <- lapply(EAP, function(x) sqrt(x$COV[dimension, dimension]))
  eap  <- do.call(c, eap)
  se   <- do.call(c, se)
  if (tscore) {
    eap <- (eap*10) + 50
    se  <- se*10
  }
  x <- as.numeric(names(eap))
  o <- cbind(x, eap, se)
  o <- as.data.frame(o)

  return(o)

}

# demo/CP_demo_TABLE.r
# converts two-dimensional theta estimates into a table

o <- EAPtoTABLE(EAP, TRUE, dimension = 1)

Calibrated projection

It should be emphasized that the focus of CP method is not the item parameters

focus is on the latent correlation and its use in producing the table

The correlation between two factors represent the relationship between two scales

using this information, obtaining a multidimensional EAP \(\theta\) estimate yields a \(\theta\) estimate on scale \(a\) and scale \(b\) simultaneously

Motivation

Advantage of CP over EQP:

CP takes the correlation between constructs explicitly into account

If the latent correlation is perfect, CP and EQP should perform similar in terms of producing T-scores

As latent correlation decreases, CP should perform better than EQP

Question: by how much?

Study objective

Equipercentile equating method vs. calibration projection method

compare performance in producing corresponding \(\theta\) values
varied correlation between the constructs underlying each scale

Method: Item parameters

Derived from PROMIS Depression - CES-D dataset in PROsetta

includes 731 response rows and 48 items
(after removing missing data)
20 items on scale \(a\) (CES-D scale)
28 items on scale \(b\) (PROMIS Depression scale)

Method: Item parameters

1D IRT model was fitted on the dataset

graded response model for all items

Obtained 1D parameters were converted to 2D parameters

to be used in response data generation
CES-D items were loaded onto dimension 1
PROMIS items were loaded onto dimension 2

Method: Simulee & response data

1000 2D \(\theta\) values were sampled from MVN with specified correlation

Response dataset \(\mathbf{X}\) was generated from item parameters and 2D theta values

Method: EQP

Equipercentile method was performed on \(\mathbf{X}\)

smoothing was not applied
since obtaining \(\theta\) values from equipercentile method requires item parameters, item parameters were estimated by performing free calibration on \(\mathbf{X}\)

Method: CP

Calibrated projection method was performed on \(\mathbf{X}\)

used 2D model
factor 1: 20 CES-D items
factor 2: 28 PROMIS items
latent correlation: free estimation with upper bound of .999
to avoid singular structures

Method: 1D pattern scoring

used 1D item parameters for 48 items obtained for performing EQP as basis
used PROMIS item parameters to obtain EAP estimates from the PROMIS part of \(\mathbf{X}\)
to serve as best-case reference

Performance criteria

From \(\mathbf{X}\), CES-D raw score was computed for each simulee

CES-D raw score was mapped to PROMIS \(\theta\) using the crosswalk table from CP or EQP
produced PROMIS \(\theta\) was compared to true PROMIS \(\theta\)
used RMSE

Simulation

factor correlation was \(0.95(-0.05)0.50\)
repeated 20 trials each

Results

Discussion

Calibrated projection explicitly accounts for latent correlation between measured constructs

CP provides better crosswalk table
since CP uses multidimensional modeling, CP can equate more than two scales simultaneously
technical issue: multidimensional integration is time-consuming

References

Bryant, D. U., Smith, A. K., Alexander, S. G., Vaughn, K., & Canali, K. G. (2005). Expected A Posteriori Estimation of Multiple Latent Traits: (518612013-445). American Psychological Association. https://doi.org/10.1037/e518612013-445

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.

Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT True-Score and Equipercentile Observed-Score "Equatings". Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.

Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., & DeWalt, D. A. (2011). Using the PedsQL™ 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505. https://doi.org/10.1007/s11136-011-9874-y