Comparison of calibrated projection score equating and equipercentile score equating

Sangdon Lim

Introduction

Scale linking vs Score equating

  • a scale is an instrument composed of multiple items
  • a score is a value that quantifies responses to an instrument
    • raw score
    • T-score
    • \(\theta\)

Scale linking

Scale link is achieved when

  • a set of item parameters \(\xi\)
  • is on the same metric with
  • another set of anchor item parameters \(\xi'\)


Notation

\(\xi\) : the set of item parameters (e.g. difficulty, …)

Scale linking

Suppose we have a response dataset \(\mathbf{X}\) with

  • items \(a_1 ... a_{10}\) from scale \(a\)


Through item parameter calibration on \(\mathbf{X}\), we can obtain

  • item parameters \(\xi_a\) for items \(a_1 ... a_{10}\)

Scale linking

Suppose we have another dataset \(\mathbf{X'}\) with

  • items \(a_1 ... a_{10}\) from scale \(a\)
  • items \(b_1 ... b_{10}\) from scale \(b\)

Let the item parameters from this dataset be denoted by \(\xi'_a\) and \(\xi'_b\)

Without conversion, \(\xi'_a\) is on a different metric compared to \(\xi_a\)

  • because \(\mathbf{X'}\) comes from a different ability range compared to \(\mathbf{X}\)

Scale linking

Scale link is achieved when \(\xi'_a\) is on the same metric with \(\xi_a\)

  • this makes \(\xi'_a\) comparable to \(\xi_a\)
  • this makes \(\xi'_b\) from \(\mathbf{X'}\) interpretable on the (unobtainable) metric of \(\xi_b\) as if it was from \(\mathbf{X}\)

Scale linking

Scale linking methods include

Linear transformation

  • find a function \(f: \xi' \rightarrow \xi\) that converts \(\xi'_a\) to the metric of \(\xi_a\)
  • once \(f\) is determined, \(f\) can be used to convert \(\xi'_b\) to \(\xi_b\)
  • Haebara method (1980)
  • Stocking-Lord method (1983)

Scale linking

Scale linking methods include

Fixed-parameter calibration

  • Item calibration phase on \(\mathbf{X'}\) is modified
  • \(\xi' = \{\xi'_a, \xi'_b\}\) is estimated subject to the constraint \(\xi'_a = \xi_a\)
  • \(\xi' = \{\xi'_a, \xi'_b\}\) is obtained so that \(\xi'_a\) is on the same metric with \(\xi_a\)
  • This achieves scale link
  • Further metric conversions should not be done
  • Because further altering the metric breaks link

Score equating

Score equating is achieved when

  • a set of score levels for one scale
  • is mapped to corresponding score levels on another scale

Score equating

Suppose that we have

  • scale \(a\) with scores \(x_a\) ranging in \([0, 10]\)
  • scale \(b\) with scores \(x_b\) ranging in \([0, 100]\)


Scores \(x_a\) and \(x_b\) are on different metrics

Score equating

Given a score \(x_a = 5\)

  • one may define a corresponding score level \(\hat{x}_b\) on instrument \(b\)
  • so that \(\hat{x}_b\) can be compared to \(x_b\) in the same metric


Score equating is the process of determining

  • the map \(f: x_a \rightarrow \hat{x}_b\) for all \(x_a\) levels

Score equating

Equipercentile equating is a method of score equating

  • scores of scale \(a\) are mapped onto the percentile \(p\) metric
  • and then onto the metric of scale \(b\)
  • so that \(x_a \rightarrow p \rightarrow \hat{x}_b\)

The process does not involve item parameters

  • only involves observed scores \(x_a\) and \(x_b\)

Score equating

Equipercentile equating may be modified to get standardized scores

  • scores of scale \(a\) are mapped onto the percentile \(p\) metric
  • and then onto the \(\theta\) metric
  • so that \(x_a \rightarrow p \rightarrow \theta\)

To accomplish this,

  • scale \(b\) scores are first mapped onto the \(\theta\) metric
  • using a presupplied set of item parameters for scale \(b\)
  • the item parameters may be obtained from free calibration or converted with scale linking as needed

Score equating

The end product of score equating is a crosswalk table

Scale A (raw) Scale B (raw) Scale B (theta) Scale B (T-score)
0 5.0 -0.781 42.189
1 14.1 -0.416 45.835
2 23.2 -0.108 48.918
3 32.3 0.159 51.589
4 41.4 0.394 53.944
5 50.5 0.605 56.052
6 59.6 0.796 57.958
7 68.7 0.970 59.698
8 77.8 1.130 61.299
9 86.9 1.278 62.781
10 96.0 1.416 64.161

Summary

Scale linking is about the metrics of item parameters

Score equating is about the metrics of observed scores

Calibrated projection

Calibrated projection [CP; Thissen et al. (2011)] is a procedure for mapping the score levels between two scales

  • maps each score level in scale \(a\) onto a corresponding \(\theta\) in scale \(b\)
  • Lord-Wingersky recursion (1984) is the standard method


  • the objective is related to the metrics of scores
  • not to the metrics of item parameters
  • thus CP can be considered as a score equating method

Calibrated projection

Suppose that we have a response dataset \(\mathbf{X}\) with

  • scale \(a\) with items \(a_1 ... a_{10}\)
  • scale \(b\) with items \(b_1 ... b_{10}\)

A 2-factor IRT model is fitted onto the response dataset \(\mathbf{X}\)

Calibrated projection

model <- mirt.model("
  F1  =  1-10  # free estimation for scale a items
  F2  = 11-20  # free estimation for scale b items
  COV = F1*F2
")

cp_calib <- mirt(X, model, itemtype = "graded")

First discrimination parameter

  • is freely estimated for scale \(a\) items; fixed at \(0\) for other scales

Second discrimination parameter

  • is freely estimated for scale \(b\) items; fixed at \(0\) for other scales

Other item parameters are freely estimated as usual

The correlation between factors are freely estimated

Calibrated projection

Calibrated model can be used to produce a crosswalk table

  • Lord-Wingersky recursion (1984) is the standard method
  • requires multidimensional extension to apply to CP


In Thissen et al. (2011), the authors presented a table

  • raw scores in PedsQL scale are mapped onto T-scores in PAIS scale

Calibrated projection

Table 4 Thissen et al. (2011)

Calibrated projection

Table 4 (reproduced)

Calibrated projection

Step 1. Read in item parameters

(code blocks are scrollable)

# demo/CP_demo_read.r
# read origin tables and create item objects

d2 <- read.csv(file.path(root, "data/table2.csv"))
d3 <- read.csv(file.path(root, "data/table3.csv"))
d  <- cbind(
  d2[order(d2[, 1]), ],
  d3[order(d3[, 1]), -1]
)

ipar <- d[, -c(1, 14)]
colnames(ipar)[9:12] <- paste0("d", 1:4)
ipar <- ipar[, c(1:2, 9:12)]

itempool <- generate.mirt_object(ipar, itemtype = "graded")
  • Tables 2, 3 from Thissen et al. (2011)

Calibrated projection

Step 2. Initialize theta grid for multidimensional integration

# module/module_grid.r
# creates quadrature points over two-dimensional space

nd         <- 2
theta      <- seq(-4.5, 4.5, .2)
theta_grid <- as.matrix(expand.grid(theta, theta))
n_grid     <- dim(theta_grid)[1]
  • Used -4.5(0.2)4.5 for each dimension, fully crossed
  • Total # of quadrature points: 2116
  • Should use other ways of integration with more dimensions

Calibrated projection

Step 3. Function for getting category probability

# module/module_computeResponseProbability.r
# function for computing category response probability
# at a given theta point

computeResponseProbability <- function(
  itempool, theta, item_idx, score_level
) {

  n_examinees <- nrow(theta)
  p           <- rep(NA, n_examinees)

  probs       <- mirt::probtrace(itempool, Theta = theta)
  itemname    <- colnames(itempool@Data$data)[item_idx]
  use_these   <- sprintf("%s.P.%s", itemname, score_level + 1)
  probs       <- probs[, use_these]

  return(probs)

}
  • input: item pool, 2D \(\theta\), item ID, score level on that item
  • output: a single probability value
  • necessary for multidimensional Lord-Wingersky recursion

Calibrated projection

Step 4. Lord-Wingersky recursion (multidimensional extension)

# module/module_LWrecursion.r
# function for performing Lord-Wingersky recursion
# this obtains likelihoods of each score level over quadrature points

LWrecursion <- function(itempool, use_items, theta_grid) {

  L_init <- TRUE

  for (item_idx in use_items) {

    new_max_value_of_item <- itempool@Data$K[item_idx] - 1
    new_possible_values   <- 0:new_max_value_of_item

    P <- list()
    for (v in new_possible_values) {
      P[[as.character(v)]] <-
        computeResponseProbability(itempool, theta_grid, item_idx, v)
    }

    if (L_init) {

      L <- P
      old_possible_values <- new_possible_values
      L_init <- FALSE

    } else {

      map_values <- expand.grid(old_possible_values, new_possible_values)

      map_L <- do.call(rbind, L[as.character(map_values[, 1])])
      map_P <- do.call(rbind, P[as.character(map_values[, 2])])

      map_lls <- map_L * map_P

      tmp <- aggregate(map_lls, by = list(apply(map_values, 1, sum)), sum)

      tmp_lls   <- tmp[, -1]
      tmp_value <- tmp[, 1]

      L <- list()
      for (i in 1:nrow(tmp)) {
        L[[as.character(tmp_value[i])]] <-
          tmp_lls[i, ]
      }

      old_possible_values <- tmp[, 1]

    }

  }

  return(L)

}

Calibrated projection

Step 4. Lord-Wingersky recursion (multidimensional extension)

  • input: item pool, items to use, theta grid
  • output: for each possible score level, likelihood value of obtaining the score level at each quadrature point
# demo/CP_demo_LW.r
# likelihood values for 11 items in PedsQL instrument
# the test score ranges from 0-44

pedsql_items <- 18:28
L <- LWrecursion(itempool, pedsql_items, theta_grid)

Use PedsQL items

  • range of possible score levels: \([0, 44]\)
  • likelihood of obtaining score \(0\) at each of 2116 quadrature points
  • likelihood of obtaining score \(1\) at each of 2116 quadrature points
  • …
  • likelihood of obtaining score \(44\) at each of 2116 quadrature points

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

  • input: likelihoods, theta grid, latent correlation
  • output: EAP estimates and covariance matrix for each score level
# module/module_LtoEAP.r
# converts likelihoods obtained from Lord-Wingersky recursion
# into two-dimensional EAP estimates

LtoEAP <- function(L, theta_grid, sigma) {

  nd  <- dim(theta_grid)[2]
  tmp <- list()

  for (i in 1:length(L)) {

    num <- matrix(0, 1, nd)
    den <- 0

    for (j in 1:n_grid) {
      term_T <- theta_grid[j, , drop = FALSE]
      term_L <- as.numeric(L[[i]][j])
      term_W <- dmvn(term_T, rep(0, nd), sigma)
      num <- num + (term_T * term_L * term_W)
      den <- den + (term_L * term_W)
    }

    th <- num / den

    num <- matrix(0, nd, nd)
    den <- 0

    for (j in 1:n_grid) {
      term_T <- theta_grid[j, , drop = FALSE]
      term_C <- (term_T - th)
      term_V <- t(term_C) %*% term_C
      term_L <- as.numeric(L[[i]][j])
      term_W <- dmvn(term_T, rep(0, nd), sigma)
      num <- num + (term_V * term_L * term_W)
      den <- den + (term_L * term_W)
    }

    COV <- num / den

    tmp[[names(L)[i]]]$EAP <- th
    tmp[[names(L)[i]]]$COV <- COV

  }

  return(tmp)

}

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

  • estimated correlation \(.96\) is used here
# demo/CP_demo_EAP.r
# converts likelihood values of PedsQL instrument
# into two-dimensional theta estimates

est_cor     <- .96
sigma       <- diag(nd)
sigma[2, 1] <- est_cor
sigma[1, 2] <- est_cor

EAP <- LtoEAP(L, theta_grid, sigma)

Calibrated projection

Step 5. Compute EAP estimates from likelihoods

  • the equations were adapted from Bryant et al. (2005)

Calibrated projection

Given a \(k\)-dimensional vector \(\theta\),

the \(k\)-dimensional EAP estimate given a score level \(x\) is

\[\mathrm{E}(\theta|x) = \frac{\int{\theta \mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}} {\int{\mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}}\]

approximated by

\[\mathrm{E}(\theta|x) = \frac{\sum{\theta \mathrm{L}(x|\theta) f(\theta,\Sigma)}} {\sum{\mathrm{L}(x|\theta) f(\theta,\Sigma)}}\]

  • \(\mathrm{L}(x|\theta)\): previously computed likelihood
  • \(f(\theta, \Sigma)\): multivariate normal density value
  • \(\Sigma\): the 2D correlation matrix

The summation is taken over all \(\theta\) grid

Calibrated projection

Given a \(k\)-dimensional vector \(\theta\),

the \(k\)-dimensional EAP covariance given a score level \(x\) is

\[\mathrm{C}(\theta|x) = \frac{\int{\mathrm{C}(\theta) \mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}} {\int{\mathrm{L}(x|\theta) f(\theta,\Sigma) d\theta}}\] approximated by

\[\mathrm{C}(\theta|x) = \frac{\sum{\mathrm{C}(\theta) \mathrm{L}(x|\theta) f(\theta,\Sigma)}} {\sum{\mathrm{L}(x|\theta) f(\theta,\Sigma)}}\]

  • \(\mathrm{C}(\theta)\): variance-covariance matrix \((\theta - \theta_\mathrm{EAP})(\theta - \theta_\mathrm{EAP})'\)

The summation is taken over all \(\theta\) grid

Calibrated projection

Step 6. Aggregate into a table

# module/module_EAPtoTABLE.r
# function for converting EAP estimates into a table

EAPtoTABLE <- function(EAP, dimension, tscore) {

  eap  <- lapply(EAP, function(x) x$EAP[dimension])
  se   <- lapply(EAP, function(x) sqrt(x$COV[dimension, dimension]))
  eap  <- do.call(c, eap)
  se   <- do.call(c, se)
  if (tscore) {
    eap <- (eap*10) + 50
    se  <- se*10
  }
  x <- as.numeric(names(eap))
  o <- cbind(x, eap, se)
  o <- as.data.frame(o)

  return(o)

}
# demo/CP_demo_TABLE.r
# converts two-dimensional theta estimates into a table

o <- EAPtoTABLE(EAP, TRUE, dimension = 1)

Calibrated projection

It should be emphasized that the focus of CP method is not the item parameters

  • focus is on the latent correlation and its use in producing the table

The correlation between two factors represent the relationship between two scales

  • using this information, obtaining a multidimensional EAP \(\theta\) estimate yields a \(\theta\) estimate on scale \(a\) and scale \(b\) simultaneously

Motivation

Advantage of CP over EQP:

  • CP takes the correlation between constructs explicitly into account


If the latent correlation is perfect, CP and EQP should perform similar in terms of producing T-scores

As latent correlation decreases, CP should perform better than EQP


Question: by how much?

Study objective

Equipercentile equating method vs. calibration projection method

  • compare performance in producing corresponding \(\theta\) values
  • varied correlation between the constructs underlying each scale

Method: Item parameters

Derived from PROMIS Depression - CES-D dataset in PROsetta

  • includes 731 response rows and 48 items
  • (after removing missing data)
  • 20 items on scale \(a\) (CES-D scale)
  • 28 items on scale \(b\) (PROMIS Depression scale)

Method: Item parameters

1D IRT model was fitted on the dataset

  • graded response model for all items


Obtained 1D parameters were converted to 2D parameters

  • to be used in response data generation

  • CES-D items were loaded onto dimension 1

  • PROMIS items were loaded onto dimension 2

Method: Simulee & response data

1000 2D \(\theta\) values were sampled from MVN with specified correlation


Response dataset \(\mathbf{X}\) was generated from item parameters and 2D theta values

Method: EQP

Equipercentile method was performed on \(\mathbf{X}\)

  • smoothing was not applied
  • since obtaining \(\theta\) values from equipercentile method requires item parameters, item parameters were estimated by performing free calibration on \(\mathbf{X}\)

Method: CP

Calibrated projection method was performed on \(\mathbf{X}\)

  • used 2D model
  • factor 1: 20 CES-D items
  • factor 2: 28 PROMIS items
  • latent correlation: free estimation with upper bound of .999
  • to avoid singular structures

Method: 1D pattern scoring

  • used 1D item parameters for 48 items obtained for performing EQP as basis
  • used PROMIS item parameters to obtain EAP estimates from the PROMIS part of \(\mathbf{X}\)
  • to serve as best-case reference

Performance criteria

From \(\mathbf{X}\), CES-D raw score was computed for each simulee

  • CES-D raw score was mapped to PROMIS \(\theta\) using the crosswalk table from CP or EQP
  • produced PROMIS \(\theta\) was compared to true PROMIS \(\theta\)
  • used RMSE

Simulation

  • factor correlation was \(0.95(-0.05)0.50\)
  • repeated 20 trials each

Results

Discussion

Calibrated projection explicitly accounts for latent correlation between measured constructs

  • CP provides better crosswalk table

  • since CP uses multidimensional modeling, CP can equate more than two scales simultaneously

  • technical issue: multidimensional integration is time-consuming

References

Bryant, D. U., Smith, A. K., Alexander, S. G., Vaughn, K., & Canali, K. G. (2005). Expected A Posteriori Estimation of Multiple Latent Traits: (518612013-445). American Psychological Association. https://doi.org/10.1037/e518612013-445
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT True-Score and Equipercentile Observed-Score "Equatings". Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., & DeWalt, D. A. (2011). Using the PedsQL™ 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505. https://doi.org/10.1007/s11136-011-9874-y