Evaluating latent class models with conditional dependence in record linkage.

Research Article Received 16 May 2013,

Accepted 22 May 2014

Published online 17 June 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6230

Evaluating latent class models with conditional dependence in record linkage Joanne Daggy,a*† Huiping Xu,a Siu Huia,b and Shaun Grannisb,c Record linkage methods commonly use a traditional latent class model to classify record pairs from different sources as true matches or non-matches. This approach was first formally described by Fellegi and Sunter and assumes that the agreement in fields is independent conditional on the latent class. Consequences of violating the conditional independence assumption include bias in parameter estimates from the model. We sought to further characterize the impact of conditional dependence on the overall misclassification rate, sensitivity, and positive predictive value in the record linkage problem when the conditional independence assumption is violated. Additionally, we evaluate various methods to account for the conditional dependence. These methods include loglinear models with appropriate interaction terms identified through the correlation residual plot as well as Gaussian random effects models. The proposed models are used to link newborn screening data obtained from a health information exchange. On the basis of simulations, loglinear models with interaction terms demonstrated the best misclassification rate, although this type of model cannot accommodate other data features such as continuous measures for agreement. Results indicate that Gaussian random effects models, which can handle additional data features, perform better than assuming conditional independence and in some situations perform as well as the loglinear model with interaction terms. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

latent class; record linkage; loglinear model; random effects

1. Introduction As health information exchange (HIE) becomes more pervasive in the American health-care system, it is important to accurately link patients’ medical records from disparate sources. Record linkage is the process of identifying records that represent the same entity when there is no unique perfect identifier. The most commonly used approach in health informatics is to use a traditional latent class model to classify each record pair as either a ‘match’ or ‘non-match’ based on agreement between corresponding fields that represent patient traits [1]. This method was first described by Fellegi and Sunter [2] with a prior discussion on the ideas of record linkage by Newcombe et al. [3]. A key assumption of the traditional latent class model is that the agreement patterns for the multiple fields are independent conditional on the true latent class. For example, among truly matched pairs, whether two records agree on first name is assumed to be independent of whether the two records agree on last name. This assumption is often violated. Previous work in the statistical literature on conditional dependence was primarily motivated by the diagnostic testing scenario where several binary diagnostic tests are used to determine disease status when no gold standard is available. Substantial literature in the diagnostic testing area describes the impact of the conditional independence (CI) assumption on the estimates of disease prevalence and diagnostic test accuracy when data are truly conditionally dependent [4–8]. In simulation studies, for example, Vacek [8] and Torrance-Rynard [7] demonstrated biases in these estimators for various situations. The record linkage problem is similar to the diagnostic testing problem in that there are often several binary fields of agreement status between two record pairs and record pairs are assumed to belong to a

a Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN 46202, U.S.A. b Regenstrief Institute, Indianapolis, IN 46202, U.S.A. c Department of Family Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, U.S.A.

4250

*Correspondence

to: Joanne Daggy, Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN 46202, U.S.A. † E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

true match or non-match class. Additionally, a gold standard may not be available because of the expense of manually reviewing record pairs or because of the error associated with manual review [9]. However, the primary goal of record linkage is to maximize the ability of a model to correctly identify true record pairs, not to estimate the individual test sensitivities and specificities, as is the case in diagnostic testing. Although simulation studies in diagnostic testing exist, which examine the impact of conditional dependence on the estimates of individual test accuracies [7, 8], studies that evaluate the impact of conditional dependence on overall classification accuracy are still lacking. There has been only one such study of limited scope by Tromp et al. in record linkage [10]. In diagnostic testing, many methods to address the lack of CI in latent class models have been proposed [11–14]. These include but are not limited to loglinear models [11, 14], Gaussian random effects (GRE) models [12], and Probit latent class models [13,14]. Additionally, several approaches have been proposed to identify important conditional dependencies [11, 12, 15]. The most commonly used method to incorporate conditional dependence in record linkage is through the loglinear model framework with interaction terms described by both Winkler and Thibaudeau among others [16–20], with a stepwise model-building strategy proposed recently [21]. Other methods include incorporating conditional dependence between two fields by combining them into one field with four nominal levels [10] or fitting a saturated model [22]. We extend the current research in several ways. First, we introduce the GRE model [12] to the record linkage problem. The use of GRE models have not been explored in the record linkage setting. Random effects models are promising because they can handle additional features of the data, such as continuous measures of agreement rather than binary indicators of agreement. For example, rather than exact match between two strings, they may be compared using an edit distance proposed by Levenshtein [23] or Jaro-Winkler [24, 25]. These measures take on a continuous value between 0 and 1 and have been used to adjust the decision rule after the latent class model has been fit [26]. Random effects models also offer an easy way to incorporate covariates. Second, we compare the GRE model matching algorithm to the loglinear model matching algorithm. Lastly, we evaluate the impact of conditional dependence on overall misclassification rate, sensitivity, and positive predictive value (PPV) in the record linkage setting through simulation under different scenarios. We implement the statistical approaches in standard software, which uses a quasi-Newton approach for estimating random effects models and loglinear models. This differs from most of the record linkage literature [18, 27], which describes estimation of loglinear models using the expectation–maximization algorithm [28] or its variants. In Section 2, we review methods to account for conditional dependence in latent class models. In Section 3 we apply the different models discussed in Section 2 to our motivating example linking newborn screening (NBS) data to a HIE. Section 4 is devoted to simulation studies that examine how the assumption of CI can impact the misclassification rate and prevalence estimates for our motivating example. Additionally, we compare the overall accuracy of the different proposed models described in Section 2 in terms of the misclassification rate, sensitivity, and PPV under different scenarios. We conclude with a brief discussion on limitations and future work in Section 5.

2. Methods to account for conditional dependence in latent class models


Statist. Med. 2014, 33 4250–4265

4251

Consider two data sets A and B, each containing records with K fields in common. Records are compared by examining the agreement of each field. Thus, for each record pair, an agreement vector is observed Yi = (yi1 , yi2 , … , yiK ), where yik is the binary agreement status for field k (k = 1, 2, … , K) with 1 indicating agreement and 0 indicating disagreement. As we are comparing n = |𝐀|×|𝐁| record pairs, for moderately sized data sets and larger, it can become computationally infeasible to examine all potential pairs, and a large proportion of these potential pairs will be non-matches. To limit the search space for true matches, a subset of record pairs is commonly created using a ‘block’, and the model is run with the blocked data. For example, we may only evaluate record pairs as possible matches if they match on gender, month, and day of birth. These are the blocking fields; thus, other matching fields in the data set are then used to help distinguish between matched and non-matched records. Records not matching on the blocking fields are excluded as potential matches in the analysis. Multiple blocks are used to determine all the matches in two data sets. Complete details are available in the original F-S paper [2].

J. DAGGY ET AL.

2.1. Conditional independence model For the classical F-S model, each record pair is assumed to belong to one of two latent classes, either a true matched record pair or a true non-matched record pair. A key assumption of the F-S model is that the agreement patterns are independent across fields conditional on the true match status. Under this assumption, the probability of observing the Yi agreement pattern given that the record pair is a true matched record pair is m(Yi ) = P(Yi |Mi = 1) =

K ∏

Y

mk ik (1 − mk )(1−Yik )

k=1

where Mi ∈ (0, 1) represents the true match status of the ith record pair and mk = P(yik = 1|Mi = 1). The probability of observing the Yi agreement pattern given that the record pair is a true unmatched record pair is u(Yi ) = P(Yi |Mi = 0) =

K ∏

Y

uk ik (1 − uk )(1−Yik )

k=1

where uk = P(yik = 1|Mi = 0). The probability of belonging to the true matched record class, 𝜏 = P(M = 1), is termed the match prevalence, and the parameters to estimate are 𝜃 = {m1 , m2 , … , mK , u1 , u2 , … , uK , 𝜏}. The distribution of the agreement pattern P(Yi ) is a finite mixture distribution P(Yi ) = 𝜏P(Yi |Mi = 1) + (1 − 𝜏)P(Yi |Mi = 0).

(1)

∑ With K fields, there are D = 2K different agreement patterns; thus, the log-likelihood ni=1 log[P(Yi )] in terms of the record pair can be equivalently written in terms of the agreement pattern as

l(𝜃|Y) =

D ∑

fd log[P(Yd )]

(2)

d=1

where P(Yd ) is the probability of being in the dth agreement pattern and fd is the observed frequency for that vector pattern. This is the log-likelihood to be maximized. The true match status is considered missing data, and the expectation–maximization algorithm is commonly used to find the maximum likelihood parameter estimates of the statistical model [28]. Once the parameter estimates are obtained, records may be classified either by setting acceptable type I and type II error thresholds [2], using estimated match prevalence [29], or using the posterior match probability [20]. Classification methods that incorporate clerically reviewed data have also been proposed [30]. 2.2. Latent class loglinear model The classical Fellegi–Sunter formulation of the CI model may be reparameterized in terms of a mixture of loglinear models [31]. Let 𝜇Yd ,Md denote the expected number of record pairs with agreement pattern Yd and match status Md . To simplify, we suppress the d indices; thus, 𝜇Y,M may be expressed as log 𝜇Y,M = 𝜆 + 𝜆M M +

K ∑ k=1

𝜆k yk +

K ∑

𝜆Mk Myk .

(3)

k=1

The parameters to be estimated in the loglinear formulation are the 𝜆 values. Under this formulation, the matching prevalence 𝜏 and the m and u parameters for each field are as follows: ∑D

4252

d=1 𝜇Yd ,Md =1 𝜏 = ∑D ( ) d=1 𝜇Yd ,Md =1 + 𝜇Yd ,Md =0


Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

and mk =

exp(𝜆k + 𝜆Mk ) 1 + exp(𝜆k + 𝜆Mk )

uk =

exp(𝜆k ) 1 + exp(𝜆k )

The maximum likelihood estimates of the 𝜆 parameters for the loglinear model can be obtained by maximizing the log-likelihood function (2), where the marginal probability of observing the agreement pattern Yh is given by 𝜇Y ,M =1 + 𝜇Yh ,Mh =0 𝑃 (Yh ) = ∑D (h h ) 𝜇Yd ,Md =1 + 𝜇Yd ,Md =1 d=1 To incorporate dependence in the loglinear model setting, we simply add the appropriate interaction terms to the model. One or two interaction terms may be added for each pairwise dependence depending on whether dependence is expected to be within one or both classes. For example, to add pairwise dependence between fields j and l within each latent class, the model would include two more terms, 𝜆jl and 𝜆Mjl log 𝜇Y,M = 𝜆 + 𝜆M M +

K ∑ k=1

𝜆k yk +

K ∑

𝜆Mk Myk + 𝜆jl yj yl + 𝜆Mjl Myj yl

(4)

k=1

Several tools have been proposed to detect conditional dependence among the observed variables in latent class models [11, 12, 15]. We take the approach described by Qu et al. as the graphical method is simple and informative. Specifically, we calculate the pairwise correlation based on the observed and expected two-way cross-classification tables. The correlation residuals are computed as the difference between the observed correlation and the expected correlation under the model. As described by Qu et al. [12], the correlation residuals are computed overall, not within latent classes; thus, the correlation between fields j and l may be computed by estimating pjl = P(Yj = 1, Yl = 1), pj = P(Yj = 1), and pl = P(Yl = 1). Thus, the pairwise correlation is given by pjl − pj pl corrj,l = √ . pj (1 − pj )pl (1 − pl )

(5)


Statist. Med. 2014, 33 4250–4265

4253

For the loglinear model, we simply use the expected cell count for each vector pattern to estimate the pairwise correlations. A correlation residual that is far from zero would imply conditional dependence. Bootstrap confidence intervals may also be computed but are relatively computationally intensive and not necessary as we are using the plots as a diagnostic tool to aid in choosing important pairwise conditional dependencies. To build the loglinear model in practice, first, the CI model is fit, and observed correlations between fields are computed within each class. Examination of the observed correlations aids in determining whether correlation is present in both latent classes or only one latent class. The correlation residual plot is then examined and appropriate interaction term(s) are added to the model to account for the pairwise conditional dependency found by the highest pairwise correlation residual. This is an iterative model building process that continues until the fit of the model (measured through BIC or deviance) shows little improvement or correlation residual plot shows that only small residuals are present. Although we use model fit to help identify the best model as we do not have a gold standard, one must be cautious because this may not correlate to the best model for record linkage [20]. The loglinear model may be fit in standard statistical software such as sas (Cary, NC) through PROC NLMIXED [32], which implements the quasi-Newton algorithm and provides standard error estimates for parameter estimates.

J. DAGGY ET AL.

2.3. Random effects model We also evaluate the GRE model [12, 33]. This latent class model was first introduced by Qu, Tan, and Kutner to model the conditional dependence among multiple diagnostic tests [12]. The authors demonstrate the usefulness of this model by applying it to three examples taken from biometric literature where determining individual test sensitivities and specificities was of most interest. This model has additionally been used in other biomedical applications [33] but not yet in record linkage. We evaluate this GRE model in the record linkage scenario. As this model includes random effect terms on the individual level, equations are in terms of the ith record pair rather than the dth vector pattern. In the random effects model, the probability of agreement on field k for the ith record pair depends on the unobserved latent class, Mi , and also on a continuous latent random variable Ti , which accounts for the unrecorded characteristics of the record pair through a regression model P(Yik = 1|Mi = m, Ti = t) = Φ(akm + bkm t)

(6)

where Φ represents the cumulative distribution function of the standard Normal distribution and t is distributed as a standard Normal distribution. Thus the model includes an intercept term for each field and a coefficient for the random effects term, which defines the tetrachoric correlation among the binary agreement patterns. The random effects model may vary depending on the assumed correlation structure. For example, when the coefficients are the same within a specific class, for example, bkm = bm , the correlations between the pairs of fields are all equal. We explore two random effects models (i) assuming equal correlation within one or both classes and (ii) allowing coefficients for the random effects to vary with each field in one or both classes. The posterior probability of belonging to the match class may be calculated for each vector pattern for classification purposes. The conditional probability of observing vector pattern Yi given the latent class may be found by integrating out the random effects,

P(Yi |Mi = m) =

K ∞∏

∫−∞

Φ(akm + bkm t)yik (1 − Φ(akm + bkm t))(1−yik ) dΦ(t).

(7)

k=1

This integral may be computed using Gauss–Hermite quadrature [34] where wj is the mass of point tj ,

P(Yi |Mi = m) =

J ∑

wj

j=1

K ∏

Φ(akm + bkm tj )yik (1 − Φ(akm + bkm tj ))(1−yik ) .

(8)

k=1

Thus, the posterior probability that the ith record pair is a match is given by Bayes’ theorem as P(Mi = 1|Yi ) =

𝜏P(Yi |Mi = 1) . 𝜏P(Yi |Mi = 1) + (1 − 𝜏)P(Yi |Mi = 0)

(9)

4254

Traditional classification for latent class models is based on the posterior match probability, which we use to estimate the overall sensitivity, misclassification rate, and PPV for each of the models. Model estimation for the GRE models is fit in standard statistical software such as sas (Cary, NC) through PROC NLMIXED using the adaptive-Gaussian quadrature technique. Estimation of the posterior match probability in Equation (9) requires an additional calculation, which we complete with PROC IML using parameter estimates obtained from PROC NLMIXED. Additionally, for comparison with the loglinear model in Section 2.2, the correlation residuals (5) corresponding to the random effects models are computed. This also requires estimation of a joint probability that is not conditional on random effects; thus, we again use Gauss–Hermite quadrature to compute pjl = P(Yj = 1, Yl = 1). The remaining terms required in the correlation residual, pj and pl , may be easily computed as pj = P(Yj = 1) = 𝜏P(Yj = 1|M = 1) + (1 − 𝜏)P(Yj = 1|M = 0) where the probability of agreement on the jth field given the record pair is a true match, Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

⎛ aj1 ⎞ ⎟ mj = Φ ⎜ √ ⎜ 1 + b2 ⎟ j1 ⎠ ⎝ is referred to as the m-parameter in record linkage and the u-parameter is also easily estimated as ⎛ aj0 ⎞ ⎟. uj = Φ ⎜ √ ⎜ 1 + b2 ⎟ j0 ⎠ ⎝ In the simplified case where correlation is only present in the match class, the coefficients bj0 = 0 and therefore uj = Φ(aj0 ).

3. Application


Statist. Med. 2014, 33 4250–4265

4255

We examine the performance of the random effects model and the loglinear model for accounting for conditional dependence using NBS data linked to a HIE data repository. To improve NBS follow-up, we seek to identify infants who may lack screening for potentially harmful disorders by linking data from Indiana’s statewide NBS registry obtained from Jul 1, 2007 to Dec 31, 2007 to data for patients less than 1 month old for the same time period obtained from the Indiana Network for Patient Care [35]. The Indiana Network for Patient Care is a regional HIE that serves more than 40 hospitals in Indianapolis and surrounding areas. In record linkage, missing data have been handled by adding a third category to the binary agreement/disagreement pattern or treating fields with missing as a disagreement. We take the latter approach because it is more conservative and protects against false matches. To make computation feasible, the data are first blocked on fields of gender, month, and day of birth. Thus, only records matching on these fields are considered as possible matches. This results in 10,467,025 record pairs. The remaining fields available for classification are the patient’s medical record number (MRN), last name, next of kin’s last name, next of kin’s first name, telephone number, zip code, and doctor’s last name. There are 27 = 128 potential vector patterns for these seven fields, of which 126 are actually observed. We compare four models, which include (i) the CI model; (ii) the loglinear model with interaction terms chosen by the correlation residual plot; (iii) the GRE model assuming equal correlation within match class only or within both classes depending on the correlation structure (GRE1 ); and (iv) the GRE model assuming different loadings within match class only or within both classes depending on the correlation structure (GRE2 ). Comparison of these four models are made throughout this study. The CI model was first fit to the data, which yielded an extremely low estimated match prevalence of 𝜏̂ = 0.0087. This was expected when using non-specific blocking fields of gender, month of birth, and day of birth. This resulted in 90,155 matched records using the posterior match probability for classification. We stratified record pairs into matches and non-matches on the basis of this model and examined the observed correlation. It was determined that some fields among the matched class were highly correlated, whereas fields in the non-match class were close to independent. Loglinear models were first used to account for the conditional dependency by sequentially adding twoway interaction terms to the model for the match class only. These terms were identified by examining the correlation residual plots. The two-way interaction terms were added sequentially to account for the largest correlation residual until only extremely small correlation residuals remained. In the first iteration, the largest correlation residual as seen in Figure 1A was found for fields y3 and y4 corresponding to nextof-kin last name and next-of-kin first name. Sequentially adding interaction terms in the match class led to the final loglinear model, which included 14 interaction terms. The final model improved the fit of the model with decreased BIC and decreased deviance, which went from G2 = 116, 817 to G2 = 9934. The correlation residuals were very close to zero, as shown in Figure 1B. Estimated prevalence in the final loglinear model is 0.01006 for approximately 101,202 estimated matches. The fit of the GRE model that assumes equal correlation within the match class (GRE1 ) is better than CI in terms of deviance with G2 = 83, 342 and gives an estimated prevalence of 0.01019 and results in 91,929 matched records. However, the correlation residual plot for this model, shown in Figure 1C, does not look much better than the CI correlation residual plot, Figure 1A. The more general random effects

J. DAGGY ET AL.

BIC = 3163051 Parameters=15

0.3 0.2 0.1 0.0

(7.5)

(7.6)

(7.4)

(7.3)

(7.2)

(7.1)

(6.5)

(6.4)

(6.3)

(6.2)

(6.1)

(5.4)

(5.3)

(5.2)

(5.1)

(4.3)

(4.2)

(4.1)

(3.2)

(3.1)

−0.1 (2.1)

Correlation Residual

A − Conditional Independence 0.4

Pairwise Correlation


0.3 0.2 0.1 0.0

(7.6)

(7.5)

(7.4)

(7.3)

(7.2)

(7.1)

(6.5)

(6.4)

(6.3)

(6.2)

(6.1)

(5.4)

(5.3)

(5.2)

(5.1)

(4.3)

(4.2)

(4.1)

(3.2)

(3.1)

−0.1 (2.1)


B − Loglinear Model with Interactions 0.4



0.3 0.2 0.1 0.0

(7.5)

(7.6)

(7.5)

(7.6)

(7.4)

(7.3)

(7.2)

(7.1)

(6.5)

(6.4)

(6.3)

(6.2)

(6.1)

(5.4)

(5.3)

(5.2)

(5.1)

(4.3)

(4.2)

(4.1)

(3.2)

(3.1)

−0.1 (2.1)


C − Gaussian Random Effects Model 1 0.4



0.3 0.2 0.1 0.0

(7.4)

(7.3)

(7.2)

(7.1)

(6.5)

(6.4)

(6.3)

(6.2)

(6.1)

(5.4)

(5.3)

(5.2)

(5.1)

(4.3)

(4.2)

(4.1)

(3.2)

(3.1)

−0.1 (2.1)


D − Gaussian Random Effects Model 2 0.4


Figure 1. Correlation residual plots obtained from fit of conditional independence, loglinear with interaction terms, GRE1 -equal correlation within match class only, and GRE2 -different loadings within match class only.

4256

model that allows the loadings to differ (GRE2 ) results in a deviance of G2 = 55, 001 and estimated prevalence of 0.01009, for a total of 101,202 matched records. Thus, allowing the loadings to vary for different fields within the match class provides a more reasonable fit and produces smaller correlation residuals, although correlation residuals are not completely reduced. As will be seen from the simulation study in Section 4, not accounting for conditional dependence even when match prevalence is low will lead to a bias in the estimated match prevalence. This in turn leads to missed matches in the database. Table I contains the maximum likelihood parameter estimates and asymptotic standard errors for 𝜃 = {m1 , m2 , … , mK , u1 , u2 , … , uK , 𝜏}. Estimated match prevalence is higher for all models that account for the conditional dependence. Overall, the m parameter estimates are much lower, and u parameter estimates are slightly lower for all models over CI. The u parameters for MRN and telephone number are very close to zero, which indicates that among truly non-matched pairs, the chance of agreeing on MRN or telephone number is extremely small when accounting for the conditional dependence. In addition, the extremely small u parameters for MRN and telephone number resulted in a final Hessian matrix that was not positive definite for the GRE models. The models were verified by holding the parameters relating to u1 and u5 fixed and re-running the models. The resulting models were identifiable with a positivedefinite Hessian and parameter estimates and standard errors identical to the unconstrained models. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

Table I. ML estimates and asymptotic standard errors for NBS. CI

Loglinear

GRE1

GRE2

Parameter 𝜏

EST 0.00870

SE 2.9e−05

EST 0.01006

SE 4.0e−05

EST 0.01019

SE 4.2e−05

EST 0.01009

SE 3.1e−05

MRN Last name NK last name NK first name Telephone Zip code DR last name

m1 m2 m3 m4 m5 m6 m7

0.616 0.829 0.435 0.357 0.649 0.794 0.290

0.0016 0.0013 0.0017 0.0016 0.0016 0.0014 0.0015

0.537 0.791 0.376 0.307 0.562 0.704 0.251

0.0021 0.0017 0.0018 0.0016 0.0021 0.0021 0.0015

0.531 0.743 0.376 0.305 0.553 0.710 0.250

0.0022 0.0020 0.0018 0.0017 0.0022 0.0021 0.0015

0.547 0.820 0.399 0.320 0.568 0.713 0.257

0.0016 0.0013 0.0015 0.0014 0.0016 0.0014 0.0014

MRN Last name NK last name NK first name Telephone Zip code DR last name

u1 u2 u3 u4 u5 u6 u7

5.4e−05 9.7e−04 9.3e−05 1.7e−03 1.3e−05 5.1e−03 1.7e−03

2.7e−06 1.0e−05 3.1e−06 1.3e−05 1.8e−06 2.2e−05 1.3e−05

2.9e−08 2.2e−04 8.7e−05 1.7e−03 4.9e−08 4.9e−03 1.7e−03

— 1.7e−05 3.2e−06 1.3e−05 — 2.3e−05 1.3e−05

3.8e−42 5.7e−04 4.9e−05 1.6e−03 5.9e−17 4.7e−03 1.7e−03

— 1.2e−05 3.2e−06 1.3e−05 — 2.3e−05 1.3e−05

9.8e−07 6.2e−06 1.0e−05 1.7e−03 8.6e−07 4.9e−03 1.7e−03

— — — 1.3e−05 — 2.2e−05 1.3e−05

Field

BIC Deviance G2

3,163,051 116,817

3,056,395 9934

3,129,593 83,342

3,101,864 55,001

Table II. Classification results for NBS: vector patterns that differed between models. Match status Vector∗

Observed

CI

Loglinear

GRE1

GRE2

0001010 0010000 0100000 1000000

206 1037 10516 737

1 0 0 0

0 0 1 1

1 1 0 1

0 0 1 1

∗

Agreement on fields: MRN, last name, NK last name, NK first name, telephone, zip code, and DR last name.

This is not unusual in the latent class setting, as recently noted that even for well-behaved and identifiable models, the Fisher information matrix can exhibit a singular behavior at the maximum-likelihood estimates [36]. Although the four models produced very different degrees of goodness of fit, a more meaningful comparison is whether they classified record pairs differently. Classification of record pairs by the different models resulted in four discrepant vector patterns out of the 126 observed. Further detail on the discrepancies is in Table II. The loglinear model and GRE2 classified record pairs identically, despite a much lower G2 for the loglinear model. Both of these models identified 11,047 additional matches in the NBS data over the CI model. This suggests that the more general random effects model (GRE2 ) may be a viable alternative to the loglinear model for data with conditional dependence.

4. Simulation


Statist. Med. 2014, 33 4250–4265

4257

Model-based estimators of field-specific accuracy and prevalence may be substantially biased under the CI model when conditional dependence is present or, more generally, when the conditional dependence structure is misspecified [4, 8]. To investigate the bias in prevalence under a realistic record linkage scenario, we will simulate data on the basis of the empirical parameters derived from our NBS data from Section 3 with a range of true match prevalence. We vary the prevalence to reflect values observed in practice. For example, data may be blocked on Social Security number (SSN) that would yield an extremely high match prevalence such as 0.98; conversely, a blocking scheme may use only partial field data, such

J. DAGGY ET AL.

as the last four digits of SSN. These partial field blocking schemes may produce mid-range values for match prevalence. Further, potential pairs may be formed using date of birth, as in our NBS example, which produces very low match prevalence. To further assess the robustness of different models to account for conditional dependence, we conduct a second simulation study under four additional scenarios with m and u parameters set at typical values found in record linkage, while varying the dependence structure: Scenario 1, high correlation within match class only and no correlation in non-match class; Scenario 2, high correlation in both classes; Scenario 3, low correlation in match class only and no correlation in non-match class; and Scenario 4, low correlation in both classes. We did not explore the case where correlation is present only in the nonmatch class as this is not a realistic scenario found in record linkage. For these four scenarios, we again fit the four models for comparison of overall classification performance. For the GRE1 and GRE2 models, correlation is assumed to occur within match class only or within both classes depending on the scenario. Although understanding bias in parameter estimates is important, in record linkage, correct classification of records as matches or non-matches is of key interest. When linking records, the goal is to identify the true record pairs; thus, an ideal model will have a low misclassification rate and high overall sensitivity. Additionally, the PPV that represents the proportion of linked records that are true matches is also of interest. Thus, these measures of overall classification performance are used to compare models rather than model-fit alone [20]. 4.1. Simulation study I: based on newborn screening data with varying true prevalence To begin the simulations, we first used our motivating example of NBS data analyzed in Section 3. Thus, the m and u values are set at the estimated values under the CI model in Table I. Correlation between fields is set at the observed pairwise Pearson correlation coefficients within each class. In the non-match class, pairwise correlation is almost non-existent with all |r| < 0.01 except for the pairwise correlation between y3 and y4 which is 0.097. The correlation matrix in the match class was set at

corrMatch

⎛ 1 ⎜0.29 ⎜0.27 ⎜ = ⎜0.27 ⎜0.25 ⎜0.24 ⎜ ⎝ 0

0.29 1 0.49 0.21 0.27 0.29 −0.20

0.27 0.49 1 0.53 0.38 0.38 −0.18

0.27 0.21 0.53 1 0.40 0.48 −0.02

0.25 0.27 0.38 0.40 1 0.52 −0.13

0.24 0.29 0.38 0.48 0.52 1 −0.15

0 ⎞ −0.20⎟ −0.18⎟⎟ −0.02⎟ −0.13⎟ −0.15⎟ ⎟ 1 ⎠

4258

We simulated data assuming a Probit model so that we could specify the true underlying correlation structure. Specifically, we simulated the underlying continuous variables from a multivariate normal distribution for each latent class with the mean set at appropriate values to obtain the specified m and u values and correlation set at the observed correlation matrices for each class. We simulated 500 data sets of 500,000 record pairs from this scenario at each prevalence value of 0.01, 0.05, 0.10, 0.50, 0.80, 0.90, 0.98 and explained the bias in the estimator for match prevalence from the CI model. When conditional dependence exists, prevalence may be extremely biased if CI is assumed. This can be seen in Figure 2. The match prevalence estimate is slightly biased when the true prevalence is low because the dependence affects only a small subset of true matches, but extremely underestimated when the true prevalence is high. To investigate how the bias in prevalence impacts the overall classification, we fit the four models to our simulated NBS data. Figure 3 summarizes the performance of the four models across the different prevalence values. The loglinear model fit was the best in terms of BIC for all simulations based on the NBS data. The GRE2 model fit was the second best with a higher BIC than both CI and GRE1 . No substantial difference in classification performance among the methods was observed at lower values of match prevalence. At a true prevalence of 0.50, the misclassification rates of the loglinear and GRE2 are lower than those for CI and GRE1 , primarily because of the 6% higher sensitivity, despite a less than 1% lower PPV. As prevalence increases to 0.80 and beyond, the CI performs the worst of the models, with loglinear and GRE2 the best with almost identical performance. Accounting for the conditional dependency results in a lower misclassification rate than assuming CI. Convergence for the random effects models was not always reached, and average convergence rates across all prevalence values was 0.994 for GRE1 and 0.991 for GRE2 . GRE1 makes the strong assumption that conditional dependence is equal between fields in the match class and GRE2 is a more complex model thus more difficult to fit. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

Bias in Prevalence

0.0

−0.1

−0.2

−0.3

−0.4 0.0

0.2

0.4

0.6

0.8

1.0

True Prevalence

0.4

1.0

0.3

0.9

Sensitivity

Misclassification Rate

Figure 2. Bias in prevalence under the conditional independence model for simulation study I.

0.2

0.8

0.1

0.7

0.0

0.6 0.0

0.2

0.4

0.6

0.8

1.0

True Prevalence

0.0

0.2

0.4

0.6

0.8

1.0

True Prevalence

1.0

PPV

0.9 CI Loglinear GRE1 GRE2

0.8 0.7 0.6 0.0

0.2

0.4

0.6

0.8

1.0

True Prevalence

Figure 3. Misclassification rate, sensitivity, and PPV for all models fit in simulation study I.

4.2. Simulation study II: model comparison under different correlation structures


Statist. Med. 2014, 33 4250–4265

4259

We again used the Probit model to simulate data based on a range of m and u values typically encountered in record linkage studies. In record linkage, the m-values represent the probability of agreement on the different fields within the matched record pairs. These values were set at 0.95 for more accurately recorded fields and 0.6 for fields prone to error. The u-values represent the probability that fields agree among the true non-matched record pairs and were set at 0.001 for fields with little chance of agreement and 0.5 for non-specific fields such as gender. Thus, we set m = {0.95, 0.95, 0.6, 0.95, 0.95, 0.6, 0.6} and u = {0.001, 0.5, 0.001, 0.001, 0.5, 0.001, 0.5}. We used four scenarios for the correlation structure. For scenario 1 (high correlation within the match class), we set three of the 21 pairwise correlations at 0.70 (r23 = r34 = r56 ), two at 0.10 (r12 = r27 ), and the remaining correlations obtained from U(0, .01), within the non-match class pairwise correlations were set at 0. Scenario 2 is identical to scenario 1, except high correlation is present in both classes. For scenario 3 (low correlation within the match class), we set three of the 21 pairwise correlations at 0.30 (r23 = r34 = r56 ), two at 0.05 (r12 = r27 ), and the remaining correlations obtained from U(0, .01), within the non-match class pairwise correlations, were set at 0. Scenario 4 has this low correlation structure present in both classes. For each scenario and prevalence value we simulated 500 datasets of 500,000 record pairs. We then compared the four models for overall classification accuracy.

0.01

0.9719 (0.0025) 0.9718 (0.0025) 0.9719 (0.0025) 0.9719 (0.0024)

0.9721 (0.0024) 0.9719 (0.0026) 0.9721 (0.0024) 0.9721 (0.0024)

0.9778 (0.0020) 0.9777 (0.0020) 0.9778 (0.0020) 0.9778 (0.0020)

0.9778 (0.0022) 0.9779 (0.0022) 0.9979 (0.0022) 0.9979 (0.0022)

Model

CI Loglin GRE1 GRE2

CI Loglin GRE1 GRE2

CI Loglin GRE1 GRE2

CI Loglin GRE1 GRE2

0.90

0.98

4260


0.9974 (9e−5) 0.9993 (4e−5) 0.9992 (4e−5) 0.9993 (4e−5) 0.9974 (0.0002) 0.9993 (4e−5) 0.9992 (4e−5) 0.9992 (4e−5)

Scenario 4 (low correlation in both classes) 0.9948 (0.0003) 0.9967 (0.0003) 0.9973 (8e−5) 0.9948 (0.0003) 0.9988 (0.0001) 0.9992 (7e−5) 0.9948 (0.0003) 0.9989 (0.0003) 0.9991 (6e−5) 0.9948 (0.0004) 0.9982 (0.0001) 0.9991 (9e−5)

0.9910 (0.0018) 0.9942 (0.0012) 0.9948 (0.0005) 0.9920 (0.0027)

0.9973 (8e−5) 0.9993 (4e−5) 0.9992 (4e−5) 0.9992 (4e−5)

0.9938 (0.0001) 0.9990 (0.0001) 0.9968 (0.0006) 0.9989 (0.0002)

Scenario 3 (low correlation in match class, no correlation in non-match class) 0.9973 (0.0002) 0.9903 (0.0011) 0.9949 (0.0003) 0.9965 (0.0004) 0.9973 (8e−5) 0.9993 (5e−5) 0.9937 (0.0014) 0.9949 (0.0003) 0.9990 (0.0001) 0.9992 (5e−5) 0.9949 (0.0005) 0.9949 (0.0003) 0.9990 (0.0003) 0.9991 (6e−5) 0.9992 (5e−5) 0.9945 (0.0011) 0.9948 (0.0003) 0.9988 (0.0001) 0.9991 (5e−5) 0.9992 (5e−5)

0.9829 (0.0047) 0.9924 (0.0008) 0.9841 (0.0025) 0.9858 (0.0073)

0.8846 (0.0034) 0.9990 (4e−5) 0.9990 (0.0001) 0.9982 (0.0003)

0.80

Scenario 2 (high correlation in both classes) 0.9924 (0.0007) 0.9938 (0.0002) 0.9939 (0.0002) 0.9934 (0.0007) 0.9989 (0.0001) 0.9990 (0.0001) 0.9925 (0.0004) 0.9959 (0.0005) 0.9970 (0.0003) 0.9940 (0.0006) 0.9963 (0.0006) 0.9986 (0.0005)

True prevalence 0.50

0.8849 (0.0036) 0.9990 (0.0001) 0.9973 (0.0004) 0.9988 (0.0001)

0.10

Scenario 1 (high correlation in match class, no correlation in non-match class) 0.9772 (0.0032) 0.9914 (0.0025) 0.9934 (0.0002) 0.9938 (0.0002) 0.9938 (0.0001) 0.9904 (0.0007) 0.9941 (0.0007) 0.9988 (0.0001) 0.9990 (0.0001) 0.9990 (0.0001) 0.9925 (0.0006) 0.9925 (0.0004) 0.9968 (0.0004) 0.9984 (0.0007) 0.9987 (0.0005) 0.9908 (0.0011) 0.9929 (0.0006) 0.9966 (0.0007) 0.9984 (0.0003) 0.9988 (0.0001)

0.05

Table III. Results of simulation: sensitivity and SD.

J. DAGGY ET AL.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

Table III compares the sensitivities of the four models under each of the four scenarios (correlation structures) at varying prevalence values. There are virtually no differences among models in any scenario at low prevalence (0.01). The CI model provides the lowest sensitivity across all scenarios when prevalence is above 0.01, as expected because match prevalence is biased. At a match prevalence of 0.50, the sensitivity of CI is lower than all other models, albeit by no more than 0.6%. At a match prevalence of 0.98, however, the sensitivity of CI is lower than all other models by over 10% in the two high-correlation scenarios, in which GRE1 is also slightly less sensitive than loglinear and GRE2 . In terms of the PPV, or the proportion of linked records that are true matches, all models provide a PPV above 0.99 under almost all scenarios. Results may be found in Appendix A. The PPV for the CI model is generally the highest, and the PPV for the other models are very close. Figure 4 summarizes the misclassification rates observed in each of the four scenarios. Please note that the y-scale reported for the low correlation scenarios is smaller to make lines distinguishable as overall misclassification rate is low for all models. When match prevalence was low (0.01), misclassification rate was not improved by fitting models to account for the conditional dependence. As match prevalence is increased to 0.50, the CI model performs more poorly than other models in terms of the misclassification rate, although most pronounced in scenario 1 (high correlation within the match class only). Specifically, the loglinear model with appropriate interaction terms found through the correlation residual plot fits the data the best in all scenarios in terms of BIC and misclassification rate. The random effects model that allows different loadings (GRE2 ) performs the next best in most scenarios. This was also found to be true when prevalence was as high as 0.98. Whether correlation was present in only the match class or within both classes, accounting for the conditional dependency even if correlation structure is misspecified results in a lower misclassification rate than assuming CI. Overall, the CI model’s performance is least favorable when correlation is high within the match class only. In this situation, our results suggest that accounting for the conditional dependency through either a loglinear model or random effects model performs better than naively assuming CI. For the NBS simulation, the loglinear model and GRE2 performed almost equivalently in terms of misclassification. For scenario 1, high correlation within match class, which is not quite as extreme as the NBS example, the loglinear model was still optimal, but both GRE1 and GRE2 still performed much better than CI. When correlation is present in both classes (scenario 2), results are similar, although GRE1 performs worse than GRE2 and the loglinear model. These results are similar in the low correlation situation, although overall misclassification for all models is much lower. Thus, random effects models that may fail to capture the

CI Loglinear GRE1 GRE2

0.10 0.08 0.06 0.04 0.02 0.00 0.2

0.4

0.6

0.8

0.10 0.08 0.06 0.04 0.02 0.00 0.0

1.0

0.2

0.4

0.6

0.8

True Prevalence

True Prevalence

Scenario 3−Low Correlation within Match Class

Scenario 4−Low Correlation in Both Classes

0.010



0.0

Scenario 2−High Correlation within Both Classes



Scenario 1−High Correlation within Match Class

0.008 0.006 0.004 0.002 0.000 0.0

0.2

0.4

0.6

1.0

0.010 0.008 0.006 0.004 0.002 0.000 0.0

0.2

0.4

0.6

0.8

1.0

True Prevalence

Figure 4. Mean misclassification rate for all scenarios under different models fit in simulation study II.


Statist. Med. 2014, 33 4250–4265

4261

True Prevalence

0.8

1.0

J. DAGGY ET AL.

true correlation structure still reduce misclassification rates. Random effects models may not converge in all situations, depending on the match prevalence and correlation structure. Convergence was obtained for 95.7% of the simpler GRE1 models, ranging from 87.5% to 100% across all four scenarios and prevalence values, whereas for the more complex GRE2 model, convergence was only obtained for 79.2% of the models (ranging from 66.3% to 96.3%). Most notable was the difficulty in attaining convergence for scenario 1 (high correlation in the match class only), when match prevalence was low. Additionally, good initial parameter estimates are crucial to obtaining convergence for the random effects models.

5. Discussion In this study, we use simulation to examine the impact of conditional dependency between fields on overall sensitivity, misclassification rate, and PPV when parameters such as the u-parameters and m-parameters are set at realistic values from record linkage situations. Additionally, we compare the loglinear model approach with the random effects model approach through simulation. In literature, simulations that address the impact of conditional dependence on classification are lacking. In record linkage literature, Tromp et al. presented a basic empirical study and simulation scenario with extremely low match prevalence(𝜏 < 0.001), which showed how high correlation with only five fields can negatively impact correct classification of record pairs [10]. Another paper motivated by the diagnostic testing problem included simulations that broadly describes how correlation affects test-specific sensitivity and specificity and match prevalence [8]. Our simulations suggest that accounting for conditional dependence can substantially improve model fit and misclassification rate. This was most notable when match prevalence was high, as was the case with the NBS simulation. However, when match prevalence was low, regardless of the correlation structure, little difference could be seen between the conditional dependence model and other models. Although with fewer more highly correlated fields, we may have found more impact in the low prevalence situation as described previously by Tromp et al. [10]. Our results indicate that if all fields have binary indicators of agreement status, then in the absence of a gold standard, one should try to fit the most appropriate loglinear model informed by the correlation residual plot to obtain the best model fit as this is not difficult to implement in standard statistical software. A drawback of this approach is its iterative nature: multiple loglinear models must be fit. Performance of the random effects model with different loadings performs comparably with the loglinear model when correlation is high. This was found to be true from our simulations and in our application to a real HIE data set, although sensitivity and specificity of the models being compared cannot be given for the observed NBS data because of the lack of a gold standard. Random effects models are investigated because of the ease of which these models may be expanded to include variables that are not only dichotomous but continuous. Such is the case if a string comparator (or similarity index) is modeled rather than a binary agreement variable for a given field. In addition, random effects models can easily allow for covariates. Our results suggest that random effects models may be useful and are also easy to implement via standard software. We ran all statistical models through the NLMIXED procedure in sas (Cary, NC). There are two issues with using the GRE models described in the paper. First, convergence of random effects models is more difficult to attain and is highly dependent on initial starting values. Second, classification based on the random effects models requires an additional step (after model fitting) to compute the posterior match probability. This was carried out using PROC IML with code available from the author upon request. Future work will evaluate the use of random effects models if a similarity index is available and explore the use of a Probit latent class model [13] that has an even more general correlation structure. Overall, even if separate independent blocks are run as is usual for record linkage, accounting for appropriate conditional dependencies in the model will provide better fit and leads to more accurate classification of record pairs as matches or non-matches. Both loglinear models and GRE models may be fit in standard software. Although loglinear models that incorporate appropriate interaction terms may provide slightly lower misclassification rates, GRE models may be a useful tool if one wants to model additional features of the data as well as account for the conditional dependence.

4262

Appendix A: Results of simulation Additional results based on simulation described in Section 4.2. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 4250–4265

0.01

0.9995 (0.0003) 0.9996 (0.0003) 0.9994 (0.0004) 0.9995 (0.0003)

0.9836 (0.0019) 0.9843 (0.0029) 0.9836 (0.0019) 0.9836 (0.0019)

0.9995 (0.0003) 0.9995 (0.0003) 0.9994 (0.0003) 0.9995 (0.0003)

0.9981 (0.0006) 0.9980 (0.0006) 0.9980 (0.0006) 0.9980 (0.0006)

Model

CI Loglin GRE1 RE2

CI Loglin GRE1 GRE2

CI Loglin GRE1 GRE2

4263


CI Loglin GRE1 GRE2

0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5)

Scenario 4 (low correlation in both classes) 0.9950 (0.0003) 0.9984 (0.0001) 0.9995 (4e−5) 0.9950 (0.0003) 0.9979 (0.0001) 0.9992 (5e−5) 0.9950 (0.0003) 0.9976 (0.0001) 0.9992 (6e−5) 0.9950 (0.0003) 0.9983 (0.0001) 0.9993 (4e−5)

0.9917 (0.0011) 0.9900 (0.0011) 0.9895 (0.0006) 0.9916 (0.0021)

0.9997 (3e−5) 0.9996 (4e−5) 0.9996 (3e−5) 0.9997 (3e−5)

0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5)

0.9915 (0.0016) 0.9876 (0.0007) 0.9904 (0.0010) 0.9902 (0.0032)

Scenario 3 (low correlation in match class, no correlation in non-match class) 0.9998 (2e−5) 0.9930 (0.0009) 0.9955 (0.0003) 0.9988 (0.0001) 0.9996 (3e−5) 0.9997 (3e−5) 0.9916 (0.0013) 0.9955 (0.0003) 0.9981 (0.0001) 0.9993 (4e−5) 0.9905 (0.0006) 0.9955 (0.0003) 0.9980 (0.0001) 0.9994 (5e−5) 0.9996 (3e−5) 0.9908 (0.0011) 0.9955 (0.0003) 0.9982 (0.0001) 0.9994 (4e−5) 0.9997 (3e−5)

0.98

0.9999 (1e−5) 0.9999 (1e-5) 0.9999 (1e−5) 0.9999 (1e−5)

0.90

0.9998 (2e−5) 0.9996 (4e−5) 0.9997 (4e−5) 0.9996 (3e−5)

0.80

Scenario 2 (high correlation in both classes) 0.9941 (0.0003) 0.9984 (0.0001) 0.9995 (4e−5) 0.9923 (0.0008) 0.9973 (0.0002) 0.9992 (0.0001) 0.9941 (0.0003) 0.9977 (0.0001) 0.9993 (6e−5) 0.9911 (0.0010) 0.9977 (0.0002) 0.9992 (5e−5)

True prevalence 0.50

0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5) 0.9999 (1e−5)

0.10

Scenario 1 (high correlation in match class, no correlation in non-match class) 0.9998 (2e−5) 0.9960 (0.0015) 0.9957 (0.0005) 0.9991 (0.0001) 0.9998 (3e−5) 0.9927 (0.0006) 0.9934 (0.0010) 0.9977 (0.0001) 0.9993 (0.0001) 0.9997 (3e−5) 0.9904 (0.0006) 0.9955 (0.0003) 0.9983 (0.0001) 0.9993 (0.0001) 0.9996 (3e−5) 0.9920 (0.0014) 0.9945 (0.0011) 0.9985 (0.0001) 0.9994 (4e−5) 0.9997 (3e−5)

0.05

Table AI. Results of simulation: PPV and SD.

J. DAGGY ET AL.

Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL.

Acknowledgement This project was supported by grant number R01HS018553 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality.

References

4264

1. Gomatam S, Carter R, Ariet M, Mitchell G. An empirical comparison of record linkage procedures. Statistics in Medicine 2002; 21:1485–1496. 2. Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association 1969; 64:1183–1210. 3. Newcombe HB. The use of medical record linkage for population and genetic studies. Methods of Information in Medicine 1969; 8:7–11. 4. Albert PS, Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60(2):427–435. 5. Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics 2001; 57:158–167. 6. Pepe MS, Janes H. Insights into latent class analysis of diagnostic test performance. Biostatistics 2007; 8(2):474–484. 7. Torrance-Rynard VL, Walter SD. Effects of dependence errors in the assessment of diagnostic test performance. Statistics in Medicine 1997; 97:2157–2175. 8. Vacek PM. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985; 41(4):959–968. 9. Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. International Journal of Epidemiology 2002; 31(6):1246–1252. 10. Tromp M, Meray N, Ravelli AC, Reitsma JB, Bonsel GJ. Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies. Journal of the American Medical Informatics Association 2008; 15(5):654–660. 11. Hagenaars J. Latent structure models with direct effects between indicators: local independence models. Sociological Methods and Research 1988; 16:379–405. 12. Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52(3):797–810. 13. Xu H, Craig BA. A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests. Biometrics 2009; 65:1145–1155. 14. Xu H, Black MA, Craig BA. Evaluating accuracy of diagnostic tests with intermediate results in the absence of a gold standard. Statistics in Medicine 2013; 32(15):2571–2584. 15. Garret E, Zeger S. Latent class model diagnosis. Biometrics 2000; 56:1055–1067. 16. Armstrong JB, Mayda JE. Estimation of Record Linkage Models Using Dependent Data. Survey Methodology 1993; 19:137–147. 17. Thibaudeau Y. The discrimination power of dependency structures in record linkage. Survey Methodology 1993; 19:31–38. 18. Winkler WE. Improved decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 1993; 274–279. Available from: https://www.census.gov/srd/papers/ pdf/rr93-12.pdf. 19. Winkler WE. Matching and record Linkage. In Business Survey Methods, Cox BG, Binder DA, Chinnappa BN, Christianson A, Colledge MJ, Kott PS (eds). John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1995. 20. Larsen MD, Rubin DB. Iterative automated record linkage using mixture models. Journal of the American Statistical Association 2001; 96(453):32–41. 21. Zhu R, Zhang J, Zhang D, Yan G. Stepwise variable selection in loglinear mixtures in record linkage. European Journal of Pure and Applied Mathematics 2010; 3(2):141–162. 22. Schürle J. A method for consideration of conditional dependencies in the Fellegi and Sunter model of record linkage. Statistical Papers 2005; 46(3):433–449. 23. Levenshtein V. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 1966; 10:707. 24. Cohen WW, Ravikumar P, Fienberg SE. A comparison of string distance metrics for name-matching tasks. Proceedings of IJCAI-03 Workshop on Information Integration, Acapulco, Mexico, 2003, 73–78. 25. Porter EH, Winkler WE. Approximate string comparison and its effect on an advanced record linkage system. Technical Report, U.S. Bureau of the Census, 1997. 26. Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research, Washington, DC, 1990, 354–359. 27. Winkler WE, Thibaudeau Y. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. Decennial Census. U.S. Decennial Census. Technical Report, US Bureau of the Census, 1987. 28. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1977; 39(1):1–38. 29. Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 1989; 84(406):414–420. 30. Belin TR, Rubin DB. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association 1995; 90:694–707. 31. Clogg CC. Latent class models. In Handbook of Statistical Modeling for the Social and Behavioral Sciences, Arminger G, Clogg CC, Sobel ME (eds), chap. 6. Plenum: New York, 1995; 311–359.


Statist. Med. 2014, 33 4250–4265

J. DAGGY ET AL. 32. Littell R, Milliken G, Stroup W, Wolfinger R, Schabenberger O. SAS System for Mixed Models, (2nd edn). SAS Institute Inc.: Cary, NC, 2006. 33. Goetghebeur E, Liinev J, Boelaert M, der Stuyft P. Diagnostic test analyses in search of their gold standard: latent class analyses with random effects. Statistical Methods in Medical Research 2000; 9:231–248. 34. Golub GH, Welsch JH. Calculation of Gauss quadrature rules. Mathematics of Computation 1969; 23:221–230. 35. Zhu VJ, Overhage MJ, Egg J, Downs SM, Grannis SJ. An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling. Journal of the American Medical Informatics Association 2009; 16(5):738–745. 36. Fienberg SE, Hersh P, Rinaldo A, Zhou Y. Maximum likelihood estimation in latent class models for contingency table data. In Algebraic and Geometric Methods in Statistics. Cambridge University Press: Cambridge Books Online, 2009; 27–62.

4265


Statist. Med. 2014, 33 4250–4265

Percolation on networks with conditional dependence group.

Probabilistic record linkage.

Privacy, epidemiology, and record linkage.

A method of record linkage.

Bayesian inferences of latent class models with an unknown number of classes.

Evaluating diagnostic tests for bovine tuberculosis in the southern part of Germany: A latent class analysis.

Bayesian Inference for Growth Mixture Models with Latent Class Dependent Missing Data.

Estimating linkage disequilibrium from conditional data.

Expected behavior of conditional linkage disequilibrium.

Latent variable models with nonparametric interaction effects of latent variables.

Evaluating the accuracy of molecular diagnostic testing for canine visceral leishmaniasis using latent class analysis.

Latent class models in diagnostic studies when there is no reference standard--a systematic review.

Learning Harmonium models with infinite latent features.

The birth number concept and record linkage.

Uncovering steroidopathy in women with autism: a latent class analysis.

Using Latent Class Analysis to Identify Sophistication Categories of Electronic Medical Record Systems in U.S. Acute Care Hospitals.

Estimating benchmark exposure for air particulate matter using latent class models.

Evaluating Dependence Criteria for Caffeine.

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.

Endogenous analgesia, dependence, and latent pain sensitization.

Climate models: use archaeology record.

Improving record linkage performance in the presence of missing linkage data.

Pyloric stenosis in the Oxford Record Linkage Study area.

IAP gene deletion and conditional knockout models.