Detangling complex relationships in forensic data: principles and use of causal networks and their application to clinical forensic science.

Int J Legal Med DOI 10.1007/s00414-015-1164-8

ORIGINAL ARTICLE

Detangling complex relationships in forensic data: principles and use of causal networks and their application to clinical forensic science Thomas Lefèvre & Aude Lepresle & Patrick Chariot

Received: 27 June 2014 / Accepted: 27 February 2015 # Springer-Verlag Berlin Heidelberg 2015

Abstract The search for complex, nonlinear relationships and causality in data is hindered by the availability of techniques in many domains, including forensic science. Linear multivariable techniques are useful but present some shortcomings. In the past decade, Bayesian approaches have been introduced in forensic science. To date, authors have mainly focused on providing an alternative to classical techniques for quantifying effects and dealing with uncertainty. Causal networks, including Bayesian networks, can help detangle complex relationships in data. A Bayesian network estimates the joint probability distribution of data and graphically displays dependencies between variables and the circulation of information between these variables. In this study, we illustrate the interest in utilizing Bayesian networks for dealing with complex data through an application in clinical forensic science. Evaluating the functional impairment of assault survivors is a complex task for which few determinants are known. As routinely estimated in France, the duration of this impairment can be quantified by days of ‘Total Incapacity to Work’ (‘Incapacité totale de travail,’ ITT). In this study, we used a Bayesian network approach to identify the injury type, victim category and time to evaluation as the main determinants of the ‘Total Incapacity to Work’ (TIW). We computed the conditional probabilities associated with the TIW node and its T. Lefèvre (*) : A. Lepresle : P. Chariot Department of Forensic Science and Medicine, Hôpital Jean-Verdier (AP-HP), F-93140 Bondy, France e-mail: [email protected] A. Lepresle e-mail: [email protected] P. Chariot e-mail: [email protected] T. Lefèvre : P. Chariot IRIS - Institut de recherches interdisciplinaires sur les enjeux sociaux (UMR 8156-723), Bobigny, France

parents. We compared this approach with a multivariable analysis, and the results of both techniques were converging. Thus, Bayesian networks should be considered a reliable means to detangle complex relationships in data. Keywords Bayesian framework . Causal networks . Multivariable analysis . Forensic science . Functional impairment . Total incapacity to work

Introduction The complexity of data encountered by forensic scientists in their daily practice (e.g., determining the manner of death [1], evaluating the functional impairment of assault survivors [2], and estimating the age of unaccompanied minors [3]) is a challenge that few approaches can meet. In many cases, the search for causality is hindered by the limited number of available techniques and the lack of adequate experimental settings. Indeed, high standards, such as randomised controlled trials, are often out of reach, materially or ethically, so a panel of statistical techniques must be used to get as close as possible to quasiexperimental settings. Forensic science is no exception, and several attempts have been made to introduce innovative techniques, such as multivariable analysis [4] and Bayesian approaches [5]. These techniques aim at lessening the impact of potential confounders and biases and assessing a possible causal relationship between two parameters of interest. Most multivariable techniques have several shortcomings [6]. Multivariable models cannot integrate a large number of covariates while remaining valid and not overfitting on data (i.e., when models account more for noise than for the real underlying relationships between variables). Additionally, biases cannot be directly identified but only suspected and compensated for in models to lessen their contribution. Although nonlinear models exist, multivariable models usually encountered are most often linear

Int J Legal Med

models that work under linearity or normality assumptions. However, if these assumptions are verified, multivariable models can quantify the respective contribution of each covariate to the main effect, take the size effect of the considered sample into account and provide confidence intervals. In practice, few tests are robust enough to verify these assumptions that are often not adequately checked. Over the last decade, the Bayesian framework has been used with increasing frequency in forensic science [7–9]. Despite the quality of these papers, Bayesian approaches should be presented in a slightly different manner, as they have been used in other fields for several years (e.g., cell and molecular biology, complex systems and/or epidemiology) [10–15]. Bayesian techniques in forensics are usually presented as an alternative to classical statistics and are used to integrate evidence and quantify effects. Bayesian networks represent a complementary approach to classical multivariable analysis and can be used to detangle complex relationships in data. Finally, assuming that a priori distributions are specified, parameters of the networks can be estimated and their effects can be quantified. In this study, we utilise the Bayesian approach and the principles of causal networks. We applied these techniques to evaluate the functional impairment of assault survivors, a complex task for clinical forensic scientists. We compared the results obtained by Bayesian network (BN) analysis with those previously obtained using a classical multivariable analysis.

certain extent. Because dependencies are found between variables, this approach can help select relevant variables and confounders such that a more classical multivariable model can be built based on these variables. Size effects and quantification in terms of odds-ratios can be estimated using classical multivariable models. Mathematical formalism and assumptions Definitions Bayesian network. One can represent their data using a set of n variables fX 1 ; ⋯; X n g, for which a BN can be built. A BN is given by a graph (G) and the JPD (P(S): BN ¼ fG; PðS Þg, where PðS Þ ¼ PðX 1 ; ⋯; X n Þ. Nodes of the graph are variables fX 1 ; ⋯; X n g, whereas edges or arrows linking the nodes depict the dependencies between variables (Fig. 1). Bayes’ rule. For two random variables, X and Y, Bayes’ rules states: PðX jY Þ ¼ PðYPjXðYÞPÞ ðX Þ, where PðX jY Þ denotes the conditional probability of X, given Y, and PðX Þ indicates the probability of X. Conditional independency (CI). X is conditionally independent of Z given Y if the probability distribution of X conditioned on both Y and Z is the same as the probability distribution of X conditioned only on Y. The conditional independency of X and Y given Z is defined as follows: PðX ; Y jZ Þ ¼ PðX jZ ÞPðY jZ Þ ¼ PðY ; X jZ Þ, w h i c h i s

Methods

A

Principles for Bayesian approaches and Bayesian networks Bayesian approaches rely on the following basic rules and principles: 1) Bayes’ rules, 2) the conditional independency of variables, 3) the simplification, 4) the estimation of the joint probability distribution (JPD) of the data, and 5) the graphical representation of this distribution [14]. If data are composed of n variables of interest (e.g., age, sex, height or weight), they can be considered random variables, continuous or categorical. The ultimate aim of the Bayesian approach presented here is two-fold: to estimate the JPD of these n variables in its simplest form and to represent their statistical dependencies. The corresponding graph is composed of nodes that display the variables and edges linking the nodes that represent the dependencies between variables. The JPD can be simplified by applying Bayes’ rule and assuming the conditional independency of some variables regarding other variables. Ultimately, the graph obtained on the basis of the simplification of the JPD can be oriented; the edges can become arrows depicting the direction of the information (i.e., its circulation between two variables). No linear dependency between variables is assumed, and complex associations between variables can be retrieved to a

S

Y

B

Z

X

U

W Fig. 1 An example of a Bayesian network. Nodes of the graph (A, B,…, Z) are variables, whereas arrows depict the dependencies between variables (e.g., X depends causally on Y and Z). Parents of a given variable (e.g., B) are denoted C() (e.g., C(B)={S, Y}). Variables X, Y, Z constitute a Vstructure (Y →X ←Z , and there is no edge between Y and Z). S is a spouse of Y because they both share a common child (node B) and have no other direct relationship (no edge linking S and Y). The direction of the arrow relates to the circulation of the information. Note that this graph is totally directed (all edges are arrows) and acyclic (starting from a given node, whatever the node, it is not possible to go back to this node by following the direction of arrows from one node to another)

Int J Legal Med

equivalent to the following statement; PðX jY ; Z Þ ¼ PðX jZ Þ and PðY jX ; Z Þ ¼ PðY jZ Þ. Indeed, for PðZ Þ > 0, PðX ; Y ; Z Þ ¼ PðX ; Y jZ Þ:PðZ Þ ¼ PðZ Þ:PðY jZ Þ:PðX jY ; Z Þ. If PðX ; Y jZ Þ ¼ PðX jZ Þ:PðY jZ Þ , PðX jZ Þ:PðY jZ Þ ¼ PðY jZ Þ :PðX jY ; Z Þ and PðX jY ; Z Þ ¼ PðX jZ Þ. Conversely, if we assume that P ðX jY ; Z Þ ¼ PðX jZ Þ and PðY jX ; Z Þ ¼ PðY jZ Þ, Pð X ; Y ; ZÞ ¼ PðX jY ; Z Þ :PðY jZ Þ:PðZ Þ ¼ PðX jZ Þ: PðY jZ Þ: PðZ Þ ¼ P ðX ; Y jZ Þ: PðZ Þ. Thus, it follows that PðX jZ Þ: PðY jZ Þ ¼ PðX ; Y jZ Þ. In practice, conditional independency tests are often based on the concept of mutual information. As in classical statistical testing, a p value must be specified for these CI tests. The threshold for accepting or rejecting the CI between two variables is usually set at 0.05 by default. Mutual information. The mutual information IðX; YÞ of X and Y is an information-theoretic distance measure defined as follows: IðX; YÞ ¼ SðXÞ þ SðYÞ SðX; YÞ, where SðXÞ is the entropy of the variable X . Conditional independence tests for discrete data are functions of the conditional probability tables implied by the graphical structure of the network, and it can use the conditional mutual information of X and Y given z Y X Z, as follows: I ðX ; Y jZ Þ ¼ ∑Nk¼1 ∑Nj¼1 ∑Ni¼1 P xi ; y j ; zk log n o Pðzk ÞPðxi ;y j ;zk Þ ð x ; z ÞP y ; z ; y ; z , where x are values i k i j k j k P taken by the variables fX ; Y ; Z g. Parents, children and spouses of a node. Given a node in an oriented graph, three main types of relatives can be defined according to their relationships to this node: a parent, child or a spouse. A parent of node X is a node Y, such as Y →X . A child of node X is a node Z, such as X →Z. Finally, a spouse of node X is a node S, such as X →Y ←S. In this latter case, S is also a parent of Y. Spouses are defined as a pair of nodes that share at least one common child while sharing no other direct link. These graph-based definitions are depicted in Fig. 1. Generally speaking, the JPD P(S) can be rewritten as follows: PðS Þ ¼ ∏ni¼1 P X i jX i1; ⋯; X 1 . We usually denote the direct parents of a given variable, X, as C(X). With this notation, we derive the following equation: PðS Þ ¼ ∏ni¼1 PðX i jC ðX i ÞÞ. This form is the most compact expression of the JPD according to Bayes’ rule and using conditional independencies between variables. V-structures. Direct causation between X and Y, noted X -> Y, implies that X and Y are dependent given any conditioning set that does not contain X and Y [16]. Additionally, if we assume that no hidden common cause of two variables exists, for any conditioning set that does not contain X and Y, X and Y are conditionally dependent given these sets, giving either X →Y or Y →X. With this property, all adjacencies of the causal graph can be determined using CI tests, but edges cannot be oriented. In a special pattern known as a V-structure, CI tests can reveal the direction of causation [17]. In this pattern, two common causes, Y and Z, which are initially independent, become dependent when conditioned on a common effect, X.

We may consider the following example using three variables fX ; Y ; Zg, where X is caused by Y and Z. These dependencies translate as follows: Y →X ←Z. If we add the constraint that there is no edge between Y and Z, such a configuration is called a V-structure, and it can be opposed to two other possible configurations Y →X →Z or Y ←X →Z. V-structures are fundamental in Bayesian networks because they directly relate to conditional independency and information circulation between variables. Finally, the joint probability P(S) can be written as follows: PðS Þ ¼ PðX jY ; Z Þ:PðY Þ:PðZ Þ. We also can write the values of the JPD P(S) (Table 1). Markov blankets. The goal of the Markov blanket approach is to discover causation by detecting V-structures [18]. The Markov blanket of each variable, T of G, is the minimal set MB(T) whose variables contain information about T that other variables do not. In a DAG, it is composed of its parents, children and spouses. Stated otherwise, the Markov blanket M BðTÞ of a variable, T, is a minimal set for which CInd ðX; T jMBðTÞÞ ¼ 1 for all X in G fTg MBðTÞ, where C IndðX; T jMBðTÞÞ ¼ 1 if X and T are CI given MB(T) and C IndðX; T jMBðTÞÞ ¼ 0 otherwise. Unoriented graphs, partially directed acyclic graphs and directed acyclic graphs. Three types of networks are of interest. Unoriented graphs are graphs composed of nodes and edges depicting conditional dependencies between nodes, but they provide no clues on how information flows from one node to another. Partially directed acyclic graphs (PDAGs) are graphs that are similar to unoriented graphs with two supplementary properties: i) some edges are replaced with arrows, informing about information circulation between nodes, but there are still some edges in the graph (which explains the Bpartially^ directed), and ii) there is no cycle in the graph (i.e., if following arrows in the graph from any starting node, it is not possible to go back to the starting node). Finally, directed acyclic graphs (DAGs) are PDAGs in which all edges are replaced with arrows. A Bayesian network is a DAG.

Table 1 The distribution of the joint probability P(S)=P(X,Y,Z) where each variable is binary (0/1)

PðY ¼ 1Þ

PðX ¼ 1Þ PðX ¼ 0Þ

PðZ ¼ 1Þ X 111 X 011

PðY ¼ 0Þ PðZ ¼ 0Þ X 110 X 010

PðZ ¼ 1Þ X 101 X 001

PðZ ¼ 0Þ X 100 X 000

X 111 corresponds to the value of the joint probability P(S) when fX ; Y Zg ¼ f1; 1; 1g. The knowledge of the whole table fully determines the knowledge of the joint probability P(S).

Int J Legal Med

Structure learning algorithms Different algorithms can be used to retrieve the structure of a network. Constraint-based algorithms use tests to detect relationships among variables. Score-based algorithms select the structure that has the highest score of a function that measures how well the structure fits the data. Hybrid or local discovery algorithms utilise both score-based and constraint-based methods to construct the graph. Constraint-based algorithms The Incremental Association Markov Blanket (IAMB) algorithm is an example of a constraint-based algorithm [19]. It can be decomposed into two phases. The first phase uses Markov blankets to maximise a heuristic; for each variable, T, it builds an associated Markov blanket by selecting one variable, X, after another (other than the current variable T) and checks if that variable (X) maximises the heuristic. Otherwise stated, it searches for the variable X in G CM B fT g that maximises f ðX ; T jCMBÞ, where G represents the graph nodes, CMB represents the current Markov blanket and f() is the heuristic. A CI test is performed between T and X given the rest of the Markov blanket of T. If X and T are not independent, X is added to the Markov blanket of T. The second phase consists of removing all false positives from the Markov blanket (i.e., it removes from the current Markov blanket all variables X for which CIndðX; T jCM B−fX gÞ ¼ 1). A pseudo-code corresponding to the IAMB algorithm is described in [19]. These algorithms usually produce PDAGs that can be completed (i.e., fully oriented with different approaches). Score-based algorithms Score-based algorithms attempt to identify the best network in terms of a specific score (e.g., AIC or BIC score) in all possible Bayesian networks. One such example, the HillClimbing (HC) algorithm, is described in [20]. A score is computed for the initial graph, and the algorithm adds, deletes or reverses an arrow between a pair of nodes and computes the new score. If the new score is higher, it keeps going; an arrow is added, deleted or reversed between another pair of node, and the new score is computed. These steps are repeated until the score stops increasing. Operations on arrows are conducted so that no cycle appears in the graph. Such algorithms can be trapped in a local optimum, making it impossible to retrieve the best solution (i.e., a suboptimal solution). Hybrid or local discovery algorithms .The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) algorithm is composed of

two phases and it relies on the assumption that the JPD is a function of a Hamiltonian with a second order truncature [21]. It also makes use of mutual information and the data processing inequality (DPI). The JPD of variables fXi g, PðX 1 ; ⋯; X n Þ is assumed to be written N as follows: PðX 1 ; ⋯; X n Þ ¼ 1 exp ∑N φ ðXi Þ ∑ φ i ij i Z N i; j HðfXi gÞ , where N Xi ; X j ∑ φi jk Xi ; X j ; Xk ⋯≝e i; j;k

is the number of variables, Z is a normalisation factor, φ… are potentials, and HðfXi gÞ is the Hamiltonian defining the system’s statistics. Variables are supposed to interact only if the potential that depends exclusively on these variables is not null. The current version of ARACNE truncates the previous equation at the pairwise interactions level: Hð fXi gÞ ¼ ∑i φi ðXi Þ þ ∑i; j φi j ðXi ; X j Þ. All variables for which φi j ¼ 0 are assumed to be non-interacting. This includes variables that are statistically independent (i.e., P Xi ; X j ≈PðXi ÞP X j Þ, as well as variables that do not directly interact, but are statistically dependent through their association with other variables (i.e., P Xi ; X j ≠PðXi ÞP X j , but φi j ¼ 0). Different strategies to orient edges An efficient method relies on the search for and use of Vstructures. Another way to retrieve the orientation of an edge from a PDAG is to orient the remaining edges so that no cycle appears (DAGs are acyclic). As in score-based algorithms, edges can be oriented by comparing different graphs using different configurations where the edges are oriented according to their score. Criteria such as the Akaïke information criterion (AIC) or the Bayesian information criterion (BIC) are used to compare different graphs. Usually, the smaller the score, the better the model. In some cases, edges are oriented based on expert knowledge. Finally, if interventions can be conducted (e.g., an intervention allowing for measuring the impact on an observation X of a variation of Y), they can help orient edges according to experience. Sample size for structure learning Bayesian networks are usually computed for relatively small samples, as in genetics [15, 22–25]. A sample size greater than 100 is considered sufficient to estimate two-way marginals in genomics problems, whereas P Xi ; X j ; Xk requires an order of magnitude more samples [21]. To the best of our knowledge, there is no consensus on the minimum (and maximum) size of the sample needed for which a Bayesian network is optimally retrieved from data. Some theoretical lower and upper bounds have been suggested [23]. In the case of constraint-based algorithms, CI tests are usually not performed if the sample size is very small.

Int J Legal Med

Classically, tests assessing the CI of X and Y given set Z of two binary variables will not be performed unless there are at least ten data samples per different possible patterns of X, Y and Z. If X and Y were also binary, it would lead to a minimum of 2^4×10=160 samples. In many approaches (as in constraint-based or local discovery-based algorithms), tests are Blocal,^ and not necessarily exhaustive or massively multivariate; not all variables are tested for dependencies at the same time, allowing smaller samples than in classical approaches. For very small samples, only pairwise CI tests are performed to search for pairwise associations.

Application to the evaluation of the functional impairment of assault survivors Persons exposed to violence (e.g., assault survivors) have to face intricate perspectives concerning their own health and the judicial system. In most European countries, physicians assess the outcome of violence so the appropriate court sentence is based on factual pieces of evidence. Practices vary from one country to another, but similar data are collected when a victim is examined [26]. The description of traumatic injuries and the expected outcome on functioning are major items of evaluation. French law quantifies the seriousness of the violence committed as the duration of the victims’ inability to fulfil their usual daily activities. Functional impairment is quantified in days of ‘Total Incapacity to Work’ (TIW, Incapacité totale de Travail, ITT) [27]. In a previous study, we reviewed 1,064 consecutive medical files of victims of assaults evaluated in the Department of Forensic Science and Medicine from the Paris area, between 10/01/2010 and 11/22/2010. Trained forensic examiners collected data using a standardised questionnaire. Patient data are summarised in Table 2 and further details can be found in ref. 2. In that paper, we explored the potential determinants of the TIW using univariate and multivariable analyses to build a generalised linear model. Variables were included in the model if a p value below 0.1 was found when testing the association between TIW and the variable. In this model, six variables were included, plus one interaction term (Table 3). The link function was the identity. Here, we used the same data to build Bayesian networks and explore dependencies between variables, so we could perfect our comprehension of the TIW determinants. Continuous variables (age, TIW and time to evaluation) were transformed to categorical variables for this purpose. Indeed, most validated and widespread algorithms are conceived or coded for either continuous or categorical variables. Discretisation is also a good option for continuous data that do not correspond to specific probability distributions, mainly normal distributions. Compared to the first study, we added four categorical variables to this model: the

Table 2

Description of the studied population (n=1,064)

Gender Presence of aggravating factors

Patient category

Type of assailant Forensic physicians TIW duration (days) Time to evaluation (hours) Age (years)

Men Women 3 or more 2 1 0 Detainees Police officers Other individuals 20 categories 19 categories

N (%) 662 (62.2) 402 (37.8) 68 (6.4) 251 (23.6) 464 (43.6) 281 (26.4) 226 (21.2) 69 (6.5) 769 (72.3)

mean (SD) 3.4 (2.5) 46.1 (51.8) 31.3 (14.3)

N: number of patients. For the Bayesian analysis, we also considered traumatic injuries (present, with no bone fracture; absent), the type of assault (10 categories) and the place where the assault happened (19 categories)

assault place, the assailant category, the assault category and the forensic examiner. Software resources and packages to run Bayesian techniques Several software resources and packages, free or commercial, are available in growing numbers. Hugin, OpenBUGS, proBAYES and winBUGS and the librarie's Bayes Net Toolbox (Matlab) and bnlearn (R project) software are popular solutions. All analyses presented in this paper were conducted using the R statistical project release 2.15.0, with the bnlearn package [28]. The two main algorithms used to build our Bayesian networks were the ARACNE algorithm that provides unoriented graphs and the IAMB that provides partially or completely directed acyclic graphs [PDAG/DAG]). The IAMB algorithm delivers mainly PDAGs and rarely DAGs. The remaining unoriented edges were oriented such that the DAG presenting the best (i.e., higher) score (AIC and BIC scores) was selected among all alternative DAGs. Finally, we ran a score-based algorithm, the Hill-Climbing algorithm (HC), on the data.

Results Results from the multivariable analysis are given in Table 3. The presence of traumatic injuries, patient category and time to evaluation influenced the duration of TIW [2]. Figure 2 presents the unoriented graph, and the directed acyclic graph is shown in Fig. 3. Figure 2 demonstrates that dependencies

Int J Legal Med Table 3 Main results obtained via classical multivariable analysis (generalised linear model). The duration of Total Incapacity to Work (TIW, days) is taken as the outcome (n=1,064)

Constant Gender

Women Men

Age (years) Patient category

Presence of aggravating factors

Traumatic injuries Time to evaluation (hours) Interaction time-to-evaluation*patient category (hours)

Detainees Police officers Other individuals 3 or more 2 1 0 Present, with no fracture Absent Time*detainees Time*police officers Time*other individuals

Beta (estimate) [95 % CI]

Standard Error

p value

3.242 [2.020; 4.465] 0.092 [−0.220; 0.400] Ref 0.017 [0.007; 0.027]

0.624 0.157 Ref 0.005

Forensic science.

Expanding forensic science through forensic intelligence.

Information in forensic science.

Diatoms in forensic science.

Forensic science and animal rights.

Application of Kinect gaming sensor in forensic science.

Forensic nursing science knowledge and competency: the use of simulation.

Application of next-generation sequencing technology in forensic science.

SNPs and SNVs in forensic science.

The general principles of clinical forensic medicine and the place of forensic medicine in a modern society.

Scientific misconduct. Science meets forensic science.

Forensic imaging for causal investigation of death.

Forensic and clinical toxicology.

U.S. initiatives to strengthen forensic science & international standards in forensic DNA.

Drawbacks in the scientification of forensic science.

DNA methylation and application in forensic sciences.

Frequencies of salivary genetic marker systems in the Japanese population and their application to forensic medicine.

Molecular Imprinting Applications in Forensic Science.

Organic mass spectrometry in forensic science.

International co-operation in forensic science.

Application of soil in forensic science: residual odor and HRD dogs.

[Forensic autopsy and clinical autopsy].

Current and future directions of DNA in wildlife forensic science.

A 21-locus autosomal SNP multiplex and its application in forensic science.