Int J Legal Med DOI 10.1007/s00414-014-1070-5

ORIGINAL ARTICLE

Sex estimation from the tarsal bones in a Portuguese sample: a machine learning approach David Navega & Ricardo Vicente & Duarte N. Vieira & Ann H. Ross & Eugénia Cunha

Received: 31 March 2014 / Accepted: 28 August 2014 # Springer-Verlag Berlin Heidelberg 2014

Abstract Sex estimation is extremely important in the analysis of human remains as many of the subsequent biological parameters are sex specific (e.g., age at death, stature, and ancestry). When dealing with incomplete or fragmented remains, metric analysis of the tarsal bones of the feet has proven valuable. In this study, the utility of 18 width, length, and height tarsal measurements were assessed for sex-related variation in a Portuguese sample. A total of 300 males and females from the Coimbra Identified Skeletal Collection were used to develop sex prediction models based on statistical and machine learning algorithm such as discriminant function analysis, logistic regression, classification trees, and artificial neural networks. All models were evaluated using 10-fold cross-validation and an independent test sample composed of 60 males and females from the Identified Skeletal Collection of the 21st Century. Results showed that tarsal bone sex-related variation can be easily captured with a high degree of repeatability. A simple tree-based multivariate algorithm involving measurements from the calcaneus, talus, first and third cuneiforms, and cuboid resulted in 88.3 % correct sex estimation both on training and independent test sets. D. Navega : R. Vicente : D. N. Vieira : E. Cunha Forensic Sciences Centre (CENCIFOR), Largo da Sé Nova, s/n, 3000-213 Coimbra, Portugal D. Navega (*) : R. Vicente : E. Cunha Department of Life Sciences, Faculty of Sciences and Technology, University of Coimbra, Calçada Martim de Freitas, 3000-456 Coimbra, Portugal e-mail: [email protected] A. H. Ross Department of Sociology and Anthropology, North Carolina State University, Raleigh, NC 27695-8107, USA D. N. Vieira Faculty of Medicine, University of Coimbra, Rua Larga, 3004-504 Coimbra, Portugal

Traditional statistical classifiers such as the discriminant function analysis were outperformed by machine learning techniques. Results obtained show that machine learning algorithm are an important tool the forensic practitioners should consider when developing new standards for sex estimation. Keywords Sex estimation . Forensic anthropology . Tarsal bones . Machine learning

Introduction The estimation of biological sex is one of the four pillars (e.g., sex, age at death, ancestry, and stature) in the forensic analysis of human skeletal remains. The most diagnostic area for sex estimation relies on the morphological and sometimes the metric analysis of the pelvis [1, 2]. Modern forensic anthropological analysis can rely on tools such as geometric morphometrics and medical imaging to acquire, analyze, and quantify sex-related variation in human skeletal structures [3, 4]. However, the effect of population variation and secular changes or changes over time requires the continuous research and updating of biological profiling standards in order to have methods that reflect and account for population biology at a regional and temporal level [5]. The degree of completeness and preservation of the remains presents a major limitation in the evaluation of any parameter of the biological profile. In many cases, only the cranium and long bones are present, or worst, just a few bone fragments are preserved such as in cases of mass disaster or cremated human remains [6]. As an alternative, tarsal bones have a robust and compact structure, which render them very resistant to taphonomic factors and of particular value for forensic analysis of incomplete or fragmented human remains [7–9], particularly in sex estimation.

Int J Legal Med

The usefulness of tarsal bones in sex estimation was first documented by Steele [10] through his metric analysis of the calcaneus and talus of European and African-Americans from the Robert J. Terry Anatomical Collection. He reported sexing accuracy rates that ranged from 79 to 89 % using linear discriminant function analysis (DFA). Several authors, based upon these results, analyzed and developed new classification models for sex estimation using the talus and calcaneus on prehistoric and contemporary samples from different geographic regions [8, 11–19]. Their results confirmed the findings of Steele, with similar accuracy rates being obtained. More recently, Harris and Case [9] extended the analysis of sexual dimorphism to the other tarsal bones in a sample of contemporary European-Americans from the William M. Bass Donated Skeletal Collection using a set of 18 newly developed measurements. Harris and Case [5] ran univariate and multivariate logistic regression models for sex estimation with correct sex allocation results that ranged from 80 to 93.6 %. In addition to the talus and calcaneus, the cuboid, first and third cuneiforms demonstrated to be suitable for sex estimation. Sex estimation standards using measurements of the tarsal bones have been constructed using exclusively linear discriminant functions and logistic regression models. While these two statistical techniques are well suited for the purpose of classification, they have several limitations. For example, discriminant functions analysis has to meet some assumptions namely multivariate normality and homoscedasticity of variance/co-variance matrices. When those assumptions are not met, something common in real-world applications, logistic regression is a more suitable technique. There are different numerical optimization routines that can be used to fit a logistic regression model, some algorithms can result in optimistically biased classification rates [20], which is particularly important in forensic anthropology as generally only the resubstitution accuracy rates are reported. The aforementioned statistical methods have been demonstrated to be outperformed by machine learning techniques in different disciplines including forensic anthropology [21–24]. In this study, we evaluated the suitability of different machine learning technique to develop standards for sex estimation from the tarsal bones in a sample of Portuguese origin. Machine learning (ML) is a scientific field where disciplines such as artificial intelligence, computational statistics, cognitive science, and information theory converge. The core of machine learning is to develop algorithms that can learn and map underlying properties and structural patterns of data, which can later be used to understand or predict a specific phenomenon [25, 26]. Examples of such algorithms are the artificial neural networks, the classification, and regression trees, the support vector machines or the ensemble models (i.e., bagging, random forest). A major difference between machine learning methods and traditional statistical learning

techniques is that ML algorithms make a more exhaustive use of the training data. ML techniques do not operate solely on group parameters such as means and co-variance matrices. Instead, many ML algorithms are required to iteratively solve optimization problems in order to find the best set of neuron weights or the tree structure that provides a decision boundary, often non-linear, with maximum classification rate. Generally, ML methods are not required to meet particular statistical assumptions such as normality since they are data-driven; this property of ML methods requires, however, a more rigorous training and validation process in order to avoid under- and over-fitting. A light and practical introduction to this field can be found in the book by Witten et al. [26]; for readers looking for a more theoretical and mathematical view on machine and statistical learning, the books by Mitchell [25] and Hastie et al. [27] are suggested. Although scarce, there are some publications where machine learning techniques were applied to forensic anthropological problems, addressing biological profile estimation. McBride et al. [28] illustrated the accuracy and automatic variable selection feature of the ID3 learner in morphological sex estimation from the os coxae. Du Jardin et al. [22] and Mahfouz [23] both provide examples on sex estimation using artificial neural networks. These two studies are of particular interest because they illustrate that traditional statistical classifiers can be easily outperformed by machine learning methods. Moore and Schaefer [29] employed a regression tree in order to model body weight estimation from skeletal remains. Corsini et al. [30] and Buk et al. [31] explored the suitability of artificial neural networks and ensemble models in age-at-death estimation. Most recently, Hefner et al. [32] and Hefner and Ousley [24] demonstrated the usefulness of machine learning methods such as artificial neural network and random forest ensembles in the estimation of ancestry using morphoscopic traits. The purpose of this study is two-fold. First, it aims to analyze the utility of the seven tarsal bones in sex estimation of a Portuguese sample using the measurements derived by Harris and Case [9]. Second, it aims to evaluate the accuracy achieved when using classical statistical classifiers such as linear discriminant function analysis and more modern and sophisticated machine learning classification algorithms, which are not commonly applied by forensic practitioners.

Material and methods Samples Two samples of tarsal bones from documented individuals of Portuguese origin were used in this study. The first sample is composed of 150 males and 150 females from the Coimbra Identified Skeletal Collection (CISC), which was used as

Int J Legal Med

training set and to cross-validate all the computational models developed in this study. The death dates for individuals in this sample range between 1904 and 1939, and their age-at-death ranged from 20 to 79 years with a homogenous distribution in both sexes. The test sample was comprised of 30 males and 30 females from the Identified Skeletal Collection of the 21st Century (ISC-XXI) with death dates ranging between 1995 and 2001, and their age-at-death ranged between 32 and 95 years. At the moment of data collection only part of the ISC-XXI was available for analysis, which limited the number of individuals included in this study. In this study, the ISCXXI sample size is not large enough to be used with confidence as a training sample, as ideally it should, given that it is a representative of contemporary Portuguese individuals. However, the experimental design employed allowed to evaluate the possible effect of temporal variation in the accuracy of the classification algorithms used due to the significant temporal distance of the sample studied. Methods Data collection and measurement error The method of Harris and Case [9] to capture the dimensions of the seven tarsal bones was followed. This method includes 18 measurements (Table 1) designed to capture tarsal bone dimensions along three axes: length, breadth, and height. All measurements were collected for the left side with a specialized mini-osteometric board available from Paleo-Tech Concepts® and recorded to the nearest millimeter (mm). Detailed descriptions and illustrations of the measurements used can be found in the original publication [9]. Only individuals with normal morphology (i.e., no pathological or traumatic changes in bone structure) were analyzed. Intra- and inter-observer error analysis was conducted on a sub-sample of 50 individuals to assess the repeatability of each measurement. The first author collected the 18 tarsal measurements on the 50 individuals in two separate sessions (1 month apart); these two data sets were used to perform the intra-observer error analysis. The second author performs the

same measurements on the same individuals in a separate session, and the resulting data was compared to the first set of measurements collected by the first observer in order to conduct an inter-observer error analysis. Absolute and relative technical error of measurement (TEM and %TEM) were computed following Ulijaszek and Kerr [33] as indicators of the repeatability of each measurement.

Machine learning algorithms for sex estimation Sex estimation in forensic anthropology is generally achieved by the application of a classification algorithm, which can either be a simple set of rules created by a human expert or a complex function constructed through statistical or artificial intelligence algorithm. In machine and statistical learning, a classification algorithm is an inductive algorithm that from a given labeled dataset constructs a model by extracting general parameters and patterns to allow discrimination between dissimilar groups. An important concept in machine learning is representation. The efficacy and interpretability of the group discrimination are constrained by the type of representation assumed to produce the mapping between the predictors and the target variable [25, 26, 34–36]. The output of an ML technique takes the form of a model or representation with specific structure that can range from a simple linear model to a complex equation system such as those found in artificial neural network models. Different representations or models thus, learn to discriminate groups in different ways so different assumptions and constrains lead to differential accuracy among algorithms depending on the problem. We evaluated several algorithms and learning paradigms such as rule and decision tree induction, statistical learning, probabilistic learning, instance-based learning (k-nearest neighbor), and artificial neural networks for the purpose of sex estimation from the tarsal bones. All models used in this study with the exception of linear discriminant function analysis were trained with WEKA (Waikato Environment for Knowledge Analysis) [37], a freely available Java opensource software that provides implementations of different

Table 1 Tarsal measurements used in this study Measurement

Abbreviation

Measurement

Abbreviation

Measurement

Abbreviation

Calcaneus length Talus length Talus breadth Talus height Navicular length Navicular breadth

CalcLg TalLg TalBrd TalH NavLg NavBr

First cuneiform length First cuneiform breadth First cuneiform height Second cuneiform length Second cuneiform breadth Second cuneiform height

CF1Lg CF1Brd CF1Ht CF2Lg CF2Brd CF2Ht

Third cuneiform length Third cuneiform breadth Third cuneiform height Cuboid length Cuboid breadth Cuboid height

CF3Lg CF3Brd CF3Ht CubLg CubBrd CubHt

Detailed descriptions and illustrations can be found in Harris and Case [5]

Int J Legal Med

machine learning algorithms from the most simple to the most complex. The algorithms used in this study are listed in Table 2 by learning paradigm, including the abbreviation/ command used to train them in WEKA, and the settings used with the relevant reference for a detailed reading about each algorithm. Linear discriminant function analysis (DFA) model was trained and evaluated with MATLAB programming language (MathWorks). In the k-nearest neighbor (KNN) model, the value of k was first set as the square root of the number of individuals in the training sample (a simple heuristic, see [34]); then an internal cross-validation cycle, looping from 1 to the first initial guess of k, was applied to find the optimal value of k given the training data. Feature selection To maximize the utility of a classification algorithm, the most discriminative features should be used during the training process. The process of selection of a subset of variables that maximize classification accuracy is called feature selection. Some of the algorithms used in this study, rules and tree induction, have internal feature selection mechanisms, which select the most relevant features during their greedy “divide-and-conquer” training process. Algorithms such as the k-nearest neighbor and naïve Bayes classifier lack such characteristic, and their performance can be heavily degraded by the presence of irrelevant and redundant attributes. Feature selection was performed in this study with a filter method, which means that feature selection is done before the training

process of each algorithm [26]. Correlation-based feature selection by exhaustive search was performed to select the most discriminative dimension of the tarsal bones. This technique allows the selection of the skeletal features in this sample (osteometric dimensions) that were more related to sex while having the least correlation among them. This avoided redundancy among the selected features. The filter approach was selected to allow a more robust comparability between the algorithms. This approach also avoided generating predictive models using tarsal bones and dimensions that had little or no relation to sex, making data modeling less time-consuming. To avoid making feature selection, highly dependent on training sample, 10-fold cross-validation was performed. All feature selection processes were conducted in WEKA. Model evaluation To assess the performance of the algorithms, overall accuracy, kappa statistic, recall, and precision were analyzed. Overall accuracy is a measure of total agreement between the real sex and the estimated sex and is calculated by the division of the number of correctly estimated instances by the total number of instances. Kappa statistic is also a measure of total agreement, but corrects for those that occur by chance [26]. Recall is a group-specific measure of performance and can be read as the probability of the algorithm correctly estimating sex. It is obtained by dividing the number of correctly sexed individuals by the total number of individuals in the group. Precision

Table 2 Classification algorithms applied in the current study Learning paradigm

Algorithm

WEKA command

WEKA settings

Reference

Instance-based

k-nearest neighbor (Euclidean distance) PART decision list Classification and regression tree

IBk

IBk -K 17 -W 0 -X -I -A

Aha and Kibler [23]

PART SimpleCart

PART -M 2 -C 0.25 -Q 1 SimpleCart -S 1 -M 2.0 -N 5 -H -C 1.0

Frank and Wittern [24] Breiman et al. [25]

Alternating decision tree Best-first decision tree

ADTree BFTree

ADTree -B 10 -E -3 BFTree -S 1 -M 2 -N 5 -C 1.0 -P POSTPRUNED

Freund and Mason [26] Shi [27] Friedman et al. [28]

Decision tree with naïve Bayes classifier at the leaves Gaussian naïve Bayes Kernel density naïve Bayes Discretized predictors naïve Bayes Linear discriminant analysis

NBTree

NBTree

Kohavi [29]

NaiveBayes NaiveBayes NaiveBayes –

NaiveBayes NaiveBayes -K NaiveBayes -D –

John and Langley [30]

Logistic regression

SimpleLogistic

Logistic regression with ridge parameter Multilayer perceptron (backpropagation)

Logistic

SimpleLogistic -I 0 -S -M 500 -H 50 -W 0.0 Logistic -R 1.0 E-8 -M -1

Rule induction Tree induction

Probabilistic learning Statistical learning

Artificial neural networks

MultilayerPerceptron

MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -D

Duda et al. [18] Landwehr et al. [31] Cessie and van Houwelingen [32] Mitchell [17] Bishop [19] Witten et al. [20]

Int J Legal Med Table 3 Intra- and inter-observer error statistics for tarsal measurements Measurement

Intra-observer

of the number of correctly identified males by all the individuals estimated as males. In the medical literature, recall goes by the name of sensitivity and precision by predictive value. Stratified 10-fold cross-validation and an independent test sample were used to calculate unbiased estimates of the performance indicators.

Inter-observer

TEM (mm)

%TEM

TEM (mm)

%TEM

CalcLg TalLg TalBrd TalHt NavLg NavBrd CF1Lg CF1Brd CF1Ht CF2Lg CF2Brd CF2Ht CF3Lg CF3Brd

0.2 0.2 0.24 0.14 0.28 0.26 0.32 0.2 0.47 0.26 0.22 0.54 0.39 0.22

0.26 0.37 0.6 0.46 1.39 0.68 1.26 1.15 1.53 1.46 1.4 2.59 1.64 1.39

0.41 0.4 0.3 0.14 0.32 0.4 0.26 0.22 0.24 0.33 0.24 0.41 0.37 0.26

0.53 0.73 0.75 0.46 1.59 1.05 1.02 1.26 0.78 1.85 1.53 1.97 1.56 1.64

CF3Ht CubLg CubBrd CubHt

0.56 0.17 0.48 0.48

2.46 0.48 1.8 2.05

0.24 0.14 0.51 0.69

1.05 0.39 1.91 2.95

Other statistical procedures Normality of the data and differences between sexes were assessed using D’Agostino K2 omnibus normality test and Student’s t tests.

Results Intra- and inter-observer error analysis (Table 3) showed that all measurements were collected with very good repeatability. The measurements collected by the same observer diverged on average by 0.31 mm and by 0.33 mm in the case of measurements collected by different observers. The relative technical error of measurement (%TEM), the amount of variation in data that was explained by measurement error, was on average 1.27 % in both situations. The features CF2Ht, CF3Ht, and CubHt were somehow difficult to measure due to some variation in the bone morphology, which was evident in the higher values of TEM and %TEM.

refers to the probability of a sex estimation being correct given the prediction of the algorithm. For example, the precision of an algorithm for male individuals is obtained by the division

Table 4 Descriptive statistics and t test between males and females of the training sample (CISC) Measurement

Females

Males

Student’s t test

n

Mean

SD

K2

p value

n

Mean

SD

K2

p value

t

p value

CalcLg TalLg TalBrd TalHt NavLg NavBrd CF1Lg CF1Brd CF1Ht CF2Lg CF2Brd

150 150 150 150 150 150 150 150 150 150 150

72.57 51.32 37.59 28.85 19.33 35.97 24.11 16.56 28.86 16.83 15.03

3.59 2.81 2.17 1.6 1.34 2.51 1.47 1.37 1.62 1.06 1.18

0.19 2.90 0.06 0.04 0.02 2.86 0.18 0.01 0.07 1.08 0.05

0.912 0.235 0.972 0.982 0.988 0.239 0.914 0.994 0.965 0.584 0.977

150 150 150 150 150 150 150 150 150 150 150

81.47 58.54 42.35 32.47 21.37 39.93 26.76 18.76 32.27 18.41 16.59

3.94 3.36 2.4 1.72 1.46 2.55 1.67 1.25 1.84 1.24 1.27

0.02 0.00 0.07 0.36 0.03 2.09 0.06 0.06 0.03 0.01 0.16

0.988 1.00 0.965 0.835 0.985 0.352 0.971 0.969 0.987 0.994 0.923

20.48 20.2 17.98 18.82 12.59 13.85 14.61 14.56 17.04 11.77 11.03

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

CF2Ht CF3Lg CF3Brd CF3Ht CubLg CubBrd CubHt

150 150 150 150 150 150 150

19.79 22.26 14.92 21.73 33.31 25.13 22.15

1.46 1.28 1.1 1.87 2.29 1.67 1.48

0.22 0.09 2.75 0.79 0.06 0.06 0.16

0.895 0.955 0.253 0.675 0.971 0.97 0.925

150 150 150 150 150 150 150

21.81 24.57 16.8 24.16 36.93 28.03 24.88

1.58 1.51 1.07 1.56 2.49 1.83 1.79

0.08 0.03 0.04 0.16 0.28 0.06 1.53

0.96 0.983 0.981 0.923 0.871 0.97 0.465

11.49 14.27 15.83 12.22 13.1 14.32 14.36

0.000 0.000 0.000 0.000 0.000 0.000 0.000

Int J Legal Med Table 5 Results of measurement selection through cross-validated correlation-based feature selection Measurement n

Measurement n

Measurement n

CalcLg TalLg TalBrd TalHt

+10/10 +10/10 +10/10 +10/10

CF1Lg CF1Brd CF1Ht CF2Lg

+10/10 5/10 0/10 1/10

CF3Lg CF3Brd CF3Ht CubLg

6/10 +10/10 +10/10 0/10

NavLg NavBr

0/10 0/10

CF2Brd CF2Ht

0/10 0/10

CubBrd CubHt

6/10 +9/10

n number of times a measurement is in the subset selected by the feature selection algorithm during the cross-validation

Table 4 reports the descriptive statistics (mean and standard deviation), the results of the D’Agostino K2 omnibus normality test for males and females, and the result of Student’s t tests to evaluate differences between sexes in individuals from the CISC. All measurements presented a normal distribution for both sexes. As expected, all measurements were on average larger in males. All tarsal dimensions presented significant statistical differences between sexes (p value

Sex estimation from the tarsal bones in a Portuguese sample: a machine learning approach.

Sex estimation is extremely important in the analysis of human remains as many of the subsequent biological parameters are sex specific (e.g., age at ...
325KB Sizes 2 Downloads 3 Views