778

Biometrical Journal 56 (2014) 5, 778–780

DOI: 10.1002/bimj.201300183

Discussion OODA of graph and tree-structured data Ela Sienkiewicz and Haonan Wang∗ Department of Statistics, Colorado State University, CO 80523, USA Received 10 September 2013; revised 10 December 2013; accepted 23 December 2013

This is a discussion of the paper: “Overview of object oriented data analysis” by J. Steve Marron and Andr´es M. Alonso.

Keywords: Data objects; Functional data analysis; Regression.

In the past two decades, with the development of new data collecting and storage techniques, the collection of the complex data objects has become increasingly more common. In the paper by Marron and Alonso (2014), the authors discuss some frequently encountered object types, including curves, shapes, images, and they briefly mention graph- and tree-structured data. There is a rich literature devoted to probability measure in the space of the random graphs; see Banks and Constantine (1998) for more details. Recently, development of the web-based social networks triggered a renewed interest in this subject; see for instance Newman et al. (2002), Snijders et al. (2010) among others. Trees, which are just forms of simple graphs, come with their own set of problems and applications. The authors have summarized two different types of tree-structured data object. The first type arises in the context of phylogenetic trees, which is used to describe the relation between a set of leaves representing genes or species. The other type of tree-structured data object is motivated by the analysis of a set of blood vessel trees or neuron trees, which possess both topological property and geometric property. As the authors pointed out, the set of tree-structured objects of the first type can be embedded into a manifold stratified space. However, such framework cannot be applied directly to the second type. Here the second type of trees, specifically binary trees, is a focus of our discussion.

1 Node labeling and correspondence In an earlier paper, one of the authors and his collaborator developed the major theoretical framework for the statistical analysis of tree-structured data (Wang and Marron, 2007). In that paper the authors introduced a metric to quantify the topological difference on the space of random binary trees. Such metric is based on assignment of the labeling system to the nodes of each tree. The so-called level-order index is assigned recursively to all nodes: (i) the root node has an index 1; (ii) for a node with index k, its left child and right child are assigned indices 2k and 2k + 1. If IND(t) denotes the level order-index set for a tree t, for any two binary trees s and t, the integer tree metric is defined as d (s, t) =

∞ 

1{k ∈ IND(s)  IND(t)}

k=1

∗ Corresponding author: e-mail: [email protected], Phone: +1-970-491-2449

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Biometrical Journal 56 (2014) 5

779

where  denotes the symmetric difference of two sets. Ambiguity may arise due to the designation of left child and right child. For example, consider two tree-structured objects s and t with level-order index sets IND(s) = {1, 2, 3, 4, 5}

IND(t) = {1, 2, 3, 6, 7}.

Topologically, trees s and t are identical since one can be obtained by node flipping (Shen et al., 2013) from the other; while their distance from the integer tree metric is d (s, t) = 4. In order to solve this problem, Aydın et al. (2009) proposed thickness correspondence and descendant correspondence for node assignment. An alternative view of this issue is to consider an equivalence relationship defined by certain tree operations, for example, node flipping. Thus, the tree space can be written as a collection of equivalence classes. It is of interest to define tree metrics on equivalence classes, which is rather crucial for unlabeled tree-structured objects. Such a metric will be invariant under those node operations, which may further enable us to quantify the difference between two trees independent of the choice of node assignment.

2 Regression analysis Regression analysis is one of the most widely used tools in statistics with one of its major goals to model the relationship between response variable and predictor variables. For tree-structured objects, an intuitive approach is to model the numeric summary (e.g., number of branches) extracted from the tree-structured objects. Thus, standard techniques, including linear regression and nonparametric/ semiparametric regression, can be directly applied to the space of binary trees. For instance, Chang et al. (2013) studied the branching properties of tree-structured objects. In particular, they proposed a generalized mixed effect model to characterize the relationships between tree branching probability and various covariates. However, Wang et al. (2012) considered a regression problem of trees directly, with a tree-structured response and a numeric covariate. Surprisingly, due to the non-Euclidean nature of the tree space, it is rather difficult to develop an analog of “linear” regression. The authors proposed a nonparametric regression technique that generalized the Nadaraya-Watson kernel regression to tree space. In particular, they formulated the regression problem as an optimization problem that can be solved in linear time. Given the sample data {(x1 , t1 ), . . . , (xn , tn )}, the estimated tree at x, denoted by  t(x) can be found as a minimizer of arg min t

n 

d (t, ti )Kh (x − xi ),

i=1

where d (t, ti ) is the integer tree distance between tree t and tree ti , and Kh is a standard kernel function with a bandwidth h. An open problem is to develop asymptotic properties of  t(x) as n → ∞, which requires further careful investigation.

3 Trees, curves, and functional data There is a benefit of moving the statistical analysis of complex tree objects from the space of trees to the functional space. This is due to the fact that the space of trees is non-Euclidean, as discussed in Wang and Marron (2007), and linear operations are not well defined. Functional data analysis offers a powerful apparatus for inference; see Ramsay and Silverman (1997) for more details. The idea of a correspondence between trees and curves dates back to Harris (1952) and his work on stochastic processes, in particular branching processes. In his seminal work, Harris analyzed paths drawn by the depth-first traversal of a tree. The so-called Harris paths (a.k.a. Dyck paths) are very useful to study tree asymptotics, but not particularly convenient for comparison of different trees or groups of trees.  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

780

E. Sienkiewicz and H. Wang: OODA of graph and tree-structured data

Recently, Shen et al. (2013) suggested a modification to Dyck paths. By embedding them in a support tree (a union of all trees in the sample), the comparison between trees becomes feasible, and commonly used techniques in functional data can also be implemented to tree curves. It is worth mentioning that both classical Harris path and modified Dyck path are very sensitive to the assignment of left and right nodes. In our recent study, as yet unpublished, we proposed a new curve representation for unlabeled tree-structured data objects. This curve representation can also be generalized to forests of binary trees. Now, our question is, are we treating tree-structured objects too Euclidean? In addition, can we interpret the results from functional data tools in the same fashion as if they were functional data instead of trees? For example, the mean curve may not correspond to a tree-structured object. However, the answers to these questions remain unclear, and this line of research is still ongoing. A possible approach is to consider a multiobjective optimization problem. In particular, based on curve representations, we can quantify topological and geometric variations separately, both of which can be adopted as objective functions in the optimization problem. Conflict of interest The authors have declared no conflict of interest.

References Aydın, B., Pataki, G., Wang, H., Bullitt, E. and Marron, J. (2009). A principal component analysis for trees. The Annals of Applied Statistics 3, 1597–1615. Banks, D. and Constantine, G. (1998). Metric models for random graphs. Journal of Classification 15, 199–223. Chang, H.-W., Iyer, H., Bullitt, E. and Wang, H. (2013). Generalized linear mixed models for branching probabilities of brain artery systems. Model Assisted Statistics and Applications 8, 121–133. Harris, T. (1952). First passage and recurrence distributions. Transactions of the American Mathematical Society 73, 471–486. Marron, J. S. and Alonso, A. (2014). Overview of object oriented data analysis. Biometrical Journal 56, 732–753. Newman, M. E., Watts, D. J. and Strogatz, S. H. (2002). Random graph models of social networks. Proceedings of the National Academy of Sciences of the United States of America 99, 2566–2572. Ramsay, J. and Silverman, S. W. (1997). Functional Data Analysis. Springer. Shen, D., Shen, H., Bhamidi, S., Maldonado, Y. M., Kim, Y. and Marron, J. (2013). Functional data analysis of tree data objects. Journal of Computational and Graphical Statistics, doi: 10.1080/10618600.2013.786943 Snijders, T. A., Van de Bunt, G. G. and Steglich, C. E. (2010). Introduction to stochastic actor-based models for network dynamics. Social Networks 32, 44–60. Wang, H. and Marron, J. (2007). Object oriented data analysis: sets of trees. The Annals of Statistics 35, 1849–1873. Wang, Y., Marron, J., Aydin, B., Ladha, A., Bullitt, E. and Wang, H. (2012). A nonparametric regression model with Tree-structured response. Journal of the American Statistical Association 107, 1272–1285.

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

OODA of graph and tree-structured data.

This is a discussion of the paper: "Overview of object oriented data analysis" by J. Steve Marron and Andrés M. Alonso...
48KB Sizes 1 Downloads 3 Views