Mem Cogn (2015) 43:266–282 DOI 10.3758/s13421-014-0458-2

Observation versus classification in supervised category learning Kimery R. Levering & Kenneth J. Kurtz

Published online: 5 September 2014 # Psychonomic Society, Inc. 2014

Abstract The traditional supervised classification paradigm encourages learners to acquire only the knowledge needed to predict category membership (a discriminative approach). An alternative that aligns with important aspects of realworld concept formation is learning with a broader focus to acquire knowledge of the internal structure of each category (a generative approach). Our work addresses the impact of a particular component of the traditional classification task: the guess-and-correct cycle. We compare classification learning to a supervised observational learning task in which learners are shown labeled examples but make no classification response. The goals of this work sit at two levels: (1) testing for differences in the nature of the category representations that arise from two basic learning modes; and (2) evaluating the generative/discriminative continuum as a theoretical tool for understand learning modes and their outcomes. Specifically, we view the guess-and-correct cycle as consistent with a more discriminative approach and therefore expected it to lead to narrower category knowledge. Across two experiments, the observational mode led to greater sensitivity to distributional properties of features and correlations between features. We conclude that a relatively subtle procedural difference in supervised category learning substantially impacts what learners come to know about the categories. The results demonstrate the value of the generative/ discriminative continuum as a tool for advancing the psychology of category learning and also provide a valuable constraint for formal models and associated theories. K. R. Levering : K. J. Kurtz Department of Psychology, Binghamton University, Binghamton, NY, USA K. R. Levering (*) Department of Psychology, Marist College, Poughkeepsie, NY, USA e-mail: [email protected]

Keywords Categorization . Concepts . Generative versus discriminative . Category learning modes . Classification learning . Supervised observational learning

It is now widely accepted that rich and flexible concept representations are required to support the diversity of ways in which knowledge is acquired, organized, and used (cf. Markman & Ross, 2003; Solomon, Medin, & Lynch, 1999). However, many leading formal accounts of human category learning (e.g., Ashby, Alfonso-Reese, Turken, & Waldron, 1998; Nosofsky, 1984; Nosofsky, Palmeri, & McKinley, 1994; Kruschke, 1992) are not designed to address the robust and multifaceted nature of real psychological concepts. Instead, much research on formal models and the theoretical accounts of categorization they represent (see Pothos & Wills, 2011) focus on accounting for patterns of learning performance under highly constrained and standardized conditions. Central to this state of affairs is the ubiquity of a specific research paradigm: artificial classification learning. In this task, learners complete a training session in which they repeatedly guess category membership for a set of examples presented one at a time and receive corrective feedback after each trial. For decades, data from this particular task has been used almost exclusively as a means for developing theories and models of category learning. While some process models (e.g., SUSTAIN; Love, Medin, & Gureckis, 2004, and DIVA; Kurtz, 2007) are naturally and effectively extensible to other learning modes, the primary basis for evaluating models in the field has been the degree of success in fitting benchmark classification learning results (e.g., Shepard, Hovland, & Jenkins, 1961; Medin & Schaffer, 1978). The focus on this task would be less problematic if the nature of category learning was known to be broadly universal across tasks, modes, and settings. However, there is now considerable evidence that the learning process and resulting

Mem Cogn (2015) 43:266–282

representational structure of categories depend on the learning task (e.g., Love, 2002; Markman & Ross, 2003; Yamauchi & Markman, 1998). A key finding is that the traditional supervised classification paradigm encourages focusing on information that distinguishes between categories – while alternative learning tasks (such as inference learning) promote learning within-category statistical regularities (Chin-Parker & Ross, 2002, 2004; Markman & Ross, 2003; Yamauchi, Love, & Markman, 2002; Yamauchi & Markman, 1998) and yield representations that can be more flexibly applied to novel category contrasts or subsequent tasks (Hoffman & Rehder, 2010).

Generative versus discriminative methods of category learning One goal in the present work is to identify a way to take the existing body of knowledge on accounts of artificial classification learning performance and figure out how it fits with, connects, or extends to other tasks and to a broader view of human category learning. Toward this end, we draw on the existing literature in machine learning on discriminative versus generative classifiers (e.g., Ng & Jordan, 2001). Like humans in a classification task, machine learning systems are designed to take data from a set of labeled training patterns and use it to learn to predict a category label (C) from a set of input features (F). The difference between generative and discriminative methods is a matter of the calculations through which prediction occurs and the type of data used in the calculations (Mitchell, 2010; Ng & Jordan, 2001; Vapnik, 1998). Discriminative models (e.g., linear regression, support vector machines) use training examples to optimize a function for directly estimating the probability, p(C|F) that an example belongs to a category given its features. A solution of this kind focuses on the specific goal of classification success and does not address learning the distributional properties of each category. By contrast, a generative approach focuses on exactly such properties. Rather than directly calculating the probability of each category given a set of features, generative classifiers (e.g., naïve Bayes classifiers, hidden Markov models) estimate from the training data the likelihood, p(F|C) that each category generated the given input. In order to classify an example, generative methods typically apply Bayes’ rule to these likelihoods along with category label base rates, p(C). The intermediate step of computing the likelihood of the features given each category means that generative methods rely upon access to or estimation of a full statistical description of each category. Importantly, both generative and discriminative methods are effective approaches to classification problems – generative methods achieve their success by modeling more of the available data than is required.

267

In terms of a psychological account of human category learning, we can interpret the generative/discriminative distinction in terms of characteristics of the learning process or the nature of developed representation. During the learning process, a discriminative approach is characterized by the use of strategies that optimize performance on the single task of predicting class labels from input features. When possible, these strategies involve using a reduced set of the available input features. By contrast, a purely generative approach is task-general and involves strategies designed to develop as much knowledge (i.e., as good a model as possible) of the regularities that hold among members of each category. With regard to representation, a key point is that the knowledge gained from generative learning is of a within-category, rather than between-category, nature. While discriminative learning results in nothing more to say about the nature of the categories than the basis for discriminating members (i.e., items fitting a rule or falling into particular region(s) of a multidimensional space), representation resulting from generative learning includes knowledge of the internal structure of individual categories. This knowledge can provide a basis not just for classification, but also for tasks like pattern completion (i.e., predicting unknown features or even generating a full example from scratch). As such, representations formed through a generative approach encompass expectations about the features of category members, rather than the likelihood of category membership given some feature or combination of features. A strong parallel exists between this framework and the continuum identified by Goldstone (1996, 2003) between isolated and interrelated categories. Goldstone posited that in addition to concepts being more or less dependent on each other (interrelated) because of inherent qualities of the categories, factors during learning can differentially encourage the extent to which a category representation is affected by contrast categories. While we believe that the isolated/interrelated framework may map well onto the one we are currently exploring, our work may have the potential to make more specific predictions about learning and representation and may offer a more useful tool for evaluating models and theories of learning, in addition to task demands. In our view, theories and mechanistic models of human category learning have been guilty of making too strong assumptions about the extent to which learning and representation is reliant on between-category differences. Discriminative assumptions about learning can be most plainly seen in current theoretical accounts of category learning via their reliance on selective attention-a mechanism that reduces or eliminates allocation of attentional resources to cues that are uninformative for discrimination (Krushke, 2011). While there are exceptions (e.g., Anderson, 1991; Rehder & Murphy, 2003), this notion of selective attention lies at the core of most leading formal models including: exemplar-based (Kruschke,

268

1992; Kruschke & Johansen, 1999; Nosofsky, 1984), rulebased (Ashby et al. 1998; Nosofsky, et al., 1994), and cluster or prototype-based (Love et al., 2004; Smith & Minda, 1998) approaches. Although parameters may be modulated to account for varying levels of selective attention, a priori predictions of how aspects of the learning task (including the task addressed in this paper) may mediate these parameters have not been well explored. Beyond selective attention, the simple fact that the majority of formal models of category learning are designed exclusively to optimize predictions of class membership means that such models are naturally focused on representing information about how features predict category membership and not vice versa. In contrast, the DIVA model (Kurtz, 2007) provides an example of how a system can successfully learn to classify based on modeling the statistical regularities of each category in an autoassociative neural network that recodes the input and then predicts its features based on the expectations of each category. Because the key goal of this paper is to explore the possibility that we have miscalculated the extent to which realworld concept learning is focused on between-category differences, it is important to identify what performance characteristics are diagnostic of more discriminative processing or representation. The answer in broad terms is that discriminative learners know less about the structure of categories. They are less likely to have expectations about features that are not needed to determine class membership and they are less likely to have a basis for predicting features of an item based on information about its class and other feature values. During learning, discriminative learners should demonstrate an increased tendency to focus (either perceptually or decisionally) exclusively on diagnostic properties, at the expense of information that is less diagnostic. With the help of these specific indicators, the generativediscriminative continuum offers a promising way to anchor our understanding of psychological tasks (i.e., modes of category learning). Depending on factors such as task demands and category structure, we expect that human learning can reflect a range of approaches from highly generative to highly discriminative. One can imagine situations in which a highly discriminative approach to category learning would be beneficial: for example, a domain expert who performs massive repetition of a particular discrimination. However, more generally, our intuition is that much of the category learning that supports everyday concept use in the real world is more generative. Focusing exclusively on information that discriminates between particular categories is advantageous only when other types of information are unnecessary for future tasks (such as other classifications or other forms of category use). However, categories that represent meaningful ways of understanding the world are not typically experienced in relation to just one particular discrimination or type of task.

Mem Cogn (2015) 43:266–282

To serve as the building blocks of knowledge and as the basis for comprehension and inference, concepts must capture the nature of category A, not just how to tell A from B. Concerns about the relative dominance of the supervised artificial classification task can be grounded in the critique that the task represents an unjustified commitment to a highly discriminative learning approach. Notably, the differences found in conceptual representation when people learn categories via tasks other than classification (e.g., Hayes & Younger, 2004; Jee & Wiley, 2007; Kemler Nelson, 1984; Love, 2002; Minda & Ross, 2004; Ross, 1997, 1999, 2000; Yamauchi & Markman, 1998) are consistent with learning that is less discriminative. Even within the context of a classification task, Goldstone (1996) found that manipulating several methodological factors – asking participants to “form an image of what each category looks like” rather than “find features … that help you distinguish between the categories”, alternating categories infrequently during learning, and labeling a second category independently of the first, resulted in categorization performance at test indicating a greater reliance on features not necessary for classification. These manipulations can be viewed within the current framework as reducing the discriminative focus of the classification task. Given these findings, it is possible that in emphasizing a particular version of the traditional classification learning task, we may have become overly reliant on discriminative assumptions and design principles in our status quo psychological understanding of categorization.

Observation as a less discriminative form of supervised category learning We suspect that there are a number of standard characteristics of the artificial classification learning paradigm (e.g., binaryvalued stimulus dimensions, two mutually exclusive categories, small number of examples and dimensions) that create a bias toward discriminative learning. We have chosen to address one core component: the guess-and-correct cycle of the supervised classification trial. On each trial, learners in the traditional task are asked to select the correct category label for a presented example and then receive feedback. The nature of the task itself is likely to encourage learners to focus on getting right answers by figuring out which features can be used in what way to make a membership decision between the given choices – rather than learning about the general nature of the categories. By contrast, observational learning is a special case of category learning in which the learner does not explicitly generate a guess about category membership. Instead, the object and category label are both provided either simultaneously or label-first. While this type of task conveys the same information per trial about category membership as

Mem Cogn (2015) 43:266–282

classification, observation learners are not being asked to focus cognitive resources on devising a method for telling categories apart. Classification requires explicit comparison and weighing of multiple category options on each trial, thereby emphasizing the mutual exclusivity of categories and making hypothesis testing about the diagnosticity of particular features easier or more likely. Without this, learning through observation may result in greater opportunity to understand more broadly the basis for coherence among members of each category. To be clear, we do not mean to suggest that observational learning is purely, or even largely, generative. The observational task in fact shares all characteristics of the classification task except for the guess-and-correct cycle. In these experiments we test the hypothesis that relative to classification learning, learning through observation facilitates a decidedly less discriminative approach. Evidence of this would be seen in reduced focusing on diagnostic features during learning and more robust representation of withincategory statistical regularities when learning through observation. At first glance, predicting greater acquisition of category knowledge through observation relative to classification may seem counterintuitive. Given that learners are never asked to explicitly consider category alternatives, observation is a less directed task than classification. Observation learners would seemingly be more apt to lose interest in gathering information or focusing on the task. Further, the task eliminates the explicit generation of response error on each trial. Not only might this mitigate motivational drive for learning, but also the generation of error is widely thought to be how categories are learned (Kruschke, 2001, 2003). Error reduction (minimizing the discrepancy between a classification guess and a “teaching signal”) via the adjustment of weights is the core mechanism behind supervised learning in adaptive network models (e.g., Kruschke, 1992). It is not clear how such models (or associated theories) would account for learning when no error signal is generated. One line of thinking would be to assume that an error signal is generated relative to an implicit class prediction during an observational learning trial1; this would imply no difference in learning and representation between classification and observation conditions. A point that has been made previously in the literature (Ashby, Maddox, & Bohil, 2002; Dye & Ramscar, 2009; Ramscar, Yarlett, Dye, Denny, & Thorpe, 2010) is that typical classification and observational learning conditions reflect two procedural differences: (1) whether or not a response is generated; and (2) the presentation order of the example and label within each trial. Prior research shows that the sequence of presented information can impact classification performance at test (Ashby et al., 2002; Ramscar et al., 2010). In 1 This proposal, while possible, is made less likely when the label is presented before the example, as it is in our experiment.

269

line with this evidence, our view is that observational trials in which the example is presented before its label and learners do not make a response are likely to mimic classification by encouraging the generation of an implicit guess about category membership. This essentially undermines the distinct character of the observational mode by inviting a category decision (even without its expression). For this reason, we evaluate the observational mode with the label presented before the example. Past research on classification and observation While there has only been limited research comparing the traditional supervised classification task with observational category learning, some existing evidence points away from a potential observational learning advantage. Estes (1994) reported an initial marginal boost for observational learners over classification learners within the first training blocks, but this advantage disappeared over the course of learning and resulted in higher overall performance for classification learners. When Ashby et al. (2002) tested learning of a category structure that necessitated the integration of information across more than one dimension pre-decisionally (referred to as an information integration structure), they found that observational learning resulted in significantly poorer performance than classification learning. However, when categories were defined by a simple unidimensional rule along two dimensions, there was little difference in performance resulting from classification and observational learning. Ramscar et al. (2010) also found that observational learning resulted in poorer classification performance of new examples at test. These data (Ashby et al., 2002; Estes, 1994; Ramscar et al., 2010) have been taken as evidence that observational learning is equally good or worse than supervised classification. For present purposes it is important to note that these experiments measured the ability of learners to categorize examples, but they were not designed to test for differences in the nature of the category knowledge acquired. We know of two sources of empirical evidence regarding learner sensitivity to internal structure after learning through classification and observation. Hoffman and Murphy (2006) taught participants a family resemblance category structure based on either four or eight feature dimensions. After learning the 8-dimensional structure (but not the 4-dimensional structure), observation learners were knowledgeable about significantly more features on average than classification learners. In addition, Hsu and Griffiths (2010) found that observation led to more sensitivity to variability along a diagnostic dimension in learning a simple rule-based category structure. Results from these studies are consistent with the intuition that observational learning may have a more generative basis than supervised classification, but further work is clearly needed.

270

Mem Cogn (2015) 43:266–282

Logic of experimental approach

Experiment 1

Category structure In the present experiments, participants learn categories in which one feature dimension perfectly predicts category membership by way of a simple rule. Such unidimensional rules are extremely easy to learn within supervised paradigms presumably due to the ability to focus on the perfectly predictive dimension (e.g., McKinley & Nosofsky, 1996; Nosofsky et al., 1994; Shepard et al., 1961). Unidimensional organizations are also typically dominant in unsupervised categorization settings (Ahn & Medin, 1992; Ashby, Queller, & Berretty, 1999; Imai & Garner, 1965; Medin, Wattenmaker, & Hampson, 1987). While we expect deficits in overall category knowledge to arise in discriminative learning whenever a subset of features can successfully predict membership, a unidimensional rule may most directly highlight these deficits. A purely discriminative representation need only include knowledge about that one feature – regardless of the structure along other dimensions. Therefore, greater sensitivity to internal structure beyond the unidimensional rule will serve as evidence that learning was less discriminative. In this study, such additional structure was included in the form of distributional regularities of individual features and correlations between features. To allow for a more nuanced investigation of how feature and correlation diagnosticity affects learning, regularities varied in the extent to which they predicted category membership. Each experiment included at least one feature or correlation that was completely non-diagnostic (i.e., knowing that an example possesses a feature or correlation does not help to determine its category membership) and at least one cue that was partially diagnostic (somewhat, but not perfectly predictive of category membership). It is straightforward to expect both observation and classification learning to result in sensitivity to the fully diagnostic information (unidimensional rule). The key prediction is that observational learning will result in increased sensitivity to distributions of features and consistent relationships between features that are partially diagnostic; and perhaps those that are not diagnostic at all.

Learners who use a more discriminative approach should display reduced sensitivity to the likelihood of feature values – particularly those features that are either not diagnostic or partially/redundantly diagnostic. When learning about the category dog, highly discriminative learners might ascertain that the presence of barking indicates that an example is a dog and not a cat, but a learner developing a more generative representation should acquire more information about dimensions that are less diagnostic (type of fur) or not diagnostic (presence of fur). To investigate whether observational learning results in representations that include more information about the distribution of features than classification learning, we use a modified version of the family resemblance category structure. In the traditional family resemblance structure, knowing that an example is a member of a category means that each feature has a likely value, but never a certain value (Rosch & Mervis, 1975). The ‘prototype’ (an actual or potential item possessing all of the likely properties) of one category is the opposite of the prototype of the other category. Since the features that determine the internal structure of a category are the same ones that serve as the basis for discrimination, it is difficult to determine whether knowledge about feature distributions arises due to their usefulness in understanding the coherence of a category (indicating a less discriminative representation) or their usefulness in distinguishing between categories (indicating a more discriminative representation). To address this problem we make two changes to the traditional family resemblance structure. First, only some of the family resemblance features discriminate between the categories (i.e., the prototypes match on two feature dimensions). Therefore, while all of the family resemblance features contribute to generative knowledge about the types of properties that members tend to share, the features differ in terms of being either partially or not at all diagnostic of membership. Using a category structure like this, Chin-Parker and Ross (2004) found that inference learners integrated information about non-diagnostic family resemblance dimensions into their representations, while classification learners did not. Secondly, we include a dimension that follows a perfectly diagnostic rule. The presence of a unidimensional rule is likely to direct highly discriminative learners to focus on fewer feature dimensions. Previous experiments (Kemler Nelson, 1984; Minda & Ross, 2004) have tested learners on this type of category structure (one perfectly diagnostic feature plus a number of partially diagnostic features), comparing classification learning with an incidental or indirect learning task. At test, incidental learners were knowledgeable of the distributions of more features and were more likely to use partially diagnostic features when making classification responses. To test for sensitivity to feature distributions we employ typicality rating difference scores. By comparing typicality

Measures In a standard category learning study, researchers measure the ease of learning to classify examples and in some cases generalization to new items. An important component of the present work is developing and demonstrating the value of additional techniques to assess the representations formed in category learning. We focus on ratings of item typicality (see Barsalou, 1983; Rosch & Mervis, 1975) and tests that assess knowledge about individual features (explained below). These measures provide a window into the representational difference that can arise from distinct forms of category learning.

Mem Cogn (2015) 43:266–282

ratings of examples that possess some property (e.g., a particular feature value or correlation between features) to ratings of examples that are identical in every way except for the presence of that property, we can determine how robustly realized a property is within the representation of the learner. We also employ single feature inference tests, in which we present a category label and ask the participant to decide which of two possible feature options are most likely given the category.2 Across these tasks, we expect observational learners to possess greater knowledge of dimensions that were either not diagnostic or partially diagnostic. Specifically, we predict observational learners to be more likely to rate examples possessing common feature values as more typical members of their category than examples possessing uncommon features. We also expect observational learners to be more accurate than classification learners in responding to questions about feature values common to each category.

Method Participants 77 undergraduates from Binghamton University participated in exchange for partial fulfillment of a course requirement. Using random assignment, there were 38 participants in the classification condition and 39 participants in the observation condition. Materials Stimuli were line drawings of fictitious cartoon animals adapted from Chin-Parker and Ross (2004). The stimuli varied along five binary-valued feature dimensions: type of beak, type of antenna, type of wing, type of tail, and type of feet. Each category was defined by a unidimensional rule along one of the dimensions (see Fig. 1 for the logical structure and assignment of physical dimensions). Two of the features were partially diagnostic family resemblance (FR) features because the feature occurred 80 % of the time in one category and 20 % of the time in the other. The other two features were non-diagnostic FR features in that a particular feature value occurred 80 % of the time in both categories. While these latter features were non-diagnostic, they were nonetheless part of each category’s internal structure. Each category therefore consisted of three types of examples (see Fig. 1): a prototype that possessed only features that were likely for the category, two examples that were off-by-one relative to the prototype along a partially diagnostic dimension, and two examples that were off-by-one relative to the prototype along a non-diagnostic dimension.

2 This task is different from the more prevalent single feature classification task in which a single feature is presented and participants are asked to guess which category is most likely (e.g. Anderson et al. 2002; Hoffman & Murphy, 2006; Murphy & Allopenna, 1994; Rehder, Colner, & Hoffman, 2009; Hayes & Younger, 2004; Ross, 2000).

271

Procedure Participants in both conditions were told to imagine that a new planet had been discovered. They were told they were a part of a student training program and would be learning about the creatures by looking at pictures that researchers had taken of them (see Appendix for exact learning and testing instructions). After reading the instructions, examples were presented on the computer screen one at a time in a random order with buttons labeled “Yugli” and “Zifer.” Participants in the classification condition performed a traditional supervised classification learning task. Learners were asked to select via mouse click the correct category for each example. After making a guess, they received feedback as to whether they were right or wrong and shown the correct answer. Learners in the observation group were presented with the correct category label for 2000 ms before the stimulus was displayed. The image and label were then shown together on the screen for another 3000 ms. The length of exposure to each image in the observation condition was intended to equate the time, on average (based on pilot work), that classification learners viewed an image before selecting a response. In order to help ensure that participants were paying attention, as well as to equate motor behavior with the supervised condition, participants were required to click a button that displayed the correct label in the context of the phrase “click to confirm: this creature is a ________.” Participants in both conditions experienced four training blocks with each block consisting of all ten examples. Immediately after the training phase, all participants received a series of test measures designed to evaluate their ability to classify examples and, most importantly, their sensitivity to within-category structure (see Table 1 for the logical structure of the test items). To evaluate category learning, participants were given an endorsement task on the training items. The examples were displayed randomly one at a time along with the following phrase referring to a correct or incorrect category label: “This is a _______.” Participants were asked to select “agree” or “disagree” depending on whether they believed the label to be correct. Each example was presented once and there were an equal number of trials in which “agree” and “disagree” were correct responses. The endorsement task was used instead of the traditional classification task without feedback to ensure that the task in the test phase was different from the procedure used in either learning condition. The remainder of the test phase served our goal of assessing sensitivity to feature distributions. First, typicality ratings were collected. Participants were told to “rate how good an example (i.e., typical or representative) each is of the type indicated.” All training examples were presented in random order along with correct category labels (to help ensure that ratings were not based on confidence in category membership). Participants were asked to rate on a scale of 1 (not at all typical) to 7 (highly typical) the typicality of the example (i.e.,

272

Mem Cogn (2015) 43:266–282

Fig. 1 Logical structure and assignment of physical dimensions for training examples used in Experiment 1. Prototypes for each category are indicated by a border. Each other category example deviated from the

prototype by one feature that was either partially diagnostic or nondiagnostic of category membership. There was no example deviating from the prototype along the first dimension (wing type)

how good or representative it is) relative to the category. Participants were then given a single feature inference test phase (see Fig. 2). For this test, participants were given a category label and two options in the form of images representing each of the two possible feature values along that dimension. Each option consisted of a full creature with all features other than the one being queried occluded. Participants were asked which feature value was more likely given the category label information. All features were queried.

or more errors would not indicate performance that exceeded chance according to a binomial distribution, we eliminated the total of three learners (one in the classification and two in the observation condition) who committed three or more errors on the endorsement test of trained examples. This left 37 participants in each group.

Results and discussion Our core predictions about representation are best assessed given successful category learning. Because committing three

Classification accuracy A preliminary concern is whether learners were able to correctly assign category membership and whether performance differed meaningfully between the two groups. We found that the proportion correct for both observation (M = 0.98, SD = 0.04) and classification (M = 1.00, SD = 0.02) learners were quite high on the endorsement

Table 1 Structure of Key Test Trials Used in Experiment 1 Typicality of Trained Examples

Single Feature Inference

Y | 00000 Y | 00001 Y | 00010 Y | 00100 Y | 01000

Y|----? Z|----? Y|---?Z|---?Y|--?-Z|--?-Y|-?--Z|-?---

Z | 11100 Z | 11101 Z | 11110 Z | 11000 Z | 10100

Note. Prototypes for each category are presented in bold

Fig. 2 Layout of the single feature inference test used in Experiments 1 and 2

Mem Cogn (2015) 43:266–282

task and did not differ from each other, p = 0.23.3 It was not possible to compare performance during the training phase because of the nature of the learning task for the observation group; however, we note that the proportion correct for classification learners reflected relatively few errors (M = 0.95, SD = 0.04). A related question is whether learners mastered the unidimensional rule. Proportion correct on the single feature inference test indicated perfect knowledge (no errors) of the feature on which the unidimensional rule was applied for both classification and observation learners. As intended, our design was successful in producing learners in both conditions who were highly knowledgeable about category membership and about the usefulness of the perfectly predictive dimension. Sensitivity to partially diagnostic feature information Typicality ratings For each learner, the average typicality across the four examples (two from each category) that were off-byone relative to the prototypes along the partially diagnostic FR dimensions (see to Fig. 1) were subtracted from the average typicality rating of each examples’ corresponding prototype. A difference score of zero indicated that the learner rated the items possessing an uncommon feature along those dimensions as just as typical of the category as the prototype. This would strongly suggest that the dimension was not a part of their representation or was not integral to their idea of what kind of features a typical category member possessed. To the extent that the difference score was positive, participants viewed the prototypical items as more typical of their category than those items deviating from the prototype along particular dimensions. To be clear, higher values indicated greater dimensional sensitivity. Average typicality ratings and difference scores for each example can be seen in Table 2. As predicted, for partially diagnostic FR feature dimensions, the observation learners (M = 0.71, SD = 0.98) produced a significantly higher average typicality difference score than classification learners (M = 0.27, SD = 0.59), t(72) = 2.33, p = 0.02, d = 0.54 (see Fig. 3). Further analysis shows that groups did not differ in their average typicality ratings for the prototypical items, p > 0.05. The effect was instead due to the fact that observation learners rated items with uncommon features on partially diagnostic FR dimensions significantly lower than classification learners, t(72) = 2.26, p = 0.03, d = 0.54. Importantly, both classification learners, t(36) = 2.79, p = 0.01, d = 0.46, and observation learners, t(36) = 4.39, p < 0.001, d = 0.88, had average typicality difference scores significantly greater than zero, indicating that both types of learning resulted in some sensitivity to the values that were common to each category along partially diagnostic FR feature dimensions. 3 Because the data was not normally distributed, a Mann-Whitney U test was run instead of the traditional t-test.

273

Single feature inference test On the single feature inference test (i.e., predicting a target feature given the category), we compared proportion correct on the queries pertaining to the partially diagnostic FR features. On this task, observation learners (M = 0.85, SD = 0.22) performed significantly better than classification learners (M = 0.72, SD = 0.28), t(72) = 2.16, p = 0.03, d = 0.52, thereby demonstrating better knowledge of the distribution along these features (see Fig. 3). Consistent with the typicality rating difference scores, both groups demonstrated some degree of knowledge of this dimension, as both performed above chance, ps < 0.001. These advantages for observational learners came despite the fact that they had less exposure to the examples on the screen (M = 4587, SD = 776) than classification learners (M = 6072, SD = 1198), t(72) = 6.33, p < 0.001, d = 1.47.4 Sensitivity to non-diagnostic information Typicality ratings Unlike our findings for the partially diagnostic dimensions, the average typicality difference scores for non-diagnostic FR dimensions did not differ between the classification (M = 0.29, SD = 0.64) and observation (M = 0.28, SD = 0.58) groups, t(72) = 0.10, p = 0.92 (see Fig. 3). The difference scores for classification learners, t(36) = 2.78, p = 0.01, d = 0.45, and observation learners, t(36) = 2.89, p = 0.01, d = 0.46, were significantly greater than zero. Therefore, according to typicality ratings, both classification and observation learners demonstrated some sensitivity to the typical values along the non-diagnostic family resemblance dimensions – but observational learning did not facilitate greater sensitivity to the presence of common features along this dimension. Single feature inference test According to the proportion correct on the single feature inference test, there was no difference between classification (M = 0.54, SD = 0.13) and observation learners (M = 0.52, SD = 0.11), t(72) = 0.75, p = 0.46 (see Fig. 3). In this case neither classification learners, t(36) = 0.20, p = 0.06, nor observation learners, t(36) = 1.14, p = 0.26, were significantly better than chance. For single feature tests, a curious result was found. Despite the fact that the distribution of features along the nondiagnostic family resemblance dimension was identical between the two categories, learners appear to have developed a representation of the categories as being opposites along those dimensions. For questions about non-diagnostic features, 87 % of learners responded in a way that was consistent with the idea that the categories had opposing prototypical values along the dimension (the response for one category was 4 Although classification learners had greater overall exposure to each example, they were exposed to the combination of the label and the example for significantly fewer milliseconds (M = 3579, SD = 664) on average than observational learners (M = 4587, SD = 776), t(72) = 6.00, p < 0.001.

274

Mem Cogn (2015) 43:266–282

Table 2 Average Typicality Ratings and Difference Scores for All Examples in Experiment 1 Classification Typicality Rating protoype Y00000 part-diag off Y01000 part-diag off Y00100 non-diag off Y00010 non-diag off Y00001 prototype Z11100 part-diag off Z10100 part-diag off Z11000 non-diag off Z11110 non-diag off Z11101 All partially diagnostic off-by-one All non-diagnostic off-by-one

6.65 6.68 6.05 6.38 6.49 6.65 6.62 6.16 6.27 6.30

correct while the response for the other category was not). By chance, this should happen only 50 % of the time. This response pattern did not differ between observational and classification learners. It is possible that learners, in the absence of sufficient knowledge about non-diagnostic features, simply used a response strategy during testing that assumed contrast along those dimensions. It could also be an example of how elements of both classification and observational learning trials lead learners to expect categories as to be complete opposites of each other. Summary of results Based on typicality ratings and single feature inference tests, classification and observation learners were each sensitive to feature likelihood along partially diagnostic dimensions. In accord with our predictions, observation learners were: (1) significantly more likely than classification learners to rate items that deviated from the common values along partially diagnostic feature dimensions lower than category prototypes; and (2) better than classification learners at predicting the correct value of a missing partially diagnostic feature based on category membership. For completely non-diagnostic dimensions, however, typicality ratings and single feature inference performance did not indicate that observation learners were more knowledgeable than classification learners. In fact, according to the single feature inference test, neither type of learner was aware of the common features along non-diagnostic dimensions. Instead, for these cues they maintained a representation of illusory contrast by assuming that categories were opposites along non-diagnostic dimensions even though they were not. While typicality ratings demonstrated some sensitivity in both observation and classification learners to the distribution of non-diagnostic features, single feature tests did not. One possible explanation for this is that the single feature queries assess more explicit knowledge of the distribution of features,

Observation Difference Score

-0.03 0.60 0.27 0.16 0.03 0.49 0.38 0.35 0.27 0.29

Typicality Rating 6.76 5.87 5.78 6.68 6.22 6.43 6.03 5.87 6.38 6.00

Difference Score

0.89 0.97 0.08 0.54 0.41 0.57 0.05 0.43 0.71 0.28

while typicality ratings are more sensitive to some form of implicit distributional knowledge. Additionally, increased sensitivity of typicality ratings may be due to the full presentation of training examples during the typicality rating test phase, whereas in the single feature inference test only one feature was presented at a time. It is important to note that the attentional contributions or decisional processes involved in typicality ratings or single feature tests are not well established. It is possible that a demonstrated lack of sensitivity to feature distributions merely reflects either the application of selective attention weights developed during learning or a response bias at a later decision stage. These results confirm our prediction that category learning is more discriminative when the learning mode includes the act of generating a guess about category membership before the presentation of the label on each learning trial. Specifically, this element of the learning experience makes learners less sensitive to information about the distribution of features along partially diagnostic feature dimensions. However, the very same factor did not seem to mediate learner knowledge of information along completely non-diagnostic dimensions. Aspects of the task that were common between observational and classification learning (e.g., materials, structure, other task demands) may have contributed to the lack of distributional knowledge along these dimensions. While this experiment provides evidence of observational learning leading to greater knowledge of feature distributions not necessary for classification, it is not clear whether this advantage can be found for other types of internal structure. The following experiment was designed to explore the extent to which observational learning, like inference learning, increases sensitivity to relationships between features, particularly those not predictive of category membership. In addition, to ensure that the effects found in the first experiment were not limited to a

Mem Cogn (2015) 43:266–282

275

Fig. 3 Average typicality difference score and performance on single feature inference test by task and diagnosticity of dimension in Experiment 1. The difference score for each participant was calculated by subtracting the average typicality rating of examples off-by-one from the average typicality rating of the examples’ corresponding prototype.

Accordingly, higher difference scores and higher proportion correct on single feature inference test indicate greater knowledge of feature information. Across both measures, people in the observation condition showed significantly more knowledge of partially diagnostic features but neither group showed much knowledge of non-diagnostic features

particular assignment of feature dimensions to logical structure, we aimed to replicate the differences in sensitivity to feature distributions with a counterbalanced stimulus set.

information that help us understand and use categories, and make predictions about unknown properties. In fact, researchers have long proposed that concepts are formed around clusters of correlated features (Rosch, 1978). This holds for natural, taxonomic categories (e.g., birds that are small tend to sing), but also can be true of artifact categories (e.g., bike tires that are wide tend to be knobby). If highly discriminative learning makes learners less aware of the statistical regularities within a category, sensitivity to feature correlations may be a good way of determining whether they are engaging in such learning. In the following experiment, we test the idea that observational learning results in greater sensitivity to correlations between features. Just as independent features can vary in the extent to which they are diagnostic of membership (and therefore potentially part of a discriminative representation), the presence of correlations

Experiment 2 In the realm of machine learning, generative classifiers (e.g., naïve bayes) traditionally include the underlying assumption that features are independent from each other. In other words, relationships between features are not represented and therefore cannot be used to aid a classification decision. Outside of the laboratory, however, the world is rich with features that co-occur and these correlations between features are important pieces of

276

between features can reflect degrees of diagnosticity. As was demonstrated in the previous experiment with feature distributions, the effect of the learning task on sensitivity to correlations might be mediated by the extent to which they are diagnostic. Diagnostic correlations are combinations of features that, taken together, predict category membership even if the features independently do not. The most extreme example of a diagnostic feature correlation is the exclusive-or category structure (i.e., Type II from Shepard, et al., 1961). In contrast, non-diagnostic feature correlations are relationships between features that do not make an example any more likely to be a member of one category or another. Non-diagnostic correlations occur when the same relationship between features exists across an entire domain, regardless of category. The extent to which feature correlations are diagnostic has been shown to influence learner sensitivity. Non-diagnostic correlations are considerably more difficult to learn within a classification task than diagnostic correlations (Chin-Parker & Ross, 2002; Medin, Altom, Edelson, & Freko, 1982; Little & Lewandowsky, 2009; Murphy & Wisniewski, 1989; Wattenmaker, 1991). In fact, there is little evidence supporting the ability of learners within a classification task to show sensitivity to such correlations in the absence of prior knowledge. In the present experiment, we investigate the effect of the guess-and-correct cycle on acquiring knowledge of nondiagnostic correlations. If more generative learning results in greater acquisition of correlations between features, observation learners might show more sensitivity than classification learners. In this experiment, the features making up the correlation are independently partially diagnostic. The reason for this is two-fold. First, we wanted to replicate the finding from Experiment 1 that observation learners are more sensitive to distributions of partially diagnostic features. Second, we wanted to increase the chances of both learning tasks resulting in sensitivity to the correlation. As seen in Experiment 1, neither classification nor observation learners were sensitive to statistical regularities along independently non-diagnostic feature dimensions. By making the dimensions in which a correlation exists independently predictive of membership (albeit only slightly), we may make it more likely that learners will pick up on the relationship when a correlation exists over such dimensions, even if the correlation itself is not diagnostic.

Method Participants 128 Students from Binghamton University participated in this experiment in exchange for partial fulfillment of a course requirement. Participants were randomly assigned to conditions that differed in the training task (classification or observation). Participants were also randomly assigned to one of two different assignments of logical structure to feature dimension (detailed below).

Mem Cogn (2015) 43:266–282

Materials Each category consisted of seven training examples, varying along four out of the five feature dimensions used in Experiment 1 (see Fig. 4). Once again, category membership was based on a simple unidimensional rule. As in the previous experiment, two features were partially diagnostic of category membership. Specifically, each partially diagnostic feature value occurred in five of the seven examples in one category and in only two of the seven examples in the contrast category. Determining membership based on either of these features would result in correct classification of 10/14 examples (71.4 % accuracy). As such, these features were less diagnostic than those used in Experiment 1. In addition to being independently predictive, the two partially diagnostic features were perfectly correlated with each other. The correlation held across all category members (e.g., the presence of a certain type of feet always cooccurred with a certain type of wing). Therefore, because the correlation could not be used to determine membership in either category, it was non-diagnostic. The remaining dimension (always the tail) could possess one of five possible values and was included to increase the size of the stimuli set. In order to ensure that the results could not be attributed to a particular stimulus set, two different assignments of physical features to logical structure were used. In set 1 (shown in Fig. 4), the perfectly predictive dimension was the type of arm and there was a nondiagnostic correlation between the dimensions of foot type and antennae type. In set 2, the foot dimension perfectly predicted membership, and the type of arm was correlated with the type of antennae. Procedure The classification and observation learning trials were identical to the previous experiment. Participants experienced three blocks of training on the set of 14 examples (42 total trials). At test, participants were given the endorsement, single feature inference test, and a typicality rating task. These tests were administered in the same way as the previous experiment (see Table 3) and were included to ascertain each learner’s sensitivity to the feature distributions along the partially predictive features. To test for knowledge of the correlation, we conducted a single feature correlation test: one feature was presented to the participant, and they were asked to choose a feature along a second dimension that they believed to be most likely. Their possible options (labeled option 1 and option 2) were images displaying both the given feature and one of the two options for the missing feature value (see Table 3 for a depiction of the structure of the single feature correlation test trials). While there were a total of 12 correlation trials (testing every pairwise combination of feature relationships between the perfectly diagnostic dimension and the two partially diagnostic dimensions), the four of interest queried the relationship between the features that were perfectly correlated (–1?, –0?, –?1, –?0).

Mem Cogn (2015) 43:266–282

277

Fig. 4 Logical structure and assignment of physical dimensions for one training set used in Experiment 2

Performance on these four items was averaged to provide a measure of sensitivity to the correlation for each participant. Results and discussion For each measure, we performed a 2 (Task: classification or observation) × 2 (Stimuli set: set 1 or set 2) ANOVA. There was a significant effect of stimulus set on performance at test. One stimulus set produced significantly higher performance on the measures of sensitivity to both feature distributions and feature correlations, ps < 0.001. Typicality ratings from this group were also significantly higher. However, there were no interactions between the task and the stimulus set variables, so we present the following main effects of task collapsed across the sets. Knowledge of category membership As in the previous experiment, classification (M = 0.986, SD = 0.041) and observation (M = 0.979, SD = 0.060) groups did not differ in their ability to correctly endorse members of the categories, p > 0.05.5 Again, no errors were made in either group on the single feature inference test for the perfectly predictive feature. Sensitivity to distribution of features As predicted, observation learners were significantly more accurate (M = 0.81, SD = 0.29) than classification learners (M = 0.70, SD = 0.32) at guessing the correct feature value for the partially diagnostic 5 As in experiment 1, the Mann-Whitney U test was used because the data was not normally distributed.

features in the single feature inference test, F (1,124) = 4.68, p = 0.03, η2 = 0.04 (see Fig. 5). Both classification learners, t(63) = 17.58, p < 0.001, and observation learners, t(63)=22.25, p < 0.001 were significantly better at this task than chance. While not statistically significant, observation learners had typicality difference scores (see Table 4) that were marginally higher (M = 1.42, SD = 1.59) than classification learners (M = 0.92, SD = 1.74), F (1, 124) = 2.91, p = 0.09, η2 = 0.02. For both observational learners, t(63) = 7.19, p < 0.001, d = 0.89, and classification learners, t(63) = 4.24, p < 0.001, d = 0.53, typicality difference scores were significantly greater than zero, meaning that items possessing the more common feature along partially diagnostic dimensions were rated as significantly more typical than those possessing prototype-inconsistent values. Results from these tests provides further evidence that observational learning makes one more sensitive to the distribution of features along partially predictive dimensions in the presence of a perfectly diagnostic cue. As with the previous experiment, the enhanced performance of observational learners cannot be explained by mere exposure to examples, as the median amount of time that classification learners viewed each example was significantly greater (M = 5604, SD =939) than observational learners (M = 4459, SD = 988), t(126) = 6.722, p < 0.001, d = 1.18.6 6 As in Experiment 1, although classification learners had greater overall exposure to each example, they were exposed to the combination of the label and the example for significantly fewer milliseconds (M = 3418, SD = 501) than observational learners (M = 4459, SD = 988), t(126) = 7.502, p < 0.001, d = 1.33.

278

Mem Cogn (2015) 43:266–282

Table 3 Structure of Key Test Trials Used in Experiment 2 Typicality of Trained Examples Y | 0000 Y | 0100 Y | 0200 Y | 0300 Y | 0400 Y | 0011 Y | 0111

Z | 1011 Z | 1111 Z | 1211 Z | 1311 Z | 1411 Z | 1000 Z | 1100

Single Feature Correlation

Single Feature Inference

- | - - 1? - | - - ?1 - | - - 0? - | - - ?0

Y|--?Y|---? Z|--?Z|---?

Note. Items possessing the most common feature values along partially diagnostic dimensions are presented in bold

Sensitivity to non-diagnostic correlation According to the proportion correct on the single feature correlation test, observation learners showed significantly more sensitivity to the correlation (M = 0.78, SD = 0.31) than classification learners (M = 0.63, SD = 0.35), F (1,124) = 7.58, p = 0.01, η2 = 0.06. Both classification learners, t(63) = 14.14, p < 0.001, d = 0.37 and observational learners, t(63) = 20.15, p < 0.001, d = 0.90, performed better than chance (see Fig. 6). Summary of results Consistent with our predictions, the observation group learned more about the distribution of features and the correlation between the features than classification learners. Similar to the previous experiment, learning through observation resulted in greater sensitivity to the properties that were common within a category – even though this knowledge was not necessary to discriminate categories. Further, while observation and classification learners both showed sensitivity to the non-diagnostic correlation between features, observation learners were more knowledgeable regarding this correlation. These results support the hypothesis that observational learning results in greater knowledge of internal structure – in this case about the relationships between features.

General discussion The abundance of research using supervised classification as an experimental approach has led to a rich set of empirical findings and theoretical advances. However, it has also perpetuated biases and assumptions about the goals of the research, the types of phenomena that are important to study, the kinds of experiments that yield broadly meaningful results, and the nature of what counts as explanation in the study of human categorization (see Murphy, 2002, 2003; see also Hoffman & Rehder, 2010; Love et al., 2004; Markman & Ross, 2003; Murphy & Medin, 1985; Solomon, Medin, & Lynch, 1999). Despite recognition that the field faces limitations tied to this dominant paradigm and an increase in the prevalence of alternative research strategies, there

has not been great progress in our ability to answer basic questions about the nature of category learning. The present work is intended to help address a need for: (1) systematic investigation of when, why, and in what ways category learning tasks promote a strict emphasis on diagnostic feature-to-class knowledge; (2) deeper treatment of the difference between classification and observational learning modes and; (3) developing a theoretical framework for organizing and clarifying research on category learning beyond the traditional classification task. Learning via classification versus observation Across two experiments, we found evidence that making a category label response and receiving feedback on each trial biases learners toward a more discriminative approach. By contrast, observational learners developed representations that were more robust in terms of sensitivity to distributional information about partially predictive features and consistent correlations between features. It is important to note that evidence for these more robust representations would not have been revealed through the common practice of looking exclusively at categorization accuracy (e.g., Ashby et al., 2002; Estes, 1994; Ramscar et al., 2010). These data suggest that statistical regularities along completely non-predictive dimensions are not learned well through either task. There are at least two ways to interpret the lack of sensitivity. First, it could represent a real property of category learning. When learning about contrast categories like forks and spoons, it is possible that one actually does not learn information about common properties such as shininess or smoothness because these properties cannot be used to determine membership. A possibility that we consider more likely is that real world category learning is robust enough to support the acquisition of these types of information, but certain elements common to both tasks caused learners to process information in a way that was more discriminative than would occur naturally. The results support our initial intuition that the observation task should not be considered a purely generative task. For

Mem Cogn (2015) 43:266–282

279 Table 4 Average Typicality Ratings for Examples With and Without Common Features Across Category, Task, and Stimulus Set in Experiment 2 Stimulus Set 1

Stimulus Set 2

Class

Obs

Class

Obs

5.74 4.91 0.84

6.30 5.02 1.28

5.84 4.84 1.00

5.93 4.20 1.73

5.77 4.91 0.86

6.18 4.97 1.21

5.69 4.70 0.98

6.23 4.75 1.48

Yugli with common features without common features difference Zifer with common features without common features difference

that learners initially test simple rules, but also spread their gaze broadly across features. After diagnostic features are discovered, the learner is likely to stop fixating on features that are not necessary for telling categories apart. One possibility is that observational learners engage in initial hypothesis testing in a manner akin to classification, but are more likely to maintain broad focus on features that are unnecessary for distinguishing categories or redundant, allowing for continued learning. Our results suggest that this continued attentional breadth applies only (or most strongly) to those dimensions that are at least partially helpful for discrimination. Another possibility is that an observational task changes the nature of hypothesis testing, for example either by encouraging learners to test a greater number of simple hypotheses or by discouraging testing of the most narrow hypotheses regarding feature diagnosticity (e.g., unidimensional rules).

Fig. 5 Average typicality difference score and performance on single feature inference test by task in Experiment 2. The difference score for each participant was calculated by subtracting the average typicality rating of examples possessing the least frequent values along partially diagnostic dimensions from the average typicality rating of those possessing the most frequent values. Accordingly, higher difference scores and higher proportion correct on the single feature inference test indicate greater knowledge of feature information. Across both measures, people in the observation condition demonstrated more knowledge of partially diagnostic features (although the effect on difference scores was only marginally significant)

observational learners, the extent of knowledge about a particular feature dimension depended on its diagnosticity, something that would not be the case had learners been developing a truly generative model of the information. Rather, the task was less likely to bias learners toward focusing exclusively on the most diagnostic properties. Eye tracking studies on classification learning (e.g. Rehder & Hoffman, 2005) have found

Fig. 6 Proportion correct on single feature correlation test for nondiagnostic feature correlation along partially diagnostic feature dimensions in Experiment 2. Learning through observation resulted in significantly better performance on the single feature correlation test

280

Using the generative framework to evaluate human learning experiments This work represents a step forward in applying the generative/discriminative framework to identify the extent to which experimental conditions may lead to different types of learning. However, the classification task itself is likely not the only factor contributing to a bias toward discriminative learning and discriminative-oriented psychological models and theories. Support for this can be seen in learner insensitivity to non-diagnostic dimensions regardless of task within the current experiments. Beyond task, the nature of the standard stimuli and category structures used in laboratory studies may also work against generative-style learning. Experiments with category structures that are asymmetrical, non-mutually exclusive, or that do not use two categories could decrease the tendency toward a discriminative approach. Increasing the robustness of materials by manipulating the number of feature dimensions (e.g. Hoffman & Murphy, 2006) or the number of values along each dimension may lead learners to consider categories to be more naturalistic or less like logic puzzles. In addition, including prior knowledge or causal relationships between features may make learning more authentic and encourage the learning of more information than what is strictly necessary to discriminate (Kaplan & Murphy, 2000; Murphy 2003; Murphy & Allopenna, 1994; Murphy & Medin, 1985; Murphy & Wisniewski, 1989). These directions would help us to better understand the kinds of learning phenomena that are universal versus those that arise due to discriminative task demands inherent in the traditional paradigm. Conclusion Information that distinguishes categories is only part of what is necessary for conceptual knowledge to be applied successfully across tasks and situations. As learners we must be able to represent what category members tend to be like. Our claim is that the way category learning has been tested has led to learning and representations that de-emphasize this latter form of information. Because theories and models are driven by data from such experiments, they, too, may have become overly reliant on differentiating between categories rather than mastering the nature of categories. We have applied the generative/discriminative framework to the psychology of human category learning and demonstrated its use in predicting and interpreting experimental results. While we aim to be cautious in our claims regarding the potential of the generative/discriminative framework, we believe a conceptual tool like this can offer: (1) new questions to ask about how concepts are learned and represented; (2) specific predictions regarding how certain elements of the

Mem Cogn (2015) 43:266–282

learning task may push learners to adopt certain strategies; (3) a tool by which to interpret category learning results; and (4) a means of evaluating formal models of category learning. We hope the present work represents an advance from which more will follow. Author Note We acknowledge the valuable comments of Greg Murphy and help from Paul Melman and other members of the Learning and Representation in Cognition Lab at Binghamton University, Binghamton, NY, USA.

Appendix Initial instructions for classification learners “Imagine that a new planet has been discovered in a nearby galaxy. Two types of living creatures have been identified by researchers. These creatures are called: Yugli and Zifer. As part of a student training program, you are being asked to learn about the Yugli and Zifer creatures. Researchers have explored this planet, so pictures of the creatures are now available. You will be shown some pictures in order to learn about the Yuglis and Zifers. TRAINING TASK INSTRUCTIONS: In your training task you will be shown pictures of the creatures one at a time. For each creature, you will decide whether it is a Yugli or a Zifer. You will then be given feedback telling you if you were right or wrong. At first you will not know anything about the two types of creatures, but you will gain experience as you go along. Try your best to learn how to recognize Yuglis and Zifers because you will be tested on your knowledge of them. Good Luck!” Initial instructions for observational learners “Imagine that a new planet has been discovered in a nearby galaxy. Two types of living creatures have been identified by researchers. These creatures are called: Yugli and Zifer. As part of a student training program, you are being asked to learn about the Yugli and Zifer creatures. Researchers have explored this planet, so pictures of the creatures are now available. You will be shown some pictures in order to learn about the Yuglis and Zifers. TRAINING TASK INSTRUCTIONS: In your training task you will be shown pictures of the creatures one at a time. Before you see each picture, you will be told whether it is a Yugli or a Zifer. At first you will not know anything about the two types of creatures, but you will gain experience as you go along. Try your best to learn how to recognize Yuglis and Zifers because you will be tested on your knowledge of them. Good Luck!” Instructions for endorsement task “Good job! Now that you are familiar with the types of creatures, you will be tested on what you have learned in a number of ways. First, as before, pictures representing creatures will be shown to you. This time, each will come with a statement about whether it is a Yugli or a Zifer. If you agree with the statement, please click “Agree”. If

Mem Cogn (2015) 43:266–282

you disagree with the statement, please click “Disagree”. You will not be told whether you are right or wrong.” Instructions for eliciting typicality ratings “Now we will show you some pictures of creatures labeled as Yuglis or Zifers. Please rate how good of an example (i.e., typical or representative) each is of the type indicated. An example that you consider typical of its type should be rated high on the scale, while an example that you think is not so typical of its type should be rated low.” Instructions for single feature inference test “Next, we will show you pictures of the creatures with one feature missing. We will tell you whether the creature is a Yugli or a Zifer. Please choose which of the two feature options you believe would be most likely.” Instructions for single feature correlation test “Now we will show you pictures of the creatures with only one of their features and two possible options for another feature. Please choose which of two feature options would be most likely.”

References Ahn, W., & Medin, D. L. (1992). A two-stage model of category construction. Cognitive Science, 16, 81–121. Anderson, J. R. (1991). The adaptive nature of human categorization. Psychological Review, 98, 409–429. Anderson, A. L., Ross, B. H., & Chin-Parker, S. (2002). A further investigation of category learning by inference. Memory & Cognition, 30, 119–128. Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A neuropsychological theory of multiple systems in category learning. Psychological Review, 106, 529– 550. Ashby, F. G., Maddox, W. T., & Bohil, C. J. (2002). Observational versus feedback training in rule-based and information integration category learning. Memory & Cognition, 30, 666–677. Ashby, F. G., Queller, S., & Berretty, P. M. (1999). On the dominance of unidimensional rules in unsupervised categorization. Perceptual and Psychophysics, 61, 1178–1199. Barsalou, L. W. (1983). Ad hoc categories. Memory & Cognition, 11, 211–227. Chin-Parker, S., & Ross, B. H. (2002). The effect of category learning on sensitivity to within-category correlations. Memory & Cognition, 30, 353–362. Chin-Parker, S., & Ross, B. H. (2004). Diagnosticity and prototypicality in category learning: Inference learning and classification learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 216–226. Dye, M., & Ramscar, M. (2009). No representation without taxation: The costs and benefits of learning to conceptualize the environment. Sophia: 2nd International Analogy Conference. Estes, W. K. (1994). Classification and cognition. New York: Oxford University Press. Goldstone, R. L. (1996). Isolated and interrelated concepts. Memory & Cognition, 24, 608-628.

281 Goldstone, R. L., Steyvers, M., & Rogosky, B. J. (2003). Conceptual interrelatedness and caricatures. Memory & Cognition, 31, 169– 180. Hayes, B. K., & Younger, K. (2004). Category-use effects in children. Child Development, 75, 1719–1732. Hoffman, A. B., & Murphy, G. L. (2006). Category dimensionality and feature knowledge: When more features are learned as easily as fewer. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 301–315. Hoffman, A. B., & Rehder, B. (2010). The costs of supervised classification: The effect of learning task on conceptual flexibility. Journal of Experimental Psychology: General, 139, 319–340. Hsu, A. S., & Griffiths, T. L. (2010). Effects of generative and discriminative learning on use of category variability. Proceedings of the 32nd Annual Cognitive Science Society. Imai, S., & Garner, W. R. (1965). Discriminability and preference for attributes in free and constrained classification. Journal of Experimental Psychology, 69, 596–608. Jee, B. D., & Wiley, J. (2007). How goals affect the organization and use of domain knowledge. Memory & Cognition, 35, 837–851. Kaplan, A. S., & Murphy, G. L. (2000). Category learning with minimal prior knowledge. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 829–846. Kemler Nelson, D. G. (1984). The effect of intention on what concepts are acquired. Journal of Verbal Learning and Verbal Behavior, 23, 734–759. Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99, 22–44. Kruschke, J. K. (2001). Toward a unified model of attention in associative learning. Journal of Mathematical Psychology, 45, 812–863. Kruschke, J. K. (2003). Attention in learning. Current Directions in Psychological Science, 12, 171–175. Kruschke, J. K. (2011). Models of attentional learning. In E. M. Pothos & A. Wills (Eds.), Formal models in categorization (pp. 120–152). Cambridge: Cambridge University Press. Kruschke, J. K., & Johansen, M. K. (1999). A model of probabilistic category learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 1083–1119. Kurtz, K. J. (2007). The divergent autoencoder (DIVA) model of category learning. Psychonomic Bulletin & Review, 14, 560–576. Little, D. R., & Lewandowsky, S. (2009). Better learning with more error: Probabilistic feedback increases sensitivity to correlated cues in categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35, 1041–1061. Love, B. C. (2002). Comparing supervised and unsupervised category learning. Psychonomic Bulletin & Review, 9, 829–835. Love, B. C., Medin, D. L., & Gureckis, T. M. (2004). SUSTAIN: A network model of human category learning. Psychological Review, 111, 309–332. Markman, A. B., & Ross, B. H. (2003). Category use and category learning. Psychological Bulletin, 129, 592–613. McKinley, S. C., & Nosofsky, R. M. (1996). Selective attention and the formation of linear decision boundaries. Journal of Experimental Psychology: Human Perception and Performance, 22, 294–317. Medin, D. L., Altom, M. W., Edelson, S. M., & Freko, D. (1982). Correlated symptoms and simulated medical classification. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8, 37–50. Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification. Psychological Review, 85, 207–238. Medin, D. L., Wattenmaker, W. D., & Hampson, S. E. (1987). Family resemblance, conceptual cohesiveness, and category construction. Cognitive Psychology, 19, 242–279. Minda, J. P., & Ross, B. H. (2004). Learning categories by making predictions: An investigation of indirect category learning. Memory & Cognition, 32, 1355–1368.

282 Mitchell, T. M. (2010). Generative and discriminative classifiers: Naïve Bayes and logistic regression. In T. M. Mitchell (Ed.), Machine learning. New York: McGraw Hill. Murphy, G. L. (2002). The big book of concepts. Cambridge: MIT press. Murphy, G. L. (2003). Ecological validity and the study of concepts. In B. H. Ross (Ed.), The psychology of learning and motivation (Vol. 43, pp. 1–41). San Diego: Academic Press. Murphy, G. L., & Allopenna, P. D. (1994). The locus of knowledge effects in concept learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 904–919. Murphy, G. L., & Medin, D. L. (1985). The role of theories in conceptual coherence. Psychological Review, 92, 289–316. Murphy, G. L., & Wisniewski, E. J. (1989). Feature correlations in conceptual representations. In G. Tiberghien (Ed.), Advances in Cognitive Science (Theory and applications, Vol. 2, pp. 23–45). Chichester: Ellis Horwood. Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistical regression and naïve bayes. In Advances in Neural Information Processing Systems, 14. Nosofsky, R. M. (1984). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 700–708. Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule-plusexception model of classification learning. Psychological Review, 101, 53–79. Pothos, E. M., & Wills, A. J. (Eds.). (2011). Formal approaches in categorization. Cambridge: Cambridge University Press. Ramscar, M., Yarlett, D., Dye, M., Denny, K., & Thorpe, K. (2010). The effects of feature-label-order and their implications for symbolic learning. Cognitive Science, 34, 909–957. Rehder, B., Colner, R. M., & Hoffman, A. B. (2009). Feature inference and eye tracking. Journal of Memory and Language, 60, 393–419.

Mem Cogn (2015) 43:266–282 Rehder, B., & Murphy, G. L. (2003). A knowledge-resonance (KRES) model of category learning. Psychonomic Bulletin & Review, 10, 759–784. Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Loyd (Eds.) Cognition and Categorization. Erlbaum: Hillsdale. Rehder, B. & Hoffman, A. B. (2005). Eyetracking and selective attention in category learning. Cognitive Psychology, 51, 1–41. Rosch, E., & Mervis, C. B. (1975). Family resemblance: Studies in the internal structure of categories. Cognitive Psychology, 7, 573–605. Ross, B. H. (1997). The use of categories affects classification. Journal of Memory and Language, 37, 249–267. Ross, B. H. (1999). Post-classification category use: The effects of learning to use categories after learning to classify. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 743–757. Ross, B. H. (2000). The effects of use on learned categories. Memory & Cognition, 28, 51–63. Shepard, R. N., Hovland, C. L., & Jenkins, H. M. (1961). Learning and memorization of classifications. Psychological Monographs, 75, 42. Smith, J. D., & Minda, J. P. (1998). Prototypes in the mist: The early epochs of category learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1411–1436. Solomon, K. O., Medin, D. L., & Lynch, E. (1999). Concepts do more than categorize. Trends in Cognitive Sciences, 3, 99–105. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Wattenmaker, W. D. (1991). Learning modes, feature correlations, and memory-based categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 908–923. Yamauchi, T., Love, B. C., & Markman, A. B. (2002). Learning nonlinearly separable categories by inference and classification. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 585–593. Yamauchi, T., & Markman, A. B. (1998). Category learning by inference and classification. Journal of Memory and Language, 39, 124–148.

Observation versus classification in supervised category learning.

The traditional supervised classification paradigm encourages learners to acquire only the knowledge needed to predict category membership (a discrimi...
887KB Sizes 1 Downloads 4 Views