Progressm NeurobiologyVol. 37, pp. 383 to 431, 1991 Printed in Great Britain. All rightsr~erved

0301-0082/91/$0.00+ 0.50 © 1991 PergamonPress plc

GENERALIZATION A N D SPECIALIZATION IN ARTIFICIAL N E U R A L NETWORKS STEVEN HAMPSON Department of Information and Computer Science, University of California, lrvine, California 92717, U.S.A. (Received 26 February 1991) CONTENTS 383 384 384 384 387 387 389 390 391 392 393 393 394 394 394 397 399 4O0 4O0 400 402 403 403 4O4 404 4O4 404 4O6 4O6 407 409 409 409 411 412 412 413 413 414 416 417 418

1. Introduction 2. Node structure and training 2.1. Introduction 2.2. Node structure 2.3. Prototypic classification 2.4. Node training 2.5. Neural mechanisms for perceptron training 2.6. Input order 2.7. Alternative LTU models 2.8. Continuous features and certainty 2.9. Representing continuous features 2.10. Excitation, inhibition and facilitation 3. Improving on perceptron training 3. I. Introduction 3.2. Perceptron training time complexity 3.3. Output-specific feature associability 3.4. Adaptive origin placement 4. Learning and using specific instances 4.1. Introduction 4.2. Focusing 4.3. Generalization vs specific instance learning 4.4. Use of specific instances 4.4.1. Stimulus learning 4.4.2. Stimulus description 4.4.3. Known specific instances 4.4.4. Constrained learning circumstances 4.4.5. Pattern completion and pattern association 4.4.6. "Working" or "scratch pad" memory 4.4.7. Specific instance based categorization 4.4.8. Combining generalization and specific instances 5. Boolean operator structure and training 5. i. Introduction 5.2. Network interconnections and association types 5.3. Operator training (disjunctive representation) 5.4. Operator training results 5.5. Shared memory focusing 5.5.1. Input driven learning 5.5.2. Error driven focusing 5.6. Input-specific feature associability 6. Summary and conclusions 7. Further reading and notes References

1. I N T R O D U C T I O N The work presented in this review was done in the context of a larger computer simulation of neural networks that are capable of adaptive problem solving (e.g. maze running). This article is based on the first few chapters of "Connectionistic Problem Solving" (Hampson, 1990) with the permission of Birkhauser, Boston. Thanks are due to J,s 3m--^

Jack Beausmans and John Justeson who read and commented on both the book and this article. Simulation and analysis of artificial neural networks requires a large n u m b e r of simplifying assumptions. More complex models permit greater biological realism, but whether at the n e u r o n or network level, simulation complexity is always a limiting factor. Considering that Cray supercomputers have been used to model individual synapses, it is clear that 383

384

S. HAMPSON

simulation "from first principles" is not possible. The behavior of more biologically realistic models is also generally harder to characterize formally. Fortunately, higher-level regularities exist that can be formalized, simulated and analyzed at their own level. The value of investigating these more abstract models for the purpose of understanding biological systems is ultimately an empirical question. This paper considers some of the empirical and formal characteristics of perhaps the simplest neuron model, a Linear Threshold Unit (LTU). In Section 2, the class of functions computed by an LTU is discussed in terms of prototypic categorization. A standard LTU training algorithm (perceptron training) is discussed and related to the Rescorla-Wagner model of classical conditioning. Possible physiological implementations are considered. In Section 3, the characteristics of perceptron training are considered. Two modifications of perceptron training (variable associability and origin shifting) are discussed. These modifications address identified limitations of perceptron training and empirically improve learning speed. Both of these modifications have identifiable analogs in biological learning characteristics. In Section 4, an alternate LTU training rule (focusing) is developed so that specific instances (individual stimulus patterns) can be rapidly learned. This specialized approach addresses a significant deficiency of perceptron training: while perceptron training is good at learning prototypic generalizations, it is poor at learning specific instances. Specific instance detectors permit a wide range of behaviorally relevant applications beside simple stimulus categorization. This general-specific dichotomy has much in common with the procedural-declarative distinction that is often made when characterizing biological behavior. Training speed and categorization accuracy can be significantly improved if specific instance detectors are used in conjunction with perceptron training. In Section 5, multilevel networks are discussed and two methods of network training are considered. These training algorithms are based on perceptron training and focusing, and consequently reflect, at the network level, the strengths and weaknesses of those two approaches as discussed in the context of single node training. Network-level applications of variable associability and origin shifting are also described. An LTU is a very abstract neuron model, and consequently misses most of the richness/complexity of actual neural information processing. However, many of the general characteristics of LTU representation and training find natural parallels in biological systems, and simulation and analysis of LTUs and simple LTU networks often gives insight into the general characteristics of actual neural systems.

a thresholded linear equation that is used for the binary categorization of feature patterns. An LTU can be viewed as computing a type of prototypic classification. One important learning mechanism for a node is the perceptron training algorithm. Although neither the representation nor the training strategy is explicitly biological, an LTU is a standard (simplified) model of neural computation, and perceptron training is sufficiently similar to biological behavior during classical conditioning to be of interest. Possible neural mechanisms for perceptron training have been worked out to a considerable extent. Both the node representation and the training algorithm are simple enough to permit a significant amount of formal analysis. Variations on the basic model are considered, and while discussion is largely limited to binary input features, some extensions to deal with continuous input values are explored. 2.2. NODE STRUCTURE In general, a node/neuron takes some number of input values and computes an output value. It may also have a specialized "teacher" input (Fig. 1). In the classical Linear Threshold Unit (LTU) model, the input patterns are represented as vectors of binary values, the absence of a feature being represented by 0 and its presence by I. One additional component is also necessary to represent the constant or threshold term of the LTU. For representational convenience, the threshold term can be represented by an input feature that is constantly on. Thus, a feature vector with d features can be represented by F -- (F0, F! . . . . . Fd) where Fi is the value of the ith feature, and F0, the additional component associated with the threshold, or constant term of the linear equation, is always equal to 1. Similarly, the coefficients of the linear equation (the synaptic weights of the neuron) are represented as a weight vector W = (WO, W I , . . . , Wd).

teacher

input

--'-

output

2. NODE STRUCTURE AND TRAINING 2. I. INTRODUCTION This section develops the basic model neuron/node and node training algorithm. The basic node is a Linear Threshold Unit (LTU). The standard L T U is

FIo. 1. The model node/neuron. A node's single output value is some function of its input features. A d ~ e d

teacher signal may also be used to train the node.

ARTIFICIALNEURALNETWORKS An LTU classifies a feature vector, F, by computing its dot product with W and comparing with 0. Out ,= ~ Fi , Wi

if Out > 0 then Out ,= 1 else Out ,= 0. That is, if F . W (also written FW) is greater than 0, the feature vector F is classified as a positive instance, and the node fires (i.e. Out = 1). If FW is less than 0, F is classified as a negative instance and the node does not fire (Out = 0). (You can do what you want with FW = 0 as long as you are consistent.) For example, if F = (1 0 1) and W = (1 2 - 3 ) , FW=(1.1)+(0.2)+(1.-3)=-2, so F is classified as a negative instance. Any LTU function of binary features has a solution weigllt vector for which FW is never 0, and the analysis presented here is generally for weight vectors with that property. Geometrically, an LTU weight vector describes a hyperplane, or more generally a separating surface, that partitions the input space into positive and negative regions. For binary features, the "input space" is limited to the vertices or corners of a d-dimensional hypercube. Alternatively, the linear equation can be thought of as measuring similarity to an arbitrary prototypic point on the surface of the d-dimensional hypersphere that circumscribes the hypercube (Fig. 2a). The closer an input pattern is to the prototypic point, the larger the dot product FW. The threshold, W0, simply divides the similarity measure into "true" and "false" regions. If all relevant features are weighted equally, an LTU is specialized to the "at least x of m features" function, written (x of m). If all features are relevant and equally weighted (that is, if m --d), the prototypic points are restricted to the vertices of the hypercube (Fig. 2b). The (x of m) function can be further specialized to (1 of m) -- OR and (m of m) = AND (Fig. 3). For example, the weight vector ( - ! 2 2 2) computes OR of three features and the weight vector ( - 5 2 2 2) computes AND.

a

385

Threshoided summation is the most common abstract model of single neuron function. Its biological justification is that the decision to fire a neuron is made (ideally) in one place only, the base of the axon, where all dendritic inputs axe summed. The actual process is more complex (Kuflier et al., 1984; Kandel and Schwartz, 1985; Shepherd, 1988) and only partly understood, but at a minimum is presumably capable of something like simple summation. One desirable characteristic of summation is that it can be implemented as a one-step, parallel operation. In particular, multiple current sources can add their contribution to a final summed current, independent of other current sources. Consequently, it is possible to integrate the contributions of any number of features in a constant amount of time. The LTU model has a threshold value which the summed inputs must exceed in order to fire the node at all, and above which it fires at full intensity. Besides any computational advantages this binary strategy may have, it is a reasonable approximation of neural behavior. Standard neurons do respond in an all-or-nothing fashion. However, this does not mean that magnitude information cannot be conveyed. Input is also temporally summed; that is, the effect of an input spike is spread over a short period of time, and so can summate with spikes that arrive soon after it. Because of this temporal summation, firing frequency can act as a magnitude measure (Adrian, 1946; Barlow, 1972). Frequency, rather than direct analog representation of magnitude information has some theoretical advantages, but it may also simply reflect biological limitations inherent to neurons. In simulated systems, there does not appear to be any immediate practical benefit in transmitting magnitude by frequency modulation, so while explicit frequency representation provides greater biological realism (Giuck et al., 1989), most models represent magnitude directly as an amplitude, which is considerably easier to model.

b

FIG. 2. The category an LTU compute~ is determined by the direction of the weight vector and the location of the threshold along that direction. (a) For LTU functions the weight vector can point between valid input patterns. (b) For (x of d) functions the weight vector must point at a valid input pattern (a corner of the hypercube).

386

S. HAMPSON

011~

0

1

0

@

~

~ 10

001)"

000 v

~111

~

~

-)101

,v'100

Fio. 3. AND and OR are the extremes of the (x of m) function. For example, in 3 dimensions: (a) W = ( - 5 2 2 2) = (at least 3 of 3 features present) = AND. (b) W = ( - 3 2 2 2) ~, (at leut 2 of 3 features present). (c) W -- ( - I 2 2 2) = (at least 1 of 3 features present) = OR. If a continuous output signal is desired, the unbounded dot product F W can be bounded between lower and upper bounds (e.g. - 1. . . . . 1 or O. . . . . 1) by thresholding, or by asymptotically approaching them, as in a sigmoid (Fig. 4). For example, the function 1/(1 + e * * - x ) is commonly used as a sigmoid "squashing function". It varies between 0 and 1 as x varies between negative and positive infinity. Under some circumstances, the magnitude of neural response seems to vary with the logarithm (or as a power function) of input intensity (Shepherd, 1988, p. 218), and similar psychophysical scaling functions have been proposed to relate the actual stimulus intensity to the perceived "sensation magnitude" (Gescheidcr, 1988). Besides, the "standard", additive input features, most models utilize specialized inputs of one sort or another. The existence of nonadditive feature interaction is of both biological and theoretical interest. Certainly different neurotransmitters have different effects and time courses of action (Thompson, 1985, Chap. 5; Siggins and Gruol, 1986; Panksepp, 1986; Footc and M o r m o n , 1987; Shepherd, 1988, Chap. 8, 24; Daw et al., 1989). For example, under some circumstances the neurotransmitter norepinephrine

/ a

b

(NE) has limited direct excitatory or inhibitory effects; instead, it acts in a modulatory role by adjusting the sensitivity of the target neuron to its other inputs (Bloom, 1979; Woodward e t a / . , 1979; Rogawski and Aghajanian, 1980; Tucker and WiUiamson, 1984; McEntee and Mair, 1984; Gold, 1984a; Foote and Morrison, 1987; Gordon et al., 1988; Servan-Schreiber et ai., 1990a,b). Such specialized (multiplicative) input can be thought of as adjusting the "gain" in neural signal processing. The anatomical characteristics of the NE system support its role as a more general modulator, rather than a carrier of detailed information. There are relatively few NE producing neurons (about 3000 in the rat, 26,000 in man) which are located in a single nucleus (the locus coeruleus), but they ramify profusely and project over much of the brain (Foote and Morrison, 1987). Other transmitters can also modulate overall neural excitability (Kaczmarek and Levitan, 1986; Foote and Morrison, 1987). Besides the range of different neurotransmitters, there are a number of other complicating factors: a single transmitter can have different receptor types which produce different effects, a single receptor may be affected by multiple transmitters, a receptor may

J c

FIo. 4. Three pouible functions of the summed input. (a) A single threshold, or step fun~ion. (b) Two thresholds at lower and upper bounds. (c) A tigmoid ftm,~on between lower and upper bounds. Other functions such as log or power functions are also of possible interest.

387

ARTIFICIALNEURALNETWORKS

be regulated by other factors such as membrane voltage or internal cytoplasmic conditions, and a single neuron can have multiple receptor types and produce multiple transmitters. In addition, there are hundreds of morphologically distinct neuron types, presumably wit h specialized computational characteristics. Thus, a wide range of potential neural computations exist. For detailed physiological simulation, correspondingly complex neuron models are necessary, but the abstract model developed here will be limited to simple linear summation of excitatory and inhibitory input. The computational limitations of a linear function may parallel biological limitations. Of the 16 possible Boolean functions of 2 features, only Exclusive Or ((0 l) and (1 0) are True) and Equivalence ((0 0) and (l l) are True) cannot be expressed as linear functions. Geometrically, this is equivalent to saying that a line (hyperplane) cannot be drawn through a square (hypercube) so as to include the two positive instances on one side without including a third supposedly negative instance. Biologically, some experiments have found Exclusive Or and Equivalence to be the most diff,,cuit to learn (Neisser and Weene, 1962; Hunt et al., 1966; Bourne, 1970; Bourne e t a / . , 1979, Chap. 7). On the other hand, some experiments have found no consistent difference in difficulty in learning linearly and nonlinearly separable categories (Medin and Schwanenflugel, 1981; Merlin, 1983; Wattenmaker et aL, 1986), so that particular point is questionable. However, any nonlinear Boolean function can be represented in a two-level network using only the functions OR and AND, and the relative contribution of one- and multi-level learning presumably depends on the particular function and training circumstances. 2.3. PROTOTYP1CCLASSIFICATION

As previously observed, an LTU of binary features can be viewed as measuring similarity to some ideal or "prototypic" point. It gives its strongest response to inputs near the prototype, and decreasing output for decreasing similarity to it. The (x of m) function assumes that all relevant features are of equal importance while a full LTU function allows varying weights to be attached to the different features. With this representation, a large weight means that a feature is highly predictive of the category, not that the feature is (necessarily) frequently present when the category is. This is different from the more common assumption that the prototypic point for a category is simply the average of the feature values over all patterns in the category. Both approaches are "prototypic" in the sense that input patterns are classified by similarity to a prototypic point, but the definition of that point is quite different in the two approaches. There are numerous formal models of prototype definition and similarity measurement, and different learning models to acquire the relevant information, but some form of prototypic similarity, or family resemblance detection, is generally considered a useful and demonstrated capability for biological organisms (see further reading, 2.1). When categories can be represented effectively as prototypic clusters, a

representation system with a bias toward such prototypic grouping can benefit both in terms of economy of representation, and perhaps more importantly, predictive generalization over as yet unseen stimuli. Although there is much support for the use of prototypic categorization, there are also many limitations of prototype theory as a complete model of human categorization behavior (Murphy and Merlin, 1985; Wattenmaker e t a / . , 1986; Barsalou, 1985; Medin et al., 1987; Hunt, 1989, p. 620). Among other things, a single category (e.g. animal) may not have a single prototype, but may require multiple prototypes (fish, bird, etc.). More significantly, prototypic classification presupposes the "proper" level of featural description. For example, receptor/pixeilevel visual descriptions of an object before and after a shift or rotation should not be expected to have a significant number of features in common. It is the same object, and consequently should be, and is, perceived as similar, but its receptor-level description can be completely changed by simple eye motion. As a more extreme example, a response to an image of a dog might generalize to the written or spoken word "dog". In general, the most important properties/features of an object for the purpose of classification may not be explicit in its raw, receptor-levei representation; they must be computed or inferred. Although prototype theory is far from a complete model of human categorization behavior, it provides a useful and relatively simple component for more complex systems. Its relative contribution may vary considerably depending on the circumstances. Adult humans may rely more heavily on logical analysis than simple prototypic similarity, but simpler organisms probably rely more on simple mechanisms. In addition, an argument can be made that the naturalistic categories an animal must deal with are more apt to have a prototypic structure than the linguistic categories defined by English words. In any event, prototype theory provides a useful framework when considering the classification power of LTU neurons. 2.4. NODE TRAINING There are many organizing processes that shape the nervous system. During early development, controlled cell division, migration and differentiation take place, and specific patterns of interconnection develop. Later, selective cell death and synaptic degeneration occur, producing a highly structured system before it is exposed to the external environment (Lund, 1978; Cowan, 1979; Oppenheim, 1981; Kuffler et a/., 1984, Chap. 19; F_.Aelman et al., 1985; Crutcher, 1986; Williams and Herrup, 1988; Shepherd, 1988, Chap. 9). Some aspects of neural growth are present throughout the life span of most organisms, and at the synaptic level, the formation of new synapses and learning-related structural changes are often observed (Tsukahara, 1981, 1984; Thompson et al., 1983; Grecnough, 1984, 1985; Black and Greenough, 1986; Wenzel and Matthies, 1985; Crutcher, 1986). Metabolic shifts may also contribute to adaptive plasticity by affecting neural excitability. Most importantly (from the perspective of the model developed here), even with a fixed set of neurons and connections, a considerable amount of

388

S. HAMPSON

neural, and therefore behavioral plasticity remains, due to the existence of variable-strength synapses. Many mechanisms of synaptic modification exist, ranging from short-term transmitter depletion to long-term structural change (Brindley, 1967). It has been suggested that there are specialized plasticity controlling systems in the nervous system, and that specific neurotransmitters such as norepinephrine, acetylcholine and dopamine may have identifiable roles in such a plasticity modulating function (Kety, 1982; McGaugh, 1983, 1989; Gold and Zornetzer, 1983; Kasamatsu, 1983, 1987; Singer, 1984; Bear and Singer, 1986; Gordon et al., 1988; Pettigrew, 1985; Carlson, 1986, Chap. 13). However, the detailed mechanisms of plasticity control are still far from worked out. In many cases, these plasticity control systems appear to simply modulate the learning rate in other systems, but more direct involvement in teaching or reinforcement is also possible. For example, individual neurons can sometimes be "taught" to increase their firing rate using local applications of dopamine as a reinforcer for firing (Stein and Belluzzi, 1988). In general, neural information processing is a complex process, and virtually any step in the process is a potential site for adaptive modification. Specialized neural systems would presumably have specialized learning strategies to adjust the particular parameters of their computation. The current discussion is limited to training an LTU for pattern classification. At a formal level, the ability to train an LTU as a pattern classifier is well known as the perceptron convergence theorem (Rosenblatt, 1962; Nilsson, 1965; Minsky and Papert, 1972; Duda and Hart, 1973). As previously described, feature vectors are classified according to their dot product, FW. If the feature vector F is misclassified, the weight vector W is adjusted by: (1) W ~ = W + F if FW is too low (2) W ,= W - F if FW is too high. For example, if F = ( I 1 1) and W =(1 1 - 2 ) , their dot product FW is 0. If FW should be positive, then W ,= W + F = (2 2 - 1). After this adjustment of W, FW is 3. Likewise, if FW should be negative, W ,= W - F = (0 0 - 3), and FW becomes - 3. This training rule is appropriate for either binary or continuous input features. It is biologically implausible that synapses can change sign (excitatory or inhibitory), but an equivalent effect can be achieved by using two weights of fixed sign per feature. By controlling the amount of F added or subtraetcd on each adjustment, a linear equation can also be train~l for continuous output (e.g. the k~astmean-squares (LMS) rule OVidrow and Hoff, 1960; Widrow and Stearns, 1985; Duda and Hart, 1973)). That is, rather than simply adding or subtracting F, the weight vector is adjusted by + F * a, where a is cho~n so that FW yields the desired output. With perccptron training, a node is trained for binary output, and when its sign is correct for all input l~ttcrns, total output error is zero. With LMS training for continuous output, total output error asymptotically approaches zero. Thus convcrFnce on continuous output can only be defined as fall-

ing within a particular error rate, not as absolute convergence. There are other variations on this general approach (Duda and Hart, 1973; Sutton and Barto, 1981; Parker, 1986), but the simplicity of perceptron training and its relative ease of analysis make it a useful point of reference. An algorithm that is essentially equivalent to the LMS rule has also been proposed as a model of learning during classical conditioning (Rescorla and Wagner, 1972). For example, if a dog is consistently fed after a tone and light are presented together, but not when they are presented separately, it will learn to salivate to the input pattern (1 1) (=tone and light), but not to (00, 01, 10). Food presentation is referred to as the Unconditioned Stimulus (US), salivation as the Unconditioned Response (UR), and the tone and light are Conditioned Stimuli (CS). In this case the Conditioned Response (CR) is the same as the UR, but in general it need not be. Classical conditioning has been studied in a wide range of animals, and has been demonstrated in very simple organisms (e.g. slugs, Sahley et al., 1981). Besides simple conditioning phenomena, various aspects of human category learning are consistent with this learning strategy (Gluck and Bower, 1988a,b; Estes et al., 1989). More complex models of classical conditioning are capable of covering more of the empirical data, but, potentially, the whole organism can be brought to bear on the learning task. Consequently, a "complete" model of classical conditioning would have to be correspondingly complex. Classical conditioning is a training and testing paradigm, not a particular learning process or mechanism. The attraction of the Rescorla-Wagner model is that it produces reasonable coverage with a very simple mechanism. The difference between thresholded and continuous output has some interesting behavioral consequences in classical conditioning. For example, while perceptron training and Rescorla-Wagner learning are both capable of learning the function OR if the input features never co-occur, Rescorla-Wagner learning has trouble if the ORed features sometimes occur together. If the individual features correctly predict the magnitude of the US (e.g. the amount of shock or food), then their combination would amount to an overprediction. Consequently, they would be reduced when they occurred together, thus reducing their separate predictive value (Rescorla, 1970; Rescorla and Holland, 1976). Because an LTU gives only a binary prediction, there is no such thing as an overprediction, and stable OR weights can be learned for any combination of features. It should also be pointed out that the temporal aspects of classical conditioning are not dealt with here. For example, in actual training, the stimulus (CS) precedes the teacher (US), so that classification is actually a prediction. The temporal aspects of this prediction are obviously important in real-world behavior, but are largely avoided in the simplified model developed here. Perceptron training was not developed as a model of any particular biological learning process, but its similarity to models of classical conditioning makes it of biological interest, and its simplicity makes it more

389

ARTIFICIALNEURALNETWORKS

accessible to formal analysis than more complex, biologically motivated processes. Many of these formal results appear relevant to biological learning. In addition, variations on perceptron training (or LMS) are used in many network training algorithms, so many issues that are relevant to singie-node training are relevant to network training as well. However, the added complexity of network training makes analysis significantly more difficult. In general, a good understanding of single node behavior has much to offer in understanding network properties. One important measure of the training characteristics of perceptron training is the number of mistakes/adjustments required to learn a given function. This can be quantified to some extent since one proof of the perceptron convergence theorem (Nilsson, 1965, p. 82) provides an upper bound on the number of adjustments needed for training an LTU: MlWl ** 2/a ** 2 Here M is the squared length of the longest input vector, IWl is the length of some solution vector W, and " a " is the minimum value of IFWI over all input vectors. This provides an upper bound on learning time for the function that W computes. Empirical results have generally reflected the time complexity results based on this upper bound, making it a good source of insight when considering the relative difficulty of learning different functions or the effects of different training conditions. Time complexity results will often be presented in "O" notation, which implies an upper bound with constant multipliers and lower order terms dropped (e.g. 6 + 3n + 4(n ** 2) is denoted O(n ** 2)). At other times results are simply approximated. 2.5. NEURAL MECHANISMSFOR PERCEPTRON

TRAINING As a model of neural learning, perceptron training can be interpreted as: (1) If the neuron's output is too low, increase the synaptic weights of the active input features. (2) If the neuron's output is too high, decrease the synaptic weights of the active input features. A similar learning process occurs in the gill withdrawal reflex of Aplysia, a learning system in which neural mechanisms of classical conditioning have been extensively studied (Kandel, 1976, 1979; Carew et al., 1983; Hawkins and Kandei, 1984; Abrams, 1985; Carew, 1987). Behaviorally, the gill withdrawal reflex is quite simple. If the animal is poked (the CS), the gill is withdrawn (the CR). This response can be habituated by continual poking (jets of water from a water pic in the original experiments) so that the reflex is completely suppressed. The response returns, however, if the animal is "hurt" by electrical stimulation (the US), The response is especially facilitated if the animal is poked prior to being hurt (Kandel and Schwartz, 1982; Carew et al., 1983). Thus the response displays both associative (specific to the active input features) and nonassociative conditioning. At the risk of being teleological, the purpose of this system can be interpreted as a protective withdrawal from possible trauma. That is, since the animal may be hurt after initial contact with a foreign object, it

is adaptive to withdraw the delicate gill at the first signs of danger. If the animal is continually poked with no deleterious results, perhaps by a piece of seaweed, the response should be suppres____u~d(habituated), so that the animal isn't continually holding its breath. If the animal is suddenly hurt, a reasonable strategy would be to reinstate (sensitize) the response until it could be safely habituated again. Sensitization is especially reasonable if the painful experience was immediately preceded by being poked. This may be taking liberties in identifying the purpose of Aplysia behavior, but the observed functioning of the system is consistent with this interpretation. In any case, it is not unreasonable to consider it as a mechanism which fulfils this purpose, even if the true biological significance of the system is debatable. The neural circuitry underlying this behavior is relatively simple. A pain sensor can be thought of as "instructing" a motor neuron which receives input from a poke sensor (Fig. 5). If pain follows being poked, the gill should have been withdrawn and its input synapses are (presynaptically) strengthened by sensitization. The important aspect of this from the perspective of perceptron training is that the synapse is especially strengthened if the poke detector just fired. If no pain occurs, the gill should not have been withdrawn and its input synapses are (also presynaptically) weakened by habituation. The most significant difference between perceptron training and Aplysia learning is that perceptron training adjusts only on mistakes (i.e. if the motor neuron performed incorrectly), while gill withdrawal (as described) adjusts on all inputs. However, the resulting behavioral phenomenon of "blocking" (a correct response on the basis of existing features blocks conditioning of new features paired with them) (Kamin, 1969; Mackintosh, 1978), has been demonstrated in other molluscs (Sahley et al., 1981; Sahley, 1984; Sahley et al., 1984; Gelperin et al., 1985), so it seems reasonably safe to assume that the appropriate neural mechanisms for perceptron training (and the corresponding characteristics of classical conditioning) do exist in simple organisms. Possible neural implementations of classical conditioning have been suggested (Hawkins and Kandei, 1984; Hawkins, 1989a,b; Crelperin et al., 1985; Tesauro, 1986; Giuck and

muscle cell

motor

neuron

vari,~ble synapse

putln detector

poke detector

Fno. 5. Aplysia gill withdrawal network (simplified). A pain detector adjusts the synaptic strength between the poke detector and the motor neuron.

390

S. HAMPSON

Thompson, 1987; Thompson, 1989; Donegan et a/., 1989; Gluck et al., 1990), but a complete mechanism is still not known. A more abstract, but more complete neural model for classical conditioning is shown in Fig. 6. The US signal (pain) will always activate the response (withdrawal) directly. The CS (poking) may or may not activate the response, depending on the strength of the variable synapse between it and the response. The teacher node compares the US signal to current response. If the response is not correctly triggered by the CS prior to the US signal, then a teacher signal of the proper sign is sent to the response node. That is, the teacher node simply subtracts the response signal from the US signal and outputs the result ( - 1 =decrease, 0 = o k as is, I=increase) as a teacher signal for the response. If the CS reliably precedes the US, then its associativc strength to the response will increase until the strength of the CSresponse correctly predicts the US, at which point the teacher signal will go to zero. If a number of CS features are provided, the circuit is capable of implementing the complete pcrceptron training algorithm. This abstract circuit is compatible with a more complex, biologically motivated circuit proposed to explain classical conditioning in the cerebellum (Thompson, 1986, 1989). The LTU model assumes a specialized "teacher" input, which tells a node when it should have been on or off, or perhaps whether it should increase or decrease its output. Since a linear equation can be trained for continuous output, the teacher might also specify a particular desired output value or amount of output change. Alternatively, the teacher might only provide an evaluation of preceding action, in which case the node receiving the instruction would have to decide on the appropriate direction and amount of modification. Consequently, there are a number of different forms that "instructive" information can take. Likewise there are a number of different physiological mechanisms by which instructive information might be differentiated from '*standard" input. For example, it might be specialized in time (the period after a stimulus was presented and after the neuron fu'ed or should have fired), synapse type (as in gill withdrawal), location (cell body vs dendrites), signal pattern (high frequency bursts or more complex patterns), system timing (in conjunction with "theta" rhythm activity), or transmitter type (norepinephrine, acetylcholine, dopamine). Besides a range of implementations for instructive input, there are different possible sites of associative plasticity. For example, although associative learning is thought to be a presynaptic process in Aplysia gill withdrawal, what appears to be at least partly postsynaptic associative learning has been demonstrated in mammalian neurons (Brown et al., 1990), suggesting the possibility of quite different neural mechanisms for similar functional properties. Thus, there need not be only one mechanism for implementing perceptron training--or any other formal learning rule. Although there may be identifiable physiological mechanisms for implementing a formal learning rule, formal rules should not be expected to cover all details of physiological plasticity. As previously

Response

)

-~ ~able

synapse)

J US

CS

Fxo. 6. A simple circuit for classical conditionin~ The US always produces a response. The response level just prior to US presentation is subtracted from the US and used as a teacher sill~al ( + = increase, 0 = stay same, - = decrease) for the output node. If the CS reliability precedes the US, then its variable connection to the output node will Alain in strensth until the correct response is produced by the CS. The teacher sisnal will then 8o to zero and the output node will quit learning. observed, there are many processes involved in neural information processing, and a correspondingly large number of opportunities of plasticity.Formal rules are at best idealized abstractions, and seldom precisely capturc what a given biolosical system is '*actuallycomputing". Detailed simulation of physiological mechanisms of associative learning (e.g. Alkon eta/., 1989; Byrne and Gingrich, 1989; Byrne et at., 1989; Tam and Perkel, 1989; Brown et al., 1989) often underscores the considerable ~

between

abstract learning rules and the actual physiology of neural plasticity. 2.6. INPUT ORDm~ When measuring empirical behavior during LTU training, a sufficient test for convergence is to make a compkte cycle throuMh all input vectors without requiring an adjustment. For testing convenience, it is therefore useful to use presentation orders in which such cycling occurs. Both the number of cycl~ and the number of adjustments to convergence provide useful measures of learning speed. Since all available information has been presented by the end of the first cycle, the total number of cycles to conveqlence is a reasonable measure of "learning efficiency". On the other hand, the perc~tron c o n ~ proof provides an upper bound on the number of adjustments. Consequently, most formal analysis will be in terms of the number of adjustments. Since the order of input pattern presentation affe~m the number of cycles and adjustments to convergence, it is

391

ARTIFICIAL NEURAL NETWORKS

useful to consider a number of different presentation strategies. One simple presentation strategy is to present the inputs in "numeric" order. With binary features, each input pattern can be viewed as a binary number. In a d-dimensional space there are 2 ** d input vectors, which can be cycled through in the implied numeric order (i.e. 0 to 2 ** d - 1). For multivalued features, the n ** d input vectors can be viewed as base n numbers, where n is the number of values a feature can assume (for binary features, n -- 2). The inputs can then be cycled through in the implied order. No single ordering is adequate to characterize empirical behavior, but numeric order provides an easily generated test case. "Shuffle-cycle" ordering randomizes the order of input patterns before each cycle. This would seem the least biased input ordering and a reasonable measure of "average" performance. However, it is of limited value in measuring best or worst case performance. A final ordering technique has proved useful in measuring the extremes of single node performance, and provides some insight as to what constitutes "good" and "bad" training instances. Based on the upper bound on adjustments from the perceptron convergence proof, the input vectors (the Fs) can be ordered so as to maximize (or minimize) the term [FI ** 2/IFWI ** 2, given a known solution vector W. Input presentation starts at the top of the list, and whenever an adjustment is made (i.e. there is a misclassification) presentation is restarted at the top. This leads to nearly worst (or best) case performance. These are referred to as "least-correct" and "mostcorrect" ordering, respectively. The IF[ ** 2 term means that, on the average, adjusting on longer input vectors result in slower learning. The ]FWI ** 2 term means that learning is slow for adjustments on instances near the category boundary (near hits and near misses) and rapid for adjustments on inputs more distant from the separating surface. For binary features, the central prototype and its negation are the most distant points in the positive and negative regions, and are consequently the most informative. Conversely, during training, correct classification of input patterns is learned in order of decreasing distance to the separating surface (i.e. most prototypic first, and boundary last) even for patterns that are not actually presented to the system (Fig. 7). Both of these characteristics are generally true in biological learning studies (Mervis and Rosch, 1981; Homa, 1984). From an alternativepoint of view, given a function and a representation structure, the boundary patterns are those points that, if classified correctly, guarantee that all other input patterns are also classified correctly. This depends on the particular representation structure, as a representation with fewer degrees of freedom can be constrained with fewer points. For example, the same points that would constrain an LTU would generally not constrain a Quadratic Threshold Unit (QTU). Least-correct ordering considers the boundary patterns first, so, while it requires the largest number of adjusts, it requires the fewest number of different input patterns to learn the function. Cover (1965) shows that, on the average, 2d patterns are sufficient to define an LTU function. The

0 0 ~ 0

\

0

0

0

0

0

0

0

0

0

~

\ 0

0

0

0

0

0

~

0

0

0 ~ / /

/ " FIG. 7. Input patterns close to the separating surface lead to slow learning and are the last to be correctly classified during perceptron training.

average number of different patterns missed using least-correct order on randomly generated LTU functions closely approximates that value. Empirically, most-correct results are not greatly different from shuffle-cycle results. As expected, fewer adjusts are required, but also slightly fewer different patterns. Empirically, the number of different patterns missed grows as about 1.7**d using either ordering technique. 2.7. ALTERNATIVEL T U MODELS

There are several alternative models for implementing an LTU. For binary inputs they are computationally equivalent, hut their training characteristics and response to continuous input can differ considerably. In the classical model, feature absence and presence are represented as 0 and 1. In the symmetric model, feature absence is represented as - 1 rather than 0. Classification and training algorithms are the same with both representations. However, an adjustment of the symmetric model always adjusts all weights, and a fixed fraction, ( l / ( d + 1)), of the adjustment is allocated to the threshold. In the classical model, only features that are present are adjusted and the "threshold fraction" varies between i and 1/(d + 1). The two-vector model is the third representation. In this case feature presence and absence are represented as separate values, and a node associates two weights with each feature, one for feature presence and one for feature absence. For binary input, there is no need for an explicit threshold weight; it is implicit in the 2d weights of the weight vectors. For classification, present features use the present weights and absent features use the absent weights. Weights are adjusted in a similar manner. For binary input, the two-vector model is equivalent to the symmetric model with a threshold fraction of 0.5; that is, with half the total weight change associated with the threshold. At the conceptual level, there are different strategies for adjusting the threshold. At one extreme, the

392

S. H A M P S O N

size of the category's positive region is not changed (i.e. the area of the positive region on the hypersphere does not change). The category is modified by shifting the direction of its central prototype to include or exclude the current input, but maintaining the same size of the threshold relative to the length of the weight vector. Equivalently, the weight vector (without the threshold) can be normalized to a fixed length (e.g. 1.0), in which case the tip of the weight vector is always on the unit hypersphere. Obviously this extreme case will not converge unless the threshold is already fortuitously correct. At the other extreme, the prototype is not shifted at all, but the threshold is adjusted to include/exclude the current instance. This also will fail to converge if the central prototype is "misplaced", but, anything short of a threshold fraction of I will eventually converge. Since the threshold determines the size of the grouping, this provides a continuum of learning strategies with an adjustable bias in the amount of generalization. In the first case, approximately as many input patterns are shifted out of the group recognized by the LTU as are shifted into it. In the second case, instances are only added or subtracted on a single adjust, but never both. An interesting intermediate point occurs if the threshold fraction is 0.5, as is the case with the two-vector model. Because at least half of the total output adjustment is in the "correct" direction, this method of adjustment retains the characteristic that instances are only added or subtracted on a single adjustment while also accomplishing a shift in the central prototype. Empirically, a threshold fraction of about 0.2 to 0.3 appears to be optimal. The theoretical optimum is unknown, and may be a function of the testing conditions. Geometrically, (0, 1) and ( - 1, 1) input correspond to coordinate systems in which the origin is at a corner, or at the center of the input hypercube, respectively. With binary input, only the corners of the cube are valid inputs, but with multivalued features the cube contains a grid of valid inputs. With continuous input, the cube is solid. Besides the obvious choices of locating the origin in the center or at a corner, any number of coordinate systems are possible. These choices have equal representational power provided they are linear transformations of each other, although specific functions may be much easier to learn in particular coordinate systems. For binary features, the two-vector model is representationally equivalent to a single-vector LTU, but for continuous features the two-vector model is different in two respects. First, it lacks an explicit threshold; thus all solution surfaces must contain the origin. Second, the solution surface is not restricted to a single hyperplane, but may consist of different hypcrplanes in each (hyper-)quadrant, provided that they meet at the quadrant boundaries. Consequently, if an explicit threshold (or constant input feature) is provided, the two-vector model is representationally more powerful. The nervous system appears to use all three approaches. F o r example, some neurons can modulate their output both above and below a resting frequency of firing: "Note... the high rate of resting impulse discharge in the nerve; this means that inhibitory as well as excitatory changes are faithfully encoded. This high set point is a

common property of many cells in the vestibular and auditory pathways and the associated cerebellar system. In some species the resting frequency is incredibly constant, which enhances the ability of the nerve to transmit extremely small signals and have them detected by centers in the central nervous system" (Shepherd, 1988, p. 31 I). However, in most other cases, the resting frequency is typically rather low (or 0) (Crick and Asanuma, 1986), thus limiting the resolution of downward modulation. Separate representation of feature presence and absence is found in the visual system where "on-center" and "off-center" cells detect the presence or absence of light against a complementary background. Again, however, this is a specialized system and the principle may not be widely utilized. There is also behavioral evidence that absent features are not utilized as readily as present ones (Jenkins and Sainbury, 1970; Hearst, 1978, 1984, 1987; Barsalou and Bower, 1984), so the mathematical option of treating feature presence and absence in a symmetric fashion may be of limited biological relevance. 2.8. CONTINUOUSFEATURESAND CERTAINTY It is often convenient to think of features as being either present or absent, but stimulus intensity is an important aspect of real-world perception. Animals demonstrate their sensitivity to stimulus intensity in many circumstances including both classical and instrumental conditioning. Although all organisms must deal with stimuli of variable intensity, a number of different strategies exist for representing and processing that information, each with its particular time/space characteristics. One advantage of connectionistic representation and processing over more symbolic approaches is an ability to directly compute on continuous values between the binary extremes of full on and full off. Depending on the application, these intermediate values can be viewed as an intensity measure or as a degree of certainty. If continuous output is viewed as a measure of certainty, continuous versions of the three binary LTU models provide different capabilities. If input and output values are limited to be between 0 and 1 (classical), and interpreted as 0 to 100% probability of feature presence, there is no place on the scale which represents "unknown". This is a problem inherent in any implementation of the law of the excluded middle (i.e. Not (True) implies False). Most logic applications simply confound false and unknown. One possible approach to this problem is to use a second signal to represent confidence in the first. However, since this requires the tight coordination of two signals to represent each feature, it seems unlikely as a general biological principle. A more tractable two-value approach is to use one signal to represent confidence in a feature's presence and another to represent confidence in its absence (two vector). N o special relationship between the signals is required; they can be treated as independent features. Unknown would be represented by zero evidence for both presence and absence. Positive values for both would indicate conflicting information, a common state of affairs when dealing with real-world situations. A similar four-valued logic

ARTIFICIAL NEURAL NETWORKS

(true, false, unknown, conflicting) has been utilized in Artificial Intelligence (AI) systems (Bclnap, 1977; Doyle, 1979). An alternative arrangement using only a single value is to establish a baseline "floating" level of output (e.g. 0) and express confidence in the presence or absence of the feature as variations above and below that level (symmetric). Input and output can be constrained to be between - 1 (certainly not present) and 1 (certainly present). Unknown is represented as the "floating" middle value of 0. A similar threevalued logic has also been employed in AI systems (Shortliffe and Buchanan, 1975). With this singlevalue representation, "no information" and "conflicting information" are confounded. This confounding of information is not necessarily fatal, but under some circumstances the distinction may be important. At present there are no completely successful formal logic models of intermediate certainty, so a general description of the appropriate information to represent is impossible. Bayesian probability calculations provide a well-founded approach, but the required conditions of mutually exclusive and exhaustive alternatives cannot be guaranteed in most real-world situations. Consequently, a strict application is not generally possible. If theoretical purity cannot be immediately achieved, ease of calculation has something to recommend it. Initial attempts to propagate a separate certainty signal were not rewarding. On the other hand, the continuous threeor four-valued logic signals are easily propagated through thresholded summation. The resulting output magnitude is a useful value, but cannot be strictly interpreted as a certainty measure. It seems likely that the mechanisms of neural integration constrain the type of information that can be accurately propagated through individual neurons. 2,9. REPRESENTINGCONTINUOUSFEATURES In artificial systems it is often convenient to represent continuous features directly as real-valued numbers, or in hardware as analog signals. Neurons can convey analog information over short distances as a continuous, graded membrane potential, but longer distance communication is essentially binary (spiking or nonspiking). Consequently, such a direct amplitude representation is not generally feasible. Several alternative magnitude representations arc possible though, each with distinct characteristics. As previously observed, the standard approach is to view a neuron's frequency of firing as a magnitude measure (Adrian, 1946; Barlow, 1972). By integrating the neuron's input over time, this frequency can be converted to an equivalent direct amplitude representation. Since neurons can potentially vary their frequency of firing around a "resting" level, both positive and negativemagnitude are possible. More complex magnitude czxiesare also theoreticallypossible. For example, a node could represent its output magnitude as a binary number (i.e. as a specific sequence of 0s and Is),rather than just the relative frequency of 0s and Is. The next approach is frequently used in symbolic AI representations.By dividing a continuous range into a number of discrete subranges, each subrange

393

can be associated with a Boolean value. There are numerous ways this can be achieved. For example, the subranges can be nonoverlapping, symmetrically overlapping with neighboring subranges, or nested starting at one of the extremes (Atldnson and Estes, 1963). Note that the latter requires only a threshold difference between otherwise identical units. Subrange representation of the frequency continuum is well developed in the auditory cortex, and a subrange representation of auditory amplitude has also been observed ('runturi, 1952; Suga, 1977, 1990; Snga and Manabe, 1982; Knudsen et al., 1987; Teas, 1989, p. 423; Gallistel, 1990, p. 496). Again, more complex encodings are possible. For example, the values I through n can be represented as binary numbers using log2(n) nodes. In general, any temporal code that a single node might produce can be represented as a spatial code by using a set of nodes. Finally, it should be recognized that the frequency modulation approach requires time to reliably sample an input's firing frequency. High accuracy requires an extended observation period. For rapid calculation, this time may not be available ( M a r l 1982; Scjnowski, 1986). Subrange representation avoids that problem, but requires a specialized range encoder. Both of these problems can be avoided by viewing a neuron as a probabilistic device (Little and Shaw, 1975; Ackley et al., 1985; Barto, 1985; Clark, 1988). For example, a continuous output between 0 and 1 can be converted into a probability of firing in a given time unit. In this model a single channel can simply be replicated to reduce integration time. That is, observing x inputs over one unit of time is equivalent to observing one input over x units of time. At the extreme, essentially no integration time is necessary; the accuracy with which the continuous value is encoded is determined solely by the number of probabilistic units encoding it. These approaches are, of course, not mutually exclusive and can be combined to optimize the time/space/resolution requirements of the particular application. The model developed here is limited to binary inputs, but with minimal modification can be extended to deal with continuous or multivalued inputs as well. 2.10. EXCITATION, INHIBITION AND FACILITATION The LTU model assumes the simple summation of positive (excitatory) and negative (inhibitory) input. However, physiologically, this is not entirely correct. Excitatory inputs do roughly summatc in depolarizing the cell membrane, but "shunting" inhibition is more like a localized, multiplicative reduction of excitatory effects. In addition, inhibitory effects are generally longer lasting than excitatory effects and there is little evidence for adaptive plasticity in inhibitory synapses. Thus, at the physiological level, there is an asymmetry between excitation and inhibition which is not reflected in the LTU model. At the binary extremes of full on and full off, the difference between multiplication and subtraction may be immaterial, but for intermediate values the difference could be significant. The Rescorla-Wagner learning model for classical conditioning also assumes an additive symmetry be-

394

S. H AMPSON

tween excitation and inhibition. This has not been a serious problem in most applications of the model, but a detailed analysis of behavioral inhibition does point out some limitations (Rescorla, 1979, 1982c, 1985; Rescorla et al., 1985a). For example, additive inhibition permits summation below zero, while the multiplicative model implies a floor effect at zero, an observed behavioral phenomenon which has long been a source of annoyance for the Rescorla-Wagner model (Zimmer-Hart and Rescorla, 1974; Rescorla, 1985, p. 318; Kaplan and Hearst, 1985). More seriously, the potential for local interaction of specific excitatory and inhibitory inputs would produce results quite different from those predicted by a simple additive model. Any particular inhibitory synapse can be thought of as "modulating" the excitatory inputs in its vicinity. For modulatory synapses on or near the cell body, the effect would be quite global, but modulating synapses on dendrites would have feature-specific effects (Grossberg, 1970, 1983; Pinter, 1985). Localized, dendritic modulation has been proposed to explain directional sensitivity in the retina (Poggio and Koch, 1987; Grzywacz and Amthor, 1989; Grzywacz and Poggio, 1990). The behavioral observation of this sort of stimulus interaction is another problem for the basic RescorlaWagner model (Holland and Lamarre, 1984; Lamarre and Holland, 1985, 1987; Holland, 1982, 1984, 1985a,b, 1986a,b, 1989c; Jenkins, 1985). Similar behavior can be achieved by using a network of strictly additive nodes, but it is a direct consequence of single node behavior if a slightly more accurate model of neural inhibition is used. There is no requirement that the particular characteristics of individual neuron information processing be directly observed at the behavioral level, but the possibility should not be discounted, especially for simple conditioning paradigms. Neural inhibition and excitation may not be symmetrical, but it is possible that inhibition and facilitation are. If inhibition is an "increase in ionic conductance that drives the membrane potential toward the equilibrium potential" (Shepherd, 1988, p. 132), then facilitation could represent a decrease. In fact, there are specialized "conduction-decrease" synapses that have exactly that effect (Shepherd, 1988, p. 133). They amplify the effect of neighboring synapses. At the behavioral level, Rescorla has suggested that inhibition and facilitation are complementary, and distinct from excitation (Rescorla, 1985, 1986a,b, 1987, 1988; Davidson and Rescorla, 1986); that is, if inhibition is viewed as multiplying excitatory input by a value less than I, facilitation can be viewed as a multiplying factor greater than I. It is tempting to speculate that some of the behavioral aspects of inhibition, facilitation and "occasion setting" (Ross and Holland, 1981; Holland, 1983, 1986a,b, 1989a,b,c,d; Ross, 1983; Ross and Loiordo, 1986, 1987; Bouton and Swartzentruber, 1986; Re~orla, 1986a,b) are based on such a multiplicative mechanism. In these experiments, sequential, rather than simultaneous, stimulus pairing seems especially prone to multiplicative interaction. Weisman and Konigsiow (1984) provide additional evidence that temporally separate features can interact multiplicatively rather than additively.

In summary, the actual process of neural integration displays a number of effects beyond the simple linear additive model. An LTU should thus be viewed as a convenient first approximation that is roughly within the capabilities of real neurons. Simple additive feature interaction is sufficient for almost all aspects of the behavioral model developed by Hampson (1990), but feature multiplication is also useful, so a more realistic treatment of inhibition and facilitation may be of value. The value of more complex node models depends on the particular domain of application.

3. IMPROVING ON PERCEPTRON TRAINING 3. i. INTRODUCTION When positive and negative input patterns are linearly separable, use of the perceptron training procedure guarantees convergence on the correct output. Theoretically this is sufficient, but in practice convergence may be quite slow. In this section, the basic time complexity of perceptron training is considered, and two techniques are developed which can significantly improve performance. These techniques are variable associability and adaptive origin placement. Each addresses a different limiting aspect of perceptron training. 3.2. PERCE~rRON TRMNING TI~E Compt~xlrY As previouslyde,~eribed,one convergence proof for the pereeptron trainingalgorithm provides an upper bound of MlWl**2/a**2 on the number of adjustments before convergence. For the purpose of the proof, the fact that such a bound can be computed is sufficient to prove convergence. It is a separate issue to ask how tight the bound is, and whether it provides any useful information about the actual (empirical) training speed. For example, if U is an upper bound, U**U is also a valid upper bound, but provides little information about actual learning speed. Empirically, using I¢~t-correct ordering and (0, 1) input, the upper bound is in fact reasonably tight. "Average" training order (e.g. shuffle-cycle ordering) is considerably faster, and ( - 1 , 1) input further accelerates learning speed. For these conditions, the upper bound is less informative, but still often reflects the overall time complexity of the observed behavior. Given that the bound is suff~ently tight to be of interest, it can be used to gain some insight into the conditions that lead to slow or fast learning, and what the resulting time complexity is. As was d i ~ in the context of input orderings, one conclusion based on the upper bound is that boundary instances are the least informative training examples and the last to be learned, while "prototypic" examples are the most informative and the first to be learned--even if they are never actually seen. Another important issue is how the complexity of a task increases with the number of features. In the upper bound, M (the squared length of the longest input vector) increases as d, which m e a ~ the bound increases at least linearly with d. With integer input,

ARTIFICIALNEURALNETWORKS the smallest dot product, a, is always 1, and so can be ignored for the moment. ( " a " cannot be 0, since input patterns exactly on the separating surface are considered to be misclassifi~l.) The most important term is the length of the .thortest solution vector, W, for a particular function. In particular, large weights take longer to learn, and the upper bound increases with the square of weight vector length. Muroga (1971) describes an LTU function that requires weights of size about 2 ** d (the values used here are approximate, see Muroga for more precise values). He also shows an upper bound of d ** (d/2) on LTU weight size. However, there are no known LTU functions which require weights bigger than 2**d, so that may be the true upper bound. (For functions of multivalued features, the corresponding bound appears to be n ** d, where , is the number of equally spaced values a feature can assume.) Thus, assuming 2 ** d is the actual upper bound on weight size, the upper bound on learning time for LTU functions increases as W** 2 - ( 2 . * d)** 2 = 2 . * (2d) = 4 . * d. As a point of comparison, the simple learning process of sequentially trying all combinations of weights up to size 2 ** d, would require on the order of (2 ** d) ** d = 2 ** (d ** 2) adjusts. Perhaps not coincidentally, 2 ** (d ** 2) is also an upper bound on the number of LTU functions (Muroga, 1971). It is also of interest to ask what weight size, and

395

consequently what training time, the "average" LTU function requries. A lower bound on the required weight size is relatively straightforward to calculate. Muroga (1971) shows that there are between 2 ** ((d ** 2)/2) and 2 ** (d ** 2) distinct LTU functions. With a maximum weight size of x, an LTU can represent at most x ** d functions. In order to represent any fixed fraction of the possible LTU functions (e.g. half of the lower bound), x must be at least 2 ** (d/2). Based on this lower bound on required weight size, the upper bound on adjusts increases as W ** 2 = (2 ** (d/2)) ** 2 = 2 ** d. To summarize, a lower bound on average weight size is 2 ** (d/2) and an upper bound is (probably) 2 ** d. Using these weight bounds, the corresponding upper bounds on adjustments can be calculated. Since the upper bound is generally reflected in empirical training speed, this suggests that training time for the average LTU function is between 2 ** d and 4 ** d. To test this analytic prediction, 100 random LTU functions were generated for 2 to 10 features, and learned using two-vector representation and leastcorrect, shuffle-cycle and most-correct ordering. Log (average adjusts) vs number of features is shown in Fig. 8. With least-correct ordering, the growth rate is around 3.5 ** d, although the rate may well approach 4 ** d for larger d. With shuffle-cycle or most-correct ordering, the observed growth rate is closer to 3 ** d. Thus, the "average" shuffle-cycle order is much closer to the best case than the worst

5.0--

4.0--

3.5--

3.0--

2.5--

2.0--

1.5--

1.0--

0

I

I

I

I

I

I

I

I

I

I

I

2

3

4

5

6

7

8

9

10

number or features FIo. 8. Log of average adjusts vs number of features. 100 random LTU functions, 2 to 10 features. Top to bottom: least-correct, shuffle-cycle, most-correct ordering.

396

S. HAMPSON

case. The average adjusts for 10 features were 4683, 7983, and 45,554, for most-correct, shuffle-cycle, and least-correct, respectively. Thus, analysis and empirical behavior both show that when using perceptron training, learning time for the average LTU function grows exponentially with the number of features. In fact, the observed number of mistakes (about 3.5.* d ) for the average LTU function is greater than the total number of input patterns ( 2 , . d). At first this might suggest that it is impossible to learn anything of interest in a reasonable amount of time. However, there are a number of ameliorating circumstances. (!) While an LTU may require an exponential number of adjusts, it does not store the patterns it has seen, but only maintains a linear, O(d), number of weights. The weights may be of exponential size, but assuming they are no larger than 2 ** d, they can be represented with d bits each, for a total of d ** 2 bits per LTU. There are about 2 ** (d** 2) LTU functions, so 1og2(2 ** (d ** 2)) = d ** 2 bits are needed at a minimum. Thus, for the average LTU function, an LTU is a more compact representation than memorizing all positive (or negative) input patterns, which requires on the order of 2 . * d bits. Perhaps not surprisingly, LTU weight vectors are a compact representation for LTU functions. In this context, it is interesting to note that, on the average, random LTU functions have 2d boundary patterns. Since an individual pattern can be described with dbits, this means that the average LTU function can be stored using 2*(d** 2) bits by storing its boundary instances. However, simply storing the boundary patterns is not in itself sufficient to classify other patterns; you still need to define a hyperplane that separates the given boundary instances. In addition, some LTU functions have an exponential number of boundary patterns, so this is not a general solution. (2) Many interesting functions can be calculated using fixed-size subgroups of features (i.e. at most k inputs to any individual node). For example, as will be shown, it is sometimes possible to compute an exponential LTU function using ( d - I ) 2-input nodes and constant sized weights. (3) The upper bound calculations assume that all input patterns, or at least all the boundary patterns, are presented to the LTU. Real-world learning circumstances will almost certainly be very limited in the number of patterns that are actually seen. As the sparseness of the space increases, the percentage of the functions over it that are linearly separable increases, and the average linear function becomes easier to learn. Sparseness is especially useful in the vicinity of the solution hyperplane. This can be quantified to some extent since the upper bound on training time, M I W I ,,* 2/a ** 2, can be related to the required resolution in hyperplane placement. In the formula, " a " is the minimum dot product, FW, over all input patterns. If the length of the weight vector is normalized to I, " a " is equal to the minimum geometric distance from any point in the input space to the hyperplane. Consequently, for a fixed weight vector, the upper bound goes down as l / a ** 2, where " a " is increased by missing boundary patterns. This

point is considered in greater detail in Section 4.4.8. (4) While a large number of adjusts may be required for perfect classification (complete convergence), a large percentage of the adjusts are made on a small number of boundary instances, Initial learning is generally quite rapid, in terms of the percent of the input space which is correctly classified after a relatively small number of adjusts. This is significant when only a reasonable accuracy, rather than perfect classification, is required. For example, in Fig. 9, the percentage of the total input space correctly classified is graphed vs the number of adjustments. 100 random 10-feature LTU functions were generated and learned using two-vector representation and most-correct, shufflecycle and least-correct ordering. The shape of the curves emphasizes the difference between most- and least-correct order. Most-correct accuracy increase rapidly and approaches 100% roughly asymptotically, subject to a fixed minimum growth rate. Leastcorrect presentation results in a very slow, linear increase in accuracy. These results can be loosely explained by the input selection strategy. Leastcorrect ordering always adjusts on the same set of patterns, those that are closest to the hyperplane. Consequently, the weight vector grows in the correct direction at a constant, linear rate. With most-correct order, the inputs furthest from the hyperplane are chosen first. However, as training proceeds, the furthest patterns are correctly classified, so the adjustments occur on points closer to the hyperplane. The rate of growth of the weight vector in the correct direction consequently decreases, and approaches the slow, linear rate produced by leastcorrect ordenng. Shuffle-cycle results resemble those of most-correct ordering. Although an average of 7983 adjusts were required for complete convergence (100% accuracy), early learning was quite effective. 85% accuracy was achieved after only 50 adjusts, and 90% after 100 adjusts. Interestingly, the curve was not visibly effected when the number of features was increased from 10 to 14. Although the size of the input space was increased by a factor of 16, from 1024 to 16,384, overall accuracy was always within 1% of 85% and 90% at 50 and 100 adjusts. That is, while the number of adjusts required for 100% accuracy increased exponentially with d (as about 3 *. d), the number of adjusts to any other fixed accuracy did not increase at all with d. (5) There are functions of interest that can be learned in polynomial time because they require weights of only polynomial size. The (x of d ) function is one of these. In this function, all feature weights are of constant size (2), and the threshold is at most 2d. This means that all (x of d ) functions, including OR and A N D , are learned in O(d ** 3) time. (6) Alternative LTU learning rules might be profitably employed in certain circumstances. For example, perceptron training can require an exponential number of adjustments to separate a linear number of input patterns. In Section 4, a one-shot learning method to train an LTU to detect specific instances is described. Using this technique, a linear number of patterns can simply be individually memorized in linear time.

AgTIFICIALNEU1~L NETWORKS

397

percent correct

80

70

60

50

40

30 20

I0

0

10

20

30

40 50 60 number or adjustn~mts

70

80

90

100

FIG. 9. Boundary patterns lead to poor generalization during perceptron training. In this test, the percent of the input space that is correctly classified is plotted vs the number of adjustments. 100 random LTU functions, l0 features. Top to bottom: most-correct, shuffle-cycle, least-correct ordering.

3.3. OUTPUT-SPECIFICFEATURE AS~3CIAB1LITY

A node can incrementally calculate conditional probability "traces" for each feature as:

One reason that pcrceptron training can be slow is that irrelevant input activity is not excluded from weight modification. Because M in the upper bound grows linearly with the total number of features, and W and " a " are unchanged by the introduction of irrelevant features, the upper bound grows linearly (O(n)) with the number of irrelevant features (n). The precise rate of growth with n depends on a number of other factors such as the difficulty of the function and the input order. A similar linear effect has been observed in human learning studies (Bourne and Pestle, 1959; Walker and Bourne, 1961; Buigareila and Archer, 1962; Haygood and Stevenson, 1967). Since real-world learning conditions may contain an effectively infinite number of irrelevant features (e.g. each fluttering leaf on a tree) some method for at least partially determining the relevance of input features is necessary. In this section, an approach is developed which achieves a significant improvement in perceptron training speed by using conditional probability to determine the relative salience of the features for the particular output. This salience measure is used to separately control the "associability" or plasticity (rate of change) of each feature. The two-vector model is used in all empirical studies.

[X[Fi] ,= [X[Fi] + ( X - [X[Fi]) * Fi • rt IX] ,--- IX] -t- (X - IX]) • rt where X is the correct output (0 or 1), Fi is the input value for feature i, [X ]Fi] is the probability of X = I given Fi, IX] is the probability of X = i, and rt is a rate constant between 0 and I which determines the "memory length" of the traces. A value of 0.01 is generally used. The conditional probability [XI ~ Fi] (the probability of X = ! given the absence of Fi) can be calculated in a similar manner. In standard contingency calculations, if IX IFi] = [Jr] (or [X I Fi] = [XI" Fi] or [XFi] = [X] • Fi]), Fi and X arc statistically independent, and Fi is considered to be an irrelevant feature with respect to X. Most contingency models compare [XIF/] to [X I ~ Fi], though good results have also been reported by comparing I X [ F i] to [X] (Gibbon, 1981; Jenkins et a/., 1981; Miller and Schachtman, 1985a). The latter approach has been used in this model. (Note that if [Fi] = [~ Fi], then [X IFi] - [At] = [X] - [X I ~ Fi] = ([XlFi] - [Xl" Fi])/2.) If [X]Fi] > [X], Fi is predictive of X's occu~euce, and if [X I Fi] < [X], Fi is predictive of X's nonoccurrence. Related probabilistic processes have been used or at least considered in numerous behavioral models (see further reading, 3.1).

398

S. HAMPSON

It should be noted that there is no necessary 1968, 1969) found that associative strength was posirelationship between contingency and appropriate tively correlated with the contingency between conLTU weight size, as contingency is dependent on the ditioned and unconditioned stimuli (corresponding probability distribution of the input patterns, while to the stimulus and the category in the current LTU weight size is not. Thus, without changing the discussion). Two general classes of theories have LTU function to be learned, a feature can be made been proposed to explain this effect: the "molecular" to be positively predictive, negatively predictive, or Rescorla-Wagner learning model in which the constatistically independent by adjusting the relative tingency effect results indirectly from step by step probability of occurrence of the positive and negative output error correction, and theories in which the input patterns. The exact weights learned during organism more directly computes probabilities perceptron training are affected to some extent by the (Rescorla, 1972a; Miller and Schachtman, 1985a; probability distribution of the input patterns, but the Gallistel, 1990, Chap. 13). Both models have some optimum (shortest) weight vector is not. The fact that strengths and some weaknesses, depending in part on contingency and appropriate weight size are generally the goal of the calculation. If the goal is simply to find correlated permits the quickly acquired statistics to a hyperplane that separates positive and negative guide the slower acquisition of LTU weights, but it instances, then perceptron training seems approprishould not be overlooked that they are independently ate, but if sensitivity to actual probabilities or rates variable properties. is desired, then it makes sense to explicitly compute In order to use this probabilistic information, when them. weights are adjusted, they are changed in proportion The proposed two-stage model utilizes both to each feature's predictive potential. Various formal approaches. The goal is to correctly position a measures of contingency are possible (Gibbon et al., separating hyperplane, and weights are adjusted by 197,1; Allan, 1980; Hammond and Paynter, 1983; the basic Reseorla-Wagner/perceptron mechanism. Scott and Platt, 1985), but empirically, a simple However, contingency is also directly computed and difference has proven satisfactory. In fact, for used for the adjustment of each feature's salience. In randomly generated LTU functions of independent the Rescorla-Wagner model, salience was modeled as features, a symmetric weight vector with weights set a constant multiplying factor specific to each feature equal to ( [ X I F i ] - [ X ] ) (as measured over the entire which determined the "associability" (maximum rate input space) is generally close to a solution vector, of change) of the feature's associative strength. although it seldom perfectly separates the positive The concept of variable associability can be used and negative instances. This contingency information to explain a number of "latent" learning phenomena is used to bias the adjustments in perceptron training. ("latent inhibition", "latent facilitation" and In the simplest case, the adjustment rate, or plasticity "learned irrelevance") that are not adequately capof each feature is set to the absolute value of tured by the Rescorla-Wagner model (see further ([XIFi] - [X]), which should generally be near zero for reading, 3.2). For example, in latent inhibition, a CS irrelevant features. That is, instead of adjusting each is repeatedly presented by itself. Later, it is generally Wi by + Fi, it is adjusted by + Fi • I[X IF i] - [X]I. harder for that stimulus to enter into new associFor fixed conditional probability values, this modifi- ations when it is explicitly paired with a US. Precation of perceptron training is still provably conver- exposure produces no discernible associative changes, gent. This use of conditional probability reduces the but affects the "associability" of the stimulus on later effects of irrelevant features, but has little effect when occasions. With learned irrelevance, the CS and a US are both presented, but in an explicitly uncorreall features are relevant. Another approach is to adjust Wi using only the lated pattern. Again, the stimulus can be reduced in associated value of IX IF i], rather than IX IFi] - [X]. associability without changing its current associative In this version, weights are adjusted by [XIFi] to strength. If the conditional probability traces are adjusted on increase and ( I - I X I F i l ) (=[~XIFi]) to decrease. Irrelevant features are de-emphasized but are not every input presentation, the conditional probability ignored using this approach. However, learning time predicts whether the node should be on or off, and based on relevant features is accelerated. Using least- the result is (vaguely) similar to Mackintosh's (1975) correct ordering, learning time was reduced from model in which a feature's salience is determined by its relative predictiveness for correct output. If the around 3.5 ** d to 2 ** d. More aggressive techniques are also possible. For traces are adjusted only for inputs that result in an example, weights can be adjusted by ([XI F i ] - [X]) incorrect output, the conditional probability predicts only when that term is of the proper sign (positive whether the node's output should move up or down, for weight increase and negative for weight decrease). and the result is more consistent with Pearce and Such an approach is not convergent for all input Hall's suggestion that salience decreases for features ordering, s, but using shuffle-cycle input order it which have reached their proper associative strength appears to work quite well. Empirically, using shuffle- (Pearce and Hall, 1980; Pearce et a/., 1982a,b). The cycle order, the average number of adjustments was latent learning effects of on/off vs up/down probreduced from about 3 ** d to about 1.7 ** d. As a ability are somewhat different, but, empirically, learnconcrete point of comparison, the average number ing acceleration is roughly equivalent. However, the of adjusts for randomly generated 10-feature LTU up/down approach is less sensitive to biased probabilities of presentation for different input patterns. functions was reduced from 7983 to 308. The results of these modifications are still generally There are other computational variations, each of consistent with observed characteristics of classical which produces slightly different effects. The importconditioning. In particular, Rescorla (1966. 1967, ant commonality is that a two-stage model permits

399

ARTIFICIALNEURAL NETWORKS

a significant acceleration of learning by explicitly calculating and utilizing feature salience to adjust assooiability. It is worth emphasizing that these variable plasticity approaches to accelerating perceptron training arc not designed to model any particular behavioral phenomena; they arc simply designed to accelerate convergence on a correctly separating hyperplane. The fact that similar mechanisms have been proposed to explain animal learning characteristics suggests that the concept of variable associability is a useful construct, but there is still considerable flexibility in the way it can be implemented (Lubow, 1989). The capacity to control feature associability may also vary considerably between different classes of animals (Pearce, 1987, p. 180; Lubow, 1989, p. 101). In the model proposed here, each feature's outputspecific salience is computed and used at the level of the individual node. However, network-level systems involving the hippocampus have also been proposed to determine feature salience (Solomon and Moore, 1975; Solomon, 1977, 1979, 1987; Moore and Stickney, 1980, 1982; Moore and Solomon, 1984; Schmajuk and Moore, 1985; Gabriel et aL, 1980, 1982; Nadel et al., 1985; Kaye and Pearce, 1987; Salafia, 1987; Lubow, 1989, p. 130). For example, feature associability can be context specific, a phenomena that is not easily explained without network-level systems. In addition, learned irrelevance is not necessarily feature-specific, in that learning one feature is irrelevant in a task may also reduce the learning rates for other features in that task. Thus there arc a number of different approaches to variable associability, both at the node and network level. The output-specific associability model developed here does not require network-level systems, but a complementary model of input-specific associability does (Section 4.6), and the hippocampus still seems a likely participant in that system. 3.4. ADAPTIVE ORIGIN PLACEMENT

The upper bound on perceptron training time increases by a multiplicative factor of O(x ** 4) with x, the distance from the solution hyperplane to the origin. Consequently, origin placement can significantly influence training time. For example, it is better for a feature to vary between 0 and ! than between 9 and !0, and as previously observed, learning is generally faster with ( - I, I) than with (0, !) input. Based on the upper bound formula, training time for (0, 1) input can be worse than ( - 1, I) input by a multiplicative factor of O(d). In the absence of other information (e.g. some knowledge of the function(s) to be learned), a reasonable choice of origin is at the average value of each feature. That is, input feature values can be "centered" by the adjustment F i ,= F i -

,ei

where Pi is the average value of feature Fi. ('Note that the constant threshold feature must be excluded from this adjustment.) For fixed values of J~i, perceptron training is still provably convcrgemt. This can be achieved by averaging over some bounded sample of 37I~......ll

input patterns and fixing the /~is at the resulting values. If feature presence and absence are equally likely, this has the effect of changing (0, l) input to ( - 0 . 5 , 0.5) input, with the resulting advantages of symmetric input. Instead of averaging over a fixed sample of input patterns, the average value of a feature can be incrementally calculated as Fi,= Pi + ( F i - P i ) , r t where rt determines the memory length of the running average. Using this approach, perceptron training is not necessarily convergent since a large value of rt can lead to significant wandering of the origin. The problem is that an origin shift can reduce the accuracy of existing weight settings. However, by choosing an appropriately small value of rt, the problem of origin wander can be reduced while still achieving a significant improvement in performance. The value of rt can also be productively manipulated using conditional probability. As an alternative to modifying.the actual output value of a feature (i.e. Fi ,= Fi Fi), only its associability might be controlled by (e.g. set equal to) the term abs(Fi - F i ) . At the extreme of rt = l , ,gi is always equal to the previous value of Fi, and only features that changed in the last time step are salient. Although not necessarily convergent, even this extreme case works reasonably well in practice. Constant "background" features are immediately eliminated from consideration. This simple "one context" model can be modified to permit different origin placement in different contexts. For example, the constant features in one room are different from the constant features in another room. Rather than incrementally adjusting the origin for each context change (and forgetting the previous setting), it would be desirable to learn the proper placement for each context and flip between them when changing rooms. This is possible for an LTU because any origin shift can be exactly compensated for by an appropriate adjustment in the threshold weight. One approach is to provide a distinct constant/context feature for each distinct context. This would permit the node to learn its proper threshold setting for each context. This approach requires network-level structures and is developed further in Section 4.6. Placing the origin at the average value of each feature is based entirely on input characteristics. Alternatively, individual nodes can maintain private origin adjustments based on the particular function each is computing. In this case, a reasonable choice is at the average value for patterns it has frequently misclassified. This can be achieved by adjusting the origin only on misclassifications. Again, this is not necessarily convergent but is generally beneficial. This is quite useful if the two-vector model is used with continuous input. While a single weight vector, by definition, describes a hyperplan¢, the two-vector model can describe closed positive or negative regions in a continuously valued space (Fig. I 0). This allows a single node to cover a cluster of instances, something that would otherwise require a set of strictly linear nodes. With an adjustable origin, these regions can be positioned anywhere in the input space. -

400 S.

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

H A M P S O N

o

o

o

o

o

o

o

o

o

o

o

o

,,

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

a

b

F,O. 10. The two-vector model can represent closed regions in a continuous input space. (a) If both features are weighted equally, a diamond-shaped region results. The size of the threshold determines the size of the diamond. (b) If features are not weighted equally, the diamond is squashed or elongated along the different dimensions. The interior can be either positive or negative.

Adaptive origin placement is consistent with most behavioral models of habituation (Domjan and Burkhard, 1982, p. 35; Mackintosh, 1983, p. 285; Mazur, 1986, p. 47; Walker, 1987, Chap. 2). In general, these models assume that current sensory stimuli are compared to a memory of recent events, and that only those stimuli which differ from the expected value are available for further processing. More elaborate models are possible (and necessary), in which the "expected" value of a feature is the result of an arbitrarily complex world model (e.g. Sokolov, 1960, 1963; Lynn, 1966; Lara and Arbib, 1985), but simple origin centering at the average values seems to be a desirable minimum capability.

4. LEARNING AND USING SPECIFIC INSTANCES 4. I. INTRODUCTION Various tradeoffs between time and space efficiency are possible in connectionistic models. For example, while an ability to effectively generaliz~ permits improved space efficiency and predictive potential, generalization is not always possible or appropriate. By definition, a maximal generali~tion over otmerved positive instances includes as many unobserved (possibly negative) instances as possible. This is a significant drawback when specific instance learning is known to be appropriate. For example, in order to accurately reco£mi~ a particular stimulus (e.g. a photograph of a person) as having been seen before, the representation should be as specific m possible to avoid incorrectly responding to similar stimuli. Since the goal is to be as specific as possible, there is no need to provide negative training exam#es. A _,_,e¢o_nd problem with generalization is that, even when generalization is appropriate, an incremental learning system that stores only a single ganeralization hypothesis can make repeated mistakes on the same input pat-

terns, a situation which need not occur with specific instance learning. Perceptron training is good at learning prototypic generalizations but poor at learning specific instances. The numerous possible applications of specific instances suggest that more specialized approaches should be of value. This section develops a specialized mechanism for learning specific instances. Although the proposed process has an adjustable bias toward specificity, the primary concern is for the extreme case of specific instance learning; that is, with A N D "snapshots" of the current input pattern in which all features are relevant and included. These Specific Instance Detectors (SIDs) can be used independently when specific instance learning is known to be appropriate. They can also be used in conjunction with a generalization system when generalization may be appropriate, but can be improved upon if certain instances are simply memorized as specific instances. 4.2. FOCUSING Geometrically, perceptron training can be viewed as two distinct processes: a rotation of the we/ght vector toward or away from some prototyp/c point, and an adjustment of the thresbold. On mi.~damfied positive instances, the central prototype is rotated toward the current input pattern and the threshold is reduced (i.e. the size of the generalization is increased). For m/sclnssified negative instances, the prototype is rotated away from the current instance and the threshold is increased (the size of the generalization is decreased). The perceptron training algorithm has many desirable properties in that it is provably convergent, is a reasonable model of classical conditioning, and appropriate neural components have been worked out to a considerable extent. However, while it is relatively good at learning prototypic generalizations, it is correspondingly poor at learn-

ARTIFICIALNEURALNETWORKS

ing specific instances. In particular, A N D is an (x of m) function, and for d relevant features perceptron training requires O(d ** 3) adjustments to completely distinguish a single positive instance from all other negative instances. Empirically, with (0, I) input and least-correct order, exactly I - (I/2)d + (7/2)d ** 2 + d , . 3 adjusts are required. As a point of comparison, the theoretical upper bound without "O" notation is M - ( d + I), W ** 2 ffi (4d + (2d - I) ** 2), a -- I, for an upper bound of I + d + 4 d * * 2 + 4 d * * 3. Besides over-generalizing the positive instance and missing on negative instances, the LTU will also repeatedly miss on the single positive instance. Empirically, about 50% of the total adjusts are on the single positive instance. Over-generalizing is not unreasonable if it is not known beforehand to be an isolated instance, but repeatedly missing the single instance is a more serious objection. Because a learning system that is biased toward generalization is inherently biased against learning specific instances, it might be advantageous to provide a specialized learning system with a bias toward specificity rather than generality. Interestingly, only a minor modification of the perceptron training algorithm is necessary to accomplish this. When presented with a positive instance, the weight vector is rotated toward it as before, but the threshold is increased rather than decreased. That is, the size of the generalization is decreased when presented with a positive instance rather than increased. There are various alternative mechanisms for learning specific instances. For example, the output of a node could be bounded between 0 and I and taken to an adjustable power. As the power is increased, the node converges on a SID, although output is always positive over the entire input space. Shepard (1987) argues that an initial generalization gradient around a positive instance should drop off exponentially with distance from the observed example. Alternatively,if the node is intended to be an A N D , a multiplicativerule of feature integration

401

is also possible, and behaviorally supported (Butler, 1963; Merlin and Reynolds, 1988). The particular method of AND learning does not significantly affect the model developed here, so, for simplicity, only the additive model will be considered. Because it "focuses" the positive region of the LTU on the current input, this learning process is referred to as focusing. As the logical extreme, the weight vector can be rotated to point directly at the current instance and the threshold increased to the length of the vector (Fig. 11). Thus a specific instance can be learned in one adjustment if desired, in effect forming an A N D "snapshot" of the current input. Although this extreme case of one-shot learning can be advantageons, it runs the risk of being overly specific. By incrementally focusing, the relevant features and appropriate level of specificity can be identified to some extent. Since the amount of focusing is adjustable between 0 (no change) and 1 (one-shot), it can be controlled to suit the particular learning circumstances. The nature of the "teacher" signal is somewhat different between perceptron training and focusing. In perceptron training, the instruction is to "learn this pattern as part (not a part) of that category". With focusing, the instruction is simply to "learn this pattern" with a variable degree of specificity. Humans sometimes display what is called "now print" (Livingston, 1967a,b), "flashbulb" (Brown and Kuiik, 1977), or one-shot learning. For example, after only a single presentation of a particular pattern (e.g. a picture), it can be reliably recognized for days or weeks (Standing, 1973; Standing et a/., 1970). Lower animals (pigeons) demonstrate similar capabilities (Herrnstein, 1985). This is not to say that the memory is necessarily specific on every detail, just that there is no observable generalization within the relevant domain of application. Perceptron training does not have this property as it tends to generalize quite aggressively. There is some disagreement on the status of "flashbulb memory" as a separate learning process (McCIoskey et al.. 1988; Schmidt and

)"

11

.

i

-1-1~I-I

-1-1

1-1

b Flo. II. Focusing an LTU. (a) Initially, the input patterns (I I) and (I-I) are both on the positive side of the plane. (b) To focus on (I I), the weight vector is rotated up to point at that pattern, and the threshold is moved out to the end of the vector. Consequently, only (I I) is now on the positive side of the plane.

402

S. HAMPSON

Bohannon, 1988; Cohen et al., 1988), but the value of sum of all synaptic weights to be constant, and that a distinct specialization-based learning mechanism is only the distribution changes with training. Alternareasonably clear in the context of the current model. tively, the length of the weight vector could be held A neural system displaying the appropriate com- constant. Although these simplistic models appear ponents of focusing has been described in the hippo- unlikely as a general learning rule, a constant sum campus (Alger and Teyler, 1976; Dunwiddie and model appears reasonable under some circumstances. Lynch, 1978; Anderson et al., 1980; Abraham and For example, acetylcholine receptors at the neuroGoddard, 1983, 1985; Brown et al., 1990). In that muscular junction can be redistributed by synaptic system, the currently active inputs to a neuron can activity (Poo, 1985). Uncontrolled weight magnitude become more effective in firing it (long-term has not been a problem in the model, and is not potentiation), while the inactive inputs become less explicitly addressed. effective (heterosynaptic depression). This is apparently achieved by simultaneously strengthening the 4.3. GENERALIZATION VS SPECIFIC INSTANCE synapses of the active inputs (rotate prototype) and LEARNING (possibly) reducing the excitability of the cell as a An ability to rapidly learn specific instances perwhole (raise threshold). Large changes in synaptic mits a considerably different approach to adaptive strength can occur on a single stimulus presentation. The cell's firing function can thus be modified to behavior than one based on more slowly acquired respond more selectively to the current input. The generalized category detectors. In general, specific actual neural process is considerably more complex instance manipulation has much more of a cognitive and is incompletely understood (Bliss and Dolphin, flavor than generalized category detection acquired by perceptron training/Rcscorla-Wagner learning. 1984; Abraham and Goddard, 1985; Schwartzkroin and Taube, 1986; Teyler and DiScenna, 1987; Overall, generalization and specific instance learning Cotman et al., 1988; Matthies, 1989; Brown et al., correspond well to the "procedural/declarative" dis1989, 1990). However, it appears at least potentially tinction (e.g. Cohen and Squire, 1980; Squire, 1982, 1983, 1987, Chap. 11; Squire et al., 1984; Squire and capable of the desired characteristics of focusing. Long-term potentiation has been studied primarily in Cohen, 1984; Cohen, 1984). Similar divisions have the hippocampus, but has also been reported in other been proposed between "knowing how" and "knowbrain structures (Teyler and DiScenna, 1984; Morris ing that" (Ryle, 1949), "context free" and "context and Baker, 1984; Artola and Singer, 1987; Brown dependent" (Kent, 1981; Kinsbourne and Wood, 1982; O'Keefe and Nadel, 1978; Nadel et al., 1985), et al., 1990). Although heterosynaptic depression in the hippocampus is often transient, long-term "reference" and "working" (Honig, 1978; Olton, 1978, 1979, 1983, 1986; Olton et al., 1979, 1984), reductions in background firing rate (presumably resulting from some sort of threshold shift) have been "semantic" and "episodic" (Tulving, 1972, 1983, 1984, 1986; Schacter and Tuiving, 1983; Kinsbourne, observed in the cortex and linked to the acquisition 1988; Kinsbourne and Wood, 1975, 1982; Wood of classically conditioned responses (Disterhoft and Olds, 1972; Olds, 1975; Disterhoft and Stuart, 1976; et al., 1982; McKoon et al., 1986; Rolls, 1990), Brons et al., 1982; Diamond and Weinberger, 1984; "implicit" and "explicit" (Schacter, 1985, 1987; Graf and Schacter, 1985; Richardson-Klavehn and Weinberger and Diamond, 1987). Neural models often utilize a variable threshold to Bjork, 1988), or "habit" and "memories" (Mishkin adjust "excitability" (MacGregor, 1987). Changes in et al., 1984; Mishkin and Petri, 1984; Mahut, 1985). neural excitability have been reported (see further A double dissociation in learning capabilities proreading, 4.1) but at present there is considerably less duced by pharmacological intervention is along physiological evidence for adaptive threshold adjust- similar lines (Staubli et al., 1985a,b). A similar disment than there is for the adaptive modification tinction is also frequently made in the AI literature of feature-specific synaptic strengths. However, as (Winograd, 1975; Anderson, 1976, 1982, 1983; demonstrated by the two-vector model, a general Neves and Anderson, 1981). It has been suggested linear equation of binary features can be represented that problem solving behavior is initially declarative, without an explicit adjustable threshold by using two but with practice, shifts toward a procedural repweights per feature. In addition, the explicit threshold resentation (Larkin et aL, 1980; Neves and Anderson, 1981; Anderson, 1983, Chap. 6; Adams and of an LTU can be represented as an additional input feature which is constantly on. A fixed threshold with Dickinson, 1981 ). The differences between perceptron training and an adjustable global multiplier (excitability) is also computationally equivalent. This indicates that while focusing reflect this general dichotomy. Perceptron it may be conceptually useful to utilize an explicit, training is oriented toward particular categorization adjustable threshold, there is some flexibility in the systems, is gradual (in the sense that it can repeatedly misclassify the same input patterns), generalizes formal or physiological implementation. Since the length of the weight vector (i.e. the size aggressively, and is inappropriate for learning specific of the weights) is not necessarily bounded, one poss- instances. Because only a generalization is stored, ible variation is to introduce an element of compe- information about specific instances is lost. Consetition among the input synapses. For example, one quently, only those categorization systems that have might hypothesize that some fixed amount of post- been trained to include or exclude the instance can synaptic material is redistributed among synapses express information about it. Focusing is categori(Stent, 1973), or that the area of dendritic surface cally uncommitted, rapid, is a form of specialization, usable for synapses is constant (von der Malsburg, and is quite capable of learning specific instances, 1973). This could be implemented by requiring the in one shot if necessary. If specific instances are

ARTIFICIAL NEURAL NETWORKS

403

recorded, no information is lost. Consequently, the interesting area in the investigation of intelligent information can potentially be used in any number of behavior. The hippocampus is often linked to the recording ways by any system that can access it. This dedicated vs flexible use of information is an important distinc- of specific instances, and it has been suggested that tion in the procedural/declarative dichotomy. That both the hippocampal system and hippocampus-type is not to say that focusing is by itself necessarily learning are relatively recent phylogenetic develop"declarative", but rather that most declarative learn- ments, at least compared to classical conditioning ing phenomena appear to have a common denomi- (Mishkin eta/., 1984; Lynch and Baudry, 1984). nator in their reliance on rapidly acquired specific In keeping with a more recent phyiogenetic status, instances. In addition, there are forms of learning declarative memory is later to develop ontogenetibesides perceptron training that can be viewed as cally (Mishkin eta/., 1984; Bachevalier and Mishkin, procedural (e.g. habituation and sensitization) so 1984; Nadel and Zola-Morgan, 1984; Schacter and perc~tron training and procedural learning cannot Moscovitch, 19M; Schacter, 1984; Moscovitch, be strictly equated. The point is simply that 1985). The evolutionary implication is that "specific perceptron training and focusing are well-defined instance" learning and representation has assumed learning processes with complementary character- increasing importance in the evolution of adaptive istics, and that these characteristics fall naturally on behavior, at least among the mammals. opposite sides of the observ~ procedural/declarative distinction. 4.4. UsE or SPECIHC INSTANCES These characteristic differences result directly from the basic processes of learning. Perceptron training The stimulus categorization model developed here learns to be on when presented with positive instances is relatively restricted in its use of specific instances, and to be off when presented with negative ones. but a number of other distinct, behaviorally relevant Consequently, it requires a number of input presenta- applications are possible. Some of those applications tions to distinguish a particular stimulus pattern from are considered in this section. The discussion is based all related ones. However, because it can effectively on single-node SIDs, but that is not a necessity. generalize over prototypic groupings, it can be quite Actual neural systems would need some sort of space efficient and should have good predictive redundancy. The only requirement is that a particular potential for novel circumstances. Focusing learns activation state can be logically manipulated as an positive instances only. It learns both when to be on SID. (for the current input) and when to be off (everything else) when presented with a single positive instance. 4.4.1. Stimulus learning The resulting one-shot learning capability is appropriate for the rapid acquisition of specific instances. An ability to learn specific instances permits the However, focusing has no commitment to useful separation of stimulus recognition from response generalization, and consequently is more resource learning. That is, it is possible to simply recognize a expensive and has little predictive potential for particular stimulus pattern as having been seen before novel stimuli. (Generalizations and predictions can without any knowledge of an appropriate classifibe inferred later from a set of specific instances, cation or correct response. Conversely, a previously but unlike the results of perceptron training, the unseen input pattern might be confidently categorized information is not stored in that form.) without a corresponding belief that it has been seen At the extreme, all learning might be nothing but before. Recognition can be viewed as a form of recording specific instances in time and space. categorization, but the specialized requirements of Appropriate behavior would then be produced as recognition suggest the use of specialized mechanan "on-the-fly" computation over this collection of isms. Only SIDs permit completely accurate stimulus facts. This is an extreme declarative point of view. recognition, as any generalization beyond the Any generalization-based system could be simulated observed instance will potentially respond to novel by computing the appropriate generalization from (previously unseen) input patterns. In addition, if the internal store of raw data. However, as a general multiple SIDs are formed with repeated presentation strategy, this appears too extravagant with memory, of the same stimulus, a certain amount of frequency and rather complex in the required computations information is encoded (Hintzman, 1988; Nosofsky, over it. If an organism can get by with just a 1988b). Behaviorally, stimulus recognition is strongly generalization rather than a complete data set, there affected by both similarity and frequency. are good reasons to maintain only the generalization. The separation of stimulus learning from response On the other hand, certain behaviorally uncommitted learning can have significant behavioral effects. If information (e.g. spatial maps) may be best acquired only responses are learned, prior presentation of the and retained as specific instances. stimuli would not be useful if no categorical inforIn keeping with their relative time and space mation were supplied. On the other hand, if stimulus efficiencies, overt behavior might effectively utilize learning is treated as a separate process, prior both approaches on the same task. For example, experience with the stimuli would give the organism a fast-learning specific instance representation a chance to learn and recognize the most frequent would permit rapidly acquired behaviour while the stimuli as individual cases, which could then be more slowly acquired generalized representation was rapidly associated with the correct response at some being learned (Mishkin eta/., 1984; Barnes and later date. The separation of input learning from McNaughton, 1985). The interaction between pro- output learning can lead to rather complex behavcedural and declarative learning is a complex, but ioral effects. For example, repeated exposure to a

404

S. HAMPSON

particular input pattern might lead to improved discrimination of that pattern from other inputs, but simultaneous loss of associability for it (Hall and Honey, 1989a; Honey and Hail, 1989). As a related application of stimulus learning, the stimulus and information about its correct classification may be separated in time. A SID can bridge the temporal gap between stimulus presentation and classification information. It may itself be used for classification learning, or it may be used to "remember" and re-establish the complete featurelevel description of the stimulus (next section). 4.4.2. Stimulus description While pattern recognition and/or classification is of primary importance, there are also important applications for pattern description. For example, in input-specific associability (Section 4.6), once a particular pattern has been recognized, its expected features (a pattern description) are compared to the currently observed features, and associability is reduced for those features that were accurately predicted. SIDs are useful for pattern description since, like pattern detection, pattern descriptions (Specific Instance Descriptions in this case) can be learned in one shot. Another advantage of SID descriptors is that while they can be triggered in a data.driven (bottom up) mode by an appropriate combination of features, they can also be easily accessed in a top-down fashion. That is, while SIDs can be used to recognize and then describe the expected features of input patterns, they can also be used to recreate the appropriate pattern of activity de novo. This would seem to be an essential part of high-level cognitive processes, where high-level symbols (e.g. the word "dog") can be re-expanded into a more detailed representation (e.g. a mental image of a dog and associated featurelevel properties). A generalized category detector does not have this symmetric detector/descriptor property. 4.4.3. Known specific instances Specific instance learning is probably useful in any domain, but it appears to be especially suited for spatial or temporal information. A location in space should not be represented simply as a particular response in that circumstance, but as a discrete entity with which any response, or any other property, can be rapidly associated. Models of spatial "cognitive maps" (e.g. O'Kcefe and Nadei, 1978) are based on the identification of such discrete spatial locations. Spatial capabilities are strongly affected by damage to the hippocampns. "Episodic" information can be viewed as a collection of specific instances in time and space (e.g. objoct (x) at location (y), time (z)). Not surprisingly, episodic learning is also strongly disrupted by h i - - p a l damage. Everyday experience is filled with applications of specific instance learning (e.g. "the acorn is hidden HERE", or " T H A T dog bites"). That is, generalization is sometimes known to be pointless, and it is often important not to make the same mistake twice. At a more cognitive level, most "facts" (e.g. "London

is the capital of England") would seem best treated as specific instances. 4.4.4. Constrained learning circumstances When presented with a novel learning environment with relatively few choices, it may be faster and of comparable space efficiency to simply memorize the specific instances rather than attempt to learn a general rule. For example, perceptron training requires O(d ** 3) adjusts to learn an (x of d) function. Consequently, when there are less than O(d ** 3) instances in an (x of d) grouping, specific instance learning can be faster. As will be shown later, under some circumstances a linear number of SIDs can avoid exponential perceptron training time for an LTU grouping. On the other hand, the (d/2 of d) function has 2 ** (d - I) positive instances and SIDbased learning (Section 4.4.7) empirically requires on the order of 2 ** d nodes and adjusts, while perceptron training time is still O(d ** 3). Thus, perceptron training is good for generalizing over large, prototypic clusters while focusing is good for memorizing a limited number of specific instances. There is behavioral evidence that generalization is more readily utilized for large categories while specific instances are more readily used for small ones (Homa et al., 1973, 1981; Homa and Vosburgh, 1976; Homa, 1978). Besides having a relatively small number of instances to learn, the instances may be of interest for only a limited period of time. By rapidly learning a number of specific instances when first presented with a novel learning task, but forgetting them afterwards, the required number of SIDs can be minimized. There is behavioral evidence that with a short delay after training, both specific instances and generalization are available; but with a longer delay, memory for specific instances is degraded while memory for generalizations is unaffected (Posner and Keele, 1970; Strange et aL, 1970; Homa et al., 1973; Homa and Chambliss, 1975; Homa and Vosburgh, 1976; Robbins et al., 1978; Medin and Schaffer, 1978). Mishkin et al. (1984) provide both neural and behavioral evidence for such a process. In that study, combined lesions of the hippocampus and amygdala eliminated the fast-learning, short-term system without affecting the performance of the slow-learning, long-term system. 4.4.5. Pattern completion and pattern association One of the more interesting capabilities of recurrently connected networks is their ability to reconstruct a complete pattern when presented with a partial or degraded version of it. Pattern reconstruction has been impreuively demonstrated with digitized pictures of faces (Kohonen et al., 1981). Unfortunately the number of clearly reconstructible patterns which can be stored in a recurrently connected network of feature-level detectors is relatively small. Using symmetrically connected nodes (bidirectional finks), Hopfieid (1982) reports that about 0 . 1 5 . n patterns can be stored with n nodes. More precisely, upper bounds of (approximately) n/(4 log(n)), n, and 2n patterns can be demonstrated,

ARTIFICIALNEURALN~'WORKS depending on the weighting scheme (Psaltis and Venkatesh, 1988). In addition, the training process required to learn the optimum connections may be slow. A similar linear (O(n)) capacity result was obtained with uni-directional links (Amari, 1972). The storage capacity and reconstructive capability of a recurrently connected pool of feature-level detectors is limited by the linear constraints on the individual nodes. Storage capacity can be increased by increasing the computational power of the nodes (Chen et al, 1986; Psaltis and Park, 1986; Psaltis et oJ., 1988; Guyon et aL, 1988; Keeler, 1988), (e.g. from linear to quadratic) but the cost of more powerful nodes can be considerable, and there is still no guarantee that any particular pattern of interest can be effectively reconstructed. However, by proriding an additional set of SIDs to supplement or replace the feature-level associations, any number of patterns can be stored and reconstructed (Hintzman, 1986, 1988). For example, while associations between letters can, to some extent, complete degraded patterns (words), specific word detectors provide a more detailed reconstruction capability (McCleiland and Rumelhart, 1981; Rumelhart and McClelland, 1982). One reasonable strategy for pattern completion is to choose the SID that has the largest dot product with the input (that is, which best matches the current input), and use it to reconstruct the expected features. Alternatively, patterns could be completed as a superposition of all competing SIDs, weighted by their similarity with the input (Hintzman, 1986, 1988). Since both the bottom-up connections (which define the SID) and the top-down connections (which describe the pattern) can be learned in one shot, SIDs provide an accurate reconstructive capability long before feature-level recurrent connections can be trained to provide such a capability (Teyler and DiScenna, 1986). Pattern completion is usually implemented using recurrent connections to the input features, in which

Input pattern

SID a

Output pattern

405

case the "completed" features are indistinguishable from the actually observed ones. Alternatively, the completed input could be represented in a duplicate set of features that indicate the expected values of the features, as distinct from their observed values. In general, any two patterns of activity can be associated in one shot by the use of an intervening SID (Fig. 12a). For pattern completion, a pattern is simply associated with itself. As in pattern completion, SIDs can be used to complement the more distributed feature-level associations which are learned by perceptron training (Fig. 12b). That is, SIDs can be used to "stabilize" the association of arbitrary patterns while the more slowly acquired, feature-level associations are being incrementally adjusted. The hippocampns is frequently implicated in such a temporary stabilization process (Squire, 1987, p. 209). Temporary dysfunction of such a system (taking it "off line") would have two noticeable effects: recent memories that were still dependent on SIDs would be temporarily forgotten, and new SIDs would not be formed. When the dysfunction was over, old SIDs and their associated memories would once again be available, but SID-dependent memories would not be available from the period of dysfunction. This is generally consistent with temporary hippocampal dysfunction (Squire, 1987). In fact, since SIDs can potentially activate either of the patterns to be associated, they could do more than just stabilize the desired associations. Rather than waiting for the initial stimulus of the desired pattern pairs to naturally occur, the desired training pairs could be generated at any time, and used as procedural training examples. Another possibility is that the hippocampus, rather than simply memorizing the desired pattern associations while the slower procedural system is trained, might also help allocate more permanent intermediate neurons between the given input and output patterns in the procedural system (Wickelgren, 1979).

Input pattern

Output pattern b

Flo. 12. (a) Any two patterns can be associated in one shot by use of an intervening SID. Its input connections are set to be an AND of the input features (i.e. an SID), and its output connections are set to match the values of the output features. (b) Alternatively, input and output patterns can be associated using perceptron trainin$ of the output nodes and feature-levelamocintions. Some intermediate nodes may be necessary, but the general characteristics of pereeptron training are still applicable.

406

S. H AMPSON

An ability to associate any two patterns in one shot also permits higher-level systems to "write their own S-R code", or at least set "traps" to watch for particular patterns. For example, the verbal instruction "push the button when you see a blue triangle", could be translated into the appropriate S-R connections, which would free the higher-level system from constantly looking for the appropriate pattern. The current discussion has little to say about the temporal aspects of association, but the explicit association of patterns across one or more time slices, as well as within time slices, is certainly a matter of general interest. 4.4.6. "Working" or "scratch pad" memory "Working" or "scratch pad" memory has been suggested as an appropriate domain (or definition of) declarative learning. However, any application of this rather high-level process is apt to involve nearly all of the previously described processes. A commitment to rapidly formed, short-term storage is implicit in the name, and if mental problem solving is viewed as (at least capable of) performing a simple state-space search, the other SID applications would also seem useful characteristics of a mental scratch pad. For example, you would want to recognize a previously visited state as such, remember the evaluation of the best state seen so far, and perhaps remember a description of a particular state so it could be returned to. Thus it is difficult to say just where a term such as "declarative" is best applied: at a high level to a process with many components, or at a low level to a process with many applications. The abundance of overlapping terminology in this particular area testifies to the difficulty of partitioning complex processes/mechanisms on the basis of largely indirect evidence. 4.4.7. Specific instance based categorization While stimulus categorization can be based on the formation of appropriate generalizations, it can also be based solely on the retention of specific instances. This general vs specific dichotomy is reflected in the "probabilistic" and "exemplar" categorization models discussed by Smith and Medin (1981). For determining categorization, the probabilistic model maintains a generalized category characterization, while the exemplar model maintains a set of specific instances. As a third pouibility, the probabilistic and exemplar models might also be combined. Although they are quite different in principle, it is often difficult to behaviorally distinguish the three possibilities (Medin and Schaffer, 1978; Smith and Medin, 1981; Busemeyer et al., 1984; Hintzman and Ludlam, 1980; Hintzman, 1986, 1988; Nosofsky, 1984, 1986, 1988c; Estes, 1986). Perhaps the simplest use of SIDs for categorization is the Nearest Neighbor (NN) approach. The classification rule in this model is simply "Decide that the input vector is a member of class F if it is nearer to the nearest known member of F than to the nearest known member of G'" (Sebestyen, 1962, Chap. 4) for arbitrary classes F and G. In Fig. 13, the resulting decision boundaries are shown for 3 positive and

3 negative instances in a continuous 2-dimensional input space. In a 3-dimensional space, the boundaries between the positive and negative patterns are planes; in higher dimensions they are hyperplanes. Learning can be equally simple: if an input is misclassified, add it to the set of SIDs that have been learned so far (Hart, 1968). This learning and categorization strategy seems biologically plausible, and is used in the empirical studies presented here. As a slight modification, classification can be based on the k-nearest neighbors, thus averaging over a limited range of the input space (Duda and Hart, 1973). This technique has proven useful is some connectionistic models (Specht, 1990). Because SID-based classification can potentially memorize the entire input space, it will eventually converge on correct classification for any fixed set of input patterns. For highly irregular (e.g. random) categories, that is about the best you can do. No generalization strategy can have predictive potential for a random function. However, when generalization is possible, an appropriate bias in the learning system can be quite valuable. In this context it is interesting to consider how well SID-based learning learns LTU functions. The "save all mistakes" NN learning algorithm was tested on random LTU functions of binary features using shuffle-cycle ordering. Empirically, NN generalization was ineffective, and it had to memorize about half the input space, requiring about 0 . 5 , 2 , , d nodes and adjusts. Perceptron training is considerably more space efficient, but the number of adjusts grows as more than 3 ** d, so simple memorization is faster to converge on perfect classification. However, NN classification cannot distinguish between relevant and irrelevant features, so while perceptron training time is linear in the number of irrelevant features, the NN algorithm is exponential. In addition, while perceptron training takes longer to learn perfect classification, it generalizes quite well after a limited number of patterns.

(+)

(+)

/

(-)

__L Fro. 13. Neareat-nci~bor eAmuifi~tion bouadar~ in a continuous 2-dimenaional aimc¢ with 3 poaitive points and 3 negative points.

.~,TIFICIAL NEURALNETWOItKS

In Fig. 14, the classification accuracy of perceptron training and NN classification are compared on 100 random 10-feature LTU functions. The top and bottom curves are perceptron training using s h u f e cycle and least-correct ordering, and are the same as the curves shown in Fig. 9. Only shuffie-cycle results are shown for the NN training algorithm. As previously observed, the shuffle-cycle perceptron training curve is roughly asymptotic. On the other hand, NN classification shows essentially no useful generalization, and accuracy is simply proportional to the number of patterns memorized. With shuffie-cycle ordering, the average number of adjusts to 100% accuracy were 677 and 7983 for N N and perceptron training, respectively, but early training clearly favors perceptron training. With least-correct ordering, perceptron training accuracy is linear with the number of adjusts, and worse (smaller slope) than the NN algorithm. Since NN classification separates the input space with hyperplanes, the poor NN results on random LTU function were a bit surprising. They are also, in fact, somewhat misleading. With least-correct order, the NN algorithm does show some useful generalization on LTU functions, although still short of perceptron training with shufe-cycle ordering. In addition, NN classification may not generalize well for LTU functions of binary features, but works quite

407

well for LTU functions of continuous or multivalued features. For example, with d ffi 3 and n = 10 (rather than d = 10, n = 2) and shuffle-cycle ordering, NN generalization was quite effective, and almost as good as perceptron training. The current discussion is limited to binary input features, so the relative value of N N classification for continuous input will not be pursued further. Clearly though, the particular testing conditions have a significant impact on the perceived values of the different approaches. The current discussion is also limited to binary output tasks, but a continuously-valued output function can be approximated with the use of SIDs. For example, the output value of the nearest neighbor could be used. Alternatively, output could be computed as the combination of multiple SIDs weighted by distance to the current pattern. Given a sufficient number of nodes, any continuous output function can be approximated. 4.4.8. Combining generalization and specific instances Even when a stimulus can appropriately be included in a generalized category, there are advantages in using specific instances. Generalization is space efficient and hopefully has predictive protential for unseen inputs, but if only a single generalization hypothesis is stored during incremental training, the

p~'ce~1~ c o r r e c t

100~ 90 n

80~ 70~

40~

20--

10~ 0 0

!

I

I

I

1

I

I

1

I

I

10

20

30

40

50

60

70

80

90

100

number or sd~ustm,mts FIo. 14. A comparison of nearest-neighbor and perccptron training in learning LTU functions. The percentage of the input space correctly classified is shown as a function of the number of adjustments. The top and bottom curves arc the same as in figure 9, in which 100 random 10-feature LTU functions were learned using shuffle-cycle and least-corrvci ordering. The middle curve is the result of the ncmrest-neighbor algorithm using shuffle-cycle ordering.

408

S. HAMPSON

system has less than perfect memory for past stimuli. Specific instance learning has the opposite characteristics, so an obvious strategy is to use them both in parallel. Such a combination has been considered in several models (Brooks, 1978, 1987; Homa et al., 1981; Smith and Medin, 1981; Medin et aL, 1984; Busemeyer et al., 1984). See also the discussion and references for "configural" learning (Section 5.2). There are multiple ways to use SIDs in conjunction with generalization. In particular, SIDs can be treated as undistinguished features which are simply fed into the generalization system along with the other features, or they can be used in a specialized manner. The advantage of the first approach is that it does not depend on the added features necessarily being SIDs, and thus allows techniques such as incremental focusing. The advantage of the second approach is that when it is known which features are SIDs, this information can be used to improve performance. Empirically, with a small number of input features, perceptron training is relatively quick to utilize undistinguished SID input, so specialized treatment is generally not dramatically better. The use of incremental focusing provides some advantages, but only the extreme case of one-shot SIDs will be considered here. If incremental focusing is used, the combined system has some of the characteristics of an NN classifier, in that the SIDs not ony respond to previously seen instances, but also instances close to them. If SIDs are treated as undistinguished features and are simply added to a feature set which is used for perceptron training, the type of representation can be important. In particular, with ( - 1 , I) SID ouput, perceptron training time based solely on SID input grows linearly with the number of SIDs. With (0, 1) SID output, training time grows linearly with the number of positive or negative instances, whichever is smaller. For a large input space, the difference between the two can be significant, as there may be an exponential number of input patterns, but only one positive instance. The problem is that while SID presence is infrequent and informative, SlD absence is virtually constant and therefore of little information value. The use of conditional probability to adjust feature associability is effective in reducing this problem since IX l" E l i - IX] approaches 0 as [" Fi] approaches I, but it still requires a number of input presentations before the near-irrelevance of SlD absence can be detected. Consequently, it may be advantageous to not explicitly represent SID absence. as is the case with (0, I) output. Using (0, 1) output, the addition of a set of SIDs to the original feature set can only improve perceptron learning time, although learning may sometimes be slower than if SlDs alone were used. If an unlimited number of SIDs are available, all positive and/or negative examples can be uniquely encoded and learning can be very rapid. If only a limited number of SIDs are available, fewer instances can be uniquely represented, so learning must depend more heavily on generalization. However, the judicious use of relatively few SIDs can still have a comiderable effect. In particular, there is no point in allocating SlDs to patterns that are correctly classified by existing generalizations. The value of SIDs comes

from memorizing patterns that are frequently misclassified. F o r example, when learning a generalization, most of the mistakes are made on the boundary instances. In fact, a single boundary instance can be missed an exponential number of times. By using SIDs for these difficult instances, a considerable time improvement might be expected. Empirically this appears to be the case. "Easy" (x of d ) functions, which can be learned by an LTU in polynomial time (O(d** 3)), can have an exponential number of boundary cases (e.g. (d/2 of d ) which has on the order of 2 . . d boundary patterns), but "hard" (exponential) LTU functions have only a linear number. As previously observed, random LTU functions have 2d boundary patterns, on the average. Consequently, while a large number of SIDs is required to improve learning time on easy LTU function, only a few are required to improve performance on hard functions. As a concrete example, consider the AND/OR function: (((((FI and F2) or F3) and F 4 ) or F5) . . . . . Fd). This is a linear function, and with ( - 1, 1) input it can be represented with feature weights which form a Fibonacci series (i.e. (1 1 2 3 5 8 . . . . . Wi ,= W(i - 1) + W(i - 2)) where each weight is the sum of the previous two). Because the ratio between successive terms in a Fibonacci series approaches the golden ratio (about !.6), weight size ( I . 6 * * d ) and training time ((1.6 ** d ) ** 2 = 1.6 ** (2d) = 2.6 ** d ) are exponential with d. For this solution vector, the values of IF W I ( " a " in the upper bound) are I, 3, 5, 7. . . . etc., for inputs at increasing distance from the hyperplane. It can be shown that there are exactly d + I boundary instances at distance IFWI = 1. (With least-correct ordering, all adjusts occur on this set of patterns.) The next set of exactly 2 d - 3 input patterns is at distance IFW[ = 3. This means that " a " in the upper bound can be increased from 1 to 3 and the upper bound divided by 9 (reduced by about 89%) by memorizing the closest d + 1 instances with SIDs (in effect removing them from the input space). By the same argument, " a " can be increased from I to 5 and the upper bound divided by 25 (reduced by about 96%) i f ( d + 1) + (2d - 3) = 3d - 2 SIDs are used to memorize the 3 d - 2 boundary patterns. Using least-correct ordering and (0, 1) input, the observed reductions were approximately 88% and 95%, respectively. Using the same testing conditions, reductions of 80% and 93% were achieved by removing the closest d + 1 and 3d - 2 patterns from randomly generated LTU functions. As expected, essentially no improvement was observed when d + 1 or 3d - 2 boundary instances were removed from the "easy" (d/2 of d ) function which has an exponential (0(2 ** d)) number of boundary instances. As an extreme case, if only the d + 1 boundary instances of the A N D / O R function are to be learned, perceptron training time is unchanged (at about 2.6 ** d). However only d + I SIDs and adjusts are required to memorize all the patterns. The point of this example is not that perceptron training is inefficient. In fact, from the same input patterns, the two

ARTIFICIAL NEURALNETWORKS

approaches have learned quite different functions. An LTU trained to correctly classify only the d + 1 boundary instances of the AND/OR function will also correctly classify all other patterns. NN classification using the same d + 1 boundary instances will not. In fact, it requires about half the input space to learn the AND/OR function. The point is that the two learning strategies have complementary strengths and weaknesses. Consequently, a combination can perform better than either of them individually. Besides the difficult boundary cases considered so far, there may be a number of exceptions to any large generalization. If a positive grouping cannot contain any negative instances, a single, large generalization may have to be broken up into a number of smaller groupings. For example, excluding a single instance from an LTU grouping over d features may require splitting the single cluster into d separate clusters. By rapidly learning such exceptions to a more general rule, large but imperfect generalizations can still be utilized. In the interest of conserving memory, SIDs that are infrequently useful could be released for reuse. The two-level focusing algorithm (Section 5.5) produces this effect by defocusing nodes. Patterns of current interest are repeatedly focused on, thus retaining their specificity, while the rest are slowly defocused and eventually reused. More sophisticated approaches to node recycling are possible, but did not prove necessary.

5. B O O L E A N O P E R A T O R STRUCTURE

AND TRAINING 5.1. INTRODUCTION The current domain is restrictedto Boolean inputs

and outputs, so for completeness a category detector should be able to represent any Boolean function. A structure capable of this will be referred to as a Boolean operator. There are Boolean functions a single node (an LTU) cannot detect, (e.g. Exclusive Or), so an assembly is needed to represent them..In particular, there are 2 , , (2 , , d) Boolean functions while there are at most 2 , , (d ** 2) LTU functions. Biologically, it is frequently suggested that the proper unit of analysis for neural systems is not the individual neuron, but small assemblies of neurons whose combined activity can be functionally described. For example, the cortical column is one possible unit of description (Mountcastle, 1978, 1979; Hubel, 1981; Kohenen, 1984, Chap. 8; Kuffier et al., 1984, Chap. 3). Training multilevel networks to detect arbitrary Boolean functions from input/output pairs has been a long standing problem. Considerable progress has been made, but current network learning algorithms (e.g. Hinton et al., 1984; Ackley et al., 1985; Barto, 1985; Rumeihart et al., 1986) are of limited physiological relevance, and empirically are quite slow. The approaches developed here are somewhat more physiologically plausible and considerably faster. In this section, two training algorithms arc considered which have the capability of training a two-

409

level network to detect Boolean functions. The first uses perceptron training to learn to represent functions in disjunctive form, and is appropriate for single-output functions. The second uses focusing to train a common memory shared by multiple output systems. Much of the nervous system (e.g. sensory systems) is shared by multiple output systems (e.g. motor systems), but it is likely that groups of neurons are also dedicated to specific outputs. Different learning algorithms can be used in the two cases. Because it tries to generalize over clusters of input patterns, the disjunctive learning algorithm is potentially space efficient but is often slow to learn. Focusing is more concerned with specific instances (in the extreme), and consequently is less space efficient but faster to learn ("one-shot" in the extreme). The two systems thus provide complementary capabilities that can be combined in a single system that capitalizes on the strengths of both. 5.2. N E T W O R K INTERCONNECTIONS AND

ASSOCIATIONTYPES The structure of a Boolean operator can be quite simple. Since all Boolean functions can be expressed in Disjunctive Normal Form (DNF) (that is, as the OR of ANDs), a two-level network is a logically sufficient, though not necessarily efficient, representation (Fig. 15). A single OR node is required for the final output, and up to 2 ** (d - 1) AND nodes may be needed in the first layer for arbitrarily complex Boolean functions. More generally, both the firstand second-level nodes can be LTUs. The minimum connections required are that the first-level nodes see the input features and that the second-level node sees the first-level nodes. Thus the desirable simplicity of a linear function can be maintained for individual nodes while implementing a completely general Boolean function. Even a simple two-level network allows a variety of distinct association types. For example, in Fig. 16a, features A and B produce response R. Direct input--output associations (A ~ R, B ~ R) have been the basis of most S--R models of conditioning, and constitute the simplest formal model of associative memory~ the so called "matrix" model (Kohonen et al., 1981). Connecting the output node directly with the input as well as with intermediate levels has some biological justification. The neoconex is a layered structure, and its principal output (pyramidal) cells have inputs which connect with the raw input as well as the output of other layers (Shepherd, 1979, 1988, Chap. 30; Diamond, 1979; Rockel, et ai., 1980; Jones, 1981a,b; Porter, 1981; White, 1981). Likewise, information is sequentially processed by different areas of the cortex, but the more advanced areas also receive relatively raw, unprocessed information in addition to the output of the preceding areas (Stone et al., 1979). Such deep connections can be a significant advantage in layered systems. If each layer were connected to only its nearest neighbors, the appropriate discriminations might have to be sequentially learned by each layer before they could be utilized at the top level. By connecting a node with all nodes below it, or conversely making its output available to

S. HAMPSON

410

Input nodes

AND nodes

OR node

Fio. 15. A two-levelnetwork of LTUs is capable of representing arbitrary Boolean functions in disjunctive normal form. intermediate nodes arc generally invoked as AND functions to explain "conflgural" or "unique cue" conditioning on compound stimuli. However, in principle, they might compute any LTU function. An advantage of AND nodes is that they can be learned in one shot. As prcvionsly d ~ , the hippocampus is frequently associated with configural learning. Finally, the introduction of recurrent connections (links that permit activity loops) (Fig. 16(:) produces a new set of associations. Sidtnvays connections ( A - - B , B - . A) arc the most direct interprctation of within-stimulus learning phenomena (Brogden, 1939; Seidel, 1959; Thompson, 1972; Rescorla and Cunningham, 1978; Durlach and Rescorla, 1980; Rcscorla and Durlach, 1981; Rcscorla and Colwill, 1983; Rescorla, 1950a, 19$1a, b, 1952a,d, 1983, 1984; Speers et al., 1980; Holland and RoB, 1983). Inhibitory, sideways connections produce lateral inhibition, a common proce~in~ strategy in neural systems (L.indsay and Norman, 1977; Rozsypal, 1985). Top-

all nodes above it, the output level can use information as soon as it becomes available anywhere in the network. As previously mentioned, for Boolean completeness the model requires at least one intervening layer between stimulus and response. It is apparent that biological S-R pathways also possess intermediate levels of analysis. Human reaction time studies suggest about 100 cell delays between stimulus and response for relatively complex S--R tasks. With the introduction of intermediate nodes, intermediate associations are possible (Fig. 16b). The existence of such associations (A -- C, B --. C, C -. R) has been demonstrated and incorporated into many models of associative conditioning (Razran, 1965; Rescorla, 1972b, 1973, 1980b; Rescorla et al., 1985b; Rudy and Wagner, 1975; Sutherland and Rudy, 1989; Keho¢ and Gormezano, 1980, Bcllingham et al., 1985; Kcho¢ and Schreurs, 1986; Keho¢, 1988; Keho¢ and Graham, 1988; Forbes and Holland) 1985; Brown, 1987). In theoretical models of conditioning, these

a

b

C

FIG. 16. Association types in a two-levelnetwork. (a) Direct SR connections. (b) Indirect SR connections. (c) Recurrent connections.

AR~FICI^LNEURALNETWOXKS down connections (C --, A, C ~ B), permit a form of "cognitively" driven bias or modulation. Recurrent connections are well developed within the cerebral cortex, and between the thalamus and the cortex OVong-Riley, 1978; Jones, 1981a,b, 1985; Crick, 1984; Maunsell and Newsome, 1987; LaBerge and Brown, 1989), but their exact function(s) are generally not known. These connection schemes can be mixed in various ways, and in describing the dorsal horn (a sensory region) of the spinal cord, Shepherd (1988, p. 261) identifies essentially all of them: "First, there are connections for straight-through transmission of each specificmodality.... These include not only ~odendritic synapses (for forward transmission) but also dendroaxonic synapses (for immediate feedback excitation or inhibition) and dendrodendritic synapses (for lateral inhibition between responding cells). Second there are connectiom that mediate interactions between modalities . . . . Third, there are connections from descending axons that provide for modulation of incoming sensory information by higher brain centers." 5.3. O ~

T~t~o

(D~u~c'nvE

Training a two-level operator to detect arbitrary Boolean functions can be treated as two separate processes: training the second-level node and training the first level. Although the output node has the full computational power of an LTU, in this section, operators are trained to represent concepts in disjunctive form; that is, as the OR of groupings of positive instances. Consequently, training the output node is trivial. The problem is to train the first-level nodes. The constraints needed to produce a disjunctive representation in the first level are relatively simple: (1) For all negative input patterns, all nodes should be off. (2) For each positive input pattern, at least one node should be on. Paralleling these constraints, two types of adjustment may be needed during training. Existing groupings must be contracted to exclude misclassified negative instances and expanded to include misclassified positive ones. Perceptron training is used in both cases. Similar techniques are discussed in Batchelor (1974, 1978, Chap. 4). The first type of error is easy to rectify: a teaching signal (T) of - 1 is simply broadcast to all nodes in the first level. There is no problem of credit assignment on negative instances since all nodes with output above 0 are in error and should adjust their output downward. The second type of error is more difficult. When presented with an uncovered positive instance, preexisting groupings must be expanded to cover it. Unfortunately, it is not obvious which group(s) can be generalized to include the new instance without incorrect overgeneralization. Under such circumstances error-free generalization may not be possible, but since learning is incremental, over-generalization is not necessarily fatal. Future negative instances will spare appropriate groupings and eliminate inappropriate over-generalization.

411

The easiest approach is to increase the output of all nodes for the current input by a fixed amount. In conjunction with the use of conditional probability (to be discussed shortly), this strategy works reasonably well if the input space is small, but the adjustment is essentially training noise for all but the "correct" nodes; that is, those which can cover the instance without over-generalization. Consequently, training could be improved upon if there were some indication as to which generalizations were most apt to be successful. There may be no exact method of determining the most appropriate group(s) to generalize, hut a heuristic method can be employed. The simple, and intuitively reasonable, heuristic adopted here is that a group which requires little generalization (that is, almost includes the instance already) is a better candidate than a group which requires major expansion. For complex, symbolic representation schemes, "closeness" may be difficult to measure, but the output of a linear equation gives a single numeric measure. Based on this, each group can be generalized in proportion to its similarity to the new instance as compared to other groups. In the simplest implementation, each first-level node's rate of learning is set equal to its "output rank", that is, its output level relative to other first-level nodes: Learn_rate ,= (Output - Min_out)/ (Max_out - Min_out) where Min_out and Max_out are the minimum and maximum output of nodes in the plane, and Output is the node's own current output. The node(s) with the smallest output do not learn at all, and those with the largest output learn at the maximum rate. All other nodes learn at intermediate rates. Related "competitive" learning rules have been used in many systems (Widrow and HolT, 1960; Nilsson, 1965, Chap. 6; Fukushima, 1975, 1980, 1984; Fukushima and Miyake, 1978, 1982; Miyake and Fukushima, 1984; Bienenstock et al., 1982; Anderson, 1982; Kohenen, 1984, 1988; Rumelhart and Zipser, 1985; Grossberg, 1987, 1988, p. 38; Silverman and Noetzel, 1988; Ahalt et al., 1990). Adjustment by closeness or similarity can be viewed as a heuristic guess as to which nodes are the most apt to generafize successfully. However, because the appropriate nodes are not necessarily the closest, this produces a fundamental tradeoff. Unproductive generalization (and consequently training time) can be reduced, but the probability of missing a successful generalization is increased. Empirically, it appears that the ability to generalize is not seriously limited by the similarity heuristic, while training time is significantly improved. It is worth noting that the continuous output value used for similarity measurement is necessarily nonbinary. For the purpose of output calculation, all nodes can be treated as LTUs, but for the purpose of learning, the continuous, nonthresholded output of the first-level nodes is utilized in determining a node's output rank. Unthresholded output provides a measure of categoric "closeness" that is lost with binary classification.

412

S ]-'I AMPSON

Similarity-based generalization is consistent with reports linking a neuron's likelihood of learning a pattern to its predisposition to respond to it (Weinberger, 1982, p. 65; Woody, 1982a, p. 135; Singer, 1985), and behavioral observations that activated memory is especially susceptible to modification (see further reading, 5. I). However, there have been some failures to replicate the behavioral results (e.g. Dawson and McGaugh, 1969) and there are various interpretations of the positive results, so the status of that phenomenon is still debatable (Spear and Mueller, 1984). As with weight adjustment, a node's conditional probability traces are also adjusted to reflect viable groupings of positive instances. Numerous approaches are possible, but conditional probability is currently adjusted as follows: (l) For all negative instances the conditional probability traces of all nodes are reduced using a conditional probability teacher (X in the conditional probability equations) of 0. (2) For a positive instance where no node in the first level has a positive output, the conditional probability is adjusted for each node with X equal to the node's output rank for that input. (3) For a positive instance where some node in the first level has a positive output, the node with maximum output has its conditional probability traces adjusted with an X of I. All other nodes are adjusted with an X of 0. (Alternatively, the conditional probabilities of all other nodes are simply not adjusted.) There are numerous variations on the use of conditional probability at the node and system level, many of which are empirically effective. However, the proposed disjunctive learning model requires the use of some form of conditional probability for convergence on all but the simplest functions. At a highly abstract level, the conditional probability serves to stabilize useful connections by "remembering" the area of appropriate firing and biasing adjustment to retain that useful firing pattern. 5.4. OPERATOR TP,AiNING RESULTS

The use of conditional probability can significantly accelerate single node training, and networks built of such conditional probability nodes appear to learn faster than networks built from standard nodes. As shown in the following results, the operator training (OT) algorithm is much faster in training a twolevel network than the results published for other algorithms. For example, the 2-feature parity function (Exclusive Or) was learned with 2 first-level nodes in 980 input presentations with back-propagation (Rumelhart et al., 1986). The OT algorithm required 15, on the average. Barto (1985) reports that his Arp algorithm learned the 6-feature multiplexer problem in 130,000 pre~ntations using 4 nodes. The OT algorithm converged in 524 presentations using 5 nodes. Finally, the 6-feature symmetry problem was learned in 77,312 presentations using 2 nodes and back-propagation (Rumelhart et a/., 1986). The OT algorithm converged in 640 presentations using 3 nodes. These results indicate that there is considerable room for improvement in network training speed

over current implementations. However, speed is not the sole criterion for judging connectionistic learning models, as all of the above functions could be learned much faster by simply memorizing every input pattern. SID-based classification can rapidly learn the instances it has seen, but is space expensive and limited in its generalization potential. For example, the previously described nearest neighbor algorithm is incapable of detecting irrelevant features. Consequently, the number of SIDs (and adjusts) required to learn a function increases exponentially with the number of irrelevant features. In addition, the NN algorithm is not particularly effective in learning prototypic categories, and empirically requires an exponential number of nodes and adjusts to learn (x of d) or LTU functions. The goal of the OT algorithm and most other connectionistic learning models is not just to learn the function, but to find effective generalizations. This reduces space requirements and hopefully improves predictive potential. In this respect the OT algorithm is reasonably effective, as it generally finds near-optimum disjunctive representations (thus achieving a compact category representation), and the learned LTU groupings are reasonable measures of prototypic similarity (thus generalizing in ways that are apt to correctly classify novel input). Appropriate generalization over unseen inputs would seem to be a universally desirable goal, but it is not clear that a maximally compact representation is necessarily desirable, since the resulting representation would be maximally sensitive to damage. The particular characteristics of the OT algorithm are important in determining its training speed, but much of its speed advantage over other network training algorithms simply comes from using smarter nodes. For example, by using the same nodes as the OT algorithm, back-propagation results are improved to almost match OT results. In the context of the S-R model, any learning algorithm that can effectively generalize over operator preconditions is acceptable. The OT algorithm is a simple, general approach. However, the more that is known about a particular task, the more the representation and learning algorithms can be tailored to exploit the known regularities. At a minimum, input from different sensory modalities might be preprocessed in different ways before being presented to a general learning mechanism. 5.5. SHAREDMEMORYFOCUSING A logical extension of a single Boolean operator is a Boolean behavioral system, mapping inputs to sets of output actions. A simple approach to training multiple outputs is to independently train a collection of separate operators. This would produce the desired behavior, but has certain drawbacks. The most obvious is that providing a separate nervous system for each operator is unnecessarily extravagant. A related objection is that totally separate operators cannot share information. If the same pattern is important to several operators, it must be learned and represented separately by each of them. Consequently there is some value in providing a shared memory structure for the multiple operators (Fig. 17).

413

ARTIFICIALNEURALNETWORKS

operators

0000000

shared

I input FIG. 17. Shared memory for multiple operators. The shared memory adds a set of new features to the original input set. The OT algorithm is inappropriate for training a shared memory. The back-propagation algorithm is appropriate, and has been successfully utilized for that task. However, an alternative approach based on focusing is developed here. Its learning characteristics are quite different from the OT algorithm, as focusing attempts to isolate problematic instances, while the OT algorithm tries to group categorically similar inputs together. In the combined system, the separate operators are trained using the OT algorithm, and the shared memory is trained using focusing. For simplicity, operators can be limited to single nodes and the shared memory to a single layer of nodes since that structure is computationally complete in the Boolean domain. More generally, a certain amount of dedicated memory should be allocated to the individual operators. While there are space advantages in shared structures, there are often speed advantages in training independent systems. For example, while back-propagation can train a single pool of nodes to be shared by a number of output nodes, training time generally increases with the number of output nodes. If the functions to be learned are relatively independent, it is faster to learn them in independent structures. Both shared and output-specific systems are well defined at the extremes of the nervous system (e.g. the sensory and motor maps). The shared memory focusing algorithm has two components: (1) Input driven learning. (2) Error driven focusing. These two learning processes will be considered separately. 5.5.1. Input driven learning Input driven (teacherless/unsupervised) learning can be summarized as:

At least one node in the shared memory should be on for any input. Input driven learning introduces an intrinsic, teacherless learning process in the shared memory. Independent of the system's output, learning takes place until all observed inputs are represented by some activity pattern in the shared memory. Learning is identical to single operator training, except that the teaching signal is equal to 1 for all input patterns. Alternatively, the thresholds of all nodes can simply be backed off until some nodes begin to fire. This is a very simple approach to unsupervised learning. All that is required is that every input pattern result in some activity pattern--which can then be modified by the supervised learning component. A more sophisticated approach to unsupervised learning might try to produce a representation that is apt to be useful for supervised portion. Input driven learning appears to play a major role in the functional development of the visual cortex (Spinelli et al., 1972; Lund, 1978; Movshon and Van Sluyters, 1981; Singer, 1984, 1985; Kufl~er et aL, 1984, Chap. 20). For example, kittens were reared in an apparatus which permitted them to see only a limited number of vertical stripes with one eye and horizontal stripes with the other. Later, cells in the visual cortex were found to be preferentially responsive to stimuli resembling the ones they had been exposed to previously (Spinelli et ai., 1972). 5.5.2. Error driven focusing Error driven focusing can be summarized as: If an output error occurs, the shared memory may not represent the current input pattern in a specificenough form for the operators to use. Therefore the state of the shared memory should be adjusted to produce a more specific representation.

414

S. HAMPSON

One of the experimental approaches demonstrating input driven learning was also used to demonstrate output dependent modification (Spineili and Jensen, 1979, 1982). In that study, sensory areas were biased to preferentially detect behaviorally important patterns. With a single operator, output error can be easily computed as the difference between the output node's actual and correct output. For multiple operators, error can be summed over all the operators. Shared memory is adjusted when output error > 0. This can be implemented as a step function, but works more smoothly if the amount of adjustment varies continuously with the error. A significant characteristic of this learning strategy is that stimulus patterns are learned independently of the correct behavioral response to them. That is, the shared memory learns to represent the input patterns with increasing specificity; it is up to the operators to decide what, if any, categorical response the patterns should be associated with. At a minimum, operators are single nodes capable of computing a thresholded linear function, so first-level modification should converge on a representation that is decodable by that function. An obviously decodable extreme results if each behaviorally relevant input pattern is uniquely represented by one node in the first level. The operators could then pick and choose among the SIDs, functioning only as ORs. Any learning process that converges on this state will eventually converge on correct behavior, although it might require up to 2 ** d nodes. In order to implement such a process, large firstlevel categories must be subdivided into smaller ones. One mechanism for achieving this is to focus the most specific current category more narrowly on the current input whenever an output error occurs. If the error persists, the category should ultimately be focused down to a single input pattern. The amount of focusing can be any fraction between 0 (no change) and i (one-shot learning). If no node includes the misclassified input, focusing does not take place. Input driven learning guarantees that there will eventually be at least one active node to focus. "Specificity" can be defined as the number of input patterns in a category besides the current pattern, or geometrically, as the diameter of the positive region on the hypersphere. As usual, alternative implementations are possible. A mathematically simple, though biologically implausible approach is to normalize the length (without threshold) of all weight vectors to a fixed length (e.g. 1.0). The threshold then provides a direct measure of category specificity. An alternate measure of specificity is to consider the placement of the positive instance within the category. This can be formally calculated as the farthest distance from the input to the border of the category. Empirically the latter measure works somewhat better. This general process of training the shared memory to detect and represent increasingly specific input categories has a number of appealing features. In particular, it is a (potentially) complete technique for learning Boolean functions. As previously considered for SIDs, the ability to rapidly learn specific instances has numerous applications. Consequently,

the extreme case of one-shot focusing may sometimes be appropriate. However, with incremental focusing, greater memory efficiency is possible. Error driven focusing is a reasonably efficient process since learning takes place only when behavior needs to be improved. If no output errors occur, the representation is specific enough for acceptable behavior, so no learning is required. When an error does occur, the representation is not specific enough for that pattern. Through successive focusing, the area of error is identified and can then be used by the operators to correct their output. 5.6. INPUT-SPECIFIC FEATURE AS,SOCIABILITY

In multilevel networks, a new set of computed features can be added to the initial input set. This is adequate for representational completeness, but may be far from optimal in the rate of higher-level learning based on this combined feature set. In particular, the upper bound on perceptron training time increases linearly with the number of redundant features in the representation. Redundant information can be useful (e.g. in the presence of noise), but under many circumstances it can significantly retard learning. Consequently, from the perspective of higher-level learning, the number of adjustable features should be kept to a minimum. Output-specific salience is useful in reducing the effects of unnecessary features, but is not perfect. While feature irrelevance, or relative relevance, is by definition with respect to something (e.g. to a particular output, or to learning in general), redundancy or predictability is a property of the input patterns (that is, the context in which the features occur), and permits more specialized approaches. One approach to the problem of redundant features is for first-level nodes (shared memory) to reduce the associability of those lower-level features that they adequately predict. This way the total number of adjustable features can be decreased rather than increased. For example, if categorization node Cn detected the category (Fi AND F2 AND F3) for input features F I through F6, the resulting feature set, from the viewpoint of higher-level learning, would be (Cn, F4, FS, F6) rather than (Cn, F 1, F2, F3, F4, F5, F6) whenever Cn was present. If Cn corresponded to a recurring environmental context involving hundreds or thousands of features, the resulting reduction would be considerable. As another example, the individual features of a pixel-level representation of an object are essentially worthless for the purpose of higher-level learning. It is the high-level categorization node that provides useful information. If it is simply added to the thousands of pixel-level features, it will be

swamped. This feature-replacement process can be formalized with the use of conditional probability. In addition to the forward conditional probability trace [Cn IFi] (the probability of Cn given £/) that each category node associates with each input feature,each categorization node can also maintain a reverse conditional probability trace [FilCh] (the probability of Fi given Cn) for each input feature. Such "top-down expectations" play a prominent role in Grossberg's Adaptive Resonance Theory (e.g. Grossberg, 1980).

ARTIFICIAL NEURALN~WORKS

As before, the conditional probability trace can be incrementally computed as: [FilCn] ,= [FilCh ] + (Fi - [Fi [Cn]) * Cn • rt

where Fi and Cn are the output values of those nodes, and rt is a rate constant determining the memory length of the trace. Whenever Cn is present and Fi = [FilCh] (that is, when Fi is adequately predicted), Fi can be safely deleted from the representation for the purpose of higher-level learning since the thing that predicted it (Cn) can be used for associative learning instead. As was observed in the discussion of SIDs, perceptron training based on a combination of raw features and SIDs can be slower than with the use of SIDs alone. This input-specific method of adjusting associability can eliminate that problem since SIDs will always predict all of their relevant inputs. In practice, since [FiICn] can only asymptotically approach 1 or - 1 , an arbitrary threshold of 0.95 is used. Output calculation is the same as before, but a pass of feature deletion occurs before learning takes place. Any node whose current output is adequately predicted by higher-level nodes has its associability set to zero; otherwise its associability remains as 1. A continuous version of this might make associability equal to I Fi - [ F i l C h ] l * Cn. T h e only things that are noticed as relevant for learning are high-level categories and the remaining features that are not adequately predicted by those categories. Consequently, new associations are preferentially attached at the highest level of categorization. A similar process is employed in Fukushima's neocognition (Miyake and Fukushima, 1984, 1986, 1989). Behaviorally, an "orienting response" is often directed toward unexpected stimuli, and its strength has been used as an index of stimulus associability (Kaye and Pearce, 1984a,b, 1987; Hall et al., 1985; Collins and Pearee, 1985; Honey et al., 1987; Lubow, 1989, p. 154), although the two are not always correlated (Hall and Schachtman, 1987; Lubow, 1989, pp. 40, 154). In this case "attention" is thought to be directed to the unexpected stimuli, which enhances their associability rather than reducing the associability of expected features. However, assuming that associability is a parameter that can be adjusted in either direction, both processes may occur. As an (unimplemented) alternative to adjusting the associability of the predicted features, their output values could be modified. In particular, an origin shift can be accomplished by category node Cn by simply subtracting its reverse conditional probability vector from the current input vector. For a fixed set of Cs, the result is still provably convergent since, for a linear equation, any origin shift (produced by (Fi - [FilCh] * Cn) can be exactly compensated for by a threshold shift (provided by the associative strength attached to the context feature Cn). Of course, some care must be taken so that Cn does not turn itself off by reducing its own inputs. One possible approach would be to process the input stimuli in stages or layers. At each stage, nodes such as Cn could block the further passage of features they adequately predict, while leaving the previous stage intact for their own use. Jl~ 37/~C

415

The separate storage and use of forward and backward conditional probability emphasizes the distinction between concept detection and description. The forward probabilities are appropriate for detection (categorization) while the reverse probabilities are appropriate for description. For convenience, the same set of concepts can be used for both processes. However, a better approach might be to train a separate set of categories that are optimized for input predictiveness. For example, a simple generalized Hebbian learning rule (Hebb, 1949) (increase weights when pre- and postsynaptic output values are the same, decrease weights when they are different), will train a node to respond to a set of co-occurring input features. Sanger (1989) describes a learning algorithm in which high-level nodes are trained to predict just those input features that are not already correctly predicted by other high-level nodes. This more general class of "predictive" world models can be learned independently of proper behavior. In general, any model that adequately predicts a set of features can be used in place of them. If the number of nodes needed to represent the state of the model is significantly smaller than the set of features it predicts, then a significant acceleration of learning may be possible. SIDs are basically one-node models, and can be used directly by the output nodes. More complex, multinode models might require another layer of processing, since an SID for the particular state of the model might be required. This method of adjusting feature associability is another form of latent learning. However, it is based on input regularities rather than predictiveness for any particular output (as it was with single node training). The concept of context-dependent feature salience is not new (see further reading, 5.2) and this particular implementation is similar to Nadel's model (Nadel and Willner, 1980; Nadel et ai., 1985). The model proposed here is functionally simple, although an actual biological implementation would probably require rather complex circuit-level systems. Nadei identifies the hippocampus as a likely component of such a system. Other researchers also identify the hippocampus as a likely site (or at least a likely component) for the matching of actual and expected conditions (Vinogradova, 1975; Gray, 1982, 1984; Schmajuk, 1989; Schmajuk and Moore, 1985; Kaye and Pearce, 1987; Lubow, 1990, p. 130). Specific neurotransmitters (dopamine and serotonin) have also been linked to particular aspects of latent inhibition (Lubow, 1989, Chap. 6). Lubow (1989) develops a biologially motivated model of latent inhibition in which context plays a key role. In that model, latent inhibition occurs when the context predicts that a particular feature is irrelevant. Thus, it combines features of both the output-dependent and context-dependent approaches to variable associability which have been developed here. The proposed model addresses the general problem of predictable input features, but does not directly address the problem of identifying context. Given the potentially huge number of redundant contextual features, it would make sense to specifically address that problem. More generally, context identification has been suggested to serve a number of beneficial

416

S. HAMPSON

functions (Balsam, 1985). Unfortunately there may be no precise criteria for defining context (there may be a continuum before foreground and background features), but spatial cues would seem to be likely candidates for determining biological context. More complex "world models" would presumably permit more complex context models.

6. SUMMARY AND CONCLUSIONS This paper considers some of the representation and training characteristics of the simplest neuron model, an LTU. Simple networks and simple network training algorithms were also developed. The time and space complexity of these models was discussed in the context of learning LTU and Boolean functions. Various acceleration techniques were considered. A recurring theme was the tradeoff between generalization and specialization. A number of topics were addressed in the development of the model, and many are of general interest and make contact with both biological and theoretical data. Some of those issues are summarized briefly, below. (l) The basic unit of computation (the model neuron) is a Linear Threshold Unit (LTU), which, for binary features, can be viewed as a prototype detector. Threshold logic is a standard simplification of neural,computation, and prototypic categorization corresponds well to the structure of natural categories and to observed classification strategies in biological organisms. A good match between representation structure and the structure of the function to be learned permits effective generalization, which in turn permits compact representation and predictive potential for novel inputs. An LTU is consequently a reasonable first approximation for connectionistic modeling. More complex node models are possible, but their value over LTUs depends on the application. Formal analysis provides bounds on representation power and required weight size. (2) An LTU can be trained to compute any linearly separable classification by use of the perceptron training algorithm. Appropriate neurophysiological components for perceptron training have been demonstrated in simple organisms, and the results of perceptron training correspond well with many characteristics of classical conditioning. A considerable amount of formal analysis is possible, providing learning speed bounds on different classes of functions and under different training circumstances. This analysis helps identify areas of possible improvement. Analysis and empirical evidence demonstrate the importance of input order in determining the learning rate and predictive potential of perceptron training. In general, prototypic instances lead to good generalization and fast learning, while border instances produce poor generalization and slow learning. (3) Perceptron training speed can be significantly accelerated with the use of output-specific feature as~aciability baaed on the conditional probability relationship between input and correct output. This is effective for both relevant and irrelevant features. Variable associability is consistent with the behavioral phenomena of latent inhibition and learned

irrelevance. Adaptive origin placement can also significantly accelerate training speed, and is generally consistent with the behavioral characteristics of habituation. (4) Focusing is developed as a specialized mechanism for learning specific instances. Neural evidence for such a learning process is weak, but the resulting dichotomy between generalized category and specific instance learning is consistent with the procedural/ declarative distinction. Numerous applications of SIDs are possible, although their application in this model is quite limited. Formal analysis and empirical evidence permits the value of SIDs for category learning to be bounded under some circumstances. Significant improvement in training speed is often possible when SIDs are used in conjunction with perceptron training. In particular, SIDs are quite useful for memorizing the difficult boundary cases of LTU functions. (5) There are Boolean functions that a single node cannot represent, so a network is needed. Since any Boolean category can be represented in a two-level network, this is a necessary minimal extension of the basic prototypic view of categorization. Many simple conditioning results can be explained in terms of such structures. (6) A two-level, disjunctive representation can be learned with a modification ofperceptron training. In this algorithm (operator training), the existing groupings of inputs are extended to cover new inputs in proportion to their pre-existing strengths of response to those inputs. There is some biological evidence for such a generalization strategy. As in the case of individual nodes, network structure and learning strategies that can effectively capture the appropriate generalizations permit compact representation and have predictive potential for novel inputs. Assuming positive instances come in prototypic clusters, the OT algorithm should be reasonably efficient, but more specialized applications would permit more specialized approaches. (7) Error-driven focusing is used to train a memory shared by multiple operators. It is based on a different approach than the OT algorithm. While the OT algorithm attempts to generalize over groups of similarly categorized patterns, focusing tries to isolate patterns that have caused errors. By rapidly identifying troublesome inputs, behavior can be rapidly corrected, even for "exceptions to the rule" which run counter to the prevailing OT generalizations. In the extreme, SIDs are formed, but since focusing is incremental, complete specificity is not necessary. (8) Input-specific associability can be used to accelerate training in multilevel systems. Perceptron training time increases with the number of redundant features, so significant speed up can be achieved if high-level nodes reduce the associability of those lower-level features that they accurately predict. Controlled associability thus emerges as an important, general process in network training. There is good behavioral evidence for input-specific associability. This process highlights the difference between pattern categorization and pattern description, and is a simple instance of the more general class of predictive/explanatory world models.

ARTIFICIALNE12RALNETWORKS

7. FURTHER READING AND NOTES 2. i. PROTOTYPICSIMILARITY

The basic idea of prototypic similarity is that the similarity of two instances is a continuous function of all the individual features that define them. In particular, given an ideal or "prototypic" point for some category, a function can Ix defined which returns a numeric measure of any other instances' similarity to the prototype (e.g. a value between 0 and 1, where the prototype is maximally similar to itself). This similarity measure is generally related to the geometric distance between the two points in some multidimensional space. A key characteristic that distinguishes this approach from the more "classical" approach, is that there may Ix no critical features that can individually determine an instance's classification. Instead, there is a continuous measure based on all features. There are different ways the prototypical point might be defined, and many forms the similarity function can take, but the general approach has considerable support in the biological literature (Wittgenstein, 1953; Neisser, 1967, Chap. 3; Posner, 1969; Posner and Kofle, 1968, 1970; Strange et al., 1970; Franks and Brantford, 1971; Reed, 1972, 1973, 1978; Homa et al., 1973; Homa, 1984; Smith et ai., 1974; Hayes-Roth and Hayes-Roth, 1977; McCloskey and Glucksberg, 1978, 1979; Tversky, 1977; Tversky and Gati, 1982; Gati and Tversky, 1982; Tversky and Hutchinson, 1986; Sattath and Tversky, 1987; Huise et al., 1980, p. 232; Rosch, 1973, 1978; Rosch and Mervis, 1975; Mervis and Rosch, 1981; Medin and Schaffer, 1978; Smith and Medin, 1981; Medin and Smith, 1984; Knapp and Anderson, 1984; Cohen and Murphy, 1984; Busemeyer et al., 1984; Bus,~neyer and Myung, 1988; Kohenen, 1984, p. 59; Nosofsky, 1984, 1986, 1988a,b; Keil, 1987; Shepard, 1987, 1988; Ashby and Pemn, 1988).

417

Platt, 1985; Granger and Schlimmer, 1986; Farley, 1986; Miller and Matz, l, 1989, p. 62; Levy, 1989; Levy et al., 1990). Not surprisingly, statistically motivated models of human decision making (Slovic et ai., 1988) are generally more complex than for other animals. 3.2. VAm~LE ASSOCIABILITY The Rescorla-Wagner learning is concerned with controlling the associative strength of features, but a feature's plasticity, or "associability" is also a modifiable parameter. The basic experimental paradigm is to show that some pretreatment involving the stimuli A and/or B does not affect the associative strength between them, but does affect the speed with which they can be associated during later associative training. There are many conditions that appear to affect a feature's associability, but a basic goal is that associative training is at least partially restricted to those features that are most relevant to the task at hand (Lubow and Moore, 1959; Lubow, 1973, 1989; Lubow et ai., 1976a,b, 1981; Siegel, 1972; Weiss and Brown, 1974; Solomon and Moore, 1975; Solomon, 1979; Moore and Stickney, 1980, 1985; Rescorla, 1971, 1972a; Rescorla and Holland, 1982; Reiss and Wagner, 1972; Halgren, 1974; Mackintosh, 1973, 1975, 1983, Chaps 8 and 9; Baker and Mackintosh, 1977, 1979; Baker, 1976; Baker and Mercier, 1982a,b, 1989; Mercier and Baker, 1985; Baker and Baker, 1985; Bitterman, 1979; Lolordo, 1979a,b,d; Fry and Scars, 1978; Dickinson and Mackintosh, 1979; Dickinson, 1980, Chap. 4; Pearce and Hall, 1979, 1980; Pearc¢ et aL, 1982a,b; Pearce, 1987, Chap. 5; Hall, 1980; Hall and Pearce, 1979, 1982a,b; Hall et ai., 1985; Hall and Honey, 1989a,b; Tarpy, 1982, Chap. 3; Gormezano er al., 1983; Staddon, 1983, p. 436; Alloy and Tabachnik, 1984; Nadel et d., 1985; Solomon, 1987; deVietti et al., 1987; Klein, 1987, Chap. 2; Dess and Overmier, 1989).

3.1. PROBABILITY 4.1. ADJUSTABLEEXCITABILITY There are many behavioral models that use probability, conditional probability or contingency as a While adjustable synaptic strength is well measure of expectations. To some extent these reflect established as a physiological plasticity mechanism, what an organism, or neuron, should "optimally" adjustable "excitability" of the cell as a whole is less compute, and to some extent they are intended to frequently reported (Spencer et ai., 1966; Thompson model observed behavior. While the fit between and Spencer, 1966; Sokoiov et al., 1970; Woody et al., theoretical prediction and actual behavior is fre- 1976; Woody, 1982a, pp. 151, 195, 1982b, 1984a,b, quently good, there are also significant deviations 1986; Brons and Woody, 1980; Brons et al., 1982; from simple statistical models (Beach, 1964a,b; Kandel, 1977; Lynch et al., 1977; Bindman et al., Rescorla, 1967, 1972a; Rescorla and Wagner, 1972; 1979; Bindman and Lippoid, 1981; Bindman and Schoenfeld et ai., 1973; Gibbon et ai., 1974; Gibbon, Prince, 1986; Abraham and Goddard, 1985; 1981; Ajzen and Fishbein, 1975; Scligman, 1975; Abraham and Bliss, 1985; Alkon, 1986, 1987, 1989; Maier and Scligman, 1976; Alloy and Scligman, Farley, 1988, Carley, 1988; Levy e t a / . , 1990). 1979; Maier, 1989; Estes, 1976; Bindra, 1976, 1978; McCauley and Stitt, 1978; Rosch, 1978; Uttley, 1979; 5.I. MODIFICATION OF REACTIVATED M E M O R Y Lolordo, 1979c,d; Dickinson, 1980; Dickinson et al., 1984; Dickinson and Shanks, 1985; Shanks and It is often observed that initial learning can be Dickinson, 1987; Shanks, 1985, 1986, 1987; subsequently modulated (e.g. disrupted by electroScjnowski, 1981; Damianopoulos, 1982; Jenkins convulsive shock soon after training). Although the et al., 1981; Jenkins and Lambos, 1983; Smith results and interpctations are considerably more variand Medin, 1981; Hammond and Paynter, 1983; able, it has also been observed that if the memory is Hammond, 1980, 1985; Staddon, 1983; Mackintosh, reactivated at a later date by some sort of reminder, 1983, Chap. 7; Schwartz, 1984; Alloy and Tabachnik, it is again subject to disruption or modification 1984; Miller and Schachtman, 1985a,b; Scott and (Misanin eta/., 1968; DeVietti et al., 1977; Lewis

418

S. HAMPSON

et al., 1972; Lewis, 1979; Mactuitus et al., 1979; Loftus and Loftus, 1980; Loftus, 1982; Riccio and Ebner, 1981; Spear, 1981; Spear and Mueller, 1984; Gordon, 1981; Miller, 1982; Judge and Quartermain, 1982; Schneider and Plough, 1983; Miller and Marlin, 1984).

ALKON,O. L., QUEK,F. and VOGEL,T. P. (1989) Computer

modeling of associative learning. In: Advances in neural information processing systems L Ed. D. S. TOURETZKY. Morgan Kaufman: San Mateo, CA. ALLAN,L. G. (1980) A note on measurement of contingency between two binary variables in judgement tasks. Bull. Psychon. Soc. 15, 147-149. ALLOY,L. B. and SEL1GMAN,M. E. P. (1979) On the cognitive component of learned helplessness and depression. In: 5.2. CONTEXT-SPECIFICASSOCIABILITY The Psychology of Learning and Motivation, Vol. 13, Ed. G. H. BOWER.Academic Press: New York, NY. As previously observed, the associability of features is a controlled parameter. Under some cir- ALLOY, L. B. and TAaACtINtK,N. (1984) Assessment of covariation by humans and animals: The joint influence cumstances, the level of associability is a function of of prior expectations and current situational information. the particular context, so that a feature which is Psychol. Rev. 91, 112-149. learned to be irrelevant in one context may not be AMAIn,S. (1972) Learning patterns and pattern sequences treated as such in another context (Dexter and by self-organizing nets of threshold elements. IEEE Merrill, 1969; Anderson et al., 1969a,b; Lubow et aL, Trans. Computers C21, 1197-1206. 1976a; Lubow, 1989, pp. 74, 213; Wagner, 1976, ANDERSON,C. W. (1982) Feature generation and selection 1978, 1979; Baker and Mercier, 1982a,h; Chann¢ll by a layered network of reinforcement learning elements: Some initial experiments. Comp. and Info. Sci. Dept. U. and Hall, 1983; Hall and Minor, 1984; Hall and of Mass. Amherst, Mass. Tech. Rep. 82-12. Channel, 1985a,b, 1986; Hall and Honey, 1989a,b; T., FOP.InCA, B. and Lovibond et al., 1984; Mackintosh, 1985a,b; Kaye ANDFatSON, D. C., O ' F ~ L , CAvol,,mOal, V. (1969a) Preconditioning CS exposure: et al., 1987). variation in the place of conditioning and presentation. Psycbon. Sci. 15, 54-55. ANDEgSON, D. C., WOLF, D. and SULLIVAN,P. (1969b) Preconditioning exposures to the CS: Variations in place REFERENCES of testing. Psychon. Sci. 14, 233-235. ARRAtiAM,W. C. and Buss, T. V. P. (1985) An analysis of ANDrrAtSON,J. R. (1976) Language, Memory and Thought. the increase in granule cell excitability accompanying Lawrence Erlbanm Associates: Hillsdale, NJ. habituation in the dentate gyrus of the anesthetized rat. ANDERSON, J. R. (1982) Acquisition of cognitive skill. Psycbol. Ret,. 89, 369-406. Brain Res. 331, 303-313. ABRAHAM,W. C. and CJODDAgD,G. V. (1983) Asymmetric ANDERSON, J. R. (1983) The Architecture of Cognition. Harvard University Press: Cambridge, MA. relations between homosynaptic long-term potentiation and heterosynaptic long-term depression. Nature, Lond. ANDERSON, P., SUNBF,RG, S. H., SV~N, O., SWANN, J. W. and WI~3s'rgOM, H. (1980) Possible mechanisms for long30~, 717-719. ASZAP,AM, W. C. and GODDAJtn, G. V. (1985) Multiple lasting potentiation of synaptic transmission in hippocampal slices from guinea-pigs. J. Physiol. 302, 463-482. traces of neural activity in the hippocampus. In: Memory Systems of the Brain. Eds. N. M. WEINaEltG~, J. L. AaTOLA,A. and S1NGER,W. (1987) Long-term potentiation and NMDA receptors in rat visual cortex. Nature, 330, McGAuon and G. LVNOL The Guilford Press: New 649-652. York, NY. Am~ots, T. W. (1985) Cellular studies of an associative Asttav. F. G. and PERaIN, N. A. (1988) Toward a unified theory of similarity and recognition. Psycbol. Rev. 95, mechanism for classical conditioning in Aplysia. In: Model Neural networks and Behavior, Ed. A. I. 124-130. ATKINSON,R. C. and ~ , W. K. (1963) Stimulus sampling SELVERSTON.Plenum Press: New York, NY. theory. In: Handbook of Mathematical Psychology, Vol. 2. ACKLEY,D. H., HINTON,G. E. and S~NOWSKJ,T. J. (1985) Eds. R. D. LucE, R. R. BUSXand E. GALAm'~R.Wiley: A learning algorithm for Boltzmann machines. Cog. Sci. New York, NY. 9, 147-169. ADAMS,C. and DICrdNSON,A. (1981) Actions and habits: BACH~VALIER,J. and Mm-mlN,M. (1984) An early and late developing system for learning and retention in infant Variations in associative representations during instrumonkeys. Behav. Neurosci. 96, 770-776. mental learning. In: Information Processing in Animals: Memory Mechanisms, Eds. N. E. Si,t.~ and R. R. MILL~. BAgia, A. G. 0976) Learned irrelevance and learned helplessness: Rats learn that stimuli reinforcers, and responses Lawrence Erllmum Associates: Hillsdale, NJ. are uncorrelated. J. exp. Psychol. Animal Behav. Processes ADMAN, E. D. (1946) The Physical Background of Perception. Clarendon Press: Oxford, England. 2, 130-141. AVL~LT, S. C., Kaismq~R~t~', A. K., C-'iib~, P. and BAKER,A. G. and BAKER,P. A. (1985) Does inhibition differ from excitation: Proactive interference, contextual conMELTON, D. E. (1990) Competitive learning algorithms ditioning, and extinction. In: Information Proce~ing in for vector quantization. Neural Networks 3, 277-290. AJZ~N, I. and F ~ m ~ u , M. (1975) A Baysian analysis of Animals: Conditioned Inhibits, Eds. R. R. MIMER and N. E. SPEAn. Lawrence Erlbaum Associates: Hillsdale, attribution proomses. Paychol. Bull. 82, 262-277. ALOER, B. E. and ~ , T. J. (1976) Long-term and NJ. short-term plasticityin CAI, CA3 and dentate regions of B A n , A. G. and MAC~brrosti,N. J. (I 977) Excitatory and inhibitory conditioning following uncorrelau~ presenthe rat hippocampal slice.Brm~a Rea. II0, 463--480. tations of the CS and US. Animal Learning and Behavior ALKON, D. L. (1986) Changes of membrane currents and calcium dependent phmphorylation durinl associative 5, 315-319. learning. In: Neural Meclumisms of Comdltloming, Eds. BAKER,A. G. and MACKINTOSH,N. J. (1979) Preexpceure to the CS alone, US alone, or CS and US uncorretated: D. L. AX~ON and C. D. WooDY. Plenum Press: New York, NY. Latent inhibition, blocking by context or learned ALKON,D. L. (1987) Memory Traces in the Brain. Oxford irrelevance. Learning and Mottcation 10, 278-294. B ~ , A. G. and M~cma, P. (19S2a) Prior experiencewith UnivmltityPress: New York. the conditioning events: Evidence for a rich cognitive ALgON, D. L. (1989) Memory storage and neural systems. $ci. Am. 261, 42-50. representation. In: Quantitatl~e analyses of Behavior. lIl.

ARTIFICIALNEURALNETWORKS F.As. M. L. COUMOWS,R. J. H E a R ~ and A. R. WAGoner. Ballinger: Cambridge, MA. BAKER, A. G. and M ~ c n ~ , P. (1982b) Extinction of the context and latent inhibition. Learning and Motivation 13, 391-416. BAKEr, A. G. and M~C~R, P. (1989) Attention, retrospective proceaing and cognitive representation. In: Contemporary Learning Theories, Vol. I. Eds. S. B. KLEIN and R. R. Movo,~t. Lawrence Erlbaum Associates: Hillsdale, NJ. B,~u.ow, H. B. (1972) Single units and sensation: A neuron doctrine for perceptual psychology. Perception 1, 371-394. B~, C. A. and McNAuowro~, B. L. (1985) Spatial Information: How and where is it stored? In: Memory Systems of the Brain. Eds. N. M. WI~ISEP,GI~,, J. L. McGAuoH and G. LYNCU. The Guilford Press: New York, N'Y. BAgSA/.Ou, L. W. (1985) Ideals, central tendency, and frequency of instantiation as determinants of graded structure in categories. J. exp. Psych.: Learning Memory and Cognition 11, 625-654. BAL~At,OU,L. W. and Bow~,, G. H. (1984) Discrimination nets as psychological models. Cog. Sci. g, 1-26. B,~tTO, A. G. (1985) Learning by statistical cooperation of self.interested neuron-like computing elements. Hum. Neurobiol. 4, 229-256. BAI"CiII~I~R, B. G. (1974) Practical Approach to Pattern Clarification. Plenum Press: New York, NY. BA~lt, B. G. (1978) Classification and data analysis in vector space. In" Pattern Classification, Ed. B. G. B A T ~ R . Plenum Press: New York, NY. BEACH, L. R. (1964a) Cue probabilism and inference behavior. Psychoi. Monographs 7~S), 1-20. BEXCtl,L. R. (1964b) Recognition, assimilation, and identification of objects. Psychol. Monographs 78(6), 21-37. Br~R, M. F. and Sn~at~, W. (1986) Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature 320, 172-176. Bv.t.~t,~3t~u~,W. P., Gtt.t.~'Tr~-B~LUNOttAM,K. and K.E~OE, E. J. (1985) Summation and configuration in patterning schedules with rat and rabit. Animal Learning and Behav. 13, 152--164. B~.J~I,, N. D. (1977) A useful four-valued logic. In" Modern Uses of Multiple-Valued Logic, Eds. J. M. D u I ~ and G. EPSl"InN. Reidel Publishing: Boston, MA. B n ~ , ~ r c ~ , E. L., Coov~t, L. N. and MuNRo, P. W. (1982) Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 32-48. BIh~t/AN, L. and I.JI'I~3LD,O. (1981) The Neurophysiology of the Cerebral Cortex. Texas University Press: Austin, Texas. Bn~n>t/x~, L., l.JPpOt~, O. C. J. and MtL~, A. R. (1979) Prolonged changes in excitability of pyramidal tract neurons in the cat: A post-synaptic mechanism. J. Physiol., Lond. 286, 457-477. Bn~'D~N, L. J. and Pm~c~, C. A. (1986) Persistent changes in excitability and input resistance of cortical neurons in the rat. In: Neural Mechanisms of Conditioning, Eda. D. L. ALKON and C. D. WOODY. Plenum Press: New York, N'Y. Blsvax, D. 0976) A Theory of IntelligentBehavior. John Wiley and Sons: New York, N'Y. l ~ I t A , D. (1978) How adaptive behavior is produced: A perceptual-motivational alternative to responsereinforcement. Behav. Brain Sci. 1, 41-91. BrrTmu/,,~, M. E. (1979) Attention. In: Animal Learning: Survey and Analysis, Eds. M. E. Bt~t~tMAN, V. M. ~ , J. B. ~ and M. E. Rx,~o'r'r~. Plenum Press: New York, N'Y. Bt~c~, J. E. and Ga]~qota314, W. T. (1986) Developmental approaches to the memory process. In: Learning and

419

memory, Eds. J. L. M ~ T n ~ z , JR and R. P. K m ~ . Academic Press: New York, NY. BLISS,T. V. P. and DoLPm~, A. C. (1984) Where is the locus of long-term potentiation? In: Neurobiology of Learning and Memory, Eds. G. LYNCH,J. L. MC"GAt~H and N. M. W~nNsmtomt. Guilford Press: New York, NY. BLOOM, F. E. (1979) Chemical integrative processes in the central nervous system. In: Neurosciences: Fourth Study Program, Eds. F. O. ~ and F. G. Wom3¢~. MIT Press: Cambridge, MA. BOURNE JR, L. E. (1970) Knowing and using concepts. Psychol. Rev. 77, 546-556. B o u ~ JR, L. E., DOmNOWSKt, R. L. and Lot~us, E. F. (1979) Cognitive Processes. Prentice Hall: Englewood Cliffs, NJ. Bou'R~ JR, L. E. and RI~STt~, F. (1959) Mathematical theory of concept identification. Psychol. Rev. 66, 278-296. BOUTON,M. E. and S W A R ~ U m ~ , D. (1986) Analyses of the associative and occasion-setting properties of contexts participating in a Pavlovian discrimination. J. exp. Psychol.: Animal Behav. Processes 12, 333-350. BmNDt.~', C. S. (1967) Classification of modifiable synapses and their use in models for conditioning. Proc. Royal Soc. Land. B. 168, 361-376. BROGVEN, W. J. (1939) Sensory pre-conditioning. J. exp. Psychol. 25, 323-332. BRows, J. F. and WOODY,C. D. (1980) Long-term changes in excitability of cortical neurons after Pavlovian conditioning and extinction. J. Neurophysioi. 44, 605--615. BRol~rs,J. F., WooDY, C. D. and AtJ,oN, N. (1982) Changes in the excitability of weak intensity electrical stimulation of units of the pericruciate cortex in cats. J. Neurophysiol. 47, 377-388. BaOOKS, L. (1978) Nonanalytic concept formation and memory for instances. In: Cognition and Categorization, Eds. E. Roscx and B. B. Lt.OYD. Lawrence Erlbeum Associates: Hillsdale, NJ. BROOKS, L. R. (1987) Decentralized control of categorization: The role of prior processing episodes. In: Concepts and Conceptual Development, Ed. NEtSS~R,U. Cambridge University Press: New York, NY. BROWN, M. F. (1987) Dissociation of stimulus compounds by pigeons. J. exp. Psychol.: Animal Behav. Processes 13, 80-91. BROWN, R. and KULtK, J. (1977) Flashbulb memories. Cognition 5, 73-99. BROWN, T. H., GANONO, A. H., IL~zm, E. W., KJ~qAN, C, L, and ~ , S, R, (1989) Long-term potentiation in two synaptic systems of the hippocampal brain slice. In: Neural models of Plasticity, Eds. BYP.NE,J. H. and BEJ~Y, W. O. Academic Press: New York, NY. BROWN, T. H., KAntue, E. W. and ~ A N , C. L. (1990) Hebblan synapses: Biophysical mechanisms and algorithms. A. Rev. Neurosci. 13, 475-511. BULOAP.ELLA,R. G. and AgCtCnt, E. J. (1962) Concept identification of auditory stimuli as a function of amount of relevant and irrelevant information. J. exp. Psychol. 63, 254-257. Busro~YV.g, J. R., ~ , G. I. and M~tN, D. L. (1984) Evaluation of exemplar-based generalization and the abstraction of categorical information. J. exp. Psychol.: Learning Memory and Cognition 10, 638--4548. BusE~-Y~, J. R. and M~'NO, I. J. (1988) A new method for investigating prototype learning. J. Exp. Psych.: Learning, Memory and Cognition 14, 3-11. BUTt~, C. M. (1963) Stimulus generAliTation along one and two dimensions in piKeons. J. exp. Psychol. 65, 339-346. BYaNE, J. H. and Gnqolttctl, K. J. (1989) Mathematical model of cellular and molecular processes contributing to associative and nona~mciative learning in AplyJda. In: Neural Models of PlaJtictty, Eds. J. H. BYlt~ and W. O. B.~RY. Academic Press: New York, N'Y.

420

S. HAMPSON

BYRNE,J., GINGluCH, K. J. and BAXTER,D. A. (1989) Computational capabilities of single neurons: Relationship to simple forms of associative and nonassociative learning in Aplysia. In: The Psychology of Learning and Motivation, Vol. 23. Eds. R. D. HAWKINSand G. H. BOWERAcademic Press: New York, NY. CAREW,T. J. (1987) Cellular and molecular advances in the study of learning in Aplysia. In: The Neural and Molecular Basis of Learning, Eds. J. P. CHANGEAUXand M. KONISHI. Wiley: New York, NY CAREW, T. J., HAWKINS,R. D. and KANDEL,E. R. (1983) Differential classical conditioning of a defensive withdrawai reflex in Aplysia californica. Science 219, 397-400. CAP,LEY, L. R. (1988) Presynaptic neural information processing. In: Neural Information Processing Systems, Ed. D. Z. ANDERSON. American Institute of Physics: New York, N'Y. CARLSON,N. R. (1986) Physiology of Behavior. Allyn and Bacon, Inc.: Boston, MA. CI-IANNELL,S. and HALL, G. (1983) Contextual effects in latent inhibition with an appetitive conditioning procedure. Animal Learning and Behavior I1, 67-74. CHEN, H. H., LEE, Y. C., SUN, G. Z., LEE, H. Y., MAXWELL,T. and GmES, C. L. (1986) High order correlation model for assooative memory. In: Neural Networks for Computing, Ed. D~KER, J. S. American Institute of Physics: New York, NY. CLARK, J. W. (1988) Probabilistic neural networks. In: Evolution, Learning and Cognition, Ed. LEE,Y. C. World Scientific: Teaneck, NJ. COHEN,A. H., RC6SlONOL,S. and GluLLNER,S. (1988) (Eds.) Neural Control of Rhythmic Movements in Vertebrates. Wiley: New York, NY. COHEN, B. and MURPHY,G. L. (1984) Models of concepts. Cog. Sci. 8, 27-58. COHEN, N. J. (1984) Preserved learning capacity in amnesia: Evidence for multiple memory systems. In: Nearopsychology of Memory, Eds. L. R. So2ntE and N. Btrr~Rs. Guilford Press: New York, NY. Com'N, N. J. and SQtagE, L. R. (1980) Preserved learning and retention of pattern-analyzing skill in amnesia: dissociation of knowing how and knowing that. Science 210, 207-210. COLLtm, L. and Pr~acE, J. M. (1985) Predictive accuracy and the effects of partial reinforcement on serial autoshaping. J. exp. Psychol.: Animal Behov. Processes I1, 548-564. COTMAN, C. W., MONAGHAN,D. T. and GANONG,A. H. (1988) Excitatory amino acid neurotransmi~on: NMDA receptors and Hebb-type synapti¢ plasticity. A. Rev. Neurosci. II, 61-80. COVER, T. (1965) Geometrical and statistical properties of systems of linear inequalities with applkatiom to pattern recognition. IEEE Trana. Electronic Computers EC-14, 326-334. COWAN, W. M. (1979) The development of the brain. Sci. Am. /,41, 112-133. CLUCK,F. (1984) Function of the thalamic reticular complex: The searchlight hypothesis. Proc. hath. Acad. $ci. U.S.A. 81, 4586-4590. CluCK, F. H. C. and ASANU~L~,C. (1986) Certain aspects of the anatomy and physiology of the cerebral cortex. In: Parallel diatributed Proce~ing, Vol. 2. Eds. MCCLELLAND, J. L. and R ~ T , D. E. MIT ~ : Cambridge, MA. C-Mtrrcmm, K.-A. 0986) Anatomical correlatm of neuronal plasticity. In: Learning and Memory, Eds. J. L. M ~ T m ~ , Jlz and R. P. Kl~m~mR.AcadtmMc Press: New York, N'Y. D~NOVOULam, E. N. (1982) Nece~mu'y and sufftc~nt factors in classical conditioning. Paviovian J. biol. $ci. 17, 215-229. DAVtmON, T. L. and RJBtKX~ZL*,R. A. (1986) Transfer of facilitation in the rat. Animal Learning and Behavior 14, 380-386.

DAW, N. W., BRUNKEN,W. J. and PARKIN$ON,D. (1989) The function of synaptic transmitters in the retina. A ret:. Neurosci. 12, 205-225. DAWSON,R. G. and MCGAUGH,J. L. (1969) Electroconvulsave shock effect on a reactivated memory trace: further examination. Science 166, 525-527. D~ss, N. K. and OVERMmR, J. B. (1989) General learned irrelevance: Proactive effects on Pavlovian conditioning in dogs. Learning and Motivation 20, 1-14. DEVETTI, T. L., BAUSTE, R. L., NUTT, G., BARRETT, O. V., DALY, K. and PETREE,A. D. (1987) Latent inhibition: A trace conditioning phenomenon? Learning and Motivation 18, 185-204. DEVIETTI, T. L., CONGER, G. L. and KIRKPATluCK,B. R. (1977) Comparison of the enhancement gradients of retention obtained with stimulation of the mesencephalic reticular formation after training or memory reactivation. Physiol. Behav. 19, 549-554. DEXTER,W. R. and MERRILL,H. K. (1969) Role of contextual discrimination in fear conditioning. J. comp. Physiol. Psychol. 69, 677-681. DIAMOND, D. M. and W~NaERGER, N. M. 0984) Physiological plasticity of single neurons in auditory cortex of the cat during acquisition of the pupillary conditioned response: II. Secondary field (AII). Behav. Neurosci. 98, 182-210. DIAMOND, I. T. (1979) The subdivisions of neocortex: A proposal to revise the traditional view of sensory, motor and association areas. In: Progress in Psychobiology and Physiological Psychology, Vol. 8. Eds. J. M. SPRAGUEand A. N. EPSTEIN.Academic Press: New York, NY. DICKINSON, A. (1980) Contemporary Animal Learmng Theory. Cambridge University Press: Cambridge, MA. DICKINSON, A. and MACKINTOSH,N. J. (1979) Reinforcer specificity in the enhancement of conditioning by postrial surprise. J. exp. Psychol. Animal Behav. Processes 5, 162-177. DICKINSON,A. and SHANKS,D. (1985) Animal conditioning and human causality judgements. In: Perspectives on Learning and Memory, Eds. L. NILSSONand T. ARCHER. Lawrence Erlbaurn Associates: Hillsdale, NJ. DICKINSON, A., SHANKS, D. R. and EVENDEN, J. L. (1984) Judgement of act-outcome contingency: the role of selective attribution. Q. J. exp. Psychol. 36A, 29-50. DIs'rERHOF'LJ. F. and OLDS, J. (1972) Differential development of conditioned unit changes in thalamus and cortex of rat. J. Neurophysiol. 35, 665-679. DISTERHOF,J. F. and STUART,D. K. (1976) Trial sequence of changed unit activity in auditory system of alert rat during conditioned response acquisition and extinction. J. Neurophysiol. 39, 266-281. DOMJAN,M. and BUaKH,~tD, B. (1982) The Principles of Learning and Behavior. Brnoks/Cole Publishing Company: Monterey, CA. DONEGAN, N. H., GLUCK, M. A. and THOMPSON, R. F. (1989) Integrating behavioral and biological models of classical conditioning. In: The Psychology of Learning and Motivation, Vol. 23. Eds. R. D. HAWKI~ and G. H. BOWEa. Academic Press: New York, NY. DOYLE, J. (1979) A truth maintenance system. Artif. lntell. 12, 231-272. DUDA, R. D. and H.~T, P. E. (1973) Pattern Claxtb6cation and Scene Analysis. Wiley: New York, NY. DUNWIODm, T. and LYNOI, G. (1978) Long.term potentiation and deprtuion of synaptic response* in the rat hippocampus: Localization and frequency dependency. J. Physiol. 276, 353-367. DLmLAC8, P. J. and R.mCOaLA, R. A. (1980) Potentiation rather than overshadowing in flavor-aversion learning: An analysis in terms of within-compound a.umciations. J. exp. Psychol. Animal Behav. Processes 6, 175-187.

ARTIFICIALNEURALNETWORKS EDEI.tO~, G. M., GALL, W. E. and COWAN,W. M. (1985) (Eds.) Molecular bases of Neural Development. John Wiley and Sons: New York, NY.

ESTES, W. K. (1976) The cognitive side of probability learning. Psychol. Rev. 83, 37--64. Esr~, W. K. (1986) Memory storage and retrieval proin category learmn& J. exp. Psychol. Gen. 115, 155-174. Es'r~ W. K., C ~ U . L , J. A., HATSOPOUt~OS, N. and HL~wrrz, J. B. (1989) _g~__~e-_ rate effect in category learning: A comparison of parallel network and memory storage-retrieval models. J. exp. Psychol. Learning Memory and Cognition 18, 556-571. FARLEY,J. (1986) Cellular mechanisms of causal detection in a mollusk. In: Neural Mechanisms of Conditioning, Eds. D. L. ALKONand C. D. WOODY.Plenum Press: New York, NY. F~u.~'Y, J. (1988) Causal detection in a mollusc: Cellular mechanisms of predictive coding, associative learning and memory. In: Quantitative Analyses of Behavior. VIII. F.As. M. L. COmmONS,R. M. C'm~ctt, J. R. STELLARand A. R. WAONE~. Lawrence Erlbaum Associates: Hillsdale, NJ. FOm'F., S. L. and MOILt,~N, J. H. (1987) Extrathalamic modulation of cortical function. A. Rev. Neurosci. 10, 67-95. FORSES, D. T. and HOLLAND,P. C. (1985) Spontaneous configuring in conditioned flavor aversion. J. Exp. Psychol. Animal Behov. P~ocesses 11, 224-240. Ft,'Y, P. W. and ~ R. J. (1978) Model of conditioning incorporating the Rescorla-Wagner associative axiom, a dynamic attentional process, and a catastrophe rule. Psychoi. Rev. 85, 321-340. FuK~, K. (1975) Cognitron: A self-organizing multilayered neural network. Biol. Cyber. 20, 121-136. FUK~ K. (1980) Neocognitron: A self-organizing neural network model for a mechanism of pattern reco8nition unaffected by shift in position. Biol. Cyber. 36, 193-202. FuKusI41~, K. 0984) A hierarchical neural network model for associative memory. Biol. Cyber. 50, 105-113. F u ~ , K. and M~^r~, S. (1978) A self-organizing neural network with a function of associative memory: Feedback-type cognitron. Biol. Cyber. 28, 201-208. Fui~tmnl~, K. and M I v ~ , S. (1982) Neocognitron: A ~elf-organi~ng neural network model for a mechanism of visual pattern recognition. In: Competition and Cooperation in Neural Nets, Eds. S. AMARIand M. A. AItBm. gprini~'-Verlag: New York, NY. G~R~L, M., Fos'r~t, K., O~tos^, E., S^LTW~CK,S. E. and ST^~rrON,M. (1980) Neuronal activity of cingnlate cortex, antcroventral th~lnmus, and hippocampal formation in discriminative conditioning: Encoding and extraction of the significance of conditional stimuli, ln: Progress in Psychobiology and Physiological Psychology, Vol. 9. Eds. J. M. SPltAOt~ and A. N. E~T~N. Academic Press: New York, NY. G ^ n m . , M., OION^, E. and F ~ ' r ~ , K. (1982) Mechanism and generality of stimulus significance coding in a mammalian model system. In: Advances in Behavioral Biology, Vol. 26. Ed. C. D. WoovY. Plenum Press: New York, NY. G ~ u . l a ~ C. R. (1990) The organization of Learning. MIT Press: Cambridge, MA. G ^ n , I. and Tvnsl~y, A. (1982) Representations of qualitative and quantitative dimensions. J. exp. Psychol.: Hunmn Peroeption and Performance 8, 325-340. ~ , A., Ho~mex,D, J. J. and TANK, D. W. (1985) The logic of Limax learning. In: Model neural networks and behavior, Ed. A. I. SELVI~"rON. Plenum Press: New York, NY. GDCHImlD~R,G. A. (1988) Psychophysical scaling. A. Rev. Psychol. 39, 169-200.

421

GIBBON,J. 0981) The contingency problem in autosbaping. In: Autoshap~ng and Con~tioninf Theory, Eds. C. M. LocuRTo, H. S. ~ c ~ and J. GmaoN. Academic Press: New York, NY. Gm~os, J., l l ~ u t ~ , R. and ~ N , R. L. (1974) Contingency spaces and measures in classical and instrumental conditioning. J. exp. Analysis khan. 21, 585-605. GLUCK,M. A. and llowe~ G. H. (198fla) From conditioning to category learning: An adaptive network model. J. exp. Psychal.: General !17, 227-247. GLUCK, M. A. and Bowmt, G. H. (1988b) Evaluating an adaptive network model of human learning. J. Memory and Language 27, 166-195. GLUCK, M., P~a~r~, D. B. and RJE~qn)~, E. S. (1989) Learning with temporal derivatives in puhe-coded neuronal systems. In: Advances in Neural Information Processing Systems 1. Ed. D. S. T o t n t ~ 7 ~ . Morgan Kaufman: San Marco, CA. GLUCK, M. A., R r . ~ , E. S. and THOMPSON, R. F. (1990) Adaptive signal proce~ng and the cerebellum: Models of classical conditioning and VOR adaption. In: Neuroscience and Connectionist Theory, Eds. M. A. GLUC~ and D. E. R ~ T . Lawrence Erlbaum: Hillsdale, NJ. GLUCK, M. A. and THOMPSON,R. F. (1987) Modeling the neural substrates of associative learning and memory: A computational approach. Psychol. Rev. 94, 176-191. GOLD, P. E. (1984a) Memory modulation: Neurobioioglcal contexts. In: Neurobialogy of Learning and Memory, Eds. G. L ~ c x , J. L. McGAuox and N. M. WUNe~G~. Guilford Press: New York, NY. GOLD, P. E. (1984b) Memory modulation: Role of peripheral catecholamines. In: Neuropsychalogy of Memory, Eds. L. R. SQUIREand N. BUTrV.~. Guilford Press: New York, NY. GOLD, P. E. and ZoRNetz.~a, S. F. (1983) The mnemon and its juices: Neuromodulation of memory processes. Behov. neural Biol. 38, 151-189. GOP.DON, B., ALTON, E. E. and TROMaL~W, P. Q. (1988) The role of norepinephrine in plasticity of visual cortex. Prog. Neurobiol. 30, 171-191. GOP.~3N, W. C. (1981) Mechanisms of cue-induced retention enhancement. In: Information Processing in Animals: Memory Mechanisms, Eds. N. E. SPEARand R. R. M I L ~ . Lawrence Erlbaum Associates: Hillsdale, NJ. G ~ F , P. and So~crER, D. L. (1985) Implicit and explicit memory for new associations in normal and amnesic subjects. J. exp. PsychoL: Learning Memory and Cognition I1, 501-518. GttANOV.t, R. H. and S o ~ i m , ~ t , J. C. (1986) The computation of contingency in classical conditioning. In: The Psychology of Learning and Motivation, Vol. 20. Ed. G. H. BowER. Academic Press: New York, NY. GRAY, J. A. (1982) The Neuropsychalogy of Anxiety:

An Enquiry into the Function of the Septo-hippocampal System. Oxford University Press: Oxford, England. GRAY, J. A. (19M) The hippocampus as an interface between cognition and emotion. In: Animal Cognition, F.As. H. L. Rorret.AT, T. G. ~ and H. S. "IV.ttAc~. Lawrence Erlbaum Associates: Hillsdale, NJ. Gtt~NOUGH, W. T. (1984) Possible structural substrates of plastic neural phenomena. In: Neurobioiogy of Learning and Memory, Ech. G. LYNCtl,J. L. MCGAUOtl and N. M. W ~ J ~ o ~ . Guilford Press: New York, NY. GmNOUGH, W. T. (1985) The possible role of experiencedependent synaptogenesi$, or synapses on demand in the memory process. In: Memory Systems of the Brain, Eds. N. M. W~Nm~tom, J. L. MC'GAUOH and G. L'~qct~. The Guilford Press: New York, NY. Gg~o, S. (1970) Neural pattern di.~crimination. J. theor. Biol. 27, 291-337. Ggo~efato, S. (1980) How does the brain build a cognitive code? PsychoL Rev. 87, 1-51.

422

S. H^MPSOS

GRO~BERG, S. (1983) The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behav. Beain Sci. 6, 625-692. GaossBtgG, S. (1987) Competitive learning: From interactive activation to adaptive resonance. Cog. Sci. !1, 23-63. GROSS~aG, S. (1988) Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks 1, 17-6 I. GaZVWACZ, N. M. and A~,frt4og, F. R. (1989) A computationally robust anatomical model for retinal directional selectivity. In: Advances in Neural Information Processing Systems 1. Ed. D. S. TOURErZKY. Morgan Kaufman: San Mateo, CA. GazVwAcz, N. M. and POC,GIO, T. (1990) Computation of motion by real neurons. In: An Introduction to Neural and Electronic Networks, Eds. S. F. ZORrCL~ZER,J. L. DAVISand C. L^u. Academic Press: New York, NY. GLrvOrq, 1., PtaLSO~r~AZ,U, NADAL,J. P. and DgEVFUS,G. (1988) High order neural networks for efficient associative memory design. In: Neural Information Processing Systems, Ed. D. Z. A~DF,aSOrq. American Institute of Physics: New York, NY. H^LG~N, C. R. (1974) Latent inhibition in rats: associative or nonassociativ¢? J. comp. Physiol. Psychol. [t6, 74-78. HALL, G. (1980) Exposure learning in animals. Psychol. Bull. 88, 535-550. HALL, G. and C~r~r,~tL, S. (1985a) Differential effects of contextual change on latent inhibition and on the habituation of an orienting response. J. exp. Psychol.: Animal Behav. Processes 11, 470-481. HALL, G. and Ci~NNELL,S. (1985b) Latent inhibition and conditioning after preexposure to the training context. Learning and Motivation, 16, 381-397. HALL, G. and CrlANNELL,S. (1986) Context specificity of latent inhibition in taste aversion learning. Q. J. exp. Psychol. 3511, 121-139. HALL, G. and HONEY,R. C. (1989a) Perceptual and associative learning. In: Contemporary Learning Theories, Vol. I. Eds. S. B. KLFar~and R. R. MowR£a. Lawrence Erlbaum Associates: Hillsdale, NJ. HALL, G. and HONEY,R. C. (1989b) Contextual effects in conditioning, latent inhibition, and habituation: Associative and retrieval functions of contextual cues. J. exp. Psychol.: Animal Behav. Processes 15, 232-241. HALL, G., KAYE, H. and I~.AgCE, J. M. (1985) Attention and conditioned inhibition. In: Information Processing in Animals, Eds. R. R. M1LLr.R and N. E. St,EAa. Lawrence Erlhaum Associates: Hillglale, NJ. HALL, G. and MINoa, H. (1984) A search for contextstimulus associations in latent inhibition. Q. J. exp. Psychol. 36B, 145-169. H^LL, G. and Pr.A~c~,J. M. (1979) Latent inhibition ofa CS during CS-US parings. J. exp. Psychol.: Animal Behav. Processes 5, 31-42. HALL, G. and Pr.xa~, J. M. (1982a) Restoring the associability of a pre-¢xposed CS by a surprising event. Q. J. exp. Psychol. 34B, 127-140. HALL, G. and PV.~C'E, J. M. (1982h) Changes in stimulus associability during acquisition: Implications for theories of acquisition. In: Quantitative Analyses of Behavior. III, Eds. M. L. Co~naor~s, R. J. H ~ r , tsrtalq and A. R. W A O ~ . Ballinger: Cambridge, MA. HALL, G. and SCrlACWI'MA~,T. R. (1987) Differential effects of a retention interval on latent inhibition and the habituation of an orienting response. Animal Learning Behav. 15, 76--82. HAt~O~OND,L. J. (1980) The effect of contingency upon the appetitive conditioning of free-operant behavior. J. exp. Analysis of Behav. 34, 297-304. HA~n~O~qD, L. J. (1985) An empirical legacy of twoprocess theory: Two-term versus three-term relations. In: Affect, Conditioning and cognition, Eds. F. R. BI~us~ and J. B. Ovr.gUl~a. Lawrence Erlbaum Associates: Hillsdale, NJ.

HA~rMO~D,L. J. and PAVN~r.a,W. E. JR (1983) Probabilistic contingency theories of animal conditioning: A critical analysis. Learning and Motivation 14, 527-550. HAI~II~JOt~, S. E. (1990) Connectionistic Problem Solving: Computational Aspects of Biological Learning. Birkhauser: Boston, MA. HAaT P. E. (1968) Condensed nearest neighbor rule. Trans. IEEE Trans. Info. Theory rr-14, 515--516. H^wKI~, R. D. (1989a) A simple circuit model for higher-order features of classical conditioning. In: Neural Models of Plasticity, Eds. J. H. BYRhrEand W. O. BEP.RY. Academic Press: New York, NY. HAWKIr~S,R. D. (1989b) A biologically based computational model for several simple forms of learning. In: The Psychology of Learning and Motivatwn, Vol. 23. Eds. R. D. HAWKINS and G. H. BoWER. Academic Press: New York, NY. HAWKINS, R. D. and K~NDEL, E. R. (1984) is there a cell-biological alphabet for simple forms of learning? Psychol. Rev. 91, 375-391. HAVES-Ro~t, B. and HAYES-Ro'm, F. (1977) Concept learning and the recognition and classification of exemplars. J. verb. Learn. verb. Behav. 16, 321-338. HAVGOOD,R. D. and STEVENSON,M. (1967) Effect of number of irrelevant dimensions in non-conjunctive concept learning. J. exp. Psychol. 74, 302-304. HEARST, E. (1978) Stimulus relationships and feature selection in learning and behavior. In: Cognitive Processes in Animal Behavior, Eds. S. H. HUL~, H. FOWLER and W. K. HONIG. Lawrence Erlbaum Associates: Hillsdale, NJ. H~ayr, E. (1984) Absence as information. Some implications for learning, performance and representational processes. In: Animal Cognition, Eds. H. L. RorraLx'r, T. G. BZVF.R and H. S. TEaaACE. Lawrence Erlbaum Associates: Hillsdale, NJ. HEAgST, E. (1987) Extinction reveals stimulus control: Latent learning of feature-negative discriminations in pigeons. J. exp. Psychol.: Animal Behav. Processes 13, 52-64. HEIIn, D. O. (1949) The Organization of Behavior: A Neuropsychological Theory. Wiley: New York, NY. HEaar~sr~r~, R. J. (1985) Riddles of natural categorization. Phil. Trans. R. Soc. Land. B 308, 129-144. Htwror~, G. E., SmNowsrd, T. J. and AclcLtV, D. H. (1984) Boltzmann machines: Constraint satisfaction networks that learn. Carnegie-Mellon Univ. Tech. Pep. CMU-CSM-II9. Ht~rrzM,~, D. L. (1986) "Schema abstraction" in a multiple. trace memory model. Psychol. Rev. 93, 411-428. Hl~rrzMxr4, D. L. (1988) Judgements of frequency and recognition memory in a multiple-trace memory model. Psychol. Rev. 95, 528-551. Hl~rrzMAr4, D. L. and L U D ~ , (3. (1980) Differential forgetting of prototypes and old instances: simulation of an exemplar based classification model. Memory and Cognition 8, 378-382. HOLLAND,P. C. (1984) Differential effects of reinforcenaent of an inhibitory feature after serial and simultaneous feature negative discrimination training. J. exp. PsychoL: Animal Behav. Processes 10, 461-475. HOLLAr4D,P. C. (1985) The nature of conditioned inhibition in serial and simultaneous feature neptiv¢ discrimination. In: Information Processing in Animals: Conditioned Inhibition, Eds. R. R. MILLI~ and N. E. S~.Aa. Lawrence Erlbaum Associates: Hili~lal¢, NJ. HOLLAND,P. C. (1985b) Element pretralning inltuences the content of appetitive serial compound conditioning in rats. J. exp. Psychol.: Animal Behav. Processes II, 367-387. HOLL^ND, P. C. (1986a) Transfer after serial feature positive discrimination training. Learning and Motivation 17: 243-268.

ARTIFICIALNEURALNETWORKS HOLLAND,P. C. (1986b) Temporal determinants of occasion setting in feature positive discriminations. Animal Learning and Behav. 14, 111-120. HOLLAm>,P. C. (1989a) Acquisition and transfer of conditional discrimination performance. J. exp. Psychol.: An/real Behav. Processes 15, 154-165. H O L ~ , P. C. (1989h) Occasion setting with simultaneous compounds in rats. J. exp. PsychoL: Animal Behav. Processes IS, 183-193. HOLLAND,P. C. (1989c) Transfer of negative ocutsion setring and conditioned inhibition across conditioned and unconditioned stimuli. J. exp. Psychol.: Animal Behav. Processes IS, 311-328. HOLLANV, P. C. (1989d) Feature extinction enhances transfer of occasion setting. Animal Learning Behav. 17, 269-279. HOLLXNV,P. C. and L A ~ J ~ , J. (1984) Transfer of inhibition after serial and simultaneous feature negative discrimination training. Learning and Motivation IS, 219-243. HOLLAm>, P. C. and Ro~s, R. T. (1983) Savings test for associations between neutral stimuli. Animal Learning Behav. 11, 83-90. Hotel, D. (1978) Abstraction of ill-defined form. J. exp. Psychoi.: Hum. Learning and Memory 4, 407-416. HOt~, D. (1984) On the nature of categories. In: The Psychology of Learning and Motivation, Ed. G. H. Bow~. Academic Press: New York, NY. Hot~, D. and ~ L ~ S , D. (1975) The relative contributions of common and distinctive information on the abstraction from ill-defined categories. J. exp. Psychol.: Human Learning and Memory !, 351-359. Hot~, D., CROSS, J., C_,OgN~L, D., GOLDMAN, D. and SCHWARTZ,S. (1973) Prototype abstraction and classification of new instances. J. exp. Psychol. 101, 116-122. Hot~, D., ST~ttaNG, S. and TREV~L,L. (1981) Limitations of exemplar-based generalization and the abstraction of categorical information. J. exp. Psychol.: Human Learning and Memory 7, 418--439. Hoto,, D. and V~nURGti, R. (1976) Category breadth and the abstraction of prototypical information. J. exp. Psychol.: Human Learning and Memory 2, 322-330. Hob'~, R. C. and HALL, G. (1989) Enhanced discriminability and reduced associability following flavor preexposure. Learning and Motivation 20, 262-277. HONEY, R. C., SotActrrt~N, T. R. and HALL. G. (1987) Partial reinforcement in serial autoshaping: The role of attentional and associative factors. Learning and MothJatton IS, 288-3O0. Ho~no, W. K. (1978) Studies of working memory in the pipon. In: Cognitive Processes in Animal Behavior, Eds. S. H. HULw~ H. FOWLER and W. K. HONtG. Lawrence Erlbaum Attociates: Hilhdale, NJ. HOPI~LD, J. J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proc. hath. Acad. Sci. U.S.A. 79, 2554-2558. H u a ~ , D. H. (1981) Columns and their function is the primate visual cortex. In: Theoretical Approaches in Neurobioiogy, F_As. W. E. I~tCH~.DT and T. Po~3to. MIT Press: Cambridge, MA. HULI~ S. H., EOIETH,H. and D~s~, J. (1980) The Psychology of Learning, 5th edn. McGraw-Hill: New York, N'Y. HUNT, E. (1989) Cognitive science: Definition, status, and questions. A. Ray. Psychol. 40, 603-629. HUNT, E., MAgTIN,J. and STOItE,P. (1966) Experiments in Induction. Academic Press: New York, NY. J~s, H. M. (1985) Conditioned inhibition of key pecking in the p i t o n . In: Information Processing in Animals: Conditioned Inhibition, Eds. R. R. MtLL~ and N. E. S ~ . Lawrence Etlbaum Associates: Hillsdale, NJ. JemUNS, H. M., B ~ t ~ t , R. A. and B~0t.grd~, F. J. (1981) Why autothaping depends on trial spacing. In: Autoshaping and Conditioning Theory, Eds. C. M. LOCURTO,

423

H, S. TERRACE and J. GIBBON. Academic Press: New York, NY. JENKINS, H. M. and LAMm~, W. A. (1983) Tests of two explanations of response elimination by noncontingent reinforcement. Animal Learning and Behavior 11, 3O2-308. J~l~mS, H. M. and S~NeUtY, R. S. (1970) Discrimination learning with the distincitve feature on positive and negative trials. In: Attention: Contemporary Theory and Analysis, Ed. D. Mos'ro~KY. Appleton-Century Crofts: New York, NY. Job, s, E. G. (1981a) Anatomy of cerebral cortex: Columnar input-output organization. In: The Organization of the Cerebral Cortex, Eds. F. U. SCHMITT,F. G. WORDEN,G. A D e N and S. G. I:)ENNtS.MIT Press: Cambridge, MA. Jo1,~s, E. G. (1981b) Functional subdivisions and synaptic organization of the mammalian thalamus. In: Neurophysiology IV, Int. Ray. of Physiol., Vol. 25. Ed. R. PORTV~. University Park Press: Baltimore, MD. Jots.s, E. G. (1985) The Thalamus. Plenum Press: New York, NY. JUDGE, M. E. and QUARTI~MAIN,D. (1982) Characteristics of retrograde amnesia following reactivation of memory in mice. Physiol. Behav. 28, 585-590. KACZMXP.EK, L. K. and ~ A N , I. B. (1986) (Eds.) Neuromodulation: The Biochemical Control of Neuronal Excitability. Oxford University Press: New York, NY. KxlvnN, L. J. 0969) Predictability, surprise, attention and conditioning. In: Punishment and Aversive Behavior, EAs. B. A. CAMt'eELLand R. M. CHURCtl. Appleton-CenturyCrofts: New York, NY. KASDEU, E. R. (1976) Cellular Basis of Behavior. W. H. Freeman and Company: San Francisco, CA. KANDEL, E. R. (1977) Neuronal plasticity and the modification of behavior. In: Handbook of Physwiogy, the Nervous System, Ed. E. R. KANDEL. Waverly Press Inc.: Baltimore, MD. KANDEL, E. R. (1979) Behavioral Biology of Aplysia. W. H. Freeman and Company: San Francisco, CA. KANDEL,E. R. and SCHWAgT7,J. (1985) (Eds.) Principlesof Neural Science. Elsevier North-Holland, New York, NY. KA~DEL,E. R. and SCHWARTZ,J. (1982) Molecular biology of learning: Modulation of transmitter release. Science 218, 433-443. KAPLAN,P. S. and HEXRs'r,E. (1985) Contextual control and excitatory versus inhibitory learning: Studies of extinction, reinstatement and interference. In: Context and Learning, F.As. P. D. BALSAMand A. TOMI~. Lawrence Erlbaum Associates: Hilhdale, NJ. ~rs'u, T. (1983) Neuronal plasticity maintained by the central norepinephrine system in the cat visual cortex. In: Progress in Psychobiology and Physiological Psychology, Vol. 10. Eds. J. M. SP~.Ac3t~and A. N. EPs'rmN. Academic Press: New York, NY. K~MATSU, T. (1987) Norepinephrine hypothesis for visual and cortical plasticity: Thesis, antithesis, and recent development. Curt. Topics day. Biol. 21, 367-389. KAY, H. and l~utcE, J. M. (1984a) The strength of the orienting response during paviovian conditioning. J. exp. Psychol.: Animal Behav. Processes 10, 90-109. KAy, H. and PEARCE,J. M. (1984b) The strength of the orienting response during blocking. Q. J. exp. Psychol. 36R, 131-144. Kxv~, H. and Pv.ARCr.,J. M. (1987) Hippocampal ksions attenuate latent inhibition and the decline of the orienting response in rats. Q. J. exp. Psychol. 39B, 107-125. KAYE, H., PRESTON, G. C., SZAI~, L., DRUII~, H. and MACKI~rrosH, N. J. (1987) Context specificity of conditioning and latent inhibition: Evidence for a dissociation of latent inhibition and associative interference. Q. J. exp. Psychol. 39~ 127-145. ~ , J. D. (1988) Capacity for patterns and sequences in Kanerva's SDM as compared to other associative

424

S. HAMPSON

memory models. In: Neural Information Processing Systems, Ed. D. Z. ANDr.~SON.American Institute of Physics: New York, NY. K~ow, E. J. (1988) A layered network model of associative learning: Learning to learn and configuration. Psychol. Rev. 95, 411-433. K~oE, E. J. and GORI~ZANO,I. (1980) Configuration and combination laws in conditioning with compound stimuli. Psychol. Bull. 87, 351-378. K~oE, E. J. and GRAHAM, P. (1988) Summation and configuration: Stimulus compounding and negative patterning in the rabbit. J. exp. Psychol. Animal Behav. Processes 14, 320-333. K~oE, E. J. and Sct.u~u~, B. G. (1986) Compoundcomponent differentiation as a function of CS-US interval and CS duration in the rabbit's conditioned nictitating membrane response. Animal Learning Behav. 14, 14.4-154. KFJL, F. C. (1987) Conceptual development and category structure. In: Concepts and Conceptual Development, Ed. U. NEISS~g. Cambridge University Press: New York, NY. KENT, E. W. (1981) The Brains of Men and Machines. BYTE/McGraw-Hill: Peterborough, NH. KETY,S. S. (1982) The evolution of concepts of memory. In: The Neural Basis of Behavior, Ed. A. L. BECKMAN. SP Medical and Scientific Books: New York, NY. KIrCSBOURNE,M. (1988) Brain mechanisms and memory. Hum. Neurobiol. 6, 81-92. KtNSBOURNE,M. and WOOD, F. (1975) Short-term memory processes and the amnesic syndrome. In: Short-Term Memory, Eds. D. DEtrrSCH and J. A. DEtrrserL Academic Press: New York, NY. KINSBOURI~,M. and WOOD,F. (1982) Theoretical considerations regarding the episodic-semantic memory distinction. In: Human and Amnesia, Ed. L. S. CER~K. Lawrence Erlbaum Associates: Hillsdale, NJ. Kt~tN, S. B. (1987) Learning. McGraw-Hill: New York, NY. KNAPP, A. G. and ANDERSON,J. A. (1984) Theory of categorization based on distributed memory storage. J. exp. Psychol.: Human Learning and Memory and Cognition 10, 616-637. KNUDS£N, E. l., DuLAc, S. and ESl"tRLY, S. D. (1987) Computational maps in the brain. A. Rev. Neurosci. 10, 41-65. KOrlONEN, T. (1984) Self-organization and Associative Memory. Springer-Verlag: New York, NY. KOrlONEN,T. (1988) An introduction to neural computing. Neural Networks 1, 3-16. KOHO~rEN,T., OJA, E. and LEH1aO, P. 0981) Storage and processing of information in distributed associative memory systems. In: Parallel Models of Associative Memory, Eds. G. E. HIN'rON and J. A. ANOEXSON. Lawrence Erlbaum Associates: Hillsdale, NJ. KUFFLER, S. W., NICHOLI.~, J. G. and MARTIN, A. R. (1984) From Neuron to Brain: A Cellular Approach to the Function of the Nervous System. Sinaver Associates: Sunderland, MA. I.~B~GE, D. and BROWN,V. (1989) Theory of attentional operations in shape identification. Psychol. Rev. 96, 101-124. LAM.~aU~, J. and HOLt.AND, P. C. (1985) Acquisition of feature negative discriminations in rats conditioned suppression. Bull. Psychonom. Sac. 23, 71-74. Lx.~tAa~, J. and HOLLAND,P. C. (1987) Transfer of inhibition after serial feature negative discrimination training. Learning and Motivation 18, 319-342. LAXX, R. and AamB, M. A. (1985) A model of the neural mechanisms responsible for pattern recoguition and stimulus specific habituation in toads. Biol. Cyber 51, 223-237. LXRgaN, J. H., McD[gMOlrr, J., SIMON, D. P. and SIMON, H. A. (1980) Expert and novice performano~ in solving physics problems. Science ~1~ 1335-1342. Lt'vY, W. B. (1989) A computational approach to hippo-

campal function. In: The Psychology of Learning and Motivation, Vol. 23. Eds. R. D. HAWKINS and G. H. BOWleR. Academic Press: New York, NY. LEVY, W. B., COLn~T, C. M. and DESUa3ND,N. L. 0990) Elemental adaptive processes neurons and synapses: A statistical/computational perspective. In: Neuroscience and Connectionist Theory, Eds. M. A. GLUCK and D. E. RUI~LrlART. Lawrence Erlbaum: Hillsdale, NJ. LEWIS, D. J. (1979) Psychobiology of active and inactive memory. Psychol. Bull. 86, 1054-1083. LEwes, D. J., Bg~GMAN,I. J. and MAHAN,J. J. 0972) Cuedependent amnesia in rats. J. camp. Physiol. Psychol. 81, 243-247. LINDSAY, P. H. and NORMAN,D. A. (1977) Human Information Processing. Academic Press: New York, NY. LiTtLE, W. A. and Srlxw, G. L. (1975) A statistical theory of short and long term memory. Behav. Biol. 14, 115-133. LIVINGSTON, R. B. (1967a) Brain circuitry relating to complex behavior. In: The Neurosciences, p. 499. Eds. G. C. QUARTON, T. MELNECHUCK and F. O. SCH~Tr. Rockefeller University Press: New York, NY. LIVINGSTONE, R. B. (1967b) In: The Neurasciences, pp. 568-576. Eds. G. C. QUARTON,T. MELr~CWOCKand F. O. ScriMnrr. Rockefeller University Press: New York, NY. Lorrus, E. F. (1982) Remembering recent experiences. In: Human Memory and Amnesia, Ed. L. S. CERMAK. Lawrence Erlbaum Associates: Hilisdale, NJ. LOF'r~s, E. F. and Lorrus, G. R. (I 980) On the permanence of stored information in the human brain. Am. Psychol. 35, 409-420. LOLORDO,V. M. (1979a) Selective associations. In: Mechanisms of Learning and Motivation, Eds. A. DICKINSONand R. A. BOArdES. Lawrence Erlbaum Associates: Hillsdale, NJ. LOLOROO, V. M. (1979b) Classical conditioning: The Pavlovian perspective. In: Animal Learning: Survey and Analysis, Eds. M. E. B , t~.aMAN, V. M. LOLORDO,J. B. OVERMIERand M. E. RASHOT'~.Plenum Press: New York, NY. LOLORDO, V. M. (1979c) Classical conditioning: Contingency and contiguity. In: Animal Learning: Survey and analysis, Eds. M. E. Brr'rta~N, V. M. LOLOgDO,J. B. OVERMmRand M. E. RASHO'rTE.Plenum Press: New York, NY. LOLORDO,V. M. (1979d) Classical conditioning: Compound CSs and the Rescorla-Wagner model. In: Animal Learning: Survey and Analysis, Eds. M. E. Bt~I'~OaAN, V. M. LOLOaDO, J. B. OVEnMtER and M. E. R.xsHoTr~. Plenum Press: New York, NY. LOVmOND, P. F., PaI~roN, G. C. and MACKIWr~H, N. J. (1984) Context specificity of conditioning, extinction, and latent inhibition. J. exp. Psychol.: Animal Behav. Processes 10, 360-375. Ltmow, R. E. (1973) Latent inhibition. Psychol. Bull. 79, 398-407. Lusow, R. E. (1989) Latent Inhibition and Conditioned Attention Theory. Cambridge University Press: New York, NY. Luaow, R. E. and MOORE,A. J. (1959) Latent inhibition: the effect of nonreinforced pre-exposure to the conditional stimulus. J. camp. Physiol. Psychol. 52, 416-419. LUBOW,R. E., RlrlCdlq,B. and ALTO,M. (1976a) The context effect: The relationship between stimulus prcexposure and environmental preexposure determines subsequent learning. J. exp. Psychol.: Animal Behav. Processes 2, 38-47. Lusow, R. E., Scrim.re., P. and RUIN, B. (1976b) Latent inhibition and conditioned attention theory. J. exp. Psychol.: Animal Behav. Processes 2, 163-177. LuBow, R. E., W~NI~, I. and SCI.INUR,P. (1981) Conditioned attention theory. In: The Psychology of Learning and Motivation, Vol. 15. Ed. G. H. BOWZR. Academic Press: New York, NY.

ARTIFICIALNEUgAL N~wogr, s Ltn¢o, R. D. (1978) Development and Plasticity of the Brain. Oxford University Press: New York, NY. LYNOt, G. and BAUDR¥, M. (1984) The biochemistry of memory: A new and specific hypothesis. Science 224, I057-I063. LYtCCH, G. S., ~ t r ~ T. and GmSKOFF, V. (1977) Heterosynaptic depression: A postsynaptic correlate of long-term potentiation. Nature 2(to, 737-739. LYNN, R. (1966) Attention, Arousal and the Orientation Reaction. Pergamon Press: New York, N'Y. MAcGitr~oolL R. L. (1987) Neural and Brain Modelling. Academic Press: New York, NY. M A ~ , N. J. (1973) Stimulus selection: learning to ignore stimuli that predict no change in reinforcement. In: Constraints on Learning, Eds. R. A. H I ~ E and J. STEVENSON-HINDF_Academic Press: New York, N'Y. M ^ ~ , N. J. (1975) A theory of attention: Variations in the associability of stimuli with reinforcements. Psychol. Rev. 82, 276-298. M A ~ N. J. (1978) Cognitive or associative theories of conditioning: Implications of an analysis of blocking. In: Cognitive Process in Animal Behavior, Eds. S. H. Hut.~, H. Fowt.E~ and W. K. Ho~G. Lawrence Erlhaum As~mciates: Hill~lale, NJ. M A ~ , N. J. (1983) Conditio¢ling and Associative Learning. Oxford Univennty Press: New York, N'Y. MAcrdl,rrm~ N. J. (1985a) Contextual specificity or state dependency of human and animal learning. In: Perspectives on Learning and Memory, Eds. L. N I ~ N and T. A.aCH~. Lawrence Edhaum Associates: Hiiisdale, NJ. MAC~NTO~, N. J. (1985b) Varieties of conditioning. In: Memory Systems of the Brain, Eds. N. M. WEINBEI~OEIg, J. L. MCGAUGH and G. LYNCH. The Guilford Press: New York, NY. MACrUTUS, C. F., I~COO, D. C. and ~ , J. M. (1979) Retrograde amnesia for old (reactivated) memories: Some anomalous characteristics. Science 204, 1319-1320. M,~Hn~T,H. (1985) Dissociation of two-behavioral functions in the monkey after early bippocampal ablations. In: Bral~ Plasticity, Learning and Memory, Eds. B. E. WILL, P. ~ and J. C. DAIA~YMPLE-ALFOI~D.Plenum Press: New York, NY. M ~ , S. F. (1989) Learned helplessness: Event covariation and cognitive changes. In: Contemporary Learning Theories, Vol. II. Eds. S. B. Kt.e~ and R. R. Mowv,ta. Lawrence Erlhaum Associates: Hillsdale, NJ. M ~ , S. F. and S ~ O ~ N , M. E. P. (1976) Learned heip-

k:ssncm: Theory and evidence. J. exp. PsychoL Gen. lOS, 3-46. M~,~, D. (1982) V~sion. W. H. Freeman and Company: San Francisco, CA. MATrHI~ H. (1989) Neurobiological aspects of learning and memory. A. Rev. Psychol. 40, 381-404. M^tn~rd,L, J. H. R. and NEWSOMr:, W. T. (1987) Visual processing in monkey extrastriate cortex. A. Rev.

Neurasct. I0, 363-401. MAZUR, J. E. (1986) Learning and Behavior. Prentice-Hall, Inc.: Englewood Cliffs, NJ. McCAut,~, C. and STITr, C. L. (1978) An individual and quantitative measure of stereotypes. J. Per. Soc. Psychol. 36, 929-94O. MCO~J-~qD, J. L. and Ruurd.I~T, D. E. (1981) An interactive activation model of the effect of context in perception. Part I. An account of basic findings. Psychol. Rev. 8~, 375-407. Mc'CLosg~, M. E. and GLucr,,s~o, S. (1978) Natural categories: Well-defined or fuzzy sets? Memory and Cognition 6, 462-472. McCt,oa.~, M. E. and G L U ~ G , S. (1979) Decision processes in verifying category membership statements: Implications for models of semantic memory. Cog. Psychol. 11, 1-37.

425

McCLOSKL~, M., WmLE, C. G. and COHEN, N. J. (1988) Is there a special flashbulb-memory mechanism? J. exp. PsychoL: Gen. 117, 171-181. McErcr~, W. J. and MAre, R. G. (1984) Some behavioral consequences of neurocbemical deficits in Korsnkoff psychosis. In: Neuropsychoiogy of Memory, Eds. L. R. SOtaRE and N. Bu'rre~. Guilford Press: New York, NY. Mc-'GAUOH,J. L. (1983) Hormonal influences on memory. A. Rev. PsychoL 34, 297-323. McGAuOH, J. L. (1989) Involvement of hormonal and neuromodulatory systems in the regulation of memory storage. A. Rev. Neurasci. 12, 255-287. McKOON, G., RXTCLn~, R. and DELL, G. (1986) A critical evaluation of the semantic-episodic distinction.

J. exp. Psychol.: Learning, Memory and Cognition 12, 295-306. MEDtN, D. L. (1983) Structural principles in categorization. In: Perception, Cognition and Development, Eds. T. J. TIOH~ and B. E. SH~PP. Lawrence Erlbaum Associates: Hillsdale, NJ. MEDXN, D. L., ALTOM, M. W. and MURPh'~, T. D. (1984) Given versus induced category representations: Use of prototype and exemplar information in classification. J. exp. Psychol.: Learning, Memory and Cognition 10, 333-352. MEDIN, D. L. and REYNOLDS,T. J. (1985) Cue-context interactions in discrimination, categorization, and memory. In" Context and Learning, Eds. P. D. BAL.~M and A. TOMn~. Lawrence Erlbaum Associates: Hillsdale, NJ. MEDIN, D. L. and ~ , M. M. (1978) Context theory of classification in learning. Psychal. Rev. 85, 207-238. MEDIN, D. L. and SC~WAW~'~F~UOEL,P. J. 0981) Linear separability in classification learnin 8. J. exp. Psychol.: Human Learning and Memory 7, 353-368. MEDIN, D. L. and SMrrH, E. E. (1984) Concepts and concept formation. A. Rev. Psychoi. 35, 113-138. MEDIN, D. L., W A T ' n ~ , W. D. and H~PSoN, S. E. 0987) Family resemblance conceptual cohesiveness, and category construction. Cog. Psychol. 19, 242-279. MERCIER, P. and BAKJ~, A. G. (1985) Latent inhibition, habituation, and sensory preconditioning: A test of priming in short-term memory. J. exp. Psychol.: Animal Behav. Processes 11,485-501. MEtvls, C. B. and ROSCH, E. (1981) Categorization of natural objects. A. Rev. Psychol. 32" 89-115. MILt.ER, R. R. (1982) Behavioral constraints on biochemical and physiological models of memory. In: Changing Concepts of the Nervous System, Eds. A. R. MOmUSoN and P. L. STaGe. Academic Press: New York, NY. M1LL.ER, R. R. and MAJ~J.IN,N. A. (1984) The physiology and semantics of consolidation. In: Memory Con.~olidation, Eds. H. WtaNOAI~T~ and E. S. P~KER. Lawrence Erlbaum Associates: HillKlale, NJ. MILt.E~, R. R. and MATZEL,L. D. (1989) Contingency and relative associative strength. In: Contemporary Learning Theories, Vol. I. Eds. S. B. Kt~N and R. R. Mowlt£t. Lawrence Erlbaum Associates: Hilhtdale, NJ. MILt.ER, R. R. and SCHACHTlV~N,T. R. (1985a) The several roles of context at the time of retrieval. In: Context and Learning, Eds. P. D. BALSAMand A. Tom~. Lawrence Erihaum Associates: Hiihdale, NJ. MILLER, R. R. and SCt~CHTUXN,T. R. (1985b) Conditioning context as an associative baseline: Implications for response generalization and nature of conditioned inhibition. In: Information Processing in Animals: Conditioned Inhibition, Eds. R. R. MIt.t.n and N. E. S ~ . Lawrence Eribanm Associates: Hillsdale, NJ. M I ~ ' ~ , M. and P A n ' r , S. (1972) Perceptrons. MIT Press: Cambridge, MA. MtSAt,r~, J. R., MltZ.E~, R. R. and LEwis, D. J. (1968) Retrograde amnesia produced by electroconvulsive shock after reactivation of a consolidated memory trace. Science 160, 554-555.

426

S. HAMPSON

MISHledN, M., MALAWdT,B. and BACHEV^LIER,J. (1984) Memories and habits: Two neural systems. In: Neuro. biology of Learning and Memory, G. LYNCH, J. L. McG^uGH and N. M. WEtNaF.g6Eg. Guilford Press: New York, NY. MISHKIN,M. and l~rPa, H. L. (1984) Memories and habits: Some implications for the analysis of learning and retention. In: Neuropsychology of Memory, Eds. L. R. SQuI~ and N. B ~ . Guilford Press: New York, NY. Mxv^r,z, S. and F u t o s m ~ , K. (1984) A neural network model for the mechanism of feature-extraction. Biol. Cyber. 50, 377-384. MIYAr,X, S. and FUKUSmM^,K. (1986) A neural network model for the mechanism of pattern information processing. In: Neural Networks for Computing, Ed. J. S. DENKER. American Institute of Physics: New York, NY. Mlv^r~, S. and FtJKOSmMA,K. (1989) Self-organizing neural networks with the mechanism of f¢odback information processing. In: Dynamic Interactions in Neural Networks: Models and Data, Eds. M. A. Aama and S. AM^RL Springer, New York, NY. MoogE, J. W. and SOLOMON,P. R. (1984) Forebrain-brain stem interactions: Conditioning and the hippocampus. In: Neuropsychology of Memory, Eds. L. R. SQUIRE and N. BUTTERS.Guilford Press: New York, NY. MOO~, J. W. and STICgr,'EV,K. J. (1980) Formation of attentional-associative networks in real time: Role of the hippocampus and implications for conditioning. Physwl. Psychol. 8, 207-217. MOOgE, J. W. and SrICKNEY,K. J. (1982) Goal tracking in attentional-associative networks: Spatial learning and the hippocampus. Physiol. Psychol. 10, 202-208. MooR~, J. W. and SrlCKNEV,K. J. (1985) Antiassociations: Conditioned inhibition in attentional-asso¢iative networks. In: Information Processing in Animals: Conditioned Inhibition, Eds. R. R. MXLI.Egand N. E. SPE^g. Lawrence Erlbaum Associates: Hillsdale, NJ. MoggLs, R. and B^Kr:g, M. (1984) Does long-term potentiation/synaptic enhancement have anything to do with learning or memory. In: Neuropsychology of Memory, Eds. L. R. S Q u ~ and N. Btrrr~s. Guilford Press: New York, NY. MOSCOV~TCH,M. (1985) Memory from infancy to old age: Implications for theories of normal and pathological memory. Ann. N Y Acad. Sci. 644~ 78-96. MOtn,rrcASrLE, V. B. (1978) An organizing principle for cerebral function: The unit module and distributed system. In: The Mindful Brain, Eds. G. M. EDl~l.IO,r~ and V. B. MOU~rTCASTL~.MIT Press: Cambridge, MA. Mou~rc,~Tt~, V. B. (1979) An organizing principle for cerebral function: The unit module and distributed system. In: The Neurosciences, Eds. F. O. SCm~T'r and F. G. WO~.OEN. MIT Press: Cambridge, MA. NovsnoN, J. A. and VAN SLUarrtas, R. C. (1981) Visual neural development. A. Rev. Psychol. 32, 477-522. MUROG^, S. (1971) Threshold Logic and its Applications. John Wiley: New York, NY. Mtra~rly, G. L. and M~D~r4,D. L. (1985) The role of theories in conceptual coherence. Psychol. Rev. 92, 289-316. NADr.t., L. and W~LL~, J. (1980) Context and conditioning. A place for space. Physiol. PsychoL ~ 218-228. N^D~.~., L., W~LLNI~,J. and KUgTZ, E. M. (1985) Cognitive maps and environmental context, in: Context and Learning, Eds. P. D. BALaAUand A. ToM~. Lawrence Erlbaum Associates: Hiilsdale, NJ. N ~ t L , L. and ZOL^-MoarAr4, S. (1984) Infantile amn~ia: A neurobiological perspective. In: Advances in the Study of Communication and Effect, Vol. 9. Ed. M. MOSCOVlTCrl. Plenum Press: New York, NY. NF~sa., U. (1967) Cognitive Psychology. AppletonCentury-Crofts: New York, NY. N~L~.a, U. and W I ~ , P. (1962) Hierarchies in concept attainment. J. exp. Psycho/. 64, 640-645.

NEvrr.s, D. M. and A N D ~ N , J. R. (1981) Knowledge compilation: Mechanisms for the automatization of cognitive skills. In: Cognitive Skills and their Acquisition, Ed. J. R. AIqDF.gSON.Lawrence Erlbaum Associates: Hillsdale, NJ. NlLSSOrq, N. J. (1965) Learning Machines. McGraw-Hill, New York, NY. NosoFsKY, R. M. 0984) Choice, similarity, and the context theory of classification. J. exp. PsychoL: Learning, Memory and Cognition 10, 104-114. NOSO~KY, R. M. (1986) Attention, similarity, and the identification-categorization relationship. J. exp. Psychot.: Gen. 115, 39-57. NOSOFsKY, R. M. (1988a) On exemplar-based exemplar representations: Reply to Ennis (1988) J. exp. Psychol. Gen. 117, 412-414. NOSOFSKY, R. M. (1988b) Similarity, frequency, and category representations. J. exp. Psycho/.: Learning, Memory. and Cognition 14, 54-65. NOSO~KV, R. M. (1988c) Exemplar-based accounts of relations between classification, recognition, and typicality. J. exp. Psychol.: Learning, Memory and Cognition 14, 700 -708. O'K~FE, J. and NAora., U (1878) The Hippocampus as a Cognitive Map. Clarendon Press: Oxford, England. OLDS, J. (1975) Unit r~:ording during Paviovian conditioning. In: Brain Mechanisms in Mental Retardation, Eds. N. A. BUCrnV^LDand M. A. B. ID.Az~. Academic Press: New York, NY. OLTON, D. S. (1978) Characteristics of spatial memory. In: Cognitive Processes in Animal Behavior, Eds. S. H. Hut.~, H. FOWLERand W. K. HON,G. Lawrence Erlbaum Associates: Hillsdale, NJ. OLrON, D S. (1979) Mazes, maps and memory. Am. Psychol. 34, 583-596. OLTorq, D. S. (1983) Memory functions and the hippocampus. In: Neurobiology of the blippocampus, Ed. W. SEtFEgT. Academic Press: New York, NY. OLrOr;. D. S. (1986) Hippocampal function and memory for temporal context. In: The Hippocampm, Vol. 4. Eds. R. L. Is^^csorq and K. H. PRImtAM. Plenum Press: New York, NY. OLTON, D. S., BEcgr~, J. T. and HAND£t.MANN,G. E. (1979) Hippocampus, space and memory. Behav. Brain Sci. 2, 313-365. OLrON, D. S., SnAI,L~O, M. L. and HUL~, S. H. (1984.) Working memory and serial patterns. In: Animal Cognition, Eds. H. L. ROrrIILAT,T. G. ~ ~ H. S. TnJ~CE. Lawrence Erlbaum Amaciatm: Hilbdale, NJ. OPpEmlr.xM, R. W. (1981) Neuronal cell death and some related phenomena during ncuroilenc~s: A selective historical review and progrcm report. In: Studie~ in Developmental Neurobtoiogy, Ed. W. M. COW^N. Oxford University Press: New York, NY. PANgStPr, J. (1986) The nenrochcmistry of behavior. A. Rev. Psychol. 37, 77-107. P^Rr,Eg, D. B. (1986) A comparison of algolfthml of neuron-like cells. In: Neural Networlu for Computing, Ed. J. S. Dta,,'Kr.g. American Institute of Physics: New York, NY. PEARCE,J. M. (1987) An lntrodaction to Animal Cognition. Lawrence Erlbaum Associates: Hillglale, NJ. Pr.ARCF.,J. M. and HAt.L, G. (1979) Loss of uso¢iability by a compound stimulus compming excitatory and inhibitory elements. J. exp. Psychol.: Animal Behav. Processes 5, 19-30. Pr.ARCt, J. M. and HALL, G. (1980) A model for Paviovian learning: Variations in the effeL-tivencu of conditioned but not of unconditioned stimuli. Psyehol. Rev. 87, 532-552. Pr.Agcr., J. M., K~vE, H. and HAIL, G. (1982a) Predictive accuracy and stimulus auociabfl/ty: Development of a model of Pavlovian learning. In: Quantitative Analyses of

.4,RTIFICIAL NEURALNETWORKS

Be/me/or. III. Eds. M. L. C_.om~, R. J. HEXRNSTEINand A. R. WAc31~nt. Ballinger: Cambridge, MA. Pv.~tc~, J. M., Nlcuot.~, D. J. and D I ~ N , A. (1982b) Loss of asmciability by a conditioned inhibitor. Q. J. exp. Psycho/. 33B, 149-162. ! ~ .o~.l~v, J. D. (1985) Some constraints operating on the synaptic modifications underlying binocular competition in the developing visual cortex. In: Synaptic Modification, Neuron Selectivity, and Neroous System Organization, Eds. W. B. ~ , J. A. Am~t,~ON and S. ~ U H L ~ . Lawrence Erlbaum A~ociates: H i l l , s / e , NJ. l~rlnt, R. B. (1985) Adaptation of spatial modulation transfer functions via nonlinear lateral inhibition. Biol. Cyber. $1, 285-291. PooG~o, T. and KOCH, C. (1987) Synapses that compute motion. Sci. Am. ~ 46-52. Poo, M. (1985) Mobifity and localization of proteins in excitable membranes. A. 1~. Neurosci. 8, 369-406. Poltrmt, R. (1981) Internal organization of the motor cortex for input-output arrangements. In: Handbook of Physiology, See. 1, The Nervous System, Vol. 2, Motor Control, Ed. V. B. BtooI~. American Physiol. Society: Bethesda, MD. M. I. (1969) Atntraction and the process of recognition. In: Psychology of Learning and Motivation. 3. Ed~. G. H. Bowlw, and J. Sv1~¢~ Academic press: New York, NY. Po~b'~, M. I. and I¢,J~, S. W. (1968) On the genesis of abstract ideas. J. exp. Psychal. 77, 353-363. M. I. and Klntt~ S. W. (1970) Retention of abstract ideas. J. exp. Psychol. 83, 304-308. PS~U.Tet, D. and P~tK, C. (1986) Nonlinear discriminant functions and associative memories. In: Neural Networks for Computing, Ed. J. S. ~ . American Institute of Physics: N e w York, N'Y. P~LI-~, D., P.~d~K,C. H. and HONG, J. (1988) Higher order associative memories and their optical implementation. Neural Networks 1, 149-163. ~ALrm, D. and V~nCAr~S~, S. S. (1988) Information storage in fully connected networks. In: Evolution, Learning and Cognition, Ed. Y. C. L~. World Scientific: Teaneck, NJ. R~7.t~N, G. (1965) Empirical codifications and specific theoretical implications of compound-stimulus conditioning: perception. In: C/assival Conditioning, Ed. W. F. PtolLtaY. Appleton-Century-Crofts: New York, NY. REED, S. K. (1972) Pattern recognition and categorization. Cog. PsychoL 3, 382-407. REED, S. K. (1973) Psychological Processes in Pattern Recognition. Academic Press: New York, NY. Rm~, S. K. (1978) Category vs item learning: implications for categorization models. Memory and Cognition 6, 612-621. S. and W^~3hn~., A. R. (1972) CS habituation produces a "latent inhibition" effect but no active "conditioned inhibition". Learning and Motivation 3, 237-245. ~ , R. A. (1966) Predictability and number of pairing in Pavlovian fear conditioning. Psychon. Sci. 4, 382-384. ~ , R. A. (1967) Pavlovian conditioning and its proper control procedures. Psychol. Rev. 74, 71-80. ~ , R. A. (1968) Probability of shock in the presence and absence of CS in fear conditioning. J. comp. Physiol. Psycho/. 66, 1-5. ~ , R. A. (1969) Conditioned inhibition of fear. In: Fundamental Issues in Assoc/otive Learning, Eds. W. K. Hom~ and N. J. M^C~WrOSH. Dalhonsie University Press: Halifax, Can. RDcogt~, R. A. (1970) Reduction in the effectiveness of reinforcement after prior excitatory conditioning. Learning and Motivation l, 372-381. Rmcom~, R. A. (1971) Summs_tion and retardation tests of intent inhibition. J. comp. Physiol. Psychol. 7S, 77-81.

427

~RLA, R. A. (1972a) Informational variables in pavlovian conditioning. In: Psychology of Learning and Motivation, Vol. 6. Ed. G. H. Bowl~. Academic Press: N e w York, NY. P-,ESCO~.LA, R. A. (1972b) "Configural" conditioning in discrete-trialbar pressing.J. comp. Physiol. Psychol. 79, 307-317. RESC~RL^, R. A. (1973) Evidence for a "unique stimulus" account of conflgural conditioning. J. comp. Physiol. Psychal. 85, 331-338. ~RL^, R. A. (1979) Conditioned inhibitionand extinction. In: Mechanisms of Learning and Motivation, Eds. A. DICKI~tSON and A. Bo^g.Es. Lawrence Erlbaum Associates: Hillsdale,NJ. R.~,CORLA, R. A. (1980a) Simultaneous and s,___~cess_ ive associations in sensory preconditioning.J. exp. Psycho/.: Animal Behav. Processes 6, 207-216. RESeX3RLA, R. A. (1980b) Pavlovian Second-Order Conditioning: Studies in Associative Learning. Lawrence Eflbaum Associates: Hillsdale, NJ. RmCX:)RL^, R. A. (1981a) Simultaneous ass(~ations. In: Predictability, Correlation and Contihm'ty, Eds. P. HARZi~ and M. D. ZEtL~. John Wiley and Sons: New York, NY. Rm(~, R. A. (1981b) Within-signal learning in autoshaping. Animal Learning Behav. 9, 245-252. RESCOP.LX, R. A. (1982a) Simultaneous second-order conditioning produces S-S learning in conditioned suppressions. J. exp. Psycho/.: Animal Behov. Processes 8, 23-32. P,ESCOe.L^, R. A. (1982b) Effect of a stimulus intervening between CS and US in autoshaping. J. exp. Psychol.: Animal Behav. Processes 8, 131-141. RZSCOP,LA, R. A. (1982c) Some consequences of associations between the excitor and the inhibitor in a conditioned inhibition paradigm. J. exp. Psychol.: Animal Behav. Processes 8, 288-298. Par,CORLA, R. A. (1982d) Comments on a technique for assessing associative learning. In: Quantitative Analyses ofBehav/or. III. Eds. A. R. W A G ~ , R. HEitRNSl"~Nand M. COMMONS. Ballinger: Cambridge, MA. R~,CORLA, R. A. (1983) Effect of separate presentation of the elements on within-compound learning in autoshaping. Animal Learning Behov. 11, 439-446. RESCORLA, R. A. 0984) Comments on three pavlovian paradigms. In: Primary Neural Substrates on Learning and Behavioral Change, F_As. D. L. ALKON and J. FARLI~'. Cambridge University Press: New York, NY. RESCOgL^, R. A. (1985) Conditioned inhibition and facilitation. In: Information Proce~ing in Animals: Condit/oncd Inhibition, Eds. R. R. MILLERand N. E. Sl,~ag. Lawrence Erlbaum Associates: Hiilsdale, NJ. R~CORLA R. A. (1986a) Extinction of facilitation. J. exp. Psychol.: Animal Behav. Processes 12, 16-24. ~ L ^ R. A. (1986b) Facilitation and excitation. J. exp. Psychol.: Animal Behav. Processes 12, 325-332. RZSCORL^, R. A. (1987) Facilitation and inhibition. J. exp. Psychal.: Animal Behav. Processes 13, 250-259. ~LA, R. A. (1988) Facilitation based on inhibition. Animal Learning Behav. 16, 169-176. P,XSCORL^, R. A. and COLWlLL, R. M. (1983) Withincompound associations in unbiocking. J. exp. Psychol.: Animal Behav. Processes 9, 390-400. RESCORL^, R. A. and CUN~INGH^M,C. L. (1978) Withincompound flavor associations. J. exp. Psychol.: Animal Behav. Processes 4, 267-275. ~ , R. A. and DUgLACH,P. J. (1981) Within-event learning in pavlovian conditining. In: Information Processing in Animals: Memory Mechanisms, Eds. N. E. Spv.Ag and R. R. MILLr~g. Lawrence Erlbaum Associates: H i ~ l e , NJ. ~ I ~ , L A , R. A., D e , LAtH, P. J. and Gro~u, J. W. (1985a) Contextual learning in Pavlovian conditioning. In: Context and Learning, F_As. P. D. B ~ and A. TOME. Lawrence Erlbaum Associates: HillK~le, NJ.

428

S. HAMPSON

RZSCO~U.A,R. A., G~,u, J. W. and DUgLACH,P. J. (1985b) Analysis of the unique cue in configural discriminations. J. exp. Psychol.: Animal Behav. Processes 11, 356-366. RZSCORLA, R. A. and HOLLAND, P. C. (1976) Some behavioral approaches to the study of learning. In: Neural Mechanisms of Learning and Memory, Eds. M. R. ROS~NZW~G and E. L. BEN1,~T. MIT Press: Cambridge, MA. RZSCORLA,R. A. and HOLLAND, P. C. (1982) Behavioral studies of associative learning in animals. A. Rev. Psychol. 33, 265-308. R£scogts~, R. A. and WAGNER,A. R. (1972) A theory of Pavlovian conditioning. In: Classical Conditioning It, Eds. A. H. BLACK and W. F. I~OK~SV. Appleton-CenturyCrofts: New York, NY. Rt¢c~o, D. C. and E~.R, D. L. (1981) Postacquisition modification of memory. In: Information Processing in Animals: Memory Mechan~m, Eds. N. E. SPF..~ and R. R. M~LL~. Lawrence Erlbaum Associates: Hillsdale, NJ. RtCH~DSON-KL~V~N, A. and B~O~K, R. A. (1988) Measures of memory. A. Rev. Psychol. 39, 475-543. Ro~mNs, D., B ~ s ~ , J., CO~PTON,P., FuRs'r, A., Rtrsso, M. and SMrr~, M. A. (1978) The genesis and use of exemplar vs prototype knowledge in abstract category learning. Memory and Cognition 6, 473-480. ROCr~L, A. J., HtO~NS, R. W. and POWELL, T. P. S. (1980) The basic uniformity in structure of the neocortex. Brain 103, 221-244. ROOAWSKI, M. A. and AGH~AN~AN, G. K. (1980) Modulation of lateral geniculate neuron excitability by noradrenalin¢ microiontophoresis or locus ceruleus stimulation. Nature 287, 731-734. ROLLS, E. T. 0990) Functions of the primate hippocampus in spatial processing and memory. In: Neurobiology of Comparative Cognition, Eds. R. P. KESNER and D. S. OLTON. Lawrence Erlbaum Associates: Hillsdale, NJ. Rc:~cH, E. (1973) Natural Categories. Cog. Psycho/. 4, 328-350. RoacH, E. (! 978) Principles of categorization. In: Cognition and Categorization, Eds. E. ROACH and B. B. LLOYD. Lawrence Erlbaum Associates: Hillsdale, NJ. ROACH,E. and M~v~s, C. B. (1975) Family resonblances: Studies in the internal structure of categories. Cog. Psychol. 7, 573-605. Rf~ENm~', F. (1962) Principles of neurodynamics: Perceptrona and the Theory of Brain Mechanisms. Spartan Books: Wash/ngton, DC. Ross, R. T. (1983) Relationships between the determinants of performance in serial feature positive discriminations. J. exp. Psychol.: Animal Behav. Processes 9, 349-373. Ross, R. T. and HOLLAND,P. C. (1951) Conditioning of simultaneous and serial feature-positive discriminations. Animal Learning Behav. 9, 293-303. ROSS, R. T. and LOt,O~DO, V. M. (1986) Blocking during serial feature-positive discriminations: associative versus occasion-setting functions. J. exp. Psychal.: Animal Behav. Processes 12, 315-324. Ross, R. T. and LoLOm~, V. M. (1987) Evaluation of the relation between Pavlovian _o,:c~___~ion-settingand instrumental discriminative stimuli: A blocking analysis. J. exp. Psychol.: Animal Behav. Processes 13, 3-16. ROZS~'PAL,A. J. 0985) Computer simulation of an ideal lateral inhibition function. Biol. Cyber. 52, 15-22. RUDY, J. W. and WAomm, A. R. (1975) Stimulus selection in associative learning. In: Handbook of Learning and Cognitive Processes, Vol. 2. Ed. W. K. E,~rl~. Lawrence Erlbaum Associates: Hillsdale, NJ. RUU~.H~T, D. E., H~N'rON, G. E. and W~LtS~XS, R. J. (1986) Learning internal representations by error propagation. In: Parallel Distributed Processing, Eds. D. E. Rtn~n~JO~T and J. L. MCCL~LL~ND. MIT Press: Cambridge, MA.

RUMELHART, D. E. and MCCLELLAND,J. L. (1982) An interactive activation model on the effect of context in perception. Part II. The contextual enhancement effect and some tests and extensions of the model. Psychol. Rev. 89, 60-94. RUMELH/~T, D. E. and ZIi,s~R. D. (1985) Feature discovery by competitive learning. Cog. Sci. 9, 75-112. RYL£, G. (1949) The Concept of Mind. Hutchinson: San Francisco, CA. SAHL~', C. L. (1984) Behavior theory and invertebrate learning. In: The Biology of Learning, Eds. P. MXU,I.F_~and H. S. Tr.m~CE. Springer-Verlag: New York, NY. SAHL~, C. L., RUDY, J. W. and GF.LPERIN,A. (1981) An analysis of associative learning in a terrestrial mollusc. J. comp. Physiol. 144, I-8. SAHLEY,C. L., RUDY,J. W. and GrO.PEmN,A. (1984) Associative learning in a mollusc: A comparative analysis. In: Primary Neural Substrates on Learning and Behavioral Change, Eds. D. L. ALKON and J. FARLEY. Cambridge University Press: New York, NY. SALAFIA, W. R. (1987) Pavlovian conditioning, information processing, and the hippocampus. In: Classical Conditwning, 3rd edn, Eds. I. CORMIET~),W. F. l~or,.~Y and R. F. THOMPSON. Lawrence Erlhaum Associates: Hillsdale, NJ. SANGER,T. D. (1989) Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2, 459-473. SXTTATH,S. and Tvr.~Ky, A. (1987) On the relation between common and distinctive feature models. Psychol. Rev. 94, 16-32. SCHACTEX, D. L. (1984) Toward the multidisciplinary study of memory: Ontogcny, phylogeny and pathology of memory systems. In: Neuropaychology of Memory, Eds. L. R. SQuI~ and N. Bt~ Jr.~. Guilford Press" New York, NY. Sc~crrEx, D. L. (1985) Multiple forms of memory in humans and animals. In: Memory Systems of the Brain, Eds. N. M. WEINnEI~OF.~,J. L. MCGAUOHand G. LYNCH. The Guilford Press: New York, N'Y. SCHAC'rER, D. L. (1987) Implicit memory: History and current status. J. exp. Psychol.: Learning Memory and Cognition 13, 501-518. SCHAC'rER, D. L. and MoscovrrcH, M. 0984) Infants, amnesics, and dissociable memory systems. In: Infant Memory, Ed. M. Mo~'ovrrcH. Plenum Press: New York, NY. SCHACTr~,D. L. and TULWNG,E. (1983) Memory, amnesia, and the episodic/semantic distinction. In: Expression of Knowledge, Eds. R. L. ~ c s o N and N. E. SPEAR.Plenum Press: New York, NY. SCHMAJUK,N. A. (1989) The hippocampus and the control of information storage in the brain. In: Dynamic Interactions in Neural Networks: Models and Data, Eds. M. A. Apasm and S. AMAm. Springer: New York, NY. Sc~uvouK, N. A. and M o o ~ , ]. W. (1985) Real-time attentional models for classical conditioning and the hippocampus. Physiol. Psychol. 13, 278-290. SCHMIDT,S. R. and B ~ I , VSONIII, J. N. (1988) In defense of the flashbulb-memory hypothesis: A comment on McCluskey, Wible and Cohen (1988). J. exp. Psychol. Gen. 117, 332-335. SEH~retD~, A. M. and PLOuoH, M. (1983) Electroconvulsive shock and memory. In: The Physiological Basis of Memory, Ed. J. A. DelJ'rscH. Academic Press: New York, NY. SCHOL~I~LD,W. N., COLE,B. K., LANO,J. and M.4~KOI~, R. (1973) "Contingency" in behavior theory. In: Contemporary Approaches to Conditiocling and Learn~, Eds. F. J. MCGUIOAN and D. B. Lbaum~xtN. John Wiley and Sons: New York, NY. SEHW.~.TZ, B. (1984) Psychology of Learning and Behavior. W. W. Norton and Company: New York, NY.

ARTIFICIALNEURALNETWOggS

$Ch'WARTZKgOIN,P. A. and T^us.~ J. S. (1986) Mechanisms underlying long-term potentiation. In: Neural Mechanof Conditioning, F_As.D. L. ALKONand C. D. WOODY. Plenum Press: New York, N'Y. SCOTT,G. K. and PLATr, J. R. (1985) Model of responsereinforcer contingency. J. exp. Psychol.: Animal Behav. Processes 11, 152-171. Slmi~rrv~, G. S. (1962) Decision-making Processes in Pattern Recognition. The Macmillan Company: New York, NY. SnDEL, R. J. (1959) A review of sensory preconditioning. Psychal. Bull. 56, 58-73. SDNOWSrd, T. J. (1981) Skeleton filters in the brain. In: Parallel Models of Associative Memory, Eds. G. E. HleCroN and J. A. Am~asoN. Lawrence Erlbaum Associates: Hillsdale, NJ. S~NOWSrd, T. J. (1986) Open questions about computation in cerebral cortex. In: Parallel Distributed Processing, Vol. 2. Eds. J. L. McCL~LXND and D. E. RUMELHART. MIT Press: Cambridge, MA. S~UOMXN,M. E. P. (1975) Helplessness. W. H. Freeman and Company: San Francisco, CA. SEav^s-~, D., Pm'rz, H. and COn~'N,J. D. (1990a) The effect of catecholamines on performance: From unit to system behavior. In: Advances in Neural Information Processing Systems 2, Ed. TotmFrzgY, D. S.. Morgen Kaufman: San Maum, CA. S~av^s-Sow~a~a, D., l~rrz, H. and COheN, J. D. (1990b) A network model ofcatecholamine effect" Gain, signal-tonoise ratio, and behavior. Science 249, 892-895. SHANKS,D. R. (1985) Forward and backward blocking in human contingency judgement. Q. J. exp. Psychol. 37B, 1-21. SHANKS,D. R. (1986) Selective attribution and the judgement of causality. Learning and Motivation 17, 311-334. SRANr,S, D. R. (1987) Acquisition functions in contingency judgement. Learning and Motivation 18, 147-166. SUANr,S, D. R. and DICKINSON,A. (1987) Associative accounts of causality judgement. In: The Psychology of Learning and Motivation, Vol. 21, Ed. G. H. BoweR. Academic Preas: New York, NY. S ~ p ~ , o , R. N. (1987) Toward a universal law of generalization for psychological science. Science 237, 1317-1323. Sn~X~D, R. N. (1988) Time and distance in generalization and discrimination: Reply to Ennis (1988). J. exp. Psychol.: Gen. 117, 415-416. SI~I~I1~, G. M. (1979) The Synaptic Organization of the Brain. Oxford University Press: New York, NY. SHI~M.I~D, G. M. (1988) Neurobiology. Oxford University Press: New York, NY. SHOaTLn~, E. H. and BUO~N^N, B. G. (1975) A model of inexact reasoning in medicine. Math. Biosci. 23, 351-379. S~OEL, S. (1972) Latent inhibition and eyelid conditioning. In: Classical Conditioning H: Current Theory and Research, Eds. A. H. BLACKand W. F. PRolo~s'f. Appleton-CenturyCrofts: New York, N'Y. S~C~INS, G. R. and GaUOL, D. L. (1986) Mechanisms of transmitter action in the vertebrate central nervous system. In: Handbook of Physiology: IV. Intrinsic regulatory systems of the brain, Ed. S. R. GE~OEg. American Physiological Society: Bethesda, MA. SILV~,t~N, R, H. and N o ~ 7 . ~ , A. S. (1988) Timesequential self-orsanization of hierarchical neural networks. In: Neural Information Processing Systems, Ed. D. Z. Am>~.SoN. American Institute of Physics: New York, NY. SINO~t, W. (1984) Learning to see: Mechanisms in experience-dependent development. In: The Biology of Learning, Eds. P. Ms.~.Ea and H. S. TegaAc~. Springer-Verlag: New York, NY. Sl~31at, W. (1985) Hebbian modification of synaptic transmission as a common mechanism in experience-dependent maturation of cortical functions. In: Synaptic Modifi-

429

cation, Neuron Selectivity, and Nervous System Organization, Eds. W. B. LLn~, J. A. ANDEIL.qON and S. ~b'W~. Lawrence Eribaum Associates: Hillsdale, NJ. SLOvtc, P., LIorr~sI~aN, S. and Ftscm4o~, B. (1988) Decision making. In: Steven's Handbook of Experimental Psychology, Eds. R. C. ATglNSON, R. H~a~sr~ar;, G. LINDZL~ and R. D. Luc1~. John Wiley and Sons: New York, N'Y. SI~I-H, E. E. and MEDIN, D. L. (1981) Categories and Concepts. Harvard University Press: Cambridge, MA. S!~a'rH, E. E., S H O ~ , E. J. and Pd~S, L. J. (1974) Structure and process in semantic memory: A feature model for semantic decisions. Psychol. Rev. gl, 214-241. SOKoLOv, E. N. (1960) Neuronal models and the orienting reflex. In: The Central Nervous System and Behavior, Ed. M. A. B. BAzma. Josiah Macy Jr. Foundation: New York, NY. SoKoLOv, E. N. (1963) The orienting reflex. A. Rev. Psychol. 25, 545-580. SOKoLOv, E. N., PAKUt.A,A. and Al~rd~.ov, G. G. (1970) The aftereffects due to an intracellular electric stimulation of the giant neuron "A" in the left parietal ganglion of the mollusk Limnaea stagnalis. In: Biology of Memory, F-As. K. H. P~aR^M and D. E. BRO^DBEI,rr. Academic Press: New York, NY. SOLOMON,P. R. (1977) Role of the hippocampus in blocking and conditioned inhibition of the rabbit's nictitating membrane response. J. comp. Physiol. Psychol. 91, 407-417. SOt.OMON,P. R. (1979) Temporal versus spatial information processing theories of hippocampal function. Psychol. Bull. 86, 1272-1279. SOLOMON,P. R. (1987) Neural and behavioral mechanism involved in learning to ignore irrelevant stimuli. In: Classical Conditioning, 3rd edn, Eds. I. GOgMEZANO, W. F. Paor, xsY and R. F. TXOMPSON. Lawrence Erlbaum Associates: Hillsdale, NJ. SOLOMON,P. R. and MOORE,J. W. (1975) Latent inhibition and stimulus generalization of the classically conditioned nictitating membrane response in rabbits following dorsal hippocampal ablation. J. comp. Physiol. Psychol. 89, 1192-1203. S~.AR, N. E. (1981) Extending the domain of memory retrieval. In: Information Processing in Animals: Memory Mechanisms, Eds. N. E. S ~ and R. R. MILLER. Lawrence Erlbaum Associates: Hillsdale, NJ. SPEAR, N. E. and MUELI.£g,C. W. 0984) Consolidation as a function of retrieval. In: Memory Consolidation, Eds. H. WEINGARTNER and E. S. PARKER. Lawrence Erlbaum Associates: Hillsdale, NJ. St~CHT, D. F. (1990) Probabilistic neural networks. Neural Networks 3, I09-I 18. SPEEaS, M. J., GILLXN, D. J. and RESCORL^,R. A. (1980) Within-compound associations in a variety of compound conditioning procedures. Learning and Motivation II, 135-149. Spt~cr~, W. A., THOMPSON,R. F. and N~LSON, D. R. JR (1966) Decrement of ventral root electronic and intracellularly recorded PSPs produced by iterated cutaneous afferent volleys. J. Neurophysiol. 29, 253-274. Spn,~.ta, D. N., HIgscu, H. V. B., Pm~L~, R. W. and ME1"ZLEg,J. (1972) Visual experience as a determinant of the response characteristics of cortical receptive fields in cats. Expl. Brain Res. 15, 289-304. SPlNELLI, D. N. and JEqs~-~, F. E. (1979) Plasticity: The mirror of experience. Science 203, 75-78. Sm~r.LU, D. N. and Jena's, F. E. (1982) Plasticity, experience and resource allocation in motor cortex and hypothalamus. In: Advances in Behavioral Biology, Vol. 26. Ed. C. D. WooDY. Plenum Press- New York, NY. SQUmlE, L. R. (1982) The neuropsycbology of human memory. A. Rev. Neurosci. 5, 241-273.

430

S. HAMP$ON

SQUIre, U R. (1983) The hippocampus and the neuropsychology of memory. In: Neurobiology of the Hippocampus, Ed. W. SV.IFr.RT. Academic Press: New York, NY. SQUIRE,L. R. (1987) Memory and Brain. Oxford University Press: New York, NY. Soul~, L. R. and CoI-~q, N. J. (1984) Human memory and amnesia. In: Neurobiology of Learning and Memory, Eds. G. LYNCH, J. U McGAuGH and N M. WEINBERGER. Guilford Press: New York, NY. SQUII~, L. R., CoHr~, N. J. and NADEL, L. (1984) The medial temporal region and memory consolidation: A new hypothesis. In: Memory Consolidation, Eds. H. WEINGARTNER and E. PARKER. Lawrence Erlbaum Associates: Hillsdale, NJ. STADDON,J. E. R. (1983) Adaptive Behavior and Learning, Cambridge University Press: New York, NY. S'r^NDIN¢3, L. (1973) Learning 10,000 pictures. Q. J. exp. Psycho/. 25, 207-222. STXNDING, L., COSEZIO, J. and H^B~, R. N. (1970) Perception and memory for pictures: Single-trial learning of 2560 visual stimuli. Psychonom. Sci. lg, 89-90. ST^UBU, U., B^UDRY,M. and LY~qCH,G. (1985) Olfactory discrimination learning is blocked by leupeptin, a thiol protease inhibitor. Brain Res. 337, 333-336. STAUaLI,U., F~DAY, R. and LYNCH,G. (1985) Pharmacological dissociation of memory: Anisomycin, a protein synthesis inhibitor, and leupeptin, a protease inhibitor, block different learning tasks. Behav. Neural Biol. 43, 287-297. S1"~s, U and BELLUZZl,J. D. (1988) Operant conditioning of individual neurons. In: Quantitative Analyses of Behavior. VII, Eds. M. L. COMMONS,R. M. CHURCH, J. R. STF.LL^g and A. R. WXGr,~R. Lawrence Erlbaum Associates: Hillsdale, NJ. S'rE~rr, G. S. (1973) A psychological mechanism of Hebb's postulate of learning. Proc. hath. ^cad. Sci. U.S.A. 70, 997-1001. S'ro~, J., Daze.a, B. and LEV~'tlXL, A. (1979) Hierarchical and parallel mechanisms in the organization of visual cortex. Brain Res. Rev. I, 345-394. S'rC,~NGE, W., ~ , T., ~ L , F. and JENKINS, J. (1970) Abstraction over time of prototypes from distortions of random dot patterns. J. exp. Psychol. 83, 508-510. SUG^, N. (1977) Amplitude spectrum representation in the doppler-shifter-CF processing area of the auditory cortex of the mustached bat. Science 196, 64-67. SUG^, N. (1990) Cortical computational maps for auditory imaging. Neural Networks 3, 3-21. SUG^, N. and M^N^~, T. (1982) Neural basis of amplitudespectrum representation in auditory cortex of the mustached bat. J. Neurophysiol. 47, 225-255. SUTH~L^~qD, R. J. and RffDY, J. W. (1989) Configural association theory: The role of the hippocampal formation in learning, memory and amnesia. Psychobiology 17, 129-144. Strr'ro~, R. S. and B~,~TO,A. G. (1981) Toward a modern theory of adaptive networks: Expectation and prediction. Psychol. Rev. gg, 135-170. T^~, D. C. and P~V.L, D. H. (1989) Quantitative modeling of synaptic plasticity. In: The Psychology of Learning and Motivation, Vol. 23. Eds. R. D. H^wglh'S and G. H. BOWER. Academic Pre~: New York, NY. T.~,pY, R. M. (1982) Principles of Animal Learning and Motivation. Foresman and Company: Glenview, IL. TEAs, D. C. (1989) Auditory physiology: Present trends. A. Rev. Psychal. 40, 405-429. Tl~u~o, G. (1986) Simple neural models of classical conditioning. BloL Cyber. ~ , 187-200. "I~t.Ea, T. J. and D ~ ^ , P. (1984) Long-term potentiation as a candidate mnemonic device. Bra/n Res. Rev. 7, 15-28.

TEYLER, T. J. and DISCENN^, P. (1986) The hippocampal memory indexing theory. Behav. Neurosci. 100, 147-154. TEYLF.g, T. J. and DISCL~N^, P. (1987) Long-term potentiation. A. Rev. Neurosci. 10, 131-161. THOMPSON,R. F (1972) Sensory preconditioning. In: Topics in Learning and Performance, Eds. R. F. THOMPSONand J. F. Voss. Academic Press: New York, N Y THOMPSON, R. F. (1985) The Brain. W. H. Freeman and Company: New York, NY. THOMPSON, R. F. (1986) The neurobiology of learning and memory. Science 233, 941-947. THOMPSON, R. F. (1989) Neural circuit for classical conditioning of the eyelid closure response. In: Neural Models of Plasticity, Eds. J. H. BYRNE and BERRY, W. O.. Academic Press: New York, NY. THOMPSON, R. F., MCCORMICK, D. A., LAVOND, D. G., CLARK, G. A., ~ , R. E. and MAUK, M. D. (1983) The engram found? Initial localization of the memory trace for a basic form of associative memory. In: Progress in Psychobiology and Physiological Psychology, Vol. 10. Eds. J. M. SPRAGLrEand A. N. E~"r~N. Academic Press: New York, NY. THOMPSOn, R. F. and SFl~qCr.g,W. A. (1966) Habituation: A model phenomenon for the study of neuronal substrates of behavior. Psychol. Rev. 73, 16-43. TSUK^Z~,~gx,N. (1981) Synaptic plasticity in the mammalian central nervous system. A. Rev. Neurosci. 4, 351-379. TSUKAHARA,N. (1984) Classical conditioning mediated by the red nucleus: An approach beginning at the cellular level. In: Neurobiology of Learning and Memory, Eds. G. LYNCH,J. L. McGAUGHand N. M. WEnqS~,c,v.a. Guilford Press: New York, NY. TUCKER, D. M. and WILLI^gSON,P. A. (1984) Asymmetric neural control systems in human self-regulation. Psychol. Rev. 91, 185-215. TULVING, E. (1972) Episodic and semantic memory. In: Organizatwn of Memory, Eds. E. TULVlNG and W. DON^LDSON. Academic Press: New York, NY. TULVI~a:3,E. (1983) Elements of Episodic Memory. Oxford University Press: London. TULVI~G, E. (1984) Precis of elements of episodic memory. Behav. Brain Sci. 7~ 223-268. TULVING, E. (1986) What kind of a hypothesis is the distinction between episodic and semantic memory? J. exp. Psychol.: Learning, Memory and Cognition 12, 307-311. TUNTURI,A. R. (1952) A difference in the representation of auditory signals for the left and fight cars in the isofrequency contours of right middle ecosylvian auditory cortex in the dog. A. J. Physiol. 168, 712-727. TVEgSKY,A. (1977) Features of similarity. Psycho/. Rev. 84, 327-352. TVEgSKV,A. and G^'n, I. (1982) Similarity, separability, and the triangle inequality. Psychol. Rev. m, 123-154. Tv~gsK¥, A. and HUTCHINSON,J. W. (1986) Ncarest neigh. bor analysis of psychological spaces. Psychol. Rev. 93, 3-22. UTTLEY, A. M. (1979) Information Transmi,gsion m the Nervous System. Academic Press: New York, NY. VI~OGP.AOOV^,O. S. (1975) Functional organization of the limbic system in the process of registration of information: Facts and hypothesis. In: The Htppocampus, Eds. R. L. ISAACSONand K. H. I~,m~M. Plenum Press: New York, NY. yon DER MALS~UgO, C. (1973) Self-organizing of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. W^o~'~g, A. R. (1976) Priming in STM: An information processing mechansim for self-generated or retrievalgenerated depression in performance. In: Habituation: Perspectives from Child Development, Animal Behavior and Neurophyslology, Eds. T. J. TIOI~ and R. N. I.X.ATON. Lawrence Erlbaum Associates: Hillsdale, NJ.

AarrmcL~,LNEtmAL NETWORKS WAol,~t, A. R. (1978) Expectancies and the priming of STM. In: Cognit~e Processes in Animal Behavior, Eds. S. H. HUL~ H. F o ~ . ~ and W. K. HONIG. Lawrence Erlbaum Associates: Hillsdak, NJ. WAON~,A. R. (1979) Habituation and memory. In: Mechanisms of Leafing and Motivation, Eds. A. D1CKn,r~N and R. A. BAOg~. Lawrence Erlhatun Associates: HillKlale, NJ. W ~ C. M. and BOURNE,L. E. Jp, (1961) The identification of concepts as a functionof relevant and irrelevant information. Am. J. Psychol. 74, 410-417. WALI~I~t, S. (1987) Animal Learning: An Introduction. Routledge & Kegan Paul: New York, N'Y. W x ~ W. D., Dt'w~, G. I., M U ~ , T. D. and MeDm, D. L. (1986) Linear separability and concept learning: context, relational properties, and concept naturalneu. Cog. Psychoi. 18, 158--194. W~o~t, N. M. (1982) EffecUof conditioned arousal on the auditory system. In: The Neural Basis of Behavior, Ed. A. Bwcrot~q. Spectrum Publishing: New York, NY. W ~ n t O l ~ , N. M. and Dt~Nom3, D. M. (1987) Physiological plasticity in auditory cortex: Rapid induction by learning. Prog. Neurobiol. 29, 1-55. Wm~L~N, R. G. and Komcm,ow, R. (1984) Order competenci~ in animals: Modeh for the delayed sequence discrimination task. In: Animal Cognition. Eds. H. L. RorraLAT, T. G. Bzvn and H. S. ~ c ~ . Lawrence Erlbaum Associates: Hillsdale, NJ. Wins, K. R. and Btow~, B. (1974) Latent inhibition: A review and new hypothesis. Acto neurobiol, exp. 34, 301-316. W l ~ . , J. and MA~m~s, H. (1985) Morphological changes in the hiplx)campal formation accompanying memory formation and long-term potentiation. In: Memory Systems of the Brain, Eds. N. M. W~Na~l~o~g,J. L. McGAuoH and G. LYNCH. The Guilford Press: New York, NY. WHrr~, E. L. (1981) Thalamocortical synaptic relations. In: The Organization of the Cerebral Cortex, Eds. F. O. Scmerr, F. G. Wov,v ~ , G. Av~.u~t~ and S. G. 1 3 ~ s . MIT Press: Cambridge, MA. W ~ , W. A. (1979) Chunking and consolidation. Psychoi. Ray. g6, 64-60. Wn>ltOW,B. and Ho~, M. E. (1960) Adaptive switching circuits. Institute of radio engineers, western electronic show and convention. Convention Rec. Part I, 96-104.

Yl/t---O

431

WU3tOW, B. and STP_~tm, S. D. (1985) Adaptive Signal Processing. Prentice-Hall: Englewood Cliff's, NJ. WILLL~m, R. W. and Hngtw, K. (1988) The control of neuron number. A. Ray. Neurosci. 11, 423-453. WINoGv..~, T. (1975) Frame representations and the declarative-procodumi controversy. In: RelneSentation and Understanding, Eds. D. G. Bomtow and A. COLU~. Academic Press: New York, NY. Wn-ro~wmN, L. (1953) Philosophical lrwesagat~is. Macmillan: New York, NY. WO~G-Rn.~, M. (1978) Reciprocal connections between striate and prestriate cortex in squirrel monkey as demonstrated by combined pernr,idas histochemistry and autoradiography. Brain Res. 14"/, 159-164. WOOD, F., E ~ T , V. and ~ M. (1982) The episodic-semantic memory distinction in memory and amnesia: clinical and experimental obeervatiom. In: Human Memory and Amnesia, Ed. L. S. Cegg~. Lawrence Erlbaum As,u~ate$: Hilledale, NJ. WOODWARD, D. J., Moe~, H. C., W ^ r ~ J l o u ~ B. D., Hof~t, B. J. and FREEMAN,R. (1979) Modulatory actions of norepinephrine in the central nervous system. Fedn Proc. 38, 2109-2116. WOODY,C. D. (1982a) Memory, Learning and Higher Mental Function. Springer-Verlag: New York, NY. WoovY, C. D. (1982b) Neuroph3nfiological correlate, of latent facilitation. In: Advances in Behavioral Bfoiofy, VoL 26. Ed. C. D. WoovY. Plenum Pren: New York, NY. WooDY, C. D. (1984a) Studies of Pavlovian eyeblink conditioning in awake cats. In: Neurobiology of Leamtnf and Memory. Eds. G. L ~ c t l , J. L. MCGAUOR and N. M. W B ~ o ~ t . Guilford Press: New York, N'Y. Woov¥, C. D. (1984b) The electrical excitability of nerve cells as an index of learned behavior. In: Pr/mary Neural Substrates of Learning and Behavioral Change, Eds. D. ALKONand J. FARLEY.Cambridge University Press: New York, NY. WoooY, C. D. (1986) Understanding the cellular basis of memory and learning. A. Ray. Psychol. 37, 433-493. WOODY,C. D., BuE~o~t, A. A., UNO~Ut,R. A. and D. S. (1976) Modeling a ~ c t s of learning by altering-biophysical properties of a simulated neuron. Biol. Cyher. 23, 73-82. ZIMM~-H~T, C. L. and RESO31tLA,R. A. (1974) Extinction of Pavlovian conditioned inhibition. J. comp. Physiol. Psychoi. g6, 837--845.

Generalization and specialization in artificial neural networks.

Progressm NeurobiologyVol. 37, pp. 383 to 431, 1991 Printed in Great Britain. All rightsr~erved 0301-0082/91/$0.00+ 0.50 © 1991 PergamonPress plc GE...
5MB Sizes 0 Downloads 0 Views