This article was downloaded by: [Thuringer University & Landesbibliothek] On: 14 January 2015, At: 10:53 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Ergonomics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/terg20

Development of microcomputer-based mental acuity tests a

b

b

c

J. J. TURNAGE , R. S. KENNEDY , M. G. SMITH , D. R. BALTZLEY & N. E. LANE a

University of Central Florida , Orlando, Florida, USA

b

Essex Corporation , Orlando, Florida, USA

b

c

now located at Dade-Miami Criminal Justice Assessment Center , Miami, Florida, USA Published online: 31 May 2007.

To cite this article: J. J. TURNAGE , R. S. KENNEDY , M. G. SMITH , D. R. BALTZLEY & N. E. LANE (1992) Development of microcomputer-based mental acuity tests, Ergonomics, 35:10, 1271-1295, DOI: 10.1080/00140139208967393 To link to this article: http://dx.doi.org/10.1080/00140139208967393

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

ERGONOMICS, 1 992, VOL

35,NO. 10,1 27 1- 1 295

Development of microcomputer-based mental acuity tests J . J. TURNAGE,* R. S. KENNEDY,** M. G.SM~-H,** D. R. BALTZLEY,*** and N. E. LANE**

* University of Central Florida, Orlando, Florida,

USA

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

** Essex Corporation, Orlando, Florida, U S A

*** now located at Dade-Miami Criminal Justice Assessment Center, Miami, Florida, Keywords:

Performance;

Alcohol; Test battery; Cognition; Microcomputer tests.

Environmental

USA

stress;

Recent disasters have focused attention on performance problems due to the use of alcohol and controlled substances in the workplace. Environmental stressors such as thermal extremes, mixed gases, noise, motion, and vibration also have adverse effects on human performance and operator eficiency. However, the lack of a standardized, sensitive, human performance assessment battery has probably delayed the systematic study of the deleterious effects of various toxic chemicals and drugs at home and in the workplace. The collective goal of the research reported here is the development of a menu of tests embedded in a coherent package of hardware and software that may be useful in repeated-measures studies of a broad range of agents that can degrade human performance. A menu of 40 tests from the Automated Performance Test System (APTS) is described, and the series of interlocking studies supporting its development is reviewed. The APTS tests, which run on several versions of laptop portables and desktop personal computers, have been shown to be stable, reliable, and factorially rich, and to have predictive validities with holistic measures of inteIligence and simulator performances. In addition, sensitivity studies have been conducted in which performance changes due to stressors, agents, and treatments were demonstrated. We believe that tests like those described here have prospective use as an adjunct to urine testing for the screening for performance loss of individuals who are granted access to workplaces and stations that impact public safety.

1. Introduction The recent oil spill in Alaska, ('Oil tanker captain.. .' 1989), where there was evidence of alcoholic intoxication, and the train crash outside Washington, DC ('Engineer in Amtrak crash. . . ' 1988), where the use of controlled substances was implied in causation, has focused attention on methods for determination of fitness for duty. Now, ten years after, and largely prompted by, the Three Mile Island incident, the US Nuclear Regulatory Commission has recently promulgated a fitnessfor-duty rule (US Code of Federal Regulations, I OCFR Part 26, Federal Register, 22 September 1988), whose purpose is to assure that power plant personnel are not under the influence of any substance (legal or illegal) or mentally or physically impaired from any cause which adversely affect their ability to safely and competently perform their duties' (CFR 1988: 36795). The rule is applicable to all individuals who are granted unescorted access. Other US federal agencies are planning similar directives. Corresponding author: Robert S. Kennedy, Vice-president, Essex Corporation, 1040 Woodcock Road, Suite 227, Orlando, FL 32803, USA. 0014-03 39/92 $3.00 O 1992 Taylor & Francis Ltd.

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

1272

J. J. Turnage et al.

Implementation of such policies is not without difficulty. For example, in addressing this issue for the nuclear power industry, the Commission noted that there are many degrees of mental impairment, some of which bear no relationship to the use (or lack of use) of the controlled substances which are presently to be tested through random urine testings. In addition, a positive urine test does not establish that an individual is currently impaired, only the presumption. Present methods to detect potential impairment from drugs or alcohol by chemical testing of body fluids have serious drawbacks. Self-medications such as cold or allergy medications, or colds themselves, are not normally tested, but may affect job performance as much as illegal drugs or alcohol ('Cold remedies' 1989). Of greatest importance, a positive drug test does not predict if and how performance will be affected. Yet effects of impairment cannot be measured satisfactorily either by self-reports of psychomotor condition or real-world measures of performance because the former can be faked and the latter, despite their validity from a measurement standpoint, are almost hopelessly unreliable. Lane (1986) suggests that, historically, a reliability range of r=O-15-0.50 has been found. Operator state could be assessed, however, using standardized tasks which are administered rapidly. Tasks that accurately and reliably measure the variation in individual psychomotor condition could be used to identify individual factors affecting real-world performance. This would be of immediate practical use and could guide the institution of standards and regulations for human performance in many sensitive job positions.

2. Recent research on microcomputer test battery development Recently, considerable research effort has focused on the development of computerbased neurobehavioural and cognitive test batteries for the assessment of human performance and mental acuity in the presence of toxic elements and environmental stressors. The US Department of Defense (DOD), Veterans Administration, Environmental Protection Agency, Federal Aviation Administration, other agencies, and several universities, have active research programmes. These programmes constitute valuable resources for the research and development of a computerized testing system. For example, the Unified Tri-Service Cognitive Performance Assessment Battery (UTC-PAB) (Englund er a/. 1986) was developed to provide a standardized DOD instrument for the computerized assessment of drug-dependent cognitive degradation. The US Army has developed a similar Performance Assessment Battery (PAB) which has been shown to be sensitive to changes in performance due to sleep deprivation, with all tasks on the battery showing similar decrement patterns across time (Thorne er a/. 1985). A related programme of testing has been conducted with PAB to evaluate the effects of hypoxia (Bandaret et al. 1988, Bandaret and Burse 1984, Bandaret et al. 1984). A neurophysiological microprocessor test battery was developed at the Air Force Aerospace Medical Research Laboratory (AFAMRL) (O'Donnell 198 1 ) to be used in a field environment to assess the effects of workload on operator performance. I n addition, a subjective workload scale has also been developed (Reid er a/. '1 98 1, Schlegel and Shingledecker 1985, Shingledecker 1 984). The Learning Abilities Measurement Program (LAMP) at the Air Force Human Resources Laboratory (AFHRL) has used automated testing to assess individual differences in cognitive abilities and information processing (Christal 1981, Payne 1982), and recently a

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

Microcomputer-bared mental acuity tests

1273

Basic Attribute Test of psychomotor and cognitive tests (BAT) has been related to success in pilot selection and training (Kantor and Bordelon 1985, Carretta 1989). In addition, the US Federal Aviation Administration (Bannich, Stokes and Elledge, 1989) has included mental acuity tests for purposes of identifying persons whose flying capabilities may be impaired as a result of residual deficits due to alcobol, degenerative disease, brain trauma, cardiovascular problems or psychiatric disturbance. Sponsored by the US Environmental Protection Agency (EPA), Eckerman and his colleagues (Guillion and Eckerrnan 1986) have also developed an automated test battery to detect the effects of toxic substances on human performance. Related batteries are found in the USA (Baker et a!. 1985, Barrett et al. 1 982, Rosa and Colligan 1 988), Canada (Heslegrave and Angus 198S), and in Europe (Hanninen and Lindstrom 1979, Logie and Baddeley 1985). A handful of recently published studies has compared automated and manual versions of tests and reported favourable results (Barrett et a/. 1982, Wilson et al.

1982)inthefomofhighcorrelationsbetweenthetwoversions.However,according to some (e.g., Berger et al. 1988, Giannetti 1988), merely being correlated significantly is insufficient evidence for assuming the computer-generated surrogate test to be equivalent to the original. An experiment conducted by Krause (1 983) illustrates this point: microcomputer and paper-and-pencil versions of four welldocumented cognitive tests were compared with two paper-and-pencil forms and one computer test form. Reliability correlations for paper-and-pencil tests were significantly and substantially higher than the computer versions when corrected for test lengths. In another study (Lane and Kennedy 1988), a standard test (Pattern Comparison) revealed somewhat different factor structures for the same tests, presumably because of time permitted for each test, and perhaps also because of the way they were implemented in two batteries. In addition, we do not know much about the influences of practice on the stability of factor structure. These issues suggest that researchers should re-evaluate computerized tests empirically rather than assume that computer-generated results are comparable to their paper-andpencil predecessors. The approach followed in the development of most computerized test batteries has included, among others, cognitive theory (e-g., Guillion and Eckerrnan 1986), a desire to 'standardize' (e-g., Hegge, cited in Sanders et a/. 1986), the need to field a battery to test the mental functions disrupted by various agents (e-g., Baker ei al. 1 985), and the opportunity to computerize paper-and-pencil tests (Ban-ett et al. 1982). T o this list we add our own philosophy. Instead of following cognitive theory, whereby tests are selected according to the type of information processing function which is expected to be the most heavily involved in their performance, it is our view that the technical requirements for developing a battery of tests should be based on the tenets of the classical theory of mental tests and testing (e.g., Allen and Yen 1 979, Gulli ksen 1 950, Thorndike and Hagen 1977). Classical test theory requires that tests meet set criteria such as stability and reliability. Stability is defined as that point after repeated testings at which practice no longer produces changes in performance. Stabilization occurs when the group mean no longer increases, the variance among subjects no longer changes, and the correlation with earlier trials remains the same from one stabilized trial to the next (Jones 1980). Stability is essential to a performance test battery because, when it is absent, practice and environmental effects are confounded, making interpretation difficult. Stability implies the use of parallel forms of a test, administered on different

.

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

J. J. Turnage et al. days to the same group of subjects. Relatedly, repeated testing with at least three parallel forms is the best way to obtain reliability because one can compare means, variances, and correlations. If means and variances are equal and correlations between repeated measures are high, one has 'little fear that undetected and irrelevant factors are rendering the obtained reliability coefficient either spuriously high or spuriously low' (Gulliksen 1950: 196). Thus, one is relatively sure that an individual's obtained performance score on a test is reflective of his or her 'True' score. This approach, which considers stability and reliability of measurement to be the foundation for test construction, differs from other current computer-based testing systems because it focuses on building a battery from psychornetricallyproven pans (tests) and then attempts to achieve factorial diversity of cognitive content, rather than the reverse. Such an approach can easily accommodate hypothetical constructs like 'controlled vs automatic' processing (Ackerman and Schneider 1984) or 'components' (Sternberg 1979) as they emerge. When a test is stable, then systematic differences in automaticity, learning, or fatigue are no longer present and the effectiveness of the introduction of treatments or agents may be seen to influence the construct which the True score purporls to tap. Therefore, a critical requirement of tests which are employed in repeated-measures applications and wi thi n-subject designs, is that the tests be stable, so that alternate forms of the tests are parallel. The requirement for parallel forms is logically necessary for proper interpretation of any loss (or gain) in the performance being measured as being due to a treatment. We believe that past test battery development has paid little if any attention to certain areas of test theory, particularly differential stability. More to the point, we know of no battery other than the Automated Performance Test System (APTS) described below which has followed a differential approach. In addition, because withinsubject designs are so often the intended application, we believe this is a critical failing of other batteries. In the next section, we will discuss in greater detail the requirements that we consider critical for constructing a suitable battery for detecting changes in human performance.

3. Methodology for evaluating tests for repeated measurements usage The following criteria are needed for repeated-measure testing. 3.1. Stability Repeated-measures studies of environmental influences on performance require stable measures if changes in the treatment are to be meaningfully related to changes in performance (Jones 1970a). Of particular concern is the fact that a subject's score may differ significantly over time due to measure instability. Jones (1970a, b) clearly describes the potential for score instability in the two process theory of skill acquisition. The theory maintains that the advancement of a skill involves an acquisition phase in which persons improve at different rates, and a terminal phase in which persons reach or approximate their individual limits. The theory further implies that, when the terminal phase is reached, scores will cease to deviate despite additional practice. Unless tests have been practised to this point of differential stability, the determination of change in scores due to practice or some other variable is confounded. For example, in a study of the effects of alcohol, if scores on a performance test remained the same before and after exposure, and if the test was not dinerentially stable, it would be impossible to determine whether a decline in

Microcompu~er-basedmenfa1 acuity tests

i 275

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

performance was masked by practice effects or whether there was no treatment effect. Only after differential stability is clearly and consistently established between subjects can investigators place confidence in the adequacy of their measures and subsequent results.

3.2. Reliability Retest reliability is usually defined as the correlation between test scores at two points in time. Retest reliability, sometimes called task definition, is measured differently in repeated-measures studies and is defined as the average retest correlations of the stabilized task (Jones 1980, Wilkes el al. 1988). B y this definition, it is necessary first to determine when a task (test) stabilizes. Stability can be determined quantitatively, but is frequently considered to be the trial on an intertrial repeated-measures correlation matrix at which correlation coefficients plateau (i.e., become high and remain high over subsequent trials). Task definition is the average of all the correlations that occur on and after the trial of stability. Higher average reliability improves power in repeated-measures studies when variances are constant, and the lower the error within a measure, the greater the likelihood that mean differences will also be detected. Therefore, tasks with low task definition are insensitive to such differences and are to be avoided. Because different tasks stabilize at different levels, task definition becomes an important criterion in task selection. Task definitions for different tests, however, cannot be directly compared without first standardizing tests for test length (i-e., reliability efficiency). 3.3. Reliability eficiency Reliability correlations are known to be influenced by test length (Guilford 1954). Tests with longer administration times a n d o r more items maintain a reliability advantage over shorter versions. Test length should be equalized before meaningful comparisons can be made. We have found that a useful tool for making relative judgments is the reliability efficiency, or standardized reliability, of the test (Kennedy er a/. 1 980). Reliability-efficiencies are computed by correcting the reliabilities of different tests to a common test length by use of the Spearman-Brown prophecy formula (Guilford 1954: 354). Reliability-efficiency not only facilitates judgments concerning different tests, but also provides an objective means for comparing the sensitivity of one test with the sensitivity of another test, or the same test in different populations where range restrictions may occur. 3.4. Stabilization time The evaluation of highly transitory changes in performance may be necessary when studying the effects of various treatments, drugs, alcohol, o r environmental stress. 'Good' performance measures should quickly stabilize following short periods of practice without sacrificing metric qualities. As a general rule, 'good' performance measures should always be economical in terms of time. A task under consideration for environmental research must be represented in terms of the number of trials and/or the total amount of time necessary to establish stability. Stabilization time must be determined for the group means, standard deviations, and intertrial correlations (differential stability).

3.5. Task ceiling If all subjects asymptote at the maximum level of performance, then the task is said

1276

J. J. Turnage et al.

Downloaded by [Thuringer University & Landesbibliothek] at 10:54 14 January 2015

to have a ceiling (Jones 1980). Ceilings are undesirable because they limit' discrimination between subjects. When subjects perform equally well, except for random error, between-trial correlations fall to zero. Therefore, the best task (test) is one which achieves stability early (i.e., plateaus), but which still exhibits subtle increments in learning.

3.6. Factor richness Where possible, subtests should be selected that tap independent factors with little or no overlap. Such selection ensures that the overall battery is rich in factor structure while free of unwanted redundancies.

,

3.7. Validiry Good tests are those which are demonstrably valid according to several criteria. For example, they should: (1) be sensitive to agents and'stimuli like hypoxia, drugs, and sleep loss; (2) be predictive of other mental test scores and cognitive performances; (3) tap constructs and factors which reflect a theoretical basis, and (4) appear on the face to be testing a mental acuity function. In summary, we believe that establishing stability, reliability, and validity of newly-developed microcomputer tests has lagged far behind both the use and marketing of such tests. Elsewhere (Carter et a!. 1980, Kennedy and Bittner 1977), the traditional criteria for validity and reliability, along with hardware factors and other measurement issues, have been listed. In a review of alternate computerized performance test systems, we also evaluated (to the best of our knowledge) various batteries for the presence of five metric properties and five practical properties considered to be necessary components of a performance monitoring device capable of detecting behavioural changes due to various treatments and agents. The five metric properties were: ( 1 ) suitability for repeated-measures (stable); (2) demonstrated reliability; (3) indexed to external referents (criterion-related validity); (4) demonstrated sensitivity; and (5) known, diverse factor content. The five practical properties were: (1 ) portability; (2) self-administered; (3) self-scoring; (4) no special interfaces required; and (5) minimal administration time (

Development of microcomputer-based mental acuity tests.

Recent disasters have focused attention on performance problems due to the use of alcohol and controlled substances in the workplace. Environmental st...
1MB Sizes 0 Downloads 0 Views