What is a Criterion-Referenced
Test? A criterion-referenced test is a measurement tool designed to estimate
mastery of an identified unit of a curriculum (e.g., battles of the Civil
War, multi-digit addition with regrouping, use of prepositions).
In somewhat different forms, they may also be referred to as curriculum-based
measures, and more broadly, as curriculum-based assessment. Criterion-referenced
tests are standardized instruments, which are constructed with sufficient
precision such that different examiners will administer, score and interpret
results in the same way. Criterion-referenced tests contain items designed
to represent the unit of instruction adequately. Each item has a
predetermined correct answer that can be scored objectively by the assessor.
A criterion-referenced test may be used for two main purposes. First, it
can be used to determined whether or not a student is weak in a given skill,
and therefore, needs further instruction. Second, it can be used following
instruction to determine the effectiveness of instruction. Although they
are seldom normed nationally, it is very beneficial to collect sufficient
data on which to compute local means for appropriate grade groups.
Steps in Constructing
a Criterion-Referenced Test
Step 1: Naming the Test.
Although it seems trivial, it is important to give the test a name that
accurately represents its content. Over the course of years, teachers
construct many tests, and a filing system that allows efficient retrieval
for future use demands that test names obviously reflect their content.
Step 2: Objective(s) Represented
by the Test. It is important that tests be created after objectives have
been specified. The test items are then constructed to reflect mastery
of the objectives, not the other way around. Ideally, the objectives
will be drawn from a large Taxonomy of Objectives designed to cover the
entire domain. A good objective contains: (a) conditions under which
the behavior will occur, (b) a statement of the behavior demanded of the
examinee, and (c) criteria that will be used to determine whether or not
the examinee has mastered the objective. The behavior should be objectively
defined such that two people would agree that the behavior has or has not
occurred. Criteria should be in the form of one of a number of recognized
scores: percentage correct, behavior rates, duration, response latency,
intensity, standard score, percentile rank, etc.
Step 3: Statement of the
Purpose of the Test. This is merely a restatement of the objective in more
easily communicated form. That is, it uses everyday language without
the technical verbiage.
Step 4: Instructions for
Administration. This component tells the user how the test should be given.
This will help to standardize data collection so that from occasion-to-occasion,
from child-to-child, and from examiner-to-examiner the test is administered
in the same way. This makes the results (i.e., the scores) comparable.
Typical elements included are: (a) instructions to the child, (b)
materials needed (e.g., two sharpened number 2 pencils, use of a watch
for timing purposes), (c) how to deal with interruptions, (d) how to deal
with questions from the child, and many, many more. The test maker must
ask herself what elements impinge upon the successful administration of
her test.
Step 5: Instructions for
Scoring. This section tells the user how to transform the examinee's responses
into item scores and total scores. This often means providing criteria
for correct and incorrect responses to individual items in a scoring key.
There may be a formula required to obtain a total score (e.g., the formula
for behavior rates) that should be illustrated for the uninformed user.
Step 6: Instructions for
Interpretation. Here the user is told how to make decisions on the basis
of the score(s) obtained from an administration of the instrument.
Basically, the criteria for minimally acceptable performance laid out in
the objective guide this process. For instance, if the criterion mentions
95% accuracy, then the user should compare the examinee's score with 95%.
If the examinee's score equals or exceeds that value, the child has mastered
the objective. If not, then the objective needs more instruction
and practice.
Step 7: Specific Items in
the Instrument. The key here is for the test maker to ensure that the items
in the test are representative of the skills specified in the objectives.
First, there must be enough items to comprise a reliable sample of the
skills in question. It is rarely possible to have a reliable measure
of any objective with less than 25 items. Second, the items should
adequately represent the various kinds of subskills contained within an
objective. For instance, a test on addition facts is unrepresentative
if it does not include items containing 6 or 8.
Step 8: Standard Error of
Measurement. The standard error of measurement (SEM) of a test can be estimated
using the number of items contained in the test (Eaves, 1979). This
estimate should be included along with the other components of the test.
As an example of the use of the SEM, consider the student who obtains a
raw score of 7 on a test containing 11 items. His percentage correct is
64%. Because the percentage does not fall into one of the exceptions,
the estimated SEM is 2 (for tests with less than 24 items). In order
to construct a 95% confidence interval, the assessor should double the
SEM (i.e., 2 X 2 = 4). Next, the product is subtracted from the student's
raw score (7 - 4 = 3), then the product is added to the student's raw score
(7 + 4 = 11). These values represent the 95% confidence interval
in raw-score form (i.e., 3 - 11). In
percentage-correct form,
the assessor can say, with the knowledge that he will be correct on 95
out of 100 such judgments, that the student's true score is contained within
the interval of 27% - 100%. Notice that such results on a test with
few items provide virtually no useful information for decision making.
The same relative performance on a 100-item test would result in a 95%
confidence interval of 54% - 74%.
Estimated Standard Errors
of Test Scores
Number of Items
|
Standard Error*....
|
Exceptions - Regardless
of the length of the test, the standard error is: |
|
<24
|
2
|
....0 when the score is
zero or perfect |
|
24-47
|
3
|
.....1 when 1 or 2 percentage
points from 0 or 100% |
|
48-89
|
4
|
.....2 when 3 to 7 percentage
points from 0 or 100% |
|
90-109
|
5
|
.....3 when 8 to 15 percentage
points from 0 or 100% |
|
110-129
|
6
|
|
|
130-150
|
7
|
|
* Standard errors are in
raw score form. Items are assumed to be scored dichotomously (i.e.,
0 or 1).
Reference
Eaves, R.C. (1979).
Some simple statistics for classroom use. Diagnostique, 4, 3-12.