|







|
RAND conducted analysis of assessments from five of the six Bridge Project
statesCalifornia, Georgia, Maryland, Oregon, and Texas. Illinois,
which was part of the Bridge Project study, was not used because assessments
were in a state of flux at the time of the RAND analysis. The Bridge Project
case study sites were chosen for a variety of state-specific reasons (e.g.
the use of affirmative action in higher education admission decisions
in California and Texas; and specific K-16 reforms in Georgia, Maryland
and Oregon).
The following are tables excerpted from the RAND report, "Alignment
Among Secondary and Post-Secondary Assessments in Five Case Study States,"
followed by more detail about the methodology and coding categories.
Additional Information about the Methodology:
Assessments Examined
Purposes of Assessments Examined in this Study
National College Admissions Tests Used in Each
Case Study Site
Coding Categories
Evaluating the Extent of Alignment Among Tests
Limitations
Assessments Examined
Because of the magnitude of tests available, it was necessary to limit
the number of tests examined in this study. The analysis were restricted
to math and English/Language Arts (ELA) measures administered to high-school
and incoming first-year college students; most remediation decisions at
the postsecondary level are based on achievement deficiencies in those
areas. The assessments were limited to those used by selected institutions
in the Bridge Projects five case study sites. Because the kinds
of tests administered may vary by college, it is important to sample exams
from a range of institutions. Namely, the minimum skill level required
of students entering a highly selective institution may differ from the
level required of students entering a less selective college. As a result,
the content of remedial college placement tests used to assess entry-level
skills can vary by institution. For each of our sites, RAND examined assessments
administered by colleges that represented a range from less selective
to highly selective. However, the chosen institutions are not a scientific
sample.
Purposes of Assessments Examined in this Study
Bridge Project researchers asked RAND to examine four types of assessments
in this study: state achievement, college admissions, college placement,
and end-of-course exams. The tests and their goals are:
- State achievement tests provide a broad survey of student proficiency
toward state standards in a particular subject like math or reading.
- College admissions exams sort applicants more qualified for college-level
work from those less qualified.
- College placement tests determine which course is most appropriate
for students. Some college placement tests are used to identify students
who may require additional remediation, whereas other college placement
exams are used more broadly to determine which course students are eligible
to enroll in, given their prior academic preparation.
- End-of-course tests are measures of knowledge of one particular course.
Because of the magnitude of tests available, it was necessary to limit
the number of tests examined in this study. RAND focused its analysis
on math and English/Language Arts (ELA) measures administered to high-school
and incoming first-year college students. Bridge researchers chose math
and ELA because most remediation decisions at the postsecondary level
are based on achievement deficiencies in these areas. Because the kinds
of tests administered may vary by college, it is important to sample exams
from a range of institutions. For each of our sites, researchers examined
assessments administered by colleges that represented a range from less
selective to highly selective. However, the chosen institutions are not
a scientific sample.
National College Admissions Tests Used in Each
Case Study Site
The first set of tests we examined, which includes the SAT I, SAT II,
ACT, and AP exams, are used in five of Bridges case study sites,
as well nationally, to aid in college admissions decisions. For those
students applying to a four-year institution, many are required to take
either the SAT I or ACT, and, at more selective schools, several SAT II
exams as part of the admissions process. While AP tests are not a requirement,
admissions officers are likely to view students with AP experience as
better-prepared and more competitive applicants. Also, students with scores
of four or five on AP tests are often rewarded with college credit in
the subject area.
The SAT I, a three-hour mostly multiple-choice exam, is intended to help
admissions officers distinguish applicants more qualified for college-level
work from those less qualified. It is not designed to measure knowledge
from any specific high school course, but instead measures general mathematical
and verbal reasoning.
The SAT II is a series of one-hour, mostly multiple-choice tests that
assess in-depth knowledge of a particular subject, and is used by admissions
officers as an additional measure with which to evaluate student subject-matter
competence. The SAT II is used primarily at the more selective institutions
and is taken by far fewer students than is the SAT I. For this study,
RAND examined the following SAT II tests: Mathematics Level IC, Mathematics
Level IIC, Literature, and Writing. The SAT II Mathematics Level IC test
assesses math knowledge commonly taught in three years of college preparatory
math courses, whereas the SAT II Mathematics Level IIC test assesses math
knowledge in more than three years of college preparatory math courses.
The SAT II Literature test assesses students proficiency in understanding
and interpreting reading passages, and the SAT II Writing test assesses
students knowledge of standard written English.
The ACT is an approximately three-hour exam consisting entirely of multiple-choice
items. Developed to be an alternative measure to the SAT I in evaluating
applicants chances of success in college, it does not emphasize
general reasoning (as does the SAT I) but is instead a curriculum-based
exam that assesses achievement in science, reading, language arts, and
math (Wightman & Jaeger, 1988). RAND included only the reading, language
arts, and math sections for this study.
AP tests are used to measure college-level achievement in several subjects,
and to award academic credit to students who demonstrate college-level
proficiency. RAND examined the AP Language and Composition exam for this
study.
Analyses are not included here. Rather, this document provides data from
RANDs analysis for use by policymakers and researchers interested
in relationships between assessments.
RAND excluded the two AP exams in calculus (i.e., Calculus AB and Calculus
BC) because they are markedly different from the other studied math tests.
For example, they do not include material from any other mathematical
content area except calculus, and are the only measures that require a
graphing calculator.
Coding Categories
Two raters examined alignment among the different types of assessments
using several coding categories, which are described below.
The coding categories for both math and ELA describe the technical features,
cognitive demands, and content of each assessment. The technical features
category involves characteristics such as time limit and format. The cognitive
demands category captures the kinds of cognitive processes elicited. For
reasons that will be explained in another section, the content category
is slightly different for math and ELA. In math, the content category
captures what is being assessed (i.e., particular content area such as
elementary algebra or geometry). In ELA, the content category describes
the reading passage.
The cognitive demands category and the content category in math are the
focus of this study because discrepancies in these areas can potentially
send mixed messages to students about the kinds of skills they should
learn in order to be prepared for college-level courses. Although variations
in technical features and in the ELA content category are believed to
have less direct impact on the kinds of signals students receive, it is
nevertheless important to document discrepancies in these areas. Technical
features, such as test format, can facilitate or limit the kinds of skills
that are measured (Bennet & Ward, 1993). RAND described differences
in the ELA content category to characterize fully the range of test content
possible.
RAND created the above coding categories by exploring different ways
of summarizing test content. RAND examined several sources, including
test frameworks, such as those used to develop the National Assessment
of Educational Progress (NAEP), as well as coding categories used in previous
studies of alignment (Education Trust, 1999; Kenney, Silver, Alacaci,
& Zawojewski, 1998; Webb, 1999). RAND then combined and modified the
different sources to produce coding categories that addressed the range
of topics and formats appearing on the tests included in this study.
Math Coding Categories
The first aspect of the math coding categories concerns technical features.
Technical features describe test length, time limit, format, and characteristics
that can be described by inspection of test instructions or items. In
math, items could be classified as one of four formats: multiple-choice,
quantitative comparison, grid-in, or open-ended. Multiple-choice items
require students to select their answer from a list of possible options.
Quantitative comparison items require students to determine the relative
sizes of two quantities. Because quantitative comparison items ask students
to select their answer from four possible options, they are considered
a subset of the multiple-choice format. However, RAND distinguished
between the two types of formats because response options across multiple-choice
items vary from one question to the next, whereas response options across
quantitative comparison items remain the same. Grid-in items require
students to produce their own answer and mark their answer in a corresponding
grid. Open-ended items also require students to produce their own answer,
but differ from grid-in items in that the former item type can take
on negative values. Additionally, many of the open-ended items in our
study require students to explain their reasoning.
Technical features also include characteristics that can be described
by examining the test instructions or items. These include characteristics
such as provisions for the use of tools such as calculators or rulers,
the use of diagrams or other graphics, the use of formulas, whether
formulas were provided or had to be memorized, and whether each item
was contextualized (i.e., a word problem that made reference to a real-life
situation). The use of formulas was sometimes difficult to determine
because problems can be solved in multiple ways, and in some cases an
item could be solved either with or without a formula. Items were coded
as requiring a formula only if it was determined that the formula was
necessary for solving the problem.
For the content category, RAND listed several math content areas, ranging
from basic through advanced math. The content areas included prealgebra,
elementary algebra, intermediate algebra, planar geometry, coordinate
geometry, statistics and probability, trigonometry, and miscellaneous.
Almost all of the tests RAND examined had specifications that included
many or all of these content areas. RAND listed subareas as means of
making the distinctions among the main content areas clearer, but raters
coded using only the main content areas.
To evaluate the cognitive demands of a test, RAND needed a coding scheme
that captured different levels of cognitive processes, from routine
procedures to complex problem solving. This led researchers to adopt
the same criteria as those used for NAEP, namely conceptual understanding,
procedural knowledge, and problem solving. The descriptions of each
are described in the table below.
| Table 2.1 Descriptions of the Math Cognitive Demands
Category |
| Dimension |
Definition |
| Conceptual understanding |
Reflects a students ability to reason in settings involving
the careful application of concept definitions, relations, or representations
of either |
| Procedural knowledge |
Includes the various numerical algorithms in mathematics that
have been created as tools to meet specific needs efficiently |
| Problem solving |
Requires students to connect all of their mathematical knowledge
of concepts, procedures, reasoning, and communication/representational
skills in confronting new situations |
| Source: Mathematics Framework for the 1996 and 2000
National Assessment of Educational Progress (NAGB, 2000). |
As is typical with studies like these (e.g., Kenney, Silver, Alacaci,
& Zawojewski, 1998), the raters found the cognitive demands category
to be the most difficult to code, partly because items can often be
solved in multiple ways, sometimes as a function of the examinee's proficiency.
What might be a problem-solving item for one examinee might require
another to apply extensive procedural knowledge. For instance, consider
an item asking students for the sum of the first 101 numbers starting
with zero. A procedural knowledge approach might involve a computation-intensive
method, such as entering all the numbers into a calculator to obtain
the resulting sum. However, a problem-solving approach would entail
a recognition that all the numbers, except the number 50, can be paired
with another number to form a sum of 100 (100+0, 99+1, 98+2, etc.).
The total sum is then computed by multiplying the number of pairs (i.e.,
50) by 100 and adding 50.
For items that can be solved in multiple ways, raters coded for the
cognitive process that was most likely to be elicited. Judgments about
the cognitive processes most likely to be evoked were based on raters
prior experience in which they observed high school students as the
students solved math problems.
ELA Coding Categories
Coding categories in ELA cover three types of skills: reading, editing,
and writing. Reading skills relate to students vocabulary and
comprehension of reading passages. Editing skills relate to students
ability to recognize sentences that violate standard written English
conventions. Writing skills pertain to how well students can produce
a composition that clearly and logically expresses their ideas. Many
of the tests include two or all three of these skills, but some assessments
focus on a single type, namely writing. Because many of the tests measuring
reading or editing skills include reading passages followed by sets
of items, it was necessary to categorize both the reading passage and
the individual items.
As with math, the ELA coding categories summarize the technical features,
content, and cognitive demands of each assessment. The technical features
category in ELA describes time limit, test length, and format. There
are only two possible formats for ELA assessments, multiple-choice or
open-ended. In ELA, open-ended items require students to produce a writing
sample.
The content category is different from that in math because content
areas are not as clearly delineated in English. Whereas a math item
may be classified into specific content areas such as elementary algebra,
geometry, or so forth, there are not analogous English content areas
in which to classify reading passages. Instead, the content category
in English describes the subject matter of the reading passage (topic),
the authors writing style (voice), and the type of reading passage
(genre). In other words, there are three dimensions to the ELA content
category: topic, voice, and genre. Topic consists of five levelsfiction,
humanities, natural science, social science, and personal accounts.
Voice consists of four levelsnarrative, descriptive, persuasive,
and informative. Genre also contains four levelsletter, essay,
poem, and story.
For exams measuring reading or editing skills, raters used all three
dimensions of the content category to describe the reading passages.
However, all three dimensions are not relevant to writing tests, as
such tests do not include reading passages. Instead, writing tests contain
a short prompt that introduces a topic that serves as the focus of students
compositions. Because students are not instructed to use a particular
voice or genre for their compositions, only the topic dimension is needed
to describe the writing prompts. Table 2.2 provides more details of
the content category used for ELA assessments.
Table 2.2
Descriptions of the ELA Content Category |
| Dimension |
Description or Example |
Used for Reading Skills |
Used for Editing Skills |
Used for Writing Skills |
| Topic |
|
yes |
yes |
yes |
Fiction
|
Writing based on imaginary events or people |
|
|
|
Humanities
|
e.g., artwork of Vincent Van Gogh |
|
|
|
Natural sciences
|
e.g., the reproductive process of fish |
|
|
|
Social sciences
|
e.g., one man, one vote; cost effectiveness of heart transplants
|
|
|
|
Personal accounts
|
e.g., diary account of death of a parent |
|
|
|
| Voice |
|
yes |
yes |
no |
Narrative
|
Stories, personal accounts, personal anecdotes |
|
|
|
Descriptive
|
Describes person, place, or thing |
|
|
|
Persuasive
|
Attempt to influence others to take some action or
to influence someones attitudes or ideas |
|
|
|
Informative
|
Share knowledge; convey messages, provide information
on a topic, instructions for performing a task |
|
|
|
| Genre |
|
yes |
yes |
no |
Letters
|
|
|
|
|
Essays
|
|
|
|
|
Poems
|
|
|
|
|
Stories
|
|
|
|
|
For measures of reading and editing skills, the cognitive demands category
is intended to capture different levels of cognitive processes, ranging
from low to high levels. Using similar coding categories as those in
previous alignment studies (Education Trust, 1999), RAND distinguished
among three levels of cognitive processes: recall, evaluate style, and
inference.
Table 2.3 provides descriptions of each of these levels.
In reading, questions that could be answered via direct reference to
the passage are recall items, whereas questions that require examinees
to interpret the material are inference items. Questions that pertain
to the development of ideas or improve upon the presentation of the
reading passages are coded as evaluating style. For editing measures,
items that entail application of grammatical rules are recall items.
Typically, most of these questions concern mechanics or usage errors.
Inference items are those that require examinees to identify cause-and-effect
relationships, and evaluating style items relate to rhetoric skills,
such as sentence organization or clarity.
Table 2.3
Descriptions of the ELA Cognitive Demands Category for Tests
Measuring Reading or Editing Skills |
| Cognitive Process |
Description or Example |
| Recall |
Answer can be found directly in the text, or by using the definitions
of words or literary devices, or by applying grammatical rules |
| Infer |
Interpret what is already written |
| Evaluate style |
Improve the way the material is written |
The above cognitive demands category is not applicable to writing measures
because students do not respond to items, but instead write their own
compositions. For writing tests, the cognitive demands category focuses
on the scoring criteria, particularly the emphasis given to mechanics,
word choice, organization, style, and insight.
Table 2.4
Description of the ELA Cognitive Demands Category for Tests
Measuring Writing Skills |
Scoring Criteria |
Description or Example |
| Mechanics |
Grammar, punctuation, capitalization |
| Word choice |
Use of language, vocabulary, sentence structure |
| Organization |
Logical presentation, development of ideas, use of appropriate supporting
examples |
| Style |
Voice, attention to audience |
| Insight |
Analytic proficiency, accurate understanding of stimulus passage,
thoughtful perceptions about its ramifications |
Rater Agreement in Applying the Coding Categories
Two raters, who had expertise in the relevant subject area, judged alignment
by applying the coding categories to each item. One rater coded the
math assessments, and the other rater coded the English assessments.
An additional rater coded eight tests, 4 each in math and English. All
discrepancies were resolved through discussion. Consensus was fairly
high, as kappa statistics ranged from .80 to 1.0 (i.e., perfect agreement)
for all but two categories. (For specific percent agreement for each
coding category, see Appendix B). One of the exceptions was the content
category in math, where items often assessed skills in more than one
area. Agreement in this category was .76. The final exception was the
cognitive demands category in math, where kappa values ranged from .42
to .63, with an average of .56.
Evaluating the Extent of Alignment Among Tests
In interpreting the results, an important issue concerns the standard
with which to evaluate the extent of alignment. Specifically, how large
should the discrepancies be before we consider two tests to be poorly
aligned? To guide its decisions, RAND analyzed data from an alignment
study conducted by Education Trust (1999). They found the average discrepancy
across coding categories to be approximately 24 percent, with a standard
deviation of 19 percent. That is, differences between any two tests were
typically within the range of 5 percent-43 percent. Using these results
as a guideline, RAND decided that differences of 25 percent or less are
considered small, (i.e., the tests are well-aligned), differences
between 25 percent and 50 percent are moderate, (i.e., the
tests are moderately aligned), and differences greater than 50 percent
are large (i.e., the tests are not well-aligned). Thus, in
order to be a misalignment, discrepancies between tests must be greater
than 50 percent and cannot be attributed to differences in intended test
use.
As there is currently no consensus among researchers regarding the standards
for judging the extent of alignment among assessments, the above criteria
should not be interpreted as a definitive standard. Additionally, because
there has been no research to date that has established how large discrepancies
among tests must be before they send students received mixed signals regarding
the skills needed to be prepared for college-level work, differences that
are considered small can nevertheless have important implications.
Thus, categorizations of discrepancies as small, moderate,
or large should be interpreted cautiously, and the studys
focus should be viewed as mainly descriptive and comparative.
Limitation
Although the use of expert judgments is a fairly common approach to studying
alignment (e.g., Kenney, Silver, Alacaci, Zawojewski, 1998; Webb, 1999),
this study does not provide a complete picture of these assessments. RAND
has not, for example, systematically examined differences in content standards
or test specifications, which may account for some of the discrepancies
among exams. Furthermore, an analysis of scores might reveal that seemingly
different instruments rank order or classify students roughly equivalently.
Similarly, observations and interviews with students as they take the
tests, an approach that is sometimes used during the test development
process, could result in somewhat different interpretations of a tests
reasoning requirements.
Finally, increasing the number of forms studied for each assessment may
enhance the generalizability of the findings. Although researchers attempted
to examine all available forms, this was not always possible. Namely,
college admissions measures and some commercially-available college placement
exams have multiple, parallel versions. Despite the fact that parallel
forms are intended to have similar content and structure, tests represent
a sample of skills from a single testing occasion, so forms from other
occasions will vary to some extent. This is especially true when RAND
analyze alignment among English Language Arts (ELA) topics, where any
given test form provides a limited sample (e.g., there may be only one
reading passage).
|
|