Multiple Choice Test
Educational Psychology Program
University of New Mexico
Why Use a Multiple Choice Test?
Multiple choice testing is an efficient and effective way to assess a wide range of knowledge, skills, attitudes and abilities (Haladyna, 1999). When done well, it allows broad and even deep coverage of content in a relatively efficient way. Though often maligned, and though it is true that no single format should be used exclusively for assessment (American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education , 1999), multiple choice testing still remains one of the most commonly used assessment formats (Haladyna, 1999; McDougall, 1997).
What is a Multiple Choice Test?
The multiple-choice test is a very flexible assessment format that can be used to measure knowledge, skills, abilities, values, thinking skills, etc. Such a test usually consists of a number of items that pose a question to which students must select an answer from among a number of choices. Items can also be statements to which students must find the best completion. Multiple-choice items, therefore, are fundamentally recognition tasks, where students must identify the correct response.
WHAT IS INVOLVED?
|Instructor Preparation Time:
||Medium to high if you are writing your own items the first time; low if you have validated items.
|Preparing Your Students:
||Little or none. Especially in introductory classes, it might be wise not to assume that students know strategies for taking multiple-choice tests. Some time spent on test taking strategies may be useful.
||Depends on the length of the test.
||Any. Especially efficient in large classes.
|Special Classroom/Technical Requirements:
||None. An optical scanner and scan sheets may be useful with large classes.
|Individual or Group Involvement:
||Usually individual; team testing is possible.
||For grading purposes, analysis is usually quick and straightforward. For developing and refining items or for diagnostic purposes, analysis can be a little more complex.
|Other Things to Consider:
||Logistical concerns such as your students using scanning sheets correctly.
A multiple choice test is constructed of multiple choice items, which consist of two, sometimes three, parts as shown below.
- What is the radius of the circle you'd draw with a compass set at 1 inch?
- A. 1/2 inch
- B. About 3.14 inches
- C. 2 inches
- D. 1 inch
The Stem of a multiple choice item is the part to which the student is to respond. In everyday language, we'd call this the question, but since it could be a statement or even an analogy or an equation, we'll use the technical term, Stem.
The Options are the choices from which examinees are to choose. There are two kinds of options: the Key is the correct or best choice; the Distracters are incorrect or less appropriate choices.
Sometimes Stimulus Materials can also be used with a multiple choice item, something like a bar graph, a table, a map, a short text, etc. There will be more about stimulus materials in a moment.
Multiple-choice tests can be used to meet a variety of educational purposes, though they most often are used in the classroom to measure academic achievement, and to determine course grades. Other purposes to which they can be put include feedback to students, instructional feedback, diagnosis of student misconceptions, and others. (See the Suggestions for Use section.)
A Few Multiple Choice Myths
- Multiple-choice tests are objectiveMultiple-choice items are sometimes called "objective" items, an unfortunate label. They can be just as subjective as an essay question if they're poorly written . (Indeed, a well-crafted essay prompt and scoring rubric can be much more objective than some multiple choice items.) Subjectivity/objectivity does not reside in the format but in the construction and scoring, thus objectivity must be planned into multiple-choice questions (Dwyer, 1993).
Multiple-choice tests assess only superficial knowledgeIt is perhaps because faculty test as they were tested, not following state-of-the-art rules for testing, that multiple choice has the reputation it does. Research has long shown that college-level faculty do not write exams well (Guthrie, 1992; Lederhouse & Lower, 1974; McDougall, 1997), and that both faculty and students notice the side effects, like focusing on memorization and facts (Crooks, 1988; Shifflett, Phibbs, & Sage, 1997).
Multiple choice tests are for grading only This myth arises from the misapprehension that assessment and instruction are separate stages of the learning process. Indeed, there can be no instruction without sound assessment, and, more importantly, both are critical for learning to occur. As Crooks (1988) succinctly put it: "Too much emphasis has been placed on the grading function of evaluation, and too little on its role in assisting students to learn" (p. 468). There are a number of ways in which multiple choice items can be designed and used to promote and refine learning, to inform instruction, as well as to assign grades. Some of these will be highlighted throughout this CAT.
Student Learning Outcomes:
- Demonstrate recognition and recall of knowledge, skills and abilities
- Demonstrate analysis, synthesis, and evaluation
- Demonstrate critical thinking
Instructor Teaching Outcomes:
- Assess higher-order and lower-order thinking skills
- Assess content broadly and deeply
- Assess quickly and efficiently
- Identify student misconceptions
In content assessment, you should aim at finding out which students have mastered the material and which have not, and to what extent they have or have not. The test score should reflect that and only that. The "rules" of test development, then, work in fundamentally two ways: to maximize the potential for the student to show you what they know about the content, and to minimize the possibility that other factors will influence the score. Such factors range from test forms that are hard to read and navigate to fatigue to reading ability or computer literacy. In writing items, you must eliminate "guessing" and "trick" questions. In test layout, that means avoiding student confusion about how to complete the test.
Planning the Test: The critical issue at this stage is matching the test to your teaching so that what was taught gets tested and what is tested gets taught. This is fundamentally an issue of fairness as well as matching assessments to learning outcomes. A great many of the complaints about testing stem from a mismatch here ("I didn't know that was going to be on the test!", "We didn't talk about that in class!"). Constructing a test blueprint is an excellent planning tool for testing and for teaching.
A test blueprint [sometimes called a Table of Specifications (e.g. Gronlund & Linn, 1990)] is a table that records the content areas for the test on one axis, which may be listed by topic or text chapter or other divisions. The other axis of the table categorizes the ways you expect your students to know that content. These can be traditional taxonomies like Bloom's Taxonomy of Educational Objectives or your own categories (link). Finally, each cell in the table records the relative weight you'll give to the intersection of each content and way of knowing. These can be expressed as proportions or as numbers of items. Here's an example of a test blueprint:
Example Test Blueprint for an Introductory Physics Course
Facts, Terminology, Concepts
Application and Integration
Once you have the Test Blueprint constructed, you can use it to plan your teaching as well as to write test items. It's useful in planning teaching because it records what content you feel is most important and how you expect your students to know that content. It also informs decisions relevant to your instructional planning. And, since the goal is a match between teaching, testing and outcomes, the same emphases used in teaching should be used in testing. The Test Blueprint thus serves as a powerful way to align teaching and assessment (Nitko, 2001).
Writing the Items: Perhaps the first issue here is whether you will actually write your own items or use items from test or item banks that accompany textbooks. Using publishers' test banks comes with a number of risks. First, your test blueprint and the publishers' may look different, so you'll need to ensure that the test still retains your emphases. Second, it's doubtful that published item banks consist of tried and true, high quality items (e.g. Hansen & Dexter, 1997; Sims, 1997). Neither authors nor publishers usually receive compensation for providing item banks, and therefore little if any item development work is likely conducted. Therefore, you should use published items sparingly or, at least, use them very carefully with as much scrutiny as you would your own items.
These tips and "rules" for item writing are synthesized from research-based sources and/ or consensus-of-the-field sources (e.g. Haladyna & Dowling, 1989). For an extensive treatment of this topic with many examples and checklists, see Nitko's chapter 8 (2001).
- Determine how many total items you want. Considerations here include how much material the test will cover, how deep the coverage needs to be, how complex the items are, and how long students will have to take the test. [The rule of thumb is one minute per question or possibly more if the items are complicated (Gronlund, 1988; Oosterhof, 2001)].
- Use the Test Blueprint to determine how many items you need to write in each cell.
- Avoid being overly general or overly specific with the content for each item. This is somewhat dependent on your learning objectives, but you don't want to be asking about broad, sweeping issues nor do you want to be asking about minutia.
- Make sure each item tests one and only one concept. If the item is "double-barreled", that is tests two or more concepts, you won't know which of the two the student truly understands if she gets the item correct. (See Pros and Cons for more on this issue.)
- Decide what item variations you're going to use (see below).
Rules for Writing Item Stems
- Either write the stem in the form of a question or, if a statement is in the stem with its completion in the options, put the "blank" at the end of the stem, not in the middle.
- Poor: When looking at liquid in a test tube, the is the name of the curved surface of the liquid.
- Better: The curved surface of liquid in a test tube is called a .
- (Answer: meniscus)
- Put the "main idea" of the item in the stem, not in the options.
- Streamline the stem to avoid extraneous language, but try to put as much of the text in the stem leaving the options shorter.
- Avoid negatives like "except" or "not" or highlight them in italics, boldface or underline if you use do use them.
- Highlight important words like "not", "only", "except" etc. if you use them at all.
Rules for Writing Options
- General Option Writing Tips
- Practically speaking, there's no "magic number" of options you should use (Nitko, 2001). Write the number of options that make sense. It is acceptable to mix and match the number of options. It's better to have a three-option item than a four-option item with a poor distracter (Nitko, 2001).
- Options should be relatively equal in length. A much longer or much shorter option can attract responses because they stand out visually.
- Make sure all options are grammatically congruent with the stem. For example, the article “an” at the end of the stem will give a key that starts with a vowel away.
- Put repeating words or phrases in the stem rather than in each option.
- Avoid overlapping options.
- Poor: Water will be a liquid between and degrees centigrade.
- a) 0, 100
- b) -50; 0
- c) 100; 150
(Note that a and b both include 0 and a and c both include 100 -- they overlap.)
Better: Water will be a liquid between and degrees centigrade.
a) 1; 99
- b) -50; 0
- c) 100; 150
(Note that there is now no overlap.)
- Avoid "all of the above." Its use muddles the interpretation of a student's response (Nitko, 2001).
- Key-writing Tips
- Make sure the key is, in fact, correct according to the text and/ or consensus in the field (Which, we hope, is also what's been taught in class!).
- Distracter Writing Tips
- They're called distracters because they are strategically designed to attract examinees who haven't completely mastered the content. This isn't "tricky" or "deceptive" or "unfair". It stems from the premise that the goal of testing is to find out who has learned the content and who has not, perhaps along a continuum between the two. Students who mastered the material should recognize the key and those who haven't should not.
- All distracters need to be plausible. We've all seen "Mickey Mouse" listed as an option on tests, but that's seldom a believable possibility, and, in terms of finding out who has mastered the material, a waste of ink.
- Use your knowledge of common student misunderstandings in writing distracters. If you know, for example, that students often miss a step in a calculation, include a distracter that would result from that miscalculation. There are also many studies that have documented common misconceptions in science concepts. Building these into multiple-choice distracters is a possibility (Sadler, 1998).
- Avoid "linking" items where the answer to one item is found in or dependent on another item. This is something to check for near final test assembly when you are proofreading. It's very easy to do inadvertently when you're writing items across several sessions.
Constructing the Test: Here are a few tips for assembling the test.
- Provide very clear, printed instructions and directions. If you're using a variety of Multiple Choice formats, you may wish to write separate directions for each section.
- Order the items by content, by item format, and then by increasing difficulty of items (Gronlund, 1988). This rule is based on information processing principles. It is easier mentally for students to answer all of the items about one content before moving to another. They also perform similar mental tasks on similar items before changing mental tasks with other formats. Finally, putting easy items before hard items helps students gain some success early on.
- Randomize the ordering of keyed responses to overcome many of the rules of thumb for guessing, like "always choose B or C". This also eliminates students looking for or using patterns like AABBCCDD, ABADABA, etc.
- Make sure students don't get "lost" in the test because then their score doesn't reflect what they know, but how well they navigated the test.
- Don't crowd items on the page.
- Avoid using double-sided pages because students may miss the back page.
- Use navigational cues throughout like, "continue on the next page" or "page 4 of 12".
Scoring the Test: You may use a full-credit model, where the student gets the item correct or incorrect. You may also use a partial credit model, where the key receives full credit and some distracters receive partial credit. More information on scoring appears in the "Analysis" section.
This is actually one of the great strengths of multiple-choice items. There are several good variations that make multiple choice a very flexible item format.
- Correct Answer: The key is clearly right and the distracters are clearly wrong. For example, computation items often fit this category.
The formula for calculating Work is:
- a) ½*m*v2
- b) 1N*m
- c) m*g*h
- d) F*D*cos q
(Answer is a)
- Best Answer: Here, the key is the best choice (most complete, most comprehensive) while distracters, though "correct", may be incomplete.
- What are the key elements that define Work in physics?
- a) force and cause
- b) force and displacement
- c) force, cause and displacement
(Answer is C)
Interpretive Exercises: These rely on a stimulus material and consist of a multiple-choice question or series of questions about the stimulus. Stimulus material can consist of many things: text, a graph, a picture, a table, etc. This format is particularly useful in tapping into several different ways of knowing quickly. Based on the same stimulus material, you can ask a knowledge question, an application question and a synthesis question, for example.
- Problem A: You are asked to design a hoist system to move sacks of grain from the entrance to a grain warehouse to a loft five meters above the entrance. Each grain sack weighs 60 kg, and the warehouse operator wants to move 20 bags per minute.
NOTE: Refer to Problem A when answering questions 1, 2 & 3.
- What formula would you use to compute the amount of work required to move one bag of grain from the ground floor to the loft?
d) F*D*cos q
(Answer is a)
- How much work would be required to move the twenty bags of grain to the loft?
a) 60000 Joules
b) 3000 Joules
c) 12000 Joules
d) 1000 Joules
(Answer is a)
- If the time restraint were quickened to 30 seconds, how would power be affected?
a) The power would double.
b) The power would be cut in half.
c) The power would decrease, but by an indeterminate amount given the present information.
d) The power would increase, but by an indeterminate amount given the present information.
(Answer is a)
There are essentially two kinds of analysis possible with multiple choice questions. The first kind of analysis, scoring models, deals with scoring the test to determine student achievement. The second kind of analysis, item analysis, deals with analyzing how well each test item functioned.
Scoring Models: With multiple-choice questions, there are several potential ways to score students' responses. Here are two:
- One is the full credit model where the response is either correct or incorrect. The student's total score is then the sum of correct responses.
- A second is a partial credit model where some responses receive full credit and others fractional credit. This works best with “best answer “questions or ones where students may choose more than one answer. You can reward students who pick a correct but not best answer, for example.
A final issue in scoring is the weighting of items. The easiest approach is to let each item equal one point so all are equally weighted. However, based on the Test Blueprint or other considerations, you may decide to prioritize some items over others and weight them accordingly in computing the final score for the test.
Item Analysis: Once you have administered the test, it is possible to conduct several analyses to ensure the quality of the questions for both this and future administrations of the test. Many optical scanning systems common in university testing offices have many of these options available for you automatically.
Item DifficultyHow hard or easy was the item for this group of students? This is typically computed as the proportion of students who got the item correct. So a low value means a hard question and a high value means an easy question on a scale from 0.00 to 1.00. It is best to have a mix of difficulties: some hard ones to challenge top students; some easy ones so low-performing students will persist; and the bulk of items at a moderate difficulty level. The average item difficulty across a test, for most multiple-choice types, should be between 0.6 and 0.7 (Gronlund & Linn, 1990).
Item DiscriminationHow well does the item differentiate among students who have mastered the content and students who have not? This is calculated either as a correlation coefficient between the item score and the total score or as a proportion of high-scoring student who got the item right to low-scoring students who got the item right. Either way, it is expressed on a scale from -1.00 to +1.00. Negative 1 means all low scorers got the item right and all high scorers got the item wrong. Given that we want students who have mastered the content getting each item correct, that's bad. A positive 1 means the item worked exactly as it should. A zero means the item doesn't distinguish between mastery and non-mastery students. That's also bad. There are a number of statistical issues at work that cause +1.00 to be a rare occurrence, so that a reasonable expectation for item discrimination indices is between 0.3 and 0.5 (Oosterhof, 2001). Again, a mix of different values in an exam is acceptable.
Distracter Analysisthis approach looks to see who is choosing each option for an item. Usually the examinees are divided into low-scoring and high-scoring groups, and the proportion of each choosing each option is reported. High-scoring students will usually pick the correct response and low scoring students will usually pick a distracter. If the opposite happens, that's a cue to revise the item. Something about a distracter is attracting high performance students (Oosterhof, 2001).
One final statistic that is useful is the reliability coefficient. Reliability is the degree of accuracy present in the score. For multiple-choice tests, this is indexed using a reliability coefficient like Cronbach's Alpha or KR-21. Both range from 0.00 to 1.00 with higher values indicating higher reliability. Though there are not strict cutoffs for "acceptable" reliability coefficients because reliability is influenced by many diverse factors, the consensus in the measurement field is that 0.60 or 0.70 would be an acceptable lower value (Oosterhof, 2001). Having most item discrimination values at or near .50, having a variety of item difficulties, and having more items will all increase estimates of reliability (Gronlund, 1988; Nitko, 2001; Oosterhof, 2001). In essence, you are calculating internal correlations of student responses between individual items.
These analyses help to identify items that are unfair, unclear, or poor in other ways. This information can be used in a variety of ways.
- In the current administration of the test. You can use this information to discard items in the current administration of the test, another state-of-the-art practice (McDougall, 1997). I find students really respect the scores more when I can explain to them that certain poor items or unfair items were removed based on item analysis. For example, if I find many high-scoring students choosing a certain distracter, that cues me to rethink whether it could be correct. Often I'll give all students who chose that distracter credit also.
- In future administrations of the test. Reusing items is a very "cost-effective" procedure. The overall quality of the test, however, will continue to increase if you target poor items -- ones that are too easy or too difficult, that have zero or negative discrimination, or ones with odd response patterns -- to be reworked or discarded. I find it helpful when going over exams with classes or individual students to have the item analyses handy so that I can listen for reasons why those values came out as they did. Sometimes I'll even probe to get explanations as to why someone chose a distracter.
Suggestions for Use
In addition to the varieties of item types, there is also variety of ways to use multiple-choice questions and administer multiple-choice tests. Here are some ideas:
Multiple Choice Items and Student Learning
Diagnosing misconceptions through distracter analysisIf you've used common student misconceptions or research-based misconceptions when writing distracters, you can look to see how many students chose those distracters. As a simplistic example, we know that in double digit addition, students will forget to "carry" values from the ones column to the tens column. So we could make sure there's a distracter that represents not carrying in several of our items. We can then look to see how many students consistently choose that option. We may then choose to return to that concept and re-teach it, or make other adjustments.
Item Writing as a Study Strategy A good study strategy for students is to have them write multiple-choice items. This is especially good if you also want to help students learn strategies for taking multiple-choice tests. This puts them "behind the scenes" and helps them identify material that might be on the test and how it might be asked. It also is a great way for the instructor to gauge their depth of understanding and what issues/ topics they are focusing on. This can be very useful, for example, if you notice that the things they're focusing on aren't what you would focus on.
Share the Test Blueprint I usually give my students my test blueprint as a study guide. After all, it tells me what's important, what I want students to know, and how I want them to know it. I use it to teach and to assess. Why can't they also use that information (McDougall (1997)? This is yet another way to share goals and expectations with students (hot link).
Mastery Learning RevisionsOne approach to emphasize learning over "testing" after the test is to offer students the opportunity to write about items they missed (Jacobsen, 1993; Murray, 1990). For each item, students argue in writing why the key isn't the right answer or why a distracter should be. Or they can write an explanation of why the key is the best choice. They can earn partial credit back on each item for which they make a valid argument. The advantages are that this avoids the emotional pleas and "feeding frenzies" that occur when tests are returned. Students usually, when forced to write about it, come to see where they went astray. Upon occasion, a student will justify a distracter, something you can then adjust for the future. Students appreciate being allowed to explain their answers (Jacobsen, 1993). The disadvantage is that it usually requires releasing the items and keys to students, and you may not wish to reuse those items in the future once they're "out".
Computers and Multiple Choice Items
Multiple choice testing owes much of its ubiquity to the invention of the optical scanner in 1934, and technology has continued to play a role in multiple-choice developments. It is no surprise, then, that computing has facilitated multiple choice testing. The major technological applications are summarized below:
Item BankingItem banking consists of storing items and information about items (e.g. difficulty, discrimination, content codes, etc.) in electronic format. This allows searching based on parameters, and tracking item characteristics over multiple administrations. There are a variety of item banking software programs available (e.g. FastTEST, C-Quest, Examiner, Exam Manager, MicroCat). This is the least complicated technological application in testing, but it interfaces with others.
Computer Administered TestsHere, students take tests on a computer. This allows for multiple forms of a test to be automatically generated, administered, scored, and recorded quickly and efficiently. Many colleges and universities have testing labs where this can be accomplished. You can make sure no two students get exactly the same test; students see their score immediately; you can arrange for the lab to schedule and proctor the tests so that you don't use class time for testing. In short, there are many advantages due to the automation the technology provides. However, there may be issues with student anxiety: some software is restrictive in terms of letting students return to answered items, changing responses, or viewing more than a single question at a time.
Computer Adaptive TestsAdaptive testing is not logistically feasible for most college classes because test items need to be calibrated on, hopefully, hundreds of examinees before the "adaptive" part can be done. In essence, each item is assigned a difficulty rating based on the hundreds of responses, which is stored in the item bank. Then, the computer administers a brief set of questions to get a preliminary estimate of the student's knowledge. Based on that estimate, the computer adapts the test to the student asking increasingly easier or harder questions in order to pinpoint the student's ability. Once it has done so to a certain level of accuracy, the student receives a score. The advantages are the accuracy and efficiency of this approach. One of the disadvantages is that students do not receive the same test or even the same number of items. Also, in only the very largest courses is item calibration feasible. Again, software for this is available (e.g. BiLog, MultiLog, Parscale, ConQuest, Quest, RASCAL). Trained psychometric staff would probably be necessary to do the calibrating. Such expertise may be found in the university's testing center or in educational psychology or psychology departments.
Internet TestingTesting on the Internet is becoming more and more common. All of the aforementioned techniques can be done in an Internet environment. At the moment, the main disadvantage is security. There's no good, inexpensive way yet to make sure that the person clicking the mouse is actually the student who is supposed to be taking the test. In Internet banking, for example, the customer is part of the security system. They have a password or PIN number, and it's in their interest not to share that with others. In an academic testing situation, however, the "customer" or student cannot be considered part of the security because it is in their interest to let someone else have access to their PIN or password. So the Internet works well for practice tests or low stakes assessments (e.g. ones that don't lead to grades) but is likely not yet ready for high stakes assessment (ones on which grades are based).
Pros and Cons
Multiple choice testing brings an efficiency and economy of scale that can be indispensable. Because students can respond to dozens of questions in a class period, it allows broad coverage of content. Multiple-choice items are flexible enough to tap nearly any level of Bloom's taxonomy(hot link). It also prevents students from trying anything to get some points: either they know it or they don't. Of course, these advantages only accrue when the items are well written and the test well constructed. That takes time, planning, creativity, and thoughtfulness.
A key downfall and disappointment with multiple choice items is that we tend to write items that are easy to write rather than items that take hard work to craft. This bias causes the tests to focus on the recall of facts (knowledge & comprehension level) and to leave out analysis, synthesis and evaluation questions, something that concerns both faculty and students (Crooks, 1988; Shifflett, Phibbs, & Sage, 1997). But this is only a drawback if we surrender to it. Just as we want to be thoughtful and explicit about all aspects of our assessment (Crooks, 1988), we especially want to be thoughtful about the items on our tests. Referring to the test blueprint while writing items is an excellent way to avoid this pitfall (Gronlund & Linn, 1990).
Another criticism of multiple-choice items is that they decontextualize information, that is, they pull knowledge -- often facts -- out of context (Shepard, 1989; Wiggins, 1989, 1992). Some argue that pulling information out of one's memory without some context is an invalid assessment. This charge doesn't come with an easy defense. Given the restricted nature of the format, it is often true that only a little bit of information like a word problem or a short scenario can be provided (Shepard, 1989). Case studies, portfolios, even essays are often considered more "authentic" or "real world" (Wiggins, 1989, 1992). Using interpretive exercises ameliorates this concern to some extent because of the additional context the stimulus material provides. Some consider this decontextualization an advantage, not a disadvantage, however. They argue that we want to know whether the student knows the answer or not, not whether they can use context to find the answer, etc. Authenticity in assessment actually falls on a continuum. Any assessment can always be made more authentic or less authentic. One could ask, "What is 2 +2?" or one could ask, "If Johnny has two apples and you have two apples, how many apples do you have together?" or one could actually hand two apples to one student and two apples to another and ask the question, and so on.
Through the analysis of distracters that have been carefully designed, it is possible to gain insight into the misconceptions of students. However, simply because a student answers correctly, it does not necessarily mean they have mastered the content (Dufresne, Leonard, & Gerace, 2002). In other words, it is possible to answer correctly for many reasons, including guessing, which results in “false positives”.. This issue has led to a category of scoring models that correct for guessing (Gronlund & Linn, 1990), which are used in some well-known testing programs like the Scholastic Assessment Test (SAT). These techniques are not recommended for classroom use, however, because they correct for blind guessing, not informed guessing, more common in classroom tests (Gronlund & Linn, 1990).
A final complaint that needs careful consideration is the potential for language or cultural bias in multiple-choice questions. Any assessment can be biased, so we should always take care to avoid it. Following the item writing rules mitigates the potential for bias. Eliminating extraneous verbiage and keeping the language simple and on the students' level helps, and at least one study has shown that well-constructed items don't preference different cognitive styles while poorly constructed ones do (Armstrong, 1993).
Kehoe, Jerard (1995). Writing multiple choice test items.
Practical Assessment, Research & Evaluation, 4(9).
Frary, Robert B. (1995). More multiple choice item writing do's and don'ts.
Practical Assessment, Research & Evaluation, 4(11).
Kehoe, Jerard (1995). Basic item analysis for multiple choice tests.
Practical Assessment, Research & Evaluation, 4(10).
Kehoe, Jerard. Basic Item Analysis for Multiple choice Tests. -- "This Digest offers some suggestions for the improvement of multiple choice tests using "item analysis" statistics. These statistics are typically provided by a measurement services, where tests are machine-scored, as well as by testing software packages."
Teaching Effectiveness Program, University of Oregon. Writing Multiple choice Questions that Demand Critical Thinking
Carneson, J., Delpierre, G. & Masters, K. Designing and Managing Multiple Choice Questions
Office of Measurement Services, University of Minnesota.
Haladyna, T. M. (1999). Developing and validating multiple choice test items. 2nd Ed. Mahwah, NJ: Lawrence Erlbaum Associates.
- This text is primarily focused on large-scale testing, but much of it is still very relevant to classroom testing.
Nitko, A. J. (2001). Educational assessment of students. 3rd Ed. Columbus, OH: Merrill Prentice Hall.
- Though this book is for K-12 teachers, Nitko is unparalleled in his copious examples and checklists when it comes to item writing. See especially chapter 8.
Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th Ed.). New York: Macmillan.
- This is a classic measurement text that expands on many of the issues presented here. See especially chapters 7 & 8.
Oosterhof, A. (2001). Classroom applications of educational measurement (3rd Ed.). Columbus, OH: Merrill Prentice Hall.
- Another text written for K-12 teachers, it also has excellent information about a broad range of topics relevant to multiple choice testing.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: AERA.
Armstrong, A. (1993). Cognitive-style difference in testing situations. Educational Measurement: Issues and Practice, 12 (3), 17-22.
Brissenden & Slater (n.d.) Assessment Primer. Retrieved June 26, 2002 from the Internet at http://www.flaguide.org/start/start.php
Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58(4), 438-481.
Dufresne, R. J., Leonard, W. J., & Gerace, W. J. (2002). Making sense of students' answers to multiple-choice questions. The Physics Teacher, 40, 174-180.
Dwyer, C. A. (1993). "Innovation and reform: Examples from teacher assessment." In R. E. Bennett & W. C. Ward (eds.) Construction versus choice in cognitive measurement (pp. 265 - 289. Hillsdale, NJ: Lawrence Erlbaum Associates.
Gronlund, N. E. (1988). How to construct achievement tests (4th Ed.). Englewood Cliffs, NJ: Prentice Hall.
Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th Ed.). New York: Macmillan.
Guthrie, D. S. (1992). "Faculty goals and methods of instruction: Approaches to classroom assessment." In Assessment and Curriculum Reform. New Directions for Higher Education No. 80, 69-80. San Francisco: Jossey-Bass.
Haladyna, T. M. (1999). Developing and validating multiple choice test items (2nd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M., & Dowling, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 51 - 78.
Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing test banks. Journal of Education for Business, 73(2), 94-97.
Jacobsen, R. H. (1993). What is good testing?: Perceptions of college students. College Teaching, 41(4), 153-156.
Lederhouse, J. E., & Lower, J. M. (1974). Testing college professor's tests. College Student Journal, 8(1), 68-70.
McDougall, D. (1997). College faculty's use of objective tests: State-of-the-practice versus state-of-the-art. Journal of Research and Development in Education, 30(3), 183-193.
Murray, J. P. (1990). Better testing for better learning. College Teaching, 38(4), 148-152.
Nitko, A. J. (2001). Educational assessment of students. (3rd Ed.). Columbus, OH: Merrill Prentice Hall.
Oosterhof, A. (2001). Classroom applications of educational measurement (3rd Ed.). Columbus, OH: Merrill Prentice Hall.
Sadler, P. M. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distracter-driven assessment instruments. Journal of Research in Science Teaching, 35(3), 265-296.
Shifflett, B., Phibbs, K., & Sage, M. (1997). Attitudes toward collegiate level classroom testing. Educational Research Quarterly, 21(1), 15-26.
Shepard, L. (1989). Why we need better assessments. Educational Leadership, 46 (7), 4-9.
Sims, R. L. (1997). Gender equity in management education: A content analysis of test bank questions. Journal of Education for Business, 72 (5), 283-287.
Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan, 70 (9) , 703-713.
Wiggins, G. (1992). Creating tests worth taking. Educational Leadership, 49(8), 26-33.
It's highly ironic in many ways that I should write the section on multiple choice testing. I started out professionally as an English teacher, and after a course in tests and measurements, vowed never to use such instruments. I've come a long way since then, studying Educational Psychology and earning two degrees in Applied Measurement. Even though I maintained my preference for performance assessments and other open-ended formats making them the focus of my research agenda, I now use multiple-choice tests in most of my university teaching. I have gained a healthy respect for them.
Tell me more about this technique: