Go to Collaborative Learning Go to FLAG Home Go to Search
Go to Learning Through Technology Go to Site Map
Go to Who We Are
Go to College Level One Home
Go to Introduction Go to Assessment Primer Go to Matching CATs to Goals Go to Classroom Assessment Techniques Go To Tools Go to Resources



Go to CATs overview
Go to Attitude survey
Go to ConcepTests
Go to Concept mapping
Go to Conceptual diagnostic tests
Go to Interviews
Go to Mathematical thinking
Go to Performance assessment
Go to Portfolios
Go to Scoring rubrics
Go to Student assessment of learning gains (SALG)
Go to Weekly reports

  Go to previous page

Multiple Choice Test

(Screen 4 of 6)

Go to next page

Variations

This is actually one of the great strengths of multiple-choice items. There are several good variations that make multiple choice a very flexible item format.

Correct Answer: The key is clearly right and the distracters are clearly wrong. For example, computation items often fit this category.

The formula for calculating Work is:
a) ½*m*v2
b) 1N*m
c) m*g*h
d) F*D*cos q

(Answer is a)

Best Answer: Here, the key is the best choice (most complete, most comprehensive) while distracters, though "correct", may be incomplete.
What are the key elements that define Work in physics?
a) force and cause
b) force and displacement
c) force, cause and displacement

(Answer is C)

Interpretive Exercises: These rely on a stimulus material and consist of a multiple-choice question or series of questions about the stimulus. Stimulus material can consist of many things: text, a graph, a picture, a table, etc. This format is particularly useful in tapping into several different ways of knowing quickly. Based on the same stimulus material, you can ask a knowledge question, an application question and a synthesis question, for example.

Problem A: You are asked to design a hoist system to move sacks of grain from the entrance to a grain warehouse to a loft five meters above the entrance. Each grain sack weighs 60 kg, and the warehouse operator wants to move 20 bags per minute.

NOTE: Refer to Problem A when answering questions 1, 2 & 3.
    1. What formula would you use to compute the amount of work required to move one bag of grain from the ground floor to the loft?
      a) ½*m*v2
      b) 1N*m
      c) m*g*h
      d) F*D*cos q

      (Answer is a)

    2. How much work would be required to move the twenty bags of grain to the loft?
      a) 60000 Joules
      b) 3000 Joules
      c) 12000 Joules
      d) 1000 Joules

      (Answer is a)

    3. If the time restraint were quickened to 30 seconds, how would power be affected?
      a) The power would double.
      b) The power would be cut in half.
      c) The power would decrease, but by an indeterminate amount given the present information.
      d) The power would increase, but by an indeterminate amount given the present information.

      (Answer is a)



Analysis

There are essentially two kinds of analysis possible with multiple choice questions. The first kind of analysis, scoring models, deals with scoring the test to determine student achievement. The second kind of analysis, item analysis, deals with analyzing how well each test item functioned.

Scoring Models: With multiple-choice questions, there are several potential ways to score students' responses. Here are two:

  • One is the full credit model where the response is either correct or incorrect. The student's total score is then the sum of correct responses.
  • A second is a partial credit model where some responses receive full credit and others fractional credit. This works best with “best answer “questions or ones where students may choose more than one answer. You can reward students who pick a correct but not best answer, for example.

A final issue in scoring is the weighting of items. The easiest approach is to let each item equal one point so all are equally weighted. However, based on the Test Blueprint or other considerations, you may decide to prioritize some items over others and weight them accordingly in computing the final score for the test.

Item Analysis: Once you have administered the test, it is possible to conduct several analyses to ensure the quality of the questions for both this and future administrations of the test. Many optical scanning systems common in university testing offices have many of these options available for you automatically.

Item Difficulty—How hard or easy was the item for this group of students? This is typically computed as the proportion of students who got the item correct. So a low value means a hard question and a high value means an easy question on a scale from 0.00 to 1.00. It is best to have a mix of difficulties: some hard ones to challenge top students; some easy ones so low-performing students will persist; and the bulk of items at a moderate difficulty level. The average item difficulty across a test, for most multiple-choice types, should be between 0.6 and 0.7 (Gronlund & Linn, 1990).

Item Discrimination—How well does the item differentiate among students who have mastered the content and students who have not? This is calculated either as a correlation coefficient between the item score and the total score or as a proportion of high-scoring student who got the item right to low-scoring students who got the item right. Either way, it is expressed on a scale from -1.00 to +1.00. Negative 1 means all low scorers got the item right and all high scorers got the item wrong. Given that we want students who have mastered the content getting each item correct, that's bad. A positive 1 means the item worked exactly as it should. A zero means the item doesn't distinguish between mastery and non-mastery students. That's also bad. There are a number of statistical issues at work that cause +1.00 to be a rare occurrence, so that a reasonable expectation for item discrimination indices is between 0.3 and 0.5 (Oosterhof, 2001). Again, a mix of different values in an exam is acceptable.

Distracter Analysis—this approach looks to see who is choosing each option for an item. Usually the examinees are divided into low-scoring and high-scoring groups, and the proportion of each choosing each option is reported. High-scoring students will usually pick the correct response and low scoring students will usually pick a distracter. If the opposite happens, that's a cue to revise the item. Something about a distracter is attracting high performance students (Oosterhof, 2001).

One final statistic that is useful is the reliability coefficient. Reliability is the degree of accuracy present in the score. For multiple-choice tests, this is indexed using a reliability coefficient like Cronbach's Alpha or KR-21. Both range from 0.00 to 1.00 with higher values indicating higher reliability. Though there are not strict cutoffs for "acceptable" reliability coefficients because reliability is influenced by many diverse factors, the consensus in the measurement field is that 0.60 or 0.70 would be an acceptable lower value (Oosterhof, 2001). Having most item discrimination values at or near .50, having a variety of item difficulties, and having more items will all increase estimates of reliability (Gronlund, 1988; Nitko, 2001; Oosterhof, 2001). In essence, you are calculating internal correlations of student responses between individual items.

These analyses help to identify items that are unfair, unclear, or poor in other ways. This information can be used in a variety of ways.

  • In the current administration of the test. You can use this information to discard items in the current administration of the test, another state-of-the-art practice (McDougall, 1997). I find students really respect the scores more when I can explain to them that certain poor items or unfair items were removed based on item analysis. For example, if I find many high-scoring students choosing a certain distracter, that cues me to rethink whether it could be correct. Often I'll give all students who chose that distracter credit also.
  • In future administrations of the test. Reusing items is a very "cost-effective" procedure. The overall quality of the test, however, will continue to increase if you target poor items -- ones that are too easy or too difficult, that have zero or negative discrimination, or ones with odd response patterns -- to be reworked or discarded. I find it helpful when going over exams with classes or individual students to have the item analyses handy so that I can listen for reasons why those values came out as they did. Sometimes I'll even probe to get explanations as to why someone chose a distracter.


Suggestions for Use

In addition to the varieties of item types, there is also variety of ways to use multiple-choice questions and administer multiple-choice tests. Here are some ideas:

Multiple Choice Items and Student Learning

Diagnosing misconceptions through distracter analysis—If you've used common student misconceptions or research-based misconceptions when writing distracters, you can look to see how many students chose those distracters. As a simplistic example, we know that in double digit addition, students will forget to "carry" values from the ones column to the tens column. So we could make sure there's a distracter that represents not carrying in several of our items. We can then look to see how many students consistently choose that option. We may then choose to return to that concept and re-teach it, or make other adjustments.

Item Writing as a Study Strategy —A good study strategy for students is to have them write multiple-choice items. This is especially good if you also want to help students learn strategies for taking multiple-choice tests. This puts them "behind the scenes" and helps them identify material that might be on the test and how it might be asked. It also is a great way for the instructor to gauge their depth of understanding and what issues/ topics they are focusing on. This can be very useful, for example, if you notice that the things they're focusing on aren't what you would focus on.

Share the Test Blueprint —I usually give my students my test blueprint as a study guide. After all, it tells me what's important, what I want students to know, and how I want them to know it. I use it to teach and to assess. Why can't they also use that information (McDougall (1997)? This is yet another way to share goals and expectations with students.

Mastery Learning Revisions—One approach to emphasize learning over "testing" after the test is to offer students the opportunity to write about items they missed (Jacobsen, 1993; Murray, 1990). For each item, students argue in writing why the key isn't the right answer or why a distracter should be. Or they can write an explanation of why the key is the best choice. They can earn partial credit back on each item for which they make a valid argument. The advantages are that this avoids the emotional pleas and "feeding frenzies" that occur when tests are returned. Students usually, when forced to write about it, come to see where they went astray. Upon occasion, a student will justify a distracter, something you can then adjust for the future. Students appreciate being allowed to explain their answers (Jacobsen, 1993). The disadvantage is that it usually requires releasing the items and keys to students, and you may not wish to reuse those items in the future once they're "out".

Computers and Multiple Choice Items

Multiple choice testing owes much of its ubiquity to the invention of the optical scanner in 1934, and technology has continued to play a role in multiple-choice developments. It is no surprise, then, that computing has facilitated multiple choice testing. The major technological applications are summarized below:

Item Banking—Item banking consists of storing items and information about items (e.g. difficulty, discrimination, content codes, etc.) in electronic format. This allows searching based on parameters, and tracking item characteristics over multiple administrations. There are a variety of item banking software programs available (e.g. FastTEST, C-Quest, Examiner, Exam Manager, MicroCat). This is the least complicated technological application in testing, but it interfaces with others.

Computer Administered Tests—Here, students take tests on a computer. This allows for multiple forms of a test to be automatically generated, administered, scored, and recorded quickly and efficiently. Many colleges and universities have testing labs where this can be accomplished. You can make sure no two students get exactly the same test; students see their score immediately; you can arrange for the lab to schedule and proctor the tests so that you don't use class time for testing. In short, there are many advantages due to the automation the technology provides. However, there may be issues with student anxiety: some software is restrictive in terms of letting students return to answered items, changing responses, or viewing more than a single question at a time.

Computer Adaptive Tests—Adaptive testing is not logistically feasible for most college classes because test items need to be calibrated on, hopefully, hundreds of examinees before the "adaptive" part can be done. In essence, each item is assigned a difficulty rating based on the hundreds of responses, which is stored in the item bank. Then, the computer administers a brief set of questions to get a preliminary estimate of the student's knowledge. Based on that estimate, the computer adapts the test to the student asking increasingly easier or harder questions in order to pinpoint the student's ability. Once it has done so to a certain level of accuracy, the student receives a score. The advantages are the accuracy and efficiency of this approach. One of the disadvantages is that students do not receive the same test or even the same number of items. Also, in only the very largest courses is item calibration feasible. Again, software for this is available (e.g. BiLog, MultiLog, Parscale, ConQuest, Quest, RASCAL). Trained psychometric staff would probably be necessary to do the calibrating. Such expertise may be found in the university's testing center or in educational psychology or psychology departments.

Internet Testing—Testing on the Internet is becoming more and more common. All of the aforementioned techniques can be done in an Internet environment. At the moment, the main disadvantage is security. There's no good, inexpensive way yet to make sure that the person clicking the mouse is actually the student who is supposed to be taking the test. In Internet banking, for example, the customer is part of the security system. They have a password or PIN number, and it's in their interest not to share that with others. In an academic testing situation, however, the "customer" or student cannot be considered part of the security because it is in their interest to let someone else have access to their PIN or password. So the Internet works well for practice tests or low stakes assessments (e.g. ones that don't lead to grades) but is likely not yet ready for high stakes assessment (ones on which grades are based).


Pros and Cons

Multiple choice testing brings an efficiency and economy of scale that can be indispensable. Because students can respond to dozens of questions in a class period, it allows broad coverage of content. Multiple-choice items are flexible enough to tap nearly any level of Bloom's taxonomy. It also prevents students from trying anything to get some points: either they know it or they don't. Of course, these advantages only accrue when the items are well written and the test well constructed. That takes time, planning, creativity, and thoughtfulness.

A key downfall and disappointment with multiple choice items is that we tend to write items that are easy to write rather than items that take hard work to craft. This bias causes the tests to focus on the recall of facts (knowledge & comprehension level) and to leave out analysis, synthesis and evaluation questions, something that concerns both faculty and students (Crooks, 1988; Shifflett, Phibbs, & Sage, 1997). But this is only a drawback if we surrender to it. Just as we want to be thoughtful and explicit about all aspects of our assessment (Crooks, 1988), we especially want to be thoughtful about the items on our tests. Referring to the test blueprint while writing items is an excellent way to avoid this pitfall (Gronlund & Linn, 1990).

Another criticism of multiple-choice items is that they decontextualize information, that is, they pull knowledge -- often facts -- out of context (Shepard, 1989; Wiggins, 1989, 1992). Some argue that pulling information out of one's memory without some context is an invalid assessment. This charge doesn't come with an easy defense. Given the restricted nature of the format, it is often true that only a little bit of information like a word problem or a short scenario can be provided (Shepard, 1989). Case studies, portfolios, even essays are often considered more "authentic" or "real world" (Wiggins, 1989, 1992). Using interpretive exercises ameliorates this concern to some extent because of the additional context the stimulus material provides. Some consider this decontextualization an advantage, not a disadvantage, however. They argue that we want to know whether the student knows the answer or not, not whether they can use context to find the answer, etc. Authenticity in assessment actually falls on a continuum. Any assessment can always be made more authentic or less authentic. One could ask, "What is 2 +2?" or one could ask, "If Johnny has two apples and you have two apples, how many apples do you have together?" or one could actually hand two apples to one student and two apples to another and ask the question, and so on.

Through the analysis of distracters that have been carefully designed, it is possible to gain insight into the misconceptions of students. However, simply because a student answers correctly, it does not necessarily mean they have mastered the content (Dufresne, Leonard, & Gerace, 2002). In other words, it is possible to answer correctly for many reasons, including guessing, which results in “false positives”.. This issue has led to a category of scoring models that correct for guessing (Gronlund & Linn, 1990), which are used in some well-known testing programs like the Scholastic Assessment Test (SAT). These techniques are not recommended for classroom use, however, because they correct for blind guessing, not informed guessing, more common in classroom tests (Gronlund & Linn, 1990).

A final complaint that needs careful consideration is the potential for language or cultural bias in multiple-choice questions. Any assessment can be biased, so we should always take care to avoid it. Following the item writing rules mitigates the potential for bias. Eliminating extraneous verbiage and keeping the language simple and on the students' level helps, and at least one study has shown that well-constructed items don't preference different cognitive styles while poorly constructed ones do (Armstrong, 1993).

  Go to previous page Go to next page


Tell me more about this technique:
Go to top of page.



Introduction || Assessment Primer || Matching Goals to CATs || CATs || Tools || Resources
Search || Who We Are || Site Map || Meet the CL-1 Team || WebMaster || Copyright || Download
College Level One (CL-1) Home || Collaborative Learning || FLAG || Learning Through Technology || NISE