Writing Good Assessment Questions

January 7, 2020

Roseman University

Roseman University

One Friday I was approached by a teacher from another institution who asked me how things were going. When I told him I’d just gotten out of a somewhat stressful assessment, he nodded understandingly and replied that he had heard another of his colleagues once say that teaching is something we do for free; it’s the assessing we have to be paid for.

Assessment exists to ensure students really learned what we intended. Sometimes we are disappointed to find out many students didn’t pick up everything we thought they should have. The disappointment can be mutual if students feel the assessment didn’t truly measure their achievement of the learning outcomes, but rather their ability to interpret and navigate flawed questions.

In the September 2019 issue of the American Journal of Pharmaceutical Education (AJPE), Michael J. Rudolph and colleagues published a review article designed to be “an accessible primer on test item development for pharmacy and other health professions faculty members.” In it, they outline an extensive list of do’s and don’ts when it comes to constructing test questions. Persons interested in improving their item writing are encouraged to check out the full paper. Following are a few highlights.

First, before writing the assessment, the authors suggest creating a blueprint that outlines the learning outcomes to be assessed and the cognitive level at which they will be assessed. With this blueprint in hand, teachers can ensure appropriate class coverage of the content and offer students plenty of formative practice using question formats students can expect to see on the summative assessment.

When the time comes to actually begin writing questions, one of the most important considerations is item validity. Validity is the extent to which a question accurately measures student achievement of an intended learning outcome. Flaws in item design and construction can reduce validity by introducing irrelevant difficulty, which is difficulty that arises from being tested on factors other than the learning outcome. For example, unnecessary wordiness, negative wording (e.g. “Which of these is not a correct answer?”), and double-negatives disadvantage students for whom English is a second language or who have weak reading comprehension. Unless the learning outcome is on English usage (and class time was spent teaching this outcome), item wording has the potential to introduce irrelevant difficulty.

Test-wise students can also thwart a question’s validity by recognizing and exploiting item flaws to get the correct answer, regardless of whether they’ve achieved the associated learning outcome. For example, students might take cues from wording similarities or grammatical inconsistencies between the stem and options, eliminate improbable absolute terms like “never” and “always,” or use the process of elimination to deduce correct answers in all-of-the-above, none-of-the-above, and K-type questions (K-type questions use options like Only A, Both A & B, Both B & C, etc.). To protect validity, the AJPE authors suggest using “select all that apply” instead of K-type questions and recommend asking a colleague to critically evaluate your questions for other flaws that could be exploited by test-wise students. If colleagues who are not experts in your field are able to get the right answer based on an item’s wording, it is a sign the question is not a valid measure of student attainment of the learning outcome.

Certain statistics found on an item analysis, such as the difficulty index (p) and the discrimination index (d) or point biserial, can point to problems with particular questions. The difficulty index, p, is the decimal proportion of students who got the correct answer to a question. The upper bound for p is 1 (100% of students got the question right) and the lower bound is theoretically zero. In practice, the lower bound may only reach 0.25, for four answer choices, or 0.33, for three answer choices, due to the probabilities of success from random guessing. Various guidelines have been proposed for what constitutes an appropriate value for p, but the AJPE authors point out that a “competency-based examination, or one designed to ensure that students have a basic understanding of specific content, should contain items that most students answer correctly (high p value).” A low p value can signal that an assessment question may be flawed and merits closer evaluation; alternatively, the concept may not have been emphasized in class or practiced sufficiently by students.

The discrimination index, d, is a readout for comparing how high- and low-scoring students performed on a particular question. To calculate d, the decimal proportion of low-scoring students who got a question right is subtracted from the decimal proportion of high-scoring students who got it right. “Low-scoring” is usually considered the bottom 27% of scorers and “high-scoring” the top 27% of scorers. The range of d is -1.0 to +1.0. For most questions, d will be positive, indicating more top-scoring students answered the question correctly. Questions with a negative d may have been difficult or confusing for top-scoring students, a sign that item construction should be scrutinized.

Because d becomes less reliable for classes of about 50 or fewer, item analyses also often include a point biserial value, which is a correlation coefficient for the class as a whole that indicates whether a correct answer to a particular question correlates with a higher assessment score. The point biserial value has a range of -1.0 to +1.0 and is interpreted the same as d.

In summary, you can improve your assessment items by 1) creating an assessment blueprint to be intentional about your assessments and the way you prepare students for them, 2) using feedback from the item analysis and your colleagues (and students, in the post-assessment review!) to identify items that fall short of expectations, and 3) consulting a list of common assessment flaws to troubleshoot or preemptively repair questions of dubious validity (for example, see Table 1 of Rudolph et al. Am J Pharm Educ. 2019; 83(7): Article 7204). As you engage in these activities they will become second nature to you and will result in better assessments.

If you would like to contribute to The Faculty Development Blog, please contact Tyler Rose at trose@roseman.edu.

Author
Tyler Rose, Ph.D.
Associate Professor of Pharmaceutical Sciences
Roseman University of Health Sciences College of Pharmacy