Students’ Course Evaluations: From Design Problems to Genuine Responses

By: Christine Winskowski

You know the ritual: You administer the institution’s course evaluation forms with 10-15 items (statements or questions, like “The teacher was enthusiastic”) and often 5 or 6 responses (from “1 – strongly agree” to “5 – strongly disagree”). A few days later, the completed forms are returned.

Do you feel anticipation? Trepidation? Do you ever wonder…why? (“Why did these people put a 2 or 3?)

I first became acquainted with course evaluation forms around 1983 when I taught ESL in a small liberal arts college in Hawaii. My liberal arts colleagues had developed an in-house course evaluation form. As the students filled it out, I browsed the form.

One item got my attention: “The instructor explains the origins of key concepts.”

But my first class of the day was Intermediate Listening. “Key concepts??” I thought. Certainly, there were important elements, like word stress, sentence intonation, and pronunciation in casual speech. But they were not important in the same way that key concepts like relativity is to physics, or the unconscious is to psychology. The origins of concepts were…let us say, of questionable relevance.

Nonetheless, the ESL students dutifully put down a rating for this item. Thus began my skepticism about students’ class rating forms.

Mystery in course evaluation responses

In the 1970s, there was enormous interest in the US about university course evaluations, first attempted by the University of Washington in the 1930s. Psychologists led this movement, and most of the early research on course evaluation was conducted in psychology classes. By the early 2000s, it was estimated that there might be about 2000 publications on students’ course/teacher evaluations, also called questionnaires, ratings forms, and surveys. Social scientists wanted evaluation forms to be as scientifically rigorous as measurements in the “hard” sciences, like physics and geology. This means ensuring that evaluation forms were reliable, that is, if one is given several times to similar groups, it should produce a similar result. The evaluation instrument had also to be valid, representing something real in the classroom experience, something observed, confirmed, or validated by outside observers, interviews with students, or other validated instruments.

The mysterious part in the evaluation process is that we often don’t know how students interpret an item, why they choose a particular answer, and what their answers represent to them. Also, no one knows how much weight we should give to student comments. If 60% of the class complains about something, we probably should pay attention. If 15% complains, we’re less sure about its significance.

The fact is, there is really a good deal of uncertainty about what class evaluation forms are telling us.

Students’ course evaluation forms: What you should know

As my career got underway, I was uneasy about distributing evaluation forms to students. I had to do it, of course. But I knew that no one could say with any certainty what the rating numbers meant. The emperor truly had no clothes.

When I found out around 2005 that my college in Japan was implementing course evaluation forms, I began researching what was known about them. You might be surprised at all the things that influence evaluation form results (and not only in education). Here is a sample:

- Humanities courses get the highest ratings, followed by social science courses, then science and technology courses. Electives get higher ratings than required courses, as well.
- In a survey on marriage, when people are asked 1) how satisfied they are with their life as a whole (a “global” item), then 2) how satisfied they are with their marriage (a more specific item), the correlation between each survey-taker’s response is r = .32. When those questions are reversed in order (that is, asking first about life as a whole, then about marriage), the correlation is r = .67. This and related studies suggest that one survey item forms a discoursal context for the next item, influencing the response. This may be why global questions (e.g. “How satisfied were you with this course?”) are often put at the bottom of students’ course evaluation forms (Schwarz, 1999).
- It may matter if choices are bipolar (e.g., strongly agree to strongly disagree) or are unipolar (strongly agree to do not agree). The unipolar responses showed better discrimination, that is, responses got more widely distributed among the choices, suggesting that some survey-takers want to avoid the negative pole in the bipolar scales. Also, if the scale is numbered 1 (poor) to 10 (excellent), most responses will be on the positive side. If the same scale is labeled -5 to +5, even more responses will be on the positive side. People are just reluctant to give negative ratings (Ostrom & Gannon, 1996; Tourangeau et al., 2000).
- Labels that go with evaluation rating numbers are often vague for concepts like frequency (never, rarely, not too often, etc.), probability (very unlikely, unlikely, likely, etc.), and amounts (none, few, some, etc.). It is likely that students will interpret them differently. After all, where is the dividing line between very rarely and rarely (Tourangeau et al., 2000)?
- Further, there is a difference between a survey item that has an even number of response choices vs. an odd number. The 6-point item evenly divides into negative and positive sides, what is called a “forced-choice” item. If survey-takers want to respond “agree and disagree, depending on the situation” (as I often do), they have no option. A 5-point item, however, provides a nice middle choice between positive and negative sides. On the other hand, unless it is explicitly labeled, what does it mean? It could mean “both.” Or “neither.” Or “I have no idea.”
- Word choice and grammar matters as well. An unclearly worded item, or grammar that is too complex, can make an item difficult to understand. Some items ask students to make an inference about the instructor or course, e.g. “The instructor was well-prepared.” Students may not know much detail about what instructors do and so make dubious inferences about their preparation. Also, to judge what is good or poor preparation has little reliability. An item like this is called a high inference item because the student has to infer (assume or judge without observing) what happened. In contrast, an item like “The instructor started the class on time” is a low inference item, since being on time is relatively easy to observe and takes less inference. Still, even here, inference is still possible, e.g. is two minutes late really late? How about five minutes?

- Evidence of bias in survey results is well-known in social science. Bias refers to influence on the survey-taker’s response that has nothing to do with teacher effectiveness or course quality. All psychology students learn about these three common biases in person perception: 1) Fundamental attribution error—the tendency to attribute (judge the cause of) a person’s behavior to internal or psychological states (e.g. attitudes, personality, states of mind, beliefs) rather than external causes (institutional rules, job constraints, social/cultural contexts); 2) The halo effect—where attributions about the instructor (friendliness, political views, likeability, etc.) may influence students’ survey responses on instructor effectiveness, and; 3) Errors of central tendency—avoidance of the lowest and highest item choices, responding on middle choices only.
- Some researchers have found gender and racial bias (e.g. stereotyping) in evaluating instructors. Evidence is mixed, but some studies show bias against minority or female instructors or authorities, particularly when the student has received criticism from them (part of the research design). More subtle biases may be found when men enter traditionally female fields (e.g., nursing or childhood education), or when women enter traditionally male fields (e.g., STEM or manufacturing) (Sinclair & Kunda, 1999, 2000).

Students’ mental processes as they do the evaluation

What do we know about the mental processes of students as they fill out an evaluation form?

Tourangeau et al., (2000) offer a synopsis of the mental steps people take as they complete survey items. First, they must understand the question, that is, note key terms (question terms like when and how many, important verbs and nouns), logical structure, presuppositions and implications). Some parts of the item may activate specific memories as the person reads, or previous items may have activated some memories.

Second, relevant information about the course must be retrieved from memory. Basic findings on memory recall include these points: Recent events are easier to recall than non-recent ones (“the recency effect”); events that are distinct and significant are easier to recall; routine (scripted) events like chapter quizzes blend together, and more recent routine events may interfere with recollection of older routine events; events near a temporal boundary (e.g. before the mid-term exam) are easier to recall (p. 98).

Third, Tourangeau et al. (2000) say the person must make some sort of judgement, then formulate the judgement in a way that is consistent with the question’s demands, e.g., whether the student learned a lot. Students are sometimes asked to make judgements requiring inference, e.g., whether a teacher was enthusiastic or how well/much the teacher prepared.[1] In any case, students must assess their own knowledge and/or draw inferences where facts cannot be retrieved or are not available from observations or general knowledge. The final step is to map a response from the available choices.

The authors concede that there are many sources of potential error that can creep into this process, such as misinterpreting the question or statement, not remembering (or mis-remembering) important information, making mistaken inferences or attributions, answering carelessly, and others. Computer scientist Faith Fich (2003), also noting many “sources of error” in the evaluation process, rightfully says the results of student evaluations have “low precision” (p. 3).

Multicultural classrooms and cultural neuroscience

To complicate things, research on students’ course evaluation seems largely unacquainted with the multicultural classrooms, like our foreign/second language classes. However, if students come from various cultures, if students and their teacher come from different cultural or social backgrounds, if evaluation instruments are constructed by people from one background and students doing evaluation forms come from another background, if students and their teacher come from the same background but the teacher has assimilated to a second cultural or social background—what does that mean for students’ perceptions, attributions, and judgment during course evaluation? Your guess is as good as mine, but there might be some hints from cultural psychology and cultural neuroscience.

It has been well established that mental processes in East Asian and Western cultures differ. (See, for example, my favorite introduction to this topic, Richard Nisbett’s The geography of thought: How Asians and Westerners think differently… and why, 2004.)[2] East Asians tend to be more interdependent, whereas Westerners tend toward independence (or individualism). East Asians’ perceptions tend to focus on the social setting and relationships among people and things. Westerners’ perceptions tend to focus on a central element (a person or thing), with less attention on the context.

Work in neuroscience has confirmed that cross-cultural differences are reflected in different patterns of the brain’s neural activation. A summary article by Han (2015) identifies East-West cultural differences, reflected in different neural activity patterns, in studies on attention, causal attribution, mental state reasoning, empathy, and trait inference, among other areas. Several of these may have implications for the students’ course evaluation process, although work on this is in its infancy.

The attribution of causes for behavior might be a significant area to consider, especially when dealing with evaluation items about a teacher. Remember the Fundamental Attribution Error? Many Westerners tend to attribute causes of behavior to a person’s disposition (“Ms. Blake spoke sharply because she is an impatient person.”). Many East Asians, in contrast, tend to attribute the causes of behavior to the situation (“Ms. Blake spoke sharply to prevent an accident with the scissors.”). Research suggests that people (Eastern or Western) automatically think of a dispositional attribution first, then some (more likely Easterners) may rethink their attribution, integrating situational evidence (Mason & Morris, 2010). This takes place in the dorsolateral prefrontal cortex (DLPFC), where “controlled, effortful processing and the inhibition of inappropriate automatic reactions” take place (Brosch et al, 2013, p. 644). It raises the question: should we instruct students to consider the teacher’s work in the context of course constraints? And what would it involve?

The ability to “read” states of mind might be undermined by attributional biases, according to Adams et al. (2009), particularly in cross-cultural settings. The author reports that “mindreading” also involves two stages, e.g., recognizing emotion from nonverbal cues like the eyes, then reasoning about the person’s state of mind. Further, people find it easier to recognize states of mind from those in their own cultural group than from other groups, an intracultural advantage. Adams and his colleagues found that small groups of Japanese and white American adults assessed emotion photos of eyes more accurately from their own group, a finding confirmed by neural correlates. For many language classes, this certainly seems relevant to know; instructors might want to weigh what it could mean for their delivery (e.g. be more explicit about their own presuppositions, assumptions, perspectives, etc.), ahead of evaluation time.

On the other hand, as we all know, exposure to cultural groups outside our own gradually shapes a person’s internal culture. Moreover, it shapes neural structure and function, as reviews by Park & Huang (2012) as well as Han & Ma (2014) show. Again, instructors might keep this in mind when interpreting questionnaire results.

Why not ask students’ reasons-for-rating?

There is a small cluster of research publications which I feel temper the claims of validity for conventional students’ course evaluation forms. These studies depart radically from conventional course evaluation research, which ordinarily finds that students’ ratings of instructors/courses are “moderately” correlated[3] to grades, peer faculty observations, faculty self-reports, and other measures. In contrast, these studies simply ask students to report, via interview or in writing, the reasons for selecting a particular response to an item as they complete their evaluations. This approach is called a think-aloud study or protocol analysis.

Here is what these authors did:

Benz & Blatt (1996)[4] asked students from 10 of their colleagues’ classes (389 students) to fill out evaluation items focused on the instructor and write the reasons for each rating choice. Block (1998) interviewed 24 adult students in courses at a large Barcelona language school, focusing on three items about the instructor (overall evaluation, making the course interesting, and punctuality), asking the reason for their choice. Kolitch & Dean (1998) had 96 undergraduates respond to a single global-evaluation item (“Overall, the instructor was an effective teacher”), using narratives and semi-structured interviews on the item’s meaning and their reasons-for-rating. Billings-Gagliardi et al., (2004) interviewed 24 medical school students regarding their evaluation rating choices. I attempted this procedure twice, first with ten Japanese EFL students, interviewed on item meaning and their reasons for rating choices (Winskowski, 2010). I then administered my Japanese university’s evaluation form and my own course-specific form for a comparative culture class (30-45 students annually), adding “Please give the reason for your rating” to each item. I compared the data from 2010 and 2012 (Winskowski, 2015).[5] Here are general and specific findings from these studies:

Unlike the average numerical ratings [sr1] that we see for an item or the whole evaluation form, the most striking finding was the extraordinary variety in the students’ reasons for their rating choices.

1. One way variety manifested was in topical themes, shown by key terms and phrases that grouped student responses into the same topic. For example, for Benz & Blatt’s Item 10: “The instructor prepared well for classes,” one topical theme was instructors’ knowledge of their subjects. Another was time, e.g., using time well, readiness to start. Benz & Blatt and Kolitch & Dean reported two to six themes per item. My 2015 study found six to 12 themes per item in the 2010 data, and eight to 15 themes in the 2012 data. About 80% of all topical themes were addressed by only 1–4 students. As the size of student groups writing on the same topical themes got larger, they became more and more rare. Thus students were markedly varied in what they were thinking of as they responded.
2. Variety also was shown in students’ interpretations of items. Block, Billings-Gagliardi et al., and I all found that there was variation in how students interpreted each item. Different interpretations also appeared for terminology, even the term “punctuality.” A particular kind of variety in interpretation appeared in responses showing unexpected or idiosyncratic reasoning, according to Billings-Gagliardi et al., e.g., “I never rate Integration [of material] below agree because I believe the onus is on me to integrate the material…” Some evaluations of faculty were affected by perceptions of the faculty member’s caring, e.g. “I didn’t like the lecture she gave…but I liked Dr. DD as a person, so I said agree…” (p. 1066). One student of Benz & Blatt said that the class was “all discussion,” so the instructor “did not have to prepare much.” In my 2010 data, an item about “appropriate use of teaching materials” (blackboards, AV materials), prompted this response: “The teacher sometimes erased the board with her hand” (p.19).
3. All of the studies mentioned found that students evaluating instructors use different strategies for choosing their ratings, another form of variation. Some students used prior items’ ratings to make an average rating of teacher effectiveness; others used their feelings; still others identified a critical element in the instructor’s work that was salient. Some students attributed the evaluation of their instructor to their own behavior (see b above), the subject matter, or entirely to the instructor. Some students compared their instructors with other instructors; others did not. Block noted that students “often conflated the teacher as an individual, the class as an event, and the class as a group of language learners” in providing a global overall evaluation of the teacher (p. 408).
4. Finally, both the Block and Winskowski (2015) studies found that students’ reasons for rating did not correspond in any consistent way to a rating number on the scales. Block found that similar arguments were assigned to different rating numbers, as did I. However, in the 2015 study, I was interested to see whether (after masking students’ rating numbers) I could predict those ratings within a 2-point range on the form’s scales, e.g. 1-2, 2-3, 3-4, from students’ reasons-for-rating.[6] Across the evaluation instruments, there was an overall average of 72% correct matches between student’s reasons-for-rating and my predicted 2-point range. There was also a noticeable number of cases where my predictions landed on the wrong side of the scale! It did show, however, that there was a general correspondence between the content of a student’s reason for rating, and the valence or approximate location of most ratings on a rating scale.

Here are some conclusions the authors drew: Benz & Blatt felt that items which garnered more consensus in student responses were more “valid,” but that evaluation forms’ validity in general was challenged by the variation in students’ interpretations of the items. They state: “Numerical ratings as ultimate meaning are insufficient evidence of how students perceive teaching” (p. 431). Kolitch and Dean found little evidence that a global item like, “The instructor was an effective teacher,” could help improve teaching. They felt that student evaluation items would be more helpful to teachers if “tied to their individual courses” (p. 73-74), addressing specific events, activities, teaching methods, etc. Block (1998) felt that the variation in student responses was never anticipated by item writers, the rating scales were problematic, and that items need to better capture what is salient to language learners, preferably in course-specific evaluation forms. Billings-Gagliardi et al. found that their study called into question the assumptions that students understand item meaning, that faculty with the same or similar ratings are perceived similarly, and that ratings actually reflect the effectiveness of teaching. As for me, I stand with Block—administer official forms if you must, but periodically, issue your own evaluations, designed for your courses, and ask the reason for students’ responses. I found the students’ responses remarkably illuminating about their engagement with the course. Unlike Benz & Blatt and Block, I found that response variation was the rule, rather than an aberration, and it is a significant finding. One might say it is the point of students’ class evaluations.

Once, in a JALT conference talk on students’ course evaluations, I started with a “thought experiment.” I put my university’s 2012 evaluation, Item 4. “Did you (were you able to) participate seriously in this course?” on the board. Then I asked the audience: “What kinds of reasons for rating would students have for picking a rating number?” As you might expect, the audience offered general versions of “yes, because I did” “no, I because I didn’t” and “partly,” but could not think of more than 3-4 themes.

Of the 34 students making a rating, 26 offered reasons clustered in 10 themes. These are listed in order of ratings, from 1 = no to 6 = yes: because I felt sleepy; I lost my concentration; we were passive; could participate without chat; concentrated; tried hard to listen; attended class/worked hard; took notes; participated actively; course & lectures were interesting. If this item were in your class evaluation form, would you find an average rating (3.85 on a 6-point scale), or these reasons for rating, more informative?

Making your own instructor-designed, course- specific evaluation forms[7]

I would encourage readers to make their own course evaluation forms, designed for a particular course. And to ask students to explain their responses. There are good reasons to do it:

1. To ask your students specifically what you really want to know about their experience with your courses, with an eye toward refining them. You can identify what is working well in your course, and show meaningful efforts to improve.
2. To provide a complement to your institution’s official evaluation instrument, especially if the institution’s evaluation results do not reflect the effectiveness that you feel your course merits. You will have concrete evidence in hand to claim that your evaluation survey results are meaningful and valid, since they address course elements directly. If necessary, you can point out that institutional evaluation forms have less certainty, since they are more general.

Having taken this stand, however, I’m aware that it is time-consuming and labor-intensive to process the data and analyze it. Teachers may not want to do this with all classes every term, but rather select occasions when there is a need to know. You could administer your course-specific survey at mid-term, giving yourself time to respond. Or, you do not have to administer an entire survey all at once. You could cut it up into pieces, addressing particular events like group projects, exams, three-unit intervals, etc. It’s a good idea to explain the rationale to the students, and add that you are looking for genuine responses. And by all means, share the findings with students. This type of instructor-designed evaluation form lends itself better to evaluating for formative purposes (to find out how the class is going) rather than summative purposes (evaluating to assess the instructor for promotion, etc.).

So here is how to construct your own course-specific evaluation form. Adapt at will.

1. Identify specific course objectives: Start with the broadest skills, including standard ones e.g. listening, writing), applied in the setting of interest (general skills, academic, conversational). Include levels, if you wish, and consider unit-specific or activity-specific skills (vocabulary on a topic, rhetorical forms or discourse conventions). Consider incidental skills, like technical skills, business or social practices.
2. Identify the course activities and events that are intended to achieve these objectives.
3. Make evaluation items that ask students if a course activity/event helped them achieve an objective, or how it helped them. You can use or adapt conventional items (Arreola, 2000, offers hundreds of examples). Or you can design your own (see below). Avoid items that ask students to speculate about the job of teaching (“the teacher used appropriate methods”) or about mental states (“the teacher respects students as people”). But by all means, ask questions about what students witnessed or did or felt (“the teacher responded helpfully to students’ questions”).
4. Ask students to explain their response or give reasons for them.

Alternative Item Design Examples

Don’t feel that you must stick to conventional survey item formats (stem, alternatives, and rating scales), nor to vague response labels (“agree” and “disagree”). You can customize your items and choices. Here are a few examples:

1. Were the directions for reading assignments clear?
  - Directions for reading assignment were quite clear and I was able to complete the work without uncertainty or confusion.
  - Directions for reading assignment were mostly clear with minor exceptions.
  - Directions for reading assignment were difficult to understand and I had to often ask for clarification.
  - Other; please explain: …………………………………………………………………………
2. What is the most effective way to study the texts and the audio files? Rank order the following choices from 1 to 5, like this:
  1 = most effective for me 5 = least effective for me
  - Listen to the audio file first, then read the text
  - Read the text, then listen to the audio file
  - Read and listen at the same time
  - Skim the reading quickly, then read and listen at the same time
  - Other; please explain: …………………………………………………………………………
3. Please show how helpful each of the following class activities were to understand how English paragraphs are organized (description, process, comparison, and others). Use one of the numbers:
  1 = very helpful 2 = somewhat helpful 3 = not so helpful
  - reading the texts in our textbook
  - diagramming the texts to show their organization
  - learning about phrases that signal the organization (for example, first…next, in conclusion, etc.)
  - reading several texts with the same organization to see how they correspond
  - Other; please explain: …………………………………………………………………………

Conclusion

With conventional course evaluation forms, students are not asked to document classroom events or instructor behavior. Usually, they are not asked to explain their responses. This is truly exasperating if you want substantive feedback on your course. But we shouldn’t feel hemmed in by these limitations. Simply asking your students to explain the reasons for their responses on your institution’s evaluation form, and/or designing your own evaluation form will give your students an authentic voice and may surprise you with interesting news about what is going on in your class.

Some find this practice controversial, feeling that most traditional-age students may not be sufficiently experienced to make these kinds of assessments. Others argue that students have quite a lot of experience being in class, and are able to make reasonable inferences.
A more recent treatment of this topic with greater emphasis on reasoning may be found in De Oliveira, S., & Nisbett, R. E. (2017). Culture changes how we think about thinking: From “Human Inference” to “Geography of Thought”. Perspectives on Psychological Science, 12(5), 782-790.
Moderate correlation for large groups of people is about r = .30 to .40. Social scientists seem to find this satisfactory.
Now cited as Ridenour & Blatt, 1996.
My primary goal was to get intensive feedback to develop the course. The secondary goal was to see what this approach would reveal about what students attend to in their selection of a rating. Hence, this was a field study with small procedural variations annually. I am indebted to colleagues Susan Duggan and Harumi Ogawa with their assistance on the two studies.
This was an admittedly speculative and subjective exercise, motivated by nosiness. Still, it would be worth having a more systematic look at the relationship between the language of rating justifications and the rating selection.
Adapted from “Taking the class ‘temperature:’ Assessing student engagement and course effectiveness” (2014).

References

Arreola, R. A. (2000). Catalog of student rating form items. In R. A. Arreola (Ed.) Developing a comprehensive faculty evaluation system: A handbook for college faculty and administrators on designing and operating a comprehensive faculty evaluation system, 2nd ed. Anker Publishing.
Billings‐Gagliardi, S., Barrett, S. V., & Mazor, K. M. (2004). Interpreting course evaluation results: Insights from thinkaloud interviews with medical students. Medical Education, 38(10), 1061-1070.
Kolitch, E., & Dean, A. V. (1998). Item 22. Journal on Excellence in College Teaching, 9(2), 119-40.
Ostrom, T. M., & Gannon, K. M. (1996). Exemplar generation: Assessing how respondents give meaning to rating scales. In N. Schwarz & S. Sudman (Eds.), Answering questions: Methodology for determining cognitive and communicative processes in survey research (pp. 293–318). Jossey-Bass/Wiley.
Sinclair, L., & Kunda, Z. (2000). Motivated stereotyping of women: She’s fine if she praised me but incompetent if she criticized me. Personality and Social Psychology Bulletin, 26(11), 1329-1342.
Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey response. Cambridge.

Christine Winskowski (PhD Psychology) taught ESL/EFL in the U.S., China, and at Iwate Prefectural University, Japan. She has written and presented on the topics of culture learning and students’ course evaluations. Currently, she writes, edits, and consults in Hawaii.