Administering Speaking Tests: Who We Are Becomes What We Do

By: Alun Roger

Evolving Validity

If you cast your mind back far enough, you may recall a time from your earlier education when you were waiting in the school hallway for an interview test and secretly praying that you get “Mrs. Smith” as your rater because “she’s always sympathetic and gives high grades!” In this article I want to argue that “Mrs. Smith” may have given you those preferential scores (what she did), partly because of who she was. While it seems intuitive that the characteristics of those we interact with might influence how a discourse unfolds and hence shape what is said (and even how!), it has not always been thus—at least not within the second language (L2) testing community.

A little over 20 years ago, a brief salvo was exchanged between Foot (1999a, 1999b) and Saville and Hargreaves (1999) regarding the validity of pairing test takers together for a speaking test when the effects of test-taker characteristics were little understood at the time. This critique provided an impetus for research which deepened our understanding of test validity. We now know, for example, that the degrees of acquaintanceship between interlocutors (O’Sullivan, 2008), and assertiveness (Ockey, 2009) of test takers (in group format tests) can affect spoken output and scores obtained. If test-taker characteristics can change a performance, then why not those of the test facilitators (examiners and raters) too?

In this article, I argue that our understanding of face-to-face interview speaking tests as a social interaction between all participants (McNamara, 1997), means that the study of personality traits should be extended to include the examiners (individuals who facilitate the test) and raters (individuals who merely observe and score the examiner-test taker interaction). Examples of high stakes interview speaking tests would be Eiken TEAP, IELTS, and Cambridge English Qualifications.

I argue that an examiner’s or rater’s personality could impact an individual test taker’s spoken performance and the judgement of said performance. This has potentially serious consequences in high-stakes test contexts where the results may determine whether a test taker can immigrate or get into the university of their choice. It is the intention of this article to generate ideas for researchers and also to prompt teachers to reflect on how they choose to interact with or score students in their own speaking tests.

The Co-Constructed Nature of Spoken Interaction

In striving for context validity—replication of the conditions, functions, and features of the Target Language Use domain as closely as possible—it is the very nature of the construct of speaking that confounds our measurement of it for individual test takers. Speaking tests are interactive, social events between a number of “actors” (McNamara, 1997). How the examiner manages the encounter, what the examiner says, or does not, how they act, or the length of their turn could all influence a test taker’s performance and create variability between individual test takers’ test experiences. An element of the test context (examiner) becomes an inseparable variable of the performance.

What evidence exists to support such a view? Studies by Brown (2003) and Nakatsuhara (2008) have shown that examiners can have distinctly different interview styles, despite training to strict interlocutor frames. These differences include, level of rapport, functional and topical choices, the way they ask questions or develop topics, and the degree to which they modify their speech to accommodate a test taker. The studies suggest that examiners co-construct test performances and the interaction leads to different performances which lead to different estimates of ability. Scoring validity (reliability) is impacted by elements of the test context, therefore.

McNamara also goes on to state that even the process of rating “is an inherently social act” (1997, p. 453). What evidence is there that raters’ observations or behaviours alter ratings? Are raters aware of the different examiner styles and do they attempt to account for them in their scores? McNamara and Lumley’s (1997) study sheds light on this issue by examining rater perceptions of examiner behaviour. Results showed that raters did seem to compensate (by giving them a higher score) test takers who had a test with a “less competent” examiner, both in terms of managing the test and establishing rapport. Speaking tests are, therefore, interactive events, and performances are affected by a variety of factors beyond solely test-taker proficiency.

The Role of Personality

In this section I explore how the co-constructed nature of spoken interaction might predict a much broader role for personality within L2 speaking tests. We have seen evidence that examiners vary in their approaches, behaviours, and accommodations. We have seen evidence that raters can perceive these effects and attempt to compensate test takers appropriately, despite training to rate performances by referring to a rating scale. So, what is driving these behaviours and causing this variance?

Personality. At least to some degree. McCrae and Costa’s (2010) Five Factor Model (which can be seen here and is better known as “The Big Five Personality Traits”), is one of the most commonly used taxonomies in the study of personality today (Corr & Matthews, 2009). This model identifies and describes five main traits of personality using the OCEAN acronym: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (Lim, 2020). While it is possible that any of these traits might have some role to play with regards to driving examiner/rater behaviours, I want to suggest a first place to start looking is with trait agreeableness. Agreeable people are altruistic, supportive, and empathic towards others, deriving value and enjoyment from helping them. Highly agreeable people are interested in others’ feelings and needs, often utilising compassion and cooperation to maintain social equilibrium and social relationships; experiencing higher levels of stress from interpersonal conflict (McCrae & Costa, 2010; Corr & Matthews, 2009; Lim, 2020). Agreeableness is a prosocial trait with clear implications for social contexts, such as the co-constructed test-taker performances of L2 speaking tests.

Next, I provide an exemplar to illustrate the hypothesised role that agreeableness could play in examiner behaviours. O’Sullivan (2008) and Brown (2003) both describe studies which indicate that female and male examiners produce qualitatively different test interactions from each other. Female examiners, generally, used fewer fillers, rephrased less, used a broader intonational range, used expressions of interest and open-ended questioning much more than male examiners. I argue the issue here is not related to sex per se, but trait agreeableness, which may overlap with sex. Research from psychology (Corr & Matthews, 2009; McCrae & Costa, 2010) shows that males and females are distributed differently along the agreeableness trait such that, generally speaking, females tend to be higher in trait agreeableness than males. This is not a value judgement, nor an absolute; there is plenty of overlap between the sexes in this trait. According to McCrae & Costa’s data, female mean agreeableness was between 0.54 – 0.61 standard deviations higher than the male mean.

Does it seem plausible that the female examiner behaviours displayed in these studies are congruent with people high in trait agreeableness? Individuals high in agreeableness seek to support and help others, maintain social equilibrium, and are conflict-avoidant. Monotone delivery of instructions, questions, or feedback might convey disinterest or feigned interest on the part of the examiner. Open-ended questions allow a test-taker to state their ideas in greater depth which might convey, on the part of the examiner, a sincere interest in the test taker (“Why do you play tennis?” infers tell me about your life). Conversely, closed questions might convey an impression of disinterest on the part of the examiner (“Do you like tennis?” infers yes or no, I don’t really care why). Regular use of expressions of interest might signal one’s cooperation in the performance (“Really? That’s interesting”) or empathy for their situation (“Ok, that’s fine”). Perhaps you will recall times when students you have known have been somewhat nervous during an in-class assessment and wanting to encourage them (there’s that supportive cooperation again!), you catch yourself being a bit too generous with the feedback (“Yeah that’s true!”, “Oh I know what you mean, that’s great!” etc.)

I argue that some of this is the prosocial aspect of agreeableness manifesting itself in behaviours which help to co-construct a spoken performance, and, in doing so, facilitate a social relationship and mitigate the potential for interpersonal conflict. If it is the case that agreeableness predicts certain examiner behaviours, and that during L2 speaking tests, these behaviours help co-construct different performances that lead to different scores, then it raises further questions: What ramifications might this have for examiner training? Might test scripts standardise interviewer responses and/or behaviours? But then, what impact might such standardisation have on the natural, organic flow of interaction?

And then what about raters? As we saw above (McNamara & Lumley, 1997), raters do observe how examiners interact with test takers, and if the examiner (in the rater’s opinion) fails to establish good rapport to the extent that the rater felt it negatively impacted on a test taker, they might choose to compensate the test taker with a higher score. Again, given the definition of trait agreeableness, do the observations of rater behaviour (lenient scores for candidates with “poorly performing” examiners) align with our understanding of how we might expect a highly agreeable rater to act? Might highly agreeable raters be more inclined to leniency in rating when observing examiners who, in their mind, fail to support the candidate sufficiently with appropriate prosocial behaviours? Is there something about highly agreeable or highly disagreeable raters, in the way they expect both participants to behave (beyond purely language output) that causes them to perceive the social act of an L2 speaking test in a fundamentally different way, and hence come to differing judgements of the performance? Furthermore, is such compensation itself a benefit or detriment to the fairness of the test?

We simply do not know the answer to any of these questions because no examination of personality as a facet of examiner/rater characteristics has been attempted to date, which leaves me pondering whether who we are (when acting as examiners and raters) becomes what we do.

The Impact

We (researchers) often get caught up in abstractions of statistics and probabilities, often forgetting the individuals beneath those calculations. Imagine Yoko and Hanako, both university students applying to study abroad. Yoko hasn’t really studied too much; she’s outgoing and things “just work out” for her. Hanako, on the other hand, has put in a lot of effort and is more proficient than Yoko. The day of the interview selection arrives and there is a commotion at the interview schedule noticeboard. Yoko rushes past Hanako, smiling contently “I’ve got Prof. Smith,” she says. Whispers float down the line of collected students—one of the interviewers is Prof. Brown! This individual is notorious among the students. A highly disagreeable individual, he often appears very judgmental of students, is overly critical of undergraduate level work, and sees interacting with students as a distraction from more important matters. Many students feel nervous around him. Selection for the study abroad program will depend upon how these interviews go. Upon reaching the schedule, Hanako sees that she has been assigned to…Prof. Brown. Her heart sinks; “it might be difficult for me to go to the US this year.”

References

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1-25. https://doi.org/10.1191/0265532203lt242oa
Corr, P. J., & Matthews, G. (2009). The Cambridge Handbook of Personality Psychology. CUP.
Foot, M. C. (1999a). Relaxing in pairs. ELT Journal, 53(1), 36-41.
Foot, M. C. (1999b). Reply to Saville and Hargreaves. ELT Journal, 53(1), 52-53.
Lim, A. G. Y. (2020). What Are the Big 5 Personality Traits? https://www.simplypsychology.org/big-five-personality.html
McCrae, R. R., & Costa, P. T. (2010). NEO Inventories For The NEO Personality Inventory-3 (NEO-PI-3), NEO Five-Factor Inventory-3 (NEO-FFI-3), NEO Personality Inventory-Revised (NEO PI-R): Professional Manual. PAR.
McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whose performance? Applied Linguistics, 18(4), 446-466. https://doi.org/10.1093/applin/18.4.446
McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational I settings. Language Testing, 14(2), 140-156. https://doi.org/10.1177/026553229701400202
Nakatsuhara, F. (2008). Inter-interviewer variation in oral interview tests. ELT Journal, 62(3), 266-275. https://doi.org/10.1093/elt/ccm044
O’Sullivan, B. (2008). Modelling Performance in Tests of Spoken Language (Vol. 12). Peter Lang.
Ockey, G. (2009). The effects of group members’ personalities on a test taker’s L2 group oral discussion test scores. Language Testing, 26(2), 161-186.
Saville, N., & Hargreaves, P. (1999). Assessing speaking in the revised FCE. ELT Journal, 53(1), 42-51.

Alun Roger (PhD.) is an associate professor at Nagoya Gakuin University, where he teaches Applied Linguistics (CLIL) to undergraduates and SLA for the graduate program. His research interests focus on testing and psychometrics.