So, Why Integrated Testing?

By: Tony Gallucci

I started teaching English as a Foreign Language at Wolfson College, Cambridge, in 1995, with my Japanese students teaching me more about the realities of teaching than any subsequent training ever could. They were challenging, talented, studious, and hilarious. The group presented a wide range of language abilities, and were culturally so phenomenally different from anything I had experienced before that I soon realized I was learning as much as they were. As I juggled a wide range of didactic, pedagogic, linguistic, and social issues in my classes, all to try and steer them towards their desired aim, I slowly realized that aim was not an exam.

These students wanted to learn English—the exam was just a thing they had to do. I taught them English, and at the end of the course they took the test.

Now, when I say “the test,” I mean it wasn’t my test. It was a test designed by the university, but it didn’t really match up with my students’ desires, or their ability to communicate in the language.

A few years later, I encountered another problem. During a brief stint at a prestigious examination assessment center, I was tasked, as one of many poor unfortunates, with assessing candidates’ written responses to open-ended questions and essay prompts. As one would expect from so august an institution, the exams were developed and quality checked by teams of professional exam developers. Yet, I—along with many other graders—had issues. Why?

Well, sometimes I felt a candidate’s response was correct even though that response did not match the answer specified by the exam developers. So, what exactly was I assessing? The candidate’s ability to guess what the exam developers expected, or the candidate’s ability to respond appropriately to the questions and prompts?

And so began years of frustration with exams, revolving around key questions of what we are assessing and how we are assessing students’ work.

The discrete charm of assessing the four skills

The majority of modern language exams assess the four skills of reading, writing, listening, and speaking, and there is a common tendency to try to assess each one separately. When assessing the active skills of speaking and writing, we have criteria of accuracy and cogency, clarity, control, and competence which we can assess within a grade scale with a certain amount of grader discretion.

But with the receptive skills of reading and listening it gets a little muddy. Often, listening and reading are assessed by a written response in one its many diverse and wonderful forms. So, right from the outset we must acknowledge that we’re checking students’ listening and reading comprehension through the filter of their writing.

After going through the umpteenth reading comprehension student exam paper, I am usually reaching for some aspirin and yearning for the days of multiple choice and single word responses. During my attempts to decipher the rushed, hand-scrawled responses, with distracting corrections and crossings out, deviously intricate arrows pointing to an extra sentence written in the margins (even when the directions expressly say not to do so), it all emerges as a convoluted mess which needs a good deal more deciphering than any teacher wants to do. This extends the marking time and, with it, the grader’s fatigue. With fatigue comes cognitive variance and the disparity in grading between first and last paper (and the amount of coffee I can drink without becoming psychotic) becomes an issue.

I have to ask myself: Am I truly assessing the knowledge they gleaned from the text accurately and impartially? And when reading a short answer response aimed at testing reading or listening, can I truly say I am not influenced by the grammatical accuracy, the style, or the persuasiveness of the response?

Combined approaches

For me, the solution to these problems is an integrated approach to testing. There’s nothing new in this approach, as Carroll wrote about this in his “Model of School Learning” back in 1963, and there has been discussion about its use ever since. An integrated testing approach certainly has its limitations but, for an experienced examiner, the benefits of this approach far outweigh the drawbacks.

But I am getting ahead of myself.

What is integrated testing?

Most commonly, it is the combination of two or more language skills in a single task, and the response provides information on the performance of those skills. In essence, you assess a student’s reading or listening comprehension through their spoken and/or written response, with the accuracy, depth, and breadth of that understanding expressed in their active (spoken and/or written) production.

One can look at it in the form of

Required Information – Expression of understanding

(what is taken in via reading) – (written response expressing the information)

Let’s look at an example. Imagine an exam task asks students to read two written texts with opposing views and subsequently explain the perspectives of each author, citing examples from both texts to support their interpretation. A student’s reading comprehension can be clearly determined from the information contained in the response—did the student understand the different perspectives? The student’s writing can be assessed by how well that comprehension is communicated in terms of syntax, accuracy, style, etc.

The advantages are distinct. There is only one exam with two areas to assess, which equates to a notable saving in time and brain power. The frustrating question of, Am I assessing the reading or the writing? is obviated by actually assessing the reading in the writing. The assessor knows the arguments and information presented in the passive data provided and then determines the cogency and effectiveness of the active responses.

Problem solved!

Except, it isn’t.

There are issues with an integrated testing approach. There is a higher degree of subjectivity inherent in the assessment. In comparison to a short answer response, grading a response to an integrated test task is far more nuanced and requires a more detailed understanding of the grading criteria. With any form of extended written or spoken response, there will always be a degree of interpretation. And what I’ve noticed when more than one person grades the same integrated test task is that different assessors can end up with very different final scores, despite rubrics and shared understanding of the criteria on the rubrics.

Assessors are not machines. If I feel engaged by the witty, nuanced style of a written response, I will be more attentive to the student’s message and be more partial to this student’s efforts (and ironically also more critical of it), but another assessor might be less favourably disposed to a style they don’t consider “academic enough.”

So, there is a larger element of interpretation in an integrated testing approach, but it is unrealistic to suggest this element is not present in short-answer questions designed to assess only one skill area. With a clear grading rubric and a detailed exam requirement key, such variability can be mitigated, but obviously never fully eliminated.

One example: Integrated speaking and listening task

A clear strength of integrative testing is that it can be adapted to suit different learning goals.

Here’s a case in point. In a Presentations and Public Speaking course, I was compelled to give a listening assessment to comply with the faculty’s assessment criteria. For 12 weeks, I trained the students to speak clearly, effectively, and confidently, and then I had to give them a graded listening exam which was not part of the core speaking training.

This disconnect between classroom instruction and assessment prompted me to try out an integrated testing approach in which I first recorded public speakers debating a topic. Next, each student listened to just one speaker and prepared presentations of their speaker’s views.

In contrast to the previously employed sit-down listening exam of a sound file listened to twice with 30-odd questions needing written answers, the difference is explicit—I was testing my students on how much they understood, and how well they could convey that information. It was less of an exercise in quickly writing answers to numerous questions about a sound file but a cogent expression of how well they could demonstrate their listening and speaking abilities.

This then would be a Listening – Speaking variant. While listening to the file, the student made copious notes, they would optionally be given a task or question(s) to focus their response and would be given time to construct their presentation. The writing in this exam was not assessed, but simply a medium with which to complete their task. If their grammar or syntax was incorrect, as demonstrated in their speaking, the rubric would acknowledge this, as it would cogency, fluency, lexis, and execution. But it would not look at any elements of writing.

It worked. And it worked astonishingly well.

This kind of adaptation can be used to combine two or more skills, with Reading and Writing being the most commonly used. But different combinations are possible, depending on the task, the material, or indeed the course. A fully integrated exam, asking students to write an essay based on connected data from both Listening and Reading material, followed up by a presentation and discussion, can lead to a detailed and thorough demonstration of their language abilities, in a way most likely to be used in everyday work or social situations.

Conclusion

So, you may well be asking yourself if this is all a load of imprecise, highly convoluted, and organizationally nightmarish over-complication?

Jein, as the Germans say.^[1]

^[1] Jein (Ja und Nein) is the German expression for the equivocation well… yes and no.

Yes, it can be complex to organize, especially at the outset, and there are certainly factors the examiner must be cognizant of to ensure standards and reliability. But the process of construction and evaluation becomes logical very quickly. Steps to ensure best practice and QC steps can reduce many aspects of variability and, after a few sessions of a well-implemented procedure, going back to fully discrete testing seems illogical and somewhat archaic.

As a more rounded assessment, which is an organic follow-on from course material and topics, especially in thematic or topic-based courses, it simply fits better. By integrating the core skills, the student gets to demonstrate their knowledge and competence in a more authentic demonstration of their abilities and knowledge.

For the examiner, the benefits of a well-constructed exam far outweigh any potential weaknesses. You actually get to assess how well the student demonstrates what they learned throughout the course, and how well they communicate it, in a manner more fitting to any metric they will be judged by outside the classroom. And, as a final point, it does cut down on marking time—and that has to be a good thing.

References

Carroll, J. B. (1963). A model of school learning. Teachers College Record, 64(8), 723–733.

Tony Gallucci is an English-Italian Medieval Historian from Cambridge (the proper one) working in the field of languages for the last 30 years and still regularly asking himself why?