The following is an excerpt from “As Good As Gold? Why We Focus on the Wrong Drivers in Education,” the first in the new series of Gold Papers by John Hattie and Arran Hamilton.
In many education systems, it is a mandatory requirement that every teacher undergoes at least an annual observation by their school leader. Heads and principals generally use some form of rubric or scoring sheet and rate their teachers against this. At our last count, we located more than 120 observation forms that had been published with some evidence about their reliability and validity.
These observations are often used for performance management purposes, to identify who are the “good” and “less good” teachers, and by national inspectorates to make more holistic judgments about whether a school is outstanding, good, or poor. They are also used for developmental purposes, with teachers peer-reviewing each other’s lessons so they can offer one another advice and harvest good practice to apply back in their own classrooms. Finally, they can be used to sift education pyrite from education gold, by observing the impact of a new education product or teacher development program in the classroom.
But we should ask ourselves an important question: Can you actually see, hear, and sniff a good lesson? Are our five senses any good at measuring outstanding, adequate, and poor? Can we see the impact of a teacher in a class of students? Do we watch teacher performance or do we watch the impact on the students? And, what if the performance is spectacular, but the impact of little consequence?
If we phrase these questions as binary yes/no choices, then the answer to whether we can make meaningful and rigorous observations is a resounding yes. And, by binary, we mean questions where there is a clear yes/no answer, like:
- Is the teacher in the classroom?
- Are they talking to the class?
- Are the children all awake?
- Has homework been set and marked?
It’s relatively straightforward to establish a sampling plan for each of these and any two observers will have a high degree of consistency in their observations (with minimal training), even if they are not educationalists.
So, for these kinds of binary questions about the performance, we can see, hear, and sniff reasonably reliably. We could probably stretch from binary questions to inquiring about frequency, such as how often something occurred (e.g., Were all the students awake, all the time during the lesson?).
But when we want to use observation to determine whether the teacher delivered a high-quality lesson, we ask questions like these:
- Did the teacher deliver a “good” leasson?
- Did the students “achieve” the learning objective?
- Were the learning objectives worthwhile, appropriate, and sufficiently challenging for the students?
- Was the classwork a “good” fit with classroom-based activity?
- Did the teacher provided “good” feedback?
- Were the education products “effective”?
- Did the teacher-training program deliver “impact” in the classroom?
With these questions, we open a huge can of worms. Who decides what “good” is, and who decides what “impact on students” means?
To answer these questions, observers rely on proxies for learning. A proxy measure is when we use one thing that’s quite easy to get data about to tell us about something else, which is much more difficult to get data about. For example, doctors rely on blood tests, blood pressure measurements, and heart rate analyses to tell them whether a patient is fit and well. And, generally, these work relatively well. However, it’s possible to have a rare type of illness that does not show up on these types of tests, which means that you might be given a clean bill of health by the doctor, but actually be at death’s door.
It’s the same with lesson observations. It is possible that, when we measure with our eyes, we are looking in the wrong areas. We may see busy, engaged students in a calm and ordered classroom where some students have supplied the correct answers and we conclude that a heck of a lot of learning is going on. Yet it is quite possible that absolutely nothing of any significance is being learned at all (as in the good old days where teachers practiced their lessons before the inspector came).
We know, too, that much of what goes on inside the classroom is completely hidden. The late great Graham Nuthall, in his seminal work The Hidden Lives of Learners (2007), theorizes that there are three separate cultural spheres at play in the classroom: the Public Sphere, which in theory is controlled by the teacher; the Social Sphere of the students, which the teacher is often unaware of; and the Private Mental Worlds of the students themselves, which both the teacher and the other students are unable to directly access. In short, most of what goes on in the classroom is inaccessible to the teacher and less still to a third-party observer.
Confounding this, the evidence from neuroscience suggests that, of the vast array of data that is collected by our various senses each second, very little is actively processed by the conscious mind. So even within the Public Sphere that we have direct access to as observers, it’s likely that we see very little. As we focus narrowly on some aspects of classroom practice, we miss the stooge in a gorilla suit dancing across the room. As observers, we have our own lens, our own theories, and our own beliefs about what we consider is “best” practice, and these can bias the observations, no matter how specific the questions in any observation system. Most observations of other teachers end up with us telling the teachers how they can teach like us!
The challenge with observation is that often we end up seeing what we want to see and we can be guided by our cognitive biases. The process of observing is like interpreting a Rorschach Image, one of those inkblot images that psychiatrists show to their patients—where some say they can see their mother and others JFK.
There has been quite a lot of research into the problem of lesson observation in the last few years. The strongest dataset comes from the Measures of Effective Teaching (MET) project, which was funded by the Bill & Melinda Gates Foundation (2013). The MET study concluded that a single lesson observed by one individual, where the purpose was to rate teacher performance, has a 50% chance of being graded differently by a different observer. In the best-case scenario, where a teacher undergoes six separate observations by five separate observers, there is “only” a 72% chance that there is agreement and thus a 28% chance that their judgments are misaligned to the lesson observation rubric. Now that’s a whole lot of observation for still almost
a 1/3 chance of error.
Observers frequently disagree about what they are observing, even with a well-established observation schedule. In assessment, we call this the interrater reliability problem.
In short, the whole process of lesson observation (when used to measure the impact of training, a new product, or the effectiveness of a teacher) is riddled with many of the cognitive biases that we describe further in our paper.
To read more, please see the Gold Papers on VisibleLearningplus.com.
Pingback: The Limits of Assessment - Corwin Connect / September 27, 2019