The Limits of Assessment

The following is an excerpt from “As Good As Gold? Why We Focus on the Wrong Drivers in Education,” the first in the new series of Gold Papers by John Hattie and Arran Hamilton. Read an earlier excerpt, “The Limits of Lesson Observation.”

Assessment

High-stakes assessment has been an important rite of passage throughout much of human history. Many ancient cultures and tribal societies required their young to undertake risky and painful quests to mark the transition to adulthood. For the Australian Aboriginals, this involved boys surviving unaided in the outback for up to six months, using the skills that they had been taught during childhood. For some African tribes, it involved successfully hunting a lion. In some South American communities, the transition to adulthood involved being able to demonstrate a very high threshold for pain, including the imbibing of neurotoxins.

The ancient Chinese were possibly the first to develop a national written assessment system. This was called the Imperial Examination and it was used as a mechanism to select administrators for government posts (Fukuyama, 2011). The system originated in 605 AD as a way of avoiding hereditary appointments to government office. Candidates would be placed in individually curtained examination cells to undertake the written assessment, which lasted for several days. At night, their writing board doubled as a bed.

It is this rite of passage that we continue to deploy in the form of national school-leaver examinations, such as the SAT and the International Baccalaureate (IB), today. Modern educational assessments are high stakes but without the physical risk of the tribal tests (although they can invoke high levels of stress). Different times, different measures. The SAT, A Levels, IB, and other assessments signal to employers and training providers that school leavers have acquired the required skills for the next stage of their journey.

These assessments can tell us, often with relatively high levels of accuracy, a student’s level of competence in mathematics, literacy, foreign languages, and science and about the depth and breadth of knowledge the student has acquired across a range of curriculum areas. From this, we can also make inferences about a student’s readiness for university studies and life beyond school, albeit with less precision (as we may need to also include the proficiency to learn, address challenges, be curious, feel a sense of belonging in more open learning environments, achieve financial security, and gather support from others).

Navigating by the Light of the Stars

The outcomes of high-stakes summative assessments are also often used to make inferences about the quality of schools (e.g., school league tables), school systems (e.g., PISA, Trends in International Mathematics and Science Study [TIMSS], and Progress in International Reading Literacy Study [PIRLS]), and individual teachers and about whether certain education products and programs are more effective than others. In other words, they are often used in the quest to find education gold.

In this context, high-stakes assessments are blunt instruments—akin to piloting your boat by the stars on a cloudy night, rather than a GPS system. We can infer something about which schools are higher and lower performers, but we need to carefully tease out background variables like the starting points and circumstances of the learners and multiple other important outcomes, so that we can measure the distance traveled rather than the absolute end point in one set of competencies. Indeed, all too often we find that the greatest variability in learning outcomes is not between different schools but between different teachers within the same school (McGaw, 2008). The key unit of analysis should be the teacher rather than the school, and many high-stakes assessments may not be attributable to a particular school.

In the context of individual teachers (provided there is a direct link between the teacher and the particular content assessed), the outcomes of high-stakes assessments can tell us quite a lot about which teachers are more or less effective—particularly where the pattern of performance holds over several years. Again, care is needed, as it is not only the outcomes of the assessments but the growth from the beginning to end of the course that should be considered. Otherwise, those teachers who start with students already knowing much but growing little look great, and those who start with students who know less at the beginning but grow remarkably look poor—when it should be the other way around.

But unless the outcomes of high-stakes student assessments are reported back to schools at the item level (i.e., how well students did and grew on each component of the assessment, rather than just the overall grade), teachers are left in the dark about which elements of their practice (or third-party products) are more or less effective or completely ineffective. They just know that overall, by the light of the stars, they are navigating in the right or wrong direction. And even where they are navigating in the wrong direction, there are likely some elements of their tradecraft or product kitbag that are truly outstanding but are missed.

Even where teachers are able to access item-level data from high-stakes evaluation, the inferential jump that they must make to systematically map this back to specific elements of their tradecraft or the impact of specific training programs or pieces of educational technology is too great to do with any meaningful fidelity. In other words, the outputs of high-stakes examinations are not reported at high enough resolution to tease out, with high confidence, the educational cargo cults from education gold. So, often, they are an event (two to three hours on one day) and the inference from this event to the teaching and learning is too great a leap.

Navigating With a GPS System

The only way we can use student achievement data with any sense of rigor to sift out the education gold is by collecting data (formatively) at the beginning, middle, and (summatively) end of the journey to systematically measure the distance traveled by individual students and groups of learners. By experimentally varying very narrow elements of teacher practice or aspects of educational products and programs, we can see whether this results in an upward or downward spike in student performance. It is as important to know about the efficiency and effectiveness of the journey as it is to reach your destination. This is one of the benefits of GPS systems.

Summative vs. Formative Evaluation

Too often, teachers see summative assessment as “bad” and formative assessment as “good” when this is nonsense; some see summative as needing to be highly reliable but with formative, the measurement rigor can be less. If formative is more powerful, then it, too, needs to be based on highly valid measures and observations.

We prefer to use the terms formative and summative evaluations and abandon the misleading terms formative and summative assessments. Our arguments and analysis in this section have principally been about the use of summative evaluation as a systematic mechanism to make inferences about what’s education gold. But we want to stress that it’s more often about what it is used for than the mechanism of data collection itself. That is, the same assessment instrument can be used both formatively and summatively. As Bob Stake puts it: when the cook tastes the soup, it is formative; but when the guest tastes the soup, it is summative.

Within the context of the individual teacher in the individual classroom, we know that formative evaluation is educational gold in and of itself (Hattie & Timperley, 2007). The most effective approach to formative evaluation contains three components:

Feed-up: Where am I going?
Feed-back: How am I doing?
Feed-forward: What is my next step?

What is important is not the testing itself but the way that it is incorporated into the cycle of challenging goals to support learners in unlocking the skill, will, and thrill to learn.

The challenge, of course, is that “everything seems to work somewhere and nothing everywhere” (Wiliam, 2014). So, even where this analysis is conducted systematically, we cannot be completely certain that the educational approach, training program, or technology intervention that resulted in education gold in one context will not end up being pyrite in quite another.

We need repeated evaluation projects that investigate the same approaches across many different contexts to give us much greater confidence in the fidelity of our findings. And once we have these data, we face the challenge of vacuuming them up from disparate sources and in drawing the common threads to build a compelling narrative about what’s gold. We can then ask not only about overall effects, but under what conditions and for which students programs work best. Thankfully, a great deal of progress has been made here through the use of meta-analysis and we discuss this in the next section.

To read more, please see the Gold Papers on VisibleLearningplus.com.

Written by John Hattie and Arran Hamilton

Imagine What School Could Be…

Want to Inspire Voluminous Reading? Start with Curating Texts

Latest comment

KfnqDuxw / July 11, 2025

Reply /

leave a comment Cancel Reply

Thrive by Five: Why Mentoring New Teachers Matters More Than Ever

How to Use Inquiry to Integrate AI Into Your Classroom

Assessment as a Tool to Empower and Motivate Learners

What Happens When Education Leaders Learn to Think Differently?

Finding Your Voice in Uncertain Times

About Us

Contact

A Brief Introduction To the 9 I’s of Modern Learning

Brain-Compatible Learning Reading List

Corwin Connect