Wednesday / April 24

The Limits of Assessment

The following is an excerpt from “As Good As Gold? Why We Focus on the Wrong Drivers in Education,” the first in the new series of Gold Papers by John Hattie and Arran Hamilton. Read an earlier excerpt, “The Limits of Lesson Observation.


High-stakes assessment has been an important rite of passage throughout much of human history. Many ancient cultures and tribal societies required their young to undertake risky and painful quests to mark the transition to adulthood. For the Australian Aboriginals, this involved boys surviving unaided in the outback for up to six months, using the skills that they had been taught during childhood. For some African tribes, it involved successfully hunt­ing a lion. In some South American communities, the transition to adulthood involved being able to demonstrate a very high threshold for pain, includ­ing the imbibing of neurotoxins.

The ancient Chinese were possibly the first to develop a national written assessment system. This was called the Imperial Examination and it was used as a mech­anism to select administrators for government posts (Fukuyama, 2011). The system originated in 605 AD as a way of avoiding hereditary appointments to gov­ernment office. Candidates would be placed in indi­vidually curtained examination cells to undertake the written assessment, which lasted for several days. At night, their writing board doubled as a bed.

It is this rite of passage that we continue to deploy in the form of national school-leaver examinations, such as the SAT and the International Baccalaureate (IB), today. Modern educational assessments are high stakes but without the physical risk of the tribal tests (although they can invoke high levels of stress). Different times, different measures. The SAT, A Levels, IB, and other assessments signal to employers and training providers that school leavers have acquired the required skills for the next stage of their journey.

These assessments can tell us, often with relatively high levels of accuracy, a student’s level of compe­tence in mathematics, literacy, foreign languages, and science and about the depth and breadth of knowledge the student has acquired across a range of curriculum areas. From this, we can also make inferences about a student’s readiness for univer­sity studies and life beyond school, albeit with less precision (as we may need to also include the pro­ficiency to learn, address challenges, be curious, feel a sense of belonging in more open learning environments, achieve financial security, and gather support from others).

Navigating by the Light of the Stars

The outcomes of high-stakes summative assess­ments are also often used to make inferences about the quality of schools (e.g., school league tables), school systems (e.g., PISA, Trends in International Mathematics and Science Study [TIMSS], and Progress in International Reading Literacy Study [PIRLS]), and individual teachers and about whether certain education products and programs are more effective than others. In other words, they are often used in the quest to find education gold.

In this context, high-stakes assessments are blunt instruments—akin to piloting your boat by the stars on a cloudy night, rather than a GPS system. We can infer something about which schools are higher and lower performers, but we need to care­fully tease out background variables like the start­ing points and circumstances of the learners and multiple other important outcomes, so that we can measure the distance traveled rather than the abso­lute end point in one set of competencies. Indeed, all too often we find that the greatest variability in learning outcomes is not between different schools but between different teachers within the same school (McGaw, 2008). The key unit of analysis should be the teacher rather than the school, and many high-stakes assessments may not be attribut­able to a particular school.

In the context of individual teachers (provided there is a direct link between the teacher and the particular content assessed), the outcomes of high-stakes assessments can tell us quite a lot about which teachers are more or less effective—particularly where the pattern of performance holds over several years. Again, care is needed, as it is not only the outcomes of the assessments but the growth from the beginning to end of the course that should be considered. Otherwise, those teach­ers who start with students already knowing much but growing little look great, and those who start with students who know less at the beginning but grow remarkably look poor—when it should be the other way around.

But unless the outcomes of high-stakes student assessments are reported back to schools at the item level (i.e., how well students did and grew on each component of the assessment, rather than just the overall grade), teachers are left in the dark about which elements of their practice (or third-party products) are more or less effective or com­pletely ineffective. They just know that overall, by the light of the stars, they are navigating in the right or wrong direction. And even where they are navi­gating in the wrong direction, there are likely some elements of their tradecraft or product kitbag that are truly outstanding but are missed.

Even where teachers are able to access item-level data from high-stakes evaluation, the inferential jump that they must make to systematically map this back to specific elements of their tradecraft or the impact of specific training programs or pieces of educational technology is too great to do with any meaningful fidelity. In other words, the outputs of high-stakes examinations are not reported at high enough resolution to tease out, with high con­fidence, the educational cargo cults from education gold. So, often, they are an event (two to three hours on one day) and the inference from this event to the teaching and learning is too great a leap.

Navigating With a GPS System

The only way we can use student achievement data with any sense of rigor to sift out the educa­tion gold is by collecting data (formatively) at the beginning, middle, and (summatively) end of the journey to systematically measure the distance trav­eled by individual students and groups of learners. By experimentally varying very narrow elements of teacher practice or aspects of educational products and programs, we can see whether this results in an upward or downward spike in student performance. It is as important to know about the efficiency and effectiveness of the journey as it is to reach your des­tination. This is one of the benefits of GPS systems.

Summative vs. Formative Evaluation

Too often, teachers see summative assessment as “bad” and formative assessment as “good” when this is nonsense; some see summative as needing to be highly reliable but with formative, the measurement rigor can be less. If formative is more powerful, then it, too, needs to be based on highly valid measures and observations.

We prefer to use the terms formative and sum­mative evaluations and abandon the misleading terms formative and summative assessments. Our arguments and analysis in this section have principally been about the use of summative eval­uation as a systematic mechanism to make infer­ences about what’s education gold. But we want to stress that it’s more often about what it is used for than the mechanism of data collection itself. That is, the same assessment instrument can be used both formatively and summatively. As Bob Stake puts it: when the cook tastes the soup, it is formative; but when the guest tastes the soup, it is summative.

Within the context of the individual teacher in the individual classroom, we know that formative evaluation is educational gold in and of itself (Hattie & Timperley, 2007). The most effective approach to formative evaluation contains three components:

  • Feed-up: Where am I going?
  • Feed-back: How am I doing?
  • Feed-forward: What is my next step?

What is important is not the testing itself but the way that it is incorporated into the cycle of chal­lenging goals to support learners in unlocking the skill, will, and thrill to learn.

The challenge, of course, is that “everything seems to work somewhere and nothing everywhere” (Wiliam, 2014). So, even where this analysis is con­ducted systematically, we cannot be completely certain that the educational approach, training program, or technology intervention that resulted in education gold in one context will not end up being pyrite in quite another.

We need repeated evaluation projects that inves­tigate the same approaches across many different contexts to give us much greater confidence in the fidelity of our findings. And once we have these data, we face the challenge of vacuuming them up from disparate sources and in drawing the com­mon threads to build a compelling narrative about what’s gold. We can then ask not only about over­all effects, but under what conditions and for which students programs work best. Thankfully, a great deal of progress has been made here through the use of meta-analysis and we discuss this in the next section.

To read more, please see the Gold Papers on

Written by

Dr. John Hattie has been Professor of Education and Director of the Melbourne Education Research Institute at the University of Melbourne, Australia, since March 2011. He was previously Professor of Education at the University of Auckland. His research interests are based on applying measurement models to education problems. He is president of the International Test Commission, served as advisor to various Ministers, chaired the NZ performance based research fund, and in the last Queens Birthday awards was made “Order of Merit for New Zealand” for services to education. He is a cricket umpire and coach, enjoys being a Dad to his young men, besotted with his dogs, and moved with his wife as she attained a promotion to Melbourne. Learn more about his research at, and view his Corwin titles here.

Dr. Arran Hamilton is Group Director of Strategy at Cognition Education. His early career included teaching and research at Warwick University and a stint in adult and community education. Arran transitioned into educational consultancy more than 15 years ago and has held senior positions at Cambridge Assessment, Nord Anglia Education, Education Development Trust (formerly CfBT) and the British Council. Much of this work was international and focused on supporting Ministries of Education and corporate funders to improve learner outcomes.

No comments

leave a comment