A recent study by Shermis et al looking at how a number of automated essay scoring (AES) systems fare compared with human raters has generated a bit of press about advances in automated test scoring and the implications. Not all that press is favorable – what should we all be doing, then?
This study compares results for nine different AES “engines” (who knew there were this many out there?) covering more than 97% of the current AES usage in the US. A large number of essays (more than 20,000!) from eight different essay prompts given across three states were included, a non-trivial sample size.
The results confirm what’s been true for a while with automated essay grading – they do work well for that specific purpose: replicating the grades that human graders give to similar work.
So is that the end of the story? Can all those hard-working English teachers and faculty finally relax over their weekends instead of stacking up hundreds of papers for grading, secure that the grading of papers is in good “hands?”
Not exactly.
The trick is being very clear what you are asking technology to do – and what you are not. These AES engines have to be trained with a large number of papers with human grading first, before they can do their work. (Truth be told, human grading of papers is also significantly improved by pre-training with anchor papers – ETS for many years has run such essay-grading training machines for their yearly high school exams, AP and otherwise.) As black boxes, AES engines establish a wide array of correlations that allow them to duplicate the final marks on papers – but wind up with essentially no insight to offer (beyond simple grammar and spelling commentary in some cases, a bit more in other cases) about how to improve the structure, meaning, or grace of even a non-fiction piece of prose.
The researchers make this clear too. They know that this problem, that the internal algorithm for grading has nothing to do with how real human graders work, can be an issue:
Another issue not addressed by this study is the question of construct validity. A predictive model may do a good job of matching human scoring behavior, but do this by means of features and methods which do not bear any plausible relationship to the competencies and construct that the item aims to assess. To the extent that such models are used, this will limit the validity argument for the assessment as a whole.
Others, too, have gone after this issue. As the article in Inside Higher Ed reported, there’s at least one researcher, Les Perelman at MIT, who hates the AES things, and looks for ways to trip up automated essay grading services by writing obviously garbage essays that receive high marks.
The difficulty is when assessments, or your marking of them, rely completely on statistical correlations that no longer hold up. Imagine a test of foreign language competency based on your vocabulary in that language, simple enough to test. However, if students pre-train on the exact words the test looks for, the “natural” correlation breaks down – the assessment no longer can reliable predict the full skill.
The best assessment are much richer performance assessments that exercise the actual skills themselves, but these are often expensive and technically difficult to administer. With simpler correlation-based tests, though, we run the risk that as soon as the results are important to someone who is not the student (i.e., not just for a student’s own learning), there is perceived value in gaming them.
So does that mean we should forget automated essay scoring? Sorry, English teachers and faculty, get out those pens and start back in again, word by word, paragraph by paragraph?
It depends.
If all you want are reasonably accurate final grades (e.g., for summative evaluation), then a mix of essay grading software to speed grading plus a quick human read to make sure the essay actually is coherent, not a trick, might get you 80% of the efficiency gain with a smallish increment in cost. That’s nice.
However, if your goal is to lift student performance, not simply document it, it gets more complicated.
Research (not always known of by English teachers or faculty) shows many things have the potential to improve student writing, but none of them are simply the recycling of student grades. However, getting students to write more, as long as they are putting in effort to write well (e.g., applying some of the evidence-based principles of structuring, etc.), may well improve writing.
So as with most things with technology, a combination of people, systems, and data may hold the right answer to getting good evidence-based learning implemented efficiently at scale. Students clearly need rich human mark-ups (aligned with evidence-based principles on improving writing), but perhaps these can be randomly combined with writing assignments graded by AES engines with a light human read. Worth trying, to see if the right balance can lift student performance more reliably over time, while limiting the extra grading costs- e.g., do one fewer fully marked essay, and see if adding several AES-plus-human graded essays gets students further, for the same investment (or less). Not clear - but something to try.
Is it possible that automated essay scoring systems could provide more guidance on structure directly? Well, in other areas, intelligent tutoring systems are beginning to make strong progress – let’s not count them out over time.
But for now, let’s start carefully, with eyes open, aware of what could go wrong, as we seek the benefits of well-designed technology added to well-trained people for lifting student performance, and to get good data on how we’re doing.
Comments