Kurt VanLehn recently published a meta-analysis of almost a hundred well-constructed papers about computers used to tutor learners. It's an eye-opening layout of a huge amount of work that's been done over the last few decades, showing that there's real promise in using these systems to help students – and suggesting, indeed, that these systems may be getting a lot closer to human tutoring performance than we've all been aware of.
If Kurt's analysis is right, more of us need to get busy figuring out how to scale more of these ideas – and test them at scale. Hard work, mistakes to come – but an average effect size of 0.75 standard deviation units for the better types of tutoring systems ain't chump change!
In 2011, he published his meta-analysis of decades worth of intelligent tutoring systems research, looking to see how far progress in the field has come compared with what's known about human tutoring.
As he points out, a lot of interest in intelligent tutoring systems came from the early reporting of human tutoring results by Benjamin Bloom, who reported several experiments suggesting that well-designed human tutoring could deliver around two standard deviations' worth of learning performance. This is a shockingly large move:
As you can see from a bell curve, a movement of two standard deviations takes an average student to the 90+ percentile of performance, and, more startling, someone at the lower quartile will move into the upper quartile of performance.
Great, if you can get it – but, of course, such human tutoring is not practical for learning at scale. Still, the results do suggest that minds can learn a lot, if only we can afford the interventions - the limits are in the engineering of instructional systems. Bloom's results became an aspirational target for the intelligent tutoring systems community, as they've sought effective ways to intervene with machines to solve the engineering puzzle.
Kurt's meta-analysis suggests a lot of progress has been made – and also raises questions about whether the original Bloom's target has been misinterpreted, so that machine tutoring is actually doing much better in comparison with human tutoring than we might have thought:
Examining 10 human tutoring experiments compared with no tutoring (including Bloom's work), he concludes the average effect size seems more like 0.79 than 2 standard deviations. He points out that the exceptional Bloom results may have been caused by a difference in the criteria for mastery used by tutors compared with classroom instruction – that difference may account for a good part of the extra standard deviation observed. (Bloom's work may actually support a different intervention - high, mastery-based, criteria for success.)
He splits machine and human tutoring research along a spectrum of “grain-size” for the interaction. Answer-based tutors provide feedback and guidance after a student has worked through a problem and selected an answer, based on that answer. Step-based tutors break a solution to a problem or task into steps, and follow those steps with students, providing feedback around each step in the problem. Substep based tutors work within the steps – e.g., checking if students are familiar with the concepts or actions that make up a step before they get started working on that step, and intervening if they don't. Human tutoring is the most finely-grain sized approach of all, of course.
He classifies student behavior in a spectrum that runs from passive (no physical activity – pure reading, listening), to active (some physical activity tied closely to what's presented, like note-taking or underlining), then constructive (a learner going beyond the initial information presented – self-explaining, drawing a concept map, etc.), and finally interactive (back and forth engagement by the learner, who takes into account what's coming back). Artificial tutors are not quite ready to engage in the the most interactive kinds of activity in a convincingly human way, but constructive work is clearly being done.
The meta-analysis results are surprising: conventional wisdom would suggest that the finer the grain-size of instruction, the better the results should be. However, at least for the studies reported that fit Kurt's criteria, it looks like typical answer-based tutoring systems average an effect size of around 0.35 standard deviation units (real progress by itself), while all three of step-based, substep-based, and human tutoring seem to cluster around an effect size of 0.75 standard deviation units.
There's a fair degree of spread around that latter effect size - but note that it looks as if machine-based tutoring is beginning to reach levels of help that overlap with what human tutors achieve.
There are, of course, limits to this analysis, as Kurt himself points out:
While a major undertaking, he did not engage other researchers in the hunting and coding for research articles. It's possible some of these are mis-coded, or other controlled trial reports were left out.
Meta-analysis numbers always have the challenge that they may average (i.e., confuse) good practice and trials with less-well conceived approaches. Bob Slavin, a well known researcher in K-12 education, writes about a “best evidence synthesis” that pays attention not just to the evidence, but to what's actually done to extract the most helpful direction possible. Indeed, Kurt's synthesis is worth reading in detail precisely because of the richness of analysis he applies to the results and specifics, not simply for the numbers.
The fact that it looks like human tutoring studied in various ways so far is in the same league as much of the step- and substep-based tutoring does not mean that there isn't a way for any of these to excel further. What if we combined sophisticated decision-support-analytics and intervention suggestions with the oversight of human coaches? TBD.
Still, this meta-analysis reminds us that exciting work is out there waiting to inspire us to innovate at scale:
- For any of us solving a specific learning problem at scale, it's helpful to focus on efforts that are more like what we're doing, rather than a broad brush. Kurt's enabled that by providing tables giving more details of each study.
- Looking across the success of many, many machine learning studies over the last few decades, it sure looks like “there's something here” that should be exploited at scale. There are a relative handful of such efforts in the market now (ALEX, Carnegie Learning's tools), but we don't seem to have a pipeline from research with solid impacts to experiments and use at scale. If this were research showing such results on a major unsolved medical problem. . .
- From a practical standpoint, it looks like starting simply has value. Even answer-based tutoring systems give a decent-lift – and (if Kurt's analysis is right), going one step further gets you into human-tutoring range. We should be able to get to a “good-enough” benefit – worth trying, and testing.
- Machine-tutoring systems change the economics for student help. While there's a lot more up-front cost involved, the delivery cost per student goes way down. This can shift investment in people resources (which do not have to go down unless you want/need it to go down) to other parts of the learning puzzle – the most intractable learning situations, motivation issues, diagnostic and group activities (where evidence shows those latter to pay off well), etc.
Dr. VanLehn has written a piece that challenges us to get going: machine-based tutoring now has almost 100 careful studies published, with good evidence that these can make a significant difference. It's up to all of us to figure out the next steps: how do we put them on an evidence-based path to use at scale in subjects by students who they can best help – and in systems that can benefit from the data-flow and changed economics that result?
Learning - and economics - stand to gain.