Some Unsettling Observations about Teacher Evaluations

02/07/2011

Filed under:

Recently, Megan McArdle and Dana Goldstein had a very interesting Bloggingheads discussion that was mostly about teacher evaluations. They referenced some widely-discussed attempts to evaluate teacher performance using what is called “value-added.” This is a very hot topic in education right now. Roughly speaking, it refers to evaluating teacher performance by measuring the average change in standardized test scores for the students in a given teacher’s class from the beginning of the year to the end of the year, rather than simply measuring their scores. The rationale is that this is an effective way to adjust for different teachers being confronted with students of differing abilities and environments.

This seems like a broadly sensible idea as far as it goes, but consider that the real formula for calculating such a score in a typical teacher value-added evaluation system is not “Average math + reading score at end of year – average math reading score at beginning of year,” but rather a very involved regression equation. What this reflects is real complexity, which has a number of sources. First, at the most basic level, teaching is an inherently complex activity. Second, differences between students are not unvarying across time and subject matter. How do we know that Johnny, who was 20 percent better at learning math than Betty in 3rd grade is not relatively more or less advantaged in learning reading in fourth grade? Third, an individual person-year of classroom education is executed as part of a collective enterprise with shared contributions. Teacher X had special needs assistant 1 work with her class, and teacher Y had special needs assistant 2 working with his class – how do we disentangle the effects of the teacher versus the special ed assistant? Fourth, teaching has effects that continue beyond that school year. For example, how do we know if teacher X got a great gain in scores for students in third grade by using techniques that made them less prepared for fourth grade, or vice versa for teacher Y? The argument behind complicated evaluation scoring systems is that they untangle this complexity sufficiently to measure teacher performance with imperfect but tolerable accuracy.

Any successful company that I have ever seen employs some kind of a serious system for evaluating and rewarding / punishing employee performance. But if we think of teaching in these terms – as a job like many others , rather than some sui generis activity – then I think that the hopes put forward for such a system by its advocates are somewhat overblown.

There are some job categories that have a set of characteristics that lend themselves to these kinds of quantitative “value added” evaluations. Typically, they have hundreds or thousands of employees in a common job classification operating in separated local environments without moment-to-moment supervision; the differences in these environments make simple output comparisons unfair; the job is reasonably complex; and, often the performance of any one person will have some indirect, but material, influence on the performance of others over time. Think of trying to manage an industrial sales force of 2,000 salespeople, or the store managers for a chain of 1,000 retail outlets. There is a natural tendency in such situations for analytical headquarters types to say “Look, we need some way to measure performance in each store / territory / office, so let’s build a model that adjusts for inherent differences, and then do evaluations on these adjusted scores.”

I’ve seen a number of such analytically-driven evaluation efforts up close. They usually fail. By far the most common result that I have seen is that operational managers muscle through use of this tool in the first year of evaluations, and then give up on by year 2 in the face of open revolt by the evaluated employees. This revolt is based partially on veiled self-interest (no matter what they say in response to surveys, most people resist being held objectively accountable for results), but is also partially based on the inability of the system designers to meet the legitimate challenges raised by the employees.

Here is a typical (illustrative) conversation between a district manager delivering an annual review based on such an analytical tool, and the retail store manager receiving it:

District Manager: Your 2007 performance ranking is Level 3, which represents 85% payout of annual bonus opportunity.

Store Manager: But I was Level 2 (with 90% bonus payout) last year, and my sales are up more than the chain-wide average this year.

DM: [Reading from a laptop screen] We now establish bonus eligibility based on your sales gain versus the change in the potential of your store’s trade area over the same time period. This is intended to fairly reflect the actual value-added of your performance. We average this over the past three years. Your sales were up 5% this year, but Measured Potential for your store’s area was 10% higher this year, so your actual value-added averaged over 2005 – 2007 declined versus 2004 – 2006.

SM: My “area potential” increased 10%? – that’s news to me. Based on what?

DM: The new SOAP (Store Operating Area Potential) Model.

SM What?

DM: [Reading from a laptop screen] “SOAP is based on a neural network model that has been carefully statistically validated.” Whatever that means.

[Continues reading] “It considers such factors are trade area demographic changes, competitor store openings, closures and remodels, changes in traffic patterns, changes in co-tenancy, and a variety of other important factors.”

SM: What factors are up that much in my area?

DM: [Skipping to the workbook page for this specific store, and reading from it] A combination of factors, including competitor openings and the training investment made in your store.

SM: But Joe Phillips had the same training program in his store, and he had no new competitor openings – and he told me that he got Level 2 this year, even though his sales were flat with last year. How can that be?

DM: Look, the geniuses at HQ say this thing is right. Let me check with them.

[2 weeks later, via cell phone]

DM: Well, I checked with the Finance, Planning & Analysis Group in Dallas, and they said that “the model is statistically valid at the 95% significance level “ (whatever that means), “but any one data point cannot be validated.”

[10 second pause]

Let me try to take this up the chain to VP Ops, and see what we can do, OK?

SM: Whatever. I’ve got customers at the register to deal with. [Hangs up]

Not all attempts to incorporate rigorous measures of value-added fail. Let me make some observations about when and how workable systems that do this tend to be designed and implemented. I doubt these will please either side in the debate.

1. Remember that the real goal of an evaluation system is not evaluation

The goal of an employee evaluation system is to help the organization achieve an outcome. For purposes of discussion, let’s assume the goal of a particular school to be “produce well-educated, well-adjusted graduates.” The question to be asked about this school’s evaluation system is not “Is it fair to the teachers?” It is not even “Does it measure real educational advancement?” Ultimately, all we should care about is whether or not the school produces more well-educated, well-adjusted graduates with this evaluation system than if it used the next-best alternative. In this way, it is like a new training program, investment in better physical facilities, or anything else that might consume money or time.

The fairness or accuracy of the measurement versus some abstract standard is the means; changing human behavior in a way that increases overall organizational performance is the end. To put a fine point on it, if a teacher evaluation that is based on a formula that considers only blood type, whether it is raining on the day of the evaluation and the last digit of the teacher’s phone number is the one does the best job producing better educated and adjusted graduates, then that’s the best evaluation system.

In practice, of course, an effective evaluation system normally has to have some reasonably clear linkage to what we think of intuitively as performance, but clarity about means versus ends helps keep the organization focused. On one hand, it prevents the perfect from being the enemy of the good – all we need to show is that this program is better than its next best competitor for resources to accept that it should be implemented. And on the other hand, it prevents the endless search for theoretical perfection, by constantly forcing this specific cost / benefit test on proposed “enhancements” to any evaluation system. Because there is enormous practical value to employees understanding and accepting the metrics used to evaluate them, this tends to produce evaluations using simpler metrics, even if they are theoretically less comprehensive.

2. You need a scorecard, not a score

There is almost never one number that can adequately summarize the performance of complex tasks like teaching that are executed as part of a collective enterprise. Outputs that can be measured with good precision and assigned to a specific employee, even when using very sophisticated statistical techniques, tend to be localized by time and organizational unit; therefore, evaluation systems that rely exclusively on such measures tend to reward short-term and selfish behavior to an irrational degree. In a business, this usually means that if we rely, for example, only on this year’s financial metrics to reward a salesperson, we will incent him to undermine the company’s brand, give away margin potential, and not work well with other salespeople on big sales projects that are shared and may take years to come to fruition. In some sales forces, this is no big deal, and we can just pay straight commission as a percent of sales, and get on with life. But for, say, most retail chains, it would be long-term disaster to pay store managers only based on that year’s store profits – you’d be likely to end up with a bunch of stores that were poorly maintained, had untrained staff, and ran constant promotional sales targeted specifically to customers who shopped at nearby branches of the same chain (hold the jokes about retailer X that you don’t like). For this reason, most organizations create a so-called Balanced Scorecard for each such employee that combines several financial and several non-financial performance metrics, some of which are almost always involve some degree of management judgment.

It’s not like this concept is alien to all schools. In fact, to most experienced practitioners in just about any relevant field, this is common-sense. But note that the attempt to bundle all of this into a number called “value added” directly contradicts this understanding. It is very unlikely to work.

3. All scorecards are temporary expedients

Beyond this, no list of metrics can usually adequately summarize performance, either. In absolute theory, what we would want to know in a business would be the impact of a given employee’s behavior on company stock price. But we can never really measure that. Instead, we have a bunch of proxies that we believe collectively approximate this. But the attempt to build up such a perspective up as a pure data-analytic exercise always ends up creating some kind of Rube Goldberg system. We have maybe a few tens of thousands of relevant employee data points, and the complexity of a phenomenon that we only understand very partially overwhelms this amount of data.

Normally, an effective balanced scorecard for the kinds of positions I have been discussing is not constructed through such a process. Instead, its design starts with the view that the practical purpose of the evaluation system is to get the employees focused on a combination of basic priorities, plus a few more targeted issues that are the object of current management attention. In this way, the scorecard partially depends on the current strategy of the organization. By example, for a store manager, annual sales would almost certainly be on any scorecard, but warrantee penetration (the percentage of sales in which the store also cross-sells the consumer a warrantee) and percentage of store employees participating in sales effectiveness training might only be on the store manager’s scorecard for one or two specific years for a given retailer, and not at all for another competitive retailer with a different strategy. Beyond this, when their own comp is at stake, any group of thousands of people will always figure out how to outsmart any team of analysts who design the scorecard. That is, they will always figure out how to game the metrics, and get the comp in ways that violate the (often implicit) assumptions that were used to link these metrics to performance improvement. Therefore, it’s very helpful to present a moving target by changing some of the metrics each year. Finally, effective scorecards also tend to have a short list of metrics, since otherwise you have the “anybody with many priorities really has no priorities” problem.

Taken together, these realities – linkage to strategy, avoiding gaming, and the need to have a short list of metrics to capture a very complicated phenomenon – mean that effective scorecards change a lot over time. Once again, they are correctly thought of as a management tool to improve performance, not as some Platonic measure of effectiveness.

4. Effective employee evaluation is not fully separable from effective management

One conclusion of this is that effective teacher evaluation is not fully separable from effective management of those teachers. This statement can be read both directions, and therefore cuts both ways in this debate. The model of “measure and publish a metric for individual teacher value-added, and use a combination of shame, money and external pressure to convert this to improved schools” is not consistent with anything that I’ve ever see work in comparable situations. On the other hand, neither is the argument one often (though not as often as in the past) hears that somehow “teaching is special,” in that reasonable attempts to objectively evaluate teachers – and link these evaluations to material changes in comp, promotions and retention – should not be expected to help the organization improve performance.

So where does this leave us? Without silver bullets.

Organizational reform is usually difficult because there is no one, simple root cause, other than at the level of gauzy abstraction. We are faced with a bowl of spaghetti of seemingly inextricably interlinked problems. Improving schools is difficult, long-term scut work. Market pressures are, in my view, essential. But, as I’ve tried to argue elsewhere at length, I doubt that simply “voucherizing” schools is a realistic strategy.

More serious measurement of teacher performance, very likely including relative improvement on standardized tests, will almost certainly be part of what an improved school system would look like. But any employees, teachers included, will face imperfect evaluation systems, and will have to have some measure of trust in this system and its application. The evaluation system will have some direct linkage to the strategy of the school, and this will have to be at least a decent strategy that has a real shot at improve learning. The evaluation system will have to have teeth, and this means realistic processes that link comp (and probably more important, promotions and outplacement) to performance.

In other words, better measurements of teacher value-added are useful on the margin, but teacher evaluation as a program to improve school performance will likely only work in the context of much better school organization and management.

Jim, this is a fantastic post. I don’t agree with all of it, and I hope to respond to it at length at my own blog, but I think that this level of sobriety and appreciation of the complexity of the issues we’re dealing with here is so important.

If nothing else, I personally will keep insisting that

1)there are consistent non-trivial confounding variables within education research that make teacher evaluation difficult (though still certainly necessary and valuable),

2)that successful educational reform efforts require honesty about the fact that much of the necessary improvement comes from particular racial and socioeconomic groups, and that treating the problems as holistic and universal makes solving them much more difficult

and 3)that improvement is a matter of slow, long-term, and minor gains on the margin that can gradually raise our relative and absolute success in educational reform.

I keep quoting this passage from Tyler Cowen’s The Great Stagnation, which refers to low-performing college students but is also a worthwhile broader statement about education:

“In contrast to earlier in the twentieth century, who today is the marginal student thrown into the college environment? It is someone who cannot write a clear English sentence, perhaps cannot read well, and cannot perform all the functions of basic arithmetic. About one third of the college students today will drop out, a marked rise since the 1960s, when the figure was only one in five. At the two hundred schools with the worst graduation rates, only 26 percent of the students will finish. The typical individual in these schools— much less the marginal individual— is someone who struggled in high school and never was properly prepared. It also may be the student who, whatever his or her underlying talent level may be, comes from a broken and possibly tragic home environment and simply is not ready to take advantage of college.

Educating many of these students is possible, it is desirable, and we should do more of it, but it is not like grabbing low-hanging fruit. It’s a long, tough slog with difficult obstacles along the way and highly uncertain returns.”

— Freddie · Feb 7, 08:06 PM · #
That is, much of the need for improvement

— Freddie · Feb 7, 08:10 PM · #
Ditto on the fantastic post sentiment (I’m a teacher and have a PhD in Stats, so both the teaching and measurement/eval issues are familiar to me). Particularly (4), which really resonates for me.

This issue often reminds me of a general tendency I see in people (even smart, technically oriented people) to expect data to almost literally tell them the answer to a question, rather than aid them in making a decision.

— jme · Feb 7, 09:04 PM · #
Two things:

First, there would seem to be a significant amount of low hanging fruit to take by nipping the problem in the bud, and reforming our “schools of education”. At every university I’ve been affiliated with, the school of education (if there was one) is a joke, and recognized as so by everyone not affiliated with it. Why do I think reforming THESE schools is easier than reforming our kids’ schools? Simple: you can just kick all the dumb people out of the schools of education. Of course, we’ll have to do something else to attract better would be teachers. But that’s relatively easy: scholarships would probably do it by themselves.
Second, if I were fighting this fight I’d be talking about the 80/20 rule endlessly. It is vastly easier to significantly improve students who are significantly behind. Is there some way of accounting for this built into the value-added computation? There may well be, but that seems very unlikely to me. Whether someone is behind depends on their “potential”, which is a function of many things including home environment, intelligence, etc. Which is to say, two students performing at the same level on some standardized test might actually be at different ends of the 80/20 spectrum.
— John 4 · Feb 7, 10:30 PM · #
Great post I’m really looking forward to your book.

— Pithlord · Feb 8, 06:15 PM · #
At my kids’ suburban school, the mommy network knows exactly who the good teachers are and who the bad ones are. It’s not that hard to figure out. The bad teacher is the one who says she gives no homework because she doesn’t have time to grade it. She’s the one who has the kids grade their own tests and take them home without her having reviewed the results. She’s the one who doesn’t seem to have prepared for the parent teacher conference. She’s the one whose progress through the curriculum lags each year behind the other teachers in the same grade at the same school. We all know she should be fired. The principal knows it, too. But all he has managed to do after years of complaints from unhappy parents is to shuffle her around from one grade to another, and to temporarily assign her as a reading specialist.

The good teachers are the ones into whose classes the moms lobby to get their placed each year. We all know who they are. The principal knows who they are. The district superintendent knows who they are. If anyone is going to get a bonus, it ought to be these teachers. Doesn’t happen.

It’s not that hard to know what should be done. Making it happen is something else.

— andrew · Feb 8, 07:10 PM · #
I have worked at a variety of large corporations over the years. All of them did annual performance reviews/evaluations. All of those were substantially (if not entirely) subjective — what you rated was what the maanger thought your rating should be.

The only variation was that most have the second level manager look at all of the overall ratings, just to see if one manager “graded harder” or “graded easier” than everybody else. If so, the ratings might get “normalized” (unless the second level manager decided, subjectively, that one group actually did have better people than the rest). In short, there was little or no actual measurement done of outcomes.

Still, our raises were based on this subjective evaluation; when layoffs happened, they were based on these evaluations. But somehow the world did not end. Wonder why teachers feel that, if anything like that happened to them, it would be the end of civilization as we know it?

— wj · Feb 8, 07:12 PM · #
I know there are far too many variables to make any system accurate. The margin of error is just toooooooooo big.

However, I think we need to add an additional complexity to get rid of some of the problems.

I teach 3rd grade and we have a test at the end of the year that counts towards my rating. Guess what?? I am going to teach to the test.

So, I think the solution is to judge my students on how they perform on the test they get at the end of 4th grade. I can’t really teach to the test because the students can’t learn 4th grade material until they learn 3rd grade material. So my goal, being rated highly, is not affected by how they do on the 3rd grade test. Therefore, I don’t have an incentive to teach to the test. I have an incentive to help the kids learn.

My kids had a few different 2nd grade teachers and will have a few different 4th grade teachers. If my kids improve by 5% between their 2nd grade scores and their 4th grade scores and your kids get worse by 2% then, other things being equal, I did a better job. Of course, if it turns out that a significant portion of my kids had a great 4th grade teacher then ….

Obviously, no rating system is easy. However, the point remains that my goal is to be evaluated well. My goal is not to teach. So it is critical that we figure out a way where my evaluation is closely related to my sucess in teaching the kids.

A friend of mine who works in finance knows that it is critical to meet the bonus goals. It is not critical to make money for the company. It is the boss’ goal to create incentives to get people to do the work that they actually want done. Teaching really isn’t any difference.

— neil wilson · Feb 8, 07:41 PM · #
‘For purposes of discussion, let’s assume the goal of a particular school to be “produce well-educated, well-adjusted graduates.”’

It seems to me this is where the problems really start. What is a “well-educated, well-adjusted graduate” anyway? The answer will vary greatly depending upon the person supplying it. I’m sure tea partiers and liberals will have very different answers. I’m sure evangelicals and agnostics will have very different answers. I’m sure those who are pro-abortion and those who are anti-abortion will have very different answers. I suspect those who are rich and those who are poor will have very different answers. There’s an awful lot of politics involved in the answer to this question.

How can we begin to determine what constitutes effective teaching if we can’t agree on what the outcome of education should be?

— Jim Reese · Feb 8, 11:45 PM · #
I heard, not too long ago, an incredible statistic about the number of teachers who have been fired in New York State or New York City. I’m terribly sorry to say it this way; I simply don’t remember the statistics and I wouldn’t know how to find them. But the number was something outrageous like 4 teachers in the last 20 years have been fired. Whatever the stat was, it was spectacular. I will do my best to find it; it anyone has info, it would be appreciated.

— jd · Feb 9, 02:24 AM · #
It seems to me this is where the problems really start. What is a “well-educated, well-adjusted graduate” anyway?

Well-adjusted? Didn’t that kind of talk go out with the 1950s, or at least the early 60s?

Why do we want people to be well adjusted? Do you think people in the 1950s should have adjusted themselves to segregation? Do you think we should adjust ourselves to abortion, or to the killing of civilians by our military, or to internment camps, or to government censorship of news?

I would think we’d want to produce people who refuse to adjust themselves to the wrongs of our society.

— The Reticulator · Feb 9, 05:53 AM · #

Commenting is closed for this article.

↑ Jim Manzi Earns an A+ Evaluation From Me

↓ Still Too Shy And Retiring

An ongoing review of politics and culture