Some Unsettling Observations about Teacher Evaluations

Recently, Megan McArdle and Dana Goldstein had a very interesting Bloggingheads discussion that was mostly about teacher evaluations. They referenced some widely-discussed attempts to evaluate teacher performance using what is called “value-added.” This is a very hot topic in education right now. Roughly speaking, it refers to evaluating teacher performance by measuring the average change in standardized test scores for the students in a given teacher’s class from the beginning of the year to the end of the year, rather than simply measuring their scores. The rationale is that this is an effective way to adjust for different teachers being confronted with students of differing abilities and environments.

This seems like a broadly sensible idea as far as it goes, but consider that the real formula for calculating such a score in a typical teacher value-added evaluation system is not “Average math + reading score at end of year – average math reading score at beginning of year,” but rather a very involved regression equation. What this reflects is real complexity, which has a number of sources. First, at the most basic level, teaching is an inherently complex activity. Second, differences between students are not unvarying across time and subject matter. How do we know that Johnny, who was 20 percent better at learning math than Betty in 3rd grade is not relatively more or less advantaged in learning reading in fourth grade? Third, an individual person-year of classroom education is executed as part of a collective enterprise with shared contributions. Teacher X had special needs assistant 1 work with her class, and teacher Y had special needs assistant 2 working with his class – how do we disentangle the effects of the teacher versus the special ed assistant? Fourth, teaching has effects that continue beyond that school year. For example, how do we know if teacher X got a great gain in scores for students in third grade by using techniques that made them less prepared for fourth grade, or vice versa for teacher Y? The argument behind complicated evaluation scoring systems is that they untangle this complexity sufficiently to measure teacher performance with imperfect but tolerable accuracy.

Any successful company that I have ever seen employs some kind of a serious system for evaluating and rewarding / punishing employee performance. But if we think of teaching in these terms – as a job like many others , rather than some sui generis activity – then I think that the hopes put forward for such a system by its advocates are somewhat overblown.

There are some job categories that have a set of characteristics that lend themselves to these kinds of quantitative “value added” evaluations. Typically, they have hundreds or thousands of employees in a common job classification operating in separated local environments without moment-to-moment supervision; the differences in these environments make simple output comparisons unfair; the job is reasonably complex; and, often the performance of any one person will have some indirect, but material, influence on the performance of others over time. Think of trying to manage an industrial sales force of 2,000 salespeople, or the store managers for a chain of 1,000 retail outlets. There is a natural tendency in such situations for analytical headquarters types to say “Look, we need some way to measure performance in each store / territory / office, so let’s build a model that adjusts for inherent differences, and then do evaluations on these adjusted scores.”

I’ve seen a number of such analytically-driven evaluation efforts up close. They usually fail. By far the most common result that I have seen is that operational managers muscle through use of this tool in the first year of evaluations, and then give up on by year 2 in the face of open revolt by the evaluated employees. This revolt is based partially on veiled self-interest (no matter what they say in response to surveys, most people resist being held objectively accountable for results), but is also partially based on the inability of the system designers to meet the legitimate challenges raised by the employees.

Here is a typical (illustrative) conversation between a district manager delivering an annual review based on such an analytical tool, and the retail store manager receiving it:

District Manager: Your 2007 performance ranking is Level 3, which represents 85% payout of annual bonus opportunity.

Store Manager: But I was Level 2 (with 90% bonus payout) last year, and my sales are up more than the chain-wide average this year.

DM: [Reading from a laptop screen] We now establish bonus eligibility based on your sales gain versus the change in the potential of your store’s trade area over the same time period. This is intended to fairly reflect the actual value-added of your performance. We average this over the past three years. Your sales were up 5% this year, but Measured Potential for your store’s area was 10% higher this year, so your actual value-added averaged over 2005 – 2007 declined versus 2004 – 2006.

SM: My “area potential” increased 10%? – that’s news to me. Based on what?

DM: The new SOAP (Store Operating Area Potential) Model.

SM What?

DM: [Reading from a laptop screen] “SOAP is based on a neural network model that has been carefully statistically validated.” Whatever that means.

[Continues reading] “It considers such factors are trade area demographic changes, competitor store openings, closures and remodels, changes in traffic patterns, changes in co-tenancy, and a variety of other important factors.”

SM: What factors are up that much in my area?

DM: [Skipping to the workbook page for this specific store, and reading from it] A combination of factors, including competitor openings and the training investment made in your store.

SM: But Joe Phillips had the same training program in his store, and he had no new competitor openings – and he told me that he got Level 2 this year, even though his sales were flat with last year. How can that be?

DM: Look, the geniuses at HQ say this thing is right. Let me check with them.

[2 weeks later, via cell phone]

DM: Well, I checked with the Finance, Planning & Analysis Group in Dallas, and they said that “the model is statistically valid at the 95% significance level “ (whatever that means), “but any one data point cannot be validated.”

[10 second pause]

Let me try to take this up the chain to VP Ops, and see what we can do, OK?

SM: Whatever. I’ve got customers at the register to deal with. [Hangs up]

Not all attempts to incorporate rigorous measures of value-added fail. Let me make some observations about when and how workable systems that do this tend to be designed and implemented. I doubt these will please either side in the debate.

1. Remember that the real goal of an evaluation system is not evaluation

The goal of an employee evaluation system is to help the organization achieve an outcome. For purposes of discussion, let’s assume the goal of a particular school to be “produce well-educated, well-adjusted graduates.” The question to be asked about this school’s evaluation system is not “Is it fair to the teachers?” It is not even “Does it measure real educational advancement?” Ultimately, all we should care about is whether or not the school produces more well-educated, well-adjusted graduates with this evaluation system than if it used the next-best alternative. In this way, it is like a new training program, investment in better physical facilities, or anything else that might consume money or time.

The fairness or accuracy of the measurement versus some abstract standard is the means; changing human behavior in a way that increases overall organizational performance is the end. To put a fine point on it, if a teacher evaluation that is based on a formula that considers only blood type, whether it is raining on the day of the evaluation and the last digit of the teacher’s phone number is the one does the best job producing better educated and adjusted graduates, then that’s the best evaluation system.

In practice, of course, an effective evaluation system normally has to have some reasonably clear linkage to what we think of intuitively as performance, but clarity about means versus ends helps keep the organization focused. On one hand, it prevents the perfect from being the enemy of the good – all we need to show is that this program is better than its next best competitor for resources to accept that it should be implemented. And on the other hand, it prevents the endless search for theoretical perfection, by constantly forcing this specific cost / benefit test on proposed “enhancements” to any evaluation system. Because there is enormous practical value to employees understanding and accepting the metrics used to evaluate them, this tends to produce evaluations using simpler metrics, even if they are theoretically less comprehensive.

2. You need a scorecard, not a score

There is almost never one number that can adequately summarize the performance of complex tasks like teaching that are executed as part of a collective enterprise. Outputs that can be measured with good precision and assigned to a specific employee, even when using very sophisticated statistical techniques, tend to be localized by time and organizational unit; therefore, evaluation systems that rely exclusively on such measures tend to reward short-term and selfish behavior to an irrational degree. In a business, this usually means that if we rely, for example, only on this year’s financial metrics to reward a salesperson, we will incent him to undermine the company’s brand, give away margin potential, and not work well with other salespeople on big sales projects that are shared and may take years to come to fruition. In some sales forces, this is no big deal, and we can just pay straight commission as a percent of sales, and get on with life. But for, say, most retail chains, it would be long-term disaster to pay store managers only based on that year’s store profits – you’d be likely to end up with a bunch of stores that were poorly maintained, had untrained staff, and ran constant promotional sales targeted specifically to customers who shopped at nearby branches of the same chain (hold the jokes about retailer X that you don’t like). For this reason, most organizations create a so-called Balanced Scorecard for each such employee that combines several financial and several non-financial performance metrics, some of which are almost always involve some degree of management judgment.

It’s not like this concept is alien to all schools. In fact, to most experienced practitioners in just about any relevant field, this is common-sense. But note that the attempt to bundle all of this into a number called “value added” directly contradicts this understanding. It is very unlikely to work.

3. All scorecards are temporary expedients

Beyond this, no list of metrics can usually adequately summarize performance, either. In absolute theory, what we would want to know in a business would be the impact of a given employee’s behavior on company stock price. But we can never really measure that. Instead, we have a bunch of proxies that we believe collectively approximate this. But the attempt to build up such a perspective up as a pure data-analytic exercise always ends up creating some kind of Rube Goldberg system. We have maybe a few tens of thousands of relevant employee data points, and the complexity of a phenomenon that we only understand very partially overwhelms this amount of data.

Normally, an effective balanced scorecard for the kinds of positions I have been discussing is not constructed through such a process. Instead, its design starts with the view that the practical purpose of the evaluation system is to get the employees focused on a combination of basic priorities, plus a few more targeted issues that are the object of current management attention. In this way, the scorecard partially depends on the current strategy of the organization. By example, for a store manager, annual sales would almost certainly be on any scorecard, but warrantee penetration (the percentage of sales in which the store also cross-sells the consumer a warrantee) and percentage of store employees participating in sales effectiveness training might only be on the store manager’s scorecard for one or two specific years for a given retailer, and not at all for another competitive retailer with a different strategy. Beyond this, when their own comp is at stake, any group of thousands of people will always figure out how to outsmart any team of analysts who design the scorecard. That is, they will always figure out how to game the metrics, and get the comp in ways that violate the (often implicit) assumptions that were used to link these metrics to performance improvement. Therefore, it’s very helpful to present a moving target by changing some of the metrics each year. Finally, effective scorecards also tend to have a short list of metrics, since otherwise you have the “anybody with many priorities really has no priorities” problem.

Taken together, these realities – linkage to strategy, avoiding gaming, and the need to have a short list of metrics to capture a very complicated phenomenon – mean that effective scorecards change a lot over time. Once again, they are correctly thought of as a management tool to improve performance, not as some Platonic measure of effectiveness.

4. Effective employee evaluation is not fully separable from effective management

One conclusion of this is that effective teacher evaluation is not fully separable from effective management of those teachers. This statement can be read both directions, and therefore cuts both ways in this debate. The model of “measure and publish a metric for individual teacher value-added, and use a combination of shame, money and external pressure to convert this to improved schools” is not consistent with anything that I’ve ever see work in comparable situations. On the other hand, neither is the argument one often (though not as often as in the past) hears that somehow “teaching is special,” in that reasonable attempts to objectively evaluate teachers – and link these evaluations to material changes in comp, promotions and retention – should not be expected to help the organization improve performance.

So where does this leave us? Without silver bullets.

Organizational reform is usually difficult because there is no one, simple root cause, other than at the level of gauzy abstraction. We are faced with a bowl of spaghetti of seemingly inextricably interlinked problems. Improving schools is difficult, long-term scut work. Market pressures are, in my view, essential. But, as I’ve tried to argue elsewhere at length, I doubt that simply “voucherizing” schools is a realistic strategy.

More serious measurement of teacher performance, very likely including relative improvement on standardized tests, will almost certainly be part of what an improved school system would look like. But any employees, teachers included, will face imperfect evaluation systems, and will have to have some measure of trust in this system and its application. The evaluation system will have some direct linkage to the strategy of the school, and this will have to be at least a decent strategy that has a real shot at improve learning. The evaluation system will have to have teeth, and this means realistic processes that link comp (and probably more important, promotions and outplacement) to performance.

In other words, better measurements of teacher value-added are useful on the margin, but teacher evaluation as a program to improve school performance will likely only work in the context of much better school organization and management.