I Thought a Think

Friday, July 17, 2009

Measuring Teacher Effectiveness

Something I've been thinking a lot about lately is the idea of linking test scores to teacher evaluation. It's a topic that's everywhere this summer:

The EFF school report card makes the point.
It's also the heart of Arne Duncan's remarks at the NEA Representative Assembly.
The ASCD put out a newsletter recently on union/management collaboration; tying test scores to evaluation was a big part of it.
And of course, there's the state-level hijinx going on vis-a-vis HB2261, the State Board of Education, and the PESB. Wheeee!

One of the notions that you often hear during these discussions is, "The good teachers have nothing to be afraid of." Let's talk about that for a bit.

Last year, for one of my Master's classes, I dug into testing data I had on hand for the first grade team in my building. These are real numbers and real averages with real kids behind them; the test in question is the Measures of Academic Progress, from the Northwest Evaluation Association.

Teacher A: In the fall, her class had an average score of 162.5 on the MAP. In the spring the class average rose to 184.3, an average gain of 21.8 points.

Teacher B: Her fall average was 164.7; her spring average, 183.85, for an increase of 19.15 points.

Teacher C: 169.05 in the fall, 189.35 in the spring, so an average gain of 20.3 points.

Teacher D: An average score of 155.30 points in the fall and 174.85 in the spring. Her fall-to-spring gain, then, was 19.55 points.

With this data, then, you could argue the case for two different teachers as the "winners" in the group. If you look at the average gain, Teacher A is your champion:

Teacher A: 21.8 points
Teacher C: 20.3 points
Teacher D: 19.55 points
Teacher B: 19.15 points

But, if you look at the overall class average at the end of the year, Teacher C is far and away your winner:

Teacher C: 189.35
Teacher A: 184.3
Teacher B: 183.85
Teacher D: 174.85

If we went strictly by these numbers from this year, then, you can see who your quality teachers are. If you were judging solely by the numbers, you might also think that you have a problem with Teacher D--her class average trails the class average of everybody else by almost 10 points, which on the MAP is very nearly an entire year's worth of growth.

But we have to dig even deeper before making a statement about teacher quality, because here the raw numbers aren't telling the whole story.

In the fall, the average score for this test is 164 points. In the spring, the average score is 178. Knowing that, here's some new data to chew on.

In Teacher A's room in the fall, 10 kids scored in the below average range. In the spring, 6 kids scored below average.

In Teacher B's room, 7 kids were below average in the fall, while 3 were below average in the spring.

In Teacher C's room, 6 kids were below average in the fall, and 3 in the spring.

In Teacher D's room, 16 kids were below average in the fall, and 6 tested below average in the spring.

With this new information, you can make two new arguments. First, Teacher B is your best teacher because she had more of her kids cross the finish line (the goal score, 178) than the other teachers did. You could also argue that Teacher D is your best teacher because she lowered her percentage of kids who were below standard more than any of the other teachers did.

So, who is your Most Valuable Teacher?

Is it Teacher A, who added the most value to her class over the course of the year?
Is it Teacher B, who had more of her kids meet the year-end goal?
Is it Teacher C, whose class scored the highest in the spring?
Is it Teacher D, who turned around more failing kids than any of the others?

"Value" is a homophone; there's the value signified by the numbers, but there's also the values of the school, the district, and the state which have to be superimposed atop any effort to link the data to the teacher. If the incentive pay/merit pay/whatever pay in this case goes to only one of the four teachers, you're making a statement about the value of the work the other three did, and it's a pretty lousy thing to say to the other three who also made progress that their success didn't matter as much.

Similarly, can we countenance a system where every one of these teachers is given the bonus money, indicating that they all did a good job? In the eyes of some reformers I could see that being too close to what we do now, where every teacher is assumed to be a good teacher. If a merit pay system is intended to have winners and losers, and to inspire the "less-capable" teachers to emulate the "better" teachers, can we really have a 4-way tie?

These are the questions that have to be answered going forward.

If you'd like to see the raw scores presented in a spreadsheet, you can find them here.

Labels: ASCD, EFF, PESB, SBE, teacher effectiveness, value added

Read more here, if any.

Wednesday, September 24, 2008

First Day of NWEA Testing

The MAP Assessment from the Northwest Evaluation Association (NWEA) is one that I've been boosting incessantly for a while now. I think that value added models of student growth are the only real way to measure student growth; it's where they start and end that really matters, not necessarily what the finishing line is supposed to be.

So this morning I had both computer labs here in my building buzzing. In one I was set up for the first grade classes, who are very, very easy to test this time of the year: average test time is about 13 minutes, and you even have some knuckleheads (one from my class, natch) who finish in under 5. That's a fun conversation:

"Johnny, there's no way you could be done in that short a time."
"But I am done!"
"You got me. Let me rephrase--there's no way you could have done a good job in that amount of time."
"I tried my hardest!"
"Really? Let's do the math--you spent about 8 seconds per problem. About half of that time is the computer reading the problem to you, sooooooo......"
"Can I go to the bathroom?"
"No. No you may not."

That said, the beauty of the NWEA is instant feedback. It backed up what I was already thinking, that I don't have any exceptionally high kids, that one or two are very, very low, and that the rest are in-between. That's OK with me.

Over in the other lab there were a couple of third grade classes that came through, and they say that their test is quite a bit harder this year after a re-alignment that occured over the summer. It'll be neat to see what they're saying at the end of the year.

This, to me, is what testing should be: quick, functional, and applicable. My next big project will be to lay out all the scores on a graph for each grade level to get ready for my RTI presentation to WERA this coming spring, but I'm rather looking forward to that piece because data matters and giving the teachers an easy way to read the data is critical to the growth we're trying to achieve.

Dino Rossi is running for Governor here in Washington, and a big part of his platform is ditching our state level assessment (the WASL) in favor of something better. I won't be voting for him, but on that point he's absolutely right.

Labels: NWEA, value added

Read more here, if any.

Wednesday, May 21, 2008

What Value Value Added?

There’s an interesting article in the May 7th Education Week on the “value added” model for judging teachers and schools. Under a value added (VA) system quality is judged by the change scores on a consistent scale; the MAP assessment from the NWEA is a good example of what a value-added test might look like. Some pieces from the article that struck me:

“My personal opinion is that this model is promising way more than it can deliver,” (Audrey Amrein-Beardsley) said in an interview with Education Week. “The problem is that when these things are being sold to superintendents, they don’t know any better.”

I’d be curious to understand what exactly it is that a data model like value added can really promise. The results of data analysis can be spun in a variety of ways, true, but the data itself is what it is. I’ve looked at VA as more a piece of the puzzle rather than as a whole puzzle in of itself, but maybe I need to look closer at the issue.

Later the article talks about the problems associated with using VA for programs like merit pay:

For example, results might be biased if it turns out that a school’s students are not randomly assigned to teachers—if, for instance, principals routinely give high-achieving students to the teachers who are considered the school’s best.
This is a legitimate concern. I’ve known several teachers in my short career who can do amazing things with gifted kids but can’t reach out to the low learners at all; by the same token, I want the low kids because the growth is the most spectacular in them. I think that this could be a strength of VA, because if you add 30 points to the scaled score of a low kid but only 5 to a high achiever, it’s clear where the most progress was made.

The most important piece of the article, and one that I’ve brushed on before:

But the more sophisticated the technique, the less understandable it could become for practitioners. The question is whether the added accuracy will make it harder for teachers and administrators to buy into value-added accountability systems, several experts say.

It’s critical that the teachers understand what the score on the test means, and that they know what factors could move that score up or down, especially if you intend to use this test to make a judgment about the teachers or their students. There’s nothing that breeds distrust faster than to be told, “You don’t need to know the details,” because that’s where the devil usually is.

It will be interesting to watch this conversation unfold.

Labels: Education Week, research, value added

Read more here, if any.

Saturday, July 07, 2007

John Merrow on the Value Added Model

I think pretty highly of John Merrow; the podcasts that he puts out are always well worth listening to, and the segments that he does for PBS are quality education journalism. In the June 13th edition of Education Week he has an interesting commentary on using growth models to measure the progress that schools and the students in them make; a part that resonated with me was this:

...to have a valid growth model, schools need what families have: a common yardstick. But schools also need a generally agreed-upon (and nourishing) curriculum. So if we want to develop a valid “growth model,” we first need to debate what belongs in the curriculum and figure out what sort of performance measures make sense.

It’s perhaps the most important question since the rise of NCLB: should the testing drive the curriculum, or should the curriculum drive the testing? The limitations and strengths are nearly identical with either approach:

*If we base our tests off of the curriculum, we can accurately measure how well they learned what was taught. This allows for discussions of teacher effectiveness, because they’re the ones who taught the material. The weakness in this approach is that the curriculum might not cover everything that is believed essential for that grade level, which creates gaps in learning that can have a terrible cumulative effect if allowed to grow over time.

*If we base our curriculum off of our tests, we can help to ensure that kids are being shown what has been deemed important, because what’s tested is what’s taught. If we believe that’s what tested is important, this is a good thing. This allows for discussions of school effectiveness, because the yardstick is the same for everyone statewide and we can see which buildings “get it” and which don’t. The weakness in this approach is the overwhelming number of learning requirements, more than can realistically be taught in most grades, which perhaps leads to a narrower curriculum.

This is why I’d like to see more school around the state using the MAP assessment from NWEA, and school-wide scores being made more readily available for study. One of the biggest weaknesses I see in the WASL is that a 1, 2, 3, 4 scale doesn’t tell you nearly enough about where you’re at as a school. The MAP breaks the scores out into the basic areas of reading and math and can be used to measure their growth from the beginning of the year to the end, which is also much more useful than the WASL, which comes back 5 months later.

Growth models: I like ‘em!

Labels: Education Week, Merrow, value added

Read more here, if any.