Anyone who has followed news reports about health research knows the pattern: one study shows a food, nutrient, or behavior is good for you, the next that it’s bad, the next is inconclusive, and further research suggests that it depends or that it doesn’t matter. Even experienced researchers working with statistical data about something as complex as health outcomes have trouble coming to clear conclusions, especially on specific points. (And news reporters looking for a simple take away message often make it worse.)
Moreover, as Daniel Kahneman explains engagingly in his new book, Thinking, Fast and Slow, all of us are hard wired to be bad at statistics and overly confident of our erroneous conclusions.
So, when a school district and a community try to make sense of statistical data about educational outcomes concerning the already emotional issue of leveling and de-leveling, we should proceed with great care.
To make sense of the data we have, it would help us all to keep in mind two things: First, no data will provide a complete and conclusive answer to some of our key questions. Secondly, there are things the available data can tell us and there are things it can’t.
What questions can the data answer?
We have one year of data from two cohorts of around 500 students. It includes test scores and class grades. These are limited measures of educational outcomes. District administrators have wisely looked at others (work samples and curriculum analysis). They addressed two questions: did students recommended for level 3 benefit from increased rigor in leveled-up classes? and did the level up classes remain challenging for students recommended for level 4?
These questions are sometimes phrased colloquially in discussions I am hearing as "Did it help?" and "Did it hurt?" These latter questions are much broader and largely impossible to answer with the available data. (Grades and test scores don’t address the pain some students and families feel about placement in either heterogeneously grouped or leveled classes. Nor can they measure the benefits for all students of more racially integrated classrooms.)
The district won’t even be able to answer the question of long-term benefit in terms of college success (a motivating reason for leveling up) for many years.
Answers to the questions we are really debating—is leveling up working? where do we go from here?—depend not just on data, but on values, priorities, and individual experience.
What can the available data tell us?
The data we have are for (relatively) small samples, two different cohorts, and a non-constant context (leveling up is not the only change in curriculum and instruction). It lends itself to descriptive analysis, not definitive conclusions. That means we can sort the information in charts and graphs and look for patterns and anomalies that are relevant to our questions. We can’t make firm conclusions about the relative outcomes of the cohorts, especially in specific cases.
In particular, there are several factors that muddy the waters. Their impact increases as the sample size decreases, as in subgroups. These factors can “explain away” some portions of a statistical increase or decrease in a measure. Statisticians may try to control for these factors, not necessarily successfully. We need to be mindful of them.
The first is individual factors. Were there more or fewer students experiencing personal problems, or family stress from one cohort to the next? Were some students coming down with the flu when they took the test? In small samples, these random variations due to individual factors can sometimes have large effects.
Secondly, there are contextual factors. If there is a spike or decline in scores or the number of Fs or As, is it distributed more or less evenly between all teams and both middle schools? We can’t tell from the aggregate data. But school and district administrators can look to see if the increase or decrease is concentrated (in a school, team or classroom) and determine what, if anything, happened, and what, if anything, needs to be done about it.
Finally, there are cohort related factors. In general terms our cohorts are similar: 7th graders in the South Orange/Maplewood district, in consecutive years. But, they can and do vary from year to year in demographic characteristics that can make a difference.
Statistically speaking, some demographic groups test and perform (in terms of grades) measurably differently than others. A relative increase or decrease in students of a given race, sex, socioeconomic status, from one cohort to the next can produce greater than expected changes, again, especially in subgroups, as do yearly variations in the number and type of IEPs (individualized education programs). Interestingly, enrollment data for the two cohorts show a relatively large swing in the balance of boys and girls from year to year, especially among white students. That’s probably statistically significant.
These kinds of cohort effects almost certainly explain some of the year-to-year variation in NJASK scores, which is why year-to-year changes don’t necessarily mean we are getting “worse” or “better” or that one school is “better” or “worse” than another.
What do the data show?
The analysis presented by the superintendent in October was, for some, underwhelming. But, I would suggest, based on what we can and cannot do with data, it was neither a failure to analyze the data nor an effort to mislead. Rather, it was an honest and informative assessment of the information we currently have. On balance, it tells us that leveled up students were not in over their heads, that leveled up classes were not dumbed down, and that heterogeneous grouping did not diminish the learning of students designated as level 4. Moreover, twice as many leveled up students qualified (under current criteria) for level 4 placement in 8th grade as non-leveled up students from the year before. That is a significant benefit for a good number of students.
There may be conversations still to have about Honors classes, “gifted” students, and the rigor of the curriculum in general, but the data certainly suggest to me that any difference between so called level 3 students and so called level 4 students is not a difference that makes a difference in their capacity to learn and achieve, or to share a classroom. It should not make a difference in the kind of instruction and opportunities they receive.
Julia Burch is a parent of a 5th and a 7th grader who attend South Orange-Maplewood School District schools.
Morrisa da Silva
4:49 pm on Friday, December 9, 2011
You state that "I would suggest, based on what we can and cannot do with data, it was neither a failure to analyze the data nor an effort to mislead. Rather, it was an honest and informative assessment of the information we currently have. "
The shortcomings of the data that you lay out belies the above quote. The district used the data to proclaim level up a success. You state that the data tells us "that leveled up students were not in over their heads and that leveled up classes were not dumbed down, and that heterogeneous grouping did not diminish the learning of students designated as level 4." The increase in D's and F's for students designated as level 3 and the surging numbers of A's for students designated as level 4 point to the fallacy of that statement. The presence of grade inflation certainly indicates that learning may have been diminished for level 4 students but we won't know for sure because NJASK only looks at proficiency and is not useful to analyze educational outcomes at the top of the ability range. As far as your assertion" that twice as many leveled up students qualified (under current criteria) for level 4 placement in 8th grade as non-leveled up students from the year before " This is not the whole story. As Jeffery Bennett has shown the criteria changed allowing for a lower cut off for level 4 designation this year. This may very well be a good thing but the increase cannot be causally linked to the level up program.
Julia Burch
7:36 am on Saturday, December 10, 2011
There is no generalized increase (across subjects, exams, and final grades) from one cohort to the next in Ds and Fs for students recommended to level 3. There is one measurable increase, in final marks (but not exams) for science, and it is small. There are somewhat larger increases in the percentages of As for final marks in Language Arts and Science for students recommended to level 4.
But absolutely nothing can be concluded from these numbers. They cannot prove either poor outcomes or grade inflation.
These increases are anomalies in our descriptive data. They can, and should prompt us to ask questions: What happened in science? Does this indicate grade inflation? Finding answers to those questions requires additional information, and that task properly belongs to school and district administrators.
(to be continued)
Julia Burch
7:38 am on Saturday, December 10, 2011
(part 2)
But there are two things that strike me as an outside observer. First, there was a significant increase in the relative number of white girls from one cohort to the next. Statistically speaking, this group is likely to do well in greater numbers than other subgroups. Increasing the relative number of higher achievers from one year to the next is likely to increase the number of high achievements, especially in Language Arts where the effect of gender is most pronounced. (And possibly in life science, where girls also perform relatively better compared to physical sciences.)
Secondly, at the end of the October report, the district had already made some internal analyses and identified areas of improvement, one of which is: “Continuing to work with teachers to improve student lab experiences in science.” Given that final exam grade distributions did not indicate a decrease in mastery of the material, could it be that something is going on with lab work that is relevant to the question of what happened in science?
Morrisa da Silva
4:52 pm on Friday, December 9, 2011
I am unaware of any analysis of work product that the district did to make sure that rigor was maintained or to adequately assess whether differentiated instruction was being used effectively for both lower ability and higher ability students . This is the kind of analysis that is needed as well as more than a year's data .Deleveling in and of itself does not mean better instruction or more opportunities to anyone if it is not delivering effectively a rigorous curriculum. We must look closely at educational outcomes and not just see the outcomes we want.
Michael Paris
10:16 am on Saturday, December 10, 2011
Thanks to Julia Burch this incredibly thoughtful piece. One thing the data do show is that there were significant numbers of young persons previously categorized as level 3 kids" AND significant numbers of young persons previously categorized as "level 4 kids" throughout the grade distributions (A to F) and test score results. In other words, many young persons previously categorized as level 3 kids did better or just as well as many young persons previously categorized as level 4 kids. Ms. Burch is certainly right: Any difference between these artificial categories of students is not a difference that makes a difference. Mr. Bennett is entitled to his values and commitments when it comes to public education. But I would urge him to stop talking constantly, and blithely, about "Level 3 students" and "Level 4 students," as if these words captured any meaningful distinctions between the young persons attending our middle schools.
Michael Paris
Amy Higer
10:36 am on Saturday, December 10, 2011
Obviously we can debate the data and interpret it for in all kinds of ways. Let's face it, all of us are inclined to favor interpretations that suit our own predisposed positions on the leveling/deleveling issue. This is an inherent problem of using quantitative data to draw conclusions about the social world. And I think this is the most important point Julie Burch raises in her excellent comment. Thank you, Julie, for reminding us why labeling children "Level 4" or "Level 3" or "Level 2" is so problematic, and to my mind, obnoxious, especially in such a wonderfully diverse and dynamic community as ours.
Morrisa da Silva
12:35 pm on Saturday, December 10, 2011
Jeffrey Bennett's labeling and mine as well is only for the purposes of analyzing the different cohorts. Paul Roth's analysis uses the labels as well in his data presentations. It is unfair to target Jeff Bennett as if only his use the label is morally repugnant. What we need to know and Julia you are right is what kind of teaching is going on in the classrooms. What kind of rigor and what DI practices are being used (if they are at all) and for which segments of the class. The results of 7th grade deleveling as evidenced by the data we have albeit slim is not indicative of success and just the fact that Mr. Roth and the Super say it is will not make it so.
Amy Higer
2:00 pm on Saturday, December 10, 2011
I think we can all agree that data are important but that this debate needs to extend beyond numbers (and certainly beyond numbers gleaned only one year after a policy change takes effect). Mr. Bennett's questions are all reasonable, and we should discuss them in a community forum that allows for such discussion. However, it's obvious to me from these questions that his perspective is that of the "highly-ready student" (no longer Level 4?--thank you!) who is perceived to benefit from the status quo, rather than the "less-ready" student who is likely to have been greatly harmed. So, here's some questions I'd like to add to the leveling debate: How can we better assess and understand the harm that's been done (and continues to be done) to students by the present leveling system? What are other economically and racially diverse districts doing that we should be doing to ensure that all students are getting a high quality "rigorous" education? How can we ensure that no students are being harmed by our educational policies? ("Parents seem to think so" is, of course, not a reliable metric). What can we do that will make our diversity into an educational advantage, rather than a disadvantage? I.e., can we move the discussion away from: "How can I make sure that my advanced child is not harmed by others" and into one that asks "How can we all work together as a community to make sure that teachers have the resources they need to teach and challenge all students?"
Amy Higer
3:12 pm on Saturday, December 10, 2011
Jeffrey, thank you for these thoughtful comments. I absolutely agree on working to help make all our teachers effective. And it's appalling (if I understand your point correctly) that social studies and science have been sacrificed on the alter of NJASK/teaching to the test in elementary schools. We should all be up-in-arms about this narrowing of the elementary school curriculum.
Michael Paris
4:16 pm on Saturday, December 10, 2011
Thanks for pointing out this change to the elementary school curriculum. It was already the case in our elementary schools that during weeks when science is taught, social studies is not taught, and that when social studies is taught, science is not taught. Now the district is making things worse by taking more time away from science and social studies? Why? What logic justifies this? The answer has to be that NJASK tests for language arts and math, but not for science and social studies.
John Skywalker
6:39 pm on Saturday, February 18, 2012
After reading thru all of the comments and the detail analysis and assertions stated above on whether or not the "Quantitative Data Analysis" is Accurate or Flawed based on the Methodology and Criteria being used to promote DeLeveling at our School Districts, it is in my own Belief that the DeLeveling Proposal should be put forward on a referendum on the November Ballot to the Voting Public for both Maplewood-South Orange. Its Too important for either the Superintendent or School Board to vote it thru for Consideration.
The future Integrity and the Quality of Education of our School District hangs in the balance. Especially if the Data and Methodology and Criteria being used is in question, whether you are in Favor or not in Favor of DeLeveling.
The "Qualitative" Analysis and "Feedback" from the Voting Public (Tax Payers and Parents) should be Considered. The only way to take this step, is for the DeLeveling Proposal to be placed on the Ballot in November so it be voted on as the critical future and direction of our Education in the School District is in question, including how it will impact both communities and their Property Values for years to come.
Michael Paris
7:36 pm on Saturday, February 18, 2012
John Skywalker wants a popular referendum on Superintendent Brian Osborne's proposal for reform at our middle schools and high school. Although I'm confident that the supporters of reform could win such a contest, thankfully local democracy in our state does not work that way. Democracy works best when the people participate through mediating and established nongovernmental and governmental institutions. That way, people have to organize actively along with others to control governing institutions, and the officials empowered through the process are encouraged to exercise power responsibly. That way, deliberation and reason, and not ill-informed, base appeals to fear and narrow self-interests, stand a better chance of carrying the day.