Technical Methods Report: Survey of Outcomes Measurement in Research on Character Education Programs

NCEE 2009-006
December 2009

Executive Summary
PDF & Related Info

Executive Summary

Children's social and moral development has long been a central goal of American schools (McClellan 1999). Through the Partnerships in Character Education Program (PCEP), located in the Office of Safe and Drug-Free Schools (OSDFS) in the U.S. Department of Education, the federal government has distributed up to about $25 million annually in grants to state and local education agencies for the design and implementation of character education programs. Conducted at the request of OSDFS and under the auspices of the Institute of Education Sciences (IES), the present study has three objectives: (1) to document the constructs measured in studies of a delimited group of character education programs; (2) to develop a framework for systematically describing and assessing measures of character education outcomes; and (3) to provide a resource for evaluators to help identify and select measures of the outcomes of character education programs.

Method

We approached the selection of programs for review so as to ensure inclusion of programs addressing the goals of PCEP and that were diverse along some key programmatic dimensions. We drew on three primary sources: (1) The IES What Works Clearinghouse (WWC) 2007 review of character education programs (WWC 2007); (2) research-driven guides to character education developed by the What Works in Character Education Project (WWCEP), a collaborative effort of the Center for Character and Citizenship at the University of Missouri-St. Louis and the Character Education Partnership (Berkowitz and Bier 2006a, 2006b); and (3) grantee reports from state and local education agencies that received funds from PCEP between 2003 and 2007. From the pool of 68 programs identified from these sources, we randomly selected 36 programs for review after stratifying by source, grade level of focus, and whether the program is comprehensive (that is, fully integrated into the life of a school) or modular (that is, a stand-alone program). Random selection of the 36 programs for examination ensured that the analysis of outcome measurement was conducted for a subset of the 68 programs which reflected the diversity in measured and unmeasured attributes of the larger set of 68 programs. We then systematically identified the studies of each program, using Psychinfo and gray literature searches, and focused on those studies that provided the greatest detail on outcome measurement. We then reviewed these studies, and developed a classification system to group related outcome constructs conceptually. This taxonomy, outlined in Table 2, was structured to organize outcomes from broad conceptual categories to increasingly specific conceptual categories. The broadest level distinguished between student-level outcomes and "other" level outcomes, with this latter category including teacher, school, parent, and community outcomes; the mid-range of specificity distinguished between student affective, behavioral, and cognitive outcomes; and finer levels of specificity distinguished, for example, between conceptual categories such as student knowledge and reasoning, and prosocial and risk behaviors. In reviewing studies, we identified all reported outcomes measured and classified them according to our taxonomy (Appendix B provides a crosswalk between the taxonomy and the programs selected for the report), we described the measures used including their psychometric properties, and we provided citations for the information on measures.

Key Findings

Research on the selected character education programs addresses a wide variety of outcomes. Student-level outcomes are measured in studies of 34 of 36 programs, with 25 of 36 programs addressing one or more cognitive outcomes, 28 addressing one or more affective outcomes, and 31 addressing one or more behavioral outcomes. Among these student level outcomes, those most often measured were: academic content (measured for 14 programs), prosocial dispositions and interpersonal strengths (each measured for 11 programs), discipline issues and interpersonal competencies (each measured for 13 programs), and substance use and intrapersonal competencies (each measured for 11 programs). In terms of outcomes at other levels (that is, beyond the student), research on 7 programs addressed teacher-level outcomes, 16 addressed school-level outcomes, and 14 addressed parent/community-level outcomes. Staff morale, school climate, and parent participation in school were the constructs measured most often in these respective domains (for 6, 16, and 11 programs, respectively).

Measurement methods were also diverse. Appendix A provides detail, by program, on every measure used in the studies reviewed for this report. For each program, the appendix provides a brief description of the program, descriptions of each measure, and an indication of which outcome constructs from the taxonomy each measure addressed. As shown in these tables, researchers employed direct and indirect assessment, as well as surveys with reports by teachers, parents, and students. They reported outcomes on scales and for stand-alone items, as well as non-scaled measures, such as attendance or disciplinary infractions.

Table 3 summarizes information on all of the scaled measures included in the studies reviewed. For each measure, the table shows the name of the instrument, whether it was developed for the study or is an "off the shelf" measure, its source, the type of assessment (for example, direct assessment versus self report), the domain it assessed (student [cognitive, affective or behavioral] or "other"), and a rating of its reliability. Table 4 provides a crosswalk between the taxonomy outlined in Table 2 and the scaled measures identified in our review with reported reliability of .70 or greater.¹ Our assessment of the characteristics of the scaled measures revealed two central themes:

Among the 95 scales that researchers applied in the studies reviewed here, 46 were developed for the study under review. An additional 17 were adapted from existing measures; and 32 were available "off the shelf," having been developed and published through other research. Among this last category, six scales were employed in research on more than one of the programs under review.
Reporting of psychometric properties of character education outcome measures is not consistent. Researchers reported reliability statistics for 62 of the 95 multiple-item scales applied in the studies under review, with 30 of these exhibiting reliability of .70 or above, 27 exhibiting mixed reliability across contexts, and 5 exhibiting less than .70 reliability. For 33 of the 95 scaled measures, no reliability statistics were reported. Validity of measures was addressed less often than reliability; the research on just 5 of the 36 programs selected provided information on the validity of the measures.

Considerations When Using This Report

The evidence developed from studies of the sample of programs reviewed here suggests that character education researchers use this report's information on outcome measurement with the following considerations in mind. First, the taxonomy presented here suggests a diverse array of outcomes may be affected by character education programming. Reference to a clear theory of how program elements are linked to specific outcomes may help researchers to identify those outcomes that the program in question is most likely to affect. In the absence of a clearly articulated theory, researchers could "work backward" from the taxonomy presented here to assess the extent to which each of the constructs are likely to be influenced by their intervention, selecting for measurement those that seem most appropriate.

Second, given the complexity of "character" as a construct, it could be beneficial for researchers to select or develop measures with demonstrated reliability and validity. While the measures presented here are not necessarily representative of the universe of research on character education programs, nor are they necessarily the best measures available, this report provides information on a variety of outcome measures with demonstrated psychometric properties. Related to this, the field of character education could benefit from more consistent reporting on the psychometric properties of outcome measures. Studies provided insufficient information to assess measures' reliability in the case of 33 of 95 scaled measures identified here. Consistent reporting of measures' psychometric properties would support comparison of outcomes across programs and populations and potentially improve our understanding of effective character education practices.

Finally, the findings of this report highlight the importance of alignment between the conceptualization and measurement of outcomes. Our review revealed two ways in which measurement methods demonstrate a potential lack of such alignment: (1) there may be misalignment between items in a particular scale (they do not "hang together"); and (2) there may be a mismatch between the domain or construct a measure actually captures and the domain or construct the researcher conceptualizes or reports. Clear conceptualization of constructs and alignment with measures may be supported by reference to the outcome taxonomy and related measures presented here.

Top

¹ See Sattler (2001) on the choice of 0.70 as a threshold of acceptable reliability of a scale measure.