Thank you very much for the opportunity to talk with all of you today. As Karen said, I
have a number of areas that I have worked in over the last several years, mostly
pertaining to multivariate applications. I've been very lucky, thank you very much to
IES, to also be quite well funded in terms of research projects to stimulate some of the
methodological work that I've been engaged in over the last several years as well.
So what I'm going to talk to you today is actually nothing it's not going to be terribly
empirical in that this isn't going to be a simulation study or a series of simulation study
results, but rather it's going to be ideas and some project-related findings that have
been inspired by a number of the IES projects that I have had had the opportunity to
work on.
In particular, I'm going to be talking about planned missing data designs; I tried to come
up with a little more creative title here, a play on not all X need apply. But basically
the question being do we really have to assess all individuals on all measures?
So my quick answer is, really, no. We don't need to collect full information on all
participants in our studies, whether it be an experimental study or whether it be
something more of an assessment nature. Whether it be small scale or large scale.
That, in fact, thanks in part to computational abilities and high-speed desktop
computing, we can treat intentionally missing data, data that we didn't intend to
collect in the first place, that was purposefully left out, we can treat it as a missing
data usually it's referred to as a problem. I prefer, in this context, to think of it as a
solution because in a number of ways the financial flexibility that the planned missing
data designs can make available to us, far outweigh some of the computational
burdens that might be involved in accommodating the missing data.
Plus there are already plenty of examples of planned missing designs already in the
substantive literature. You see these across the board in terms of disciplines -
between accelerated longitudinal designs or cohort sequential designs, popular in
development, cradle-to-grave applications, specific planned missing data designs that
are also sometimes referred to as efficiency and measurement designs. Getting back
to this notion of saving something, whether it's saving participant burden, whether its
literally saving money, whether it's just saving time, researcher resources.
Measurement applications, from simple matrix sampling to more modern
computerized adaptive testing applications, those are intentionally missing designs. As
well as the area that I've become most interested in which comes from biostatistics
and medical applications, the notion of sequentially designed experiments.
So I'm going to talk a little bit right off the bat about my motivating context, so like I
said, most of this work, most of the concepts and the ideas I'm going to present to you
today, are inspired by some of my work on substantive projects. So I'm going to talk a
little bit about those motivating contexts. As I said, I like to think of this as a missing
data solution rather than a problem, but I really can't get away with not talking about
missing data itself. So I'll talk a little bit about the types of missing data and I'll talk a
little bit about some of the common methods for dealing with missing data, and that's
really what makes this planned missingness, feasible in the first place.
And then I'm going to talk about two classifications that I use in terms of how I
compartmentalize all of these different types of designs that are out there. One
application will be in terms of situations where all participants are assessed, so big S
little s (Ss) participants, not technically APA format correct anymore, but that's aits
wired in there someplace and it's hard to override. But you want to assess all
participants but not all participants receive all the measures. Versus all instruments, so
all the measures are administered to those individuals who are assessed, but not all
individuals will be not all of the initially planned individuals will be assessed. There
are some circumstances where it can stop early. And the savings is in terms of not
recruiting additional participants, rather than reducing the battery that's administered
to everybody.
So in the first category, everybody is assessed on something; the particular
background will come from my work with the Reading for Understanding projects. Ill
talk about it, specifically accelerated longitudinal designs, I'll give some of the other
examples of planned missing data designs, as well as computerized adaptive testing.
So at the grantee meeting a month ago, Susan Embretson was a part of the panel and
she specifically talked about adaptive testing, and I just made reference to it. I've
expanded my slides a little bit to talk about adaptive testing. Because that then gives
me the segue into the idea of not assessing everyone we might have intentionally
initially intended on assessing, and that I'm going to talk about sequentially
randomizedor sequentially-designed experiments. And my interest in that class of
models comes from my experience through a number of randomized trials that have
been either conducted or are currently in place through a number of the IES-funded
projects.
So as I said, I'm going to describe kind of my motivating context a little bit more. Four
particular projects really provide the body of my interest in this area. First is the
currently funded Reading for Understanding project. I'm an investigator on the UNL
sub award led by Tiffany Hogan, but the overall project is the one led by Ohio State and
somewhere around the corner, I think Laura Justice is still in the room, and probably a
good thing that there's no opportunity for eye contact here.
In that particular study, or set of studies, we have an assessment project or Study 1
that's currently underway, and then we have the later intervention work that's going
to be done. I'm going to talk about some of the ideas in regards to planned missing-
ness that have applications to that assessment study. Some of the issues that we've
thought about, discussed, grappled with, over the last year or so.
Then I'm going to use data from this 2004-10, this completed randomized trial that I
worked on in collaboration with our PI, Sue Sheridan, also at the University of
Nebraska, where we in the funded project, we evaluated this intervention referred
to as Conjoint Behavioral Consultation, that's CBC, that's intended to for one
outcome, reduce disruptive behaviors among children, both in the classroom and at
home. I used that completed project as a data lab, so to speak, and I treated it as a
secondary dataset and went back and reanalyzed the data as if it had come from a
sequentially designed experiment in the first place, and Ill demonstrate some of the
potential sample savings that came about. And that work has been presented at the
Research Conference, I think it was 2008 or 2009.
Learned all about sequential designs in that context, well now we have two other
projects that also involve randomized trials: One is the Natural Center for Research on
Rural Education. Were in the midst of a randomized trial there. And then we have a
second efficacy study in the CBC family of projects, where were replicating the
intervention in a rural context.
I think it's not an uncommon problem, but we've run into some recruitment issues.
Getting schools, especially some of the rural schools, to agree to participate is always a
problem; its rearing its head for us now. So Sue, Todd Glover, other investigators on
this project, are continuously coming to me and asking, Well, okay do we really have
to recruit everybody we said we were going to? Well, thanks to the work that I had
done on the previous CBC Project, had a little bit of foresight.
Now I gave them one power analysis that went to the proposal. But at the same time,
I also started off a sequentially-designed parallel plan, anticipating that they would
come to me and ask about at some point in the future I assumed that they were
going to come to me and ask, Okay, do I really have to collect everything that we
thought we needed to. I planned for the opportunity to do a peek; I'll talk about how
that will be done in a principled manner because we're at the point right now, where
were recruiting for next year already. But how basically it's a power analysis that also
takes into account this idea of looking at the data early with multiplicity control. So I
built in a component that allows us to basically test repeatedly, but keep the error rate
in control.
So what is missing data? Missing data comes about from a number of mechanisms
whether it's something intentional on the part of the participants, so selected non-
response. They just choose to respond to some items, some measures, and not
others. They don't show up on an assessment period. They might be sick some day
but they are back the next assessment in a longitudinal design. Missing data could
come about by attrition, where they could just simply drop out of the experiment or
drop out of the study, whether that's an intentional or whether it's more of a random
process, you know the kid can't participate if the parents move them out of the
school system, especially to another state. It's not necessarily because of the
intervention itself; maybe there's something systematic, but something that we don't
have to deal with in terms of the inference.
Sometimes missing data is what were going to talk about continue to talk about
today, missing by design, where we intentionally don't collect something in the first
place. And sometimes missing data comes about just by human technology error,
whether it's your coffee spilling on the laptop, the internet connection goes down, or
some other those darn undergrads just don't get something quite right in the data
entry process. Can't blame it on grad students or post-docs, but undergrads, you
know, they can be the scapegoats for everything.
Regardless of how the missing data comes about, there are types of missing data in
terms of the set of assumptions that we make about the statistical properties of the
missingness. So I'll give an example here, supposed were modeling a construct like
literacy, and I'll use the label Y for that. And let's say were modeling literacy as a
function of some predictor variable, in this case SES, socioeconomic status, which Ill
just leave in the later diagrams as an X. So some participants don't complete the
literacy measure, so we do have some missing data on this outcome variable, Y.
We have to ask ourselves why do we have that incomplete data? Is it a random
process or is it a systematic process? And to assume that we can make some
assumption about the answer to that question, can we determine the nature then if
we have determined that it's a systematic influence. Leads us to a typology of missing
data either being missing at random or missing not at random I like to think of it as
more of a special case of missing at random as to missing completely at random. So
basically its missing data we can deal with easily; missing at random or completely at
random versus not at random, much more troublesome a situation.
So further defining the elements, X is our completely observed predictor variable; Y is
our partly observed outcome variable that has some missingness to it; Z is then the
component of the causes of missingness. So whatever the mechanism that led to
missing data on Y, that's captured in Z, whether it's one variable or a set of variables.
And then R, that's the probability, the fact that some data is missing or not; okay,
that's the probability of missingness.
So starting off with missing completely at random, the MCAR model, in a missing
completely at random situation, missing value on this outcome variable Y, so literacy,
are not associated with other variables in a given dataset or with the variable Y itself.
So there may be some causal mechanism between X and Y, but as you can see from
the lower left here, there's no association between Y and R. R, the probability of
missingness, and its causes, Z, are in no way associated with the system that were
interested in. So it's not that the missingness is not is completely, literally random.
But whatever the mechanism is, is not part of the substantive system that were trying
to model. There's not a causal mechanism.
The second context, the second type of missing data, is the missing at random. Now
this adds an additional set of assumptions. So now, the missing values on the given
variable Y, so literacy, are not associated with the unobserved variable, Z, notice that
there's no link between Z and Y, but they may be related to other measured or
observed variables. So the probability that Y is missing R may be dependent upon X, so
X is part of the mechanism leading to the missingness, but as long as X is in the model
so X is predicting Y the mechanism for missingness is incorporated within modeling
process. So the analysis should be unbiased, because X is part of the system. Its
essentially the same bias. Its avoiding the bias that would be present if you just leave
out an important predictor. If you leave out a variable that's supposed to be part of
the system, results from the model - leaving that important piece out would be biased.
That's the principle at play with missing at random. As long as the system is correct,
the model is correct, then the bias is avoided.
Versus the not missing at random: Now we see the association between Y and R, so
the probability that Y is missing is dependent upon what the value of Y itself might
have been. So, participants don't complete the literacy measure because they have
poor literacy skills. The probability response of non-response is directly predicted by
what their value would have been, had it been observed. That's a not-ignorable case;
that's the problematic case .
So, what can we tell from our data? Well, we only have access to what we've actually
measured and observed, so our hands are a little bit tied in whether we can find
empirical evidence that we have missing completely at random versus missing at
random versus not missing at random. Okay? We can test, we can reject missing
completely at random; that's been a set of procedures that have been available for
decades. Missing at random, is itself, not testable because we don't we have no
knowledge of what the value would have been, had it been measured in the first
place. So we have no ability to predict the missingness because we don't have the
outcome that we need to predict isn't available to us. So we can't distinguish between
(MAR) missing at random and missing not at random.
Now in the case of planned missing data, we, presumably the experimenter, are the
mechanism, we are the Z, leading to the probability of missing data. As long as were
either a random process or something that can otherwise be deemed non-systematic,
as long as were not selecting certain individuals to not receive a certain assessment
because they might not have filled it completely or correctly in the first place, so as
long as were avoiding those type of circumstances, we are acting as Z and were
unassociated with the missingness itself. So we should be meeting the missing
completely at random assumption.
Missing data techniques, solutions for dealing with missing data: Most assume missing
at random or missing completely at random. Missing at random is the typical
assumption. Everything works much better when you have completely at random; the
model can be simplified.
Traditional techniques: Pair wise, list wise deletions, some form of mean or some other
value, substitution, where you put a value in, there are regression models for
predicting what the observation might have been had it been observed, and some
stochastics and some procedures that put some additional randomness or some
uncertainty into that versus the modern techniques: full information maximum
likelihood and multiple imputation. Those are the two big categories of missing data
procedures, and thankfully to all the statisticians who have developed those fields,
that's what makes it feasible for us to allow planned missingness and to account for
that in our processes.
So I'll talk a little bit now about just pretty briefly about some of the classic
approaches just to make sure were all on the same page in terms of what I'm talking
about here, and then Ill provide an answer to that question about the completeness
of X.
List wise deletion: If our overall goal for this talk is to reduce the amount of information
that we have to collect, but still be able to make a valid inference about the
phenomenon were studying, list wise deletion, assuming that we are so overpowered
and have such a large sample size in the first place, list wise deletion actually isn't a bad
idea. Because what happens, if you have a single data point you delete the entire
case.
If you can afford to lose participants, then you're in a good situation. But, for I already
talked about strapped resources, our sample probably isn't as big as we might have
liked for it to be in the first place, we don't want to toss anything. We don't want to
give up any pieces of information. So all of those whole numbers, not the periods that
now have a line through them, that's wasted information that we really would like to
keep a hold on.
Well, that leads us to pair wise deletion. Pair wise deletion would have you, in a
bivariate, two-variable situation, if a case is missing one or the other observation, just
that case is removed for the purpose of estimating say that correlation or covariance.
So here you can see an example: The upper one, they're missing the BADL variable
but they have MMSE, they're missing one out of the two, the correlation between
those two boxed variables would be calculated without those two participants to it. If
you take any other pairing of two variables, the correlation would potentially be
estimated on different data, different amounts of data. Well, this tends to lead to a
problem in multivariate statistics that we refer to as a not positive definite matrix,
which is basically the kiss of death for most multivariate analytic procedures.
Pair wise deletion is generally not a good idea, although it does allow us to keep more
of the data whereas list wise deletion would have eliminated case six and case fifteen
right off the bat. So it seems like a good idea, but it leads to computational problems.
So, okay, mean substitution: Instead of getting rid of cases with missing data, let's put
something in. The two procedures that are our most common are either substituting
the mean of the sample, so the mean of the column in place of the missing data, or the
mean for that person, say its item-level response data, it's on an entire scale, a
consistent unidimensional measure. You could find how that person typically responds
and put in there their person mean, so they're case by substitution; wed like for our
distribution to look like the one on the lower left, but with the mean substitution we
end up getting the one on the lower right.
So, the means are unbiased, that means the parameter estimates tend to be
unbiased. But the variance components, so our standard errors, they get drastically
distorted it leads to type I error. There's far less variance on the right than there is
on the left, so we're going to have an increase in type I error rate with your mean
substitution typically.
Leads us to the modern model based approaches, multiple imputation and full
information maximum likelihood. You'll see both of these prevalent in the literature.
Multiple imputation is a multistep process, where full information maximum likelihood
is a simultaneous one. Multiple imputation comes a lot of the tradition comes out of
survey research, where a big battery of assessments are given, plausible values are
generated when missing data occurs. There are multiple say the tradition is m, m
meaning the number of imputations or parallel dataset that are created; the tradition
is five to ten. So five to ten parallel complete datasets are created. Whatever model is
intended is run on each of those datasets using whatever procedure you might want
to use, whether it's a mixed model, a structural model, traditional GOM-type of
procedures. And then those five parallel results are re-combined into a single report, a
single analysis.
There's some user burden in terms of multiple imputation. I throw a caution in here in
that some recent work by John Graham, and colleagues, have suggested that that m=5
may not be enough; in fact, the simulation presented in that Graham paper shows that
you may need as many as 100 imputations to achieve the same statistical power as full
information maximum likelihood.
Full information maximum likelihood is almost the default procedure now that's
available in any statistical paradigm that utilizes maximum likelihood, so that's the
whole class of latent variable models, structural modeling, IRT, latent class analysis,
etc., all the mixed model or multilevel model or HLM-type programs; so the program
HLM, PROC MIXED, MPLUS, any of those programs; when you're using maximum
likelihood, FIML is the default.
Full information maximum likelihood is conditional upon endogenous variables so that
means that its ability to account for missing data only extends to variables that are
endogenous, that have predictive errors coming into it. X has to be complete.
The idea is that for a conceptual representation is that for that subgroup of participants
where you have both X and Y, there's some knowledge about the association. That
knowledge, given their value of X, is used to imply what the sufficient statistics on Y
would have been on the full data, had the full data occurred.
So, in a way, you have to know what the value of X is for the person, both with
complete data and with missing data, so you can match up. So had they had
someone had that same X value, what would their Y value have been. X has to be
complete for the FIML approaches.
So, there are some tricks to make everything endogenous, to make everything a Y
variable within the framework. Multiple imputation isn't a model-based procedure,
and so all variables are fair play to have a value imputed for them.
Its through these two procedures that were able to either obtain complete data,
despite planned missingness, or model the fact that it was missing in the first place,
intentionally not collected, but still estimate these sufficient statistics, means,
variances, covariances, whatever the distributional assumptions might be for the given
situation. If we can make those assumptions, we can continue on the modeling
process.
That's my primer on missing data and its procedures. I've spent some time talking
about it. There is so much more to say, and I would recommend having Craig Enders
from Arizona State come in if you want further information. He has a recent book out
on the topic. It's very approachable; very readable. He does a really good
presentation. Little and Rubin have the classic book, which is now in its second edition;
there are a number of recent articles in the peer reviewed literature, handbook
chapters, etc. These are some of the ones that I find to be useful, that I tend to go to
when I have to refresh myself on something or other. So I would encourage any of
you to pursue some of these. Biometric, that may be not the most readable. But the
psych methods, the SEM, the Structural Equation Modeling journal articles, those are
fairly readable.
Alright. Now that I have spent a few minutes talking about missing data and some of
the solutions to dealing with missing data, I'm going talk about some of the designs
that are really the focal point of this talk.
So, the first motivating context that I'll use is the Language and Reading Resource
Consortium, we call it LARRC. This is the Ohio State-led project, under the Reading for
Understanding Initiative. Laura Justice, in the room, is our PI. It's a consortium of five
universities: Arizona State, University of Kansas, Lancaster University and the
University of Nebraska-Lincoln, in addition to our leadership, of course, at Ohio State.
In particular, I'm going to use our assessment panels as my motivating context for this
piece.
Now in our assessment panel, last year, in year one, we recruited a panel of preschool
through third-grade students. We had some cross-sectional aims that we wanted to
answer based on that year one data, but we also have some longitudinal pieces that
we want to be able to answer later on, plus a need to inform our intervention studies.
We originally proposed an ambitious sample size, 1,200 participants, to be collected in
year one and the majority of those followed then for as long as we could keep them
were going to have some attrition, but as long as we can keep them up through the
completion of third grade. So last year's third graders, its one and done. They're out
of the study. The last year's second graders are assessed again this year. Last years
first graders will be assessed this year and next. The preschoolers will be assessed for
all five years of the study.
Its actually a modified cohort sequential design. I'll talk a little bit more about that
cohort design here in a moment. I say its modified in that were kicking them out
after third grade. A true cohort design, we would continue to track all five cohorts for
all five years and we'd have an extended assessment period or developmental
period that we could say something over.
Well we originally proposed what is about a four-hour battery, quickly expanded to six
hours, six to seven hours, it got really big really fast. By the time we put all the
measures in that we really needed to measure and by the time we got our hands dirty
and found out how long it really took to assess the kids.
Well, then we found out that we didn't have in some cases, didn't have logistical
capacity to assess all of those kids, all 1,200 kids, for all of those hours. We had to go
outside of the school day, we had to expand our testing window, we had a lot of data
we wanted to get, but we had some logistical constraints. So it led us to start thinking
about this idea of planned missing designs. Could we cut some of the measures or
could we cut some of the measures for some participants to reduce the assessment
battery?
Ultimately, we wanted to be able to this is an early version of the model, please
don't take anything inferential from this. Initially what, in our cross-sectional analysis,
we wanted to be able to test a model for each of the preschool through third grades,
and we want to be able to say something about how parameters within that model
change as a function of development, change as function of grade. So you can see a
lightly circled coefficient up in the upper right region of each of those. We want to be
able to test whether that, the relationship between those two constructs, changes as
we move from first grade to second to third, etc.
So we need to have a sufficient sample size in each of the grades. We also need to
have a sufficient representation longitudinally; we need to preserve some of this
information over the five-year period. We have a lot of latent variables, so we need
if we want to do good latent variable modeling, we have to have at least three
indicators per construct, for identification purposes. Yes, we can get away with two
but it's we shouldn't plan to do those types of things so that's just what happens
sometimes out of reality. And in some cases we have four and five measures. We
want to be able to just really measure some of these constructs really well.
Well, once we realized the magnitude of problem, some of the sites were already
underway, some thought maybe they could collect the full sample, others not. We
couldn't look to a simple measurement solution. We didn't want to have some sites
with complete data, some sites with missing data. We didn't want to make were
very protocol-based; we want to be as experimental with this as possible. So we
ended up not doing a missing data design, but it forced us to consider a number of
these issues.
A number of our measures are experimental or they're being used in new contexts,
whether being used for earlier grades than they're originally developed, or older
grades. We also have an ELL sample that we want to be able to compare our primary
sample to. And so we have to look at invariance issues.
So especially for some of these experimental measures, some of these measures that
we don't know as much about them as we'd like to in the first place, we really need to
preserve information. We need as much information as possible. We don't want to
get in a situation where we have to determine the psychometric properties of these
measures, and then not have the data to be able to do so. It's also complex sampling.
Its kids within classrooms within schools across four sites, spread across the U.S. So
we have a lot of between-group comparisons that we have to be able to make and so
it really became important that we preserve our sample sizes and preserve the
breadth, the richness of the data that we really intentionally wanted to collect in the
first place.
Luckily, we felt that we were overpowered to begin with by making use of this
accelerated, this cohort sequential design which we planned in the first place. So we
elected to maximize our between measure information, capitalizing on some of our
previous, our intentional design decisions, and some oversampling. So we did, as I
said, we did consider dropping some measures. We ended up deciding to reduce the
sample.
Using a cohort design, if we ultimately wanted to be able to say something both cross-
sectionally early on in the study, in say years one and two, but also longitudinally over
the five year, we look at how the data accumulates. As long as we can preserve most
of that preschool sample, for those 400 participants that we recruited in the initial
sample, and carry those forward, were going to end up with between 400 and 600
data points at each of those five grade levels by the time were done.
So some of our cross-sectional aims aren't necessarily answerable to a small effect size
in year one, but if we allow a second year of data to accumulate, about this time next
year well be able to say something, hopefully, here within the next three or four
months as our data continues to get clean and get rolled out. But here in a year from
now, we should be able to say something pretty conclusive about those cross-
sectional aims, plus well be accumulating longitudinal data and be able to say some
hopefully some really interesting things about development, once we get some
overlapping information.
Didn't really initially think about it, but if you take the chart from the IES website in
terms of the overlap in the panels across the RFU projects, not all RFU panels
themselves are assessing all grades. There's intentional overlap. The whole RFU
process itself is an accelerated longitudinal design.
I use that as a segue to talk about a very specific set of planned missing data designs.
Its one that has some very good validity to the audience here. Accelerated designs,
also referred to as convergent designs, cross-sequential, cohort sequential, or
accelerated longitudinal designs. They have a 50-plus year history. They're widespread
in developmental applications. My primary exposure to them is more on the
gerontology end to it than necessarily early childhood development.
But what is the term accelerated mean here? A cohort sequential design and
accelerated design is one characterized by overlapping cohorts; you can see the three
cohorts in the diagram on the slide. We're going to recruit participants in a given year,
usually those groups of participants are across different grades or different ages,
different areas in the developmental spectrum; so in this case let's say kindergarten
this year's kindergarten, first grade, second grade. We're going to track those for a
limited number of measurement occasions. But because there's some linking
available, there's some overlap to the design in that this year's kindergartners next
year will be first graders, well this year's first graders are first graders too. We can link
by the fact that they will both be experiencing the first grade phenomenon and
likewise for the second graders, third graders, etc.
Where you see a G in the diagram, that's data that's intentionally collected; assuming
no attrition, assuming no dropout, of course. The shaded three boxes in the upper left
and the lower right are elements of the developmental span, but are not intended to
be collected in the first place; missing completely at random. This phenomenon, this
cohort sequential design, the data from this type of research design, can be treated as
a missing data solution, not a problem.
Advantages: It allows for assessment of entry individual change, takes less time than a
longitudinal design. So here we get longitudinal data on an age range over five years
with three years of data collection. That's really where the accelerated phrase comes
in to play. Subject attrition and testing effects, etc., some of the threats to validity can
be reduced because the temporal burden on the participants are reduced. The longer
a study is, the more the risk of attrition, the more the risk of cumulative testing
effects, especially if the repeated observations are closer together in time. So the
design itself has some nice safeguards built into it.
Applications: Basically anything in a longitudinal context. It may require a relatively
large sample size. It may require more information collected in each of those group
panels, so more kindergartners, more first graders, more second graders, but you're
not tracking them for as long a period of time and so it's a cost benefit analysis. Does it
cost more to recruit an extra 15-20 kids a year, or does it cost more to go into the
schools for a fourth and a fifth year. There will be some economic tradeoffs there that
have to be addressed. No universal size recommendations. Typically, in the anecdotal
literature, numbers around 150 per. It also partly depends on the analytic method. If
you're using maximum likelihood in some of the more advanced estimation
frameworks, you need larger sample sizes for the validity of the inference.
There has to be a sufficient degree of overlap. So there needs to be at least two
points of overlap to test for differences in a linear slope between adjacent groups,
more than two if you're going to look at anything curvilinear, quadratics, cubics, etc. So
you need at least two points of overlap to make the linear function work.
There are several analytic models that are commonplace in the applied statistics
literature that can deal with this from the original multiple group SEM approach, where
each of those cohort groups has its own growth model, and commonalities across
groups are constrained to be equal versus treating it as a panel study, time one, two,
three, regardless of the developmental starting point, where, here in this case, age is
that developmental starting point, is included as a covariate, treated as a missing
design so if you go back a couple slides to those corners with the intentional missing
data, if you allow the missingness to be in the data itself, can be accounted for that
way. There are individually varying time points, approaches out of the mixed model
literature, or treated as a random coefficients model as say an econometrician might
do.
Start getting the idea going on this notion of planned missingness. The accelerated
designs are just a case of an efficiency of measurement design. Were trying to
accelerate our ability to assess along the developmental process using fewer years
worth of data, but still try to get the same bang for our buck, so to speak. Well,
Graham, Taylor, Olchowski, and Cumsille in a 2006 article, summarized several of these
more broadly, inclusive efficiency of measurement designs. Random sampling, itself,
is an efficiency design as the simplest case. But they also bring forth the notion of
optimal designs. Now this isn't necessarily Raudenbushs Optimal Design Power
Analysis program, but there is a component in that Optimal Design program that allows
you to put financial elements to it. So you can construct, you can do a power analysis,
conditional not just on effect size and various estimates, but also the cost per unit,
whether the units the classroom or the participant. That's literally out of this Optimal
Design literature. So the attempt is to balance the cost of design decisions with
statistical power.
Fractional factorial designs from Box, Hunter, and Hunter, this is a resource on that
type of design, but the literature goes back a bit further. But instead of using a full
factorial design, those elements of the factorial design that are of most interest are
selected, rather than a fully-crossed design.
Which is actually not so different from adaptive testing. Focusing information, focusing
resources on the area of inference that is of most interest to you. So, we try to make a
lot of illusions, a lot of foreshadowing to later concepts, so I'm priming you now for
adaptive testing.
And then classic measurement models, that also fall under this efficiency of
measurement piece. So the originator was probably the simple matrix sampling,
Shoemaker, as a citation for that and this is in the upper right corner is an example
of that type of design where you would have a set of participants and they would each
be assigned a different form, of the assessment battery, and each form contains a
different block of items.
So not all participants get all assessments, but as you can see the diagonal elements
here, the ones, there's no overlap. It's good for means; it doesn't do anything for it
doesn't allow for any type of correlation or covariance outside of correlations within
the block.
Whether block of items is just item A or whether it's a set of items, A, you might be
able to say something you'll be able to say something about the correlation among
items within that block, but you can't say anything about the association between A
and B.
The fractional block design allows means and some correlations; the lower right
diagram is an example of that. So now you see the squares are where assessments
are given. The circles are the absence of. This diagram comes from Jack McArdle,
allows for means again and allows for some correlations. His particular approach
requires a multiple group SEM application. It's been generalized out to this notion of
balancing complete blocks, so depending upon the degree of overlap, there is some
correlation information available. But both of these cases, both the fractional block
design and the balance and complete block design, in terms of the missing data
solutions that are available, hopefully it would be apparent to you based on the
pair wise deletion illustration earlier there's potential for some algebraic problems,
some matrix problems in terms of incompatible calculations.
Graham and his colleagues further generalized these to what they referred to as a
three-form design and others have played to the development of these classes of
designs so they would have you split the overall item or battery into four steps, called
X, A, B, and C. All subjects get X; X becomes the linking, the anchoring information.
Presumably, that's the most important information, that's the core of your
assessment battery. In IRT or in measurement, we would literally call those linking
items, whether its linking across multiple forms or whether it's the notion of vertical
equating, so you give, the kids at an earlier developmental period, you give them
some probably difficult for those items, but those items are still applicable to the
next development stage up, and as long as you administer the same items across
those developmental ranges, you can link the information together, for the principle
of vertical equating.
In this design also you get that common set of linking information X and then they get
two out of the remaining three blocks, so they get A and B, A and C, or B and C.
So a number of hypotheses are testable now; this k(k-1)/2 is referring to basically
descriptive univariate type hypotheses, are the means or correlations different from
zero.
I just threw in, don't forget multiplicity. Not being able to test a number of additional
hypotheses isn't always a good thing. We should be intentional in which ones were
trying to test, of course. And modifications on this being the split questionnaire survey
design, so items in block A, do you really want all those items think of basic
counterbalancing effects. You want to split up the items in block A so they appear in
different orders throughout the assessment battery; basic manipulations of that
nature.
In particular, this Graham article led its culminating point was what they refer to as two
method measurement. This is building off the notion of this common set of
information, X. There are many assessment situations where we can easily administer
a cheap, pencil-and-paper or computer-based, not maybe as high reliability or validity
as we might like it to be, but easy to apply instrument. Self reports are the perfect
case. There are all kinds of problems with self report measures that bring their own
bag of problems with them.
But in parallel to those cheap, inexpensive, easy to administer measures that we like
to use, there may be some gold standards, some really effective, reliable precise
measure that's just too expensive to give to everybody, or too time-consuming. So a
good example of these are biological markers.
So Graham and colleagues used an example from smoking cessation. So a really good
measure of whether someone smokes or not can be taken from analysis of say, saliva
or blood work. It's really expensive, time consuming, hard to do on a full sample but is
a whole lot more reliable than asking them, Hey, did you smoke or not? You're going
to get all kinds of response biases to that. But it's really easy to ask, to just ask them,
Did you smoke or not? So what they proposed is using a two method approach
where get some information using that really strong, valid, reliable measure; get some
information on some of the participants; get the cheap easy to acquire information on
the rest. And by having that overlap in a construct situation here in the lower left,
you're able to maximize the information. They found even over the three group, the
three form model, this two method assuming that one of those two sets of
information, in this case that the biological markers, the saliva measures, for instance,
because of the strength of the quality of those measures, just two blocks can achieve
better psychometric precision than three block cases.
All variations on this idea of providing some information to some, assessing some
information on some, and not assessing on others.
The last example or classification that I'll talk about in this category of assessing
everybody, but not necessarily assessing on everything, is the notion of computerized,
adaptive testing. This is what Susan Embretson spoke on at the grantee meetings a
month ago that I've now added in, to expand upon for my talk.
Adaptive testing administers items that are most appropriate for a given ability level.
So again, I foreshadowed earlier this notion of partial fractional factorial designs where
if our pieces of the inference space that are more important than others, why not
focus information in those areas rather than collecting what may be extraneous,
redundant, unnecessary information.
So, for example, higher ability examinees why give them the easy questions you
know they're going to get right? Why don't you concentrate their effort on answering
the harder questions that are more appropriate, they're more challenging; they'll
probably provide more discriminate ability among that subset of the sample.
Items essentially become weighted according to the difficulty, and through that
weighting, that makes scores comparable even when participants don't get the same
set of items. Even in cases of complete non overlap among the assessment battery,
because we made some assumptions about the difficulty or the characteristics of the
item, we can then make a comparison among participants.
Adaptive testing can often achieve the precision of a fixed length test using as few as
half of the original fixed length test. So this is, again, this is all made possible through
item responsive theory, IRT, which is basically model-based measurement.
So here's the formula here at the bottom. This is an example of what's referred to as
the 2PL, that's the two parameter logistic item response model. It's basically the
categorical version of a common measurement model. So we assume that there is
some construct data, there is two additional parameters to describe the relationship
between the item and construct; that's the B and the A parameter. A would be
referred to as the discrimination parameter, B is the difficulty parameter. It's the
difficulty of the item. It's the mean response, in a sense, of the item, to the item,
excuse me, and then the item total correlation would be the discrimination parameter
by using a parallel to classical test theory.
These are example item response functions from a number of hypothetical items. The
ogive, the S shape is due to the nature of the outcome of the item responses being
traditionally an item response theory, its correct or incorrect. It's a dichotomous
response and so were trying to predict the probability of whether someone gets an
item correct, gets a one, versus a zero. Probability is bounded by zero and one. So
that's what forces the curvilinearity to the response function.
If this were a Likert response or some assumed continuous item response, then that
would be a straight line and values both below the current zero and above the current
one could be plausible values.
Each of these items differs, both in its location. If you take the inflection point, the
point where it ceases to accelerate and starts decelerating, you take that inflection
point which corresponds with the probability of .5, so you draw a line from .5 over and
draw it down. That's the difficulty, the location of the item.
So if we look at the dark shaded item here, its .5 anchor is at a difficulty or an ability
score or level of zero. So zero meaning average, not absence of. So a participant who
is of average ability, of average level of theta, whatever the construct might be,
whether it's an ability or some other type of trait, or characteristic. A person at that
average ability has a chance at .5 probability of getting that item right. That means that
item is appropriate for a person at that ability level.
In adaptive tests, items are selected so that the majority of the items are appropriate;
they basically all have difficulty values of zero. You minimize the negatives, the low
the easy items and you minimize the difficult items. You minimize the ones they
should get right anyway and you minimize the ones that they had no chance of getting.
So you can look at the location, left to right, in terms of items differing on their
difficulty, their location. And you can look at the slope at the inflection point, as in the
discrimination parameter. The relationship between the construct on the X-axis and
the likelihood of a response. That's item total correlation, so to speak, from classical
information.
So items what do we know about the strength or the steepness of a slope. A
steeper slope, a higher correlation. More information provided by the item in terms of
measuring the construct; higher discrimination, an increased ability to effectively rank
order participants on that continuum.
We can convert those items response functions to information functions. So for each
of the three ogive shapes that you see, you should now also see three humps, that
they correspond, where the peak of the hill is situated at the inflection point of the
corresponding curve. So, again, looking at the dark shaded looks like a normal
distribution there in the forefront. That corresponds with that previous items that had
that average difficulty, that was located at the zero point.
Look at the item response functions curve, its steepness, relative say to the one
immediately to the left of it. The one to the left has a steeper slope; that's the
inflection point. There's higher discrimination so its corresponding information
function, that's the small dashed hill there. Its peak is higher, there's more
information because it has a higher discrimination value, it has a higher association.
But if you look at the breadth and the width of that item information function, has
more information, but it's over a narrower range of the ability distribution than the
one that has a less steep slope but a wider range.
And then the item response function to the right has the least steep slope and you can
see its corresponding, it's basically a Nebraska mountain out there. So not a lot of
information but what little information it has, its distributed over the broadest range.
So what we'd like to do is a well conceived test, we have a range of difficulty, so we
have spread of item response functions from left to right that all have comparable
slopes, that all have the steepest slope as possible, as high a discrimination as possible,
so that a corresponding test information function, now this is just based on the
previous item, so this isn't an ideal test. But the ideal test would have what might as
well look like a plateau. High information over as broad a range as possible.
That corresponds, then, with a standard error of measurement ,so the dark line is the
test information function, the dashed line that is its inverse and is the standard error of
measurement at the point where measurement where the information is localized
and concentrated in its peak, standard error of measurement, the precision on that
individual is as low as its going to be, given this assessment. When you get out to the
high ability, low ability range information is less because there are usually fewer items
appropriate for high ability/low ability and so information is decreased, measuring
precision is decreased as well. That standard error is larger than it would be otherwise.
What an adapted test tries to do is maximize the peak of that info function for the
individual over that individual's true ability level. So it would have a high peak and a
very narrow one, very similar to the tallest peak here.
Adaptive tests work by making some initial assumptions about participants. We
assume that basically everyone is average, and so you get a moderately difficult item.
If you miss an average item, you get an easier one the next time around. One is more
appropriate for a lower ability estimate. If you get it correctly, though, you get a
slightly more difficult one, and you move up. Using item response theory, because we
have discrimination values, difficulty values, we make some assumptions or we've
otherwise calibrated we've learned something empirical about the relationship
between the items and the construct. Because we have that additional parameter in
our arsenal, we can then select items based on a difficulty to apply the next item,
rather than having to administer the entire test and then determine what the classic T-
values might be.
So subsequent items basically get tailored to the respondent's ability level. You
continue this until the algorithm either identifies a stopping point that's based on
some precision criteria like that persons standard error meets some minimal
criteria or you've just administered all of the number of items that you're supposed
to the maximum number of items that would've been administered had it been, say,
a fixed-line test.
So here's a diagram representing the kind of the branching idea. Question one
everybody usually starts with the same item, the same level of item. You get it
correct. Then moving down the left-hand side, you take the item response, you
update your assumptions about the individual, and then you choose the next item.
And at each point, it branches. If you miss the item initially, you get a lower ability
item, an item appropriate for someone with a lower ability than what it initially
assumed you hat.
Notice both tracks whether you get the item correct or miss the first item you can
still end up in the same eventual location. You can still meet this middle item and
across the bottom row. But the pathway to getting to that item becomes more
complex. That lengthens the test. Your response pattern is basically more
inconsistent, and so that lengthens the test, reduces the efficiency, etc.
That's the end of my section on assessing everybody, but not necessarily on all
instruments or all items in the case of adaptive testing. The other classification of
models falls within this idea of you give everything to everybody, but you don't
necessarily assess or test or intervene with everyone you might have started out
intending to assess.
Contrasting this notion of sequential designs with what were more familiar with is
being a fixed experimental design. Fixed designs are typical in education and
psychological research, social and behavioral sciences, etc. A fixed design is one where
the sample size and the composition so who's assigned to what group is
determined prior to conducting experiment. So we do a power analysis to figure out
we need 100 participants, and 50 are assigned to the counterfactual, 50 are assigned to
receive the intervention. And we intervene with all well, with the 50 participants.
We measure the business as usual or whatever the counterfactual condition is with
the other set. And at the end, we make our inference on a fixed sample size versus a
sequential experimental design where the sample size is treated as a random variable.
We don't know what the eventual sample size is going to be, but we make a set of
assumptions, we lay out a protocol for what the parameters of it potentially stopping
early are going to be. This allows for sequential analyses and decision making. The
idea is that we make our inferences based on accumulating evidence, accumulating
data. All the while, maintaining our appropriate error rates, both Type I and Type II
so preserving our statistical power, but maintaining our Type I error.
Also referred to as adaptive or flexible designs (think of computerized adaptive
testing). Current design decisions are sequentially selected according to previous
design points. It is kind of a Bayesian idea. It's not necessarily a Bayesian statistic, but
it's a Bayesian idea, this notion of prior assumptions, collecting some information, that
prior is updated by the data to become the posterior, while the posterior becomes the
next prior. So we may form an opinion; something happens to change our opinion; we
update our opinion. And then something else happens, and then we update. It's an
iterative, cumulative process.
This is the principle in play with a sequential design versus, again, the fixed design
where that composition size is fixed. We have to encounter everything that we might
encounter to form a cumulative opinion, a cumulative inference, rather than adapt it
as we go. Although there is the potential for this continuously iterative process until
we reach a certain criterion, typically an upper limit is set in practice, and that is
typically close to what the fixed sample size might have been in the first place.
Primary benefits it allows for early termination of experiments. So instead of
reducing, say, an assessment battery the size of a battery you don't collect data for
as long or on as many participants. It's an early termination. From an ethical
perspective, this prevents unnecessary exposure. It also prevents unnecessarily
withholding the administration of something that is showing clear evidence of working.
From a logistical perspective, there can be considerable financial savings. Typically, the
savings are reported to be between 10% and 50%. So remember, in adaptive testing
one of the common selling points is that you can reduce the fixed-length test by half;
that would be a 50% savings.
The adaptive test is actually an example of a sequential design. Sequential designs
have a long history; this is nothing new. Its new to the social and behavioral sciences,
but has a long history almost 100-year history in other disciplines. The earliest
reference I've found has been back to 1929 with the double-stamping inspection
procedure for industrial quality control. Mahalanobis (so that's Mahalanobis distance,
Mahalanobis) contributed to this with a census of Bengalese jute areajute is basically
the fiber in burlap sacksBengal, in India.
Where it really started taking off is in 43 with development of the sequential
probability ratio test, which was developed by Wald and other members of the
Statistical Research Group at Columbia. You see a lot of familiar statistical last names
here among that group. I say this is the real jumping-off point because this is also
where the parallel field of sequential analysis jumped off with as well, primarily
because of this development of the sequential probability ratio test. It turns out that
the sequential probability ratio test is the stopping criteria used in adaptive testing. In
1960, Peter Armitage published what probably could be considered as the resource on
sequential designs in biomedical applications. And then in the 80s, we saw the rise of
adaptive testing.
Adaptive testing is made possible through IRT. The statistical foundation for it though,
is Wald's sequential probability ratio test, but the whole idea of adaptive testing in
general actually goes back to Binet at the last century with individualized intelligence
testing. So this isn't anything new. Its new to us. It was definitely new to me. It
wasn't something I was exposed to in graduate school, but it has a long, long history.
Characteristics of the sequential design there needs to be at least one interim
analysis. And I'm not talking about well, I guess you can think of it as one data snoop.
But you have to have the opportunity to look at the data once and make a decision at
that point in time prior to the formal, fixed completion of the experiment. But there's
a protocol for that, so the decisions that you can make at that one interim analysis are
predetermined. And the criteria leading to making those decisions are determined.
So you have to figure out how many times you're going to look, (how many interim
analyses) how much information, (what's the n at each stage are you going to look
halfway through; are you going to look every ten participants; are you going to look
after every participant that completes the study) you have to know what your nominal
alpha and beta levels are, (how highly do you want to be powered, what error rate,
are you going to control the 0.05 level versus 0.01 versus something even more
stringent) and then we determine the critical values. So if we think, Simple T-test,
normal distribution, the upper and lower regions of rejection plus or minus 1.96 are
the boundary values. So you have to determine what the boundary values are going
to be each time you make an interim analysis.
All available data is analyzed at each stage, so if you're going to look every 10
participants the first interim look you base it on the first 10. The second interim
look, you look at the first 20 because it's the first 10 plus an additional 10. And you
continue forward. At each stage, the appropriate test statistic is calculated. Fisher
information level (which is just the inverse of the standard error) is calculated. That's
basically the denominator for your inferential test. That test statistic is compared
against a critical value so traditional hypothesis testing. If the test statistic falls within
the decision region now the decision regions you'll see in a moment get a little bit
more complicated. Decision regions aren't just regions of rejection, but they're also
regions of futility. There's a central region that, basically, it's not going anywhere, and
you might as well give up. Or if its somewhere in between, you keep going, not
enough information, keep going until you do get enough or you have to stop.
Here are some examples of boundary plots so this is kind of the power analysis stage
of a sequential design. The diagrams on the left are just meant to illustrate the overlap
or non-overlap of the different boundary plot methods. There are different
procedures that are employed in the field of sequential design in terms of how you
determine what the boundary values are. So if you look in any of the four regions, the
first so you have the white area, then you have the first vertical line to it. That's the
first look.
The point, (the boundary of the blue area) that's your it's not 1.96, but that's the
1.96 or 1.64thats the critical value. Notice that as data accumulates, as you go from
the beginning to what eventually is the fixed sample size, the criterion by which you
make your decision becomes I guess maybe call it more liberal. So if you're going to
make a decision early, you have a much higher standard that you have to meet, but
that standard is relaxed as you accumulate more information.
So in this table, the top two are one-tail tests, the bottom two are two-tailed tests.
Early on, at that first vertical line, you have three you have decisions to make. If the
test statistic falls in the white area, keep going, not enough information. If it falls in the
dark area at the top, you've found statistically supported evidence that whatever the
effect is may likely be true given that sample. So you reject the null hypothesis. If it
falls in the low area, thanks for trying. If it's the region or rejection, it's the region of
futility.
And the two-tailed tests, you see how it just complicates because you have two tails;
you have two boundaries for the region of rejection; you also have two boundaries for
this futility region in the middle. By the time you get to the far right so the initial
fixed sample size has been reached. You see that all points converge. Were back to
our two decisions: you reject, or you fail to reject.
The only reason why these plots are different is just different methods are used to
determine the boundary values. That is an area of research within this field.
Three general type of sequential designs a fully sequential design, a group sequential
design, or a flexible design. Fully sequential designs are a continuous . The adaptive
testing where each item is adaptively selected, is an example of a fully sequential
design. If it was a test slip that's administered so let's say a block of ten items are
administered, scores calculated and then the next set of 10 is based on the
performance of the previous set of 10, that's an example of a group sequential design
where instead of after every observation, its after a set, so looking at every 10, every
25 type of considerations. The flexible designs are a compromise between the two.
Limitations clearly, there's an increase in the design complexity. You pretty much
have to get a methodologist or a statistician, probably somebody with some
background in biostatistics at this point to collaborate on what should already be an
interdisciplinary proposal. There is increased computational burdens, but just like
there's an app (a protocol) for that, there's an app for this. SAS has three
procedures two sequential design procedures that make this very, very feasible, as
the procedure to determine the boundary values. And then it has an analytic
integration procedure that tests the statistic, controlling for the error rate.
There are threats to validity due to this early termination idea. If you terminate for
whatever the reason is, whether its efficacy, futility, or safety, you just have to stop it
early because the risk to the participants. That small sample size can lead to distrust.
There could be some assumption problems, depending upon the analytic method
the inferential method especially if you're using maximum likelihood. The
maximum-likelihood principles are asymptotic, so it works a whole lot better with large
samples.
And oftentimes, the early termination is more complex than just that statistical
criterion. So we don't usually measure just a single outcome in the social sciences; we
usually have a full battery of outcomes. And what happens when one variable shows
early termination and another variable doesn't? You have decisions you have to make.
Do you just stop collecting data on that one outcome that you've shown evidence for,
or do you have to continue on with everything? What if that one measure is an
indicator for a construct, and the rest of the construct isn't done yet? You have to
make those types of decisions what's primary, what's secondary, etc.
So back to the substantive context and well be pretty close to wrapping up here.
Based on the CBC in the Early Grades Project, this is actually, that citation should be
08 and not 11 we completed a four cohort fixed-design, cluster-randomized trial to
evaluate the effect of the CBC intervention for students with challenging classroom
behavior. So we had data from 22 schools, 90 classrooms, and equivalent to teachers,
and 207 K through third grade students and parents. Student-parent dyads within a
teacher were randomly assigned to participation in one of two conditions: a control
condition, (a business as usual) or received the CBC. So assignment to condition was at
the teacher, at the classroom small-group level. That makes it a cluster-randomized
trial.
The study was additionally proposed and designed to detect a medium standardized
effect of approximately 0.38. So that told us we needed a sample size of 270 children,
90 classrooms, with an assumption of three kids per class. We didn't end up getting a
full three kids in some of the classes, but we progressed through until we got all 90
classrooms. That took us into a no-cost extension year, but we were able to do that.
But because assignment to condition is at the classroom level, it's not the 270 kids that
affect the sample size, it was the 90 classrooms. That's why we had to pursue there
and why we didn't get the remaining 63 participant's child participants.
Well, I took that completed study, I did a methodological piece presented it at IES
Research Conference, presented it at APA, and were currently working through the
manuscript and hope to get that out yet by the end of the year. Basically implemented
a post-doc application of this sequential design analysis strategy. So I treated
cohorts we ended up collecting data over four years. The intervention itself is
contained within a 12-week period, so the intervention can be contained within a
calendar or an academic year. We collected four cohorts worth of data. I'm going to
treat each cohort as the group. So I'm going to assume that the eventual decision that
we made based on the fixed sample is the true finding. I'm trying to see if I can find
that same finding early on this.
So what's the degree to which sample-size savings might've been realized if wed
implemented this as a sequential group, sequential design, from the start? Everything
was implementable in SAS. PROC SEQDESIGN is basically a power-analysis procedure.
I use GLIMMIX as my analytic model, as I would've done in a fixed design anyway
and as I did, and the paper is almost in press. PROC SEQTEST, then integrates the
information takes the information from GLIMMIX, conditional upon the boundary
values set about in the sequential design, and reports back. Basically, the figures that
you saw earlier with the boundary values and you might've seen some dots on
there those were the hypothesis tests from GLIMMIX placed in the context of the
sequential design. So it was output from the sequential test program.
So here we see those again, a little bit bigger. Now we have several measurers both
parent report and teacher report and so this sets up part of that conundrum of,
When do you stop what? Because we're looking for a number of outcomes, not
everything grow the same conclusion. So here, this is our adaptive skills measure from
the BASC. On the left are the parent reports; on the right are the teacher reports. On
the left you can see the dots progressed very quickly into the lightly shaded area (the
region of futility). Parents were not their perceptions of changes in their children's
adaptive skills were not impacted by the intervention.
Well, look at teachers. All of their data points from the start, after the first 25
classrooms in the first cohort, clearly were affected. Parents not showing it, teachers
do. Parent report versus teacher report on the externalizing behaviors score from the
BASC. Again, now we see the same general phenomenon. Parents don't get it,
teachers do. But here we see parents initially it's in the region of, Keep going. Its
in the white area after the first cohort. But after the second cohort, they move and
then stay into the region of futility. We could've stopped collecting data on this
measure after the second cohort on the parent responses. But on the teacher report,
we eventually were able to reject the null hypothesis, but we needed to go all the way
into the fourth cohort to be able to do so. So already, we've seen four different
decisions to have to be negotiated.
Looking at the parent/teacher relationship, we see a similar phenomenon in that no
decisions can be made after the first cohort. But by the time we get to the second
cohort or actually third cohort for the parents. It's just barely outside, just right on
the boundary. It moves into the third. But again, parents didn't get it, teachers did
different stopping points, though, for different measures. Then we have the social
skills measure, and parent's right off the bat didn't get it. And then teachers, we had to
go through the third cohort to be able to do so.
So when you compare those decisions from the sequential design to the fixed results,
we could on a number of the measures pretty much all but one we could've
stopped early. It was just that externalizing teacher measure that we had to go all the
way through and get our full sample. Everything else, we could've stopped early. We
basically could've stopped collecting parent information after the second cohort
because the intervention clearly wasn't working for them, at least given the sensitivity
of the measures that we were employing. Whereas the teacher effects, we could see
some of those early, but to get the full flavor of the intervention efficacy we had to go
all the way through.
A number of source materials that are here, both my presentations. I have a chapter
in a handbook with a grad student, and some of the other resources I've found useful.
These are here for your information if you want to do some additional reading. I cite
Wald provide you a citation there. Despite the journal, it's fairly readable.
And I'll just wrap up now with conclusions and things to think about. Like any
methodological statistical approach, different approaches for different questions.
There's no one size fits all solution to any of this. To me in terms of the you collect
data on everybody, but not everything type of category, what's the degree of overlap
that's necessary. Like there's no one size fits all approach, I think that even the
designs that I talked about earlier in the presentation in terms of the matrix sampling,
the three-group or the two-group two class designs. I don't know that any one of
those is something that should be universally recommended.
I think that you need to be conscious for me as an experimentalist particularly the
effect sizes. If anything, you have to think about what is the quality of the information
that you seek to obtain and the magnitude of the relationship you're trying to detect,
whether it be a factor loading or whether it be a mean difference. Logically, if it's a
small effect, it's a weak loading, you have to have more information to be able to
make a reasonable, valid claim about that effect or that association. So you need more
data on that.
Maybe it's a case of taking all of those items or that block of measures that have weak
associations or have small effect sizes between them. And those are all placed in that
X category that everybody gets. And then it's the big effect size, the broadside of the
barn. Clearly, this is an indicator of the construct. Clearly, there's a reasonable effect
size to detect here. Those are the measures that get put into categories A, B, and, C
that are administered to some and not to others. It needs careful consideration
before you apply any of these, but the tools to be able to consider these problems are 01:25:19:00 readily available to us and have been for decades.
Issues of counterbalancing where should X occur in the battery? Again, I always fall
back into my experimental roots. I just encourage everyone to not give up that
experiment or that researcher control over this. Yes, the statistics can bail you out
when the reality happens, to an extent. They can't solve all problems, but you can
incorporate enough co-variates and stratification variables and what have you to get
some reasonable inference. But all of this would work better if its protocol-based, it
its purposeful, if its intentional, if its planned out, and you follow the script. So the
inference is always better.
Alright. Thank you very much for your time. Any further, any follow-up questions, any
further conversations that maybe you would like to have, contact information is
available too.