Thank you very much for the opportunity to talk with all of you today. As Karen said, I have a number of areas that I have worked in over the last several years, mostly pertaining to multivariate applications. I've been very lucky, thank you very much to IES, to also be quite well funded in terms of research projects to stimulate some of the methodological work that I've been engaged in over the last several years as well. So what I'm going to talk to you today is actually nothing it's not going to be terribly empirical in that this isn't going to be a simulation study or a series of simulation study results, but rather it's going to be ideas and some project-related findings that have been inspired by a number of the IES projects that I have had had the opportunity to work on. In particular, I'm going to be talking about planned missing data designs; I tried to come up with a little more creative title here, a play on not all X need apply. But basically the question being do we really have to assess all individuals on all measures? So my quick answer is, really, no. We don't need to collect full information on all participants in our studies, whether it be an experimental study or whether it be something more of an assessment nature. Whether it be small scale or large scale. That, in fact, thanks in part to computational abilities and high-speed desktop computing, we can treat intentionally missing data, data that we didn't intend to collect in the first place, that was purposefully left out, we can treat it as a missing data usually it's referred to as a problem. I prefer, in this context, to think of it as a solution because in a number of ways the financial flexibility that the planned missing data designs can make available to us, far outweigh some of the computational burdens that might be involved in accommodating the missing data. Plus there are already plenty of examples of planned missing designs already in the substantive literature. You see these across the board in terms of disciplines - between accelerated longitudinal designs or cohort sequential designs, popular in development, cradle-to-grave applications, specific planned missing data designs that are also sometimes referred to as efficiency and measurement designs. Getting back to this notion of saving something, whether it's saving participant burden, whether its literally saving money, whether it's just saving time, researcher resources. Measurement applications, from simple matrix sampling to more modern computerized adaptive testing applications, those are intentionally missing designs. As well as the area that I've become most interested in which comes from biostatistics and medical applications, the notion of sequentially designed experiments. So I'm going to talk a little bit right off the bat about my motivating context, so like I said, most of this work, most of the concepts and the ideas I'm going to present to you today, are inspired by some of my work on substantive projects. So I'm going to talk a little bit about those motivating contexts. As I said, I like to think of this as a missing data solution rather than a problem, but I really can't get away with not talking about missing data itself. So I'll talk a little bit about the types of missing data and I'll talk a little bit about some of the common methods for dealing with missing data, and that's really what makes this planned missingness, feasible in the first place. And then I'm going to talk about two classifications that I use in terms of how I compartmentalize all of these different types of designs that are out there. One application will be in terms of situations where all participants are assessed, so big S little s (Ss) participants, not technically APA format correct anymore, but that's aits wired in there someplace and it's hard to override. But you want to assess all participants but not all participants receive all the measures. Versus all instruments, so all the measures are administered to those individuals who are assessed, but not all individuals will be not all of the initially planned individuals will be assessed. There are some circumstances where it can stop early. And the savings is in terms of not recruiting additional participants, rather than reducing the battery that's administered to everybody. So in the first category, everybody is assessed on something; the particular background will come from my work with the Reading for Understanding projects. Ill talk about it, specifically accelerated longitudinal designs, I'll give some of the other examples of planned missing data designs, as well as computerized adaptive testing. So at the grantee meeting a month ago, Susan Embretson was a part of the panel and she specifically talked about adaptive testing, and I just made reference to it. I've expanded my slides a little bit to talk about adaptive testing. Because that then gives me the segue into the idea of not assessing everyone we might have intentionally initially intended on assessing, and that I'm going to talk about sequentially randomizedor sequentially-designed experiments. And my interest in that class of models comes from my experience through a number of randomized trials that have been either conducted or are currently in place through a number of the IES-funded projects. So as I said, I'm going to describe kind of my motivating context a little bit more. Four particular projects really provide the body of my interest in this area. First is the currently funded Reading for Understanding project. I'm an investigator on the UNL sub award led by Tiffany Hogan, but the overall project is the one led by Ohio State and somewhere around the corner, I think Laura Justice is still in the room, and probably a good thing that there's no opportunity for eye contact here. In that particular study, or set of studies, we have an assessment project or Study 1 that's currently underway, and then we have the later intervention work that's going to be done. I'm going to talk about some of the ideas in regards to planned missing- ness that have applications to that assessment study. Some of the issues that we've thought about, discussed, grappled with, over the last year or so. Then I'm going to use data from this 2004-10, this completed randomized trial that I worked on in collaboration with our PI, Sue Sheridan, also at the University of Nebraska, where we in the funded project, we evaluated this intervention referred to as Conjoint Behavioral Consultation, that's CBC, that's intended to for one outcome, reduce disruptive behaviors among children, both in the classroom and at home. I used that completed project as a data lab, so to speak, and I treated it as a secondary dataset and went back and reanalyzed the data as if it had come from a sequentially designed experiment in the first place, and Ill demonstrate some of the potential sample savings that came about. And that work has been presented at the Research Conference, I think it was 2008 or 2009. Learned all about sequential designs in that context, well now we have two other projects that also involve randomized trials: One is the Natural Center for Research on Rural Education. Were in the midst of a randomized trial there. And then we have a second efficacy study in the CBC family of projects, where were replicating the intervention in a rural context. I think it's not an uncommon problem, but we've run into some recruitment issues. Getting schools, especially some of the rural schools, to agree to participate is always a problem; its rearing its head for us now. So Sue, Todd Glover, other investigators on this project, are continuously coming to me and asking, Well, okay do we really have to recruit everybody we said we were going to? Well, thanks to the work that I had done on the previous CBC Project, had a little bit of foresight. Now I gave them one power analysis that went to the proposal. But at the same time, I also started off a sequentially-designed parallel plan, anticipating that they would come to me and ask about at some point in the future I assumed that they were going to come to me and ask, Okay, do I really have to collect everything that we thought we needed to. I planned for the opportunity to do a peek; I'll talk about how that will be done in a principled manner because we're at the point right now, where were recruiting for next year already. But how basically it's a power analysis that also takes into account this idea of looking at the data early with multiplicity control. So I built in a component that allows us to basically test repeatedly, but keep the error rate in control. So what is missing data? Missing data comes about from a number of mechanisms whether it's something intentional on the part of the participants, so selected non- response. They just choose to respond to some items, some measures, and not others. They don't show up on an assessment period. They might be sick some day but they are back the next assessment in a longitudinal design. Missing data could come about by attrition, where they could just simply drop out of the experiment or drop out of the study, whether that's an intentional or whether it's more of a random process, you know the kid can't participate if the parents move them out of the school system, especially to another state. It's not necessarily because of the intervention itself; maybe there's something systematic, but something that we don't have to deal with in terms of the inference. Sometimes missing data is what were going to talk about continue to talk about today, missing by design, where we intentionally don't collect something in the first place. And sometimes missing data comes about just by human technology error, whether it's your coffee spilling on the laptop, the internet connection goes down, or some other those darn undergrads just don't get something quite right in the data entry process. Can't blame it on grad students or post-docs, but undergrads, you know, they can be the scapegoats for everything. Regardless of how the missing data comes about, there are types of missing data in terms of the set of assumptions that we make about the statistical properties of the missingness. So I'll give an example here, supposed were modeling a construct like literacy, and I'll use the label Y for that. And let's say were modeling literacy as a function of some predictor variable, in this case SES, socioeconomic status, which Ill just leave in the later diagrams as an X. So some participants don't complete the literacy measure, so we do have some missing data on this outcome variable, Y. We have to ask ourselves why do we have that incomplete data? Is it a random process or is it a systematic process? And to assume that we can make some assumption about the answer to that question, can we determine the nature then if we have determined that it's a systematic influence. Leads us to a typology of missing data either being missing at random or missing not at random I like to think of it as more of a special case of missing at random as to missing completely at random. So basically its missing data we can deal with easily; missing at random or completely at random versus not at random, much more troublesome a situation. So further defining the elements, X is our completely observed predictor variable; Y is our partly observed outcome variable that has some missingness to it; Z is then the component of the causes of missingness. So whatever the mechanism that led to missing data on Y, that's captured in Z, whether it's one variable or a set of variables. And then R, that's the probability, the fact that some data is missing or not; okay, that's the probability of missingness. So starting off with missing completely at random, the MCAR model, in a missing completely at random situation, missing value on this outcome variable Y, so literacy, are not associated with other variables in a given dataset or with the variable Y itself. So there may be some causal mechanism between X and Y, but as you can see from the lower left here, there's no association between Y and R. R, the probability of missingness, and its causes, Z, are in no way associated with the system that were interested in. So it's not that the missingness is not is completely, literally random. But whatever the mechanism is, is not part of the substantive system that were trying to model. There's not a causal mechanism. The second context, the second type of missing data, is the missing at random. Now this adds an additional set of assumptions. So now, the missing values on the given variable Y, so literacy, are not associated with the unobserved variable, Z, notice that there's no link between Z and Y, but they may be related to other measured or observed variables. So the probability that Y is missing R may be dependent upon X, so X is part of the mechanism leading to the missingness, but as long as X is in the model so X is predicting Y the mechanism for missingness is incorporated within modeling process. So the analysis should be unbiased, because X is part of the system. Its essentially the same bias. Its avoiding the bias that would be present if you just leave out an important predictor. If you leave out a variable that's supposed to be part of the system, results from the model - leaving that important piece out would be biased. That's the principle at play with missing at random. As long as the system is correct, the model is correct, then the bias is avoided. Versus the not missing at random: Now we see the association between Y and R, so the probability that Y is missing is dependent upon what the value of Y itself might have been. So, participants don't complete the literacy measure because they have poor literacy skills. The probability response of non-response is directly predicted by what their value would have been, had it been observed. That's a not-ignorable case; that's the problematic case . So, what can we tell from our data? Well, we only have access to what we've actually measured and observed, so our hands are a little bit tied in whether we can find empirical evidence that we have missing completely at random versus missing at random versus not missing at random. Okay? We can test, we can reject missing completely at random; that's been a set of procedures that have been available for decades. Missing at random, is itself, not testable because we don't we have no knowledge of what the value would have been, had it been measured in the first place. So we have no ability to predict the missingness because we don't have the outcome that we need to predict isn't available to us. So we can't distinguish between (MAR) missing at random and missing not at random. Now in the case of planned missing data, we, presumably the experimenter, are the mechanism, we are the Z, leading to the probability of missing data. As long as were either a random process or something that can otherwise be deemed non-systematic, as long as were not selecting certain individuals to not receive a certain assessment because they might not have filled it completely or correctly in the first place, so as long as were avoiding those type of circumstances, we are acting as Z and were unassociated with the missingness itself. So we should be meeting the missing completely at random assumption. Missing data techniques, solutions for dealing with missing data: Most assume missing at random or missing completely at random. Missing at random is the typical assumption. Everything works much better when you have completely at random; the model can be simplified. Traditional techniques: Pair wise, list wise deletions, some form of mean or some other value, substitution, where you put a value in, there are regression models for predicting what the observation might have been had it been observed, and some stochastics and some procedures that put some additional randomness or some uncertainty into that versus the modern techniques: full information maximum likelihood and multiple imputation. Those are the two big categories of missing data procedures, and thankfully to all the statisticians who have developed those fields, that's what makes it feasible for us to allow planned missingness and to account for that in our processes. So I'll talk a little bit now about just pretty briefly about some of the classic approaches just to make sure were all on the same page in terms of what I'm talking about here, and then Ill provide an answer to that question about the completeness of X. List wise deletion: If our overall goal for this talk is to reduce the amount of information that we have to collect, but still be able to make a valid inference about the phenomenon were studying, list wise deletion, assuming that we are so overpowered and have such a large sample size in the first place, list wise deletion actually isn't a bad idea. Because what happens, if you have a single data point you delete the entire case. If you can afford to lose participants, then you're in a good situation. But, for I already talked about strapped resources, our sample probably isn't as big as we might have liked for it to be in the first place, we don't want to toss anything. We don't want to give up any pieces of information. So all of those whole numbers, not the periods that now have a line through them, that's wasted information that we really would like to keep a hold on. Well, that leads us to pair wise deletion. Pair wise deletion would have you, in a bivariate, two-variable situation, if a case is missing one or the other observation, just that case is removed for the purpose of estimating say that correlation or covariance. So here you can see an example: The upper one, they're missing the BADL variable but they have MMSE, they're missing one out of the two, the correlation between those two boxed variables would be calculated without those two participants to it. If you take any other pairing of two variables, the correlation would potentially be estimated on different data, different amounts of data. Well, this tends to lead to a problem in multivariate statistics that we refer to as a not positive definite matrix, which is basically the kiss of death for most multivariate analytic procedures. Pair wise deletion is generally not a good idea, although it does allow us to keep more of the data whereas list wise deletion would have eliminated case six and case fifteen right off the bat. So it seems like a good idea, but it leads to computational problems. So, okay, mean substitution: Instead of getting rid of cases with missing data, let's put something in. The two procedures that are our most common are either substituting the mean of the sample, so the mean of the column in place of the missing data, or the mean for that person, say its item-level response data, it's on an entire scale, a consistent unidimensional measure. You could find how that person typically responds and put in there their person mean, so they're case by substitution; wed like for our distribution to look like the one on the lower left, but with the mean substitution we end up getting the one on the lower right. So, the means are unbiased, that means the parameter estimates tend to be unbiased. But the variance components, so our standard errors, they get drastically distorted it leads to type I error. There's far less variance on the right than there is on the left, so we're going to have an increase in type I error rate with your mean substitution typically. Leads us to the modern model based approaches, multiple imputation and full information maximum likelihood. You'll see both of these prevalent in the literature. Multiple imputation is a multistep process, where full information maximum likelihood is a simultaneous one. Multiple imputation comes a lot of the tradition comes out of survey research, where a big battery of assessments are given, plausible values are generated when missing data occurs. There are multiple say the tradition is m, m meaning the number of imputations or parallel dataset that are created; the tradition is five to ten. So five to ten parallel complete datasets are created. Whatever model is intended is run on each of those datasets using whatever procedure you might want to use, whether it's a mixed model, a structural model, traditional GOM-type of procedures. And then those five parallel results are re-combined into a single report, a single analysis. There's some user burden in terms of multiple imputation. I throw a caution in here in that some recent work by John Graham, and colleagues, have suggested that that m=5 may not be enough; in fact, the simulation presented in that Graham paper shows that you may need as many as 100 imputations to achieve the same statistical power as full information maximum likelihood. Full information maximum likelihood is almost the default procedure now that's available in any statistical paradigm that utilizes maximum likelihood, so that's the whole class of latent variable models, structural modeling, IRT, latent class analysis, etc., all the mixed model or multilevel model or HLM-type programs; so the program HLM, PROC MIXED, MPLUS, any of those programs; when you're using maximum likelihood, FIML is the default. Full information maximum likelihood is conditional upon endogenous variables so that means that its ability to account for missing data only extends to variables that are endogenous, that have predictive errors coming into it. X has to be complete. The idea is that for a conceptual representation is that for that subgroup of participants where you have both X and Y, there's some knowledge about the association. That knowledge, given their value of X, is used to imply what the sufficient statistics on Y would have been on the full data, had the full data occurred. So, in a way, you have to know what the value of X is for the person, both with complete data and with missing data, so you can match up. So had they had someone had that same X value, what would their Y value have been. X has to be complete for the FIML approaches. So, there are some tricks to make everything endogenous, to make everything a Y variable within the framework. Multiple imputation isn't a model-based procedure, and so all variables are fair play to have a value imputed for them. Its through these two procedures that were able to either obtain complete data, despite planned missingness, or model the fact that it was missing in the first place, intentionally not collected, but still estimate these sufficient statistics, means, variances, covariances, whatever the distributional assumptions might be for the given situation. If we can make those assumptions, we can continue on the modeling process. That's my primer on missing data and its procedures. I've spent some time talking about it. There is so much more to say, and I would recommend having Craig Enders from Arizona State come in if you want further information. He has a recent book out on the topic. It's very approachable; very readable. He does a really good presentation. Little and Rubin have the classic book, which is now in its second edition; there are a number of recent articles in the peer reviewed literature, handbook chapters, etc. These are some of the ones that I find to be useful, that I tend to go to when I have to refresh myself on something or other. So I would encourage any of you to pursue some of these. Biometric, that may be not the most readable. But the psych methods, the SEM, the Structural Equation Modeling journal articles, those are fairly readable. Alright. Now that I have spent a few minutes talking about missing data and some of the solutions to dealing with missing data, I'm going talk about some of the designs that are really the focal point of this talk. So, the first motivating context that I'll use is the Language and Reading Resource Consortium, we call it LARRC. This is the Ohio State-led project, under the Reading for Understanding Initiative. Laura Justice, in the room, is our PI. It's a consortium of five universities: Arizona State, University of Kansas, Lancaster University and the University of Nebraska-Lincoln, in addition to our leadership, of course, at Ohio State. In particular, I'm going to use our assessment panels as my motivating context for this piece. Now in our assessment panel, last year, in year one, we recruited a panel of preschool through third-grade students. We had some cross-sectional aims that we wanted to answer based on that year one data, but we also have some longitudinal pieces that we want to be able to answer later on, plus a need to inform our intervention studies. We originally proposed an ambitious sample size, 1,200 participants, to be collected in year one and the majority of those followed then for as long as we could keep them were going to have some attrition, but as long as we can keep them up through the completion of third grade. So last year's third graders, its one and done. They're out of the study. The last year's second graders are assessed again this year. Last years first graders will be assessed this year and next. The preschoolers will be assessed for all five years of the study. Its actually a modified cohort sequential design. I'll talk a little bit more about that cohort design here in a moment. I say its modified in that were kicking them out after third grade. A true cohort design, we would continue to track all five cohorts for all five years and we'd have an extended assessment period or developmental period that we could say something over. Well we originally proposed what is about a four-hour battery, quickly expanded to six hours, six to seven hours, it got really big really fast. By the time we put all the measures in that we really needed to measure and by the time we got our hands dirty and found out how long it really took to assess the kids. Well, then we found out that we didn't have in some cases, didn't have logistical capacity to assess all of those kids, all 1,200 kids, for all of those hours. We had to go outside of the school day, we had to expand our testing window, we had a lot of data we wanted to get, but we had some logistical constraints. So it led us to start thinking about this idea of planned missing designs. Could we cut some of the measures or could we cut some of the measures for some participants to reduce the assessment battery? Ultimately, we wanted to be able to this is an early version of the model, please don't take anything inferential from this. Initially what, in our cross-sectional analysis, we wanted to be able to test a model for each of the preschool through third grades, and we want to be able to say something about how parameters within that model change as a function of development, change as function of grade. So you can see a lightly circled coefficient up in the upper right region of each of those. We want to be able to test whether that, the relationship between those two constructs, changes as we move from first grade to second to third, etc. So we need to have a sufficient sample size in each of the grades. We also need to have a sufficient representation longitudinally; we need to preserve some of this information over the five-year period. We have a lot of latent variables, so we need if we want to do good latent variable modeling, we have to have at least three indicators per construct, for identification purposes. Yes, we can get away with two but it's we shouldn't plan to do those types of things so that's just what happens sometimes out of reality. And in some cases we have four and five measures. We want to be able to just really measure some of these constructs really well. Well, once we realized the magnitude of problem, some of the sites were already underway, some thought maybe they could collect the full sample, others not. We couldn't look to a simple measurement solution. We didn't want to have some sites with complete data, some sites with missing data. We didn't want to make were very protocol-based; we want to be as experimental with this as possible. So we ended up not doing a missing data design, but it forced us to consider a number of these issues. A number of our measures are experimental or they're being used in new contexts, whether being used for earlier grades than they're originally developed, or older grades. We also have an ELL sample that we want to be able to compare our primary sample to. And so we have to look at invariance issues. So especially for some of these experimental measures, some of these measures that we don't know as much about them as we'd like to in the first place, we really need to preserve information. We need as much information as possible. We don't want to get in a situation where we have to determine the psychometric properties of these measures, and then not have the data to be able to do so. It's also complex sampling. Its kids within classrooms within schools across four sites, spread across the U.S. So we have a lot of between-group comparisons that we have to be able to make and so it really became important that we preserve our sample sizes and preserve the breadth, the richness of the data that we really intentionally wanted to collect in the first place. Luckily, we felt that we were overpowered to begin with by making use of this accelerated, this cohort sequential design which we planned in the first place. So we elected to maximize our between measure information, capitalizing on some of our previous, our intentional design decisions, and some oversampling. So we did, as I said, we did consider dropping some measures. We ended up deciding to reduce the sample. Using a cohort design, if we ultimately wanted to be able to say something both cross- sectionally early on in the study, in say years one and two, but also longitudinally over the five year, we look at how the data accumulates. As long as we can preserve most of that preschool sample, for those 400 participants that we recruited in the initial sample, and carry those forward, were going to end up with between 400 and 600 data points at each of those five grade levels by the time were done. So some of our cross-sectional aims aren't necessarily answerable to a small effect size in year one, but if we allow a second year of data to accumulate, about this time next year well be able to say something, hopefully, here within the next three or four months as our data continues to get clean and get rolled out. But here in a year from now, we should be able to say something pretty conclusive about those cross- sectional aims, plus well be accumulating longitudinal data and be able to say some hopefully some really interesting things about development, once we get some overlapping information. Didn't really initially think about it, but if you take the chart from the IES website in terms of the overlap in the panels across the RFU projects, not all RFU panels themselves are assessing all grades. There's intentional overlap. The whole RFU process itself is an accelerated longitudinal design. I use that as a segue to talk about a very specific set of planned missing data designs. Its one that has some very good validity to the audience here. Accelerated designs, also referred to as convergent designs, cross-sequential, cohort sequential, or accelerated longitudinal designs. They have a 50-plus year history. They're widespread in developmental applications. My primary exposure to them is more on the gerontology end to it than necessarily early childhood development. But what is the term accelerated mean here? A cohort sequential design and accelerated design is one characterized by overlapping cohorts; you can see the three cohorts in the diagram on the slide. We're going to recruit participants in a given year, usually those groups of participants are across different grades or different ages, different areas in the developmental spectrum; so in this case let's say kindergarten this year's kindergarten, first grade, second grade. We're going to track those for a limited number of measurement occasions. But because there's some linking available, there's some overlap to the design in that this year's kindergartners next year will be first graders, well this year's first graders are first graders too. We can link by the fact that they will both be experiencing the first grade phenomenon and likewise for the second graders, third graders, etc. Where you see a G in the diagram, that's data that's intentionally collected; assuming no attrition, assuming no dropout, of course. The shaded three boxes in the upper left and the lower right are elements of the developmental span, but are not intended to be collected in the first place; missing completely at random. This phenomenon, this cohort sequential design, the data from this type of research design, can be treated as a missing data solution, not a problem. Advantages: It allows for assessment of entry individual change, takes less time than a longitudinal design. So here we get longitudinal data on an age range over five years with three years of data collection. That's really where the accelerated phrase comes in to play. Subject attrition and testing effects, etc., some of the threats to validity can be reduced because the temporal burden on the participants are reduced. The longer a study is, the more the risk of attrition, the more the risk of cumulative testing effects, especially if the repeated observations are closer together in time. So the design itself has some nice safeguards built into it. Applications: Basically anything in a longitudinal context. It may require a relatively large sample size. It may require more information collected in each of those group panels, so more kindergartners, more first graders, more second graders, but you're not tracking them for as long a period of time and so it's a cost benefit analysis. Does it cost more to recruit an extra 15-20 kids a year, or does it cost more to go into the schools for a fourth and a fifth year. There will be some economic tradeoffs there that have to be addressed. No universal size recommendations. Typically, in the anecdotal literature, numbers around 150 per. It also partly depends on the analytic method. If you're using maximum likelihood in some of the more advanced estimation frameworks, you need larger sample sizes for the validity of the inference. There has to be a sufficient degree of overlap. So there needs to be at least two points of overlap to test for differences in a linear slope between adjacent groups, more than two if you're going to look at anything curvilinear, quadratics, cubics, etc. So you need at least two points of overlap to make the linear function work. There are several analytic models that are commonplace in the applied statistics literature that can deal with this from the original multiple group SEM approach, where each of those cohort groups has its own growth model, and commonalities across groups are constrained to be equal versus treating it as a panel study, time one, two, three, regardless of the developmental starting point, where, here in this case, age is that developmental starting point, is included as a covariate, treated as a missing design so if you go back a couple slides to those corners with the intentional missing data, if you allow the missingness to be in the data itself, can be accounted for that way. There are individually varying time points, approaches out of the mixed model literature, or treated as a random coefficients model as say an econometrician might do. Start getting the idea going on this notion of planned missingness. The accelerated designs are just a case of an efficiency of measurement design. Were trying to accelerate our ability to assess along the developmental process using fewer years worth of data, but still try to get the same bang for our buck, so to speak. Well, Graham, Taylor, Olchowski, and Cumsille in a 2006 article, summarized several of these more broadly, inclusive efficiency of measurement designs. Random sampling, itself, is an efficiency design as the simplest case. But they also bring forth the notion of optimal designs. Now this isn't necessarily Raudenbushs Optimal Design Power Analysis program, but there is a component in that Optimal Design program that allows you to put financial elements to it. So you can construct, you can do a power analysis, conditional not just on effect size and various estimates, but also the cost per unit, whether the units the classroom or the participant. That's literally out of this Optimal Design literature. So the attempt is to balance the cost of design decisions with statistical power. Fractional factorial designs from Box, Hunter, and Hunter, this is a resource on that type of design, but the literature goes back a bit further. But instead of using a full factorial design, those elements of the factorial design that are of most interest are selected, rather than a fully-crossed design. Which is actually not so different from adaptive testing. Focusing information, focusing resources on the area of inference that is of most interest to you. So, we try to make a lot of illusions, a lot of foreshadowing to later concepts, so I'm priming you now for adaptive testing. And then classic measurement models, that also fall under this efficiency of measurement piece. So the originator was probably the simple matrix sampling, Shoemaker, as a citation for that and this is in the upper right corner is an example of that type of design where you would have a set of participants and they would each be assigned a different form, of the assessment battery, and each form contains a different block of items. So not all participants get all assessments, but as you can see the diagonal elements here, the ones, there's no overlap. It's good for means; it doesn't do anything for it doesn't allow for any type of correlation or covariance outside of correlations within the block. Whether block of items is just item A or whether it's a set of items, A, you might be able to say something you'll be able to say something about the correlation among items within that block, but you can't say anything about the association between A and B. The fractional block design allows means and some correlations; the lower right diagram is an example of that. So now you see the squares are where assessments are given. The circles are the absence of. This diagram comes from Jack McArdle, allows for means again and allows for some correlations. His particular approach requires a multiple group SEM application. It's been generalized out to this notion of balancing complete blocks, so depending upon the degree of overlap, there is some correlation information available. But both of these cases, both the fractional block design and the balance and complete block design, in terms of the missing data solutions that are available, hopefully it would be apparent to you based on the pair wise deletion illustration earlier there's potential for some algebraic problems, some matrix problems in terms of incompatible calculations. Graham and his colleagues further generalized these to what they referred to as a three-form design and others have played to the development of these classes of designs so they would have you split the overall item or battery into four steps, called X, A, B, and C. All subjects get X; X becomes the linking, the anchoring information. Presumably, that's the most important information, that's the core of your assessment battery. In IRT or in measurement, we would literally call those linking items, whether its linking across multiple forms or whether it's the notion of vertical equating, so you give, the kids at an earlier developmental period, you give them some probably difficult for those items, but those items are still applicable to the next development stage up, and as long as you administer the same items across those developmental ranges, you can link the information together, for the principle of vertical equating. In this design also you get that common set of linking information X and then they get two out of the remaining three blocks, so they get A and B, A and C, or B and C. So a number of hypotheses are testable now; this k(k-1)/2 is referring to basically descriptive univariate type hypotheses, are the means or correlations different from zero. I just threw in, don't forget multiplicity. Not being able to test a number of additional hypotheses isn't always a good thing. We should be intentional in which ones were trying to test, of course. And modifications on this being the split questionnaire survey design, so items in block A, do you really want all those items think of basic counterbalancing effects. You want to split up the items in block A so they appear in different orders throughout the assessment battery; basic manipulations of that nature. In particular, this Graham article led its culminating point was what they refer to as two method measurement. This is building off the notion of this common set of information, X. There are many assessment situations where we can easily administer a cheap, pencil-and-paper or computer-based, not maybe as high reliability or validity as we might like it to be, but easy to apply instrument. Self reports are the perfect case. There are all kinds of problems with self report measures that bring their own bag of problems with them. But in parallel to those cheap, inexpensive, easy to administer measures that we like to use, there may be some gold standards, some really effective, reliable precise measure that's just too expensive to give to everybody, or too time-consuming. So a good example of these are biological markers. So Graham and colleagues used an example from smoking cessation. So a really good measure of whether someone smokes or not can be taken from analysis of say, saliva or blood work. It's really expensive, time consuming, hard to do on a full sample but is a whole lot more reliable than asking them, Hey, did you smoke or not? You're going to get all kinds of response biases to that. But it's really easy to ask, to just ask them, Did you smoke or not? So what they proposed is using a two method approach where get some information using that really strong, valid, reliable measure; get some information on some of the participants; get the cheap easy to acquire information on the rest. And by having that overlap in a construct situation here in the lower left, you're able to maximize the information. They found even over the three group, the three form model, this two method assuming that one of those two sets of information, in this case that the biological markers, the saliva measures, for instance, because of the strength of the quality of those measures, just two blocks can achieve better psychometric precision than three block cases. All variations on this idea of providing some information to some, assessing some information on some, and not assessing on others. The last example or classification that I'll talk about in this category of assessing everybody, but not necessarily assessing on everything, is the notion of computerized, adaptive testing. This is what Susan Embretson spoke on at the grantee meetings a month ago that I've now added in, to expand upon for my talk. Adaptive testing administers items that are most appropriate for a given ability level. So again, I foreshadowed earlier this notion of partial fractional factorial designs where if our pieces of the inference space that are more important than others, why not focus information in those areas rather than collecting what may be extraneous, redundant, unnecessary information. So, for example, higher ability examinees why give them the easy questions you know they're going to get right? Why don't you concentrate their effort on answering the harder questions that are more appropriate, they're more challenging; they'll probably provide more discriminate ability among that subset of the sample. Items essentially become weighted according to the difficulty, and through that weighting, that makes scores comparable even when participants don't get the same set of items. Even in cases of complete non overlap among the assessment battery, because we made some assumptions about the difficulty or the characteristics of the item, we can then make a comparison among participants. Adaptive testing can often achieve the precision of a fixed length test using as few as half of the original fixed length test. So this is, again, this is all made possible through item responsive theory, IRT, which is basically model-based measurement. So here's the formula here at the bottom. This is an example of what's referred to as the 2PL, that's the two parameter logistic item response model. It's basically the categorical version of a common measurement model. So we assume that there is some construct data, there is two additional parameters to describe the relationship between the item and construct; that's the B and the A parameter. A would be referred to as the discrimination parameter, B is the difficulty parameter. It's the difficulty of the item. It's the mean response, in a sense, of the item, to the item, excuse me, and then the item total correlation would be the discrimination parameter by using a parallel to classical test theory. These are example item response functions from a number of hypothetical items. The ogive, the S shape is due to the nature of the outcome of the item responses being traditionally an item response theory, its correct or incorrect. It's a dichotomous response and so were trying to predict the probability of whether someone gets an item correct, gets a one, versus a zero. Probability is bounded by zero and one. So that's what forces the curvilinearity to the response function. If this were a Likert response or some assumed continuous item response, then that would be a straight line and values both below the current zero and above the current one could be plausible values. Each of these items differs, both in its location. If you take the inflection point, the point where it ceases to accelerate and starts decelerating, you take that inflection point which corresponds with the probability of .5, so you draw a line from .5 over and draw it down. That's the difficulty, the location of the item. So if we look at the dark shaded item here, its .5 anchor is at a difficulty or an ability score or level of zero. So zero meaning average, not absence of. So a participant who is of average ability, of average level of theta, whatever the construct might be, whether it's an ability or some other type of trait, or characteristic. A person at that average ability has a chance at .5 probability of getting that item right. That means that item is appropriate for a person at that ability level. In adaptive tests, items are selected so that the majority of the items are appropriate; they basically all have difficulty values of zero. You minimize the negatives, the low the easy items and you minimize the difficult items. You minimize the ones they should get right anyway and you minimize the ones that they had no chance of getting. So you can look at the location, left to right, in terms of items differing on their difficulty, their location. And you can look at the slope at the inflection point, as in the discrimination parameter. The relationship between the construct on the X-axis and the likelihood of a response. That's item total correlation, so to speak, from classical information. So items what do we know about the strength or the steepness of a slope. A steeper slope, a higher correlation. More information provided by the item in terms of measuring the construct; higher discrimination, an increased ability to effectively rank order participants on that continuum. We can convert those items response functions to information functions. So for each of the three ogive shapes that you see, you should now also see three humps, that they correspond, where the peak of the hill is situated at the inflection point of the corresponding curve. So, again, looking at the dark shaded looks like a normal distribution there in the forefront. That corresponds with that previous items that had that average difficulty, that was located at the zero point. Look at the item response functions curve, its steepness, relative say to the one immediately to the left of it. The one to the left has a steeper slope; that's the inflection point. There's higher discrimination so its corresponding information function, that's the small dashed hill there. Its peak is higher, there's more information because it has a higher discrimination value, it has a higher association. But if you look at the breadth and the width of that item information function, has more information, but it's over a narrower range of the ability distribution than the one that has a less steep slope but a wider range. And then the item response function to the right has the least steep slope and you can see its corresponding, it's basically a Nebraska mountain out there. So not a lot of information but what little information it has, its distributed over the broadest range. So what we'd like to do is a well conceived test, we have a range of difficulty, so we have spread of item response functions from left to right that all have comparable slopes, that all have the steepest slope as possible, as high a discrimination as possible, so that a corresponding test information function, now this is just based on the previous item, so this isn't an ideal test. But the ideal test would have what might as well look like a plateau. High information over as broad a range as possible. That corresponds, then, with a standard error of measurement ,so the dark line is the test information function, the dashed line that is its inverse and is the standard error of measurement at the point where measurement where the information is localized and concentrated in its peak, standard error of measurement, the precision on that individual is as low as its going to be, given this assessment. When you get out to the high ability, low ability range information is less because there are usually fewer items appropriate for high ability/low ability and so information is decreased, measuring precision is decreased as well. That standard error is larger than it would be otherwise. What an adapted test tries to do is maximize the peak of that info function for the individual over that individual's true ability level. So it would have a high peak and a very narrow one, very similar to the tallest peak here. Adaptive tests work by making some initial assumptions about participants. We assume that basically everyone is average, and so you get a moderately difficult item. If you miss an average item, you get an easier one the next time around. One is more appropriate for a lower ability estimate. If you get it correctly, though, you get a slightly more difficult one, and you move up. Using item response theory, because we have discrimination values, difficulty values, we make some assumptions or we've otherwise calibrated we've learned something empirical about the relationship between the items and the construct. Because we have that additional parameter in our arsenal, we can then select items based on a difficulty to apply the next item, rather than having to administer the entire test and then determine what the classic T- values might be. So subsequent items basically get tailored to the respondent's ability level. You continue this until the algorithm either identifies a stopping point that's based on some precision criteria like that persons standard error meets some minimal criteria or you've just administered all of the number of items that you're supposed to the maximum number of items that would've been administered had it been, say, a fixed-line test. So here's a diagram representing the kind of the branching idea. Question one everybody usually starts with the same item, the same level of item. You get it correct. Then moving down the left-hand side, you take the item response, you update your assumptions about the individual, and then you choose the next item. And at each point, it branches. If you miss the item initially, you get a lower ability item, an item appropriate for someone with a lower ability than what it initially assumed you hat. Notice both tracks whether you get the item correct or miss the first item you can still end up in the same eventual location. You can still meet this middle item and across the bottom row. But the pathway to getting to that item becomes more complex. That lengthens the test. Your response pattern is basically more inconsistent, and so that lengthens the test, reduces the efficiency, etc. That's the end of my section on assessing everybody, but not necessarily on all instruments or all items in the case of adaptive testing. The other classification of models falls within this idea of you give everything to everybody, but you don't necessarily assess or test or intervene with everyone you might have started out intending to assess. Contrasting this notion of sequential designs with what were more familiar with is being a fixed experimental design. Fixed designs are typical in education and psychological research, social and behavioral sciences, etc. A fixed design is one where the sample size and the composition so who's assigned to what group is determined prior to conducting experiment. So we do a power analysis to figure out we need 100 participants, and 50 are assigned to the counterfactual, 50 are assigned to receive the intervention. And we intervene with all well, with the 50 participants. We measure the business as usual or whatever the counterfactual condition is with the other set. And at the end, we make our inference on a fixed sample size versus a sequential experimental design where the sample size is treated as a random variable. We don't know what the eventual sample size is going to be, but we make a set of assumptions, we lay out a protocol for what the parameters of it potentially stopping early are going to be. This allows for sequential analyses and decision making. The idea is that we make our inferences based on accumulating evidence, accumulating data. All the while, maintaining our appropriate error rates, both Type I and Type II so preserving our statistical power, but maintaining our Type I error. Also referred to as adaptive or flexible designs (think of computerized adaptive testing). Current design decisions are sequentially selected according to previous design points. It is kind of a Bayesian idea. It's not necessarily a Bayesian statistic, but it's a Bayesian idea, this notion of prior assumptions, collecting some information, that prior is updated by the data to become the posterior, while the posterior becomes the next prior. So we may form an opinion; something happens to change our opinion; we update our opinion. And then something else happens, and then we update. It's an iterative, cumulative process. This is the principle in play with a sequential design versus, again, the fixed design where that composition size is fixed. We have to encounter everything that we might encounter to form a cumulative opinion, a cumulative inference, rather than adapt it as we go. Although there is the potential for this continuously iterative process until we reach a certain criterion, typically an upper limit is set in practice, and that is typically close to what the fixed sample size might have been in the first place. Primary benefits it allows for early termination of experiments. So instead of reducing, say, an assessment battery the size of a battery you don't collect data for as long or on as many participants. It's an early termination. From an ethical perspective, this prevents unnecessary exposure. It also prevents unnecessarily withholding the administration of something that is showing clear evidence of working. From a logistical perspective, there can be considerable financial savings. Typically, the savings are reported to be between 10% and 50%. So remember, in adaptive testing one of the common selling points is that you can reduce the fixed-length test by half; that would be a 50% savings. The adaptive test is actually an example of a sequential design. Sequential designs have a long history; this is nothing new. Its new to the social and behavioral sciences, but has a long history almost 100-year history in other disciplines. The earliest reference I've found has been back to 1929 with the double-stamping inspection procedure for industrial quality control. Mahalanobis (so that's Mahalanobis distance, Mahalanobis) contributed to this with a census of Bengalese jute areajute is basically the fiber in burlap sacksBengal, in India. Where it really started taking off is in 43 with development of the sequential probability ratio test, which was developed by Wald and other members of the Statistical Research Group at Columbia. You see a lot of familiar statistical last names here among that group. I say this is the real jumping-off point because this is also where the parallel field of sequential analysis jumped off with as well, primarily because of this development of the sequential probability ratio test. It turns out that the sequential probability ratio test is the stopping criteria used in adaptive testing. In 1960, Peter Armitage published what probably could be considered as the resource on sequential designs in biomedical applications. And then in the 80s, we saw the rise of adaptive testing. Adaptive testing is made possible through IRT. The statistical foundation for it though, is Wald's sequential probability ratio test, but the whole idea of adaptive testing in general actually goes back to Binet at the last century with individualized intelligence testing. So this isn't anything new. Its new to us. It was definitely new to me. It wasn't something I was exposed to in graduate school, but it has a long, long history. Characteristics of the sequential design there needs to be at least one interim analysis. And I'm not talking about well, I guess you can think of it as one data snoop. But you have to have the opportunity to look at the data once and make a decision at that point in time prior to the formal, fixed completion of the experiment. But there's a protocol for that, so the decisions that you can make at that one interim analysis are predetermined. And the criteria leading to making those decisions are determined. So you have to figure out how many times you're going to look, (how many interim analyses) how much information, (what's the n at each stage are you going to look halfway through; are you going to look every ten participants; are you going to look after every participant that completes the study) you have to know what your nominal alpha and beta levels are, (how highly do you want to be powered, what error rate, are you going to control the 0.05 level versus 0.01 versus something even more stringent) and then we determine the critical values. So if we think, Simple T-test, normal distribution, the upper and lower regions of rejection plus or minus 1.96 are the boundary values. So you have to determine what the boundary values are going to be each time you make an interim analysis. All available data is analyzed at each stage, so if you're going to look every 10 participants the first interim look you base it on the first 10. The second interim look, you look at the first 20 because it's the first 10 plus an additional 10. And you continue forward. At each stage, the appropriate test statistic is calculated. Fisher information level (which is just the inverse of the standard error) is calculated. That's basically the denominator for your inferential test. That test statistic is compared against a critical value so traditional hypothesis testing. If the test statistic falls within the decision region now the decision regions you'll see in a moment get a little bit more complicated. Decision regions aren't just regions of rejection, but they're also regions of futility. There's a central region that, basically, it's not going anywhere, and you might as well give up. Or if its somewhere in between, you keep going, not enough information, keep going until you do get enough or you have to stop. Here are some examples of boundary plots so this is kind of the power analysis stage of a sequential design. The diagrams on the left are just meant to illustrate the overlap or non-overlap of the different boundary plot methods. There are different procedures that are employed in the field of sequential design in terms of how you determine what the boundary values are. So if you look in any of the four regions, the first so you have the white area, then you have the first vertical line to it. That's the first look. The point, (the boundary of the blue area) that's your it's not 1.96, but that's the 1.96 or 1.64thats the critical value. Notice that as data accumulates, as you go from the beginning to what eventually is the fixed sample size, the criterion by which you make your decision becomes I guess maybe call it more liberal. So if you're going to make a decision early, you have a much higher standard that you have to meet, but that standard is relaxed as you accumulate more information. So in this table, the top two are one-tail tests, the bottom two are two-tailed tests. Early on, at that first vertical line, you have three you have decisions to make. If the test statistic falls in the white area, keep going, not enough information. If it falls in the dark area at the top, you've found statistically supported evidence that whatever the effect is may likely be true given that sample. So you reject the null hypothesis. If it falls in the low area, thanks for trying. If it's the region or rejection, it's the region of futility. And the two-tailed tests, you see how it just complicates because you have two tails; you have two boundaries for the region of rejection; you also have two boundaries for this futility region in the middle. By the time you get to the far right so the initial fixed sample size has been reached. You see that all points converge. Were back to our two decisions: you reject, or you fail to reject. The only reason why these plots are different is just different methods are used to determine the boundary values. That is an area of research within this field. Three general type of sequential designs a fully sequential design, a group sequential design, or a flexible design. Fully sequential designs are a continuous . The adaptive testing where each item is adaptively selected, is an example of a fully sequential design. If it was a test slip that's administered so let's say a block of ten items are administered, scores calculated and then the next set of 10 is based on the performance of the previous set of 10, that's an example of a group sequential design where instead of after every observation, its after a set, so looking at every 10, every 25 type of considerations. The flexible designs are a compromise between the two. Limitations clearly, there's an increase in the design complexity. You pretty much have to get a methodologist or a statistician, probably somebody with some background in biostatistics at this point to collaborate on what should already be an interdisciplinary proposal. There is increased computational burdens, but just like there's an app (a protocol) for that, there's an app for this. SAS has three procedures two sequential design procedures that make this very, very feasible, as the procedure to determine the boundary values. And then it has an analytic integration procedure that tests the statistic, controlling for the error rate. There are threats to validity due to this early termination idea. If you terminate for whatever the reason is, whether its efficacy, futility, or safety, you just have to stop it early because the risk to the participants. That small sample size can lead to distrust. There could be some assumption problems, depending upon the analytic method the inferential method especially if you're using maximum likelihood. The maximum-likelihood principles are asymptotic, so it works a whole lot better with large samples. And oftentimes, the early termination is more complex than just that statistical criterion. So we don't usually measure just a single outcome in the social sciences; we usually have a full battery of outcomes. And what happens when one variable shows early termination and another variable doesn't? You have decisions you have to make. Do you just stop collecting data on that one outcome that you've shown evidence for, or do you have to continue on with everything? What if that one measure is an indicator for a construct, and the rest of the construct isn't done yet? You have to make those types of decisions what's primary, what's secondary, etc. So back to the substantive context and well be pretty close to wrapping up here. Based on the CBC in the Early Grades Project, this is actually, that citation should be 08 and not 11 we completed a four cohort fixed-design, cluster-randomized trial to evaluate the effect of the CBC intervention for students with challenging classroom behavior. So we had data from 22 schools, 90 classrooms, and equivalent to teachers, and 207 K through third grade students and parents. Student-parent dyads within a teacher were randomly assigned to participation in one of two conditions: a control condition, (a business as usual) or received the CBC. So assignment to condition was at the teacher, at the classroom small-group level. That makes it a cluster-randomized trial. The study was additionally proposed and designed to detect a medium standardized effect of approximately 0.38. So that told us we needed a sample size of 270 children, 90 classrooms, with an assumption of three kids per class. We didn't end up getting a full three kids in some of the classes, but we progressed through until we got all 90 classrooms. That took us into a no-cost extension year, but we were able to do that. But because assignment to condition is at the classroom level, it's not the 270 kids that affect the sample size, it was the 90 classrooms. That's why we had to pursue there and why we didn't get the remaining 63 participant's child participants. Well, I took that completed study, I did a methodological piece presented it at IES Research Conference, presented it at APA, and were currently working through the manuscript and hope to get that out yet by the end of the year. Basically implemented a post-doc application of this sequential design analysis strategy. So I treated cohorts we ended up collecting data over four years. The intervention itself is contained within a 12-week period, so the intervention can be contained within a calendar or an academic year. We collected four cohorts worth of data. I'm going to treat each cohort as the group. So I'm going to assume that the eventual decision that we made based on the fixed sample is the true finding. I'm trying to see if I can find that same finding early on this. So what's the degree to which sample-size savings might've been realized if wed implemented this as a sequential group, sequential design, from the start? Everything was implementable in SAS. PROC SEQDESIGN is basically a power-analysis procedure. I use GLIMMIX as my analytic model, as I would've done in a fixed design anyway and as I did, and the paper is almost in press. PROC SEQTEST, then integrates the information takes the information from GLIMMIX, conditional upon the boundary values set about in the sequential design, and reports back. Basically, the figures that you saw earlier with the boundary values and you might've seen some dots on there those were the hypothesis tests from GLIMMIX placed in the context of the sequential design. So it was output from the sequential test program. So here we see those again, a little bit bigger. Now we have several measurers both parent report and teacher report and so this sets up part of that conundrum of, When do you stop what? Because we're looking for a number of outcomes, not everything grow the same conclusion. So here, this is our adaptive skills measure from the BASC. On the left are the parent reports; on the right are the teacher reports. On the left you can see the dots progressed very quickly into the lightly shaded area (the region of futility). Parents were not their perceptions of changes in their children's adaptive skills were not impacted by the intervention. Well, look at teachers. All of their data points from the start, after the first 25 classrooms in the first cohort, clearly were affected. Parents not showing it, teachers do. Parent report versus teacher report on the externalizing behaviors score from the BASC. Again, now we see the same general phenomenon. Parents don't get it, teachers do. But here we see parents initially it's in the region of, Keep going. Its in the white area after the first cohort. But after the second cohort, they move and then stay into the region of futility. We could've stopped collecting data on this measure after the second cohort on the parent responses. But on the teacher report, we eventually were able to reject the null hypothesis, but we needed to go all the way into the fourth cohort to be able to do so. So already, we've seen four different decisions to have to be negotiated. Looking at the parent/teacher relationship, we see a similar phenomenon in that no decisions can be made after the first cohort. But by the time we get to the second cohort or actually third cohort for the parents. It's just barely outside, just right on the boundary. It moves into the third. But again, parents didn't get it, teachers did different stopping points, though, for different measures. Then we have the social skills measure, and parent's right off the bat didn't get it. And then teachers, we had to go through the third cohort to be able to do so. So when you compare those decisions from the sequential design to the fixed results, we could on a number of the measures pretty much all but one we could've stopped early. It was just that externalizing teacher measure that we had to go all the way through and get our full sample. Everything else, we could've stopped early. We basically could've stopped collecting parent information after the second cohort because the intervention clearly wasn't working for them, at least given the sensitivity of the measures that we were employing. Whereas the teacher effects, we could see some of those early, but to get the full flavor of the intervention efficacy we had to go all the way through. A number of source materials that are here, both my presentations. I have a chapter in a handbook with a grad student, and some of the other resources I've found useful. These are here for your information if you want to do some additional reading. I cite Wald provide you a citation there. Despite the journal, it's fairly readable. And I'll just wrap up now with conclusions and things to think about. Like any methodological statistical approach, different approaches for different questions. There's no one size fits all solution to any of this. To me in terms of the you collect data on everybody, but not everything type of category, what's the degree of overlap that's necessary. Like there's no one size fits all approach, I think that even the designs that I talked about earlier in the presentation in terms of the matrix sampling, the three-group or the two-group two class designs. I don't know that any one of those is something that should be universally recommended. I think that you need to be conscious for me as an experimentalist particularly the effect sizes. If anything, you have to think about what is the quality of the information that you seek to obtain and the magnitude of the relationship you're trying to detect, whether it be a factor loading or whether it be a mean difference. Logically, if it's a small effect, it's a weak loading, you have to have more information to be able to make a reasonable, valid claim about that effect or that association. So you need more data on that. Maybe it's a case of taking all of those items or that block of measures that have weak associations or have small effect sizes between them. And those are all placed in that X category that everybody gets. And then it's the big effect size, the broadside of the barn. Clearly, this is an indicator of the construct. Clearly, there's a reasonable effect size to detect here. Those are the measures that get put into categories A, B, and, C that are administered to some and not to others. It needs careful consideration before you apply any of these, but the tools to be able to consider these problems are 01:25:19:00 readily available to us and have been for decades. Issues of counterbalancing where should X occur in the battery? Again, I always fall back into my experimental roots. I just encourage everyone to not give up that experiment or that researcher control over this. Yes, the statistics can bail you out when the reality happens, to an extent. They can't solve all problems, but you can incorporate enough co-variates and stratification variables and what have you to get some reasonable inference. But all of this would work better if its protocol-based, it its purposeful, if its intentional, if its planned out, and you follow the script. So the inference is always better. Alright. Thank you very much for your time. Any further, any follow-up questions, any further conversations that maybe you would like to have, contact information is available too.