This study sought to advance implementation science by systematically developing valid, reliable, and pragmatic measures of three key implementation outcomes: acceptability (Acceptability of Intervention Measure (AIM)), appropriateness (Intervention Appropriateness Measure (IAM)), and feasibility (Feasibility of Intervention Measure (FIM)). Substantive and discriminant content validity assessment involving implementation researchers and implementation-experienced mental health professionals indicated that most of the items that we generated reflected the conceptual content of these three implementation outcomes. Exploratory and confirmatory factor analyses produced brief 5-item scales with acceptable model fit and high reliability. Although the scales were highly correlated, nested confirmatory factor analysis models provided evidence that the three implementation outcomes are best represented from an empirical perspective as distinguishable constructs, just as Proctor and colleagues suggest .
Scale refinement through construct-specific confirmatory factor analysis of data obtained from a vignette study involving practicing mental health counselors resulted in trimmed 4-item scales. Nested confirmatory factor analysis models provided evidence of structural validity, with the three-factor model demonstrating acceptable model fit and high-scale reliability. Analysis of variance provided evidence of known-groups validity, with medium- to large-size main effects of each manipulation on the relevant scale score. Although the design error precluded a full exploration of discriminant validity, the analysis of variance indicated that the newly developed measures of acceptability, appropriateness, and feasibility can differentiate groups with known differences in the levels of these implementation outcomes.
Finally, test-retest reliability and sensitivity to change were demonstrated when a subsample of mental health counselors participating in the vignette study randomly received either the same vignette or the opposite vignettes and re-rated the implementation outcomes. These measurement properties are important to researchers and other stakeholders (e.g., intermediaries, policymakers, practice leaders) yet are rarely assessed. Importantly, regression analysis indicated that the implementation outcome measures were sensitive to change in both directions: high to low and low to high. This makes the measures useful for assessing the impact of planned strategies or unexpected events on practitioners’ perceptions of acceptability, appropriateness, and feasibility.
Contributions to implementation science and practice
This study contributes to the literature by developing valid and reliable measures of important implementation outcomes . The field of implementation science has been deemed a Tower of Babel given the lack of conceptual clarity and consistency in its terminology , and concerns about the state of measurement in the field have been well documented [3, 27, 28]. These concerns extend to the measurement of implementation outcomes, as, despite their centrality to understanding the extent to which implementation is successful, valid and reliable measures are lacking . This study fills that gap by developing valid and reliable measures of three implementation outcomes that are salient to a wide range of implementation studies as well as a wide range of pilot, efficacy, and effectiveness studies. Indeed, with an increasing focus on designing for dissemination and implementation , and some going as far as to say that all effectiveness studies be Hybrid I effectiveness-implementation studies [30, 31], measuring constructs such as acceptability, appropriateness, and feasibility will be a ubiquitous need. These outcomes are relevant to assessing stakeholders’ perceptions of clinical and public health interventions, as well as assessing perceptions of implementation strategies, which are often complex interventions in and of themselves [1, 32, 33]. Assessing these outcomes early in the research process may ensure that interventions and implementation strategies are optimized and fit with end-users’ preferences.
In addition to the need for valid and reliable measures of implementation-related constructs, there is a need for pragmatic measures . Implementation stakeholders are unlikely to use measures unless they possess these qualities, which may include broad domains such as being (1) useful in informing decision-making, (2) compatible with the settings in which they are employed, (3) easy to use, and (4) acceptable . Pragmatic measures are particularly important for low-resource settings . There are several examples of recent efforts to develop pragmatic measures for implementation constructs such as organizational readiness for change , implementation leadership , and implementation climate [38, 39]. We sought to ensure that our measures were pragmatic, and we believe that we have accomplished that in three ways. First, we sought to develop measures that were brief. We began by developing 12 or fewer items for each construct, and our psychometric testing resulted in final measures with only four items per construct. Second, we made each item as general as possible by not specifying a specific context or clinical problem within the items. For example, the appropriateness items did not specify a purpose (e.g., “for treating depression”), person (e.g., “for my patients”), or situation (e.g., “for my organization”); those wishing to use these measures could add such referents to explore specific aspects of appropriateness (i.e., social or technical fit). Third, we purposefully made the measures “open access” to ensure that the scales are freely available to all who might wish to use them. Our hope is that developing a measure that is free, brief, easy to use and not context- or treatment-specific will increase the chances of its use broadly in implementation research and practice. One of the unfortunate realities in implementation science is that, to date, the majority of current measures are developed for the purpose of a single study (usually with minimal conceptual clarity and psychometric testing) and then never used again. This state of affairs precludes our ability to develop generalizable knowledge in implementation science and, more specifically, knowledge about how current measures perform across a wide range of contexts. Of course, whether these measures ultimately demonstrate predictive validity within the field of mental health and other clinical settings is yet to be determined.
Finally, we have laid out a systematic process for measure development and testing that we believe is both replicable and feasible. In doing so, we stress the importance of clearly defining constructs and engaging in domain delineation processes that ensure that constructs are sufficiently differentiated from similar constructs. We also stress careful psychometric testing, and lay out a process for establishing substantive validity, discriminate validity, structural validity, discriminant validity, known-groups validity, test-retest reliability, and sensitivity to change. The measurement development and testing process took 15 months, suggesting that the process could be completed within the period of a grant-funded implementation study. We encourage other teams to replicate this methodology and suggest further refinements that may enhance the efficiency and effectiveness of this process.
Though there were a number of strengths in the current study, there were also limitations. First, correlations among the three factors of acceptability, appropriateness, and feasibility were at times fairly high and discriminant validity was not fully tested due to a survey design error. Future research would benefit from further explorations of the discriminant validity of these constructs.
Replication is also needed. Testing the measures with samples of providers that have different backgrounds and characteristics than the samples included here would yield important information about generalizability (e.g., structural invariance). Likewise, testing the measures with different methods or materials would also be useful.
We plan to administer prospectively our newly developed measures to a large sample of providers faced with the decision to adopt, or not adopt, an EBP. We will then evaluate whether their perceptions of the acceptability, appropriateness, and feasibility of the EBP predict their adoption of the EBP. If the predictive validity of our measures is established, researchers and practitioners would have a brief tool for assessing early on if staff are likely to adopt an EBP or if more work needs to be done to increase the EBP’s acceptability, appropriateness, or feasibility.
Glasgow and Riley  argue that measures need to be pragmatic if they are to be useful (and used) outside the context of research. Pragmatic features of measures include sensitivity to change, brevity, psychometric strength, actionability, and relevance to stakeholders. While their list of pragmatic features is helpful, it was not developed with stakeholder input and therefore might not reflect what stakeholders view as important. To address this limitation, we are working with stakeholders to define and operationalize pragmatic features of measures, with the goal of developing rating criteria that can be used to assess the pragmatic properties of measures, much like the psychometric properties of measures are assessed . In future work, we will apply these pragmatic rating criteria to the three new measures of implementation outcomes developed here.