We next present our list of six identified reporting problems and proposed recommendations to prevent them in future work. To accompany this, we created two tools. Additional file 1 is an audit worksheet that the reader can use to assess adherence to our proposed recommendations or to plan out the inclusion of implementation outcomes in potential work. In Additional file 2, we provide exemplar articles that the reader may use as a guide and to generate ideas.
Recommendation 1: consistent term use
The 2011 paper noted widespread inconsistency in terminology for implementation outcomes and called for consistent use of language in describing key constructs . Our review revealed that this problem prevails and can appear in the literature in three specific ways. One way was reporting different outcomes in different manuscript sections. In one article, for example, the stated study goal in the Introduction was to assess fidelity and sustainment. However, the authors only reported on fidelity in the “Methods” and “Results” section, never addressing results pertaining to sustainment. Whether these authors failed to distinguish between fidelity and sustainment conceptually and operationally, or whether the paper simply failed to address sustainment, the effect is the same: lack of clarity about the specific outcome being addressed. Inconsistent terminology prevents readers from knowing what construct was assessed and what exactly was learned—both of which prevent the accrual of information across studies.
Another way that this problem appeared was using terms from the 2011 taxonomy in a new way and without explanation. While the original taxonomy invited the identification and study of additional implementation outcomes, interchanged use of terms perpetuates confusion and impedes synthesis across studies . Examples included an article where authors reported that they were assessing fidelity but called it uptake and an article where the definition of feasibility included the term acceptability. In both cases, an explanation as to why the outcome terms were applied in this way (while still citing the 2011 taxonomy) was absent.
Third, we found confusing instances of studies that merged implementation outcomes in the analysis and interpretation of results without explanation. For example, in one article, fidelity and acceptability were combined and called feasibility. In another, acceptability, feasibility, and appropriateness were combined into a variable called value. The 2011 research agenda described how implementation outcomes could be used in a “portfolio” of factors that explain implementation success . For example, implementation success could be conceptualized as a combination of treatment effectiveness, acceptability, and sustainability . However, understanding the role of implementation outcomes in mechanisms of change—including how we get to “implementation success” and what it looks like—requires precision in outcomes measurement and reporting. Until we have a stronger knowledge base, our field needs concepts to be disentangled rather than merged, absent compelling theory or evidence for combining. To address these reporting problems, our first recommendation is to clearly state each implementation outcome and provide an operational definition that the study will use. Ensure consistent use of outcomes terms and operational definitions across manuscript sections and provide an explanation if using the taxonomy in a new way or merging terms.
Recommendation 2: role in analysis
Another reporting problem is lack of specificity around how the outcome was measured relative to other constructs. This problem appeared as poor or unclear alignment between outcomes-related aims, research questions and/or hypotheses, and the reported results. One example of this was an article that aimed to examine fidelity, adoption, and cost across multiple phases of implementation. However, the authors assessed barriers to adoption instead of actual adoption and used the terms fidelity, engagement, and adoption interchangeably when reporting results on the intervention and implementation strategies. This made it difficult to assess the roles that different implementation outcomes played in the study. In another article, the authors stated that their qualitative interview guide “provided insight into” acceptability, adoption, and appropriateness of the practice of interest. However, the “Results” section did not include any information about these implementation outcomes, and they were not mentioned again until the discussion of future directions.
Our second recommendation is to specify how each implementation outcome will be or was analyzed relative to other constructs. Readers can draw upon the categories that we observed during data charting. For example, an implementation outcome may be treated as an independent, dependent, mediating, moderating, or descriptive variable. Correlations may be assessed between an implementation outcome and another implementation outcome or a contextual variable. An implementation outcome may be treated as a predictor of system or clinical outcomes, or as an outcome of a planned implementation strategy. Manuscripts that succinctly list research questions or study aims—detailing the outcome variables measured and their role in analyses—are easier to identify in literature searches, easier to digest, and contribute to the accrual of information about the attainment and effects of specific implementation outcomes.
Recommendation 3: referent
The next problem that we observed is difficulty identifying what “thing”  the implementation outcome is referring to. For example, in one article that examined both an intervention and an implementation strategy, the aims referred to feasibility and acceptability of the intervention. However, the “Results” section only reported on intervention acceptability and the “Discussion” section mentioned that acceptability and feasibility of the intervention using the implementation strategy were assessed. This example illustrates how study conclusions can be confusing when the implementation outcome referent is unclear. Another study compared different training approaches for promoting fidelity within a process improvement study. However, we were unable to discern whether fidelity was referring to the process improvement model, the training approaches, or both. As a result, it was difficult to assess which body of fidelity literature these findings pertained to.
As such, our third recommendation is to specify “the thing”  that each implementation outcome will be measured in relation to. This requires a thorough review of all manuscript sections and can be especially important if you are concurrently studying interventions and strategies (e.g., in a hybrid study ), or if you are studying interventions and strategies that have multiple components of interest. Coding options for “the thing” in our scoping review included screening, assessment, or diagnostic procedures (e.g., X-rays), one manualized treatment, program, or intervention (e.g., trauma-focused cognitive behavioral therapy), or multiple manualized interventions that are simultaneously implemented. We also observed that “the thing” may refer to research evidence or guidelines. It could be an administrative intervention (e.g., billing system, care coordination strategy, supervision approach), a policy, technology (e.g., health information technology, health app, data system), a form of outcome monitoring (e.g., measurement-based care for individual clients), data systems, indicators, or monitoring systems. Finally, “the thing” that an outcome is being measured in relation to may be a clinical pathway or service cascade intervention (e.g., screening, referral, treatment type of program).
Recommendation 4: data source and measurement
The fourth reporting problem is lack of detail around how the implementation outcome was measured, including what data were used. For instance, some studies drew upon participant recruitment and retention information to reflect feasibility without describing the way this information was obtained or recorded. The “Methods” section of another article stated that “project records” were used to assess fidelity without providing additional detail. Another example is an article in which the “Measures” section stated that survey items were created by the study team based on the 2011 taxonomy, including feasibility, acceptability, and sustainability. However, appropriateness was the only one that was clearly operationalized in the “Methods” section. Lack of information about data source and measurement limits transparency, the ability to understand the strengths and limitations of different measurement approaches for implementation outcomes, and replication.
To address this, our fourth recommendation is twofold. The first element of our recommendation is to report who provided data and the level at which data were collected for each implementation outcome. The reader can consider the following categories when reporting this information for their studies. Possible options for who reported the data for an implementation outcome include client/patient, individual provider, supervisor/middle manager, administrator/executive leader, policymaker, or another external partner. Possible options for the level at which data were collected include individual/self, team/peers, organization, or larger system environment and community. We found that implementation outcomes studies often drew upon multiple levels of data (see also “Recommendation 6: unit of analysis vs. unit of observation”). Furthermore, the level at which data were collected and the level at which data were reported may not be the same (e.g., individual providers reporting on an organizational level implementation outcome variable). To address this, the second element of our recommendation is to report what type of data was or will be collected and used to assess each implementation outcome. Data may be quantitative, qualitative, or both. Information may be collected from interviews, administrative data, observation, focus groups, checklists, self-reports, case audits, chart or electronic health record reviews, client reports, responses to vignettes, or a validated survey instrument or questionnaire.
Recommendation 5: timing and frequency
Another reporting problem that we encountered is lack of information about the timing and frequency of implementation outcome measurement. A fundamental principle of clear research reporting includes disclosing observation periods, times between observations, and number of times constructs are measured. Yet, our review of implementation outcome research was hampered by lack of such details. For example, in one article, self-assessments and independent assessments of fidelity were compared for a particular intervention. However, in the “Methods” section, fidelity assessments of both types were described as “completed during the last quarter” of a particular year. Without further detail, it was difficult to tell if these were cross-sectional fidelity assessments for unique providers or longitudinal data that tracked the same provider’s fidelity over time. Lack of detail about data collection timeframes limits researchers’ ability to assess the internal validity of study findings and the actual time that it takes to observe change in a given implementation outcome (and at a particular level of analysis). Therefore, our fifth recommendation is to state the number of time points and the frequency at which each outcome was or will be measured. Broad categories that the reader may consider include measuring the implementation outcome once (cross-sectional), twice (pre-post), or longitudinally (three or more time points are assessed). Reporting the phase  or stage  can also help to clarify when during the implementation lifecycle outcomes are observed or are most salient.
Recommendation 6: unit of analysis vs. unit of observation
The last problem we encountered is inconsistent or insufficient specification of the unit of analysis (the unit for which we make inferences about implementation outcomes) and the unit of observation (most basic unit observed to measure the implementation outcome). In multiple instances, studies relied on reports from individual providers or clinicians to make inferences about team or organizational implementation outcomes (e.g., aggregating observations about individual providers’ adoption to understand overall team adoption). However, in some studies, these distinctions between the units of analysis and observation were not clearly drawn, explained, or appropriate. For instance, in a study examining practitioners participating in a quality improvement initiative, the study team assessed group level sustainability by asking individual practitioners to discuss their perceptions of sustainability in interviews. It was not clear how the research team arrived at their conclusions about group level sustainability from individual reports, which limits transparency and replicability. Furthermore, the lack of clarity around units of observation and analysis muddles the causal pathways that we are trying to understand in implementations outcomes research because mechanisms of change may differ among individual, group, organizational, and system levels.
A related issue involved limited explanation as to why units of observations (e.g., individual’s perceptions of appropriateness) can and should be aggregated to reflect higher levels in the analysis (e.g., organizational level appropriateness). Aggregating individual level data to the group, team, or organizational level requires a strong theoretical justification that bottom-up processes exist to create a shared characteristic . In the example, sufficient theory was needed to demonstrate that appropriateness was an organizational level construct (unit of analysis) and reflected a shared perception of appropriateness among individuals (unit of observation). This type of study also requires an analytic design that allows the researcher to rigorously test this assumption , including sufficient sample sizes to account for between-group effects [30, 31]. During our data charting process, lack of clarity in how unit of observation and unit of analysis were distinguished and treated, and why, made it difficult to assess the presence of such considerations.
In response, our sixth recommendation is to state the unit of analysis and unit of observation for each implementation outcome. Observations may be generated by individual clients/patients, individual providers, teams, organizations, or another type of system. However, these observations may be aggregated in some way to reflect an implementation outcome at a higher level (e.g., assessing team adoption based on an aggregation of each individual member’s adoption). We urge the reader to ensure that the level of analysis theoretically and methodologically aligns with who provided data and the level of data collection described in in “Recommendation 4: data source and measurement” section. If conducting multilevel analyses and the units of observation and analysis are different, we also encourage the reader to include theoretical and analytical justification when aggregating implementation outcome data to a higher level of analysis .