This study represents an effort to systematically characterize the types of modifications that are made to interventions when they are implemented in real world settings. On a high level, two of the major categories of coding mapped onto Castro and colleagues’ distinction between modifications of program content, and modification of the form of delivery (e.g., location of delivery, delivery person, or channel of delivery) [27]. In a sample of studies described in peer-reviewed articles, which represent a variety of interventions and contexts, we found that contextual modifications were occasionally reported, but that content modifications were reported much more frequently. Tailoring the intervention to address language, cultural differences, literacy, or situational constraints was the most commonly identified content modification, followed by the addition or removal of elements and changes to the length or pacing of the intervention.
Other modifications identified in our coding process, such as drift and loosening of structure, occurred relatively rarely within the articles that we reviewed. This low frequency in the current sample is not surprising, as such behaviors are unlikely to occur in a planned manner, and may be less likely to be emphasized when describing an evidence-based intervention in a peer-reviewed article. Furthermore, relatively few of the articles that were sampled employed the type of observation or stakeholder interviews through which such behaviors may be identified. While drift might also be considered a discontinuation of the intervention entirely or a lack of fidelity rather than a modification, it also seems important to capture it in a system designed to classify deviations from and modifications to a protocol in order to better measure its impact on outcomes of interest. For example, the impact of the option to occasionally or strategically drift on clinician or client satisfaction may be important to explore, in addition to the impact of drift on clinical effectiveness.
In contrast to the findings in Hill and colleagues’ study [20], most of the articles that we found in our search process described modifications that were made proactively in recognition of key differences between the implementation setting and the original intervention. In another report, we describe findings that emerged when we applied this framework to interview data from a sample of community-based mental health service providers who were trained in an EBP [53]. Several of the lower-frequency modifications identified in the current study were endorsed much more frequently in that study, suggesting that modifications made proactively may differ from those made once implementation is underway. Thus, at this stage of development, we determined that it is important to represent a more exhaustive set of possible modifications in the classification system.
As the discussion above indicates, some modifications may signify decreases in fidelity, while others may be consistent with the design of the intervention. The tension between modification and fidelity is a critical issue in implementation science [4, 54, 55]. Many recognize that modifications will occur throughout the course of an implementation effort, but the type and extent of modifications that can occur without compromising effectiveness or degrading fidelity to an unacceptable degree has not been sufficiently explored. In theory, it is possible to make some types of modifications without compromising effectiveness or removing the key elements of an intervention. However, for some interventions, the core elements have not yet been determined empirically, and very little is known about the impact of behaviors such as integrating other interventions or selectively implementing particular aspects of a treatment. Fidelity measures that emphasize competence or the spirit of an intervention over adherence may not adequately capture some potentially important types of modification, and those that emphasize adherence may not capture modifications such as tailoring. Thus, when observation or reliable self-report is possible, the use of a fidelity measure along with this modification framework can guide decisions regarding the extent to which a particular modification represents a departure from core elements of an intervention. Used alone or as a complement to fidelity measures, this measure may also be useful in determining whether particular elements can be removed, re-ordered, integrated or substituted without compromising effectiveness.
Despite the breadth of the coding system we developed, interrater agreement for the subset of independently-coded articles was quite high, reaching standards of ‘substantial agreement’ and ‘almost perfect’ agreement for the level and nature of modifications, respectively [52]. Within our research group, this level of reliability was achieved after a brief series of hour-long weekly coding meetings, suggesting that our coding scheme can be used to reliably classify modifications described in research articles without overly burdensome training.
We note several potential limitations to the study and framework. First, our search process was not intended to identify every article that described modifications to evidence-based interventions, particularly if adaptation or modification was not a major topic addressed in the article. Instead, we sought to identify articles describing modifications that occurred across a variety of different interventions and contexts and to achieve theoretical saturation. In the development of the coding system, we did in fact reach a point at which additional modifications were not identified, and the implementation experts who reviewed our coding system also did not identify any new concepts. Thus, it is unlikely that additional articles would have resulted in significant additions or changes to the system.
In our development of this framework, we made a number of decisions regarding codes and levels of coding that should be included. We considered including codes for planned vs. unplanned modifications, major vs. minor modifications (or degree of modification), codes for changes to the entire intervention vs. changes to specific components, and codes for reasons for modifications. We wished to minimize the number of levels of coding in order to allow the coding scheme to be used in quantitative analyses. Thus, we did not include the above constructs, or constructs such as dosage or intensity, which are frequently included in frameworks and measures for assessing fidelity [56]. Additionally, we intend the framework to be used for multiple types of data sources, including observation, interviews and descriptions, and we considered how easily some codes might be applied to information derived from each source. Some data sources, such as observations, might not allow coders to discern reasons for modification or make distinctions between planned and unplanned modifications, and thus we limited the framework to characterizations of modifications themselves rather than how or why they were made. However, sometimes, codes in the existing coding scheme implied additional information such as reasons for modifying. For example, the numerous findings regarding tailoring interventions for specific populations indicate that adaptations to address differences in culture, language or literacy were common. Aarons and colleagues offer a distinction of consumer-driven, provider-driven, and organization-driven adaptations that might be useful for researchers who wish to include additional information regarding how or why particular changes were made [35]. While major and minor modifications may be easier to distinguish by consulting the intervention’s manual, we also decided against including a code for this distinction. Some interventions have not empirically established which particular processes are critical, and we hope that this framework might ultimately allow an empirical exploration of which modifications should be considered major (e.g., having a significant impact on outcomes of interest) for specific interventions. Furthermore, our effort to develop an exhaustive set of codes meant that some of the types of modifications, or individuals who made the modifications, appeared at fairly low frequencies in our sample, and thus, their reliability and utility require further study. As it is applied to different interventions or sources of data, additional assessment of reliability and further refinement to the coding system may be warranted.
An additional limitation to the current study is that our ability to confidently rate modifications was impacted by the quality of the descriptions provided in the articles that we reviewed. At times, it was necessary to make some assumptions about how things were actually modified, or the level at which the modifications occurred. The level of detail available in records, clinical notes, or other qualitative data that may be utilized to investigate modifications may similarly impact future investigations. We attempted to address this limitation by making decision rules about the level of detail and clarity required to assign codes and by documenting these rules in detail in our coding manual. The level of rater agreement that we achieved suggests that our process was reasonably successful, despite occasional ambiguities in the descriptions. In future efforts to utilize this system, two strategies can minimize the likelihood that insufficient data are available to assign codes. Whenever possible, observation by raters knowledgeable about the intervention and its core components should be used to identify modifications. This may be especially important in differentiating minor modifications (which might be coded as ‘tailoring/tweaking/refining’) from more intensive modifications (which, for example, might be coded as ‘removing elements’); ultimately, making these distinctions requires a thorough knowledge of the intervention itself. When interviews are conducted in lieu of observation or in addition to review of existing records, we recommend asking very specific follow-up questions regarding modifications that are made. Familiarity with both the intervention and the coding system when interviewing can increase the likelihood that sufficient information is obtained to make an appropriate judgment. Despite these measures, interrater reliability may vary across different data sources, although additional work by our research group suggests that reliability remains high when the coding scheme is applied to interview data [53]. We are currently examining reliability when the coding scheme is used for observation using audio recordings of psychotherapy sessions as well, and we recommend that when using this framework, researchers assess reliability.
We believe that the framework that we present can be used flexibly depending on the goals of the research and the type of data collection that occurs. For example, researchers may wish only to code exclusively for content or context-level modifications if they are interested in determining the impact of specific types of modifications on health outcomes. Similarly, the code for the decision maker may not be necessary if researchers are studying modifications made by one particular group or evaluating adaptations that were pre-specified by a single decision-maker before implementation began. However, this code might be very informative if the researchers wish to understand the impact of the nature and process of modification on outcomes such as stakeholder engagement or fidelity to core program or intervention elements.
This coding system may be used to advance research that is designed with the goal of understanding the impact of changes made to interventions in particular contexts. Ultimately, such an understanding will require simultaneous use of this coding scheme and treatment outcome assessments, in order to help researchers and clinicians determine what specific types of modifications are most useful in increasing the effectiveness of interventions. Such an understanding will allow stakeholders to make more informed decisions about whether and how to modify the interventions when implementing them in contexts that differ from those in which they were originally developed and tested. Additionally, when used in the context of fidelity monitoring, this system can provide more useful information about what actually occurs when lower levels of adherence are identified, as well as the types of modifications that can occur within acceptable levels of fidelity. Baumann and colleagues suggested that there is a range of feasible fidelity, as well as a point of ‘dramatic mutation,’ at which the intervention is no longer recognizable or effective [8]. This system of characterizing modifications may be useful in determining these ranges and boundaries with greater specificity. By understanding the types of modifications that can be made while keeping the intervention out of the range of dramatic mutation, stakeholders may ultimately find it easier to adapt interventions as needed while attending to an intervention’s most critical components. Investigations of the impact of particular types of modifications on clinical outcomes can further inform efforts to implement evidence-based interventions while preserving desired levels of effectiveness. Finally, another potential area of investigation using this framework is on the impact of specific modifications on implementation outcomes such as adoption and sustainability. Additional knowledge about these critical issues in implementation science will yield important guidance for those wishing to advance the implementation of evidence-based programs and interventions.