Standardizing an approach to the evaluation of implementation science proposals

Background The fields of implementation and improvement sciences have experienced rapid growth in recent years. However, research that seeks to inform health care change may have difficulty translating core components of implementation and improvement sciences within the traditional paradigms used to evaluate efficacy and effectiveness research. A review of implementation and improvement sciences grant proposals within an academic medical center using a traditional National Institutes of Health framework highlighted the need for tools that could assist investigators and reviewers in describing and evaluating proposed implementation and improvement sciences research. Methods We operationalized existing recommendations for writing implementation science proposals as the ImplemeNtation and Improvement Science Proposals Evaluation CriTeria (INSPECT) scoring system. The resulting system was applied to pilot grants submitted to a call for implementation and improvement science proposals at an academic medical center. We evaluated the reliability of the INSPECT system using Krippendorff’s alpha coefficients and explored the utility of the INSPECT system to characterize common deficiencies in implementation research proposals. Results We scored 30 research proposals using the INSPECT system. Proposals received a median cumulative score of 7 out of a possible score of 30. Across individual elements of INSPECT, proposals scored highest for criteria rating evidence of a care or quality gap. Proposals generally performed poorly on all other criteria. Most proposals received scores of 0 for criteria identifying an evidence-based practice or treatment (50%), conceptual model and theoretical justification (70%), setting’s readiness to adopt new services/treatment/programs (54%), implementation strategy/process (67%), and measurement and analysis (70%). Inter-coder reliability testing showed excellent reliability (Krippendorff’s alpha coefficient 0.88) for the application of the scoring system overall and demonstrated reliability scores ranging from 0.77 to 0.99 for individual elements. Conclusions The INSPECT scoring system presents a new scoring criteria with a high degree of inter-rater reliability and utility for evaluating the quality of implementation and improvement sciences grant proposals.


Background
The recognition that experimental efficacy studies alone are insufficient to improve public health [1] has led to the rapid expansion of the fields of implementation and improvement sciences [2][3][4][5]. However, studies that aim to identify strategies that facilitate adoption, sustainability, and scalability of evidence may not translate well within traditional efficacy and effectiveness research paradigms [6].
The need for new tools to aid investigators and research stakeholders in implementation science became clear during evaluation of grant submissions to the Evans Center for Implementation and Improvement Sciences (CIIS) at Boston University. CIIS was established in 2016 to promote scientific rigor in new and ongoing projects aimed at increasing the use of evidence and improving patient outcomes within an urban, academic, safety net medical center. As part of CIIS's goal to foster rigorous implementation and improvement methods, CIIS established a call for pilot grant applications for implementation and improvement sciences [7]. Proposals were peer-reviewed using traditional National Institutes of Health (NIH) scoring criteria [8]. Through two cycles of grant applications, proposal reviewers identified a need for improved evaluation criteria capable of identifying specific strengths and weaknesses in order to rate the potential impact of implementation and/or improvement study designs.
We describe the development and evaluation of Imple-meNtation and Improvement Science Proposal Evaluation CriTeria (INSPECT): a tool for the standardized evaluation of implementation and improvement research proposals. The INSPECT tool seeks to operationalize criteria proposed by Proctor et al. as "key ingredients" that constitute a well-crafted implementation science proposal, which operate within the NIH proposal scoring framework [6].

Assessment of need
CIIS released requests for pilot grant applications focused on implementation and improvement sciences in April 2016 and April 2017 [7]. The request for applications described an opportunity for investigators to receive up to $15,000 for innovative implementation and improvement sciences research on any topic related to improving the processes and outcomes of health care delivery in safety net settings. CIIS funds pilot grants with the goal of providing investigators with the opportunity to obtain preliminary data for further research. Proposals were required to include a specific aims page and a three-page research plan structured within the traditional NIH framework with subheadings for significance, innovation, approach, environment, and research team. The NIH framework was required because it corresponds with the grant proposal structure required by the NIH. A study budget and justification as well as research team biographical sketches were required with no page limit restrictions. CIIS received 30 pilot grant applications covering a broad array of content areas, such as smoking cessation, hepatitis C, diabetes, cancer, and neonatal abstinence syndrome.
Six researchers with experience in implementation and improvement sciences served as grant reviewers. Four reviewers scored each proposal. Reviewers evaluated the quality of pilot study proposals, assigning numerical scores from 1 to 9 (1 = exceptional, 9 = poor) for each of the NIH criteria (significance, innovation, investigators, approach, environment, overall impact) [8]. CIIS elected to use the NIH criteria to evaluate the pilot grant applications because the criteria are those used by the NIH peer review systems to evaluate the scientific and technical merit of grant proposals. The CIIS grant review team held a "study section" to review and discuss the proposals. However, during that meeting, reviewers provided feedback that the NIH evaluation criteria, based in the traditional efficacy and effectiveness research paradigm, did not offer sufficient guidance for evaluating implementation and improvement science proposals, nor did it provide enough specificity for the proposal writers who are less experienced in implementation research. Grant reviewers requested new proposal evaluation criteria that would better inform score decisions and feedback to proposal writers on specific aspects of implementation science including measuring the strength of implementation study design, strategy, feasibility, and relevance.
Despite the challenges of using the traditional NIH evaluation criteria, the review panel used those criteria to score all of the grants received during the first 2 years of proposal requests. CIIS pilot grant funding was awarded to applications that received the lowest (best) scores under the NIH criteria and received positive feedback from the review panel.
The request for more explicit implementation science evaluation criteria prompted the CIIS research team to conduct a qualitative needs assessment of all 30 pilot study applications in order to determine how the proposals described study designs, implementation strategies, and other aspects of proposed implementation and improvement research. Three members of the CIIS research team (MLD, AJW, DB) independently open-coded pilot proposals to identify properties related to core implementation science concepts or efficacy and effectiveness research [9]. The team identified common themes in the proposals, including an emphasis on efficacy hypotheses, descriptions of untested interventions, and the absence of implementation strategies and conceptual frameworks. The consistent lack of features identified as important aspects of implementation science reinforced the need for criteria that specifically addressed implementation science approaches to guide both proposal preparation and evaluation.

Operationalizing scoring criteria
We identified Proctor et al.'s "ten key ingredients" for writing implementation research proposals [6] as an appropriate framework to guide and evaluate proposals. We operationalized the "ingredients" into a scoring system. To construct the scoring system, a four-point scale (0-3) was created for each element. In general, a score of 3 was given for an element if all of the criteria requirements for the element were fully met; a score of 2 was given if the criteria were somewhat, but not fully addressed; a score of 1 was given if the ingredient was mentioned but not operationalized in the proposal or linked to the rest of the study; and a score of 0 was given if the element was not addressed at all in the proposal. Table 1 illustrates the INSPECT scoring system for the 10 items, in which proposals receive one score for each of the 10 ingredients, for a cumulative score between 0 and 30.

Testing INSPECT
We used the pilot study proposals submitted to CIIS to develop and evaluate the utility and reliability of the IN-SPECT scoring system. Initially, two research team members (ELC, DB) independently applied the 10-element criteria to 7 of the 30 pilot grant proposals. Four team members (MLD, AJW, ELC, DB) then met to discuss these initial results and achieve consensus on the scoring criteria. Two team members (ELC, DB) then independently scored the remaining 23 pilot study applications using the revised scoring system. Both reviewers recorded brief justifications for each of the ten scores assigned to individual study proposals. The two coders (ELC, DB) then met to compare scores, share scoring justifications, and determine the final item-specific scores for each proposal using group consensus.
Inter-coder reliability with the scoring protocol was measured using Krippendorff 's alpha to assess observed and expected disagreement between the two coders' initial individual item scores [10,11]. An alpha coefficient of 0.70 was deemed a priori as the lowest acceptable level of agreement to establish reliability of the new scoring protocol [10,11]. Frequency analyses were conducted to determine the distribution of final element-specific scores (0-3) across all proposals. We calculated a correlation coefficient to assess the association between proposal scores assigned using the NIH framework and scores assigned using INSPECT. All calculations were performed in R version 3.3.2 [12].

Results
Iterative review of the 30 research proposals using Proctor et al.'s "ten key ingredients" resulted in the development and testing of the INSPECT system for assessing implementation and improvement science proposals. Figure 1 displays the skewed right distribution of cumulative proposal scores, with most proposals receiving low overall scores. Out of a possible cumulative score of 30, proposals had a median score of 7 (IQR 3.3-11.8). Table 2 presents the distribution of cumulative and item-specific scores assigned to proposals using the INSPECT criteria. Across individual elements, proposals scored highest for criteria describing care/quality gaps in health services. Thirty-six percent of proposals received the maximum score of 3 for meeting all care or care or quality gap element requirements, including using local setting data to support the existence of a gap, including an explicit description of the potential for improvement, and linking the proposed research to funding priorities (i.e., safety net setting).
Proposals generally scored poorly for other criteria. As shown in Table 2, most study proposals received scores of 0 in the categories of evidence-based treatment to be implemented (50%), conceptual model and theoretical justification (70%), setting's readiness to adopt new services/treatment/programs (53%), implementation strategy/process (67%), and measurement and analysis (70%). For example, reviewers gave scores of 0 for the "evidence-based intervention to be implemented" element because the intervention was not evidence-based and the project sought to establish efficacy, rather than to examine uptake of an established evidence-based practice. Similarly, proposals that only sought to study effectiveness and did not assess any implementation outcomes [13] (e.g., adoption, fidelity) received scores of 0 for "measurement and analysis." None of the study proposals primarily aiming to assess effectiveness outcomes expressed the dual research intent of a hybrid design. Scores of 0 for other categories were given when applications lacked any description relevant to the category, such as no conceptual model, no implementation strategy, or no research team skills revenant to implementation or improvement science. Table 2 displays the assessed rates of inter-coder reliability in applying INSPECT to the 30 pilot study proposals. An overall alpha coefficient of 0.88 was observed between the coders. Rates of inter-coder reliability in applying each of the 10 items to the proposals ranged from 0.77 to 0.99, all above the 0.70 reliability threshold.
Additionally, we observed a moderate inverse correlation (r = − 0.62, p < 0.01) between the proposal scores initially assigned using the NIH framework and the scores assigned using INSPECT.  • No measurement and/or data analysis plan are included to describe how variables and outcomes will be measured • Measurement and/or data analysis plans do not clearly describe how all variables and outcomes will be measured, or plans are inappropriate for the proposed study • Measurement and/or data analytic plans describe how all variables and outcomes will be measured and is appropriate for the proposed study, but linkage to the theoretical model is unclear • The unit of analysis is appropriate for the proposed study • Measurement and data analytic plans robustly describe how all variables and outcomes will be measured and are appropriate for the proposed study through a clear theoretical justification Policy/funding environment; leverage of support for sustaining change • No acknowledgement of the internal/ external policy trends and/or funding environment for the propose study is included • Zero or limited discussion of the potential impact of the intervention is included • Zero or limited discussion of disseminating study findings is included • The internal/external policy trends and/or funding environment are discussed but additional clarification is needed • The potential impact of the intervention is not linked to the policy and/or funding context and may not be relevant to a safety net setting • The dissemination plan for study findings does not clearly indicate a contribution will be made to the broader policy level and safety net setting • The internal/external policy trends and/or funding environment are clearly described • The potential impact of the intervention is linked to relevant policies and funding issues associated with a safety net setting but may need further explanation • The dissemination plan for study findings indicates a contribution will be made to the broader policy level and safety net setting, but what contribution and how it will be achieved is unclear • The internal/external policy trends and/or funding environment are clearly described • Potential impact of the intervention is explicitly linked to relevant policies and funding issues associated with a safety net setting • The dissemination plan for study findings indicates what and how a contribution will be made to the broader policy level and safety net setting

Discussion
We developed a reliable proposal scoring system that operationalizes Proctor et al.'s "ten key ingredients" for writing an implementation research grant [6]. Previous research analyzing peer-review grant processes has highlighted a need to improve scoring agreement between peer reviewers [14]. High levels of disagreement in assessors' interpretation of grant scoring criteria result in unreliable peer-review processes and funding decisions based more on chance than scientific merit [14].
Measuring rates of inter-rater reliability are a standard approach for evaluating the utility of existing proposal scoring criteria and assessing efforts to improve the criteria [15,16]. Application of the INSPECT system demonstrated high inter-rater reliability overall and within each of the 10 items. The high degree of reliability measured for INSPECT may be related to the specificity of its design as an implementation and improvement science scoring criteria. A review of scoring rubrics reported in the scientific literature suggests that topic-focused criteria contribute to increased scoring reliability [17]. Additionally, the moderate correlation between scores assigned using the NIH framework and scores assigned using INSPECT suggests validity of the INSPECT criteria in evaluating proposal quality. Proctor et al.'s "ten key ingredients" for grant writers were developed to map onto the existing NIH criteria. Our operationalized version of the ingredients as scoring criteria demonstrated that proposals that scored poorly under NIH criteria also scored poorly under INSPECT. Applying the INSPECT system to proposed implementation and improvement science research at an academic medical center improved proposal reviewers' ability to identify specific strengths and weaknesses in implementation approach. Overall, proposals only received high scores for identifying the care gap or quality gap. Since efficacy and implementation or improvement research may use similar techniques to establish the significance of the study questions [18], proposals may score well on describing the quality gap, even if they later described Consistently low scores in four areas-defining the evidence-based treatment to be implemented, conceptual model and theoretical justification, setting's readiness to adopt new programs, and measurement and analysis-suggest that many investigators seeking to conduct implementation research may have misconceptions about the fundamental goals of this field. One misconception may relate to a sole focus on evaluating an intervention's effectiveness rather than studying the processes and outcomes of implementation strategies. The majority of study proposals evaluated using INSPECT neither aimed to improve uptake of any evidence-based practice nor included any implementation measures such as acceptability, adoption, feasibility, fidelity, penetration, or sustainability [19]. Inadequate and inconsistent descriptions of implementation strategies and outcomes represent major challenges to overall implementation study success [20]. In addition to guidance provided by the INSPECT criteria, recent efforts to develop implementation study reporting standards [21] may assist proposal writers in describing planned research.
Several proposals addressed treatments or practices with low evidence for the potential to improve healthcare. Although hybrid studies, which study both effectiveness and implementation outcomes, are practical approaches to establishing the effectiveness of evidence-informed practices while measuring implementation efforts [18], none of the study proposals expressed this dual research intent or were conceived as hybrid designs.
Our findings also suggest low familiarity with and use of resources to evaluate of the strength of evidence (such as the Grading Quality of Evidence and Strength of Recommendations system [22] and the Strength of Recommendation Taxonomy grading scale [23]) for implementation science research. A more systematic evaluation of the strength of evidence [24][25][26][27] necessary to warrant implementation efforts may help to differentiate implementation science from efficacy or effectiveness research and improve understanding of the utility hybrid studies offer [28].
Expanding access to implementation science training in universities as part of the core health services research curriculum and enhancing access to professional development opportunities that focus on conceptual and methodological implementation skills in a content agnostic way would aid in building capacity for the next generation of implementation science researchers. Additionally, training programs provide an opportunity to provide guidance on both writing and evaluating the quality of implementation science grant applications.
Strengths of our results include that application of INSPECT to study proposals submitted by investigators with a wide range of implementation and improvement science-specific experience, and covering a variety of content areas. However, our results are limited in that they characterize one academic institution's familiarity with implementation and improvement science research and the INSPECT system requires validation in other settings and over a broader range of proposal ratings. Additionally, we measured a high degree of inter-rater reliability for INSPECT when it was applied to a sample of low-scoring proposals. INSPECT's inter-rater reliability may decrease when applied to a sample of higher quality proposals, and reviewers are required to discriminate between gradations of quality (i.e., scores of 1-3) rather than mostly scoring the absence of key items (i.e., scores of 0). Future research should test the validity of INSPECT by comparing INSPECT-assigned scores to ratings assigned to approved proposals by the NIH Dissemination and Implementation Research in Health study section. Future research should also assess the relationship between INSPECT score assignments and successful study completion to determine the utility of INSPECT as a mechanism for ensuring the quality and impact of funded research. To aid in these prospective research efforts, forthcoming proposal calls from CIIS will specifically use INSPECT as the proposal evaluation criteria.
Although multiple tools exist to aid researchers in writing implementation science proposals [6,29,30], few resources exist to support grant reviewers. This study identified additional functionality of Proctor et al.'s "ten key ingredients" as a guide for writers by developing it into a detailed checklist for proposal reviewers. The current research makes a substantive contribution to implementation and improvement sciences by demonstrating the utility and reliability of a new tool designed to aid grant reviewers in identifying high-quality research.

Conclusion
In conclusion, we operationalized an implementation and improvement research-specific scoring system to provide guidance for proposal writers and grant reviewers. We demonstrated the utility and reliability of the new INSPECT scoring systems in evaluating the quality of implementation and improvement sciences research proposed at one academic medical center. The prevalence of low scores across the majority of INSPECT criteria suggests a need to promote education about the goals of implementation and improvement science, including the conceptual and methodological distinctions from efficacy and effectiveness research.