Psychometric properties of implementation measures for public health and community settings and mapping of constructs against the Consolidated Framework for Implementation Research: a systematic review

Background Recent reviews have synthesised the psychometric properties of measures developed to examine implementation science constructs in healthcare and mental health settings. However, no reviews have focussed primarily on the properties of measures developed to assess innovations in public health and community settings. This review identified quantitative measures developed in public health and community settings, examined their psychometric properties, and described how the domains of each measure align with the five domains and 37 constructs of the Consolidated Framework for Implementation Research (CFIR). Methods MEDLINE, PsycINFO, EMBASE, and CINAHL were searched to identify publications describing the development of measures to assess implementation science constructs in public health and community settings. The psychometric properties of each measure were assessed against recommended criteria for validity (face/content, construct, criterion), reliability (internal consistency, test-retest), responsiveness, acceptability, feasibility, and revalidation and cross-cultural adaptation. Relevant domains were mapped against implementation constructs defined by the CFIR. Results Fifty-one measures met the inclusion criteria. The majority of these were developed in schools, universities, or colleges and other workplaces or organisations. Overall, most measures did not adequately assess or report psychometric properties. Forty-six percent of measures using exploratory factor analysis reported >50 % of variance was explained by the final model; none of the measures assessed using confirmatory factor analysis reported root mean square error of approximation (<0.06) or comparative fit index (>0.95). Fifty percent of measures reported Cronbach’s alpha of <0.70 for at least one domain; 6 % adequately assessed test-retest reliability; 16 % of measures adequately assessed criterion validity (i.e. known-groups); 2 % adequately assessed convergent validity (r > 0.40). Twenty-five percent of measures reported revalidation or cross-cultural validation. The CFIR constructs most frequently assessed by the included measures were relative advantage, available resources, knowledge and beliefs, complexity, implementation climate, and other personal resources (assessed by more than ten measures). Five CFIR constructs were not addressed by any measure. Conclusions This review highlights gaps in the range of implementation constructs that are assessed by existing measures developed for use in public health and community settings. Moreover, measures with robust psychometric properties are lacking. Without rigorous tools, the factors associated with the successful implementation of innovations in these settings will remain unknown Electronic supplementary material The online version of this article (doi:10.1186/s13012-016-0512-5) contains supplementary material, which is available to authorized users.


Background
In the field of implementation science, a considerable number of theories and frameworks are being used to better understand implementation processes and guide the development of strategies to improve the implementation of health innovations [1][2][3]. Many of these theories and frameworks, however, have not been tested empirically. As such, examining the utility of theories and frameworks has been recognised as critical to advance the field of implementation science [4].
The assessment of implementation theories and frameworks necessitates robust measures of their theoretical constructs. Psychometric properties important for measures of implementation research have been proposed [5] and include the following: reliability (internal consistency and test-retest); validity (construct and criterion); broad application (validated in different settings and cultures); and sensitivity to change (responsiveness). Tools which are acceptable, feasible, and display face and content validity are also particularly useful for researchers in real-world settings [5]. Furthermore, the psychometric characteristics of measures that assess a comprehensive range of implementation constructs have been highlighted as a particular priority area of research [4].
A number of reviews of implementation measures exist [6][7][8][9][10][11][12][13]. Such reviews indicate that the quality of existing measures of implementation constructs is limited. A review by Brennan and colleagues, for example, identified 41 instruments designed to assess factors hypothesised to influence quality improvement in primary care [6]. The review found that while most studies reported the internal consistency of instruments, very few assessed the construct validity of the measures using factor analysis [6]. Similarly, in a review of the psychometric properties of research utilisation measures used in health care, Squires and colleagues found that, of the 97 identified studies (60 unique measures), only 31 reported internal consistency and only 3 reported test-retest reliability [13]. Twenty percent of the included measures had not undergone any type of validity testing, and no studies reported on measure acceptability [13].
There are a number of limitations of previous reviews. Most do not provide comprehensive details of the psychometric properties of included measures [7,8,12] or address only a small number of constructs or outcomes relevant to implementation science [8,10]. Additionally, the majority of these reviews primarily focus on measures developed for use in healthcare settings [6,9,11,13]. Evidence from the field of psychometric research has suggested that, even when administered to similar population groups, changes in measure reliability and validity can occur when a measure developed in one setting is applied to another setting with different characteristics [14,15].
Currently, a comprehensive review of measures of implementation constructs is being conducted by the Society for Implementation Research Collaboration (SIRC) Instrument Review Project [16,17]. The SIRC review addresses some of the limitations of past reviews by extracting a range of psychometric properties from identified measures and assessing a more comprehensive range of outcomes [18] and constructs relevant to implementation science [19]. The outcomes of interest in the SIRC review are taken from Proctor and colleagues' Implementation Outcomes Framework (IOF) and focus on the appropriateness, acceptability, feasibility, adoption, penetration, cost, fidelity, and sustainability of the intervention itself [18]. The constructs of interest for the review are drawn from the Consolidated Framework for Implementation Research (CFIR), which outlines factors or conditions deemed important to support the successful implementation of an intervention [19]. The constructs are grouped under five domains which describe the following: (1) Intervention characteristics (details of the intervention itself ); (2) Outer setting (factors of influence which are external to an organisation); (3) Inner setting (internal characteristics of an organisation such as culture and learning climate); (4) Characteristics of individuals (actions and behaviours of individuals within the organisation); and (5) Process (systems and pathways within an organisation) [19].
To date, the SIRC review has uncovered 420 instruments related to 34 of the CFIR constructs and 104 instruments related to Proctor and colleagues' IOF [16,17]. At present, the data are available for the measures relevant to the inner setting domain of the CFIR and the IOF [20]. However, while comprehensive, the SIRC review only pertains to measures primarily applied to healthcare or mental health care settings, where the individuals responsible for implementing health-related interventions are most likely to be healthcare professionals [16,17]. In the field of public health, the implementation of health-related interventions often occurs in non-clinical settings, with nonhealthcare professionals responsible for implementing these changes. Therefore, there is a need to identify measures which have been developed specifically to measure constructs important for the implementation of healthrelated interventions in community settings, where the primary role of the organisations and individuals is not healthcare delivery.
To our knowledge, no previous reviews of measures of implementation constructs have focussed on instruments designed for use in a broad range of community settings. Such measures are of particular interest to public health researchers who are utilising implementation theories or frameworks to support evidence-based practice in these settings. As such, the aim of this study was to (1) systematically review the literature to identify measures of implementation constructs which have been developed in community settings; (2) describe each measure's psychometric properties; and (3) describe how the domains of each measure align with the five domains and 37 constructs of the CFIR.

Scope of this review
The focus of this review was to identify, from peerreviewed literature, measures which have been developed for use in community-based (non-clinical settings), and which measure constructs aligned to the CFIR. These measures were then examined to determine their psychometric properties and identify which of the CFIR constructs they captured. In this review, 'measures' are defined as surveys, questionnaires, instruments, tools, or scales which contain individual items that are answered or scored using predefined response options. 'Constructs' are defined as the broad attributes or characteristics which these items (usually grouped into domains) are attempting to capture. The constructs of interest were chosen to align with the CFIR, as this framework is the most comprehensive and draws together numerous theories which have been developed to guide the planning and evaluation of implementation research and combines them into one uniform theory with overarching domains [19].

Design
A systematic search and review was conducted to address the broad question of 'what psychometrically robust measures are currently available to assess implementation research in public health and community settings'. A comprehensive search of peer-reviewed publications was conducted using four electronic databases and the quality of identified measures was assessed using well-established, pre-defined psychometric criteria.

Eligibility
Publications were included if they (1) were peerreviewed journal articles reporting original research results; (2) reported research from non-clinical settings; (3) reported details regarding the development of a measure; (4) described a measure which assessed at least one of the 37 CFIR constructs; (5) described a measure which was being applied to a specific innovation or intervention; and (6) used statistical methods to assess the measures' factor structure.
In this review, clinical settings included the following: hospitals, general practices, allied health facilities such as physiotherapy or dental practices, rehabilitation centres, psychiatric facilities, and any other settings where the delivery of health or mental health care was the primary focus. Non-clinical settings included schools, universities, private businesses, childcare centres, correctional facilities, and any other settings where the delivery of health or mental health care was not the primary focus. Given that an aim of the study was to map the domains of included measures against constructs within the CFIR, it was important that measures displayed a minimum level of construct validity via exploratory or confirmatory factor analysis.
Duplicate abstracts were excluded from the review, as were abstracts describing reviews, editorials, commentaries, protocols, conference abstracts, and dissertations. Publications which reported on measures developed using qualitative methods only were also ineligible.

Search strategy
A search of MEDLINE, PsycINFO, EMBASE, and CINAHL databases was conducted to identify publications describing the development of measures to assess factors relevant to the implementation of innovations. These four databases were selected as they index journals from the field of implementation science and provide extensive coverage of research across a range of public health and community settings, such as schools, pharmacies, businesses, nursing homes, sporting clubs, and childcare facilities.
Prior to the database searches being conducted, four authors met to ensure that the chosen keywords accurately captured the constructs of interest and that keywords were combined using the correct Boolean operators [21]. The core search terms comprised of keywords that related to measurement, the psychometric properties of instruments, the levels at which the measurement could occur (e.g. organisational or individual) and the goals of research implementation. These keywords were as follows: [questionnaire or measure or scale or tool] AND [psychometric or reliability or validity or acceptability] AND [organisation* or institut* or service or staff or personnel] AND [implement* or change or adopt* or sustain*].
Similar to the strategy used in the SIRC review [16,17], the core search terms were combined with five more keyword searches designed to capture the constructs within each of the five CFIR domains: (1) Intervention Characteristics [strength or quality or advantage or adapt* or complex* or pack* or cost]; (2) Outer Setting [needs or barrier* or facilitate* or resource* or network or external or peer or compet* or poli* or regulation* or guideline* or incentive*]; (3) Inner Setting [structur* or communication or cultur* or value* or climate or tension or risk* or reward* or goal* or feedback or commitment or leadership or knowledge*]; (4) Characteristics of Individuals [belief* or attitude* or selfefficacy or skill* or identi* or trait* or ability* or motivat*]; or (5) Process [plan* or market or train or manager or team or champion or execut* or evaluat*].
The keyword search terms were repeated for all four databases. Keyword searches were limited to the English language; however, no limit was placed on the year of publication, as measurement tools often evolve over many years. Medical Subject Headings (MeSH) were not used in the literature search, as keyword searches have been found to have higher sensitivity, being more successful than subject searching in identifying relevant publications [22].

Identification of eligible publications
One author coded all abstracts according to the inclusion and exclusion criteria. A second author crosschecked 10 % of the abstracts to confirm they had been correctly classified. Full-text versions of publications were obtained for included abstracts. To ensure that no relevant tools had been missed, previous systematic reviews [7,8,10] were also screened for relevant measures, as were tools included on the SIRC Instrument Review Project website [20]. Copies of publications for any additional measures that met the inclusion criteria were obtained. Full-text versions of all eligible publications were then obtained and screened to identify the names and acronyms of all relevant measures they described. The reference lists of all eligible publications were also screened for any additional measures, and Google Scholar was used to conduct cited reference searches. A final literature search was conducted by 'measure name' and 'author names' , using Google Scholar. This search strategy ensured that as many publications as possible were found that related to the psychometric development and validation and revalidation and cross-cultural adaptation of identified measures.

Extraction of data from eligible publications
The properties of each measure were extracted from all full-text publications relating to the development of the measure using data reported in the manuscript text, tables, or figures. Extracted data included: (1) the research setting, sample, and characteristics of the intervention or innovation being assessed; (2) psychometric properties including face and content validity, construct and criterion validity, internal consistency, test-retest reliability, responsiveness, acceptability, and feasibility; and (3) whether the measure had undergone a process of revalidation or cross-cultural adaptation.
The psychometric properties of each measure were independently assessed by two authors using the same criteria described in previous systematic reviews [23,24] and according to the guidelines for the development and use of tests, including the Standards for Educational and Psychological Testing [5,25,26]. The Standards provides a frame of reference to ensure all relevant issues are addressed when developing a measure and allows the quality of measures to be evaluated by those who wish to use them [25]. Following the assessment of psychometric properties, two authors then independently coded each publication to determine which measure domains corresponded with which CFIR constructs. When discrepancies emerged, a third author assisted in reaching consensus.
Psychometric coding Setting, sample, and characteristics of the innovation being assessed Details regarding the country and setting where the measure was developed, characteristics of the innovation or intervention being assessed, response rate, sample size, and demographic characteristics of the sample (gender and profession) who completed the measure were described.

Face and content validity
An instrument is said to have face validity if both the administrators and those who complete it agree that it measures what it was designed to measure [27]. To have content validity, the description of the measure's development needed to include: (1) the process by which items were selected; (2) who assessed the measure's content; and (3) what aspects of the measure were revised [14,28]. Information regarding any theories or frameworks that the measure was developed to test, as well as whether items were adapted from previously validated measures, was also extracted.

Construct and criterion validity
A measure was classified as having good internal structure (construct validity) if exploratory factor analysis (EFA) was performed with eigenvalues set at >1 [14,29] and >50 % of the variance was explained [30], or confirmatory factor analysis (CFA) was performed with a root mean square error of approximation (RMSEA) of <0.06 and a comparative fit index (CFI) of >0.95 [31,32]. The number of items and domains in the measure following factor analysis was recorded. Additional construct validity was determined by assessing whether the measure had convergent validity (correlations (r) >0.40) with similar instruments or divergent validity (correlations (r) <0.30) with dissimilar instruments [33]. Criterion validity was determined by assessing whether the measure was able to obtain different scores for subpopulations with known differences (known-groups validity) [34].

Internal consistency and test-retest reliability
To meet the criteria for internal consistency, correlations for a measure's subscales and total scale needed to have a Cronbach's alpha (α) of >0.70 or a Kuder- Richardson 20 (KR-20) of >0.70 for dichotomous response scales [28]. For test-retest reliability, the measure needed to have undergone a repeated administration with the same sample within 2-14 days [35]. Agreement between scores from the two administrations needed to be calculated, with item, subscale, and total scale correlations having a (1) Cohen's kappa coefficient (κ) of >0.60 for nominal or ordinal response scales [14]; (2) Pearson correlation coefficient (r) of >0.70 for interval response scales [14,28]; or an (3) Intraclass correlation coefficient (ICC) of >0.70 for interval response scales [14,28].

Responsiveness, acceptability, feasibility, revalidation, and cross-cultural adaptation
A measure's potential to detect change over time was confirmed if it could show a moderate effect size (>0.5) for a given change [14,28,36], and if it had minimal floor and ceiling effects (less than 5 % of the sample achieved the highest or lowest scores) [37]. To determine acceptability and feasibility (burden associated with using the measure), data on the following were extracted: proportion of missing items, time needed to complete, and time needed to interpret and score [28]. Data from publications reporting the revalidation of a measure with additional samples, or in different languages or cultures, were also extracted [28].

CFIR coding
The domains of each included measure were assessed to determine whether the factors they measured corresponded with one or more of the 37 CFIR constructs [19]. A brief summary of each of the CFIR constructs is presented in Additional file 1. The mapping process was domain-focused (i.e. mapping the overall measure domains to constructs) rather than item-focused (i.e. mapping individual items to constructs) to ensure that the overall construct was well captured. Within a measure, only one domain needed to be judged by the reviewers to address a CFIR construct. Therefore, it was possible that a measure with five domains might only have one of its domains mapped to a CFIR construct. Similarly, a measure with three domains might have all contributing to the same CFIR construct. In the latter scenario, the construct was only counted once.

Analysis
Descriptive statistics (frequencies and proportions) were used to report the number of domains from the included measures which were mapped to each of the CFIR constructs and CFIR domains. Frequencies and proportions were also used to describe the number of measures which met various psychometric criteria.

Identified measures of implementation constructs
The initial searches of MEDLINE, PsycINFO, EMBASE, and CINAHL identified 8547 potentially relevant publications. Of these, 5195 were duplicates leaving 3352 publication abstracts to be coded. Of these 3352 publications, 3317 did not meet the inclusion criteria (see Fig. 1 for PRISMA diagram), leaving 35 eligible publications. The process of identifying measures included in systematic reviews related to the current review [7,8,10], and a secondary literature search by measure or author name, lead to the inclusion of an additional 30 publications. A total of 65 full-text publications were retained which described 51 unique measures.

Face and content validity
Almost all measures (n = 47) had undergone a process of face and content validation. The development of 36 measures was guided by an existing theory or framework (Additional file 2). No measures were specifically designed to address all constructs considered important for the implementation of innovations by the CFIR. Twenty-six measures had adapted at least some of their items from pre-existing instruments (Additional file 2).

Perceived Attributes of the Healthy Schools Approach Scale [42]
Mapping of measure domains that align with the 37 constructs of the CFIR The number of measure domains that mapped onto the CFIR constructs ranged from 1 to 19. Relative advantage, networks and communications, culture, implementation climate, learning climate, readiness for implementation, available resources, and reflecting and evaluating were the constructs most frequently addressed by the included measures. Five of the CFIR constructs were not addressed by any measure (Additional file 6). These five constructs were as follows: intervention source, tension for change, engaging, opinion leaders, and champions.

Discussion
To our knowledge, this is the first systematic review to describe the psychometric properties of measures developed to assess innovations and implementation constructs specifically in public health and community settings. Overall, the psychometric properties of included measures were typically inadequately assessed or not reported. No single measure reported on all key psychometric quality indicators. The majority of studies assessed face, content, construct validity, and internal consistency. However, criterion validity (known-groups), test-retest reliability, and acceptability and feasibility were rarely reported. Only seven measures had responsiveness to change assessed. These findings mirror those of previous reviews [7,13] that found that few measures demonstrated test-retest reliability, acceptability, or criterion validity.
When measures did report psychometric data, it was typically below the widely accepted thresholds defined in this review. Almost half of the measures that reported undertaking EFA reported that their final factor model explained <50 % of the variance. Furthermore, none of the measures that used CFA alone reported satisfactory RMSEA (<.06) or CFI (>0.95). This suggests that a notable proportion of available implementation measures developed and currently available for use in non-clinical settings are not particularly robust or are prone to misspecification of fit. That only eight of the 51 measures explored criterion validity using known-groups is also concerning. The lack of attention to known-groups validity limits the confidence we can place in these measures being able to detect how groups within community settings (e.g. experienced teachers vs. new teachers) vary in regards to implementation of innovation. This is important for identifying which aspects of an intervention or innovation might need to be adjusted to ensure more robust implementation in the future.
Internal consistency was frequently reported but only 40 % of measures reported that all scale domains had a Cronbach's alpha >0.70, highlighting a need for further refinement of scale items and revalidation. Only three measures assessed test-retest reliability, another area requiring much greater attention in future studies. Those studies that did assess test-retest reliability performed well, meeting the vast majority threshold criteria. However, the stability of these types of measures over time remains unclear. Acceptability and feasibility data were reported for just 33 % of the measures. Mean completion time for measures was almost 35 min. Although shorter questionnaires have been shown to improve response rates [113], it is unclear what the optimal survey length is while still maintaining the survey validity. Rates of missing data ranged from 1.5 to <5 %, which according to Schafer [114] is acceptable given missing data rates of less than 5 % are likely to be inconsequential. Only 25 % of measures had been revalidated or validated in a different culture. This limits the generalisability of the measures and poses a significant barrier to research translation within potentially underserved communities or cultures [115].
Without more comprehensive assessment of the psychometric properties of these instruments, the ability to ascertain the utility of theories or frameworks to support the implementation of innovations in public health and community settings is limited. For example, understanding the responsiveness of measures is essential for evaluating implementation interventions and ensuring that changes in constructs over time can be detected [116,117]. Having measures which are acceptable and feasible is also important to the conduct of rigorous research, particularly in more pragmatic research studies [5,18]. Low survey response rates or high rates of attrition due to onerous research methods can introduce bias and compromise study internal and external validity [118,119].

Alignment of measure domains with constructs of the CFIR
While some of the CFIR constructs were addressed by domains from multiple measures in this study, five constructs were not assessed by any measure. These were intervention source, tension for change, engaging, opinion leaders, and champions. The development of psychometrically robust measures which can assess these constructs in public health and community settings may be a priority area of research for the field.
The most frequently addressed constructs appeared to fall within the 'inner setting' and 'characteristics of individuals' domains, suggesting that the focus of measures to date has been on understanding only the immediate environment where the innovation or intervention will be implemented. It appeared that measures addressing 'outer setting' or 'process' constructs were less frequently observed than other domains. The development of future measures should target these domains of the CFIR to ensure a greater breadth and depth of understanding of all factors which may influence the implementation of evidence into practice in public health and community settings.

Comparison of the current review with the SIRC Instrument Review Project
Despite the similarity in review methodologies utilised by the current review and that undertaken by SIRC [16], few measures have been reported by both reviews. This is not surprising, as although the SIRC review captured some measures developed in education or workplace settings, other public health and community settings were not addressed. Furthermore, the SIRC review used a much broader inclusion criteria with regard to measures of CFIR constructs. For example, for the construct of 'self-efficacy' , the SIRC review includes all measures of self-efficacy, regardless of the context in which selfefficacy is being examined. In contrast, the current review only includes measures which assess self-efficacy in the context of an individual's perceived ability to implement the target innovation.
Despite these differences, the use of a common framework (CFIR) for examining constructs captured by different measures in the current review promotes consistency and complements the findings of the SIRC review.

Limitations
It is possible that not all existing implementation measures in public health and community settings were captured by this review. The keywords used to identify measures were limited to 'questionnaire' , 'measure' , 'scale' , or 'tool' and other possible terms such as 'instrument' and 'test' were not used. These terms were excluded due to the likelihood of identifying non-relevant publications related to clinical practice (e.g. surgical instruments, immunologic tests). However, the exclusion of these keywords may have meant that some relevant publications were not identified during the database search. Additionally, the review did not assess measures published in the grey literature and only studies published in English were included. However, it is likely that those measures which were identified represent the best available evidence, given their publication in peerreviewed journals and indexing in four scientific databases. The psychometric properties that were chosen to be extracted from publications about each measure may have also limited the findings. For example, for studies that utilised CFA, only data pertaining to the RMSEA and CFI were recorded based on recommendations by Schmitt [32]. Included publications may have reported additional CFA metrics (such as goodness of fit (GFI) or the normed fit index (NI)); however, they were not included in this review.
Despite these limitations, the findings from this review are likely to be of value to public health researchers who are looking to identify measures with robust psychometric properties that can be used to assess implementation constructs. There are, however, a small number of constructs for which no measure could be identified. Developing measures which can assess these five remaining constructs will be an important consideration for future research.

Conclusion
Existing measures of implementation constructs for use in public health and community settings require additional