We find some statistical support, in terms of reliability and factor analyses, for aggregation of survey items and subscales into three scales of organizational readiness-to-change based on the core elements of the PARIHS framework: evidence, context and facilitation. Reliability statistics met conventional thresholds for the majority of subscales, indicating that the subscales intended to measure the individual components of the main elements of the framework (e.g., the six components of the context scale) held together reasonably well. Exploratory factor analysis applied to the aggregated subscale scores supports three underlying factors, with the majority of subscale scores clustered corresponding to the core elements of the PARIHS framework.
However, three findings may indicate concerns and suggest need for further revision to the instrument and further research on its reliability and validity: (1) reliability was poor for the three evidence subscales; (2) the subscales measuring clinical champion (as part of the facilitation scale), and availability of general resources (as part of the context scale) failed to load significantly on any factor; and (3) the leadership practices subscale loaded on the second factor with most of the context subscales. We discuss each of these in turn.
Reliability of evidence subscales
Reliability, as measured by Cronbach's alpha, was mediocre for the evidence scale and the three constituent subscales. Poor reliability could be a function of too few items (alpha coefficients are highly sensitive to the number of items in a scale ); could indicate that the items are deficient measures of the evidence construct; or could signal that the subscales are not uni-dimensional, i.e., they reflect multiple underlying factors with none measured reliably or well.
There is some evidence for the latter given the observed improvement in reliability statistics after dropping three items: q3d and q3e from the research evidence subscale, and q4d from the practice experience subscale. These items had some important conceptual differences from other items in their respective subscales. Both q3d and q3e are about anticipating the effect of the practice change on patient outcomes, whereas the other items in the subscale (q3a – q3c) are about the scientific evidence for the practice change. The former require respondents to make a prediction about a future state, not just an assessment of a current one (i.e., the state of the research evidence). Item q4d, on the other hand, is about whether the practice change has previously been attempted in the respondent's clinical setting, which was unlikely given the context was quality improvement projects introducing new practices. However, factor analysis generally supported a common factor solution for the three subscales, supporting the hypothesis that the subscales may tap into a common latent variable. This question would benefit from more conceptual as well as empirical work.
The patient preferences subscale requires further consideration, and we feel remains an open question as to how it fits with the model and with the survey. It had high uniqueness, indicating that the majority of variance in the items was not accounted for by the three factors. Furthermore, past research appears to conflict with the contention that patient preferences or experiences have significant influence on how favorably clinicians evaluate a given practice or course of treatment. For example, some research concludes there is little or no correlation between patient preferences and what clinicians do [32, 33], and even after interventions to increase shared decision making (a practice intended to better incorporate patient preferences into health care practice), the actual effects on clinical choices appear limited, even though providers and patients may perceive greater participation . Patient preference should be a major driver of implementation of evidence-based practices, but we suspect that in our current health care system it is generally not. It remains unclear what this means for assessing patient preferences as a distinct component of organizational readiness to change, but additional exploratory research would seem to be in order.
It is also important to note that Cronbach's alpha findings do not mean that the evidence scale is invalid. The item-level results from the item-rest correlations suggested the evidence subscales had strong reliability, and the subscale-level principal factors analysis suggested a common, latent factor structure. Other researchers have demonstrated that Cronbach's alpha is not a measure of uni-dimensionality; it is possible to obtain a high alpha coefficient from a multidimensional scale, i.e., from a scale representing multiple constructs, and conversely to obtain a low alpha coefficient from a uni-dimensional scale . Overall, the scale reliability findings for the evidence scale primarily suggest caution in interpreting the aggregated scale and that further study is warranted.
As noted in the background, the ORCA omits a subscale for routine information, which was added to the framework beginning in 2004 , and that could affect reliability for the overall evidence scale. However, this omission would not account for the weak reliability of the other subscales. Moreover, conceptually, routine information would appear more congruent with the context element. Routine information addresses the existence and use of data gathering and reporting systems, which are a function of the place where the evidence-based practice or technology is being implemented rather than a characteristic of the evidence-based practice itself or how it is perceived by users. In contrast, the other evidence subscales are dimensions of the perceived strength of the evidence, e.g., the strength of the research evidence; how well the new practice fits with past clinical experience. The meaning of a routine information subscale, as a dimension for evaluating the strength of the evidence, requires further consideration.
Two subscales with low factor loadings
Two subscales failed to load significantly on any of the three factors: One measured dimensions of facilitation related to the clinical champion role, the other measured dimensions of context related to the availability of general resources, such as facilities and staffing. There are at least two ways to interpret this finding, with different attendant implications.
First, the failure of the two subscales to load on any of the three factors may indicate that overall availability of resources and clinical champion roles are functions of unique factors, distinct from evidence, context and facilitation (at least as framed in this instrument). Empirically and conceptually, we believe this may be the case for the general resource availability, but not for the clinical champion role.
In the case of general resource availability, the subscale had high uniqueness, indicating that a majority of variance of the items was not accounted for by any of the three factors. Conceptually, this subscale was not part of the original PARIHS framework; it was added to the ORCA based on other organizational research supporting the powerful influence of resource availability as an initial state that often sets boundaries in planning and execution. Although this seems to fit logically within the domain of the context scale, general resources may be a function of factors at other levels. This is consistent with the observed subscale scores, which were lowest for the general resources subscale across the three study samples. General resource availability may be less a function of the organization (in this case individual VHA facilities), and more a function of the broader resource environment in the VHA, or in the US health care system generally. The period covered in these three quality improvement projects has been one of high demand on Veterans Health Administration services , and cost containment was (and continues to be) a major and pervasive issue in healthcare . We still believe that resource availability is an important factor in the determination of organizational readiness to change. However, it may be distinct from the three factors hypothesized in the PARIHS model, appearing different from the other dimension of context. We propose that additional conceptual work is needed on this subscale and that more items are likely needed to reliably measure it.
Second, the distinctiveness of the two subscales may indicate measurement error. General resource availability and clinical champion role might be appropriately understood as distinct reflections of the favorability of the context in the organization. However, the items, and their component subscales, may simply be inaccurate measures of the latent variables, or the number of observations in this analysis may have been insufficient for a stable estimate of the factors. We believe the latter is the case for the clinical champion subscale, which had a relatively low uniqueness value (0.34), and relatively high factor loading (0.49). Although the factor loading did not meet the threshold (0.60), we set an unusually high threshold for this analysis because the relatively small number of observations needed to be balanced with high factor loadings in order to achieve stable estimates . We expect that repeating the analysis with a larger sample will confirm that the clinical champion subscale loads onto the same factor as the other facilitation subscales.
The leadership practices subscale loaded on the context factor
The subscale measuring leaders' practices (from the facilitation scale) loaded on the second factor with context subscales. The leaders' practice subscale addressed whether senior leaders or clinical managers propose an appropriate, feasible project; provide clear goals; establish a project schedule; and designate a clinical champion. The high loading on the second factor could indicate that the leaders' practices subscale is properly understood as part of context, or it could signal poor discriminant validity between the context and facilitation scales. However, in this case, we believe the overlap may be a function measurement error related to item wording. Two of the items refer to "a project," which put the respondent in mind of a generic change more consonant to the questions in the context scale, whereas many of the facilitation items in the subsequent subscales refer to "this project" or "the intervention" implying the specific implementation project named in the question stem from the opening of the survey.
We believe that this unintended discrepancy in the pattern of wording cued respondents to answer the leader practices questions in a different frame of mind, conceiving of them in terms of projects in general rather than their estimate of leadership practices in the project they were actively engaged upon. This will be a revision to explore in future use of the survey.
Another question readers should bear in mind is whether readiness to change is best understood as a formative scale or a reflective scale. Principal factors analysis assumes that the individual items are reflective of common, latent variables (or factors) that cause the item responses [38, 39]; when a scale is reflective, it corresponds to a given latent variable. However, organizational readiness to change may be more aptly understood as a formative scale, meaning that the constituent pieces (items or subscales) are the determinants and the latent variable organizational readiness to change is the intermediate outcome . In the former case, the constituent parts are necessarily correlated (see Howell et al 2007 for a comparison of the mathematical assumptions underlying formative and reflective scales). For example, a scale meant to measure native athletic ability should register high correlations among constituent components meant to assess speed, strength, and agility; i.e., the physiological factors that determine speed, are also thought to determine strength and agility, and therefore a person scoring "high" on one component should score relatively high on the others. Conversely, a scale meant to measure how good a baseball player is, might assess their throwing, fielding, and batting to create a composite score. Throwing, fielding and batting may often be related – being in part a function of native athletic ability – but they're also a function of specific training activities and experience, and skill developed in one does not parlay into skill in the others. Rigorous training in pitching will not make you a good batter. For the purposes of the present analyses, we assumed that the ORCA is a reflective scale; the factor analysis appears to support that conclusion. However, the domains covered are quite diverse, and it seems appropriate to further explore the question of whether organizational readiness to change should properly be understood as a formative or a reflective scale.
There are five major limitations to our work. First, this analysis does not address the validity of the instrument as a predictor of evidence-based clinical practice, or even as a correlate of theoretically relevant covariates, such as implementation activities. Our objective with the present analysis was confined to assessing correlations among items within respondents to determine if the items cluster into scales and subscales as predicted. Criterion validation using implementation and quality-of-care outcomes is the next phase of our work.
Second, this study relied on secondary data from quality improvement projects, which did not employ some standard practices for survey development intended to mitigate threats to internal validity. We note two specific examples. First, the items were organized according to the predicted scales and subscales, rather than being presented to respondents in a random order. Item ordering can influence item scoring, and introduces the danger that reliability statistics may be inflated because items were organized according to the predicted subscales. However, this is not an uncommon practice in health services research survey instruments. Second, two of the quality improvement projects (Cardiac Care Initiative, and the intensive care unit quality improvement project) entailed multiple evidence-based practice changes, each of which could conceivably elicit different responses in terms of evidence, context and facilitation. The surveys assessed these practice changes as a whole, and therefore may have introduced measurement error to the extent that respondents perceived evidence, context and facilitation differently for different components. However, the danger here is less significant than for the item ordering, as the measurement error would tend to inflate item variance within scales, and therefore bias results towards the null (i.e., toward an undifferentiated mass of items rather than distinct scales), which we did not observe.
Third, the survey instrument is somewhat long (77 items), and may need to be shorter to be most useful. Despite the length, we note that most respondents are able to complete the survey in about 15 minutes, and this instrument is shorter than organizational readiness instruments used in other sectors, such as business and IT . Moreover, any item reduction needs to consider the threat to content validity posed by potentially failing to measure an essential content domain . The research presented included only preliminary item reduction based on scale reliability. Although scale reliability statistics often serve as a basis for excluding items , we believe that item reduction is best done as a function of criterion validation, i.e., that items are retained as a function of how much variance they account for in some theoretically meaningful outcome, and content validity, i.e., consideration of the theoretical domains the instrument is purported to measure. We regard this as a priority for the next stage of research.
Fourth, the sample size was small (80) relative to the number of survey items (77). This led us to factor analyze the aggregated subscales rather than the constituent items. This assumed that the subscales were unidimensional. While Cronbach's alpha findings generally supported the reliability of the subscales, high average correlations can still occur among items that reflect multiple factors , and high reliability is no guarantee that the subscales were unidimensional. This limitation will be corrected with time when additional data become available and the analysis can be repeated with a larger sample.
Fifth, the ORCA was fielded a single time in each project, which leaves unanswered questions both about the proper timing of the assessment and how variable subscales and scales are over time. In terms of timing, in the Lipids Clinical Reminders project, and the intensive care unit quality improvement project the instrument was fielded before any work related to the specific change was undertaken. In the case of the Cardiac Care Initiative, some work had already begun at some sites. It is possible that administering the instrument at more than one time point might yield different factor structures.
Other limitations include questions of external validity, for example, in terms of the setting in the VHA and these particular evidence-based practices; and questions of internal validity, in terms of the sensitivity of the measures to changes in wording or format. These limitations are all important topics for future research on the instrument.