Assessment of the quality of recommendations from 161 clinical practice guidelines using the Appraisal of Guidelines for Research and Evaluation–Recommendations Excellence (AGREE-REX) instrument shows there is room for improvement

Objective To assess the quality of recommendations from 161 clinical practice guidelines (CPGs) using AGREE-REX-D (Appraisal of Guidelines REsearch and Evaluation-Recommendations Excellence Draft). Design Cross-sectional study Setting International CPG community. Participants Three hundred twenty-two international CPG developers, users, and researchers. Intervention Participants were assigned to appraise one of 161 CPGs selected for the study using the AGREE-REX-D tool Main outcome measures AGREE-REX-D scores of 161 CPGs (7-point scale, maximum 7). Results Recommendations from 161 CPGs were appraised by 322 participants using the AGREE-REX-D. CPGs were developed by 67 different organizations. The total overall average score of the CPG recommendations was 4.23 (standard deviation (SD) = 1.14). AGREE-REX-D items that scored the highest were (mean; SD): evidence (5.51; 1.14), clinical relevance (5.95; SD 0.8), and patients/population relevance (4.87; SD 1.33), while the lowest scores were observed for the policy values (3.44; SD 1.53), local applicability (3,56; SD 1.47), and resources, tools, and capacity (3.49; SD 1.44) items. CPGs developed by government-supported organizations and developed in the UK and Canada had significantly higher recommendation quality scores with the AGREE-REX-D tool (p < 0.05) than their comparators. Conclusions We found that there is significant room for improvement of some CPGs such as the considerations of patient/population values, policy values, local applicability and resources, tools, and capacity. These findings may be considered a baseline upon which to measure future improvements in the quality of CPGs.


Introduction
Clinical practice guidelines (CPGs) are systematically developed statements informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options with the aim of optimizing patient care [1][2][3]. However, concerns about variation in the quality of CPGs and their resultant recommendations exist in the literature [1, 3,4]. The AGREE II is an established instrument, used internationally, to evaluate the overall methodological quality of CPGs and to serve as a methodological blueprint to inform CPG development and reporting [5][6][7]. The AGREE II focuses on the entire CPG development process. As its complement, the AGREE-REX (Appraisal of Guidelines REsearch and Evaluation-Recommendations EXcellence) was designed to focus specifically on CPG recommendations and the justifications that underpin them [8]. Its development was in response to data demonstrating high-quality CPG processes, although necessary, are not always sufficient to yield individual CPG recommendations that are clinically credible and implementable [9,10].
The prototype of the AGREE-REX (the AGREE-REX-D) and the AGREE II was applied to 161 guidelines. In this article, we present the results of this assessment, identify areas for CPG recommendation improvement, and compare the evaluative information garnered by both tools.

Materials and methods
This study represents a component of a larger program of research designed to create the AGREE-REX version 1 (AGREE-REX-D); the technical components of this program of research are reported elsewhere [8]. Our main study was designed to create the AGREE REX tool following a mixed-methods project, and this manuscript presents the cross-sectional study that summarizes the assessment of the selected CPGs during the development of the AGREE REX-D). This study received ethics approval from the Hamilton Integrated Research Ethics Board (project #13-700).

Participants
Participants included CPG developers, clinicians, implementers, and other users. They were purposefully recruited through a variety of channels including social media and CPG organizations, such as the Guidelines International Network (G-I-N), G-I-N North America

Contributions to the literature
We applied the AGREE II and the recently developed tool (AGREE-REX draft version), to assess quality, credibility, and implementability of 161 international clinical practice guidelines (CPGs). The AGREE REX draft tool was applied by 322 guidelines' developers, users and researchers from 51 countries.
-The scores of the AGREE REX draft tool items were higher in those items related to the quality of the evidence and the clinical relevance. The items related to patients and population relevance and implementation relevance scored in the mid-range, while the items related to patients/population or policy values, the alignment of values, the local applicability, and the resouces, tools, and capacity items scored low.
CPGs produced by government-supported organizations scored higher on all the items of the AGREE-REX draft tool than those produced by professional societies or other types of groups, and CPGs produced in UK and Canada scored higher in selected items in comparison to USA and international CPGs The correlations between the overall AGREE-REX draft tool and AGREE II domains were low, except for the applicability domain where the correlation was modest. regional community, Knowledge Translation (KT) Canada, Canadian Agency for Drugs and Technologies in Health (CADTH), Canadian Partnership Against Cancer, Cancer Care Ontario, and to investigators known in the CPG research community. The study was also advertised on the AGREE social media accounts (Facebook and Twitter), and My AGREE PLUS (online platform for appraising CPGs with the AGREE II tool, www.agreetrust.org) registered users were invited to participate.

CPGs
CPGs in multiple clinical specialty areas were collected from the Agency for Healthcare Research and Quality (AHRQ) National Guidelines Clearinghouse database [11]. Using the database's advanced search function, we identified CPGs that were (1) published between 2013 and 2015; (2) written in English language, and (3) no more than 50 pages in length for the CPG core document. The resulting list of CPGs was reviewed and the following were excluded: guidelines addressing organizational rather than clinical topics, technology assessments; CPGs not available for free to the public; and CPGs for which the link in the database were not functional. Descriptive information was extracted from each CPG, including type of authoring organization (government supported vs. professional society vs other/not clear), disease topic (cancer vs. non-cancer), and country of authoring group (USA, UK, Canada, or international).

Procedure
Participants received individualized password-protected access to the study materials, which included links to a downloadable PDF format of the AGREE-REX-D, the CPG to which they were randomly assigned, and the online survey platform (LimeSurvey) to record their scores. Participants were asked to review the AGREE-REX-D manual and items, read the CPG, and then evaluate it by applying the tool and recording their item ratings in LimeSurvey. Participants were provided with no formal training or orientation of the tool by members of the team. The AGREE-REX-D manual provided definitions of the items and instructions on how to assess and score them. An email reminder was sent at 2 weeks from the participant's initial start date informing them of their deadline in 1 week. Deadline extensions were given when requested. Evaluations were completed between May 2016 and March 2017. Participants were offered a $50 CAD pre-paid virtual gift card for completing the study. All communication with participants was done by the staff of AGREE Scientific office.

AGREE-RE-DX scores
The prototype of AGREE-REX-D comprised 11 items within 4 themes (Table 1). Each item was rated using a 7-point scale applied to two quality attributes, with higher scores reflecting higher quality. The two attributes were the following: Extent to which quality features were documented in the CPG Extent to which quality features were considered in formulating the recommendations.
The instrument concludes with two general quality assessments: overall credibility and overall implementability of the CPG recommendations.

AGREE II evaluations
For exploratory purposes, the CPGs were also assessed, independently, using the AGREE II by two members of the AGREE Scientific team. The AGREE II includes 23 items within 6 domains and 2 overall assessments [5]. The 23 items are assessed with a 7-point scale (1 = strongly disagree; 7 = strongly agree), with high scores reflecting more favorable quality results. Discrepancies in scoring were resolved by consensus when required.

Scoring
For each CPG, an AGREE-REX-D item score was derived for each of the 11 items by averaging scores on the 7-point scale between the two raters. A mean overall AGREE-REX-D score was calculated for each CPG by averaging across the 11 items. Finally, mean scores for overall credibility and overall implementability items were derived by averaging scores between the two raters.
AGREE II tool mean domain scores were derived by summing the scores across the two appraisers and standardizing them as a percentage of the maximum possible score a CPG could achieve for that domain [5]. Before these scores were summed and calculated, the independent appraisers were required to reach a consensus on any AGREE II item scores that were two or more points apart on the 7-point scale.

Sample size calculation
The sample size calculation was based on a separate methodological goal to conduct a reliability study of the AGREE-REX-D tool based on the interrater reliability outcome. Based on consensus by the team, we made the following assumptions: two raters per CPG, an intraclass correlation coefficient of 0.6, and a confidence interval from 0.5 to 0.7. We determined that we required 316 participants to appraise 158 CPGs: each participatant rated one CPG using the AGREE-REX-D and each CPG was rated by two independent raters. Additional information on the details of the sample size calculation can be found elsewhere [8] Analytical framework Descriptive measurements were used to summarize the AGREE-REX item and overall scores. A series of oneway ANOVA tests was used to examine mean differences in the AGREE-REX-D item scores and the overall score as a function of the

Participants
Descriptive statistics of the participants are listed in Tables 2 and 3.

CPGs
We appraised 161 CPGs. The CPGs targeted a range of diseases and clinical problems including cancer, infectious diseases, pregnancy and child birth, mental health, nervous system disorders, respiratory, digestive, genitourinary, blood and endocrine disorders, and musculoskeletal, among others. With the exception of cancer (n = 38), the number of CPGs for each unique disease was small (< 8) making other comparisons by disease topic not viable. CPGs were developed by 67 different AGREE-REX (see Table 4)

AGREE-REX.D performance for all CPGs
The mean overall AGREE-REX score across the 161 CPGs was 4.23 (SD 1.14). There was variability in performance across the individual 11 items, with 6 that scored above the middle point of 4.0 on the response scale. The mean overall credibility and overall implementability assessments were 4.78 (SD 1.24) and 4.19 (SD 1.23), respectively.

AGREE-REX-D performance by type of organization
Statistically significant differences (i.e., p < 0.05) were found as a function of organization type for each of the mean AGREE-REX-D items, the mean overall AGREE-REX-D score, and the overall implementability and overall credibility assessments. In each case, more favorable ratings were found among CPGs produced by government-supported organizations. The item scores of CPGs produced by government-supported organizations (n = 46) ranged from 4.41 (SD 1.11) to 5.95 (SD 0.8); the scores of CPG produced by professional societies (n = 109) ranged from 2.99 (SD 1.46) to 5.24 (SD 1.26); and the scores of CPG produced by other types of organizations (n = 6), ranged from 3.00 (SD 0.89) to 6.17 (SD 0.68). Of note, in 5 of the 11 cases, the AGREE-REX-D item means across the organization types fell within the positive ends of the response scale (m ≥ 4) despite there being statistically significant differences between them. In contrast, in 6 of the 11 cases, the overall means of the AGREE-REX-D items straddled the mid-point of the scale-suggesting some organizations tended to perform lower than the mid-point and others perform higher than the mid-point of the scale.

AGREE-REX-D performance by country of CPG authoring group
The country of the authoring CPG organization showed differences in AGREE-REX quality scores as well. Statistically significant differences (i.e., p < 0.05) for five AGREE-REX items (implementation relevance, target user values, policy values, local applicability, and resources, tools, and capacity), and the mean overall AGREE-REX score were found. Differences as a function of authoring group approached, but did not reach, statistical significance for the overall implementability assessment. For each of these comparisons, the CPGs produced in the UK and Canada showed higher scores. The item scores of CPGs published from the UK ranged from 3.66 (SD 1.26) to 5.74 (SD 0.90); from Canada ranged from 3.42 (SD 1.0) to 5.87 (SD 0.64); from the USA ranged from 3.08 (SD 1.47) to 5.06 (SD 1.39); and from international organizations ranged from 2.96 (SD 1.39) to 5.18 (SD 1.44)). In all but one case, overall AGREE-REX-D item means straddled the mid-point of the scale where there was a significant difference between the groups.

AGREE-REX-D performance by disease
No significant differences emerged between cancer and non-cancer CPGs scores; this held true for each of the AGREE-REX items and the mean overall AGREE-REX-D score (p > 0.5; means not presented).
AGREE II (see Table 5) The AGREE II domain scores for the CPGs are displayed in the Table 5. Scope and purpose, and clarity of presentation were the domains with the highest scores, while the applicability domain had the lowest score.

AGREE II and AGREE-REX
The correlations between the overall AGREE-REX-D and AGREE II domains were low (r < 0.30) except for the applicability domain where the correlation was modest at r = 0.38 [8]. Overall, AGREE-REX scores were higher among appraisers with no AGREE II experience compared to those with AGREE II experience.

Discussion
We appraised 161 CPGs with the prototype of the AGREE-REX-D tool and the AGREE II tool. The most favorable AGREE-REX ratings (means > 5.0) were found for the evidence and clinical relevance items; ratings that fell in the more moderate range of the scale (means > 4.0 and < 5.0) were found for the patient/population relevance, implementation relevance, developers' values and users' values items; and least favorable ratings that fell below the mid-point of the scale (means < 4.0) were found for patients/population values, policy values, alignment of values, local applicability and resources, tools and capacity items. CPGs produced by government-supported organizations scored higher on all the items of the AGREE-REX-D than those produced by professional societies or other types of groups, and CPGs produced in UK and Canada scored higher in selected items in comparison to USA and international CPGs. The confidence intervals around the mean AGREE-REX scores were large. The distribution of the mean scores across the 11 items is not surprising. CPG methods research has focused largely on issues directly relevant to creating the evidence base. As a consequence, some AGREE-REX concepts are easier to achieve success because there exists tools and resources to support thier operationalization (e.g., tools designed by the GRADE working group [12]). In contrast, resources to operaitonalize other concepts are more elusive. For example, continued methodological development is needed to adequately measure and report values across diverse stakeholder groups so that they are reliable, valid, and usable. Similarly, systematic strategies to incorporate these perspectives into the framing of recommendations are required [13].
As previously reported with the evaluation of the AGREE II [14], lower scores with some AGREE-REX-D items may reflect inadequate reporting and not poor quality in methodological execution [6]. Developers may have followed appropriate steps but not reported them in the CPG documentation and, as a consequence, could not be assessed. Also, it is possible that some conceptual elements reflected in the AGREE-REX-D (e.g., concepts related to implementation activities) are not the responsibility of the CPG developer directly, but perhaps by another party or group within their specific settings [12]. Thus, the AGREE-REX could provide a signal to individuals who are ultimately responsible for action about where gaps and barriers to this goal exist so that corrective action can be taken.
Differences in mean overall AGREE-REX-D scores as a function of the type of organization may reflect the greater interest or great capacity of governmentsupported organizations to seek out a broader range of values or invest in additional methodological steps that lead to higher quality scores than do other types of development groups. These data align with initial appraisal findings using the original AGREE instrument, in which CPGs developed by government-supported organizations also had the most favorable quality scores [15]. CPG panels with more resources (financial and access to skilled methodologists) confer quality benefits and setting quality standards too high may have the unintended consequence of increasing the disparities between the "have much" and "have less" jurdisdictions. Similar differences and similar concerns were raised in the assessment of CPGs with the original AGREE instrument [15].
Our study has several limitations. First, we only included English-language CPGs in the analysis. As a result, we have no data on the unique strengths or limitations related to credibility and implementability of non-English CPGs. This provides an opportunity for future research studies. Additionally, in order to optimize the feasibility of the study and candidates' interests to participate, we only included CPGs that were less than 50 pages in length (excluding appendices and tables). Although the length of the CPG document is not necessarily associated with the quality, credibility and implementability, the restriction we imposed may have resulted in the exclusion of lengthy CPGs that may have more information and perhaps could have been scored higher. In addition, while 161 CPGs were evaluated, they were not from 161 unique developers. This could potentially be a source of confounding. Finally, the penultimate prototype of the AGREE-REX-D was used and not the final version. While there is considerable overlap between the two, future status reports must account for these differences when reflecting on changes in scores over time.

Conclusion
As part of the development of the AGREE-REX tool, we assessed 161 CPG recommendations from different organizations around the world using the draft version of the tool. We found that there is significant room for improvement in some CPG recommendation elements. The most unfavorable ratings were found in the following items: patients/population values, policy values, alignment of values, local applicability and resources, tools and capacity. It should also be noted that statistically significant higher scores were found in guidelines developed by government-supported organizations (in comparison to those produced by professional or specialist societies or others), and in guidelines developed in the UK and Canada (in comparison to those produced in the USA and internationally.
Since the AGREE-REX can be used as a methodological blueprint to inform the development and reporting of high-quality recommendations, our findings may be used as a baseline upon which to measure future improvements in the quality of CPG recommendations.