Assessment of the quality of recommendations from 161 clinical practice guidelines using the Appraisal of Guidelines for Research and Evaluation–Recommendations Excellence (AGREE-REX) instrument shows there is room for improvement
Implementation Science volume 15, Article number: 79 (2020)
To assess the quality of recommendations from 161 clinical practice guidelines (CPGs) using AGREE-REX-D (Appraisal of Guidelines REsearch and Evaluation-Recommendations Excellence Draft).
International CPG community.
Three hundred twenty-two international CPG developers, users, and researchers.
Participants were assigned to appraise one of 161 CPGs selected for the study using the AGREE-REX-D tool
Main outcome measures
AGREE-REX-D scores of 161 CPGs (7-point scale, maximum 7).
Recommendations from 161 CPGs were appraised by 322 participants using the AGREE-REX-D. CPGs were developed by 67 different organizations. The total overall average score of the CPG recommendations was 4.23 (standard deviation (SD) = 1.14). AGREE-REX-D items that scored the highest were (mean; SD): evidence (5.51; 1.14), clinical relevance (5.95; SD 0.8), and patients/population relevance (4.87; SD 1.33), while the lowest scores were observed for the policy values (3.44; SD 1.53), local applicability (3,56; SD 1.47), and resources, tools, and capacity (3.49; SD 1.44) items. CPGs developed by government-supported organizations and developed in the UK and Canada had significantly higher recommendation quality scores with the AGREE-REX-D tool (p < 0.05) than their comparators.
We found that there is significant room for improvement of some CPGs such as the considerations of patient/population values, policy values, local applicability and resources, tools, and capacity. These findings may be considered a baseline upon which to measure future improvements in the quality of CPGs.
Clinical practice guidelines (CPGs) are systematically developed statements informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options with the aim of optimizing patient care [1,2,3]. However, concerns about variation in the quality of CPGs and their resultant recommendations exist in the literature [1, 3, 4]. The AGREE II is an established instrument, used internationally, to evaluate the overall methodological quality of CPGs and to serve as a methodological blueprint to inform CPG development and reporting [5,6,7]. The AGREE II focuses on the entire CPG development process. As its complement, the AGREE-REX (Appraisal of Guidelines REsearch and Evaluation-Recommendations EXcellence) was designed to focus specifically on CPG recommendations and the justifications that underpin them . Its development was in response to data demonstrating high-quality CPG processes, although necessary, are not always sufficient to yield individual CPG recommendations that are clinically credible and implementable [9, 10].
The prototype of the AGREE-REX (the AGREE-REX-D) and the AGREE II was applied to 161 guidelines. In this article, we present the results of this assessment, identify areas for CPG recommendation improvement, and compare the evaluative information garnered by both tools.
Materials and methods
This study represents a component of a larger program of research designed to create the AGREE-REX version 1 (AGREE-REX-D); the technical components of this program of research are reported elsewhere . Our main study was designed to create the AGREE REX tool following a mixed-methods project, and this manuscript presents the cross-sectional study that summarizes the assessment of the selected CPGs during the development of the AGREE REX-D). This study received ethics approval from the Hamilton Integrated Research Ethics Board (project #13-700).
Participants included CPG developers, clinicians, implementers, and other users. They were purposefully recruited through a variety of channels including social media and CPG organizations, such as the Guidelines International Network (G-I-N), G-I-N North America regional community, Knowledge Translation (KT) Canada, Canadian Agency for Drugs and Technologies in Health (CADTH), Canadian Partnership Against Cancer, Cancer Care Ontario, and to investigators known in the CPG research community. The study was also advertised on the AGREE social media accounts (Facebook and Twitter), and My AGREE PLUS (online platform for appraising CPGs with the AGREE II tool, www.agreetrust.org) registered users were invited to participate.
CPGs in multiple clinical specialty areas were collected from the Agency for Healthcare Research and Quality (AHRQ) National Guidelines Clearinghouse database . Using the database’s advanced search function, we identified CPGs that were (1) published between 2013 and 2015; (2) written in English language, and (3) no more than 50 pages in length for the CPG core document. The resulting list of CPGs was reviewed and the following were excluded: guidelines addressing organizational rather than clinical topics, technology assessments; CPGs not available for free to the public; and CPGs for which the link in the database were not functional. Descriptive information was extracted from each CPG, including type of authoring organization (government supported vs. professional society vs other/not clear), disease topic (cancer vs. non-cancer), and country of authoring group (USA, UK, Canada, or international).
Participants received individualized password-protected access to the study materials, which included links to a downloadable PDF format of the AGREE-REX-D, the CPG to which they were randomly assigned, and the online survey platform (LimeSurvey) to record their scores. Participants were asked to review the AGREE-REX-D manual and items, read the CPG, and then evaluate it by applying the tool and recording their item ratings in LimeSurvey. Participants were provided with no formal training or orientation of the tool by members of the team. The AGREE-REX-D manual provided definitions of the items and instructions on how to assess and score them. An email reminder was sent at 2 weeks from the participant’s initial start date informing them of their deadline in 1 week. Deadline extensions were given when requested. Evaluations were completed between May 2016 and March 2017. Participants were offered a $50 CAD pre-paid virtual gift card for completing the study. All communication with participants was done by the staff of AGREE Scientific office.
The prototype of AGREE-REX-D comprised 11 items within 4 themes (Table 1). Each item was rated using a 7-point scale applied to two quality attributes, with higher scores reflecting higher quality. The two attributes were the following:
Extent to which quality features were documented in the CPG
Extent to which quality features were considered in formulating the recommendations.
The instrument concludes with two general quality assessments: overall credibility and overall implementability of the CPG recommendations.
AGREE II evaluations
For exploratory purposes, the CPGs were also assessed, independently, using the AGREE II by two members of the AGREE Scientific team. The AGREE II includes 23 items within 6 domains and 2 overall assessments . The 23 items are assessed with a 7-point scale (1 = strongly disagree; 7 = strongly agree), with high scores reflecting more favorable quality results. Discrepancies in scoring were resolved by consensus when required.
For each CPG, an AGREE-REX-D item score was derived for each of the 11 items by averaging scores on the 7-point scale between the two raters. A mean overall AGREE-REX-D score was calculated for each CPG by averaging across the 11 items. Finally, mean scores for overall credibility and overall implementability items were derived by averaging scores between the two raters.
AGREE II tool mean domain scores were derived by summing the scores across the two appraisers and standardizing them as a percentage of the maximum possible score a CPG could achieve for that domain . Before these scores were summed and calculated, the independent appraisers were required to reach a consensus on any AGREE II item scores that were two or more points apart on the 7-point scale.
Sample size calculation
The sample size calculation was based on a separate methodological goal to conduct a reliability study of the AGREE-REX-D tool based on the interrater reliability outcome. Based on consensus by the team, we made the following assumptions: two raters per CPG, an intraclass correlation coefficient of 0.6, and a confidence interval from 0.5 to 0.7. We determined that we required 316 participants to appraise 158 CPGs: each participatant rated one CPG using the AGREE-REX-D and each CPG was rated by two independent raters. Additional information on the details of the sample size calculation can be found elsewhere 
Descriptive measurements were used to summarize the AGREE-REX item and overall scores. A series of one-way ANOVA tests was used to examine mean differences in the AGREE-REX-D item scores and the overall score as a function of the following characteristics: type of authoring organization (government-supported vs. professional societies vs. other), disease topic (cancer vs. not cancer), and country of development (USA vs. UK vs. Canada vs. international). International guidelines category included guidelines co-developed by two or more countries or developed by international organizations or societies. Descriptive measures were used to summarize AGREE II domain scores. Finally, correlations between mean overall AGREE-REX-D scores and AGREE II domain scores were calculated. Analyses were performed using Stata 15.0 (StataCorp. 2017. Stata Statistical Software: Release 15. College Station, TX: StataCorp LLC).
We appraised 161 CPGs. The CPGs targeted a range of diseases and clinical problems including cancer, infectious diseases, pregnancy and child birth, mental health, nervous system disorders, respiratory, digestive, genitourinary, blood and endocrine disorders, and musculoskeletal, among others. With the exception of cancer (n = 38), the number of CPGs for each unique disease was small (< 8) making other comparisons by disease topic not viable. CPGs were developed by 67 different international organizations (see Additional file 1 Appendix 1). Organizations that produced the CPGs were government-supported in less than a third of cases (n = 46; 28.6%), and they were authored by groups most often located in USA (n = 89; 55.3%) or the UK (n = 46; 28.6%). CPGs were all published between 2013 and 2015. The list of appraised CPGs can be accessed in the supplementary file.
AGREE-REX (see Table 4)
AGREE-REX.D performance for all CPGs
The mean overall AGREE-REX score across the 161 CPGs was 4.23 (SD 1.14). There was variability in performance across the individual 11 items, with 6 that scored above the middle point of 4.0 on the response scale. The mean overall credibility and overall implementability assessments were 4.78 (SD 1.24) and 4.19 (SD 1.23), respectively.
AGREE-REX-D performance by type of organization
Statistically significant differences (i.e., p < 0.05) were found as a function of organization type for each of the mean AGREE-REX-D items, the mean overall AGREE-REX-D score, and the overall implementability and overall credibility assessments. In each case, more favorable ratings were found among CPGs produced by government-supported organizations. The item scores of CPGs produced by government-supported organizations (n = 46) ranged from 4.41 (SD 1.11) to 5.95 (SD 0.8); the scores of CPG produced by professional societies (n = 109) ranged from 2.99 (SD 1.46) to 5.24 (SD 1.26); and the scores of CPG produced by other types of organizations (n = 6), ranged from 3.00 (SD 0.89) to 6.17 (SD 0.68). Of note, in 5 of the 11 cases, the AGREE-REX-D item means across the organization types fell within the positive ends of the response scale (m ≥ 4) despite there being statistically significant differences between them. In contrast, in 6 of the 11 cases, the overall means of the AGREE-REX-D items straddled the mid-point of the scale—suggesting some organizations tended to perform lower than the mid-point and others perform higher than the mid-point of the scale.
AGREE-REX-D performance by country of CPG authoring group
The country of the authoring CPG organization showed differences in AGREE-REX quality scores as well. Statistically significant differences (i.e., p < 0.05) for five AGREE-REX items (implementation relevance, target user values, policy values, local applicability, and resources, tools, and capacity), and the mean overall AGREE-REX score were found. Differences as a function of authoring group approached, but did not reach, statistical significance for the overall implementability assessment. For each of these comparisons, the CPGs produced in the UK and Canada showed higher scores. The item scores of CPGs published from the UK ranged from 3.66 (SD 1.26) to 5.74 (SD 0.90); from Canada ranged from 3.42 (SD 1.0) to 5.87 (SD 0.64); from the USA ranged from 3.08 (SD 1.47) to 5.06 (SD 1.39); and from international organizations ranged from 2.96 (SD 1.39) to 5.18 (SD 1.44)). In all but one case, overall AGREE-REX-D item means straddled the mid-point of the scale where there was a significant difference between the groups.
AGREE-REX-D performance by disease
No significant differences emerged between cancer and non-cancer CPGs scores; this held true for each of the AGREE-REX items and the mean overall AGREE-REX-D score (p > 0.5; means not presented).
AGREE II (see Table 5)
The AGREE II domain scores for the CPGs are displayed in the Table 5. Scope and purpose, and clarity of presentation were the domains with the highest scores, while the applicability domain had the lowest score.
AGREE II and AGREE-REX
The correlations between the overall AGREE-REX-D and AGREE II domains were low (r < 0.30) except for the applicability domain where the correlation was modest at r = 0.38 . Overall, AGREE-REX scores were higher among appraisers with no AGREE II experience compared to those with AGREE II experience.
We appraised 161 CPGs with the prototype of the AGREE-REX-D tool and the AGREE II tool. The most favorable AGREE-REX ratings (means > 5.0) were found for the evidence and clinical relevance items; ratings that fell in the more moderate range of the scale (means > 4.0 and < 5.0) were found for the patient/population relevance, implementation relevance, developers’ values and users’ values items; and least favorable ratings that fell below the mid-point of the scale (means < 4.0) were found for patients/population values, policy values, alignment of values, local applicability and resources, tools and capacity items. CPGs produced by government-supported organizations scored higher on all the items of the AGREE-REX-D than those produced by professional societies or other types of groups, and CPGs produced in UK and Canada scored higher in selected items in comparison to USA and international CPGs. The confidence intervals around the mean AGREE-REX scores were large.
The distribution of the mean scores across the 11 items is not surprising. CPG methods research has focused largely on issues directly relevant to creating the evidence base. As a consequence, some AGREE-REX concepts are easier to achieve success because there exists tools and resources to support thier operationalization (e.g., tools designed by the GRADE working group ). In contrast, resources to operaitonalize other concepts are more elusive. For example, continued methodological development is needed to adequately measure and report values across diverse stakeholder groups so that they are reliable, valid, and usable. Similarly, systematic strategies to incorporate these perspectives into the framing of recommendations are required .
As previously reported with the evaluation of the AGREE II , lower scores with some AGREE-REX-D items may reflect inadequate reporting and not poor quality in methodological execution . Developers may have followed appropriate steps but not reported them in the CPG documentation and, as a consequence, could not be assessed. Also, it is possible that some conceptual elements reflected in the AGREE-REX-D (e.g., concepts related to implementation activities) are not the responsibility of the CPG developer directly, but perhaps by another party or group within their specific settings . Thus, the AGREE-REX could provide a signal to individuals who are ultimately responsible for action about where gaps and barriers to this goal exist so that corrective action can be taken.
Differences in mean overall AGREE-REX-D scores as a function of the type of organization may reflect the greater interest or great capacity of government-supported organizations to seek out a broader range of values or invest in additional methodological steps that lead to higher quality scores than do other types of development groups. These data align with initial appraisal findings using the original AGREE instrument, in which CPGs developed by government-supported organizations also had the most favorable quality scores . CPG panels with more resources (financial and access to skilled methodologists) confer quality benefits and setting quality standards too high may have the unintended consequence of increasing the disparities between the “have much” and “have less” jurdisdictions. Similar differences and similar concerns were raised in the assessment of CPGs with the original AGREE instrument .
Our study has several limitations. First, we only included English-language CPGs in the analysis. As a result, we have no data on the unique strengths or limitations related to credibility and implementability of non-English CPGs. This provides an opportunity for future research studies. Additionally, in order to optimize the feasibility of the study and candidates’ interests to participate, we only included CPGs that were less than 50 pages in length (excluding appendices and tables). Although the length of the CPG document is not necessarily associated with the quality, credibility and implementability, the restriction we imposed may have resulted in the exclusion of lengthy CPGs that may have more information and perhaps could have been scored higher. In addition, while 161 CPGs were evaluated, they were not from 161 unique developers. This could potentially be a source of confounding. Finally, the penultimate prototype of the AGREE-REX-D was used and not the final version. While there is considerable overlap between the two, future status reports must account for these differences when reflecting on changes in scores over time.
As part of the development of the AGREE-REX tool, we assessed 161 CPG recommendations from different organizations around the world using the draft version of the tool. We found that there is significant room for improvement in some CPG recommendation elements. The most unfavorable ratings were found in the following items: patients/population values, policy values, alignment of values, local applicability and resources, tools and capacity. It should also be noted that statistically significant higher scores were found in guidelines developed by government-supported organizations (in comparison to those produced by professional or specialist societies or others), and in guidelines developed in the UK and Canada (in comparison to those produced in the USA and internationally.
Since the AGREE-REX can be used as a methodological blueprint to inform the development and reporting of high-quality recommendations, our findings may be used as a baseline upon which to measure future improvements in the quality of CPG recommendations.
Availability of data and materials
The analyses are available from the corresponding author.
Shiffman RN, Shekelle P, Overhage JM, et al. Standardized reporting of clinical practice guidelines: a proposal from the conference on guideline standardization. Ann Intern Med. 2003;139(6):493–8.
Qaseem A, Forland F, Macbeth F, et al. Guidelines international network: toward international standards for clinical practice guidelines. Ann Intern Med. 2012;156:525–31.
Institute of Medicine. 2011. Clinical practice guidelines we can trust. Washington, DC: The National Academies Press. https://doi.org/https://doi.org/10.17226/13058 ().
Grilli R, Magrini N, Penna A, et al. Practice guidelines developed by specialty societies: the need for critical appraisal. Lancet. 2000;355:103–6.5.
Brouwers M, Kho ME, Browman GP, et al. AGREE II: advancing guideline development, reporting and evaluation in healthcare. CMAJ. 2010;182:E839–42.
Brouwers MC, Kho ME, Browman GP, et al. Performance, usefulness and areas for improvement: development steps towards the AGREE II - part 1. CMAJ. 2010;182:1045–52.
Brouwers MC, Kho ME, Browman GP, et al. Validity assessment of items and tools to support application: development steps towards the AGREE II – part 2. CMAJ. 2010;182:E472–8.
Brouwers MC, Spithoff K, Kerkvliet K, Alonso-Coello P, Burgers J, Cluzeau F, et al. Development and validation of a tool to assess the quality of clinical practice guideline recommendations. JAMA Netw Open. 2020;3(5):e205535.
Nuckols TK, Lim YW, Wynn BO, et al. Rigorous development does not ensure that guidelines are acceptable to a panel of knowledgeable providers. J Gen Intern Med. 2008;23:37–44.
Watine J, Friedberg B, Nagy E, et al. Conflict between guideline methodologic quality and recommendation validity: a potential problem for practitioners. Clin Chem. 2006;52:65–72.
The Agency for Healthcare Research and Quality (AHRQ) National CPG Clearinghouse database. https://www.guideline.gov. Accessed 20 May 2013.
Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ (Clinical research ed). 2008;336(7650):924–6.
Brouwers MC, Makarski J, Kastner M, on behalf of the GUIDE-M Research Team, et al. The Guideline Implementability Decision Excellence Model (GUIDE-M): a mixed methods approach to create an international resource to advance the practice guideline field. Implement Sci. 2015;10(1):36.
Fervers B, Burgers JS, Haugh MC, et al. Predictors of high quality clinical practice guidelines: examples in oncology. Int J Qual Health Care. 2005;17(2):123–32.
Burgers JS, Cluzeau FA, Hanna SE, et al. Characteristics of high-quality guidelines: evaluation of 86 clinical guidelines developed in ten European countries and Canada. Int J Technol Assess Health Care. 2003;19(1):148–57.
The authors thank the following AGREE-REX Research Team members and collaborators for their input: Onil Bhattacharyya; George Browman; Anna Gagliardi; Peter Littlejohns; Holger Schunemann; and Louise Zitzelsberger. The authors also thank the study participants.
The lead author affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
This project was funded by the Canadian Institutes of Health Research (CIHR), grant #201209MOP-285689-KTR-CEBA-40598. JG holds a Canada Research Chair in Health Knowledge Transfer and Uptake. The funding body did not influence the design of the study, the collection, analysis, and interpretation of the data, or the writing of the manuscript.
Ethics approval and consent to participate
This study has been approved by the Hamilton Integrated Research Ethics Board (project number: 13-700).
Consent for publication
All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: KS and KK had financial support from the CIHR grant that funded the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous 3 years; MCB is the grant holder of the CIHR funding that supported this work; no other relationships or activities that could appear to have influenced the submitted work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Florez, I.D., Brouwers, M.C., Kerkvliet, K. et al. Assessment of the quality of recommendations from 161 clinical practice guidelines using the Appraisal of Guidelines for Research and Evaluation–Recommendations Excellence (AGREE-REX) instrument shows there is room for improvement. Implementation Sci 15, 79 (2020). https://doi.org/10.1186/s13012-020-01036-5