Clinical Guideline Development Part 2: GRADE, AGREE, and the Alphabet Soup of Evidence Appraisal

From Hippocrates to Delphi: A Practical Guide to Clinical Guideline Development Part 2: GRADE, AGREE, and the Alphabet Soup of Evidence Appraisal - banner image

Gordon Guyatt, the father of evidence-based medicine. Photo from:https://www.cdnmedhall.ca/laureates/gordonguyatt
In 1990, while serving as director of the internal medicine residency program at McMaster University in Hamilton, Ontario, Gordon Guyatt introduced a new approach to teaching at the bedside. He called it “Scientific Medicine.” The idea was straightforward: clinical decisions should be grounded in a critical appraisal of the best available evidence, not in intuition, authority, or the way things had always been done. His colleagues were not enthusiastic. The implication that established clinical practice was not always grounded in systematic evidence was not universally well received ^1-3.

So, Gordon Guyatt changed the name. He called it “Evidence-Based Medicine.” The term first appeared in a 1991 editorial in the ACP Journal Club, and a landmark 1992 JAMA paper co-authored with the Evidence-Based Medicine Working Group brought the concept to global attention ^1,4. Within a decade, evidence-based medicine had reshaped medical education, clinical research, and health policy worldwide. In a 2007 BMJ reader poll of the most important medical milestones since 1840, evidence-based medicine ranked among the top ten, ahead of both the computer and medical imaging ^5,6.

But Guyatt had created a problem he would spend the next decade solving. If clinical practice was now supposed to be evidence-based, someone needed a consistent way to rate the quality of that evidence and the strength of the recommendations derived from it. By the early 2000s, more than a dozen competing systems existed for doing exactly this: the Oxford Levels of Evidence, the USPSTF grading system, the SIGN methodology, the old AHCPR framework. None of them agreed with each other. The same randomised trial could be classified as “Level I” evidence in one framework and receive a different strength designation in another. Two guidelines on the same clinical question, developed by two different societies using two different grading systems, could produce different recommendation strengths based on the framework, not the data.

Recognising the confusion created by competing evidence-grading systems, Gordon Guyatt, Andrew Oxman, Holger Schünemann, and an international group of clinicians, epidemiologists, and guideline methodologists launched what became the GRADE Working Group around 2000. Their first foundational publications appeared in 2004, introducing a unified framework for rating the certainty of evidence and the strength of recommendations ⁷. By 2008, a landmark BMJ publication helped establish GRADE as the emerging international standard ⁸. Today, GRADE is used by leading healthcare organisations worldwide, including the World Health Organization, the Cochrane Collaboration, and UpToDate. The man from Hamilton who named the movement also helped build the system that tells us how much to trust its outputs.

This article covers the frameworks that anyone involved in guideline development needs to understand: GRADE for appraising evidence and formulating recommendations, AGREE II for evaluating guideline quality, RIGHT for reporting completeness, and the IOM standards that underpin the definition of a trustworthy guideline.

GRADE: Grading recommendations, assessment, development, and evaluation - visual guideline.

GRADE: Separating evidence quality from recommendation strength

GRADE’s central contribution is the separation of two judgements that are frequently conflated: how confident we are in the evidence (evidence quality) and what we think clinicians should do (recommendation strength).

Evidence quality is rated across four levels: high, moderate, low, and very low, reflecting the degree of confidence that the estimated effect is close to the true effect. Randomised controlled trials begin as high-quality evidence and can be downgraded based on five factors: study limitations, inconsistency of results, indirectness of evidence, imprecision, and publication bias. Observational studies begin as low-quality evidence but can be upgraded when effect sizes are large, a dose-response gradient exists, or plausible confounders would only reduce the observed effect ^8,9.

Recommendation strength is binary: strong or conditional (sometimes termed weak). A strong recommendation (e.g. “we recommend”) indicates that the benefits clearly outweigh the harms or vice versa and that most patients should receive (or avoid) the intervention. A conditional recommendation (e.g. “we suggest”) indicates that the balance is closer, that uncertainty exists, or that patient values and preferences play a larger role in the decision ^8,10,11.

This separation produces an important and often misunderstood possibility: a strong recommendation can be appropriately issued on the basis of low-quality evidence when the benefit-harm balance is unequivocal. Recommending against a therapy with clear hepatotoxicity signals does not require high-quality RCT data, the potential for serious harm is sufficient. Conversely, high-quality evidence may support only a conditional recommendation when benefits and harms are closely balanced ¹¹.

GRADE encourages consistent, active-voice language to signal recommendation strength. “We recommend” for strong and “we suggest” for conditional is the most widely adopted convention, though alternatives such as “clinicians should”/”clinicians might” are also used ¹⁰.

The most common pitfall in guideline development is adopting GRADE language and ratings without implementing the GRADE process. If a document uses “we recommend” and “we suggest,” assigns quality ratings to each recommendation, but describes its methodology as “narrative review and expert consensus,” the reader is being given a confidence signal that the methodology does not support. This mismatch is readily identified by peer reviewers and journal editors, and it undermines recommendations that may be perfectly sound. If GRADE was not used, the recommendation language should not imply that it was.

AGREE II: Appresial of Guidelines for Research and Evaluation II - visual guideline.

AGREE II: Appraising guideline quality

GRADE provides the framework for building a guideline, while AGREE II describes how to evaluate one. The AGREE II instrument (Appraisal of Guidelines, Research and Evaluation) is the most widely used and validated tool for assessing clinical practice guideline quality. It consists of 23 items organised across six domains: Scope and Purpose, Stakeholder Involvement, Rigour of Development, Clarity of Presentation, Applicability, and Editorial Independence ¹².

Each item is rated on a 7-point scale by at least two independent appraisers, and domain scores are calculated as standardised percentages. The instrument does not produce a single pass/fail score but rather a profile of strengths and weaknesses.

Domain 3, Rigour of Development, is where most guidelines succeed or fail. It evaluates whether systematic methods were used to search for evidence, whether selection criteria are clearly described, whether the methods for formulating recommendations are transparent, whether health benefits and risks were considered, whether there is an explicit link between recommendations and supporting evidence, whether the guideline was externally reviewed, and whether a procedure for updating is provided ¹³.

A critical distinction: AGREE II is a quality appraisal tool, not a development methodology. It tells you whether a finished guideline meets quality standards; it does not prescribe how to develop one. However, understanding its domains during development, rather than discovering them at the appraisal stage, is considerably more efficient than working backwards.

RIGHT: Reporting completeness

The RIGHT checklist (Reporting Items for practice Guidelines in HealThcare) complements AGREE II by focusing specifically on the completeness of guideline reporting rather than the quality of the development process ¹⁴.

RIGHT includes 22 items covering basic information (title, executive summary), background (rationale, objectives, target population), evidence (search strategy, assessment methods), recommendations (basis, strength, clarity), and review, quality assurance processes, as well as funding and declaration of interests.

The distinction from AGREE II is subtle but meaningful: a guideline can be well developed but poorly reported or thoroughly reported but methodologically weak. Both tools are needed to fully evaluate a guideline, and developers who address both during manuscript preparation save themselves revision cycles during peer review.

IOM standards: The definitional foundation

The Institute of Medicine’s 2011 report Clinical Practice Guidelines We Can Trust established eight standards for trustworthy CPG development that continue to serve as the foundational reference for the field ¹⁵. These standards cover transparency, conflict of interest management, group composition, systematic review, evidence foundations and rating of recommendation strength, articulation of recommendations, external review, and updating.

Two aspects deserve particular attention because they are frequently underappreciated. First, the conflict-of-interest standard requires that the majority of the guideline development group be free of relevant conflicts and that the chair be unconflicted, disclosure alone does not constitute management. Second, the standard on group composition calls for multidisciplinary expertise and inclusion of patient representatives, a requirement that many guideline panels omit.

How to choose the right framework for the right purpose, each serves a different role in guideline development use the right tool for the job.

Choosing the right framework for the document type

Not every guidance document requires GRADE. The framework was designed for clinical practice guidelines grounded in systematic evidence review. Consensus statements developed through Delphi or nominal group methodology, where the evidence base is limited and expert judgement is the primary input, do not need, and should not claim, a formal GRADE process. What they do need is a transparent description of their consensus methodology and a clear explanation of how recommendation confidence is conveyed through language ^8,16.

The decision tree is practical: if the methodology supports formal evidence appraisal, use GRADE. If it does not, use declarative recommendation language with implicit strength signalling “should,” “may be considered,” “is not recommended” and explain the convention in the methods section so the reader can interpret the gradient.

It is worth noting that Guyatt himself has emphasised the importance of this pragmatism. GRADE was not designed to make guideline development harder or more exclusive. It was designed to make the reasoning behind recommendations transparent, to give clinicians and patients a way to understand not just what the panel recommends, but how much confidence that recommendation deserves. When the evidence is strong, GRADE provides the framework to demonstrate it. When it is not, honesty about that uncertainty, rather than the false precision of a grading system that was never applied, serves the reader better.

Gordon Guyatt understood, perhaps better than anyone, that the goal of evidence-based medicine was never certainty. It was transparency.

Disclaimer: The mention of specific companies, products, or organizations in this article is for informational purposes only and does not imply endorsement. The companies whose products were referenced were not consulted, involved in the preparation of this content, nor did they provide any funding or compensation.

< PREV
NEXT >

References

1. G. Guyatt, J. Cairns, D. Churchill, D. Cook, B. Haynes, J. Hirsh, J. Irvine, M. Levine, M. Levine, J. Nishikawa, D. Sackett, P. Brill-Edwards, H. Gerstein, J. Gibson, R. Jaeschke, A. Kerigan, A. Neville, A. Panju, A. Detsky, M. Enkin, P. Frid, M. Gerrity, A. Laupacis, V. Lawrence, J. Menard, V. Moyer, C. Mulrow, P. Links, A. Oxman, J. Sinclair, P. Tugwell, Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine, JAMA 268 (1992) 2420–2425. https://doi.org/10.1001/JAMA.1992.03490170092032.
2. R. Smith, D. Rennie, Evidence-based medicine – An oral history, JAMA 311 (2014) 365–367. https://doi.org/10.1001/JAMA.2013.286182.
3. R.L. Sur, P. Dahm, History of evidence-based medicine, Indian J. Urol. 27 (2011) 487–489. https://doi.org/10.4103/0970-1591.91438.
4. G.H. Guyatt, ed., Evidence-based medicine, Https://Doi.Org/10.7326/ACPJC-1991-114-2-A16 114 (1991) A16. https://doi.org/10.7326/ACPJC-1991-114-2-A16.
5. Greatest Medical Advance: Sanitation – CBS News, (n.d.).
6. K. Dickersin, S.E. Straus, L.A. Bero, Evidence based medicine: increasing, not dictating, choice, BMJ 334 (2007) s10–s10. https://doi.org/10.1136/BMJ.39062.639444.94.
7. Grading quality of evidence and strength of recommendations, BMJ 328 (2004) 1490–1494. https://doi.org/10.1136/BMJ.328.7454.1490.
8. G.H. Guyatt, A.D. Oxman, G.E. Vist, R. Kunz, Y. Falck-Ytter, P. Alonso-Coello, H.J. Schünemann, GRADE: an emerging consensus on rating quality of evidence and strength of recommendations, BMJ Br. Med. J. 336 (2008) 924. https://doi.org/10.1136/BMJ.39489.470347.AD.
9. G. Guyatt, A.D. Oxman, E.A. Akl, R. Kunz, G. Vist, J. Brozek, S. Norris, Y. Falck-Ytter, P. Glasziou, H. Debeer, R. Jaeschke, D. Rind, J. Meerpohl, P. Dahm, H.J. Schünemann, GRADE guidelines: 1. Introduction – GRADE evidence profiles and summary of findings tables, J. Clin. Epidemiol. 64 (2011) 383–394. https://doi.org/10.1016/j.jclinepi.2010.04.026.
10. J. Andrews, G. Guyatt, A.D. Oxman, P. Alderson, P. Dahm, Y. Falck-Ytter, M. Nasser, J. Meerpohl, P.N. Post, R. Kunz, J. Brozek, G. Vist, D. Rind, E.A. Akl, H.J. Schünemann, GRADE guidelines: 14. Going from evidence to recommendations: The significance and presentation of recommendations, J. Clin. Epidemiol. 66 (2013) 719–725. https://doi.org/10.1016/j.jclinepi.2012.03.013.
11. J.C. Andrews, H.J. Schünemann, A.D. Oxman, K. Pottie, J.J. Meerpohl, P.A. Coello, D. Rind, V.M. Montori, J.P. Brito, S. Norris, M. Elbarbary, P. Post, M. Nasser, V. Shukla, R. Jaeschke, J. Brozek, B. Djulbegovic, G. Guyatt, GRADE guidelines: 15. Going from evidence to recommendation – Determinants of a recommendation’s direction and strength, J. Clin. Epidemiol. 66 (2013) 726–735. https://doi.org/10.1016/j.jclinepi.2013.02.003.
12. M.C. Brouwers, M.E. Kho, G.P. Browman, J.S. Burgers, F. Cluzeau, G. Feder, B. Fervers, I.D. Graham, J. Grimshaw, S.E. Hanna, P. Littlejohns, J. Makarski, L. Zitzelsberger, AGREE II: advancing guideline development, reporting and evaluation in health care, CMAJ 182 (2010). https://doi.org/10.1503/CMAJ.090449.
13. M.C. Brouwers, M.E. Kho, G.P. Browman, J.S. Burgers, F. Cluzeau, G. Feder, B. Fervers, I.D. Graham, S.E. Hanna, J. Makarski, Development of the AGREE II, part 1: performance, usefulness and areas for improvement, C. Can. Med. Assoc. J. 182 (2010) 1045. https://doi.org/10.1503/CMAJ.091714.
14. Y. Chen, K. Yang, A. Marušić, A. Qaseem, J.J. Meerpohl, S. Flottorp, E.A. Akl, H.J. Schünemann, E.S.Y. Chan, Y. Falck-Ytter, F. Ahmed, S. Barber, C. Chen, M. Zhang, B. Xu, J. Tian, F. Song, H. Shang, K. Tang, Q. Wang, S.L. Norris, H. Li, Y. Hu, B. Zhang, H. Shen, L. Jiang, S. Zhai, X. Luo, Y. Ma, A Reporting Tool for Practice Guidelines in Health Care: The RIGHT Statement, Ann. Intern. Med. 166 (2017) 128–132. https://doi.org/10.7326/M16-1565.
15. R. Graham, M. Mancher, D.M. Wolman, S. Greenfield, E. Steinberg, Clinical Practice Guidelines We Can Trust, Clin. Pract. Guidel. We Can Trust (2011) 1–266. https://doi.org/10.17226/13058.
16. Murphy, Black, Lamping, McKee, Sanderson, Askham, Marteau, Consensus development methods, and their use in clinical guideline development., Health Technol. Assess. (Rockv). 2 (1998). https://doi.org/10.3310/HTA2030.Hippocrates: The “Greek Miracle” in Medicine. Accessed May 12, 2026. https://www.homepages.ucl.ac.uk/~ucgajpd/medicina antiqua/sa_hippint.html

News & Views