Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

March 25, 2025

2025 ANDERS Health Research Research 2025

New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening in systematic reviews,” co-authored by Emerson, Christian Cao, Jason Sang, and Rohit Arora, demonstrates how sophisticated prompting strategies can dramatically enhance LLM performance for text classification tasks critical to systematic review processes. By creating generalized prompting templates (as opposed to review-specific solutions), the researchers establish an accessible approach to systematic review automation that delivers significant cost and time savings.

TLDR: Uncover groundbreaking AI research in 3 minutes

This concise summary bridges the gap between complex scientific advancements and everyday understanding. Ideal for enthusiasts and non-researchers, start listening now.

Prior applications of LLMs for systematic review screening and text classification tasks primarily relied on zero-shot prompting methods, which heavily underestimated the true capabilities of LLMs for downstream tasks. As such, LLMs had, to this point, been characterized as incapable of performing these tasks well enough to be used effectively for systematic reviews. By applying prompting best practices, as well as novel prompting techniques, the authors were able to leverage the full capabilities of LLMs.

Background and Motivation

In medicine, systematic reviews (SRs) summarize clinical and research study results, providing evidence of an intervention’s effectiveness. They are the gold standard for evidence-based practice. However, they are exceptionally resource-intensive, typically requiring a year and over $100,000 to complete. The goal of an SR is to sift through large collections of research, identify relevant articles, and synthesize new insights that, as a whole, answer an important and specific question. A simplified example might be, “Does intervention X improve patient outcomes?” The initial screening phase of SRs is particularly demanding, requiring two investigators to independently review articles against eligibility criteria in two stages. First, investigators review article abstracts as a first pass article filter. In the second phase, the investigators review the entire article (‘full-text’) to determine whether the articles meet the predetermined criteria for final inclusion.

Despite existing tools, SR automation remains elusive as current solutions only supplement human workflows, lack necessary performance for independent decision-making, and require extensive historical training data. The emergence of LLMs presents new opportunities for SR screening automation, potentially reducing time and resource requirements significantly.

Methodology

In this work, the researchers developed generic prompt templates for LLM-driven systematic review screening for both abstracts and full-text articles. They first created BenchSR, a database of 10 previously published systematic reviews spanning nine clinical domains, which served as the testing ground. Through iterative experimentation with various prompting techniques, they developed “Framework Chain-of-Thought,” a novel approach that guides LLMs to reason systematically against predefined criteria—mimicking human cognitive processes by evaluating each criterion before making final decisions. For full-text screening, they discovered that repeating prompt instructions at both the beginning and end of prompts significantly improved performance with lengthy documents, effectively addressing the “lost-in-the-middle” phenomenon where LLMs typically struggle with retaining information from the middle of large texts. These insights culminated in two optimized templates: “Abstract ScreenPrompt” for abstract screening and “ISO-ScreenPrompt” (Instruction-Structure-Optimized) for full-text screening, both designed to maximize sensitivity while maintaining acceptable specificity across diverse systematic review types.

Experimental Setup

The study compared multiple prompting approaches including zero-shot, few-shot, and Chain-of-Thought techniques across a wide array of LLMs (GPT-3.5, GPT-4 variants, Gemini Pro, Mixtral, Mistral, and Claude-3.5-Sonnet). Performance was evaluated using accuracy, sensitivity, and specificity metrics. The final human screening decisions from the original systematic reviews served as the gold standard reference against which all model decisions were compared, for both the abstract and full-article screening phases. The experimental protocol followed training-validation-testing workflows: prompts were initially optimized using training samples from a single systematic review (SeroTracker), validated on a separate SeroTracker sample, and finally thoroughly tested on a SeroTracker test set and nine additional systematic review datasets. For abstract screening, researchers examined the complete set of titles/abstracts from original searches, while full-text screening evaluated all freely accessible PubMed Central articles. The team also conducted time and cost analyses comparing LLM-based screening with traditional human approaches, providing practical implementation insights for research teams.

Key Findings

  • The optimized Abstract ScreenPrompt template achieved high performance across 10 diverse reviews (97.7% weighted sensitivity, 85.2% weighted specificity), significantly outperforming zero-shot prompting (49.0% weighted sensitivity, 97.9% weighted specificity) and previous screening tools, as high sensitivity while maintaining good specificity is the most important metric for abstract screening. 
  • The ISO-ScreenPrompt template for full-text screening demonstrated similar high performance (96.5% weighted sensitivity, 91.2% weighted specificity). 
  • Performance of both abstract and full-text prompt templates surpassed previous literature estimates of single human-reviewer performance (86.6% sensitivity and 79.2% specificity). 
  • LLM-based screening substantially reduced costs and time requirements—Depending on the SR, Abstract ScreenPrompt cost between $16.74-$157.02 versus $194.83-$1,666.67 for single-human screening, while ISO-ScreenPrompt cost $14.53-$622.12 versus $676.35-$25,956.40 for human full-text screening. Both LLM approaches completed screening within 24 hours, compared to 9.74-83.33 hours (abstracts) and 33.82-1,297.82 hours (full-texts) for human reviewers.

Conclusion and Implications

This research demonstrates that well-engineered LLM prompts can achieve high sensitivity and specificity for systematic review screening across diverse reviews without requiring model fine-tuning or labeled training data. The study offers immediate implementation pathways: LLMs can serve as independent single reviewers, complement human reviewers to halve workload, or as pre-screening tools to reduce human screening volume by 66-95%. Future research will validate these templates across a broader spectrum of systematic reviews, evaluate their performance against human reviewers in prospective studies, and explore applying similar prompting techniques to other criteria-based tasks in medical sciences.

Created by AI, edited by humans, about AI

This blog post is part of our ‘ANDERS – AI Noteworthy Developments Explained & Research Simplified’ series. Here we utilize AI Agents to create initial drafts from research papers, which are then carefully edited and refined by our humans. The goal is to bring you clear, concise explanations of cutting-edge research conducted by Vector researchers. Through ANDERS, we strive to bridge the gap between complex scientific advancements and everyday understanding, highlighting why these developments are important and how they impact our world.

Related:

Vector logo
2025
News

Vector Institute Announces the Appointment of Glenda Crisp as President and CEO

Vector logo
2025
News

Vector Institute Unveils Comprehensive Evaluation of Leading AI Models

2025
AI Engineering
Research
Research 2025

State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models