Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare

Main Figure AgentClinic

Overview. We introduce a Tiered Agentic Oversight (TAO) framework designed to enhance the AI safety in healthcare. (top): Inputs derived from safety benchmarks are processed by a Recruiter Agent to source medical agents with different expertise. An Agent Router, instructed to assess potential risks based on the prompt and agent capabilities, determines the appropriate tier for each medical agent. Simpler cases are handled by lower tiers (Tier 1), while complex or potentially unsafe cases trigger Case Escalation to higher tiers (Tiers 2 and 3) involving more scrutiny, potentially incorporating human oversight as explored in our comparative study design (bottom left). We conducted comprehensive experiments across five healthcare-focused safety benchmarks. The radar plot (bottom right) demonstrates that TAO outperforms best multi-agent (LLM-Debate) and a reasoning model (Gemini-2.5 Pro) with CoT performance in 4 out of 5 benchmarks.

Abstract

Agentic systems offer transformative potential for healthcare, but it is imperative that patient-safety is prioritized. Although advanced large language models (LLMs) are powerful tools, singular LLMs pose safety risks in complex healthcare scenarios due to limited error detection and single point of failure. This paper introduces \textbf{Tiered Agentic Oversight (TAO)}, a hierarchical multi-agent framework that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse, physician, specialist), TAO conducts agent routing based on task complexity and agent roles. Leveraging automated inter- and intra- tier collaboration and role-playing, TAO creates a robust safety framework. Comprehensive experiments, including ablation studies of agent attribution, human intervention requests, tier configurations, agent capability ordering, and adversarial robustness, demonstrate TAO's superior performance in 4 out of 5 healthcare safety benchmarks compared to single-agent and multi-agent frameworks. Finally, we validate TAO via an auxillary clinician-in-the-loop study that highlights how agentic oversight can complement human expertise.


Main Figure AgentClinic
Main Figure AgentClinic

Main Result

First, we present a comparative overview of the accuracy achieved by TAO and baseline methods across the five safety benchmarks. Across four out of five of these evaluations TAO demonstrates superior performance. Notably, TAO consistently outperforms both single advanced LLMs and conventional multi-agent oversight frameworks. These performance improvements across diverse safety dimensions underscores the effectiveness of TAO's hierarchical agentic architecture in enhancing AI safety within critical healthcare applications.

The radar plot in the main figure directly compares TAO to baseline multi-agent approaches and CoT with single LLMs. The expanded area covered by TAO relative to these baselines visually reinforces its enhanced safety performance across the benchmark suite. This superior performance suggests that TAO's architectural advantage; specifically, its tiered structure, dynamic routing, and context-aware escalation strategies are key enablers of improved safety in complex, healthcare-related AI tasks. Consistent performance across various safety benchmarks, provides compelling evidence for the holistic safety enhancements offered by the TAO framework.

Case Study

The user study involved 6 medical doctors who completed evaluations for all 20 medical triage scenarios and were thus included as qualified participants in this analysis. The evaluation focused on three dimensions: Oversight Necessity, Safety Confidence, and Output Appropriateness. To assess the consistency of expert judgments, we calculated inter-rater reliability (IRR) using the Intraclass Correlation Coefficient (ICC), specifically ICC(3,k) for absolute agreement of the average ratings from our k=6 experts. The ICC(3,k) values, which reflect the reliability of the average expert judgment for each dimension, were as follows:
1) Oversight Necessity: ICC(3,k) = 0.6100
2) Output Appropriateness: ICC(3,k) = 0.2592
3) Safety Confidence: ICC(3,k) = -0.1009

Bias Figure AgentClinic