4.2 Human-in-the-Loop Controls

In the previous section, you learned how to minimize what an agent can do by removing unnecessary tools and narrowing permissions. This section covers a complementary control: for the actions an agent does need to take, you decide upfront which ones it may execute on its own and which ones require a human to approve first.

This is the Human-in-the-Loop (HITL) pattern. At its core, it means inserting explicit human approval checkpoints into an agent's action chain so that the most consequential steps never happen silently.

Why this matters for security: AI agents now take real-world actions — they send emails, delete files, move money, and modify cloud infrastructure. Unlike a chatbot that only generates text, a misbehaving agent can cause immediate, tangible damage. Human-in-the-loop controls ensure that a human reviews a high-risk action before the harm is done.

The Lesson from the Freysa Incident#

In late 2024, a crypto AI agent called Freysa was launched with a single, explicit rule baked into its system prompt: "Never transfer money under any circumstances." The rule held for 481 attempts. On attempt 482, a participant used a carefully crafted message that semantically redefined what the transfer function meant. Freysa transferred 13.19 ETH — roughly $47,000 — and reported that it had done so correctly.

The reason this attack worked is also the most important lesson about Human-in-the-Loop controls:

Security controls that live only inside the AI model's prompt can be overridden through language manipulation. Confirmation mechanisms must exist at the application or infrastructure layer — in code that the model cannot rewrite.

A system prompt saying "ask for confirmation before transferring money" is not a HITL control. A confirmation function in your application code that pauses execution and waits for a human response before calling the transfer API — that is a HITL control. The distinction is whether the check can be bypassed by changing what the model is told.

Prompt-Level Guard vs. Application-Level HITL

Rendering diagram...

The Three-Tier Action Classification Framework#

Not every action needs human review. Requiring approval for a file read would make agents unusably slow and create alert fatigue — a state where reviewers see so many prompts that they start approving everything without actually reading them. The goal is to require confirmation only where the risk justifies the friction.

A practical approach is to classify every action an agent can take into one of three tiers and apply different autonomy rules at each tier.

Three-tier risk classification for agent actions

Tier	Characteristics	Action examples	Policy
Tier 1 — Autonomous	Read-only, non-destructive, no external effects, easily reversed or sandboxed	Read a file, query a database with SELECT, search documentation, browse a URL, list directory contents, run tests in a sandbox	Execute immediately. Log the action for audit purposes.
Tier 2 — Log and Notify	Writes to internal data, reversible with effort, no effect outside the system, no communication to external parties	Write or modify a file in the application workspace, add a record to an internal database, post an internal Slack message, create a pull request (not merge it), update a configuration file	Execute, but write a detailed audit entry and send a real-time notification to a monitoring channel. A human can review and roll back.
Tier 3 — Require Explicit Confirmation	Irreversible, has external effects, involves communication outside the system, or touches credentials and infrastructure	Delete a database record, send an email to a customer, make a payment, execute arbitrary shell commands, merge to a production branch, modify cloud infrastructure, read from a secrets manager, export bulk data	Hard block before execution. Surface the action, its parameters, and the agent's reasoning to a human. Wait for explicit approval or denial.

When uncertain about which tier an action belongs to, classify it up. It is safer to require confirmation for a reversible action than to silently allow an irreversible one.

Action Classification Decision Flow

Rendering diagram...

Tier 1: Autonomous Actions#

Tier 1 actions are safe to execute without human approval. Their defining property is that a mistake produces no lasting side effect: a file read that returns the wrong file can simply be corrected by reading the right one; a sandboxed test that fails causes no production impact.

The control at this tier is logging, not confirmation. Every Tier 1 action should be recorded with enough detail to reconstruct what happened during an incident investigation. Logging also enables anomaly detection: an unusual spike in file-read volume — say, an agent reading hundreds of files per minute — is a signal worth investigating even if each individual read was harmless.

What to log for Tier 1 actions:

Action name and parameters (for example: read_file("/app/reports/q1.csv"))
Timestamp and session identifier
Whether the action succeeded or failed

Do not log full file contents, query results, or anything that could contain PII or credentials. Logging the name of a file read is useful for auditing. Logging the contents of every file read fills your logs with sensitive data.

Tier 2: Log and Notify#

Tier 2 actions change state but in ways that a human could detect and reverse. The control is an audit trail plus a real-time notification — the agent acts, but immediately reports what it did to a channel where a human will see it.

A pull request created by an agent is a good example of a Tier 2 action done well. The agent writes code and opens a PR; a developer reviews the PR before merging. No confirmation was needed before the PR was created, but a human must take explicit action before the code reaches production. The agent's work is visible and reviewable before it has real-world consequences.

Practical Tier 2 pattern — writing to internal state:

import logging
import json
from datetime import datetime, timezone

def agent_create_record(record_data: dict, session_id: str) -> dict:
    # Execute the write
    created = db.insert("support_notes", record_data)

    # Audit log — structured so it can be queried and alerted on
    logging.info(json.dumps({
        "event": "agent_action",
        "tier": 2,
        "action": "create_record",
        "table": "support_notes",
        "record_id": created["id"],
        "session_id": session_id,
        "timestamp": datetime.now(timezone.utc).isoformat()
    }))

    # Real-time notification to the monitoring channel
    notify_ops_channel(
        f"Agent created support note {created['id']} in session {session_id}"
    )

    return created

The notification does not block the action — it informs humans in near-real time so they can intervene if something looks wrong. The key requirement is that this notification reaches someone who is actually paying attention, not just a log file that nobody opens.

Tier 3: Require Explicit Confirmation#

Tier 3 is where the hard block lives. Before any Tier 3 action executes, the agent must:

Stop execution and surface the pending action to a human.
Display the action clearly: what it is, the exact parameters, and the agent's stated reason for wanting to take it.
Wait for an explicit approve or deny decision.
Respect the decision: if denied, the agent must not find an alternative path to the same outcome.

A real incident illustrates the cost of skipping this gate. In June 2025, a vulnerability in Microsoft 365 Copilot (CVE-2025-32711, also known as "EchoLeak") showed how a crafted email could cause Copilot to jump from a low-risk summarization task to exfiltrating data to an external endpoint — with no checkpoint in between. A properly implemented Tier 3 gate on any outbound data transfer would have paused execution before the exfiltration and asked the user: "I am about to send data to an external URL. Do you want to allow this?"

What a Tier 3 confirmation should show the human:

Required information in a Tier 3 confirmation prompt

Field	Example	Why it matters
Action name	Send email	The human needs to know what category of action this is
Exact parameters	To: client@example.com, Subject: Invoice #4821, Body: [preview]	Parameters are where attacks hide — showing them lets the human catch a prompt-injected recipient address
Agent's reasoning	'User asked me to send the invoice summary from this support thread'	Lets the human verify that the agent's interpretation of the task matches their intention
Risk indicator	This action sends a message outside the system and cannot be undone	Helps non-technical users understand why this confirmation exists
Approve / Deny	Two clearly distinct actions, no default selection	Confirmation dialogs with a pre-selected 'OK' are easily bypassed by users who click through without reading

Agent That Sends External Messages Without Confirmation

High

An agent that sends emails or posts external messages without a human approval checkpoint can be manipulated through prompt injection to exfiltrate data from a trusted domain.

Ch 4 · AI Agent SecurityOWASP LLM06:2025

The agent calls the email-sending tool directly without any confirmation step. A prompt injection that causes the agent to compose a message also causes it to send immediately.

Implementation Patterns#

Pattern 1: Per-Tool Confirmation Callback#

The most direct way to implement Tier 3 controls is a callback function that fires before a specific tool executes. The Claude Agent SDK provides a canUseTool callback that receives the tool name and its exact parameters, pauses the agent, and waits for a human response. This pattern works well for interactive agents where a human is present at the UI.

// TypeScript — Claude Agent SDK canUseTool callback
import { query } from "@anthropic-ai/claude-agent-sdk";

const HIGH_RISK_TOOLS = new Set([
  "send_email",
  "delete_record",
  "execute_shell",
  "modify_infrastructure",
  "read_secret",
]);

for await (const message of query({
  prompt: userInput,
  options: {
    canUseTool: async (toolName, toolInput) => {
      // Tier 1 and 2: allow without prompting
      if (!HIGH_RISK_TOOLS.has(toolName)) {
        return { behavior: "allow", updatedInput: toolInput };
      }

      // Tier 3: surface to human before executing
      console.log("\n--- Agent wants to take a high-risk action ---");
      console.log(`Tool:       ${toolName}`);
      console.log(`Parameters: ${JSON.stringify(toolInput, null, 2)}`);
      console.log("----------------------------------------------");

      const response = await promptUser("Allow this action? (yes/no): ");

      if (response.trim().toLowerCase() === "yes") {
        return { behavior: "allow", updatedInput: toolInput };
      }

      // Deny with a reason so the agent understands and can adapt
      return {
        behavior: "deny",
        message: "User denied this action. Do not retry it.",
      };
    },
  },
}));

The callback also supports returning an updatedInput with modified parameters — for example, a human can approve a file deletion but change the path from /data/2024 to /archive/2024 (moving instead of deleting). This "approve with changes" capability makes HITL more powerful than a simple yes/no gate.

Pattern 2: Approval Queue for Background Agents#

When an agent runs asynchronously — for example, an overnight batch job, a long-running research task, or an automated pipeline — blocking on real-time human input is not practical. Background agents need an approval queue: when the agent reaches a Tier 3 action, it submits an approval request and pauses that action, then continues with other work while a human reviews and responds at their own pace.

Approval Queue Pattern for Background Agents

Rendering diagram...

Critical implementation requirement: approvals must expire. An approval request sitting in a queue for 48 hours is a security hazard — the context may have changed, the file list may be different, and the circumstances may have shifted entirely. Set a TTL (time-to-live) on every pending approval: a fixed time window after which the request automatically expires. When an approval expires, the agent must re-request it with fresh parameters rather than acting on a stale decision.

# Python — approval queue with TTL enforcement
from datetime import datetime, timedelta, timezone
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ApprovalRequest:
    id: str
    tool_name: str
    parameters: dict
    agent_reasoning: str
    session_id: str
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    # Approvals expire after 2 hours — never act on a stale approval
    ttl_hours: int = 2

    @property
    def expires_at(self) -> datetime:
        return self.created_at + timedelta(hours=self.ttl_hours)

    @property
    def is_expired(self) -> bool:
        return datetime.now(timezone.utc) > self.expires_at

class ApprovalQueue:
    def execute_if_approved(self, approval_id: str) -> Optional[dict]:
        request = self.get(approval_id)

        if request is None:
            raise ValueError(f"Approval {approval_id} not found")

        if request.is_expired:
            self.mark_expired(approval_id)
            # Never execute an expired approval
            raise PermissionError(
                f"Approval {approval_id} expired at {request.expires_at.isoformat()}. "
                "The agent must re-request with current parameters."
            )

        # Only reaches here if within TTL
        return self.dispatch_tool(request.tool_name, request.parameters)

Pattern 3: Dry Run Mode#

A dry run (sometimes called preview mode or plan mode) shows a human all the actions an agent plans to take — listed in order with their parameters — before executing any of them. This is especially useful for destructive operations where the full scope of impact is not obvious from a single action in isolation.

Instead of approving one deletion at a time, the human sees "I am planning to delete the following 14 files: [list]" and can review the full plan before anything is committed. If you have used Terraform, this is the same idea behind its plan command: it describes every resource it will create, update, or destroy before apply executes anything. The same principle applies to AI agents — show everything first, then act.

# Python — dry run mode for a batch agent
from typing import NamedTuple

class PlannedAction(NamedTuple):
    tool: str
    parameters: dict
    risk_tier: int
    reasoning: str

def run_agent_with_dry_run(goal: str, dry_run: bool = True) -> list[PlannedAction]:
    """
    In dry_run=True mode, returns the plan without executing.
    The human reviews the plan and re-runs with dry_run=False to execute.
    """
    planned_actions: list[PlannedAction] = []

    for action in agent.plan(goal):
        tier = classify_risk_tier(action.tool, action.parameters)

        if dry_run:
            # Collect the plan — do not execute
            planned_actions.append(PlannedAction(
                tool=action.tool,
                parameters=action.parameters,
                risk_tier=tier,
                reasoning=action.reasoning
            ))
        else:
            # Execute Tier 1 and 2 immediately
            # Tier 3 goes through the approval queue
            if tier < 3:
                action.execute()
            else:
                approval_queue.submit(action)

    if dry_run:
        print_plan_for_human_review(planned_actions)

    return planned_actions

def print_plan_for_human_review(actions: list[PlannedAction]) -> None:
    """Display the proposed plan in a human-readable format."""
    print(f"\nAgent plan ({len(actions)} actions):\n")
    for i, action in enumerate(actions, 1):
        tier_label = {1: "Autonomous", 2: "Log+Notify", 3: "⚠ Requires Confirmation"}
        print(f"  {i}. [{tier_label[action.risk_tier]}] {action.tool}")
        print(f"     Parameters: {action.parameters}")
        print(f"     Reasoning:  {action.reasoning}\n")
    print("To execute this plan, re-run with dry_run=False.")

Avoiding Alert Fatigue#

OWASP explicitly identifies alert fatigue as an attack vector against HITL systems. Alert fatigue occurs when confirmation prompts appear so frequently that reviewers start approving them automatically without actually reading them — which defeats the entire purpose of the gate. Adversaries can deliberately exploit this by flooding the review queue with low-stakes requests before slipping in a high-stakes one.

Three design principles prevent alert fatigue:

Tier 3 actions must be rare. If your agent is generating dozens of confirmation requests per session, either the tier classification is wrong (too many actions are classified as Tier 3) or the agent's tool list is too broad and needs pruning (Section 4.1).

Make Tier 3 confirmations visually distinct. High-risk confirmations should not look the same as routine notifications. A red border, a distinct sound, or a different layout are all reasonable ways to signal that this prompt requires real attention. Pre-selecting "OK" as the default is an anti-pattern — it makes accidental approval too easy.

Apply progressive trust to repeated low-risk actions. For actions that have been reviewed and approved many times without incident, consider promoting them from Tier 3 to Tier 2 with enhanced logging. This reduces the reviewer's burden while preserving the audit trail. The key constraint: this promotion must be granted explicitly by a human administrator — the agent should never promote itself to a higher trust tier.

What HITL Does Not Protect Against#

Human-in-the-loop controls are powerful but not a complete solution on their own. Understanding their limits helps you apply them correctly.

HITL does not replace Least Privilege. If an agent has Tier 1 read access to your entire database, an attacker who triggers repeated reads can still exfiltrate data silently — no Tier 3 gate is ever triggered. Least Privilege (Section 4.1) limits what the agent can access; HITL controls what it can change. Both layers are necessary.

HITL does not prevent all prompt injection damage. A successful prompt injection that triggers only Tier 1 and Tier 2 actions can still cause harm: reading sensitive files, creating misleading internal records, or draining API quotas all fall within the autonomous tier. Defense in depth — input validation, output filtering, and context isolation (Section 4.4) — is needed alongside HITL.

HITL confirmation quality depends on human attention. A reviewer who approves a deletion request without reading the parameters is not providing meaningful oversight. Confirmation dialogs should display the predicted impact of the action (for example, "this will permanently delete 847 records"), not just the action name. The goal is to provide the reviewer with enough information to make a genuine decision in the time available.

Summary: HITL Controls Checklist#

Human-in-the-Loop implementation checklist

Item	What to verify
Classify every tool	Each tool in the agent's list is assigned a risk tier. The classification is reviewed when the agent's purpose changes.
Tier 3 gate is in application code	Confirmation for high-risk actions is enforced by a function in your code — not by a prompt instruction that a model can be talked out of.
Confirmation shows exact parameters	The human reviewer sees the recipient address, file paths, SQL query, or API payload — not just the action name.
Approval requests have TTLs	Pending approvals expire after a defined window. Expired approvals cannot be acted on without re-requesting.
Background agents use an approval queue	Asynchronous agents do not silently skip confirmation for Tier 3 actions just because no human is watching in real time.
Dry run is available for batch operations	Agents that will take many sequential actions can show the full plan before executing any of it.
Tier 3 confirmations look different	High-risk confirmation prompts are visually distinct from routine notifications. No pre-selected 'OK' button.
Denials are final	When a human denies a Tier 3 action, the agent does not find an alternative path to the same outcome in the same session.

Section 4.3 covers how HITL controls extend to multi-agent systems, where one AI orchestrates others and the question of whose confirmation counts becomes more complex.

Sources:

Previous4.1 Excessive Agency

Next4.3 Multi-Agent Trust

4.2 Human-in-the-Loop Controls

Agent That Sends External Messages Without Confirmation

Security Advisor