Agent Behavior Defense
Agent Behavior Defense protects agent behavior at runtime. Where Prompt Defense, Content Moderation, and Data Leakage Prevention screen the content flowing through an agent, Agent Behavior Defense screens what the agent is doing: whether its tool use serves the user’s intent, and whether the tools it calls are allowed at all.
It contains two capabilities, both configured in your policy and enforced through Guard API screening requests:
- The Off-Task Action detector
- The Tool Allow/Deny List
Off-Task Action detector
The Off-Task Action detector flags tool calls that are inconsistent with the user’s intent in the conversation. With agents, the same action can be legitimate in one moment and harmful in another: send_email is fine when the user asked for a report to be shared, and not fine when a poisoned tool response triggered it. The detector judges each tool call against the conversation history rather than against a fixed rule.
For example, with conversation history showing a user asking about flight booking:
- A
tool_callforsearch_flightsis consistent with the user’s intent and is not flagged. - A
tool_callfortransfer_fundsdoes not serve the user’s request and is flagged as off-task, with reason text explaining the inconsistency.
To use it, enable the Off-Task Action detector in your policy and include the conversation history and the tool call in your Guard API request. Flagged events appear in the logs with the reason text, and are counted in analytics.
The Off-Task Action detector ships in a conservative configuration: it favors a low false-positive rate over catching every marginal case. Run it in monitoring mode first and review what it flags on your real traffic before enforcing.
Tool Allow/Deny List
The Tool Allow/Deny List enforces which tools an agent may call at runtime, at the moment of tool invocation:
- Allow list: tool calls for any tool not on the list are flagged. Use this when an agent has a known, fixed set of tools it should ever use.
- Deny list: tool calls for tools on the list are flagged. Use this to block specific high-risk tools while leaving the rest unrestricted.
The outcome is deterministic: a denied tool call is flagged regardless of content, and the reason text names the tool and the list that caused the flag. Denied tools should be flagged consistently every time, so a useful rollout check is attempting a denied tool call, confirming it is flagged, and attempting a permitted one, confirming it is not.
The Tool Allow/Deny List controls which tools an agent may call at runtime. It is separate from the content Allow and Deny Lists, which override flagging decisions for specific screened content.
Screening tool responses and tool descriptions
Prompt attacks against agents often arrive through tools rather than through the user: a poisoned tool response, or a malicious instruction embedded in a tool’s description. The existing guardrails extend to these interaction points:
- Tool responses: pass tool results as
toolrole messages in the Guard API request. Tool and developer messages are screened as untrusted content, so Prompt Defense and Data Leakage detection run on them according to your policy. - Tool descriptions: tool definitions can carry injected instructions. Screen them by passing the description as content in a Guard API call, for example when a new tool or MCP server is added, rather than on every interaction.
See Agent and Tool Integration for the message format.
Rolling out enforcement
Running the evaluation against your own agents follows the same staged approach as the other guardrails:
- Observe first: run runtime protection in monitoring mode, where detections are logged without blocking, and review what is flagged against real traffic.
- Tune: adjust the policy and detector thresholds to your traffic and data patterns, and report false positives so the models can be calibrated to your use case.
- Enforce: once detection accuracy is validated, act on flags to block or modify interactions.
At production scale, with calibration, customers typically see a false-positive rate below 0.5%. Accuracy measured on a small or untuned setup is not representative: early results improve significantly with policy tuning and calibration cycles.
Detection results, including Off-Task Action and Tool Allow/Deny List flags, are returned in the Guard API response (use "breakdown": true for per-detector results), appear in the logs and analytics, and can be exported to your SIEM.