AI Agent Security Evaluation
This guide sets out best practices for evaluating Check Point AI Agent Security: how to understand what the product covers, test its capabilities effectively, and run a proof-of-value that gathers the information needed for a buying decision. For evaluating the runtime guardrails on their own, see the AI Guardrails Evaluation; this guide adds the agent-specific layers on top.
AI Agent Security is an early access release. The product ships with expanding threat coverage and risk frameworks that develop quickly, so the proof-of-value is on the trajectory as much as the snapshot: capabilities ship rapidly, and evaluations are typically run in collaboration with Check Point so new coverage lands during the evaluation window.
What a proof-of-value should answer
- Does discovery find the agents you actually have, across the agent platforms and cloud you use?
- Do the risk assessments surface real risks, with explanations your team can act on?
- Are the risk ratings and contributing factors consistent with your own reading of each agent?
- Does runtime protection detect the attacks and policy violations relevant to your use cases, at acceptable false-positive rates?
- Can your team operate it: policies, projects, monitoring, and escalation?
- Does it integrate with your architecture without disrupting how agents are built and deployed?
Defining what a good answer looks like for each, before testing begins, keeps the evaluation focused.
The two layers under evaluation
- Posture: the structural state of an agent — its tools and toolsets, connected MCP servers, model, authentication, and level of autonomy. Assessed from configuration. Covered by discovery and risk assessment.
- Runtime: live behavior — protection at agent runtime through the Guard API, with policies configured per use case.
Keep results attributed to the right layer: posture findings tell you about standing risk, runtime findings about live protection. The two are evaluated differently and produce different kinds of evidence.
Setting up the evaluation
- Scope: choose the platforms and agents that matter most. Starting with the agents that carry your most sensitive data or hold the broadest permissions tends to produce the most informative results.
- Access: connect each platform under AI Integrations — a cloud role, an OAuth authorization, or an API key per platform — and create a project with a policy for runtime testing.
- Stage the rollout: run runtime protection in monitoring mode first, so detections are logged without blocking, and review what is flagged against real traffic before enforcing. Early, untuned results are not representative; expect calibration cycles.
Evaluating discovery and visibility
The core test is inventory against ground truth:
- Connect the platforms you use and let discovery complete.
- Compare the inventory against what you believe you have on each platform. Unexpected agents are often the most valuable output — they indicate ownership gaps rather than tooling errors.
- Open a sample of agents and confirm the captured detail: owner, tools, connected MCP servers, model, and recent activity.
- Review the MCP and tool inventory, including any external or remote servers your agents connect to.
Discovery depth varies by platform because it depends on what each platform’s APIs expose — see Agent Discovery for per-platform notes. Useful questions to ask: which platforms discover continuously versus on a scheduled scan, and which agents in your estate sit on platforms that are not yet connected.
Evaluating risk assessment
- Per-agent risk review: select a few agents you understand well. Confirm the risks surfaced are ones you expected, that the explanations match your reading of the agent, and that nothing you consider risky about the agent is missing. See Risk Assessment for the risk types surfaced.
- Toxic combinations: check whether agents with the lethal trifecta — access to confidential data, exposure to untrusted content, and external communication, simultaneously — are identified and clearly explained.
- Prioritizing across agents: use the risk-types view to confirm the cross-agent rollup supports prioritization: severity, the number of agents affected, and the external-framework mapping (OWASP, MITRE ATLAS) for each risk type.
Evaluating runtime protection
Runtime evaluation follows the AI Guardrails Evaluation for the core guardrails: prompt attack detection, content moderation, data leakage, benign-data false-positive testing, and latency. For agents, extend that evaluation to the agent interaction points:
- Integration and screening flow: integrate a Guard API call at the agent’s interaction points and confirm the screening flow works end to end — see Agent and Tool Integration.
- Prompt injection in tool responses: enable Prompt Defense on the
toolrole and confirm injected instructions in tool results are flagged. - Data leakage in tool use: enable Data Leakage detection on tool interactions and confirm sensitive data passing through tools is flagged.
- Off-Task Action: with conversation history on one topic, confirm a clearly unrelated tool call is flagged as off-task with reason text, and that an on-task tool call is not. See Agent Behavior Defense.
- Tool Allow/Deny List: configure an allow or deny list and confirm denied tool calls are flagged deterministically and permitted ones pass.
- Accuracy and latency: measure detection rates and false-positive rates with the policy you intend to use in production, and judge them against your acceptable thresholds. At production scale, with calibration, customers typically see a false-positive rate below 0.5%. For latency, compare against the documented p95 reference figures with the content length and detector count attached.
Avoiding misleading results
The most common ways an evaluation produces the wrong picture:
- Small or hand-picked test datasets: accuracy and false-positive figures from a handful of examples are unreliable; use representative data.
- Testing with the default policy: it is intentionally strict. Configure a policy before testing accuracy and false positives.
- Mixing system instructions into user content: the most common cause of false positives. Screen clean, original content with correct message roles.
- Reading early posture results as final: discovery and risk coverage expand during the evaluation; re-test before drawing conclusions.
- Testing only attacks: run benign traffic too, or the false-positive picture is missing entirely.
Producing the decision
A completed evaluation should leave you with a clear answer to each of the proof-of-value questions above, backed by evidence:
- Proof of the connection to your platforms and the discovered agents, compared against ground truth.
- A list of agents and risks discovered, including any you were not previously tracking.
- A view of whether the risk explanations were accurate and actionable for your team.
- Measured detection, false-positive, and latency results for runtime protection on representative data, with the policy, content length, and detector count stated.
- A view on day-to-day operability: policy management, monitoring, alerting, and reporting.
Need help?
Evaluations are typically run with hands-on support from Check Point, including evaluation frameworks, policy recommendations, and calibration. Contact our team to plan an evaluation.