Lakera Guard Guardrails | Lakera API documentation

Lakera Guard provides real-time visibility and security for GenAI applications. It does this through a combination of Lakera managed and custom guardrails.

Lakera offers guardrails for the following defenses:

Prompt Defense - Detect and respond to direct and indirect prompt attacks. This includes jailbreaks, prompt injections and any attempts to manipulate and exploit AI models through malicious or unintentionally troublesome instructions, preventing potential harm to your application.
Content Moderation - Ensure your GenAI applications do not violate your organization’s policies by detecting and stopping harmful and unwanted content.
Data Leakage Prevention - Safeguard Personally Identifiable Information, prevent system prompt leakage, and avoid costly leakage of sensitive data, ensuring compliance with data protection and privacy regulations.
Malicious links detection - Prevent attackers manipulating the LLM into displaying malicious or phishing links to your users by flagging unknown links.

GenAI faces novel threats

Large Language Models and other generative AI technologies are introducing brand new cybersecurity threats that existing cybersecurity tools can’t address. The number of potential attackers of LLMs is also massively larger than traditional software since you don’t need specialist technical skills, anyone who can write can exploit LLMs just using natural language.

This means the attack surface of GenAIs is orders of magnitudes larger and fundamentally different than traditional software and requires a paradigm shift in cybersecurity to secure.

One of the major components of AI security is a real-time AI application firewall for any applications using LLMs. This solution integrates into the application and screens any user or reference contents passed into an LLM and the output response from the LLM. Any threats detected can then be handled in real-time, blocking attackers and preventing harm to end users, the application, and to the organization running the application.

Overview

When securing a GenAI application follow these general steps to set up defenses:

System prompt

Create a robust and securely written system prompt to ensure the AI behaves securely and as intended. For help see our guide Crafting Secure System Prompts for LLM and GenAI Applications.

Prompt defenses

Set up prompt defenses for all LLM inputs, directly from users and referenced content, as even trusted sources can contain inadvertent prompt attacks.

Additional guardrails

Add guardrail defenses to prevent dangerous or sensitive content in LLM inputs or outputs in real-time, such as content moderation, data leakage prevention or malicious link detection.

Lakera managed guardrails

Lakera managed guardrails use a combination of machine learning and language models models with rule-based filters to detect threats within the contents submitted to Lakera Guard for screening.

Guardrails are designed to tackle specific types of threats. Lakera Guard can be customized according to the threat profile of your application by setting the relevant guardrails to use in screening in the Guard policy.

Lakera’s guardrails are updated on daily basis to incorporate defenses against new attacks and reduce false positives. We offer the option to fine-tune Lakera guardrails based on customer data or targeted feedback.

For details on each of the guardrails, please see the documentation linked above for each defense.

We are always actively improving our detectors and working to increase accuracy and reduce bias. We are also continuously improving our controls and interface to empower customers to effectively and easily secure their applications. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.

Custom guardrails

Customers can enforce their own bespoke security or content policies by using custom guardrails. These enable you to define in natural language or regular expressions the content type, patterns, or specific keywords that you want Lakera Guard to detect and flag.

Within a policy, you can create your own additional custom detectors for content moderation and detecting PII or sensitive data. These use regular expressions to flag specific words, strings or text patterns when screening.

These can be used to add custom defenses, preventing your GenAI talking about unwanted topics or screening for additional types of sensitive data.

For example, you could create a custom PII detector for internal employee IDs, or other forms of national ID. Or you could have a custom content moderation detector that flags any time one of your competitors’ names are mentioned to avoid your GenAI application being tricked into talking about them.

For more information on writing regular expressions plus guides to creating your own please see this useful website, or reach out to support@lakera.ai for help.

Fine-tuning guardrails

Lakera Guard guardrails can be customized within your policies to make them more or less aggressive in flagging potential threats. This is done via threshold levels. These set the confidence level the detector needs to reach in order to flag the screened contents.

For example, if you had a high risk tolerance for one use case you can set a guardrail to only flag very high confidence detections in order to have low false positives. Or, if you had a use case where you wanted to be really sure the LLM wasn’t manipulated, even at the cost of potential impact on user experience, you could set the guardrail to flag anything that the detector thinks could potentially be a detection.

Lakera Guard uses the following threshold levels, in line with OWASP’s paranoia level definitions for WAFs:

L1 - Lenient, very few false positives, if any.
L2 - Balanced, some false positives.
L3 - Stricter, expect false positives but very low false negatives.
L4 - Paranoid, higher false positives but very few false negatives, if any. This is our default confidence threshold.

Setting a guardrail to a threshold level in the policy means that the detector will flag whenever it has that level of confidence, or higher, that the screened contents contain a threat of that type.

The higher the threshold level the stricter the guardrail will be, reducing the probability that a potential threat slips through but at the potential risk of higher false positives flagging benign interactions.

Note that the threshold levels fine-tune the required confidence of the detector, not the severity of the threat.

We would love any feedback on the threshold levels to make sure they’re calibrated correctly and give you the control you need for your use cases. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.

Allow and Deny Lists

Lakera Guard also provides the ability to create custom allow and deny lists to temporarily override model flagging decisions. This feature helps customers quickly address false positives or false negatives while waiting for model improvements. These lists are designed as a temporary measure for addressing urgent edge cases that impact critical workflows, not as a permanent security solution.

Overriding Lakera Guard’s guardrails with custom lists can introduce security loopholes. We recommend using this feature only as a temporary measure while reporting misclassified prompts to Lakera for robust fixes.

For implementation details, see the Allow and Deny Lists documentation.