Lakera Guard Guardrails

Lakera Guard is a real-time GenAI application firewall that protects your GenAI applications and users through four built-in guardrails:

  • Prompt Defense - Detect and respond to direct and indirect prompt attacks. This includes jailbreaks, prompt injections and any attempts to manipulate and exploit AI models through malicious or unintentionally troublesome instructions, preventing potential harm to your application.
  • Content Moderation - Ensure your GenAI applications do not violate your organization’s policies by detecting harmful and unwanted content.
  • Data Leakage Prevention - Safeguard Personally Identifiable Information, prevent system prompt leakage, and avoid costly leakage of sensitive data, ensuring compliance with data protection and privacy regulations.
  • Unknown links detection - Prevent attackers manipulating the LLM into displaying malicious or phishing links to your users.

GenAI faces novel threats

Large Language Models and other generative AI technologies are introducing brand new cybersecurity threats that existing cybersecurity tools can’t address. The number of potential attackers of LLMs is also massively larger than traditional software since you don’t need specialist technical skills, anyone who can write can exploit LLMs just using natural language.

This means the attack surface of GenAIs is orders of magnitudes larger and fundamentally different than traditional software and requires a paradigm shift in cybersecurity to secure.

One of the major components of AI security is a real-time AI application firewall for any applications using LLMs. This solution integrates into the application and screens any user or reference contents passed into an LLM and the output response from the LLM. Any threats detected can then be handled in real-time, blocking attackers and preventing harm to end users, the application, and to the organization running the application.

Lakera Guard detectors

Each Lakera Guard defense consists of detectors that use a combination of machine learning models and rule-based filters to detect threats within the contents submitted to Lakera Guard for screening.

Detectors are designed to tackle specific types of threats. Lakera Guard can be customized according to the threat profile of your application by setting the relevant detectors to use in screening in the Guard policy.

For details on each of the detectors, please see the documentation linked above for each defense.

We are always actively improving our detectors and working to increase accuracy and reduce bias. We are also continuously improving our controls and interface to empower customers to effectively and easily secure their applications. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.

Fine-tuning detectors

Lakera Guard detectors can be customized within your policies to make them more or less aggressive in flagging potential threats. This is done via threshold levels. These set the confidence level the detector needs to reach in order to flag the screened contents.

For example, if you had a high risk tolerance for one use case you can set a detector to only flag very high confidence detections in order to have low false positives. Or, if you had a use case where you wanted to be really sure the LLM wasn’t manipulated, even at the cost of potential impact on user experience, you could set the detector to flag anything that the detector thinks could potentially be a detection.

Lakera Guard uses the following threshold levels, in line with OWASP’s paranoia level definitions for WAFs:

  1. L1 - Lenient, very few false positives, if any.
  2. L2 - Balanced, some false positives.
  3. L3 - Stricter, expect false positives but very low false negatives.
  4. L4 - Paranoid, higher false positives but very few false negatives, if any. This is our default confidence threshold.

Setting a detector to a threshold level in the policy means that the detector will flag whenever it has that level of confidence, or higher, that the screened contents contain a threat of that type.

The higher the threshold level the stricter the detector will be, reducing the probability that a potential threat slips through but at the potential risk of higher false positives flagging benign interactions.

Note that the threshold levels fine-tune the required confidence of the detector, not the severity of the threat.

We would love any feedback on the threshold levels to make sure they’re calibrated correctly and give you the control you need for your use cases. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.

Custom detectors

Within a policy, you can also create your own additional custom detectors for content moderation and detecting PII or sensitive data. These use regular expressions to flag specific words, strings or text patterns when screening.

These can be used to add custom defenses, preventing your GenAI talking about unwanted topics or screening for additional types of sensitive data.

For example, you could create a custom PII detector for internal employee IDs, or other forms of national ID. Or you could have a custom content moderation detector that flags any time one of your competitors’ names are mentioned to avoid your GenAI application being tricked into talking about them.

For more information on writing regular expressions plus guides to creating your own please see this useful website, or reach out to support@lakera.ai for help.

Allow and Deny Lists

Lakera Guard also provides the ability to create custom allow and deny lists to temporarily override model flagging decisions. This feature helps customers quickly address false positives or false negatives while waiting for model improvements. These lists are designed as a temporary measure for addressing urgent edge cases that impact critical workflows, not as a permanent security solution.

Overriding Lakera Guard’s detectors with custom lists can introduce security loopholes. We recommend using this feature only as a temporary measure while reporting misclassified prompts to Lakera for robust fixes.

For implementation details, see the Allow and Deny Lists documentation.

1

Prompt defenses

Set up prompt defenses for all LLM inputs, as even trusted sources can contain inadvertent prompt attacks.

2

System prompt

Create a robust and securely written system prompt to ensure the AI behaves securely and as intended.

3

Additional defenses

Add defenses to prevent dangerous or sensitive content in LLM inputs or outputs, such as content moderation, data leakage prevention or malicious link detection.