Configuring policies in Self-hosted guard

Enterprise self-hosting customers can configure and dynamically update the Guard detectors and thresholds used for screening by individual apps and integrations. This is done using a policy file hosted in an S3 bucket (or S3-compatible storage) within your own environment.

The policy file is used to specify and update both your projects and policies configurations. It also specifies the assignment mapping between projects and policies.

Each project must have one policy assigned. The same policy can be assigned to multiple projects.

Policies can be checked for validity via the policy linter in the Guard platform or via the /policies/lint or /policies/health endpoints.

Policies can either be shared through an S3-compatible bucket or through a local filesystem that is mounted in the running containers.

For information on how to set up and self-host Guard, please refer to the Self-Hosting documentation

Set up requirements for S3

First create a bucket in S3, or an S3 compatible alternative, in your environment. Add a directory in the bucket to contain the policies. Policies must reside in a directory rather then in the root of the bucket.

Create one or more JSON policy files (details on the file contents are outlined below). The file can be called anything as long as it ends with .json. Upload the policy file to the S3 bucket in the directory that was created.

Then, set up feature flags for the Guard container via the environment variables:

1export LLM_GUARD_POLICY_ENABLED=true
2export POLICY_BUCKET_NAME=[Name of the bucket, e.g. policies]
3export POLICY_ENDPOINT_URL=[S3 URL including region, e.g. https://s3.eu-central-1.amazonaws.com]
4export POLICY_PREFIX=[Path to the policy within the bucket, e.g. the directory where the policies reside]
5export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
6export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_ACCESS_KEY
7export LAKERA_GUARD_LICENSE=$YOUR_LICENSE

The final S3 URI will be ${POLICY_ENDPOINT_URL}/${POLICY_BUCKET_NAME}/${POLICY_PREFIX}/, e.g. https://s3.eu-central-1.amazonaws.com/policies/prod/*.json.

If you are using a non-AWS S3 provider, the access key environment variables S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY can be used as variable names instead of the AWS ones.

Set up requirements for a local filesytem

A local filesystem which contains policy files can be mounted into a self-hosted container to share policy files through local filesystem storage. This filesystem can read-only from the container’s perspective allowing for the same policy filesystem to be shared across multiple containers.

Create one or more JSON policy files (details on the file contents are outlined below). The file can be called anything as long as it ends with .json. Store the policy file in the filesystem which is mounted to the self-hosted container.

The policy files must reside in a directory pointed to by the LOCAL_POLICY_DIR environment variable.

Then, set up the feature flags for the Guard container via the environment variables:

1export LLM_GUARD_POLICY_ENABLED=true
2export LOCAL_POLICY_DIR=[The directory that is mounted in the Guard container]
3export LAKERA_GUARD_LICENSE=$YOUR_LICENSE

Docker command

When running the Docker command the environment variables need to be included. For example:

1docker run \
2 -e POLICY_BUCKET_NAME \
3 -e POLICY_PREFIX \
4 -e POLICY_ENDPOINT_URL \
5 -e AWS_ACCESS_KEY_ID \
6 -e AWS_SECRET_ACCESS_KEY \
7 -e LAKERA_GUARD_LICENSE \
8 -p 8000:8000 \
9 registry.gitlab.com/lakeraai/llm-guard/guard:latest

Depending on your AWS account type it may be necessary to also inclue -e AWS_SESSION_TOKEN.

Technical considerations

The container will crash if the policy file is invalid. When this happens, it will flag with errors explaining the issues with the policy file.

Policies are auto-updated when changes are made to the policy file. This is triggered when any new screening requests come in for that policy. Note that it may take a couple of minutes for policy updates to take effect.

The policy file can be stored in a folder or prefix in the S3 bucket or S3 compatible storage.

In order to change the location of the policy file or update the credentials, the container needs to be restarted and the environment variable changed.

Policy schema

The policy schema is made up of three sections:

  1. projects - this defines the different projects, defines their ID name, and assigns them to their policy.
  2. policies - this defines the different policies, defines their ID name, and maps them to their detectors.
  3. detectors - this defines the different detectors, defines their ID name, their detector type and their confidence threshold level for flagging.

Note that project, policy and detector ID names can be whatever you want as long as they follow the naming convention set out for each section below. It is recommended to make them clear and descriptive to ease interpretation when they are returned in logs and API responses.

The policy file should always start with specifying the schema version: "schema_version": 1.

Projects section

Project IDs must start with project-.

Each project can only map to one policy. Multiple projects can map to the same policy. This is specified by policy_id.

For information on how to use projects to set, configure, track, understand, and compare the security and threat profiles of each of your GenAI applications and components, please refer to the Projects documentation.

projects Example:

1"projects": [
2 {
3 "id": "project-chatbot",
4 "policy_id": "policy-strict-defense"
5 },
6 {
7 "id": "project-internal-RAG",
8 "policy_id": "policy-just-hate-speech"
9 }
10 ],

Policies section

Policy IDs must start with policy-.

Each policy can map to multiple detectors of different types. This is specified via lists of detector IDs:

  • detectors - these are detectors that should be run on all screening requests using this policy, regardless of whether they’re user input, reference materials passed to the LLM, or a response coming from the LLM.
  • input_detectors - these are detectors that should be run when screening LLM inputs, i.e. only when the role is set as user in the API request.
  • output_detectors - these are detectors that should be run when screening LLM outputs, i.e. only when the role is set as assistant in the API request.

Note that a policy can’t be set to use multiple detectors of the same detector type for an input or an output screening request.

For convenience, you can map a policy to use a whole defense category rather than listing all of the individual detectors. For example you can map a policy to a detector with type pii, which will then run all of Guard’s PII detectors when screening.

Note that you can’t map a policy to both a defense category detector and also one of its underlying individual detectors. For example you cannot map to both a moderated_content type detector and a moderated_content/hate detector in the same policy.

Policy mode

Policies that have a single defense configuration for all content, regardless of whether it’s LLM input or output are called unified policies in Guard. These policies define the detectors just via the detectors list.

Within a policy, you can also map different detector configurations to be run on LLM inputs vs outputs. For example, you could screen inputs for prompt attacks, outputs for unknown links, and both inputs and outputs for moderated content. This is called an Input & Output Policy in Guard, and can have detectors mapped via detectors,input_detectors and/or output_detectors lists.

The policy mode is specified as unified or input & output via the mode property. The default value is unified and the mode property is optional for unified policies.

To specify the policy mode as an input & output policy set the property "mode": "IO". Also specify the input and/or output detectors via input_detectors and output_detectors respectively.

You can opt to just screen the LLM input or just the output. If either input_detectors or output_detectors is specified then it is optional to include a unified detectors list as well. For IO policies it is recommended to specify empty lists if you opt to do either of these.

In screening API requests, the content is specified as being LLM input or output via the user or assistant role respectively. API requests with the system role are not screened as they are considered trusted content. Please refer to the guard API documentation for details.

policies example

1"policies": [
2 {
3 "id": "policy-strict-defense",
4 "detectors": [
5 "detector-strict-prompt-attack",
6 "detector-moderation"
7 ]
8 },
9 {
10 "id": "policy-chatbot-defense",
11 "mode": "IO",
12 "detectors": [
13 "detector-moderation"
14 ],
15 "input_detectors": [
16 "detector-prompt-attack",
17 "detector-pii-credit-card"
18 ],
19 "output_detectors": [
20 "detector-unknown-link"
21 ]
22 }
23 ],

Detectors section

Detector IDs must start with detector-.

threshold specifies the lowest confidence level at which the detector returns flagged = true. The accepted values are:

  1. l1_confident
  2. l2_very_likely
  3. l3_likely
  4. l4_less_likely

The thresholds are inclusive, e.g. setting the threshold to l2_very_likely means the detector will flag when its confidence level is either l2_very_likely or l1_confident. For more details, please refer to the documentation on fine-tuning detectors.

type specifies the detector type. The accepted values are:

  1. prompt_attack
  2. moderated_content - this will run all six content moderation detectors and flag based on the max confidence level returned by any of them
  3. moderated_content/crime
  4. moderated_content/hate
  5. moderated_content/profanity
  6. moderated_content/sexual
  7. moderated_content/violence
  8. moderated_content/weapons
  9. moderated_content/custom - this is for custom regular expression based detectors
  10. pii - this will run all eight PII detectors and flag based on the max confidence level returned by any of them
  11. pii/name
  12. pii/phone_number
  13. pii/email
  14. pii/ip_address
  15. pii/address
  16. pii/credit_card
  17. pii/iban_code
  18. pii/us_social_security_number
  19. pii/custom - this is for custom regular expression based detectors
  20. unknown_links - this can be customized by specifying allowed domains
  21. override_allow - Creates an allow list that prevents content from being flagged
  22. override_deny - Creates a deny list that forces content to be flagged

Custom detectors

You can define your own custom regular-expression based detectors for content moderation or PII. These can be used to add custom defenses, preventing your GenAI talking about unwanted topics or screening for additional types of sensitive data by flagging specific words, strings or text patterns when screening.

For example, you could create a custom PII detector for internal employee IDs, or other forms of national ID. Or you could have a custom content moderation detector that flags any time one of your competitors’ names are mentioned to avoid your GenAI application being tricked into talking about them.

Custom detectors are created by defining a detector with type moderated_content/custom or pii/custom.

Detectors of this type are specified by an custom_matchers list of objects containing a label and a list of regexes. When a text string matching any of the regexes is found in the screened contents then the custom detector will flag.

Guard uses Perl-Compatible Regular Expressions (PCRE) syntax.

For example, you can replace <deny list word> in the below regex with a word or phrase and Guard will flag when it appears in contents, in any casing, and as a standalone word rather than part of another word.

1"custom_matchers": [
2 {
3 "label": "Deny list",
4 "regexes": ["(?i)\\b<deny list word>\\b"]
5 }
6]

When writing regular expressions in the policy file, make sure to escape backward slashes, e.g. use \\b and \\s etc., otherwise it’s interpreted on the JSON string level.

It is recommended to always test custom detectors carefully to make sure the regular expressions are set correctly.

Note that custom detectors cannot be fine-tuned so need to be specified with threshold level of l1_confident.

For more information on regular expressions plus guides to creating your own please refer to this useful website, or reach out to support@lakera.ai for help.

The Unknown Links detector flags URLs that aren’t in the top one million of the most popular domains. It’s used to prevent malicious links being returned to users by an LLM, e.g. through data poisoning where an indirect prompt injection is hidden in text passed to the LLM that tricks the AI into sharing phishing links.

In the policy, you can specify a custom list of safe domains. The Unknown Links detector won’t flag URLs in the domains added to the allow list. These domains might not be included in the million most popular domains but are safe to appear in LLM interactions. As a result, Guard should not flag them. Usually this will include your own domain.

To add domains to the allowed domains list, add an allowed_domains list when defining a detector of type unknown_links. For example:

1{
2 "id": "detector-phishing-links",
3 "type": "unknown_links",
4 "threshold": "l1_confident",
5 "allowed_domains": [
6 "my-trusted-domain.com",
7 "another-safe-domain.com"
8 ]
9 }

Do not include prefixes, e.g. http:// or www., or a subdomain, e.g. platform.lakera.ai, when defining your allowed domains as these are not supported. Please only include the domain name and top-level domain.

Allow and Deny List Detectors

Lakera Guard supports custom allow and deny lists to temporarily override model flagging decisions. These special detector types help you address false positives or false negatives that affect critical workflows while waiting for model improvements.

Temporary Solution Only
Overriding Lakera Guard can introduce security loopholes. These lists should only be used as a temporary measure while reporting misclassified prompts to Lakera for robust fixes.

For details, see the Allow and Deny Lists documentation.

To implement these overrides, use the following detector types:

  1. override_allow - Creates an allow list that prevents content from being flagged
  2. override_deny - Creates a deny list that forces content to be flagged

Both detector types require a list of strings in the override_list parameter and must use the l1_confident threshold level.

Example of allow and deny list detectors
1{
2 "id": "detector-ban-soft-drinks",
3 "type": "override_deny",
4 "threshold": "l1_confident",
5 "override_list": [
6 "coke",
7 "pepsi",
8 "fanta",
9 "redbull"
10 ]
11},
12{
13 "id": "detector-allow-dad-jokes",
14 "type": "override_allow",
15 "threshold": "l1_confident",
16 "override_list": [
17 "That mineral water was fanta-stic!"
18 ]
19}

In this example, any content containing references to soft drinks would be flagged, but the specific (and terrible) dad joke about mineral water would be allowed, even though it contains “fanta” which is in the deny list.

detectors example

1"detectors": [
2 {
3 "id": "detector-moderation",
4 "threshold": "l3_likely",
5 "type": "moderated_content/hate"
6 },
7 {
8 "id": "detector-cc",
9 "threshold": "l2_very_likely",
10 "type": "pii/credit_card"
11 },
12 {
13 "id": "detector-competitor",
14 "type": "moderated_content/custom",
15 "threshold": "l1_confident",
16 "custom_matchers": [
17 {
18 "label": "Competitors",
19 "regexes": ["(?i)\\bacme corp\\b", "(?i)\\bgeneric AI startup name\\b"]
20 }
21 ]
22 },
23 {
24 "id": "detector-employee-ids",
25 "type": "pii/custom",
26 "threshold": "l1_confident",
27 "custom_matchers": [
28 {
29 "label": "Old ID",
30 "regexes": ["(AG|AI|AR|BE|BL|BS|...|ZG|ZH)\\s*[-.•]?\\s*[0-9]{1,6}"]
31 },
32 {
33 "label": "New ID",
34 "regexes": ["[A-Z0-9]*CD[A-Z0-9]"]
35 }
36 ]
37 },
38 {
39 "id": "detector-phishing-links",
40 "type": "unknown_links",
41 "threshold": "l1_confident",
42 "allowed_domains": [
43 "my-trusted-domain.com",
44 "another-safe-domain.com"
45 ]
46 }
47]

Checking the policy file is valid

Policy linter interface

The easiest way to check your policy file valid is via the policy linter tool in the Guard platform. This editor provides basic linting for your policy. It will check for common errors and provide suggestions for improvement.

The tool is run locally in the browser, so any policies or JSON entered in it are not saved anywhere. Note this also means that the contents are lost if you close the page.

Policy linter endpoint

Alternatively, you can use the /v2/policies/lint endpoint within the container to send a group of JSON files and get them checked for correctness. It will return whether the policy passed true or false and give a list of any errors or warnings.

Policy health check endpoint

You can use the /v2/policies/health endpoint to check the validity of the policy configuration for a given project.

If the policy is syntactically correct, the endpoint will return an ok status and the linter will be marked as passed. If the policy linter reports any warnings or errors these will be returned in the response.

You need to specify the desired project’s ID explicitly in the request. If the project ID isn’t passed then the health endpoint will report the status of the Lakera Default Policy. If the project ID does not exist, or has been deleted, the health endpoint will return an error status.

The health check also reports whether the project is using the Lakera Default Policy, or a custom policy is defined and assigned to it in the policy file.

If you have tried to configure a custom policy for the project, make sure the health check doesn’t say it’s using the Lakera Default Policy. Otherwise the status report is not reflective of your policy configuration but the health of Lakera Guard’s Default Policy instead.

Audit history

For self-hosting customers, the audit history of the policy file must be set up, tracked and managed by the customer. The Lakera Guard container does not maintain an audit history of policies.

Policy examples

Defense category example

In this example, only the Content Moderation defense is used for screening requests marked as for the project project-chatbot-output. This project has been assigned to the policy policy-just-moderation and the policy mapped to the moderated content detect with confidence threshold of l3_likely.

This means if any of the content moderation detectors (crime, hate speech, profanity, sexual content, violence and weapons) identify something that looks suspiciously like a violation in the user’s input or the chatbot’s output, it will trigger the guard response of “flagged”.

1{
2 "schema_version": 1,
3 "projects": [
4 {
5 "id": "project-chatbot",
6 "policy_id": "policy-just-moderation"
7 }
8 ],
9 "policies": [
10 {
11 "id": "policy-just-moderation",
12 "detectors": [
13 "detector-moderation"
14 ]
15 }
16 ],
17 "detectors": [
18 {
19 "id": "detector-moderation",
20 "threshold": "l3_likely",
21 "type": "moderated_content"
22 }
23 ]
24}

Individual detector example

In this example, only the individual hate detector within the content moderation defense category is used for the specified project’s screening requests. The confidence threshold is set to l2_very_likely, which means a balanced risk tolerance.

1{
2 "schema_version": 1,
3 "projects": [
4 {
5 "id": "project-chatbot-output",
6 "policy_id": "policy-only-hate"
7 }
8 ],
9 "policies": [
10 {
11 "id": "policy-only-hate",
12 "detectors": [
13 "detector-hate"
14 ]
15 }
16 ],
17 "detectors": [
18 {
19 "id": "detector-hate",
20 "threshold": "l2_very_likely",
21 "type": "moderated_content/hate"
22 }
23 ]
24}

Multiple detectors example

In this example, LLM inputs and outputs will both be screened for moderated content of any kind. User inputs will additionally be screened for prompt attacks, credit card numbers and (via a custom detector) internal employee IDs.

1{
2 "schema_version": 1,
3 "projects": [
4 {
5 "id": "project-chatbot",
6 "policy_id": "policy-pinj-moderation-cc"
7 }
8 ],
9 "policies": [
10 {
11 "id": "policy-pinj-moderation-cc",
12 "mode": "IO",
13 "detectors": [
14 "detector-moderation"
15 ],
16 "input_detectors": [
17 "detector-pinj",
18 "detector-credit-card",
19 "detector-employee-ids"
20 ],
21 "output_detectors": []
22 }
23 ],
24 "detectors": [
25 {
26 "id": "detector-moderation",
27 "threshold": "l2_very_likely",
28 "type": "moderated_content"
29 },
30 {
31 "id": "detector-pinj",
32 "threshold": "l4_less_likely",
33 "type": "prompt_attack"
34 },
35 {
36 "id": "detector-credit-card",
37 "threshold": "l1_confident",
38 "type": "pii/credit_card"
39 },
40 {
41 "id": "detector-employee-ids",
42 "type": "pii/custom",
43 "threshold": "l1_confident",
44 "custom_matchers": [
45 {
46 "label": "Old Company ID",
47 "regexes": ["(AG|AI|AR|BE|BL|BS|...|ZG|ZH)\\s*[-.•]?\\s*[0-9]{1,6}"]
48 },
49 {
50 "label": "New Company ID",
51 "regexes": ["[A-Z0-9]*CD[A-Z0-9]"]
52 }
53 ]
54 }
55 ]
56}

Multiple policies

In this example, there are three apps, an external and an internal chatbot plus an internal Q&A tool. The customer chatbot has its own policy with separate detector configuration for screening the LLM input and output. The internal apps both use the same policy.

1{
2 "schema_version": 1,
3 "projects": [
4 {
5 "id": "project-customer-chatbot",
6 "policy_id": "policy-strict-chatbot"
7 },
8 {
9 "id": "project-internal-chatbot",
10 "policy_id": "policy-internal"
11 },
12 {
13 "id": "project-internal-q&a",
14 "policy_id": "policy-internal"
15 }
16 ],
17 "policies": [
18 {
19 "id": "policy-strict-chatbot",
20 "mode": "IO",
21 "detectors": [],
22 "input_detectors": [
23 "detector-pinj"
24 ],
25 "output_detectors": [
26 "detector-moderation",
27 "detector-credit-card"
28 ]
29 },
30 {
31 "id": "policy-internal",
32 "detectors": [
33 "detector-moderation",
34 "detector-pii"
35 ]
36 }
37 ],
38 "detectors": [
39 {
40 "id": "detector-moderation",
41 "threshold": "l2_very_likely",
42 "type": "moderated_content"
43 },
44 {
45 "id": "detector-pinj",
46 "threshold": "l3_likely",
47 "type": "prompt_attack"
48 },
49 {
50 "id": "detector-credit-card",
51 "threshold": "l2_very_likely",
52 "type": "pii/credit_card"
53 },
54 {
55 "id": "detector-pii",
56 "threshold": "l1_confident",
57 "type": "pii"
58 }
59 ]
60}