Self-Hosting
Self-hosting Lakera Guard allows organizations to keep user data in their own infrastructure. The Lakera Guard container enables self-hosting on-premises or in a private cloud.
Prerequisites
Before you start, you will need the following:
- the
docker
command line interface (CLI) - a valid Lakera Guard Enterprise license
- an
ACCESS_TOKEN
andSECRET_TOKEN
for the Lakera Guard container registry - the
REGISTRY_URL
for the container registry - the
CONTAINER_PATH
for the container in the container registry - the
LAKERA_GUARD_LICENSE
for running the Lakera Guard container
The LAKERA_GUARD_LICENSE
, ACCESS_TOKEN
, SECRET_TOKEN
, REGISTRY_URL
, and CONTAINER_PATH
are provided by Lakera. If you are an Enterprise customer who plans to self-host and haven’t received these credentials, please reach out to support@lakera.ai.
Login to the container registry
Once you’ve received your credentials, you can log in to the Lakera Guard container registry using the docker
CLI.
If you’re logging in to the container registry locally and the SECRET_TOKEN
isn’t available as an environment variable, use the --password-stdin
option to enter the SECRET_TOKEN
securely and avoid exposing it in your shell history.
Pull the Lakera Guard container
After logging in to the container registry, you can pull the Lakera Guard container.
Export License key as an environment variable
To run the Lakera Guard container, make sure to have the license key ready and can be exported as an environment variable
The container will only load if the license key is found and not expired.
When a successful license key if found:
Example error messages
When no License key is found:
If the License key is expired:
When the License key is close to expiration:
This is a warning message, and the container will continue to boot.
Run the Lakera Guard container
The Lakera Guard container needs to be bound to port 8000
:
Container versioning
The Lakera Guard container follows semantic versioning for its tags. If you need to pin your implementation to a specific version, replace stable
with the desired version number.
Semantic version uses a MAJOR
.MINOR
.PATCH
scheme where each version number is updated based on the scope of changes:
MAJOR
: incremented when we make incompatible API changes or potentially breaking model changesMINOR
: incremented when we add functionality that is backwards compatible or significantly improve model performancePATCH
: incremented when we make backwards compatible bug fixes, add minor functionality, or make small improvements in model performance
Stable builds
Stable builds are recommended for most use cases. These are thoroughly tested for compatibility and updated every two weeks.
Nightly builds
If you want to opt-in to bleeding edge, nightly builds you can use the latest
tag, which will correspond to the most recently shipped changes. Lakera Guard’s defenses are updated every day.
Nightly builds include the latest improvements to our defenses for emerging attacks and vulnerabilities.
Version pinning
If you need to pin your implementation to a specific version, replace stable
with the desired version number.
This is not recommended unless you have to for compliance or have been instructed to pin to a specific version by Lakera’s support team.
Deployment guides
Our team has documented deployment guides for some popular platforms:
If you need assistance deploying Lakera Guard, please reach out to support@lakera.ai.
Environment variables
The container has the following environment variables that can be configured:
-
LAKERA_NO_ANALYTICS
: disable Sentry crash reporting by setting this to1
By default, the Lakera Guard container reports crashes to Sentry. No request input is included in these crash reports. To prevent any data egress from the container, set the
LAKERA_NO_ANALYTICS
environment variable to1
. -
NUM_WORKERS
: optional number of parallel workers to run; set to 1 by default - increase if needed based on the available resources (see resources and scaling) -
MAX_INPUT_TOKENS
: maximum number of tokens in the input as measured by OpenAI’stiktoken
tokenizer; defaults to16385
tokens -
MAX_WARMUP_TOKENS
: maximum number of tokens used during model warmups; defaults to 50000 (orMAX_INPUT_TOKENS
if that is smaller). This configuration reduces the latency of first requests, but increases the container startup duration. -
POLICY_RELOAD_INTERVAL_SECONDS
: delay in seconds between policy reloading, must be an integer. Default is 60s. If 0 is given, it means infinity (never reload after initial loading at startup)
Configuring the token limit
Depending on the resources available to your container and the number of tokens the Large Language Model (LLM) you plan to leverage can handle, you may need to adjust the MAX_INPUT_TOKENS
value.
Tokens are different from characters or words, but a rough estimate is that one token is approximately four characters or 3/4 of a word. For example, 16,385 tokens is roughly equivalent to 65,000 characters or 12,000 words. You can explore how text is tokenized using OpenAI’s tokenizer tool.
The MAX_INPUT_TOKENS
should generally be the same as the context window for the LLM you plan to use. For example, the gpt-4
model has a context window of 8192
tokens, so you should set MAX_INPUT_TOKENS
to 8192
, and gpt-4-turbo-preview
has a context window of 128000
tokens, so you should set MAX_INPUT_TOKENS
to 128000
.
If you plan to leverage multiple models with varying context windows, you can set MAX_INPUT_TOKENS
to the maximum context window across all models you plan to use.
If you raise the MAX_INPUT_TOKENS
too high, you may encounter situations where users can send extremely long requests that could lead to performance bottlenecks or a Denial of Service (DoS) attack.
Input processing time should increase linearly with the number of tokens, so it’s best to set a reasonable token limit based on your use case, model provider, and the resources available to your container.
Common models and their context windows
The table below includes a non-exhaustive list of popular models and their context windows for quick reference. Check for the latest context window for your desired model by referring to the model provider’s documentation.
Last Updated: 2024-03-18
* The Guard Platform uses OpenAI’s tiktoken
tokenizer to calculate tokens, so the MAX_INPUT_TOKENS
for models from other providers might be different from the published value of the model’s context window depending on the tokenization method used by the model provider.
† claude-3-*
refers to the entire Claude 3 family of models, which all share a 200,000
token context window.
‡ mistral-*
refers to the entire Mistral family of models, including Mixtral, which all share a 32,000
token context window.
Resources and scaling
The Lakera Guard platform requires at least 4 GB of Memory and 2 CPU cores to run smoothly. For increased performance, you can scale up the number of replicas of the Lakera Guard container.
Resource requirements
Testing your self-hosted Lakera Guard
Once your container is running, you can replace the https://api.lakera.ai
in any of our API examples with the URL and port of your self-hosted Lakera Guard instance and ignore the Authorization
header.
For example, if you’re running the container on localhost
, you can use http://localhost:8000
as the base URL for your API requests.
Example usage
Here are some examples of using your self-hosted prompt injection endpoint: