Sizing Guide | Lakera API documentation

This document describes the resources required to run Lakera Guard, provides current data on latency characteristics, and details an architecture which can be used to scale Lakera Guard. This guide is for Self-Hosted deployments.

Resources

Lakera Guard is run in any OCI-compliant orchestration system. When bringing up the Lakera Guard container, memory and compute resources are the most critical to define. For a single pod we recommend:

Memory: 20GB
CPU: 4 cores
GPU: none

We are CPU optimized and do not require GPUs. Individual CPU core performance determines the amount of horizontal scaling required. Higher speed CPUs may process faster requiring less pods whereas slower CPUs may require more pods to hit the same latency targets.

Latency

The input length of a request (measured by number of characters or tokens) is a contributor to latency, especially at large request sizes. Internal networking configurations for self-hosted containers may alter latency performance. Table 1 shows latency for a single request across multiple prompt sizes. These values were collected on AWS M5.2xl nodes when running against a Lakera Guard container in AWS EKS.

Input Length (Characters)	Latency ms p(95)
1,000	<20
10,000	<100
20,000	<150
30,000	<200
50,000	>250

Input Length and Request Volume

Another factor in latency is the number of requests per second (RQS) sent to the container. Latency rises as the number of requests increases. There is a hockey-stick increase in latency once the container gets overloaded. Note that large prompts have a steeper slope which starts earlier. Input length and request volume drive the observed latency. The table below combines the two variables to provide some guidelines on where to target RQS and prompt length for given latency targets. We have seen many use cases where prompts are typically under 5,000 characters (P95). From this table, targeting 20 RQS or less provides a good latency of up to 10,000 characters.

Latency (P95)	Input Length (Characters)	RQS
<30ms	< 1000	< 100
	< 2000	< 60
<100 - 250ms	< 2000	< 80
	< 5000	< 40
	< 10000	< 20
<250 - 350ms	< 2000	< 100

A few examples will help in understanding how the two variables interact.

If the target is very low latency (less than 30ms) prompts need to be kept below 2,000 characters while RQS can be driven up to 60. This shows that the prompt length drives the lowest latency characteristics.
Given a larger prompt of 10,000 characters, a RQS of under 20 is required to hit latencies between 100ms and 300ms.
A 2,000 character prompt length provides a wide range of acceptable RQS rates which can target various latencies.
At 20 RQS, prompt lengths up to 10,000 characters are handled well while still keeping latencies between 100ms - 300ms.

In sizing a system, planning for tail latency is important. In a typical use case, prompt lengths will be mixed with some number of prompts falling into the larger size. Some use cases may have a higher preponderance of these large prompts whereas others will see fewer. These larger prompts can often be easily consumed when oversubscribing the number of Lakera Guard containers ensuring there are always free cycles to process incoming prompts. To pull down tail latency, chunking the input prompts for very large inputs is often useful, see Prompt Length Chunking for details.

Scaling

Lakera Guard is distributed as an OCI container, typically orchestrated in a Kubernetes environment. The container instances (i.e. pods) are stateless and can be easily scaled within the environment.

Lakera recommends horizontally scaling load to optimize latency by increasing the number of pods to account for load. We strongly recommend to keep the number of workers per pod at 1, Lakera Guard is not optimized for vertical scaling. Both the prompt length and request dimensions can be scaled independently. Use the charts and tables in the prior section to determine where to set each dimension. For example, given a P95 latency requirement under 300ms, prompt lengths should remain under 10,000 with a request rate per Lakera Guard instance of 20 RQS.

For simplicity, it is often acceptable to scale only based on RQS allowing for additional tail latency with very large prompts. In use cases where the average prompt length is too large to gain acceptable latencies even with a low RQS, chunking of the prompts may be necessary.

Pod Elasticity

To keep the RQS rate at a desired level, we suggest enabling auto-scaling. A Kubernetes horizontal pod autoscaler (HPA) in conjunction with a load-balancer to spread requests to multiple pods is an effective approach in a Kubernetes environment. A load balancer would distribute requests across each of the active Lakera Guard pods and an HPA instance would allow for elasticity to scale up or down the number of pods as needed.

We recommend using a latency metric across the Lakera Guard pods as the scaling metric for the HPA. As latency across the Lakera Guard pods increase, more pods should be deployed. Likewise, when latency is reduced, pods can be reaped. Tracking latency as the scaling metric offers better results than tracking CPU load.

Prompt Length Chunking

In order to address tail latency for the very large prompts or in cases where the average prompt size is large, a chunking mechanism can be added to the dispatch layer within your deployment. Before calling into the Lakera Guard endpoint, a large prompt may be chunked into smaller prompts. Each smaller prompt is submitted to Lakera Guard pods in parallel and the result of the disposition is logical-or’ed together, i.e if any chunk returns a prompt as flagged, the whole prompt would be flagged. To be clear, the chunking and logical-or operation does not happen automatically and needs to be implemented in the dispatch layer outside of Lakera Guard.

For best results, we recommend chunking on semantic boundaries, e.g. sentence or paragraph. We found little need for overlap between chunk windows.