Deploying Self-hosted Guard with GPUs | Lakera API documentation

Lakera Guard supports using either CPUs or GPUs for running its internal AI models.

When using GPUs to run the Guard container, we observed latency improvements for prompts with more than 1000 words, while reducing the total cost of operation for a fixed throughput.

Hardware requirements

Lakera Guard supports all GPUs compatible with CUDA 12.x

Deployment guide

The same guard container image can be used for both CPU and GPU deployments. If suitable GPUs are provided at runtime, guard will automatically utilize them.

Locally running container

To run the container locally with GPU support the --gpus flag can be used, assuming that the Nvidia Container Toolkit and suitable CUDA drivers are installed.

$ docker run -p 8000:8000 --gpus all -e LAKERA_API_KEY=YOUR_API_KEY lakera/guard

Kubernetes deployment

On Kubernetes the correct drivers must be installed on the underlying compute nodes and the available GPUs must be declared as nvidia.com/gpu resources. This can be done for example by using the GPU Operator or through the Nvidia Device Plugin.

With this setup out of the way, GPUs can be added to the deployment configuration by specifying the nvidia.com/gpu resource in the container configuration:

1 spec:
2   containers:
3     - name: guard
4       resources:
5         limits:
6             nvidia.com/gpu: 1

Scaling Considerations

When utilizing GPUs, the following scaling parameters should be taken into account:

CPU Cores

To achieve higher resource utilization a single GPU can be shared by multiple worker processes. We have found good vertical scaling when adding additional CPU cores to a single GPU Lakera Guard instance in contrast to horizontal scaling when using a CPU Lakera Guard instance.

It is recommended when using GPU Lakera Guard instances to add additional CPU cores to the instance and increase the NUM_WORKERS environment variable beyond the recommended 1 worker that is used for CPU Lakera Guard instances. The number of CPU cores should be at least twice the number of workers.

Optimal configurations of NUM_WORKERS and cores is dependent on hardware, traffic volume, prompt size, and policy. Tuning can start with a configuration of 2 workers and 4 cores and gradually increase as resources are available and if performance improves with the additional resource.

GPU Memory

The GPU memory requirements of guard scale linearly with the number of workers and the inference batch size. The batch size can be configured through the LAKERA_INFERENCE_BATCH_SIZE environment variable and has a default value of 128. The default configuration corresponds to a memory requirement of 4.5 GB per worker process. Higher values of the batch size can lead to a higher throughput, but also require more memory.

It is recommended to scale the batch size based on the available GPU memory.