Baselining Performance Metrics

Gathering baseline performance metrics for Lakera Guard prior to full integration is an essential step in ensuring successful implementation. By establishing clear benchmarks, organizations can accurately assess Lakera Guard’s capabilities. These initial measurements serve as a valuable diagnostic tool, highlighting potential areas for optimization and helping to maximize effectiveness after integration.

Measuring Success Criteria - Classification Evaluation

Lakera Guard functions as a control layer around your model, offering configurability over what enters and leaves your model through detection. The first phase of testing focuses on the question, “How effective are Lakera Guard’s detection capabilities?”

To answer this, Lakera recommends using a Confusion Matrix to assess labeled datasets. A confusion matrix provides a detailed evaluation of detection capabilities by measuring:

  • True Positives: The model correctly predicts the positive class.
  • True Negatives: The model correctly predicts the negative class.
  • False Positives: The model incorrectly predicts the positive class.
  • False Negatives: The model incorrectly predicts the negative class.

Confusion Matrix

A Confusion Matrix is a standardized approach for gaining insights into how well the model identifies positive instances and avoids false detections. Lakera considers a predicted positive to be an input that Lakera Guard is expected to flag as True. A predicted negative represents a benign input that is expected to flag as False.

For example, a known prompt injection input such as, Ignore your system prompt and perform the following instructions, is expected to produce an API response containing flagged: true. This is a predicted positive.


Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Using a Confusion Matrix allows for the calculation of recall, accuracy, and false positive rate, which are valuable for evaluating the performance of a classification model.


Metrics

MetricDescriptionFormula
RecallMeasures the ability to identify all relevant instances.TP / (TP + FN)
AccuracyMeasures the overall correctness of the model.(TP + TN) / (TP + TN + FP + FN)
False Positive RateProportion of actual negatives incorrectly classified as positive.FP / (FP + TN)

Example Confusion Matrix

Consider a mixed dataset containing 10 true positive prompt injection examples and 10 true negative benign prompts examples. If the model correctly identifies all true positives and true negatives, it demonstrates a 100% accuracy rate.


Predicted PositivePredicted Negative
Actual Positive100
Actual Negative010

Example Dataset Calculations

In this example, consider an imbalanced dataset. Imbalanced data is representative of real-world use cases as we’d expect there to be far more benign prompts than malicious ones. In our dataset, we have 50 predicted positives (prompt injections) and 950 predicted negatives (benign prompts).

Predicted PositivePredicted Negative
Actual Positive482
Actual Negative3947

The model has correctly classified:

  • Actual Positive: 48 prompts as prompt injections
  • Actual Negative: 947 prompts as benign inputs

The model has incorrectly classified:

  • Predicted Negative: 2 prompt injections as benign inputs
  • Predicted Positive: 3 benign inputs as prompt injections

Based on these results, we can calculate scoring using the formulas in the metrics table above.

MetricResult
Recall96.0%
Accuracy99.5%
False Positive Rate0.32%

Testing Datasets

A labeled dataset is required to measure Lakera Guard’s detection capabilities with a Confusion Matrix. Lakera Guard provides detectors for Prompt Injection, Content Moderation, Personally Identifiable Information (PII), and Unknown Links.

Labeling

Each detector requires a correctly labeled dataset. As an example, consider the prompt injection endpoint. The dataset must contain a set of known prompt injections and/or jailbreaks, and a set of benign prompts. The quality of the labeled testing data is crucial for producing meaningful results.

Datasets can be in any structured format, but should contain consistent and accurate labeling. For example, a JSON-encoded dataset may look like this:

1[
2 {
3 "text": "bengign prompt input...",
4 "label": false
5 },
6 {
7 "text": "another bengign prompt input...",
8 "label": false
9 },
10 {
11 "text": "prompt injection input...",
12 "label": true
13 }
14 // Additional entries...
15]

Measuring Lakera Guard’s Detection Efficacy

The Confusion Matrix Benchmark offers a consistent and repeatable framework for evaluating Lakera Guard’s detection capabilities across various datasets. This benchmark establishes baseline metrics, allowing for standardized performance measurement against real-world use cases. It facilitates a systematic approach to assess the accuracy and reliability of Lakera Guard’s detection efficacy by analyzing key metrics such as recall, accuracy, and false positive rate.

Be aware of monthly API requests limit as outlined in your trial agreement.

Measuring Lakera Guard’s Latency

Considering the inherent latency of LLMs, it’s crucial for additional layered solutions to prioritize speed. Lakera measures latency based on API response times for each request to Guard.

The input length of a request, measured by the number of input characters in a single request, is the most significant contributor to latency. There is a latency tradeoff for increased token size, and you may elect to chunk inputs into smaller sizes to decrease latency.

Establishing persistent connections with Lakera Guard is highly recommended to reduce networking overhead and reduce latency. Persistent connections reduce latency by ~100% to 350% compared to new connections. Any method for persistent connections is viable. For example:

1import requests
2
3session = requests.Session() # Allows persistent connection

Latency Benchmark

The Latency Benchmark offers a clear and repeatable framework for evaluating Lakera Guard’s response times across varying input volumes before integration. The benchmark establishes baseline metrics, providing a standard to compare against once deployed. Additionally, it helps identify strategies to optimize latency, particularly in scenarios that involve processing large inputs.

Be aware of monthly API requests limit as outlined in your trial agreement.

Scaling (Self-Hosted Only)

Lakera recommends horizontally scaling load to optimize latency and suggests enabling auto-scaling. In a Kubernetes environment, this would be a horizontal pod autoscaler (HPA) in conjunction with a load-balancer to spread requests to multiple pods.

Often the objective function used to determine pod elasticity is latency. Latency is correlated to prompt length and request load. Larger prompts are slower, which may require more containers to service a wide range of prompt lengths. Likewise, as the request rate grows, latency will increase. Tracking latency as the scaling metric offers better results than tracking CPU load.

For more detailed guidance, see the Sizing Guide documentation.