This tutorial uses our legacy v1 endpoints. It will be updated soon.

A Confusion Matrix is a standardized approach for gaining insights into how well the model identifies positive instances and avoids false detections. The Lakera Guard Confusion Matrix Benchmark offers a streamlined framework for collecting True Positive, True Negative, False Positive, False Negative data. Categorical scoring allows for the calculation of Accuracy, Recall, and False Positive Rate.

The following steps are written for testing the Lakera Guard SaaS API. The benchmark is also applicable for self-hosted evaluation, with minor changes.

The complete SaaS and Self-Hosted Benchmarks are available at the bottom of this page.

Prerequisites

For testing the Lakera Guard SaaS API, you’ll need to obtain an API key.

Environment Variables

Set the LAKERA_GUARD_API_KEY environment variable with your API key:

$ export LAKERA_GUARD_API_KEY='your_api_key_here'

Install Dependencies

Next, you’ll need to install the required packages - we recommend using a Python Virtual Environment - to avoid conflicts with other projects.

$ pip install --upgrade datasets tqdm requests pandas

Import Dependencies

Then create a new Python file.

$ touch confusion_matrix_benchmark.py

Next, import the required packages and read in API Key.

1 import json
2 import requests
3 import os
4 import argparse
5 import pandas as pd
6 from tqdm import tqdm
7 from typing import Callable, Optional
8 
9 def get_env_var(var_name):
10     value = os.getenv(var_name)
11     if value is None:
12         raise ValueError(f"Environment variable {var_name} not set")
13     return value
14 
15 # Ensure Lakera Guard API key is set
16 lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')

Prepare a Dataset

The Confusion Matrix Benchmark expects data in JSON format. The dataset should be structured as a list of dictionaries, where each dictionary represents an individual data point. For example, the structure of a prompt injection dataset should look like this:

1 [
2   {
3     "text": "bengign prompt input...",
4     "label": false
5   },
6   {
7     "text": "another bengign prompt input...",
8     "label": false
9   },
10   {
11     "text": "prompt injection input...",
12     "label": true
13   }
14   // Additional entries...
15 ]

Load and Validate JSON File

Next we load our data, ensuring it meets the specific structure outlined above.

1 def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
2     with open(json_file_path, 'r') as file:
3         data = json.load(file)
4 
5     if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
6         raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
7 
8     return pd.DataFrame(data)

Interacting with Lakera Guard

We define a function to authenticate and establish a persistent connection with the Lakera Guard API. We also create evaluate_lakera_guard to send a prompt and evaluate the result. We are interested in whether Lakera Guard returns flagged:true or flagged:false in its response. Remember, based on our labeled dataset we expect a predicted positive to return true and a predicted negative to return false.

1 # Setting up the Lakera Guard session
2 def setup_lakera_session(api_key: str) -> requests.Session:
3     session = requests.Session()
4     session.headers.update({"Authorization": f'Bearer {api_key}'})
5     return session
6 
7 lakera_session = setup_lakera_session(lakera_guard_api_key)
8 
9 # Evaluate a single prompt with Lakera Guard
10 def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
11     valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
12 
13     if endpoint not in valid_endpoints:
14         raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
15 
16     request_json = {"input": prompt}
17     response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
18     response.raise_for_status()
19     result = response.json()
20     return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}

Confusion Matrix

Now we’ll create an evaluate_dataset function to analyze True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

1 def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
2     valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
3 
4     if endpoint not in valid_endpoints:
5         raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
6 
7     request_json = {"input": prompt}
8     response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
9     response.raise_for_status()
10     result = response.json()
11     return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
12 
13 def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str) -> tuple:
14     false_positives = []
15     false_negatives = []
16 
17     predictions = []
18     for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
19         result = eval_function(row["text"], session, endpoint)
20         predictions.append(result["flagged"])
21         if result["flagged"] and row["label"] == 0:
22             false_positives.append(result)
23         elif not result["flagged"] and row["label"] == 1:
24             false_negatives.append(result)
25 
26     df["prediction"] = predictions
27     df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
28 
29     TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
30     FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
31     FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
32     TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
33 
34     return df, TP, FP, FN, TN, false_positives, false_negatives

Running the Benchmark

We’ve now established all the core functionality needed to evluate the Lakera Guard API with a Confusion Matrix. In our completed Benchmark script we will also define functions for logging ouput and calculating Accuracy, Recall, and False Positive Rate.

Lakera Guard Confusion Matrix Benchmark (SaaS)

Ensure you’ve created a confusion_matrix_benchmark.py file with the complete code outlined below. To run the benchmark, we must pass in the following flags:

-f: Path to the input JSON file
s: Sample size for evaluation. Be aware of monthly API requests limit as outlined in your trial agreement when setting this value.
e: API endoint. Valid options are prompt_injection, moderation, pii, unknown_links

Example Usage

The following will send 100 requests, extracted from file.json to the Prompt Injection endpoint.

$ python confusion_matrix_benchmark.py -f path/to/your/file.json -s 100 -e prompt_injection

`confusion_matrix_benchmark.py`

1 import json
2 import requests
3 import os
4 import argparse
5 import pandas as pd
6 from tqdm import tqdm
7 from typing import Callable, Optional
8 
9 def get_env_var(var_name):
10     value = os.getenv(var_name)
11     if value is None:
12         raise ValueError(f"Environment variable {var_name} not set")
13     return value
14 
15 # Ensure Lakera Guard API key is set
16 lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')
17 
18 # Validate and load JSON file
19 def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
20     with open(json_file_path, 'r') as file:
21         data = json.load(file)
22 
23     if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
24         raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
25 
26     return pd.DataFrame(data)
27 
28 # Setting up the Lakera Guard session
29 def setup_lakera_session(api_key: str) -> requests.Session:
30     session = requests.Session()
31     session.headers.update({"Authorization": f'Bearer {api_key}'})
32     return session
33 
34 lakera_session = setup_lakera_session(lakera_guard_api_key)
35 
36 # Evaluate a single prompt with Lakera Guard
37 def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
38     valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
39 
40     if endpoint not in valid_endpoints:
41         raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
42 
43     request_json = {"input": prompt}
44     response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
45     response.raise_for_status()
46     result = response.json()
47     return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
48 
49 def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str) -> tuple:
50     false_positives = []
51     false_negatives = []
52 
53     predictions = []
54     for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
55         result = eval_function(row["text"], session, endpoint)
56         predictions.append(result["flagged"])
57         if result["flagged"] and row["label"] == 0:
58             false_positives.append(result)
59         elif not result["flagged"] and row["label"] == 1:
60             false_negatives.append(result)
61 
62     df["prediction"] = predictions
63     df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
64 
65     TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
66     FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
67     FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
68     TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
69 
70     return df, TP, FP, FN, TN, false_positives, false_negatives
71 
72 def print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename):
73     summary = f"""
74 Summary
75 =======
76 Total requests made: {total_requests}
77 Total number of accurately classified prompts: {total_accurate}
78 Total number of false positives: {FP}
79 Total number of false negatives: {FN}
80 
81 Metrics
82 =======
83 Recall: {recall * 100:.2f}%
84 Accuracy: {accuracy * 100:.2f}%
85 False Positive Rate: {false_positive_rate * 100:.2f}%
86 
87 Confusion Matrix
88 ================
89                    | Predicted Positive | Predicted Negative
90 -------------------+--------------------+-------------------
91 Actual Positive    | {TP:<20} | {FN:<19}
92 Actual Negative    | {FP:<20} | {TN:<19}
93 """
94 
95     print(summary)
96 
97     with open(summary_filename, 'w') as file:
98         file.write(summary)
99     print(f"Summary and metrics saved to {summary_filename}")
100 
101 def save_results(results: list, filename: str, result_type: str):
102     with open(filename, 'w') as file:
103         json.dump(results, file, indent=4)
104     print(f"\n{result_type} saved to {filename}")
105 
106 def confusion_matrix_benchmark(
107     json_file_path: str,
108     sample_size: int,
109     endpoint: str,
110     eval_function: Callable = evaluate_lakera_guard,
111     session: requests.Session = lakera_session,
112     quiet: bool = False
113 ) -> Optional[tuple]:
114     # Load and transform JSON file
115     df = load_and_transform_json(json_file_path)
116 
117     # Sample the specified number of random rows
118     df_sample = df.sample(n=sample_size).reset_index(drop=True)
119 
120     # Initialize progress bar and evaluate dataset
121     df_sample, TP, FP, FN, TN, false_positives, false_negatives = evaluate_dataset(df_sample, eval_function, session, endpoint)
122 
123     total_requests = len(df_sample)
124     total_accurate = TP + TN
125     recall = TP / (TP + FN) if (TP + FN) > 0 else 0
126     accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
127     false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0
128 
129     summary_filename = f"{endpoint}_summary.txt"
130     if not quiet:
131         print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename)
132 
133     # Save results with endpoint name in the filename
134     save_results(false_positives, f"{endpoint}_false_positives.json", "False Positives")
135     save_results(false_negatives, f"{endpoint}_false_negatives.json", "False Negatives")
136 
137     if quiet:
138         return recall, accuracy, false_positive_rate, df_sample
139 
140 if __name__ == "__main__":
141     parser = argparse.ArgumentParser(
142         description="Lakera Guard API Confusion Matrix Benchmark",
143         formatter_class=argparse.ArgumentDefaultsHelpFormatter
144     )
145     parser.add_argument('-f', '--file', type=str, required=True, help="Path to the input JSON file")
146     parser.add_argument('-s', '--sample_size', type=int, default=100, help="Sample size for evaluation")
147     parser.add_argument('-e', '--endpoint', type=str, required=True, help="API endpoint. Valid options are: prompt_injection, moderation, pii, unknown_links")
148 
149     args = parser.parse_args()
150 
151     confusion_matrix_benchmark(
152         json_file_path=args.file,
153         sample_size=args.sample_size,
154         endpoint=args.endpoint
155     )

Lakera Guard Confusion Matrix Benchmark (Self-Hosted)

The Self-Hosted Benchmark works in the same way, with minor changes as the authentication is not needed. We also provide the ability to pass in custom hostnames. Ensure you’ve created a confusion_matrix_benchmark.py file with the complete code outlined below. To run the benchmark, we must pass in the following flags:

-f: Path to the input JSON file
s: Sample size for evaluation. Be aware of monthly API requests limit as outlined in your trial agreement when setting this value.
e: API endoint. Valid options are prompt_injection, moderation, pii, unknown_links
-u or --base_url: (Optional) Base URL for the API. Default is http://localhost:8000/v1.

Example Usage

The following will send 100 requests, extracted from file.json to the Prompt Injection endpoint.

$ python confusion_matrix_benchmark.py -f path/to/your/file.json -s 100 -e prompt_injection -u 'http://custom.api.url:8000/v1/'

`confusion_matrix_benchmark.py`

1 import json
2 import requests
3 import os
4 import argparse
5 import pandas as pd
6 from tqdm import tqdm
7 from typing import Callable, Optional
8 
9 def get_env_var(var_name):
10     value = os.getenv(var_name)
11     if value is None:
12         raise ValueError(f"Environment variable {var_name} not set")
13     return value
14 
15 # Ensure Lakera Guard API key is set
16 lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')
17 
18 # Validate and load JSON file
19 def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
20     with open(json_file_path, 'r') as file:
21         data = json.load(file)
22 
23     if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
24         raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
25 
26     return pd.DataFrame(data)
27 
28 # Setting up the Lakera Guard session
29 def setup_lakera_session(api_key: str) -> requests.Session:
30     session = requests.Session()
31     session.headers.update({"Authorization": f'Bearer {api_key}'})
32     return session
33 
34 lakera_session = setup_lakera_session(lakera_guard_api_key)
35 
36 # Evaluate a single prompt with Lakera Guard
37 def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str, base_url: str) -> dict:
38     valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
39 
40     if endpoint not in valid_endpoints:
41         raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
42 
43     request_json = {"input": prompt}
44     response = session.post(f"{base_url}/{endpoint}", json=request_json)
45     response.raise_for_status()
46     result = response.json()
47     return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
48 
49 def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str, base_url: str) -> tuple:
50     false_positives = []
51     false_negatives = []
52 
53     predictions = []
54     for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
55         result = eval_function(row["text"], session, endpoint, base_url)
56         predictions.append(result["flagged"])
57         if result["flagged"] and row["label"] == 0:
58             false_positives.append(result)
59         elif not result["flagged"] and row["label"] == 1:
60             false_negatives.append(result)
61 
62     df["prediction"] = predictions
63     df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
64 
65     TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
66     FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
67     FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
68     TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
69 
70     return df, TP, FP, FN, TN, false_positives, false_negatives
71 
72 def print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename):
73     summary = f"""
74 Summary
75 =======
76 Total requests made: {total_requests}
77 Total number of accurately classified prompts: {total_accurate}
78 Total number of false positives: {FP}
79 Total number of false negatives: {FN}
80 
81 Metrics
82 =======
83 Recall: {recall * 100:.2f}%
84 Accuracy: {accuracy * 100:.2f}%
85 False Positive Rate: {false_positive_rate * 100:.2f}%
86 
87 Confusion Matrix
88 ================
89                    | Predicted Positive | Predicted Negative
90 -------------------+--------------------+-------------------
91 Actual Positive    | {TP:<20} | {FN:<19}
92 Actual Negative    | {FP:<20} | {TN:<19}
93 """
94 
95     print(summary)
96 
97     with open(summary_filename, 'w') as file:
98         file.write(summary)
99     print(f"Summary and metrics saved to {summary_filename}")
100 
101 def save_results(results: list, filename: str, result_type: str):
102     with open(filename, 'w') as file:
103         json.dump(results, file, indent=4)
104     print(f"\n{result_type} saved to {filename}")
105 
106 def confusion_matrix_benchmark(
107     json_file_path: str,
108     sample_size: int,
109     endpoint: str,
110     base_url: str,
111     eval_function: Callable = evaluate_lakera_guard,
112     session: requests.Session = lakera_session,
113     quiet: bool = False
114 ) -> Optional[tuple]:
115     # Load and transform JSON file
116     df = load_and_transform_json(json_file_path)
117 
118     # Sample the specified number of random rows
119     df_sample = df.sample(n=sample_size).reset_index(drop=True)
120 
121     # Initialize progress bar and evaluate dataset
122     df_sample, TP, FP, FN, TN, false_positives, false_negatives = evaluate_dataset(df_sample, eval_function, session, endpoint, base_url)
123 
124     total_requests = len(df_sample)
125     total_accurate = TP + TN
126     recall = TP / (TP + FN) if (TP + FN) > 0 else 0
127     accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
128     false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0
129 
130     summary_filename = f"{endpoint}_summary.txt"
131     if not quiet:
132         print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename)
133 
134     # Save results with endpoint name in the filename
135     save_results(false_positives, f"{endpoint}_false_positives.json", "False Positives")
136     save_results(false_negatives, f"{endpoint}_false_negatives.json", "False Negatives")
137 
138     if quiet:
139         return recall, accuracy, false_positive_rate, df_sample
140 
141 if __name__ == "__main__":
142     parser = argparse.ArgumentParser(
143         description="Lakera Guard API Confusion Matrix Benchmark",
144         formatter_class=argparse.ArgumentDefaultsHelpFormatter
145     )
146     parser.add_argument('-f', '--file', type=str, required=True, help="Path to the input JSON file")
147     parser.add_argument('-s', '--sample_size', type=int, default=100, help="Sample size for evaluation")
148     parser.add_argument('-e', '--endpoint', type=str, required=True, help="API endpoint. Valid options are: prompt_injection, moderation, pii, unknown_links")
149     parser.add_argument('-u', '--base_url', type=str, default="http://localhost:8000/v1", help="Base URL for the API. Default is http://localhost:8000/v1")
150 
151     args = parser.parse_args()
152 
153     confusion_matrix_benchmark(
154         json_file_path=args.file,
155         sample_size=args.sample_size,
156         endpoint=args.endpoint,
157         base_url=args.base_url
158     )