Confusion Matrix Benchmark

This tutorial uses our legacy v1 endpoints. It will be updated soon.

A Confusion Matrix is a standardized approach for gaining insights into how well the model identifies positive instances and avoids false detections. The Lakera Guard Confusion Matrix Benchmark offers a streamlined framework for collecting True Positive, True Negative, False Positive, False Negative data. Categorical scoring allows for the calculation of Accuracy, Recall, and False Positive Rate.

The following steps are written for testing the Lakera Guard SaaS API. The benchmark is also applicable for self-hosted evaluation, with minor changes.

The complete SaaS and Self-Hosted Benchmarks are available at the bottom of this page.

Prerequisites

For testing the Lakera Guard SaaS API, you’ll need to obtain an API key.

Environment Variables

Set the LAKERA_GUARD_API_KEY environment variable with your API key:

$export LAKERA_GUARD_API_KEY='your_api_key_here'

Install Dependencies

Next, you’ll need to install the required packages - we recommend using a Python Virtual Environment - to avoid conflicts with other projects.

$pip install --upgrade datasets tqdm requests pandas

Import Dependencies

Then create a new Python file.

$touch confusion_matrix_benchmark.py

Next, import the required packages and read in API Key.

1import json
2import requests
3import os
4import argparse
5import pandas as pd
6from tqdm import tqdm
7from typing import Callable, Optional
8
9def get_env_var(var_name):
10 value = os.getenv(var_name)
11 if value is None:
12 raise ValueError(f"Environment variable {var_name} not set")
13 return value
14
15# Ensure Lakera Guard API key is set
16lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')

Prepare a Dataset

The Confusion Matrix Benchmark expects data in JSON format. The dataset should be structured as a list of dictionaries, where each dictionary represents an individual data point. For example, the structure of a prompt injection dataset should look like this:

1[
2 {
3 "text": "bengign prompt input...",
4 "label": false
5 },
6 {
7 "text": "another bengign prompt input...",
8 "label": false
9 },
10 {
11 "text": "prompt injection input...",
12 "label": true
13 }
14 // Additional entries...
15]

Load and Validate JSON File

Next we load our data, ensuring it meets the specific structure outlined above.

1def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
2 with open(json_file_path, 'r') as file:
3 data = json.load(file)
4
5 if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
6 raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
7
8 return pd.DataFrame(data)

Interacting with Lakera Guard

We define a function to authenticate and establish a persistent connection with the Lakera Guard API. We also create evaluate_lakera_guard to send a prompt and evaluate the result. We are interested in whether Lakera Guard returns flagged:true or flagged:false in its response. Remember, based on our labeled dataset we expect a predicted positive to return true and a predicted negative to return false.

1# Setting up the Lakera Guard session
2def setup_lakera_session(api_key: str) -> requests.Session:
3 session = requests.Session()
4 session.headers.update({"Authorization": f'Bearer {api_key}'})
5 return session
6
7lakera_session = setup_lakera_session(lakera_guard_api_key)
8
9# Evaluate a single prompt with Lakera Guard
10def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
11 valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
12
13 if endpoint not in valid_endpoints:
14 raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
15
16 request_json = {"input": prompt}
17 response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
18 response.raise_for_status()
19 result = response.json()
20 return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}

Confusion Matrix

Now we’ll create an evaluate_dataset function to analyze True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

1def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
2 valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
3
4 if endpoint not in valid_endpoints:
5 raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
6
7 request_json = {"input": prompt}
8 response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
9 response.raise_for_status()
10 result = response.json()
11 return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
12
13def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str) -> tuple:
14 false_positives = []
15 false_negatives = []
16
17 predictions = []
18 for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
19 result = eval_function(row["text"], session, endpoint)
20 predictions.append(result["flagged"])
21 if result["flagged"] and row["label"] == 0:
22 false_positives.append(result)
23 elif not result["flagged"] and row["label"] == 1:
24 false_negatives.append(result)
25
26 df["prediction"] = predictions
27 df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
28
29 TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
30 FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
31 FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
32 TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
33
34 return df, TP, FP, FN, TN, false_positives, false_negatives

Running the Benchmark

We’ve now established all the core functionality needed to evluate the Lakera Guard API with a Confusion Matrix. In our completed Benchmark script we will also define functions for logging ouput and calculating Accuracy, Recall, and False Positive Rate.

Lakera Guard Confusion Matrix Benchmark (SaaS)

Ensure you’ve created a confusion_matrix_benchmark.py file with the complete code outlined below. To run the benchmark, we must pass in the following flags:

  • -f: Path to the input JSON file
  • s: Sample size for evaluation. Be aware of monthly API requests limit as outlined in your trial agreement when setting this value.
  • e: API endoint. Valid options are prompt_injection, moderation, pii, unknown_links

Example Usage

The following will send 100 requests, extracted from file.json to the Prompt Injection endpoint.

$python confusion_matrix_benchmark.py -f path/to/your/file.json -s 100 -e prompt_injection

confusion_matrix_benchmark.py

1import json
2import requests
3import os
4import argparse
5import pandas as pd
6from tqdm import tqdm
7from typing import Callable, Optional
8
9def get_env_var(var_name):
10 value = os.getenv(var_name)
11 if value is None:
12 raise ValueError(f"Environment variable {var_name} not set")
13 return value
14
15# Ensure Lakera Guard API key is set
16lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')
17
18# Validate and load JSON file
19def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
20 with open(json_file_path, 'r') as file:
21 data = json.load(file)
22
23 if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
24 raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
25
26 return pd.DataFrame(data)
27
28# Setting up the Lakera Guard session
29def setup_lakera_session(api_key: str) -> requests.Session:
30 session = requests.Session()
31 session.headers.update({"Authorization": f'Bearer {api_key}'})
32 return session
33
34lakera_session = setup_lakera_session(lakera_guard_api_key)
35
36# Evaluate a single prompt with Lakera Guard
37def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str) -> dict:
38 valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
39
40 if endpoint not in valid_endpoints:
41 raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
42
43 request_json = {"input": prompt}
44 response = session.post(f"https://api.lakera.ai/v1/{endpoint}", json=request_json)
45 response.raise_for_status()
46 result = response.json()
47 return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
48
49def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str) -> tuple:
50 false_positives = []
51 false_negatives = []
52
53 predictions = []
54 for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
55 result = eval_function(row["text"], session, endpoint)
56 predictions.append(result["flagged"])
57 if result["flagged"] and row["label"] == 0:
58 false_positives.append(result)
59 elif not result["flagged"] and row["label"] == 1:
60 false_negatives.append(result)
61
62 df["prediction"] = predictions
63 df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
64
65 TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
66 FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
67 FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
68 TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
69
70 return df, TP, FP, FN, TN, false_positives, false_negatives
71
72def print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename):
73 summary = f"""
74Summary
75=======
76Total requests made: {total_requests}
77Total number of accurately classified prompts: {total_accurate}
78Total number of false positives: {FP}
79Total number of false negatives: {FN}
80
81Metrics
82=======
83Recall: {recall * 100:.2f}%
84Accuracy: {accuracy * 100:.2f}%
85False Positive Rate: {false_positive_rate * 100:.2f}%
86
87Confusion Matrix
88================
89 | Predicted Positive | Predicted Negative
90-------------------+--------------------+-------------------
91Actual Positive | {TP:<20} | {FN:<19}
92Actual Negative | {FP:<20} | {TN:<19}
93"""
94
95 print(summary)
96
97 with open(summary_filename, 'w') as file:
98 file.write(summary)
99 print(f"Summary and metrics saved to {summary_filename}")
100
101def save_results(results: list, filename: str, result_type: str):
102 with open(filename, 'w') as file:
103 json.dump(results, file, indent=4)
104 print(f"\n{result_type} saved to {filename}")
105
106def confusion_matrix_benchmark(
107 json_file_path: str,
108 sample_size: int,
109 endpoint: str,
110 eval_function: Callable = evaluate_lakera_guard,
111 session: requests.Session = lakera_session,
112 quiet: bool = False
113) -> Optional[tuple]:
114 # Load and transform JSON file
115 df = load_and_transform_json(json_file_path)
116
117 # Sample the specified number of random rows
118 df_sample = df.sample(n=sample_size).reset_index(drop=True)
119
120 # Initialize progress bar and evaluate dataset
121 df_sample, TP, FP, FN, TN, false_positives, false_negatives = evaluate_dataset(df_sample, eval_function, session, endpoint)
122
123 total_requests = len(df_sample)
124 total_accurate = TP + TN
125 recall = TP / (TP + FN) if (TP + FN) > 0 else 0
126 accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
127 false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0
128
129 summary_filename = f"{endpoint}_summary.txt"
130 if not quiet:
131 print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename)
132
133 # Save results with endpoint name in the filename
134 save_results(false_positives, f"{endpoint}_false_positives.json", "False Positives")
135 save_results(false_negatives, f"{endpoint}_false_negatives.json", "False Negatives")
136
137 if quiet:
138 return recall, accuracy, false_positive_rate, df_sample
139
140if __name__ == "__main__":
141 parser = argparse.ArgumentParser(
142 description="Lakera Guard API Confusion Matrix Benchmark",
143 formatter_class=argparse.ArgumentDefaultsHelpFormatter
144 )
145 parser.add_argument('-f', '--file', type=str, required=True, help="Path to the input JSON file")
146 parser.add_argument('-s', '--sample_size', type=int, default=100, help="Sample size for evaluation")
147 parser.add_argument('-e', '--endpoint', type=str, required=True, help="API endpoint. Valid options are: prompt_injection, moderation, pii, unknown_links")
148
149 args = parser.parse_args()
150
151 confusion_matrix_benchmark(
152 json_file_path=args.file,
153 sample_size=args.sample_size,
154 endpoint=args.endpoint
155 )

Lakera Guard Confusion Matrix Benchmark (Self-Hosted)

The Self-Hosted Benchmark works in the same way, with minor changes as the authentication is not needed. We also provide the ability to pass in custom hostnames. Ensure you’ve created a confusion_matrix_benchmark.py file with the complete code outlined below. To run the benchmark, we must pass in the following flags:

  • -f: Path to the input JSON file
  • s: Sample size for evaluation. Be aware of monthly API requests limit as outlined in your trial agreement when setting this value.
  • e: API endoint. Valid options are prompt_injection, moderation, pii, unknown_links
  • -u or --base_url: (Optional) Base URL for the API. Default is http://localhost:8000/v1.

Example Usage

The following will send 100 requests, extracted from file.json to the Prompt Injection endpoint.

$python confusion_matrix_benchmark.py -f path/to/your/file.json -s 100 -e prompt_injection -u 'http://custom.api.url:8000/v1/'

confusion_matrix_benchmark.py

1import json
2import requests
3import os
4import argparse
5import pandas as pd
6from tqdm import tqdm
7from typing import Callable, Optional
8
9def get_env_var(var_name):
10 value = os.getenv(var_name)
11 if value is None:
12 raise ValueError(f"Environment variable {var_name} not set")
13 return value
14
15# Ensure Lakera Guard API key is set
16lakera_guard_api_key = get_env_var('LAKERA_GUARD_API_KEY')
17
18# Validate and load JSON file
19def load_and_transform_json(json_file_path: str) -> pd.DataFrame:
20 with open(json_file_path, 'r') as file:
21 data = json.load(file)
22
23 if not isinstance(data, list) or not all(isinstance(item, dict) and 'text' in item and 'label' in item for item in data):
24 raise ValueError("Invalid JSON file structure. Ensure it is a list of dictionaries with 'text' and 'label' keys.")
25
26 return pd.DataFrame(data)
27
28# Setting up the Lakera Guard session
29def setup_lakera_session(api_key: str) -> requests.Session:
30 session = requests.Session()
31 session.headers.update({"Authorization": f'Bearer {api_key}'})
32 return session
33
34lakera_session = setup_lakera_session(lakera_guard_api_key)
35
36# Evaluate a single prompt with Lakera Guard
37def evaluate_lakera_guard(prompt: str, session: requests.Session, endpoint: str, base_url: str) -> dict:
38 valid_endpoints = ["prompt_injection", "moderation", "pii", "unknown_links"]
39
40 if endpoint not in valid_endpoints:
41 raise ValueError(f"Invalid endpoint specified. Valid endpoints are: {', '.join(valid_endpoints)}")
42
43 request_json = {"input": prompt}
44 response = session.post(f"{base_url}/{endpoint}", json=request_json)
45 response.raise_for_status()
46 result = response.json()
47 return {"input": prompt, "response": result, "flagged": result["results"][0]["flagged"]}
48
49def evaluate_dataset(df: pd.DataFrame, eval_function: Callable, session: requests.Session, endpoint: str, base_url: str) -> tuple:
50 false_positives = []
51 false_negatives = []
52
53 predictions = []
54 for i, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating prompts"):
55 result = eval_function(row["text"], session, endpoint, base_url)
56 predictions.append(result["flagged"])
57 if result["flagged"] and row["label"] == 0:
58 false_positives.append(result)
59 elif not result["flagged"] and row["label"] == 1:
60 false_negatives.append(result)
61
62 df["prediction"] = predictions
63 df["correct"] = (df["prediction"] & (df["label"] == 1)) | (~df["prediction"] & (df["label"] == 0))
64
65 TP = ((df["prediction"] == True) & (df["label"] == 1)).sum()
66 FP = ((df["prediction"] == True) & (df["label"] == 0)).sum()
67 FN = ((df["prediction"] == False) & (df["label"] == 1)).sum()
68 TN = ((df["prediction"] == False) & (df["label"] == 0)).sum()
69
70 return df, TP, FP, FN, TN, false_positives, false_negatives
71
72def print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename):
73 summary = f"""
74Summary
75=======
76Total requests made: {total_requests}
77Total number of accurately classified prompts: {total_accurate}
78Total number of false positives: {FP}
79Total number of false negatives: {FN}
80
81Metrics
82=======
83Recall: {recall * 100:.2f}%
84Accuracy: {accuracy * 100:.2f}%
85False Positive Rate: {false_positive_rate * 100:.2f}%
86
87Confusion Matrix
88================
89 | Predicted Positive | Predicted Negative
90-------------------+--------------------+-------------------
91Actual Positive | {TP:<20} | {FN:<19}
92Actual Negative | {FP:<20} | {TN:<19}
93"""
94
95 print(summary)
96
97 with open(summary_filename, 'w') as file:
98 file.write(summary)
99 print(f"Summary and metrics saved to {summary_filename}")
100
101def save_results(results: list, filename: str, result_type: str):
102 with open(filename, 'w') as file:
103 json.dump(results, file, indent=4)
104 print(f"\n{result_type} saved to {filename}")
105
106def confusion_matrix_benchmark(
107 json_file_path: str,
108 sample_size: int,
109 endpoint: str,
110 base_url: str,
111 eval_function: Callable = evaluate_lakera_guard,
112 session: requests.Session = lakera_session,
113 quiet: bool = False
114) -> Optional[tuple]:
115 # Load and transform JSON file
116 df = load_and_transform_json(json_file_path)
117
118 # Sample the specified number of random rows
119 df_sample = df.sample(n=sample_size).reset_index(drop=True)
120
121 # Initialize progress bar and evaluate dataset
122 df_sample, TP, FP, FN, TN, false_positives, false_negatives = evaluate_dataset(df_sample, eval_function, session, endpoint, base_url)
123
124 total_requests = len(df_sample)
125 total_accurate = TP + TN
126 recall = TP / (TP + FN) if (TP + FN) > 0 else 0
127 accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
128 false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0
129
130 summary_filename = f"{endpoint}_summary.txt"
131 if not quiet:
132 print_and_save_results(total_requests, total_accurate, TP, FP, FN, TN, recall, accuracy, false_positive_rate, summary_filename)
133
134 # Save results with endpoint name in the filename
135 save_results(false_positives, f"{endpoint}_false_positives.json", "False Positives")
136 save_results(false_negatives, f"{endpoint}_false_negatives.json", "False Negatives")
137
138 if quiet:
139 return recall, accuracy, false_positive_rate, df_sample
140
141if __name__ == "__main__":
142 parser = argparse.ArgumentParser(
143 description="Lakera Guard API Confusion Matrix Benchmark",
144 formatter_class=argparse.ArgumentDefaultsHelpFormatter
145 )
146 parser.add_argument('-f', '--file', type=str, required=True, help="Path to the input JSON file")
147 parser.add_argument('-s', '--sample_size', type=int, default=100, help="Sample size for evaluation")
148 parser.add_argument('-e', '--endpoint', type=str, required=True, help="API endpoint. Valid options are: prompt_injection, moderation, pii, unknown_links")
149 parser.add_argument('-u', '--base_url', type=str, default="http://localhost:8000/v1", help="Base URL for the API. Default is http://localhost:8000/v1")
150
151 args = parser.parse_args()
152
153 confusion_matrix_benchmark(
154 json_file_path=args.file,
155 sample_size=args.sample_size,
156 endpoint=args.endpoint,
157 base_url=args.base_url
158 )