Data Leakage Prevention
Lakera Guard can prevent data leakage by screening LLM inputs and outputs for personally identifiable information, system prompts, trigger words, or custom entity types. It can either block the interaction or mask sensitive information. Additionally, Lakera Guard can stop end-user PII from being sent to third-party LLM providers.
Personally Identifiable Information (PII) is any private data that could lead to the identification of an individual.
Organizations that handle PII must safeguard it to prevent unauthorized access or disclosure. Laws like the General Data Protection Regulation (GDPR) in the European Union and the Gramm-Leach-Bliley Act (GLBA) and the Health Insurance Portability and Accountability Act (HIPAA) in the United States impose strict guidelines on the handling and protection of PII and strict penalties for non-compliance.
PII can show up in applications powered by Large Language Models (LLMs) in a variety of ways, including:
- A user could enter their own PII, or the PII of another person
- An application that uses retrieval augmented generation (RAG) could retrieve content from a document that unknowingly contains PII and share it with another end user
- Your data policy may not include sharing customer PII with a third-party LLM provider powering your GenAI application
Lakera Guard will not train on any PII.
PII Detectors
Lakera Guard can be used to identify the following entities:
- Full names
- United States mailing addresses
- Phone numbers
- Email addresses
- Internet Protocol (IP) addresses
- Credit card numbers
- International Bank Account Numbers (IBANs)
- United States Social Security Numbers (SSNs)
Name
The name detector identifies full names of individuals from any cultural background, including names with a middle letter or a full middle name. It is resilient to common typos and punctuation errors.
Examples
Robert Neruda
John C Smith
Rafael Mora
Francis Shawn Key
Yukihiro Ozawa
Aishwarya Rajan
Zainab Malik
Goerge Mller
(typo)
Counterexamples
It does not flag single names:
Louise
Ahmed
Maria
It does not flag common test names:
John Doe
Jane Doe
Mailing Addresses
The mailing address detector identifies US mailing addresses that include a street address and possibly one or more of city, state, and zip code. It supports abbreviations for states and common street suffixes. It is resilient to common typos and punctuation errors.
Examples
402 Johnson Street Ozaukee County Port Washington 53074 WI
402 Johnson Street Ozaukee County
1229 COGGIN AVE WASHINGTON CHIPLEY
1990-A Gildersleeve Ave, Bronx, NY 12345
1501 Skyland Blvd E Tuscaloosa AL 35405
777 Brockton Avenue Abington
1000 Highland Colony Pkwy, Ridgeland, MS 39157
Counterexamples
It does not flag non-US postal addresses:
Bahnofstrasse 23, 8001, Zurich, Switzerland
1-1-1 Marunouchi, Chiyoda-ku, Tokyo, Japan
It does not flag names of cities or states, as these are not considered identifying information:
New York
Austin, Texas
Phone numbers
The phone number detector identifies phone numbers that follow the standard US format, with or without the area code. In order to reduce the occurence of false positives, only a the standard US format is recognized.
Examples
(145) 123-1853
+1 (145) 123-1853
(area code is ignored)787-124-5123
(145)123-1853
(ignoring spaces is allowed)
Counterexamples
Deviations from the standard format are deliberately not recognized.
(145) 123 1853
(second dash is required)+1 (145) 123-1853
(area code is ignored)787-124-515
787-124-51588
(trailing numbers are not allowed)+41 796548327
(non-US numbers are not recognized)
Email addresses
The email address detector identifies email addresses that follow a standard format, including the @
symbol and a domain with a top-level domain (TLD) identifier. It supports periods, underscores, plus signs, and dashes in the local part of the address, accounts for subdomains, and allows for [DOT]
and [AT]
to be used in place of .
and @
.
Note that the confidence threshold level of the email address detector cannot be fine-tuned.
Examples
abc@lakera.ai
abc@lakera [DOT] ai
abc [AT] lakera [DOT] ai
abc@platform.lakera.ai
abc+spam@platform.lakera.ai
Counterexamples
The detector does not identify invalid email addresses or those that use certain subsets of characters like emoji domains and emoji email addresses:
john@google
john@@gmail.com
👋@💌.kz
IP addresses
The IP address detector identifies IPv4 and IPv6 addresses that follow a standard format with dot separators (.
) for IPv4 and colon separators (:
) for IPv6 addresses.
Only public, non-multicast IP addresses are detected.
The detector does not report common DNS addresses like 8.8.8.8
(Google’s public DNS) or reserved IP addresses as PII.
Note that the confidence threshold level of the IP address detector cannot be fine-tuned.
Examples
109.202.218.238
2a02:168:6385:0:606d:1692:689f:1049
2.168.0.1
Counterexamples
The detector should not identify invalid IP addresses or those that contain typos:
10.920a.218.238
257.168.0.1
10.0.0.1
::1
8.8.8.8
127.0.0.1
Credit card numbers
The credit card detector identifies credit card numbers without spaces and those formatted in the standard 16-digit format, American Express (15-digit) format, 19-digit format, and Diners Club (14-digit) format, separated by whitespace characters or dashes. Credit card numbers are validated using the Luhn algorithm to ensure they are valid card numbers before being flagged as PII.
Examples
4242424242424242
5200 8282 8282 8210
3782 822463 10005
3622 720627 1667
4111-1111-1111-1111
Counterexamples
The detector cannot identify credit card numbers that use a non-standard format, include punctuation or typos, are comprised of zeroes only, or are not valid card numbers according to the Luhn algorithm:
411 1 111 1 1111 11 11
411-1---111-1-1111-11-11
41 11 11111111 11 a 11
4111-1111-1111-1112
0000 0000 0000 0000
IBANs
The International Bank Account Number (IBAN) detector identifies valid IBAN numbers in the standard format with spaces as separators (AA BB BBBB BBBB BBBB BBBB BBBB
) or no separators (AABBBBBBBBBBBBBBBBBBBBB
).
Note that the confidence threshold level of the IBAN detector cannot be fine-tuned.
Examples
CH 9300762011623852957
CH93 0076 2011 6238 5295 7
DE89 3704 0044 0532 0130 00
FR76 3000 6000 0112 3456 7890 189
IT60 X054 2811 1010 0000 0123 456
ES91 2100 0418 4502 0005 1332
Counterexamples
The detector does not identify IBAN numbers that use a non-standard format, include punctuation or typos, or are invalid IBAN numbers:
CH 9300762011623852951
C H 93007620116238529 57
GB29 NWBK 6016 1331 9268 19A
DE89 3704 0044 0532 0130 0
FR76 3000 6000 0112 3456 7890 1891
IT60 X054 2811 1010 0000 0123 4567
US Social Security numbers
The US Social Security Number (SSN) detector identifies valid SSN numbers in the standard format with dashes as separators (AAA-GG-SSSS
), spaces as separators (AAA GG SSSS
), or a combination of the two.
Note that the confidence threshold level of the US social security number detector cannot be fine-tuned.
Examples
778-62-8144
030 72 7381
003 06-8815
003-06 8815
Counterexamples
The detector cannot identify SSN numbers that use a non-standard format, include punctuation or typos, or are invalid SSN numbers:
6-4327-4363
241532634
45.356-5678
64-a27-4363
999-45-6789
666-45-6789
000-62-8144
778-00-8144
778-62-0000
Guides
To help you learn more about integrating application data and protecting users’ PII, we’ve created some guides.
Other Resources
If you’re still looking for more:
- Read what our CEO, David Haber, had to say about the EU’s AI Act in Fortune
- Learn more about how LLM training data memorization could put PII in training data at risk
- Learn more about private data leakage in LLMs