Content Moderation
Moderating the content passed to or generated by your Large Language Model (LLM) applications is an important component to securing your GenAI application and protecting your users.
Lakera Guard can be used to ensure that your applications are not generating harmful or embarrassing content, as well as flagging if your users are trying to use your application to produce offensive content or help with dangerous or illcit activities.
Our content moderation works best when combined with Lakera Guard’s prompt defense to holistically secure LLM interactions. This defense in depth ensures that the LLM isn’t being manipulated or hacked into outputting moderated content.
Due to the sensitive nature of the contents, we do not document specific examples of what our content moderation flags here.
We are always actively improving our detectors and working to increase accuracy and reduce bias. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.
Lakera Guard currently supports content moderation in English. We are working on expanding our language coverage. Please reach out if there’s a language that is a priority for you.
Detectors
Lakera Guard’s content moderation consists of six detectors:
crime
: detects content that mentions criminal activities, including theft, fraud, cyber crime, counterfeiting, violent crimes and other illegal activities.hate
: detects harassment and hate speech.profanity
: detects obscene or vulgar language, such as cursing and offensive profanities.sexual
: detects sexually explicit or commercial sexual content, including sex education and wellness materials.violence
: detects content describing acts of violence, physical injury, death, self-harm or accidents.weapons
: detects content that mentions weapons or weapon usage, including firearms, knives, and personal weapons.
You can also create custom content moderation detectors to flag any trigger words or phrases you wish using policies.
Crime and illicit activity
Text that discusses or mentions illicit activity. This includes activities such as fraud, terrorism, criminal planning, cyber crimes (e.g., phishing, hacking, piracy), extraction of confidential information, illegal drug activity, child exploitation and abuse, human trafficking, counterfeiting, harassment, stalking, blackmail and violent crimes (e.g. murdering).
Hate speech and harassment
Hate speech and harassment includes any content that expresses, incites, or promotes harassing language directed at any target. This includes, but is not limited to, harassment content that involves violence or harm toward any target, and discrimination through the use of slurs.
Harassment and Hate speech does not include:
- Non-targeted misogynistic or misandristic content.
- Non-targeted content that is rude or disrespectful.
- Slurs that are non-discriminatory, although these may be flagged as profanities.
Out of Scope Examples
Chess players are all nerds.
This is not flagged as hate speech. People associated with a hobby, sport, or other activities are not protected groups or widely considered vulnerable to hate speech. Describing someone as a nerd also isn’t sufficiently offensive to be classed as harassment.
I really hate offensive language.
Content that does not mention individuals, protected groups or attributes could be considered toxic, but is not flagged as hate speech.
Profanity
Profanity includes text that utilizes cursing, cussing, swearing, strong language, foul language, or expletives. It also encompasses instances where explicit language is used to emphasize a point.
Lakera Guard will also flag profane content that’s been obfuscated, e.g. through techniques like using leet script or intentional typos. This prevents attempts to circumvent exact matching algorithms for offensive words.
Profanity does not include:
- Non-vulgar insults. These may be detected by the hate detector though.
- Sexual terms that are not considered offensive. These will be flagged by the sexual content detector though.
Sexual content
Sexual content refers to any material that either describes or encourages sexual organs, acts, behavior, and sexuality. It encompasses erotic and sexually explicit content, as well as educational and sexual wellness material.
References to sexual services are also flagged by this detector. This includes encouraging or advertising services such as sex work, prostitution, escort services, or any other form of commercial sexual activity.
Text containing vulgar language, obscenities, or explicit terms that are sexual in nature will be flagged.
Sexual content does not include:
- Profanity and vulgar words that are not sexual in nature. This is instead covered by the profanity detector.
- Misogyny
Descriptions of violence
Descriptions of violence include any text that mentions the death or injury of a person or animal. This includes, but is not limited to, violent threats, graphic war reports, self-harm and suicide, accounts of physical harm or murder, descriptions of accidents, and brutal scenes from books, movies, and video games.
Descriptions of violence do not include:
- Threats, or hate speech that does not depict acts of violence
Weapons and weapon usage
Any text that mentions any type of weapons or weapon usage. This includes, but is not limited to, firearms such as guns, bombs, missiles, ammunition, etc., bladed weapons such as knives, as well as any type of personal weapon.
Custom content moderation
Using policies, you can create custom regular expression based detectors for content moderation. These can be used to create lists of trigger words, phrases or strings that will be flagged.
For example, you might want to add a custom detector for the names of your competitors to prevent your chatbot referring to them. Or you might want to add some extra defense in depth to flag if users or the LLM use any local slang for illegal drugs.
Learn More
Other Resources
- Learn more about detecting undesired content in the real world