Content Moderation
Moderating the content passed to or generated by your Large Language Model (LLM) applications is an important component to securing your GenAI application and protecting your users.
Check Point AI Guardrails can be used to ensure that your applications are not generating harmful or embarrassing content, as well as flagging if your users are trying to use your application to produce offensive content or help with dangerous or illicit activities.
Our content moderation works best when combined with AI Guardrails’ prompt defense to holistically secure LLM interactions. This defense in depth ensures that the LLM isn’t being manipulated or hacked into outputting moderated content.
Due to its sensitive nature, we do not document specific examples of what our content moderation flags here.
We are always actively improving our detectors and working to increase accuracy and reduce bias. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.
Threshold levels
Content moderation detectors use the same L1–L4 threshold levels as other AI Guardrails detectors. The threshold you set determines how sensitive the detector is: higher thresholds flag more content (including milder mentions), while lower thresholds only flag the most severe cases.
Detectors
AI Guardrails’ content moderation covers seven types of content:
- Crime: content that mentions criminal activities, including theft, fraud, cyber crime, counterfeiting, violent crimes and other illegal activities.
- Hate: harassment and hate speech.
- Profanity: obscene or vulgar language, such as cursing and offensive profanities.
- Sexual: sexually explicit or commercial sexual content, including sex education and wellness materials.
- Violence: content describing acts of violence, physical injury, death, self-harm or accidents.
- Weapons: content that mentions weapons or weapon usage, including firearms, knives, and personal weapons.
- Self Harm: content that relates to self-harm, suicide, or self-destructive behaviors that could put individuals at risk.
You can also create custom content moderation guardrails within AI Guardrails to flag any other content type, or specific trigger words or phrases.
Crime and illicit activity
Text that discusses or mentions illicit activity. This includes activities such as fraud, terrorism, criminal planning, cyber crimes (e.g., phishing, hacking, piracy), extraction of confidential information, illegal drug activity, child exploitation and abuse, human trafficking, counterfeiting, stalking, blackmail and violent crimes (e.g. murder).
Hate speech and harassment
Hate speech and harassment includes any content that expresses, incites, or promotes harassing language directed at identity groups or their members. This includes, but is not limited to, content that involves violence or harm toward a protected group, and discrimination through the use of slurs.
Harassment and Hate speech does not include:
- Non-targeted misogynistic or misandristic content.
- Non-targeted content that is rude or disrespectful.
- Slurs that are non-discriminatory, although these may be flagged as profanities.
Out of Scope Examples
Chess players are all nerds.
This is not flagged as hate speech. People associated with a hobby, sport, or other activities are not protected groups or widely considered vulnerable to hate speech. Describing someone as a nerd also isn’t sufficiently offensive to be classed as harassment.
I really hate offensive language.
Content that does not mention individuals, protected groups or attributes could be considered toxic, but is not flagged as hate speech.
Profanity
Profanity includes text that utilizes cursing, cussing, swearing, strong language, foul language, or expletives. It also encompasses instances where explicit language is used to emphasize a point.
AI Guardrails will also flag profane content that’s been obfuscated, e.g. through techniques like using leet script or intentional typos. This prevents attempts to circumvent exact matching algorithms for offensive words.
Profanity does not include:
- Non-vulgar insults. These may be detected by the hate detector though.
- Sexual terms that are not considered offensive. These will be flagged by the sexual content detector though.
Sexual content
Sexual content refers to any material that either describes or encourages sexual organs, acts, behavior, and sexuality. It encompasses erotic and sexually explicit content, as well as educational and sexual wellness material.
References to sexual services are also flagged by this detector. This includes encouraging or advertising services such as sex work, prostitution, escort services, or any other form of commercial sexual activity.
Text containing vulgar language, obscenities, or explicit terms that are sexual in nature will be flagged.
Sexual content does not include:
- Profanity and vulgar words that are not sexual in nature. This is instead covered by the profanity detector.
- Misogyny
Descriptions of violence
Descriptions of violence include any text that mentions the death or injury of a person or animal. This includes, but is not limited to, violent threats, graphic war reports, self-harm and suicide, accounts of physical harm or murder, descriptions of accidents, and brutal scenes from books, movies, and video games.
Descriptions of violence do not include:
- Threats, or hate speech that does not depict acts of violence
Weapons and weapon usage
Any text that mentions any type of weapons or weapon usage. This includes, but is not limited to, firearms such as guns, bombs, missiles, ammunition, etc., bladed weapons such as knives, as well as any type of personal weapon.
Self harm
Any text that mentions, encourages, advocates, and details self-harm, suicide, or other dangerous activities or self-destructive behaviors that could put individuals at risk.
Custom content moderation
Using policies, you can create custom regular expression based detectors for content moderation. These can be used to create lists of trigger words, phrases or strings that will be flagged.
For example, you might want to add a custom detector for the names of your competitors to prevent your chatbot referring to them. Or you might want to add some extra defense in depth to flag if users or the LLM use any local slang for illegal drugs.
Learn More
Other Resources
- Learn more about detecting undesired content in the real world