Content Moderation | Lakera API documentation

Moderating the content passed to or generated by your Large Language Model (LLM) applications is an important component to securing your GenAI application and protecting your users.

Lakera Guard can be used to ensure that your applications are not generating harmful or embarrassing content, as well as flagging if your users are trying to use your application to produce offensive content or help with dangerous or illcit activities.

Our content moderation works best when combined with Lakera Guard’s prompt defense to holistically secure LLM interactions. This defense in depth ensures that the LLM isn’t being manipulated or hacked into outputting moderated content.

Due to its sensitive nature, we do not document specific examples of what our content moderation flags here.

We are always actively improving our detectors and working to increase accuracy and reduce bias. If you experience any issues or would like to provide feedback, please reach out to support@lakera.ai.

Lakera Guard currently supports content moderation in English. We are working on expanding our language coverage. Please reach out if there’s a language that is a priority for you.

Detectors

Lakera Guard’s content moderation covers six types of content:

Crime: content that mentions criminal activities, including theft, fraud, cyber crime, counterfeiting, violent crimes and other illegal activities.
Hate: harassment and hate speech.
Profanity: obscene or vulgar language, such as cursing and offensive profanities.
Sexual: sexually explicit or commercial sexual content, including sex education and wellness materials.
Violence: content describing acts of violence, physical injury, death, self-harm or accidents.
Weapons: content that mentions weapons or weapon usage, including firearms, knives, and personal weapons.

You can also create custom content moderation guardrails within Guard to flag any other content type, or specific trigger words or phrases.

Crime and illicit activity

Text that discusses or mentions illicit activity. This includes activities such as fraud, terrorism, criminal planning, cyber crimes (e.g., phishing, hacking, piracy), extraction of confidential information, illegal drug activity, child exploitation and abuse, human trafficking, counterfeiting, harassment, stalking, blackmail and violent crimes (e.g. murdering).

Hate speech and harassment

Hate speech and harassment includes any content that expresses, incites, or promotes harassing language directed at any target. This includes, but is not limited to, harassment content that involves violence or harm toward any target, and discrimination through the use of slurs.

Harassment and Hate speech does not include:

Non-targeted misogynistic or misandristic content.
Non-targeted content that is rude or disrespectful.
Slurs that are non-discriminatory, although these may be flagged as profanities.

Out of Scope Examples

Chess players are all nerds.

This is not flagged as hate speech. People associated with a hobby, sport, or other activities are not protected groups or widely considered vulnerable to hate speech. Describing someone as a nerd also isn’t sufficiently offensive to be classed as harassment.

I really hate offensive language.

Content that does not mention individuals, protected groups or attributes could be considered toxic, but is not flagged as hate speech.

Profanity

Profanity includes text that utilizes cursing, cussing, swearing, strong language, foul language, or expletives. It also encompasses instances where explicit language is used to emphasize a point.

Lakera Guard will also flag profane content that’s been obfuscated, e.g. through techniques like using leet script or intentional typos. This prevents attempts to circumvent exact matching algorithms for offensive words.

Profanity does not include:

Non-vulgar insults. These may be detected by the hate detector though.
Sexual terms that are not considered offensive. These will be flagged by the sexual content detector though.

Sexual content

Sexual content refers to any material that either describes or encourages sexual organs, acts, behavior, and sexuality. It encompasses erotic and sexually explicit content, as well as educational and sexual wellness material.

References to sexual services are also flagged by this detector. This includes encouraging or advertising services such as sex work, prostitution, escort services, or any other form of commercial sexual activity.

Text containing vulgar language, obscenities, or explicit terms that are sexual in nature will be flagged.

Sexual content does not include:

Profanity and vulgar words that are not sexual in nature. This is instead covered by the profanity detector.
Misogyny

Descriptions of violence

Descriptions of violence include any text that mentions the death or injury of a person or animal. This includes, but is not limited to, violent threats, graphic war reports, self-harm and suicide, accounts of physical harm or murder, descriptions of accidents, and brutal scenes from books, movies, and video games.

Descriptions of violence do not include:

Threats, or hate speech that does not depict acts of violence

Weapons and weapon usage

Any text that mentions any type of weapons or weapon usage. This includes, but is not limited to, firearms such as guns, bombs, missiles, ammunition, etc., bladed weapons such as knives, as well as any type of personal weapon.

Custom content moderation

Using policies, you can create custom regular expression based detectors for content moderation. These can be used to create lists of trigger words, phrases or strings that will be flagged.

For example, you might want to add a custom detector for the names of your competitors to prevent your chatbot referring to them. Or you might want to add some extra defense in depth to flag if users or the LLM use any local slang for illegal drugs.

Learn More

Datasets

Find datasets to evaluate Lakera Guard with content moderation use cases

Other Resources

Learn more about detecting undesired content in the real world