Red teaming

red-teaming

Also called adversarial testing/ red team

Red teaming is structured adversarial testing: people (or other AI systems) try hard to make a model misbehave. The findings inform safety training, prompt engineering, and deployment decisions. Every major lab does internal red-teaming before launch and many run external red-team programs too.

Photo: Pedro Dias / Pexels

Red teaming originated in cybersecurity (where the "red team" attacks the "blue team" defenses) and got adapted for AI systems around 2022. The goal is to find the model's failure modes before the public does. A red-teamer might try every angle: jailbreaks, role-play exploits, indirect prompt injection, manipulation through emotional framing, exploiting niche knowledge gaps, encoded payloads.

Internal red teams at the labs are full-time operations. Anthropic, OpenAI, Google DeepMind, and Meta all run them. Findings go directly into the safety training pipeline: a successful jailbreak today becomes a refusal target in the next training round.

External red team programs supplement the internal work. OpenAI ran a public red-teamer network for GPT-4. Anthropic has invited researchers to probe Claude. The DEF CON Generative Red Team event in 2023 brought thousands of testers to attack frontier models in coordinated waves; it produced a public dataset of attack patterns that fed into the field's evaluation infrastructure.

What red teaming covers:

Safety violations: getting the model to produce harmful content (weapons, fraud, etc.)
Manipulation resistance: getting the model to ignore prior instructions, leak system prompts, or impersonate other entities
Hallucination under pressure: pushing the model to confidently invent facts
Bias and fairness: surfacing systematic errors across demographic groups
Privacy leakage: getting the model to recall and emit training data verbatim
Tool misuse: in agentic systems, getting the model to call tools in dangerous combinations

The output is a report, often public for major launches, listing tested categories and pass rates. Anthropic's "model card" and OpenAI's "system card" both include red-team summaries.

For a smaller AI product, red teaming is what you do before wider release: run a structured set of adversarial prompts, document the failures, fix or document them, ship.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary