Red teaming
red-teaming
Red teaming is structured adversarial testing: people (or other AI systems) try hard to make a model misbehave. The findings inform safety training, prompt engineering, and deployment decisions. Every major lab does internal red-teaming before launch and many run external red-team programs too.
Red teaming originated in cybersecurity (where the "red team" attacks the "blue team" defenses) and got adapted for AI systems around 2022. The goal is to find the model's failure modes before the public does. A red-teamer might try every angle: jailbreaks, role-play exploits, indirect prompt injection, manipulation through emotional framing, exploiting niche knowledge gaps, encoded payloads.
Internal red teams at the labs are full-time operations. Anthropic, OpenAI, Google DeepMind, and Meta all run them. Findings go directly into the safety training pipeline: a successful jailbreak today becomes a refusal target in the next training round.
External red team programs supplement the internal work. OpenAI ran a public red-teamer network for GPT-4. Anthropic has invited researchers to probe Claude. The DEF CON Generative Red Team event in 2023 brought thousands of testers to attack frontier models in coordinated waves; it produced a public dataset of attack patterns that fed into the field's evaluation infrastructure.
What red teaming covers:
- Safety violations: getting the model to produce harmful content (weapons, fraud, etc.)
- Manipulation resistance: getting the model to ignore prior instructions, leak system prompts, or impersonate other entities
- Hallucination under pressure: pushing the model to confidently invent facts
- Bias and fairness: surfacing systematic errors across demographic groups
- Privacy leakage: getting the model to recall and emit training data verbatim
- Tool misuse: in agentic systems, getting the model to call tools in dangerous combinations
The output is a report, often public for major launches, listing tested categories and pass rates. Anthropic's "model card" and OpenAI's "system card" both include red-team summaries.
For a smaller AI product, red teaming is what you do before wider release: run a structured set of adversarial prompts, document the failures, fix or document them, ship.
Related concepts