AI Red Teaming
The structured practice of adversarially testing AI systems — probing for jailbreaks, prompt injection, harmful or biased outputs, privacy leakage, and unsafe tool use — to discover failures before attackers or real users do. It combines human creativity with automated attack generation and is encouraged or required for high-risk systems by frameworks such as the NIST AI Risk Management Framework and the EU AI Act. AI red teaming is covered in AAISM, AICP, and AIGP governance domains.
Why It Matters
In practice, AI red teaming is critical because AI systems fail in ways traditional security testing misses — through natural-language manipulation, emergent behaviors, bias, and unsafe autonomous actions rather than classic code vulnerabilities. Major AI labs and regulators now treat red teaming as a release gate for frontier and high-risk models. Effective programs blend manual adversarial prompting, automated and transferable attack suites, and scenario-based testing of the full application (tools, RAG sources, and guardrails), then feed findings back into alignment, filtering, and monitoring. On exams such as AAISM, AICP, and AIGP, expect questions on when red teaming is mandated, what it covers beyond traditional pen testing, and how findings map to the AI risk-management lifecycle.
Related AI Security terms
Prompt Injection
An attack against large language model (LLM) applications in which crafted input manipulates the model into ignoring its original instructions or system prompt and performing attacker-controlled actions. Direct prompt injection embeds malicious instructions in user input ("ignore previous instructions and..."), while indirect prompt injection hides instructions in external content the model ingests (web pages, documents, emails) during retrieval or tool use. It ranks as the #1 risk in the OWASP Top 10 for LLM Applications. Prompt injection is a core topic in AI security and governance certifications such as AIGP, AICP, and AAISM.
Jailbreaking (LLM)
Techniques that bypass an AI model's safety guardrails and content policies to elicit prohibited outputs such as instructions for weapons, malware, or disallowed content. Common methods include role-play framing ("act as an unrestricted assistant"), obfuscation and encoding, many-shot priming, and adversarial suffixes discovered through optimization. Jailbreaking differs from prompt injection: jailbreaking targets the model's safety alignment, whereas prompt injection hijacks an application's surrounding instructions. It is central to red-teaming generative AI and appears in AICP, AIGP, and AAISM study domains.
Adversarial Examples
Inputs deliberately perturbed with small, often human-imperceptible changes that cause a machine learning model to misclassify them — for example altering a few pixels so an image classifier reads a stop sign as a speed-limit sign, or crafting audio that a voice assistant transcribes as a hidden command. Adversarial machine learning is the broader field studying such evasion attacks alongside poisoning and extraction across the ML lifecycle. NIST formalizes the taxonomy in NIST AI 100-2. Covered in AAISM, AICP, and AIGP.
Data Poisoning
An attack in which adversaries inject malicious or mislabeled data into a model's training set to degrade performance, cause targeted misclassifications, or implant a backdoor that activates on a specific trigger. Poisoning can target the pre-training corpus, fine-tuning data, or a retrieval (RAG) knowledge base. Because modern models train on large, often web-scraped datasets, even a small fraction of poisoned samples can have outsized effects. It appears in the OWASP Top 10 for LLM Applications and NIST AI 100-2, and is tested in AAISM, AICP, and AIGP.
Model Inversion Attack
A privacy attack that reconstructs sensitive training data, or attributes of it, by repeatedly querying a model and analyzing its outputs — for example recovering recognizable face images from a facial-recognition model or inferring private attributes of individuals in the training set. Model inversion undermines the confidentiality of the data a model was trained on and can breach privacy regulations such as GDPR. It is a key privacy risk in AI governance and is covered in AICP, AIGP, and AAISM.
Membership Inference Attack
A privacy attack that determines whether a specific data record was part of a model's training set, by exploiting differences in the model's confidence or behavior on data it has seen versus unseen data. It can reveal, for instance, that a particular person's medical record was used to train a model — a confidentiality breach in its own right. Membership inference is closely related to model inversion and is relevant to AICP, AIGP, and privacy-focused AI governance.