Jailbreaking (LLM)
Techniques that bypass an AI model's safety guardrails and content policies to elicit prohibited outputs such as instructions for weapons, malware, or disallowed content. Common methods include role-play framing ("act as an unrestricted assistant"), obfuscation and encoding, many-shot priming, and adversarial suffixes discovered through optimization. Jailbreaking differs from prompt injection: jailbreaking targets the model's safety alignment, whereas prompt injection hijacks an application's surrounding instructions. It is central to red-teaming generative AI and appears in AICP, AIGP, and AAISM study domains.
Why It Matters
In practice, jailbreaking matters because safety alignment is probabilistic, not a hard control, so new bypasses are discovered constantly and published faster than vendors can patch them. Organizations exposing chatbots to the public face brand, legal, and safety harm when models are coerced into producing toxic, defamatory, or dangerous content. Automated jailbreak generators and transferable adversarial suffixes mean an attack crafted against one model often works against others. Defense is continuous: ongoing red-teaming, layered output moderation/classifiers, refusal training, and monitoring for anomalous prompt patterns rather than relying on the base model's alignment alone. On exams such as AICP and AAISM, expect questions contrasting jailbreaking with prompt injection and selecting defense-in-depth controls.
Related AI Security terms
Prompt Injection
An attack against large language model (LLM) applications in which crafted input manipulates the model into ignoring its original instructions or system prompt and performing attacker-controlled actions. Direct prompt injection embeds malicious instructions in user input ("ignore previous instructions and..."), while indirect prompt injection hides instructions in external content the model ingests (web pages, documents, emails) during retrieval or tool use. It ranks as the #1 risk in the OWASP Top 10 for LLM Applications. Prompt injection is a core topic in AI security and governance certifications such as AIGP, AICP, and AAISM.
Adversarial Examples
Inputs deliberately perturbed with small, often human-imperceptible changes that cause a machine learning model to misclassify them — for example altering a few pixels so an image classifier reads a stop sign as a speed-limit sign, or crafting audio that a voice assistant transcribes as a hidden command. Adversarial machine learning is the broader field studying such evasion attacks alongside poisoning and extraction across the ML lifecycle. NIST formalizes the taxonomy in NIST AI 100-2. Covered in AAISM, AICP, and AIGP.
Data Poisoning
An attack in which adversaries inject malicious or mislabeled data into a model's training set to degrade performance, cause targeted misclassifications, or implant a backdoor that activates on a specific trigger. Poisoning can target the pre-training corpus, fine-tuning data, or a retrieval (RAG) knowledge base. Because modern models train on large, often web-scraped datasets, even a small fraction of poisoned samples can have outsized effects. It appears in the OWASP Top 10 for LLM Applications and NIST AI 100-2, and is tested in AAISM, AICP, and AIGP.
Model Inversion Attack
A privacy attack that reconstructs sensitive training data, or attributes of it, by repeatedly querying a model and analyzing its outputs — for example recovering recognizable face images from a facial-recognition model or inferring private attributes of individuals in the training set. Model inversion undermines the confidentiality of the data a model was trained on and can breach privacy regulations such as GDPR. It is a key privacy risk in AI governance and is covered in AICP, AIGP, and AAISM.
Membership Inference Attack
A privacy attack that determines whether a specific data record was part of a model's training set, by exploiting differences in the model's confidence or behavior on data it has seen versus unseen data. It can reveal, for instance, that a particular person's medical record was used to train a model — a confidentiality breach in its own right. Membership inference is closely related to model inversion and is relevant to AICP, AIGP, and privacy-focused AI governance.
Model Extraction
An attack, also called model stealing, in which an adversary queries a deployed model (often a paid API) enough times to train a substitute model that replicates its functionality, stealing intellectual property and enabling cheaper offline experimentation and adversarial attacks. Extraction undermines the confidentiality and commercial value of proprietary models and is listed among LLM supply-chain and theft risks. Covered in AAISM and AICP.