AI Security

AI Hallucination

When a generative AI model produces fluent, confident output that is factually incorrect, fabricated, or unsupported by its sources — such as inventing citations, statistics, software packages, or legal cases. Hallucination is an inherent property of probabilistic language models rather than a simple bug, and it becomes a security and safety risk when outputs are trusted and acted upon without verification. Mitigations include retrieval-augmented generation, grounding with citations, and human review. It is a core reliability and safety topic in AIGP, AICP, and AAISM.

Why It Matters

In practice, hallucination is critical because users over-trust fluent answers, leading to real harm: fabricated legal citations sanctioned by courts, incorrect medical or financial guidance, and insecure code. A notable security twist is "package hallucination" (slopsquatting), where models invent plausible but nonexistent software dependencies that attackers then register as malware, turning a reliability flaw into a supply-chain attack. Organizations that embed LLMs into workflows without verification face liability and operational risk. Mitigations include RAG with source citations, output validation against authoritative data, constraining models to retrieved context, confidence/uncertainty signaling, and mandatory human review for high-stakes decisions. On exams such as AIGP and AICP, expect questions on overreliance, grounding techniques, and human oversight requirements.

Related AI Security terms

Prompt Injection

An attack against large language model (LLM) applications in which crafted input manipulates the model into ignoring its original instructions or system prompt and performing attacker-controlled actions. Direct prompt injection embeds malicious instructions in user input ("ignore previous instructions and..."), while indirect prompt injection hides instructions in external content the model ingests (web pages, documents, emails) during retrieval or tool use. It ranks as the #1 risk in the OWASP Top 10 for LLM Applications. Prompt injection is a core topic in AI security and governance certifications such as AIGP, AICP, and AAISM.

Jailbreaking (LLM)

Techniques that bypass an AI model's safety guardrails and content policies to elicit prohibited outputs such as instructions for weapons, malware, or disallowed content. Common methods include role-play framing ("act as an unrestricted assistant"), obfuscation and encoding, many-shot priming, and adversarial suffixes discovered through optimization. Jailbreaking differs from prompt injection: jailbreaking targets the model's safety alignment, whereas prompt injection hijacks an application's surrounding instructions. It is central to red-teaming generative AI and appears in AICP, AIGP, and AAISM study domains.

Adversarial Examples

Inputs deliberately perturbed with small, often human-imperceptible changes that cause a machine learning model to misclassify them — for example altering a few pixels so an image classifier reads a stop sign as a speed-limit sign, or crafting audio that a voice assistant transcribes as a hidden command. Adversarial machine learning is the broader field studying such evasion attacks alongside poisoning and extraction across the ML lifecycle. NIST formalizes the taxonomy in NIST AI 100-2. Covered in AAISM, AICP, and AIGP.

Data Poisoning

An attack in which adversaries inject malicious or mislabeled data into a model's training set to degrade performance, cause targeted misclassifications, or implant a backdoor that activates on a specific trigger. Poisoning can target the pre-training corpus, fine-tuning data, or a retrieval (RAG) knowledge base. Because modern models train on large, often web-scraped datasets, even a small fraction of poisoned samples can have outsized effects. It appears in the OWASP Top 10 for LLM Applications and NIST AI 100-2, and is tested in AAISM, AICP, and AIGP.

Model Inversion Attack

A privacy attack that reconstructs sensitive training data, or attributes of it, by repeatedly querying a model and analyzing its outputs — for example recovering recognizable face images from a facial-recognition model or inferring private attributes of individuals in the training set. Model inversion undermines the confidentiality of the data a model was trained on and can breach privacy regulations such as GDPR. It is a key privacy risk in AI governance and is covered in AICP, AIGP, and AAISM.

Membership Inference Attack

A privacy attack that determines whether a specific data record was part of a model's training set, by exploiting differences in the model's confidence or behavior on data it has seen versus unseen data. It can reveal, for instance, that a particular person's medical record was used to train a model — a confidentiality breach in its own right. Membership inference is closely related to model inversion and is relevant to AICP, AIGP, and privacy-focused AI governance.