What is AI data poisoning?

AI data poisoning is a deliberate attempt to introduce bias into an AI model's training data so that its outputs are skewed.

Learning Objectives

After reading this article you will be able to:

  • Explain how an AI data poisoning attack works
  • Describe the types of AI and LLM data poisoning attacks
  • List data poisoning prevention methods

Copy article link

What is AI data poisoning?

Artificial intelligence (AI) data poisoning is when an attacker manipulates the outputs of an AI or machine learning model by changing its training data. The attacker's goal in an AI data poisoning attack is to get the model to produce biased or dangerous results during inference.

AI and machine learning* models have two primary ingredients: training data and algorithms. Think of an algorithm as being like the engine of a car, and training data as the gasoline that gives the engine something to burn: data makes an AI model go. A data poisoning attack is like if someone were to add an extra ingredient to the gasoline that makes the car drive poorly.

The potential consequences of AI data poisoning have become more severe as more companies and people begin to rely on AI in their everyday activities. A successful AI data poisoning attack can permanently alter a model's output in a way that favors the person behind the attack.

AI data poisoning is of particular concern for large language models (LLMs). Data poisoning is listed in the OWASP Top 10 for LLMs, and in recent years researchers have warned of data poisoning vulnerabilities affecting healthcare, code generation, and text generation models.

*"Machine learning" and "artificial intelligence" are sometimes used interchangeably, although the two terms refer to slightly different sets of computational capabilities. Machine learning, however, is a type of AI.

How does a data poisoning attack work?

AI developers use vast amounts of data to train their models. Essentially, the training data set provides the models with examples, and the models then learn to generalize from those examples. The more examples there are in the data set, the more refined and accurate the model becomes — so long as the data is correct and relatively unbiased.

Data poisoning introduces bias on purpose to the training data set, changing the starting point for the model's algorithms so that its results come out differently than its developers originally intended.

Imagine a teacher writes a math problem on a chalkboard for her students to solve: for example, "47 * (18 + 5) = ?". The answer is 1,081. But if a student sneaks behind her back and changes "47" to "46," then the answer is no longer 1,081, but 1,058. Data poisoning attacks are like that sneaky student: if the starting data changes slightly, the answer is also changed.

How do AI data poisoning attacks happen?

Unauthorized alterations to training data can come from a number of sources.

Insider attack: Someone with legitimate access to the training data can introduce bias, false data, or other alterations that corrupt outputs. These attacks are more difficult to detect and stop than attacks by an external third party without authorized access to the data.

Supply chain attack: Most AI and machine learning models rely on data sets from a variety of sources to train their models. One or more of those sources could contain "poisoned" data that affects any model using that data for training and fine-tuning models.

Unauthorized access: There are any number of ways that an attacker could gain access to a training data set, from using lateral movement via a previous compromise, to obtaining a developer's credentials using phishing, to multiple potential attacks in between.

What are the two main categories of data poisoning attack?

  • Direct (or targeted) attacks: These attacks aim to skew or alter a model's output only in response to particular queries or actions. Such an attack would leave a model otherwise unaltered, giving expected responses to almost all queries. For example, an attacker might want to trick an AI-based email security filter into allowing certain malicious URLs through, while otherwise performing as expected.
  • Indirect (or nontargeted) attacks: These attacks aim to affect a model's performance in general. An indirect attack may aim to simply slow down the performance of the model as a whole, or to bias it towards giving particular kinds of answers. A foreign adversary, for instance, might want to bias general-use LLMs towards giving out misinformation within a particular country for propaganda purposes.

What are the types of AI data poisoning attacks?

There are several ways an attacker can poison an AI model's data for their own purposes. Some of the most important techniques to know include:

  • Backdoor poisoning: This attack introduces a hidden vulnerability into the model so that, in response to certain specific triggers known to the attacker, it behaves in an unsafe way. Backdoor poisoning is particularly dangerous because an AI model with a hidden backdoor will otherwise behave normally.
  • Mislabeling: An attacker can change the way data is labeled within the training data set of a model, leading the model to misidentify items after it has been trained.
  • Data injection and manipulation: Such an attack alters, adds to, or removes data from a data set. These attacks aim to make the AI model biased in a certain direction.
  • Availability attack: This attack aims to slow down or crash the model by injecting data that degrades its overall performance.

How to prevent data poisoning

Data validation: Before training, data sets should be analyzed to identify malicious, suspicious, or outlier data.

Principle of least privilege: In other words, only those persons and systems that absolutely need access to training data should have it. The principle of least privilege is a core tenet of a Zero Trust approach to security, which can help prevent lateral movement and credential compromise.

Diverse data sources: Drawing from a wider range of sources for data can help reduce the impacts of bias in a given data set.

Monitoring and auditing: Tracking and recording who changed training data, what was changed, and when it was changed enables developers to identify suspicious patterns, or to trace an attacker's activity after the data set has been poisoned.

Adversarial training: This involves training an AI model to recognize intentionally misleading inputs.

Other application defense measures like firewalls can also be applied to AI models. To prevent data poisoning and other attacks, Cloudflare offers Firewall for AI, which can be deployed in front of LLMs to identify and block abuse before it reaches them. Learn more about Firewall for AI.