AI data poisoning is a deliberate attempt to introduce bias into an AI model's training data so that its outputs are skewed.
After reading this article you will be able to:
Copy article link
Artificial intelligence (AI) data poisoning is when an attacker manipulates the outputs of an AI or machine learning model by changing its training data. The attacker's goal in an AI data poisoning attack is to get the model to produce biased or dangerous results during inference.
AI and machine learning* models have two primary ingredients: training data and algorithms. Think of an algorithm as being like the engine of a car, and training data as the gasoline that gives the engine something to burn: data makes an AI model go. A data poisoning attack is like if someone were to add an extra ingredient to the gasoline that makes the car drive poorly.
The potential consequences of AI data poisoning have become more severe as more companies and people begin to rely on AI in their everyday activities. A successful AI data poisoning attack can permanently alter a model's output in a way that favors the person behind the attack.
AI data poisoning is of particular concern for large language models (LLMs). Data poisoning is listed in the OWASP Top 10 for LLMs, and in recent years researchers have warned of data poisoning vulnerabilities affecting healthcare, code generation, and text generation models.
*"Machine learning" and "artificial intelligence" are sometimes used interchangeably, although the two terms refer to slightly different sets of computational capabilities. Machine learning, however, is a type of AI.
AI developers use vast amounts of data to train their models. Essentially, the training data set provides the models with examples, and the models then learn to generalize from those examples. The more examples there are in the data set, the more refined and accurate the model becomes — so long as the data is correct and relatively unbiased.
Data poisoning introduces bias on purpose to the training data set, changing the starting point for the model's algorithms so that its results come out differently than its developers originally intended.
Imagine a teacher writes a math problem on a chalkboard for her students to solve: for example, "47 * (18 + 5) = ?". The answer is 1,081. But if a student sneaks behind her back and changes "47" to "46," then the answer is no longer 1,081, but 1,058. Data poisoning attacks are like that sneaky student: if the starting data changes slightly, the answer is also changed.
Unauthorized alterations to training data can come from a number of sources.
Insider attack: Someone with legitimate access to the training data can introduce bias, false data, or other alterations that corrupt outputs. These attacks are more difficult to detect and stop than attacks by an external third party without authorized access to the data.
Supply chain attack: Most AI and machine learning models rely on data sets from a variety of sources to train their models. One or more of those sources could contain "poisoned" data that affects any model using that data for training and fine-tuning models.
Unauthorized access: There are any number of ways that an attacker could gain access to a training data set, from using lateral movement via a previous compromise, to obtaining a developer's credentials using phishing, to multiple potential attacks in between.
There are several ways an attacker can poison an AI model's data for their own purposes. Some of the most important techniques to know include:
Data validation: Before training, data sets should be analyzed to identify malicious, suspicious, or outlier data.
Principle of least privilege: In other words, only those persons and systems that absolutely need access to training data should have it. The principle of least privilege is a core tenet of a Zero Trust approach to security, which can help prevent lateral movement and credential compromise.
Diverse data sources: Drawing from a wider range of sources for data can help reduce the impacts of bias in a given data set.
Monitoring and auditing: Tracking and recording who changed training data, what was changed, and when it was changed enables developers to identify suspicious patterns, or to trace an attacker's activity after the data set has been poisoned.
Adversarial training: This involves training an AI model to recognize intentionally misleading inputs.
Other application defense measures like firewalls can also be applied to AI models. To prevent data poisoning and other attacks, Cloudflare offers Firewall for AI, which can be deployed in front of LLMs to identify and block abuse before it reaches them. Learn more about Firewall for AI.