Protecting data from AI

Pros and cons of AI-enhanced development

AI is changing the development landscape

AI has enabled organizations to build and enhance applications at an impressive speed and scale. This evolution to software development is powered by the rapid adoption of generative AI tools like ChatGPT and GitHub Copilot.

Among its many use cases, AI can generate code quickly (and, to a large degree, accurately), clean up existing code, pinpoint useful algorithms, spin up software documentation, and accelerate the manual coding process.

Put simply, AI can be a powerful development tool: When given specific, carefully-scripted prompts, it can produce quality output that saves significant time and labor.

However, all technology comes with limitations and in the case of AI, we’ve seen some serious security and data privacy risks that can outweigh the efficiency benefits it has to offer — from failing to spot critical errors, to exposing proprietary code. One of the ways to combat these risks is by using data loss protection (DLP), which helps organizations detect the movement of sensitive data, comply with data and privacy regulations, and counteract data exfiltration.

But, given how new AI tools are, many traditional security solutions are not equipped to mitigate the risks and unknowns that they pose to organizational data. Rather, organizations looking to utilize AI in the development process can safely enable these tools by implementing an AI-resilient data protection strategy. Modern data protection helps prevent compromised confidential information, compliance violations, adversarial attacks, and the loss of intellectual property.

The risks of using generative AI in coding

AI-powered development can help organizations drive innovation at scale. However, when used without being mindful of the inherent limitations and risks these tools present — they can not only hamper the development process, but also cause harm to the organizations using them.

1. AI can expose (and reproduce) proprietary code

Generative AI tools ingest the information that is fed to them, then use that data to identify patterns and structures that enable them to generate new content. The more data these large language models (LLMs) are given, the more sophisticated and expansive in scope they become.

This raises important concerns when it comes to proprietary data. Take Samsung, for example, who banned the use of ChatGPT after an engineer accidentally uploaded internal source code to the tool. While that data was not leaked in the traditional sense, data shared with AI tools is often stored on servers outside of an organization’s control — and they then lose the ability to protect how that data is used and distributed.

One of the most common concerns for organizations is the way AI platforms collect user data in order to further train their LLMs. Popular AI platforms, like OpenAI and GitHub Copilot, train their AI models using the data they receive — and, on multiple occasions, have reproduced that data when generating outputs for other users of those platforms. This raises privacy concerns for proprietary code, sensitive data, or personally identifiable information (PII) being publicly exposed.

At the end of the day, sharing data with AI platforms is like sharing data with any other company. Users are trusting them to safeguard data inputs, not realizing data security is not a core feature and — the more data they amass, the more lucrative a target they become.

2. AI can introduce vulnerabilities

Many of the leaks connected to AI tools have been accidental: An engineer uploads code that should not have been released outside of internal environments, or an organization discovers ChatGPT responses that closely resemble confidential company data.

Other instances of compromise are more insidious. FraudGPT and WormGPT are two AI tools specifically trained on stolen data for the sole intent of creating phishing campaigns, automating malware, and carrying out more sophisticated and human-appearing social engineering attacks. While most AI platforms are predominantly used for beneficial purposes, the powerful technology underpinning them can be trained to accelerate and drive attacks.

In addition to exploiting stolen data, more benign AI tools can generate unstable code. In a recent study, 40% of the code generated by GitHub Copilot contained at least one of the 25 most common vulnerabilities identified by MITRE. The study’s authors determined that this was the result of training Copilot on GitHub’s open-source repository, to which any user could upload code.

Finally, AI tools themselves may also be targeted by attackers. In a recent case, ChatGPT suffered a data breach in which over 100,000 accounts were compromised. Names, email and payment addresses, and credit card information was exposed in the breach, as were confidential chat titles and messages created with the tool.

3. AI can bypass data privacy controls

The ease with which AI tools may be manipulated raises questions about the extent to which organizations will be able to fully protect user data when using these technologies. Whether inadvertently or maliciously, using AI software may open the door to data exposure and create widespread compliance issues.

For example, researchers discovered a critical flaw in Nvidia’s AI software that allowed them to bypass intentional data privacy and security restraints. In less than a day, they were able to trick the AI framework into revealing PII.

Investing in AI requires a security-first mindset

When protecting sensitive data from AI risks, it may be helpful to think of AI as one of the more dangerous types of shadow IT. Put simply, using third-party AI tools often comes with a critical lack of visibility into how data is being processed, stored, and distributed.

Since open-source AI tools were not built with security and data privacy in mind, the onus falls on organizations to proactively defend their systems, code, and user data from compromise. Short of banning the use of AI completely, organizations can employ several strategies to minimize these risks, including:

Use proactive risk identification

Before introducing new third-party AI tools, assess planned use cases for AI. Will AI be used to suggest natural language documentation? Develop low-code or no-code software applications? Evaluate and remediate flaws in existing code? Integrate into internal applications or public-facing products?

Once these use cases are prioritized, it is important to evaluate potential risks that may be introduced or exacerbated by exposure to AI tools. Because AI risks exist on a wide spectrum, organizations need to establish clear guidelines for preventing and patching any vulnerabilities that arise. It may also be helpful to reference existing documentation of vulnerabilities that are connected to specific AI software.

Develop protocols around AI usage

It goes without saying that organizations should not provide carte blanche access to AI, especially when proprietary information and user data is at stake. Beyond security and data privacy concerns, AI tools raise questions of bias and transparency, which may further impact the benefits of AI-enhanced development.

For this reason, organizations should develop guidelines and protocols for third-party AI usage. Determine what data can be shared with AI tools, in what context that data can be shared, and which AI tools can access it. Investigate potential biases that AI tools introduce, document how AI is used within the organization, and set standards for the quality of AI-generated output that is gathered.

Implement and fine-tune AI controls

AI is constantly evolving, and as such, needs to be monitored on an ongoing basis. When utilizing AI models, adjust existing protocols and data restrictions as new use cases emerge. By continually assessing AI-generated code and functions, organizations may be able to detect potential risks more easily and minimize the likelihood of compromise.

Internal checks should be supplemented by regular evaluations of third-party AI tools. As new vulnerabilities are logged in ChatGPT, Copilot, or other AI software, reconsider the type of data that is being fed into those tools — or, if necessary, revoke access to tools until bugs have been patched.

Invest in data protection that can anticipate AI risks

Traditional data protection solutions are not adaptive or flexible enough to keep up with evolving AI data risks. Many standard data loss protection (DLP) products are complex to set up and maintain, and they introduce negative user experiences, such that in practice DLP controls are often underutilized or bypassed entirely. Whether deployed as a standalone platform or integrated into other security services, DLP services by themselves can often be too inefficient and ineffective to modify against the various ways AI can be exploited.

Instead, organizations need to invest in data protection technology designed to be agile enough to mitigate AI risks and protect proprietary information and user data from misuse, compromise, and attacks. When evaluating modern data protection solutions, opt for one that is architected to secure developer code across all the locations that valuable data resides in, while evolving alongside an organization’s changing security and privacy needs.

Cloudflare helps minimize AI risks

Businesses are just scratching the surface for how to leverage generative AI. Even in its early days, AI has already exposed data and introduced privacy risks. Today, effectively minimizing these risks requires strategic coordination across people, processes, and technology.

Cloudflare is designed to stay at the forefront of distinctly modern data risks like emerging AI tools. Cloudflare One converges multiple data protection point solutions onto a single SSE platform for simpler management and enforces controls everywhere — across all web, SaaS, and private environments — with speed and consistency. With all services built on the Cloudflare programmable network, new capabilities are built quickly and deployed across all 320 network locations.

This approach helps organizations with their data protection strategy such that:

  • Security teams can protect data more effectively by simplifying connectivity, with flexible inline and API-based options to send traffic to Cloudflare to enforce data controls.

  • Employees can improve productivity by ensuring reliable, consistent user experiences that are proven faster than other competitors.

  • Organizations can increase agility by innovating rapidly to meet evolving data security and privacy requirements.

This article is part of a series on the latest trends and topics impacting today’s technology decision-makers.

Dive deeper into this topic.

Get the Simplifying the way we protect SaaS applications whitepaper, to see how Cloudflare helps organizations protect their applications and data with a Zero Trust approach.
Get the whitepaper!

Key takeaways

After reading this article you will be able to understand:

  • How AI puts proprietary data at risk

  • Where legacy data protection falls short

  • Strategies to minimize AI risks — while maximizing productivity

Receive a monthly recap of the most popular Internet insights!