Combating shadow AI

Implementing controls for government use of AI

AI legislation is on the rise

The White House Office of Management and Budget released Memorandum 24-10 for governance, innovation, and risk management in the use of Artificial Intelligence to all federal agencies and departments. The three-part focus of the memorandum is to:

  • Strengthen AI governance

  • Advance responsible AI innovation

  • Manage the risks from the use of AI

Last year, 25 states introduced legislation focused on some aspects of AI. 18 states and Puerto Rico enacted some form of legislation around AI. These legislative efforts range from initial study and evaluation of AI use all the way through governance on its use by employees to required controls to mitigate malicious or unintended consequences of AI.

Broadly speaking, this new body of legislation represents new compliance, consumption, and controls for government and other public sector organizations.

In this article, we will review some of the challenges facing organizations both from the protection of public facing properties as well as identifying and crafting governance for the consumption of AI models.

Challenge #1: Protecting public Internet properties from AI bots

The impact of crawlers can be both legitimate and problematic for agencies. In some contexts, responsible crawlers and indexers will be able to use publicly accessible data enhancing citizens ability to find relevant on-line services and information.

On the other hand, poorly developed or malicious AI crawlers can scrape content to train public AI platforms without consideration for the privacy of that content.

There are numerous intellectual property and privacy concerns if this data ends up in training models that back public AI platforms. If unchecked, these bots can also hamper the performance of public websites for all users by consuming resources from legitimate interactions.

Control 1: Deploy application-side protections

There are several server-side protections that can be implemented to help control how bots interact with the server. One example is the deployment of a robots.txt file. In a nutshell, this file can inform and define how crawler traffic interacts with various sections of the site and the data therein. The file is deployed in the root of the site and defines what agents (Bots) can crawl the site and what resources they can access.

There are a couple of challenges with this approach. The first and most obvious is that the crawler must respect the robots.txt file. While this is a general best practice for “respectable” bots, let’s face it…not everyone follows the rules. There are also non-malicious bots that may just misinterpret syntax and therefore are able to interact with elements that agencies want to stay hidden.

In short, while a common approach, it should be noted that leveraging robots.txt or similar .htaccess (Apache) strategies are not full proof protection. But, they are part of a holistic approach to governing how legitimate bots interact with application content.

Control 2: Deploy Bot Mitigation within a Web Application Firewall

Web Application Firewalls (WAF) and Bot Mitigation solutions are table stakes in today’s world for public web applications. These controls help organizations protect their public digital properties from DDoS threat vectors, shadow and insecure APIs, along with various other threats delivered in the form of bot technology.

Any bot mitigation strategy today should include the ability to programmatically identify and classify bots that are scraping content in the service of AI data training. This classification mechanism is a critical capability. It defines whether to limit or allow only legitimate and verified AI crawlers or block them altogether until it is determined how these bots should be allowed to interact with government websites. The Cloudflare WAF not only identifies crawlers, but whether they have been developed using industry best practices.

Last summer, António Guterres, Secretary-General of the United Nations, noting that AI has been compared to the printing press, observed that — while it took more than 50 years for printed books to become widely available across Europe — “ChatGPT reached 100 million users in just two months''. The scale and the unprecedented growth in AI platforms directly correlates to the growing number of AI bots searching for any publicly exposed datasets for training.

This goes to the second major consideration in implementing these WAF and bot management controls. The architecture of these platforms must be able to scale in a distributed global environment. Cloudflare’s network architecture delivers a centrally managed, globally distributed bot mitigation capability that is deployed in 320 cities in over 120 countries. Backed by one of the largest networks on the Internet with 280 Tbps edge capacity, the Cloudflare connectivity cloud can absorb, block, filter, and rate limit threats on a truly global scale closest to the source of attack as opposed to close to your origins.

Challenge #2: Shadow AI: Unsanctioned consumption of public AI models

Let’s face it, public AI platforms have enabled users to accelerate everything from writing a memo to writing complex code. State and Federal agencies see AI as critical to solving complex social problems like healthcare, access to citizen services, food & water safety among others. However, without governance, organizations may be complicit in leaking regulated data sets to insecure public language model training data.

In the same way that organizations have leveraged tools to get a handle on the consumption of unsanctioned cloud applications or “Shadow IT”, they now need to understand the scope of Shadow AI consumption within their organizations.

The increase of “Shadow AI” is making headlines. A 3Gem study of over 11,500 employees worldwide, showed that 57% of employees used public Generative AI tools in the office at least once a week. 39% of respondents agreed that there is a risk of sensitive data being leaked through these interactions.

This information is sometimes even being unknowingly shared across AI models given the increase of AI models being trained on data produced by other models as opposed to traditional sourced content.

Control 1: Determine appropriate use

Any comprehensive approach needs to include the determination of acceptable use of public AI models and more specifically, what roles need access to those models. Establishing these guardrails are critical first steps. In fact, one major theme in the rising new legislation on AI in government is the review appropriate use of AI within agencies and which models should be allowed.

Control 2: Deploy controlled access

Once those determinations have been made, agencies must then develop controls for enforcing those policies. Zero Trust Network Access (ZTNA) principles enable the development and enforcement of those policies to restrict unsanctioned access.

For example, you may only allow authorized users from specific administrative groups to access public AI models. Even if they are an authorized user, ZTNA allows additional posture checks such as ensuring corporate devices are up to date with patches or that the device has government approved endpoint management agents running prior to allowing access.

In this way, governments can enforce and restrict who can access these public AI models while operating on government assets.

Control 3: Determine what data is appropriate for disclosure to AI platforms

Acceptable use is not only defining what users can access AI platforms. Government needs to also understand the data that are posted or submitted into AI platforms.

Even something as innocuous as a department memorandum could have non-public or sensitive data points. Once those data points are submitted to an LLM, there is a risk of that data being exposed.

Integrated data loss prevention (DLP) controls should be developed to ensure that proprietary information, such as sensitive application code or even citizen data, does not become a part of an unsecured training data set for an AI Platform.

Let’s take the example of an “AI developer group” needing to interact with both public and private or in-house AI platforms.

An agency could allow for the consumption of both public (e.g. ChatGPT and private (e.g. AWS BedRock) AI platforms. Only approved users in the “AI development group” are allowed access to these platforms. General users are blocked from both platforms.

However, even for approved “AI development group” users, the implementation of a DLP rule to examine the data that is being posted to these platforms is examined to ensure that non-public sensitive data can only be posted to the internal private AI platform.

Protecting constituents

Governance must start from a policy or mission perspective rather than a technology perspective. Understanding the role of AI in government programs from both a benefit and risk perspective takes intentionality by leadership to appoint focused teams that can evaluate the potential intersections of AI platforms and the mission of the agency.

The increase of public engagement through technology creates an accessible rich set of data that AI platforms can use to train their models. Organizations may choose a more conservative approach by blocking all AI crawlers until the impact of allowing those interactions is understood. For those entities that see benefit for legitimate crawling of public properties, the ability to allow legitimate and controlled access by verified AI crawlers while protecting against the bad is critical in today’s environment.

From within the organization, establishing what roles and tasks require access to AI platforms is a critical early step in getting ahead of increased regulations. Mapping those needs to a set of controls that determine who gets access and when, as well as control over the kinds of data posted to these models, ultimately allow the removal of Shadow AI without sacrificing the tangible benefits these technologies provide.

Cloudflare Bot Management and Zero Trust are core to helping government entities reduce risk in the face of proliferating AI usage. Protecting public web properties and delivering the control mechanisms for responsible consumption of these technologies are critical controls that should be top of mind when developing mitigation strategies.

The promise of AI may–and in some ways already is–solving many complex social problems. However, governments must also protect their constituency while they wade into these new technologies.

This article is part of a series on the latest trends and topics impacting today’s technology decision-makers.

Dive deeper into this topic.

Learn more about how Zero Trust can reduce risk in the face of proliferating AI usage in the complete guide, “A roadmap to Zero Trust architecture.”
Get the guide!


Scottie Ray — @scottieray
Principal Solutions Architect, Cloudflare

Key takeaways

After reading this article you will be able to understand:

  • The emerging state of AI focused legislation

  • 2 primary challenges AI presents

  • Controls that help agencies achieve legislative compliance

Related resources:

Receive a monthly recap of the most popular Internet insights!