What Is a Web Crawler?

What is a web crawler bot?

A web crawler, spider, or search engine bot is a software program that accesses, downloads, and/or indexes content from all over the Internet. Web crawler operators may seek to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it's needed. Search engine operators may use these bots to find relevant pages to display in search results. The bots are called "web crawlers" because crawling is the technical term for automatically accessing a website and obtaining data via a software program.

AI web crawlers are a separate, but related, type of crawler bot. They access content on the web either to help train large language models (LLMs), or to help AI assistants provide information to users. Many search providers also operate AI crawlers.

Search engine web crawlers

By applying a search algorithm to the data collected by web crawlers, search engines can provide relevant links in response to user search queries, generating the list of webpages that show up after a user types a search into Google or Bing (or another search engine).

A search engine web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card catalog so that anyone who visits the library can quickly and easily find the information they need. To help categorize and sort the library's books by topic, the organizer will read the title, summary, and some of the internal text of each book to figure out what it's about.

However, unlike a library, the Internet is not composed of physical piles of books, and that makes it hard to tell if all the necessary information has been indexed properly, or if vast quantities of it are being overlooked. To try to find all the relevant information the Internet has to offer, a web crawler bot will start with a certain set of known webpages and then follow hyperlinks from those pages to other pages, follow hyperlinks from those other pages to additional pages, and so on.

It is unknown how much of the publicly available Internet is actually crawled by search engine bots. Some sources estimate that only 40–70% of the Internet is indexed for search — and that's billions of webpages.

AI web crawlers

AI web crawlers serve three main purposes:

Training data for LLMs: LLMs need large quantities of content in order to better refine their models and provide more useful and accurate responses to users. New content helps them continue to improve. AI crawlers look over websites for new content. They copy and save any content they find so it can be used for training.
Live retrieval of information for users: AI assistants sometimes complement the answers they generate with content from external sources. To do so, they may incorporate the web content their crawler bots discover into their responses.
Indexing content: Like search engines, AI models need to know where on the Internet they can find valuable content. Otherwise they cannot, for example, perform live retrieval in response to user prompts.

People increasingly receive answers to their queries via AI tools, and AI crawling activity now exceeds that of search engine crawlers. Unfortunately for content creators, who often rely on people visiting their websites to make money, AI tools rarely refer users to websites they have crawled relative to traditional search.

What is search indexing?

Search indexing is like creating a library card catalog for the Internet so that a search engine knows where on the Internet to retrieve information when a person searches for it. It can also be compared to the index in the back of a book, which lists all the places in the book where a certain topic or phrase is mentioned.

Indexing focuses mostly on the text that appears on the page, and on the metadata* about the page that users don't see. When most search engines index a page, they add all the words on the page to the index — except for words like "a," "an," and "the" in Google's case. When users search for those words, the search engine goes through its index of all the pages where those words appear and selects the most relevant ones.

*In the context of search indexing, metadata is data that tells search engines what a webpage is about. Often the meta title and meta description are what will appear on search engine results pages, as opposed to content from the webpage that's visible to users.

How do web crawlers work?

The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.

Given the vast number of webpages on the Internet that could be indexed for search, this process could go on almost indefinitely. However, a web crawler will follow certain policies that make it more selective about which pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content updates.

**The relative importance of each webpage:**Most web crawlers don't crawl the entire publicly available Internet and aren't intended to; instead they decide which pages to crawl first based on the number of other pages that link to that page, the amount of visitors that page gets, and other factors that signify the page's likelihood of containing important information.

The idea is that a webpage that is cited by a lot of other webpages and gets a lot of visitors is likely to contain high-quality, authoritative information, so it's especially important that a search engine has it indexed — just as a library might make sure to keep plenty of copies of a book that gets checked out by lots of people.

**Revisiting webpages:**Content on the Web is continually being updated, removed, or moved to new locations. Web crawlers will periodically need to revisit pages to make sure the latest version of the content is indexed.

Robots.txt preferences: Web crawlers also may decide which pages to crawl based on the robots.txt protocol (also known as the robots exclusion protocol). Before crawling a webpage, they usually will check the robots.txt file hosted by that page's web server. A robots.txt file is a text file that specifies the rules for any bots accessing the hosted website or application. These rules define which pages the website operator gives the bots permission to crawl, and which links they are permitted to follow. As an example, check out the Cloudflare.com robots.txt file.

All these factors are weighted differently within the proprietary algorithms that each search engine builds into their spider bots. Web crawlers from different search engines will behave slightly differently, although the end goal is the same: to download and index content from webpages. Not all web crawlers obey the instructions set forth in robots.txt files.

Why are web crawlers called 'spiders'?

The Internet, or at least the part that most users access, is also known as the World Wide Web — in fact that's where the "www" part of most website URLs comes from. It was only natural to call search engine bots "spiders," because they crawl all over the Web, just as real spiders crawl on spiderwebs.

Should web crawler bots always be allowed to access web properties?

That is up to the web property, and it depends on a number of factors. Web crawlers require server resources in order to index content — they make requests that the server needs to respond to, just like a user visiting a website or other bots accessing a website. Depending on the amount of content on each page or the number of pages on the site, it could be in the website operator's best interests not to allow search indexing too often, since too much indexing could overtax the server, drive up bandwidth costs, or both.

Developers or companies may not want some webpages to be discoverable via search unless a user already has been given a link to the page (without putting the page behind a paywall or a login). One example of such a case for enterprises is when they create a dedicated landing page for a marketing campaign, but they don't want anyone not targeted by the campaign to access the page. In this way they can tailor the messaging or precisely measure the page's performance. In such cases the enterprise can add a "no index" tag to the landing page, and it won't show up in search engine results. They can also add a "disallow" tag in the page or in the robots.txt file, and search engine spiders won't crawl it at all.

Also, some web administrators may not want LLMs to be trained on their content. Website content could be proprietary or under copyright. In some cases, harvesting web content for training data may disrupt that website's business model — for instance, if the website hosts unique content and sells ad space to generate revenue. For such websites, administrators would want to specifically limit AI crawler bot activity, or charge for it, while still allowing search engine bots to crawl for free.

Website owners may not want web crawler bots to crawl part or all of their sites for a variety of other reasons as well. For instance, a website that offers users the ability to search within the site may want to block the search results pages, as these are not useful for most users. Other auto-generated pages that are only helpful for one user or a few specific users should also be blocked.

What is the difference between web crawling and web scraping?

Web scraping, data scraping, or content scraping is when a bot downloads the content on a website without permission, often with the intention of using that content for a malicious purpose.

Web scraping is usually much more targeted than web crawling. Web scrapers may be after specific pages or specific websites only, while web crawlers will keep following links and crawling pages continuously.

Also, web scraper bots may disregard the strain they put on web servers, while web crawlers, especially those from major search engines, are more likely to obey the robots.txt file and limit their requests so as not to overtax the web server.

How do web crawlers affect SEO?

SEO stands for search engine optimization, and it is the discipline of readying content for search indexing so that a website shows up higher in search engine results.

If spider bots do not crawl a website, then it cannot be indexed, and it will not show up in search results. For this reason, if a website owner wants to get organic traffic from search results, it is very important that they do not block web crawler bots.

However, the relationship between SEO and web traffic has changed. The increased use of AI chatbots and AI-generated results reduces traffic even for high-ranking pages. Meanwhile AI crawler bots request web content significantly more often than traditional search engine crawlers. Web crawlers still offer benefits to websites, but websites that rely on web traffic for revenue may be negatively impacted by AI crawlers.

List of search web crawlers

The bots from the major search engines are called:

Google: Googlebot (actually two crawlers, Googlebot Desktop and Googlebot Mobile, for desktop and mobile searches)
Bing: Bingbot
DuckDuckGo: DuckDuckBot
Yahoo! Search: Slurp
Yandex: YandexBot
Baidu: Baiduspider
Exalead: ExaBot

There are also many other web crawler bots, some of which are not associated with any search engine.

List of AI crawlers

These are some of the most common AI crawler bots that collect data for LLMs:

OpenAI: GPTBot
OpenAI: ChatGPT-User (for live retrieval)
Meta: Meta-ExternalAgent
Google: GoogleOther
Huawei: PetalBot
Amazon: Amazonbot
ByteDance: Bytespider
Claude: Claudebot

See Cloudflare's list of verified bots.

Why is it important for bot management to take web crawling into account?

Bad bots can cause a lot of damage, from poor user experiences to server crashes to data theft. However, in blocking bad bots, it's important to still allow good bots, such as search engine web crawlers, to access web properties. Cloudflare Bot Management allows good bots to keep accessing websites while still mitigating malicious bot traffic. The product maintains an automatically updated allowlist of good bots, like web crawlers, to ensure they are not blocked.

While websites can still benefit from search engine crawling, search engines and AI tools alike often answer user questions without directing users to websites. This cuts down significantly on the amount of traffic a website receives. AI crawlers also tend to crawl considerably more often than search engine bots, which can drive up costs for websites. To protect content creators, Cloudflare enables website owners to choose between allowing AI crawlers, blocking them altogether, or charging them for accessing their content using a feature called pay per crawl.

FAQs

A web crawler, also known as a spider, is an automated program or bot predominantly used by search engines like Google and Bing to explore and catalog web content across the Internet. Its primary functions are to gather content of nearly every webpage and to facilitate retrieval of that content in search results.

How do web crawlers determine which pages to visit and index?

Web crawlers begin their journey from a predefined list of known website addresses, or URLs. As they process these initial pages, they identify and add new hyperlinks to their list of pages to crawl. Since the Internet is vast, crawlers prioritize pages based on factors like how many other pages link to them and how much traffic they receive, as these often indicate valuable content. They also read and follow instructions in robots.txt files, which are created by website owners and specify which parts of their site bots are permitted to access.

What is the purpose of search indexing?

Search indexing is akin to creating a comprehensive library catalog for the Internet. This process allows search engines to quickly locate and present relevant information when a user performs a search. The indexing process mainly focuses on the text visible on a page and its metadata.

How do AI web crawlers function, and for what purposes are they used?

AI web crawlers are a specific kind of bot that accesses web content for two main reasons. First, they gather vast amounts of content to train large language models (LLMs), helping these models improve their accuracy and utility in generating responses. Second, some AI crawlers are used by AI assistants to pull live information from the web to supplement the answers they provide to users.

Why might a website owner choose to restrict web crawler access, and how can they do it?

Website owners might limit crawler access to conserve server resources, as crawling consumes bandwidth and requires server responses. They might also restrict access to pages not intended for public search, such as specific marketing landing pages where they want to control access or measure precise performance. So some administrators might want to prevent AI models from training on their copyrighted or proprietary content that generates revenue through advertising. Owners can prevent specific pages from appearing in search results by adding a "noindex" tag or completely blocking crawling with a "disallow" tag in the robots.txt file.

What is the distinction between web crawling and web scraping?

Web crawling is generally performed by legitimate bots, such as those from search engines, to index content for search results. Web scraping, however, might involve illicitly collecting website content. These scrapers might ignore robots.txt rules, disregard the strain placed on servers from their requests, and facilitate the use of original content in unauthorized ways. AI and search engine companies using web scrapers should obtain permission to scrape content and pay content creators to use their content.

Why is managing web crawler bots important for search engine optimization (SEO)?

Effective bot management is crucial for SEO because if web crawlers are blocked from accessing a website, the site cannot be indexed and, consequently, will not appear in search results. For website owners seeking organic traffic, ensuring that good bots like search engine crawlers can access and index their content is vital.