What are good bots?
A bot is a computer program that automates interactions with web properties over the Internet. A "good" bot is any bot that performs useful or helpful tasks that aren't detrimental to a user's experience on the Internet. Because good bots can share similar characteristics with malicious bots, the challenge is ensuring good bots aren’t blocked when putting together a bot management strategy.
There are many kinds of good bots, each designed for different tasks. Here are some examples:
- Search engine bots: Also known as web crawlers or spiders: These bots "crawl," or review, content on almost every website on the Internet, and then index that content so that it can show up in search engine results for relevant user searches. They're operated by search engines like Google, Bing, or Yandex.
- Copyright bots: Bots that crawl platforms or websites looking for content that may violate copyright law. These bots can be operated by any person or company who owns copyrighted material. Copyright bots can look for duplicated text, music, images, or even videos.
- Site monitoring bots: These bots monitor website metrics – for example, monitoring for backlinks or system outages – and can alert users of major changes or downtime. For instance, Cloudflare operates a crawler bot called Always Online that tells the Cloudflare network to serve a cached version of a webpage if the origin server is down.
- Commercial bots: Bots operated by commercial companies that crawl the Internet for information. These bots may be operated by market research companies monitoring news reports or customer reviews, ad networks optimizing the places where they display ads, or SEO agencies that crawl clients' websites.
- Feed bots: These bots crawl the Internet looking for newsworthy content to add to a platform's news feed. Content aggregator sites or social media networks may operate these bots.
- Chatbots: Chatbots imitate human conversation by answering users with preprogrammed responses. Some chatbots are complex enough to carry on lengthy conversations.
- Personal assistant bots: like Siri or Alexa: Although these programs are much more advanced than the typical bot, they are bots nonetheless: computer programs that browse the web for data.
Good bots vs. bad bots
Web properties need to make sure they aren't blocking these kinds of bots as they attempt to filter out malicious bot traffic. It's especially important that search engine web crawler bots don't get blocked, because without them a website can't show up in search results.
Bad bots can steal data, break into user accounts, submit junk data through online forms, and perform other malicious activities. Types of bad bots include credential stuffing bots, content scraping bots, spam bots, and click fraud bots.
What is robots.txt?
Good bot management starts with properly setting up rules in a website's robots.txt file. A robots.txt file is a text file that lives on a web server and specifies the rules for any bots accessing the hosted website or application. These rules define which pages the bots can and can't crawl, which links they should and shouldn't follow, and other requirements for bot behavior.
Good bots will follow these rules. For instance, if a website owner doesn't want a certain page on their site to show up in Google search results, they can write a rule in the robots.txt file, and Google web crawler bots won't index that page. Although the robots.txt file cannot actually enforce these rules, good bots are programmed to look for the file and follow the rules before they do anything else.
Bad bots, however, will often either disregard the robots.txt file or will read it to learn what content a website is trying to keep off-limits from bots, then access that content. Thus, managing bots requires a more active approach than simply laying out the rules for bot behavior in the robots.txt file.
What is a whitelist?
Think of a whitelist as being like the guest list for an event. If someone who isn't on the guest list tries to enter the event, security personnel will prevent them from entering. Anyone who's on the list can freely enter the event. Such an approach is necessary because uninvited guests may behave badly and ruin the party for everyone else.
For bot management, that's basically how whitelists work. A whitelist is a list of bots that are allowed to access a web property. (A whitelist is the opposite of a blacklist; hence the name.) Typically this works via something called the "user agent," the bot's IP address, or a combination of the two. A user agent is a string of text that identifies the type of user (or bot) to a web server.
By maintaining a list of allowed good bot user agents, such as those belonging to search engines, and then blocking any bots not on the list, a web server can ensure access for good bots.
Web servers can also have a blacklist of known bad bots.
What is a blacklist?
A blacklist, in the context of networking, is a list of IP addresses, user agents, or other indicators of online identity that are not allowed to access a server, network, or web property. This is a slightly different approach than using a whitelist: a bot management strategy based around blacklisting will block those specific bots and allow all other bots through, while a whitelisting strategy only allows specified bots through and blocks all others.
Are whitelists enough for letting good bots in and keeping bad bots out?
It is possible for a bad bot to fake its user agent string so that it looks like a good bot, at least initially – just as a thief might use a fake ID card to pretend to be on the guest list and sneak into an event.
Therefore, whitelists of good bots have to be combined with other approaches to detect spoofing, such as behavioral analysis or machine learning. This helps proactively identify both bad bots and unknown good bots, in addition to simply allowing known good bots.
What does a bot manager solution do?
A bot manager product allows good bots to access a web property while blocking bad bots. Cloudflare Bot Management uses machine learning and behavioral analysis of traffic across their entire network to detect bad bots while automatically and continually whitelisting good bots.