What is data scraping?
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifest in web scraping, the process of using an application to extract valuable information from a website.
Why scrape website data?
Typically companies do not want their unique content to be downloaded and reused for unauthorized purposes. As a result, they don’t expose all data via a consumable API or other easily accessible resource. Scraper bots, on the other hand, are interested in getting website data regardless of any attempt at limiting access. As a result, a cat-and-mouse game exists between web scraping bots and various content protection strategies, with each trying to outmaneuver the other.
The process of web scraping is fairly simple, though the implementation can be complex. Web scraping occurs in 3 steps:
- First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
- When the website responds, the scraper parses the HTML document for a specific pattern of data.
- Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
Scraper bots can be designed for many purposes, such as:
- Content scraping - content can be pulled from the website in order to site in order to replicate the unique advantage of a particular product or service that relies on content. For example, a product like Yelp relies on reviews; a competitor could scrape all the review content from Yelp and reproduce the content on their own site, pretending the content is original.
- Price scraping - by scraping pricing data, competitors are able to aggregate information about their competition. This can allow them to formulate a unique advantage.
- Contact scraping - a lot of websites contain email addresses and phone numbers in plaintext. By scraping locations like an online employee directory, a scraper is able to aggregate contact details for bulk mailing lists, robo calls, or malicious social engineering attempts. This is one of the primary methods both spammers and scammers use to find new targets.
How is web scraping mitigated?
Typically, all content a website visitor is able to see must be transferred onto the visitor’s machine, and any information a visitor is able to access can be scraped by a bot.
Efforts can be made to limit the amount of web scraping that can occur. Here are 3 methods of limiting exposure to data scraping efforts:
- Rate limit requests - for a human visitor clicking through a series of webpages on a website, the speed of interaction with the website is fairly predictable; you’ll never have a human browsing 100 webpages a second, for example. Computers, on the other hand, can make requests orders of magnitude faster than a human, and novice data scrapers may use unthrottled scraping techniques to attempt to scrape an entire website very quickly. By rate limiting the maximum number of requests a particular IP address is able to make over a given window of time, websites are able to protect themselves from exploitative requests and limit the amount of data scraping that can occur within a certain window.
- Modify HTML markup at regular intervals - data scraping bots rely on consistent formatting in order to effectively traverse website content and parse out and save useful data. One method of interrupting this workflow is to regularly change elements of the HTML markup so that consistent scraping becomes more complicated. By nesting HTML elements, or changing other aspects of the markup, simple data scraping efforts will be hindered or thwarted. For some websites, each time a webpage is rendered, some form of content protection modifications are randomized and implemented. Other websites will change up their markup code occasionally to prevent longer-term data scraping efforts.
- Use CAPTCHAs for high-volume requesters - in addition to using a rate limiting solution, another useful step in slowing content scrapers is the requirement that a website visitor answers a challenge that’s difficult for a computer to surmount. While a human can reasonably answer the challenge, a headless browser* engaging in data scraping most likely cannot, and certainly will not consistently across many instances of the challenge. However, constant CAPTCHA challenges can negatively impact the user experience.
Another less common method of mitigation calls for embedding content inside media objects like images. Because the content does not exist in a string of characters, copying the content is far more complex, requiring optical character recognition (OCR) to pull the data from an image file. But this can also hinder web users who need to copy content such as an address or phone number off a website instead of memorizing or retyping it.
*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write bots that use headless browsers to request data more quickly, as there is no human viewing each page being scraped.
How is web scraping stopped completely?
The only way to totally stop web scraping is to avoid putting content on a website entirely. However, using an advanced bot management solution can help websites eliminate access for scraper bots almost completely.
What is the difference between data scraping and data crawling?
Crawling refers to the process large search engines like Google undertake when they send their robot crawlers, such as Googlebot, out into the network to index Internet content. Scraping, on the other hand, is typically structured specifically to extract data from a particular website.
Here are 3 of the practices a scraper bot will engage in that are different from a web crawler bot’s behavior:
- Scraper bots will pretend to be web browsers, while a crawler bot will indicate its purpose and not attempt to trick a website into thinking it’s something it is not.
- Sometimes scrapers will take advanced actions like filling out forms, or otherwise engaging in behaviors to reach a certain part of the website. Crawlers will not.
- Scrapers typically have no regard for the robots.txt file, which is a text file containing information specifically designed to tell web crawlers what data to parse and what areas of the site to avoid. Because a scraper is designed to pull specific content, it may be designed to pull content explicitly marked to be ignored.
Cloudflare Bot Management uses machine learning and behavioral analysis to identify malicious bots such as scrapersprotecting unique content and preventing bots from abusing a web property.