What is data scraping?
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifest in web scraping, the process of using an application to extract valuable information from a website.
Why scrape website data?
Typically companies do not want their unique content to be downloaded and reused for unauthorized purposes. As a result, they don’t expose all data via a consumable API or other easily accessible resource. Scrapers, on the other hand, are interested in getting website data regardless of any attempt at limiting access. As a result, a cat-and-mouse game exists between web scraping and various content protection strategies, with each trying to outmaneuver the other.
The process of web scraping is fairly simple, though the implementation can be complex. Web scraping occurs in 3 steps:
- First the piece of code used to pull the information, which we call a scraper, sends an HTTP GET request to a specific website.
- When the website responds, the scraper parses the HTML document for a specific pattern of data.
- Once the data is extracted, it is converted into whatever specific format the scraper’s author designed.
Scrapers can be designed for many purposes, such as:
- Content scraping - content can be pulled from the website in order to site in order to replicate the unique advantage of a particular product or service that relies on content. For example, a product like Yelp relies on reviews; a competitors could scrape all the review content from Yelp and reproduce the content on their own site, pretending the content is original.
- Price scraping - by scraping pricing data, competitors are able to aggregate information about their competition. This can allow them to formulate a unique advantage.
- Contact scraping - a lot of websites contain email addresses and phone numbers in plaintext. By scraping locations like an online employee directory, a scraper is able to aggregate contact details for bulk mailing lists, robo calls, or malicious social engineering attempts. This is one of the primary methods both spammers and scammers use to find new targets.
How is web scraping mitigated?
The reality is that there is no way to stop web scraping; given enough time, a resourceful web scraper can extract an entire public-facing website, page by page. This is a results of the fact that any information visible inside a web browser must be downloaded in order to be rendered. In other words, all content a visitor is able to see must be transferred onto the visitor’s machine, and any information a visitor is able to access can be scraped.
Efforts can be made to limit the amount of web scraping that can occur. There are 3 primary methods of limiting exposure to data scraping efforts:
- Rate limit requests - for a human visitor clicking through a series of webpages on a website, the speed of interaction with the website is fairly predictable; you’ll never have a human browsing 100 webpages a second, for example. Computers, on the other hand, can make requests orders of magnitude faster than a human, and novice data scrapers may use unthrottled scraping techniques to attempt to scrape an entire website very quickly. By rate limiting the maximum number of requests a particular IP address is able to make over a given window of time, websites are able to protect themselves from exploitative requests and limit the amount of data scraping that can occur within a certain window.
- Modify HTML markup at regular intervals - data scraping software relies on consistent formatting in order to effectively traverse website content and parse out and save useful data. One method of interrupting this workflow is to regularly change elements of the HTML markup so that consistent scraping becomes more complicated. By nesting HTML elements, or changing other aspects of the markup, simple data scraping efforts will be hindered or thwarted. For some websites, each time a webpage is rendered some form of content protection modifications are randomized and implemented, while others will change up their website occasionally to prevent longer-term data scraping efforts.
Another less common method of mitigation calls for embedding content inside media objects like images. Because the content does not exist in a string of characters, copying the content is far more complex, requiring optical character recognition (OCR) to pull the data from an image file. This can also provide a hindrance to web users who need to copy content such as an address or phone number off a website instead of memorize or retyping it.
*A headless browser is a type of web browser much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write scripts to use headless browsers to request data more quickly, as there is no human viewing each page being scraped.
How is web scraping stopped completely?
The only way to stop web scraping of content is to avoid putting content on a website entirely. More realistic methods include hiding important content behind user authentication, where it’s easier to track users and highlight nefarious behavior.
What is the difference between data scraping and data crawling?
Crawling essentially refers to the process large search engines like Google undertake when they send their robot crawlers, such as Googlebot, out into the network to index Internet content. Scraping, on the other hand, is typically structured specifically to extract data from a particular website.
Here are 3 of the practices a scraper will engage in that are different from web crawler’s behavior:
- Scrapers will pretend to be web browsers, where a crawler will indicate its purpose and not attempt to trick a website into thinking it’s something it is not.
- Sometimes scrapers will take advanced actions like filling out forms, or otherwise engaging in behaviors to reach a certain part of the website. Crawlers will not.
- Scrapers typically have no regard for the robots.txt file, which is a text file containing information specifically designed to tell web crawlers what data to parse and what areas of the site to avoid. Because a scraper is designed to pull specific content, it may be designed to pull content explicitly marked to be ignored.
Cloudflare’s WAF can help rate limit and filter out scrapers, protecting unique content and preventing bots from abusing a web property.