Che cosa è lo scraping dei dati?

I malintenzionati possono utilizzare degli strumenti di web scraping per accedere ai dati molto più rapidamente del previsto. Il risultato è che i dati potrebbero essere utilizzati per scopi non autorizzati.

Obiettivi di apprendimento

Dopo aver letto questo articolo sarai in grado di:

  • Dare una definizione di scraping di dati
  • Spiegare le finalità dello scraping di dati
  • Comprendere i metodi di mitigazione dello scraping di dati
  • Differenziare tra scraping e crawling di dati

Argomenti correlati


Vuoi saperne di più?

Abbonati a theNET, il riepilogo mensile di Cloudflare sulle tematiche più discusse in Internet.

Fai riferimento all'Informativa sulla privacy di Cloudflare per scoprire come raccogliamo ed elaboriamo i tuoi dati personali.

Copia link dell'articolo

Defend against bot attacks like credential stuffing and content scraping with Cloudflare

Che cosa è lo scraping dei dati?

Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifested in web scraping, the process of using an application to extract valuable information from a website.

Scraping di dati

What are different types of web scraping? Why scrape website data?

Gli scraper possono essere progettati per vari scopi, come ad esempio:

  1. Content scraping - a website’s content is pulled in order to replicate the unique advantage of a particular product or service that relies on content. Take a restaurant review site, for instance; a competitor could scrape all the reviews, then reproduce the content on their own website, pretending the content is original (and reaping the benefits).
  2. Price scraping - by scraping pricing data, competitors are able to aggregate information about their competition. This can allow them to formulate a unique advantage, namely by undercutting their competitors, thus taking their business.
  3. Contact scraping - a lot of websites contain email addresses and phone numbers in plaintext. By scraping pages such as online employee directories, a scraper can aggregate contact details to be used in bulk mailing lists, robo calls, or malicious social engineering attempts. This is one of the primary methods used by both spammers and scammers to find new targets.

Qual è la differenza tra lo scraping e il crawling dei dati?

Crawling refers to the process large search engines like Google undertake when they send their robot crawlers, such as Googlebot, out into the network to index Internet content. Scraping, on the other hand, is typically structured specifically to extract data from a particular website.

Here are 3 differences in behavioral practice between scraper bots and web crawler bots:

  Honesty/transparency Advanced maneuvers Respecting robots.txt
Scraper bot Will pretend to be web browsers to get past any efforts to block scrapers. Can take advanced actions such as filling out forms in order to access gated information. Typically has no regard for robots.txt, meaning they can pull content explicitly against the website owner’s wishes.
Crawler bot Will indicate its purpose, wouldn’t attempt to trick a website into thinking the crawler is something it’s not. Will not try to access gated parts of a website. Respects robots.txt, meaning they abide by the website owner’s wishes around what data to parse vs. what areas of the website to avoid.

How are websites scraped?

The process of web scraping is fairly simple, though the implementation can be complex. We can summarize the process in 3 steps:

  1. First, the piece of code used to pull the information (the scraper bot) sends an HTTP GET request to a specific website.
  2. Quando il sito risponde, lo scraper analizza il documento HTML alla ricerca di una sequenza specifica di dati.
  3. Una volta estrapolati, i dati vengono convertiti in qualsivoglia formato specifico progettato dall'autore dello scraper.

Typically, companies do not want their unique content to be downloaded and reused for unauthorized purposes, so they might try not to expose all data via a consumable API or other easily accessible resource. Scraper bots, on the other hand, are interested in getting website data regardless of any attempt at limiting access. As a result, a cat-and-mouse game exists between web scraping bots and various content protection strategies, with each trying to outmaneuver the other.

Come si può mitigare il web scraping?

Smart scraping strategies require smart mitigation strategies. Methods of limiting exposure to data scraping efforts include the following:

  1. Rate limit requests - for a human visitor clicking through a series of webpages on a website, the speed of interaction with the website is fairly predictable; you’ll never have a human browsing 100 webpages a second, for example. Computers, on the other hand, can make requests that are orders of magnitude faster than a human, and novice data scrapers may use unthrottled scraping techniques to attempt to scrape an entire website very quickly. By rate limiting the maximum number of requests a particular IP address can make over a given window of time, websites are able to protect themselves from exploitative requests and limit the amount of data scraping that can occur within that window.
  2. Modify HTML markup at regular intervals - data scraping bots rely on consistent formatting in order to effectively traverse website content and parse out data. One method of interrupting this workflow is to regularly change elements of the HTML markup. By nesting HTML elements, or changing other aspects of the markup, simple data scraping efforts will be hindered or thwarted. For instance, some websites will randomize some form of content protection modification every single time a webpage is rendered; others may update their front-end every few weeks to prevent longer-term data scraping efforts.
  3. Use challenges for high-volume requesters - another useful step in slowing content scrapers is requiring website visitors to answer a challenge that’s difficult for a computer to surmount. While a human can reasonably answer the challenge, a headless browser* most likely can’t, certainly not across many instances of the challenge.
  4. Another less common mitigation method calls for embedding content inside media objects like images. Because the content does not exist in a string of characters, copying the content is far more complex, requiring optical character recognition (OCR) to pull the data from an image file.

*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write bots that use headless browsers to request data more quickly, as there is no human viewing each page being scraped

In che modo è possibile fermare del tutto il web scraping?

The only way to guarantee a full stop to web scraping is to stop putting content on a website entirely. However, using an advanced bot management solution can help websites eliminate access for scraper bots.

Protect against scraping attacks with Cloudflare

Cloudflare Bot Management uses machine learning and behavioral analysis to identify malicious scraping activity, protecting unique content and preventing bots from abusing a web property. Similarly, Super Bot Fight Mode is designed to help smaller organizations defend against scrapers and other malicious bot activity, while giving them more visibility into their bot traffic.