Is web scraping legal? A short guide on scraping under EU law

Even though web scraping is ubiquitous, its legal status remains unclear. That is because whether web scraping is legal will depend on many aspects. Currently, web scraping is not per se prohibited in the European Union but the use of data mining tools is legally risky.

So, before you jump into your daily dose of scraping, read this guide to save you from some nasty situations.

Please note that this article is for informational purposes only and that any information contained below does not constitute legal advice. Therefore, before engaging in any scraping activities, you should get appropriate professional legal advice regarding your specific situation.

Table of Contents

I. INTELLECTUAL PROPERTY BREACH

First of all, web scraping could infringe on intellectual property rights.

I know: you’re scraping to collect some hot data and not to copy people’s art of a doubtful value. However, I kid you not, also such a ‘simple’ data collection could result in an intellectual property infringement.

Firstly, you could violate the author’s rights to its copyrighted text even by reproducing just one sentence (if it is of particular originality). For that purpose, see the judgment of 16 July 2009 made by the Court of Justice of the European Union, C-5/08, Infopaq International A/S v. Danske Dagblades Forening, which found that even an excerpt of 11 words might be protected (see in particular paragraph 47).

Secondly, in the EU we have laws (Directive 96/9/EC) that provide specific protection with regards to databases. To avoid problems connected with such protection, take into account the following factors.

1) Check if the database from which you’re scraping is protected and what it means for you.

Copyright protection

Basically, a database is copyrighted if the structure of the database is an original intellectual creation.

Clear? No? I know, neither for the lawyers.

The Court of Justice of the European Union comes to clarify the situation (a little bit). According to the CJEU judgment of 1 March 2012, Case C-604/10, Football Dataco Ltd vs. Yahoo! UK Ltd: the ‘criterion of originality is satisfied when, through the selection or arrangement of the data which it contains, its author expresses his creative ability in an original manner by making free and creative choices (see, by analogy, Infopaq International, paragraph 45; Bezpečnostní softwarová asociace, paragraph 50; and Painer, paragraph 89) and thus stamps his “personal touch” (Painer, paragraph 92). By contrast, that criterion is not satisfied when the setting up of the database is dictated by technical considerations, rules or constraints which leave no room for creative freedom’.

What does it mean? That the databases which are quite regular, technical, composed in a way to demonstrate in the clearest way possible (‘dictated by technical considerations’) specific data and are not organized creatively shouldn’t be copyrighted. So probably many databases won’t be copyrighted. However, it’s extremely difficult to guess what the court would say on the subject of the work’s originality. Therefore, it’s actually safer to assume the database from which you’re scraping is copyrighted.

How does it influence scraping? As the copyright protects the structure and organization of the database (and not the data included therein) the scraping simply cannot lead to copying and, for example, republishing the original database’s structure (or a substantial part of it). There is also good news: you may be allowed to copy the database (and use it for your own purposes, not republishing or selling), even if it’s copyrighted if your actions fall under the TDM exception (see the second point below).

Sui generis protection

Even if the database isn’t original, it still may be protected. Chapter III of the Directive 96/9/EC granted a ‘sui generis’ protection to the EU “maker of a database which shows that there has been qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents”. In particular, the database creator is entitled to prevent extraction and/or re-utilization of the whole or a substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database (see Article 7 of the mentioned Directive).

Therefore, the sui generis database right protects the content of a database. If the requisites are met, the protection is granted automatically for 15 years starting either from the creation date or from when the database was first made publicly available.

What does it mean for web scrapers? That you can scrape such data (and, therefore, copy and collect contents of the protected database – which falls under the definition of “extraction” under the analyzed Directive) as long as (a) you don’t scrape a ‘substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database’ and you don’t re-use it (meaning basically selling or publishing it); or (b) scraping falls under TDM exception described below; or (c) you’ve received an appropriate licence.

If you’d like to have an example of what NOT to do, check out the 77m Ltd v Ordnance Survey Ltd case.

2) Check if you fall under the Text and Data Mining (‘TDM’) exception

If the EU protection of databases seems discouraging, note well that Europe recognizes the importance of data mining activities, often accompanied by extraction of data from the Internet (web scraping). In fact, the Old Continent considers itself a ‘data-driven economy’ and acknowledges the current ‘data revolution’. Therefore, the EU wishes to protect data mining and, consequently, scraping which don’t harm others (too much). Consequently, the new law, that must be applied by all EU countries until 7 June 2021 (Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market or ‘DSM Directive’), in its Article 4 provides an exception from the rights of the database owner mentioned above in case of ‘reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining’ unless ‘the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online’.

It may sound confusing so let me break it down for you.

Generally, the new DSM Directive allows for scraping (reproduction and extraction) of data from the databases for the purpose of text and data mining (defined in Recital 8 of the DSM Directive as “automated computational analysis of information in digital form”) even if they are granted copyright or sui generis protection. That doesn’t mean you will be allowed to further publish or sell such data – that still might be illegitimate if you violate the rights of the database owners.

However, the TDM exception is limited: the database owners are granted the possibility to restrict the reproduction and extraction of the databases and their content. That restriction must be made in a manner that will allow bots and crawlers etc. to see that restriction (therefore, on a website there should be installed for example a special program communicating visiting scraping programs that scraping is prohibited). Any such restriction should, in any case, permit scraping made for scientific research purposes (see art. 3 (1) and 7(1) of the DSM Directive).

Therefore, unless you take a lot of data/structure and later republish it or sell it etc., there is a great chance you won’t violate any intellectual property rights.

II. CONTRACT BREACH

But there are more traps on your way. One of them is the possibility of breaching the website’s Terms of Use if they prohibit web scraping.

Should you, therefore, verify the terms and conditions of hundreds or thousands of websites? Yeah… that cannot sound practical. The point is that to make you liable for such a breach, the website owner would have to prove that the contract was validly concluded (or that the Terms of Use were validly accepted by the user). That may be problematic as there are no laws or one and clear case-law line that would determine when the terms and conditions of the website are binding for the users.

In that context, the question gets even more complicated in the case of web scraping activities. Specifically, scraping is carried out primarily by computer programs, not by humans. It is therefore unclear if in such case the terms and conditions may be considered accepted by anyone.

I think that we can find the hint as to the approach of the EU legislator in the previous paragraph regarding the TDM exception. We’ve seen that scraping will be allowed (in the context of database protection) unless such a possibility is ‘reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online‘. Therefore, in my opinion, it’s highly doubtful that the Terms of Use of a website will be considered “appropriate”. On the other hand, you could probably be considered in breach of such Terms of Use if the prohibition of web scraping activities is communicated appropriately, in the manner readable by the computer programs.

As the situation is highly uncertain, it is advisable to be careful and, if possible, rather avoid breaching terms of use made available in any form.

III. PRIVACY CONCERNS

(to apply when you scrape personal data of EU/EEA residents)

Web scrapers that scrape personal data need to pay attention also to the privacy laws. In the European Union that would be, in particular, the General Data Protection Regulation (GDPR). Contrary to common belief, the GDPR does not prohibit data scraping. However, you must follow its rules while mining personal data on the Internet.

1) Lawful grounds for the processing

Collecting data, forming a database and then using these data for any purpose is data processing. And you cannot process any personal data of the EU residents if you do not have legal grounds to do that. You may, therefore, need either to receive a consent of a person of which data you’re going to process, to be performing a task carried out in the public interest or, as a controller or even as a third party, to have a legitimate interest to process the data and such processing is necessary to achieve that interest (see art. 6 of GDPR).

Let’s see in order if any of these grounds may apply to you.

1. Consent

Making their data public by the Internet users does not mean they gave consent to everyone on the Internet to process such data. Therefore, you should not think: ‘oh, these Facebook users published their data without anyone forcing them to do so. They are not only giving their consent for others to process such data but they are basically begging the world to do that’. No. You still need either to obtain express consent or you will need to find other legitimate grounds for processing.

Objectively, if you’re scraping data from the web, you just want to collect data automatically and you won’t ask for the consent of thousands of Internet users – it would block the whole operation. Therefore, probably, the scraping of personal data will need to be justified by one of the following circumstances.

2. Public interest

According to Article 6 of GDPR, personal data processing is legitimate when “necessary for the performance of a task carried out in the public interest or the exercise of official authority vested in the controller”. I will assume you’re not a public authority. Therefore, the more common public interest could apply to you if you’re scraping personal data for purposes such as journalism, scientific research, art and literature.

In any case, the processing of personal data must always remain balanced, taking into account the overriding interests of the data subjects. Therefore, in certain cases, you could, for example, use such data to contact people whose data you scraped to get some information. On the other hand, it would rarely be justified to, for example, publish such scraped personal data, even in the fields mentioned above, without receiving the appropriate consent of the data subject.

3. Legitimate interest

The GDPR does not specify what exactly the legitimate interest of a data controller means. However, in Recital 47 it gives certain clues, stating that such legitimate interest may be a ground for processing provided that ‘the interests or the fundamental rights and freedoms of the data subject are not overriding, taking into consideration the reasonable expectations of data subjects based on their relationship with the controller’.

Therefore, taking into account Article 6 and Recital 47 of the GDPR, to use the legitimate interest grounds we must balance the following factors: (i) the legitimate interest of the data scraper; (ii) the necessity of the processing to achieve that interest; (iii) the interests of data subjects to have their overriding fundamental rights and freedoms protected; and (iv) reasonable expectations of the data subject.

In that context, the legitimate interest justifying data processing could occur for example in the event of analyzing employees’ or clients’ data; intra-group administrative transfers… but also (wait for it) in case of direct marketing! Yes, I’m serious.

Direct marketing when there is no consent of the data subject is rather avoided in the EU because it is normally believed to be illegitimate under the GDPR. It seems the authorities tend to forget that Recital 47 of GDPR states no more and no less than as follows: ‘the processing of personal data for direct marketing purposes may be regarded as carried out for a legitimate interest‘. Therefore, if we make a balancing exercise and we find that a web-scraper has a legitimate interest to promote its services through cold calls or cold emails, that direct marketing is necessary to achieve its goals, that such processing doesn’t override the fundamental rights of data subjects and the data subject could reasonably expect to have their data used for being contacted then, by all means, direct marketing should be fully legitimate.

Of course, the reasonable expectations of the data subject might be a problem here. The main problem is of course the fact many users do not really THINK while making their data public, so it’s hard to expect they have ANY expectations. In any case, reasonable expectations of the users could be, to a certain extent, understood from the character of the websites. If you make your email publicly visible on websites such as Facebook or your blog, you somehow invite people to contact you. Therefore, it seems possible to assume the data subjects, by making certain data public, expect to a certain extent that these data may be used to identify and to contact them. Furthermore, it may be useful to take into account the terms and conditions of the website on which the data were made public: the users may have certain specific expectations based on such terms and conditions as to how their data will be processed. And such reasonable expectations, if contrary to your activity, may make your processing illegitimate.

Attention! Notwithstanding the considerations made above, many believe GDPR prohibits web scraping for direct marketing entirely unless explicit consent is given. In particular, the prevailing argument is that such use misses the ‘taking into consideration the reasonable expectations of data subjects’ requisite. One of the sources denying the possibility of direct marketing based on scraped data is the French Data Protection Authority’s (CNIL) guidance.

Therefore, even though it is not, in my opinion, legally prohibited in all situations, you may want to avoid explaining these arguments to the authorities that may see the situation differently.

And what about publishing or selling the databases created from the extracted data? Well, the legitimate interest won’t rather apply to those. We must not forget that any processing that is not strictly linked to obtaining the legitimate interest, and that is not necessary for obtaining it, will be unlawful. Therefore, for example, organizing data in a database to pursue the legitimate interest of the controller may be lawful. On the other hand, making such a database public in my opinion exceeds the legitimate interest grounds. In particular, making the database public is likely unnecessary to pursue the legitimate interest and, what’s more important, makes it easy for persons without such interest to abuse the rights of the data subject. Consequently, “the interests or the fundamental rights and freedoms of the data subject” are “overriding” (see art. 6 f and Recital 47 of GDPR). Similarly, it is very doubtful that scraping to sell a database to others may be considered a balanced, legitimate ground.

2) Information obligations

No, that’s not all. There are other things you must consider when scraping personal data. One of them is to remember quite wide information obligations of the controllers.

If you’re processing personal data without consent, you need to inform the data subject about such processing and her or his rights, UNLESS it requires disproportionate effort (see Article 14 of the GDPR).

Therefore:

If you’re processing information of the data subjects to contact them, you should provide all required information regarding the processing and the data subject’s rights in the first communication (calling the unproportionate effort exception would seem quite absurd). Please remember one of the most important rights of the data subject is the right to OPT-OUT from the processing. Consequently, you not only need to inform the data subject about that right but you also need to make ceasing the processing possible.
On the other hand, if you’re processing the data without communicating with the data subjects, in certain cases you might be able to call the unproportionate effort exception. In that context, the European Data Protection Board, in its Transparency guidelines, stated that ‘you should carry out a balancing exercise to assess the effort involved for the data controller to provide the information to the data subject against the impact and effects on the data subject if he or she was not provided with the information.’

In any case, be very careful when relying on exemptions. For example, the high operational cost of providing the information to the data subjects may not be a justifiable reason for all authorities (it wasn’t for a Polish Data Protection Authority. Later that decision was only partially overturned by the court, which in any case confirmed that the financial difficulties cannot count as disproportionate effort https://iapp.org/news/a/polish-court-overturns-dpas-first-gdpr-fine/)

If you determine a disproportionate effort could apply in your situation, you should at least publish the privacy information and carry out a data protection impact assessment (see below).

3) Performing Data Protection Impact Assessments (“DPIA”)

GDPR in Article 35 provides that ‘where a type of processing in particular using new technologies, and taking into account the nature, scope, context and purposes of the processing, is likely to result in a high risk to the rights and freedoms of natural persons, the controller shall, prior to the processing, carry out an assessment of the impact of the envisaged processing operations on the protection of personal data’.

Therefore, DPIA is a process aimed at helping you identify and minimise the data protection risks of a project. A DPIA is required for example in case of a systematic and extensive evaluation of the personal aspects of an individual, including profiling; in case of processing of sensitive data on a large scale; and in the event of systematic monitoring of public areas on a large scale. But not only. It could apply also to web scraping, especially where large numbers of personal data are processed. The Working Party on the Protection of Individuals with Regard to the Processing of Personal Data prepared guidelines that can help you with the DPIA. It instructs that if in doubt it’s better to apply DPIA.

IV. HOW TO SCRAPE AND NOT BE SUED?

As mentioned, the EU is a data-driven economy that some even call the fourth industrial revolution. It’s not in the interest of the EU Member States to block the web scraping. On the other hand, certain fundamental rights cannot be overridden in the process. Consequently, businesses wishing to take advantage of web scraping should do their best to comply with the laws and, maybe even more importantly, act in good faith following best practices when it comes to scraping.

To minimize concerns, scraping should be discreet, respect websites’ terms of service, check whether sites are using the robots.txt protocol to communicate that scraping is prohibited, avoid personal data scraping and, if it is necessary, make sure no GDPR violations are made and avoid scraping private or classified information. If possible, it would be advisable to get a licence for scraping.

Furthermore, if you’re commissioning web scraping for your company, it also could be useful to make sure your service provider complies with all laws applicable, got the necessary licences and is ready to indemnify you against any harm resulting from third parties’ claims.

Finally, the last piece of advice (warning: a dad’s joke!): mine your own business! (pun intended)