Data leaks and web scraping – the thin border between legal and illegal

A hacker finishing for data in others' cell phones

I. RECENT CASES OF ‘DATA LEAKS’ AND WEB SCRAPING

Management of one’s personal data has become more than just a subjective right. Personal data have developed into a currency with which you can pay for numerous smartphone applications and internet services. But not only: putting aside all the privacy laws, personal data may be also traded by others, legally or not. Therefore, it’s no surprise we hear so often about ‘data leaks’ or simply data being scraped from various social media platforms. The two are rarely the same thing. But let’s start with examples.

COMB

You might have heard of probably the biggest data leak of all times, the so-called Compilation of Many Breaches (COMB). On 2 February 2021 more than 3.2 billion unique pairs of emails and passwords was leaked on a popular hacking forum. As the name suggests, it seems rather a compilation of old breaches than a new breach. It may be compared to the Breach Compilation of 2017 with 1.4 billion credentials leaked. In fact, similarly to the latter, the COMB is similarly organized by alphabetical order with the same scripts for querying emails and passwords.

COMB contains user credentials, among others, from Netflix, LinkedIn and Bitcoin. In a great part the data are old (even from 2012). However, as we all know, we sometimes wait some time before changing that password we’ve had for 5 years. Therefore, the breach is still relevant.

Facebook

In early April 2021, a database with records of 533 million Facebook users from 106 countries was made available online. The database mainly contains information such as users’ full names, gender, location, phone numbers and, sometimes, e-mail addresses, marital status and other information from the users’ About me section. According to Insider, a Facebook spokesperson told that the data had been scraped because of a vulnerability that the company patched in 2019. Here you can find the list of the countries affected.

LinkedIn

Few days after the Facebook Leak, we’ve heard about another one, involving LinkedIn. Specifically, an archive containing data of 500 million LinkedIn users has been put for sale on one of the popular hacker forums. As reported by CyberNews, another 2 million records leaked as a proof-of-concept sample by the post author. In its statement, LinkedIn announced: “we have investigated an alleged set of LinkedIn data that has been posted for sale and have determined that it is actually an aggregation of data from a number of websites and companies. It does include publicly viewable member profile data that appears to have been scraped from LinkedIn. This was not a LinkedIn data breach, and no private member account data from LinkedIn was included in what we’ve been able to review”.

Clubhouse

More or less within the same week as the Facebook and LinkedIn Leaks were announced, we’ve heard about another, even bigger ‘data leak’ if it comes to the number of the users affected.

According to CyberNews, a SQL database containing 1.3 million Clubhouse user records leaked for free on a popular hacker forum. The database contains users’ names, photo URLs, usernames, Twitter handles, Instagram handles, number of followers, and the number of people followed by the users. It seems, therefore, the database does not contain information such as phone numbers and e-mails of the users.

Instagram/TikTok/Youtube

2021 has seen several major data ‘leaks’. However, the situation is in no way new. Massive databases with social media users’ personal data are made public every then and now. That happened also to nearly 235 million Instagram, TikTok and YouTube users when on 1 August 2020 a database containing their unsecured data was discovered by the Comparitech researchers. Comparitech reported that one in five records contained either a telephone number or email address. Other data included in every record some or sometimes all of the following information: profile name, full real name, profile photo, account description, statistics about follower engagement, last post timestamp, age and gender.

II. ARE ALL OF THOSE REALLY ‘DATA LEAKS’?

Based on these situations, there are basically two types of personal data that are being made public or sold in aggregate: (i) data scraped from the public accounts, made available by the users themselves; and (ii) data that were obtained through a data breach, usually, by hackers. And although all of them are called ‘data leaks’, that name really suits only the second category.

A data leak (or leakage) may be defined as an unintentional (on the part of the data controller) disclosure of confidential or, in any case, secured data, to the public. It is also referred to as a data breach.

Consequently, a ‘data leak’ patch suits only the first two situations mentioned above. The first one is leakage of the COMB database (Compilation of Many Breaches). It clearly contains confidential data and, specifically, users’ passwords collected through numerous hacking events in the past. The other one is, to some extent, the leak of the database with records of 533 million Facebook users. The status of data extracted from Facebook is ambiguous: Facebook has stated the data had been scraped, because of a vulnerability that the company patched in 2019. It seems, therefore, the data were made available for scrapers (so there was no hacking) but not all of them by the users.

On the other hand, the remaining databases contain information that cannot be classified as confidential or even private. In fact, all data contained therein were simply scraped from the users’ public accounts. Therefore, formally, they shouldn’t be called ‘leaks’.

If you aren’t familiar with the topic, web scraping (or web harvesting) means extracting data available on the websites, usually through dedicated software (bots) and creating organized files. Collection of data that are publicly available is far from “data leaking”, as the data scraped aren’t secret or otherwise hidden. The data scrapers “only” put together the data that anyone could access anyway.

III. IS WEB SCRAPING BAD?

Can it bring any harm to anyone? Well… as always, it depends on how we use the tools we’ve obtained.

Data scraping isn’t bad in itself. It may be used for noble, or at least harmless purposes: scientific research, journalism, market analysis etc. Furthermore, if it’s made cautiously and respecting the applicable laws it can be even legal (see: Is web scraping legal?).

However, the events described above are a different story. As indicated by Paul Bischoff, Comparitech editor, “even though the data is publicly accessible, the fact that it was leaked in aggregate as a well-structured database makes it much more valuable than each profile would be in isolation”.

Indeed, such data collected into one database may be used to the detriment of the users. For example, the extracted data may be used for spam marketing and phishing scams. Furthermore, the scraped data from the social medial profiles could be used to create fake profiles, attracting followers and promoting scams. Finally, the users’ photos could also be used for face recognition purposes (see for example the Clearview AI case here [https://www.buzzfeednews.com/article/ryanmac/clearview-ai-fbi-ice-global-law-enforcement]).

To check out if your data have been scraped you may try this link.

The remedies applied so far by social media platforms and governments all over the world are rather disappointing. On the other hand, maybe it would be useful if we all became more accountable for our actions and, in particular, for the data we make public.