The United States Ninth Circuit Court of Appeals ruled April 18 that scraping publicly accessible data was legal and not in violation of the Computer Fraud and Abuse Act—the legislation which constitutes computer hacking under U.S. law—following the legal battle between LinkedIn and hiQ Labs.
Major benefits for academics, researchers and archivists may also have important implications for other tech industries as a result.
Web scraping itself sounds more complex than it actually is—the act of grabbing data and code from a website so it can be put into a format that is more usable and readable. So long as the data that’s being grabbed is publicly available, anyone can scrape the data for their own use.
Beyond understanding what web scraping is, it’s important that it be legal for any data scientist or researcher who uses online information. This has been reaffirmed by this recent ruling.
What defines public and private data online is a little bit different than how we might imagine it in person, and is why LinkedIn sought to rule web scraping as an illegal act in the first place.
Websites both large and small often have what is called a digital gate. Like a gate you’d find on a fancy driveway, this gate is used to secure stuff that websites don’t want other people touching. What kind of data should be considered public or private is ultimately up to the website itself, though most private data is mundane and helps the website run, only to be touched by the web developers.
Some websites allow users to determine if they want their data private or public. Private user data doesn’t mean that the website itself doesn’t have access to it, but rather the general public shouldn’t have access to it.
But public data is necessary for the internet to function. This technique is actually how Google and other web search engines work—they crawl the internet to see what sites are publicly available, and then match these sites to what the user searches for.
Some data has to be public, such as profiles on social media sites. The details of someone’s profile, such as age or phone number, may be private data, but the profile name itself is often public so that users can find others online.
This is where LinkedIn comes in. HiQ Labs, a data analytics company, uses public data from LinkedIn’s profile database to conduct metrics for its clients. These metrics identify which employees are most likely to quit and which are most likely to be targeted by recruiters.
LinkedIn filed a cease and desist letter, and in 2019, the court affirmed that this act was not illegal and that forbidding the company to scrape web data would destroy its business. LinkedIn filed an appeal, but was again turned down in favor of hiQ Labs. Like the 2019 decision, the ruling determined that any data that is public-facing online is available for anyone to use.
HiQ Labs isn’t the first company to depend on public data from other popular social media sites. Clearview AI—which has questionable business practices—scrapes the web for billions of social media profiles to train its facial recognition data.
This practice is not limited to just businesses, although there can be great demand for individuals who use such techniques. This is the essential toolbox for any research analyst and data scientist.
Research analysts can be hired by businesses to scrape the web for product data, and help provide economic calculus to determine affordable prices for consumers. They can also scrape web posts—which is particularly valuable for publicly-funded and academic research about important topics such as disinformation.
Data scientists do much more than simply scrape the web, however a major part of their role in the data ecosystem is about managing and creating large datasets for others to digest. They can work internally within a company—or in an educational setting—where they provide the data for research analysts to evaluate safely. Their focus is much more about the side of computers that users don’t see, using web scraping tools to see how computer systems could be improved or tweaked.
Because of this, web scraping has become a vital tool for both research and business alike. Without it, universities would not be able to safely research how to prevent the spread of misinformation, and academic fields would become severely limited in what they would be allowed to research.
It is up to the websites themselves to handle their data responsibly, and to ensure that any data that they don’t want the world to see should be kept behind their pearly gates.