Ethical web scraping and data rights [Q&A]

Web scraping, automatically harvesting and extracting data from websites, can be a useful tool for businesses to learn about their customers.

But it's easy to fall into the trap of harvesting data just because it's there, leading to information overload not to mention privacy concerns for the consumer. To find out more about web scraping and how it can be used in an ethical way we spoke to founder and CEO of Rayobyte, Neil Emeigh.

BN: What is ethical scraping and how is it being used to collect consumer data?

NE: Believe it or not, web scraping is something all of us do every day. You can even do it without software. If you're a social media user who regularly checks in on the number of likes your posts get, or someone selling a product who regularly checks the prices of their competitors, you are in effect scraping, because you are collecting specific realtime data from a public website.

Now let's say you're an agency managing a hundred social media accounts, or an eCommerce seller with thousands of competing products. It would obviously take you far too long to observe and collect all that information yourself, and by the time you did, it would be out of date. That's why most of us opt to use a piece of software to find that information for us. This is called 'scraping' because the software scrapes the information you're REALLY looking for -- let's say price data -- from a page with a lot of other information you're not interested in.

Even if you're not scraping directly, chances are that anyone with a business today is relying on scraping in some form. The big SEO tools scrape information from search engine result pages, social book and movie review apps pull information from databases to make sure they have the most extensive list of titles, and scraping is even the engine that powers all search engine results! So as you can see, scraping has existed for years and it's not going anywhere soon.

The question of ethics relates to two factors. Firstly: usage. Are you only scraping publicly-available data that's non-identifiable and free for anyone to use? Are you following all local laws about data collection? And secondly: the ethics of the scraping tools themselves. This point gets a bit technical.

All scrapers require proxy IP addresses, which is what my company sells. That's because when most websites detect a scraping bot, they will ban that bot's IP address. So to scrape millions of pages effectively, you need a large number of IP addresses -- ideally, IP addresses that are associated with a real internet service provider, or better yet a real user. Many of my fellow proxy providers have, at various times in the industry's history, sourced proxies without the knowledge of those real users, and without compensating for them. A lot of 'proxy networks' are actually advanced botnets, obtained illegally and/or used to collect personal private data about consumers.

So 'ethical scraping' is really about enforcing the ethical usage and acquisition of proxies.

BN: Is web scraping legal and do you expect the activity to become regulated?

NE: This is a complicated question, one I usually find it's easiest to answer with a comparison. Web scrapers and the proxies that power them are tools, so let's consider another tool: the humble hammer. It is legal to buy and sell hammers. There are many wonderful, legal uses for hammers - you could build furniture for you and your family, or shelter for a neighbor. On the other hand, you can also use a hammer to hurt or kill someone, which is of course illegal.

So yes, the existence of scrapers is -- in most circumstances, in most places on earth -- quite legal. But the exact lines for how it's legal to use them are being drawn as we speak, by cases like HiQ Labs v LinkedIn here in the United States, or consumer privacy regulations in states like California, Colorado, and Virginia. It is the duty of scrapers like myself to ensure that my company -- and our clients -- are complying with the law at all times.

I'll admit that this is something that concerns me -- our industry doesn't exist in a vacuum, and data privacy has, quite rightly, become a major public talking point as of late. If the proxy industry can’t get the stink of unethical behavior off of us, we're going to see a lot more moves to regulate our current modes of operation from both the public and private sectors. That's part of why I'm talking to you, and to interviewers like you, to try to help people understand that these technologies have a useful and necessary side as well as the more well-known sketchy use cases.

BN: From the consumer side what can people do to ensure their data is safe and protected?

NE: From ethical scrapers like myself -- who I truly believe make up the majority of our industry -- you're safe from anything that's hidden behind a login. Our tools cannot be used to get your credit card information, your password, etc.

So if you have information that you're worried about being scraped, the safest thing you can do is simple: don't post it! This is common sense, but we should all think very carefully about what we post online. By now I assume most of us know that if you post your phone number on your website, some spam caller's going to find it, or that if you post something embarrassing on Twitter it will exist in screenshot form forever. I myself don't have any personal social media accounts, which is its own form of security.

As for protecting yourself from unethical scrapers who are trying to find personal information, you're talking about hackers at that point -- so the advice is the same is it is for any other kind of malicious attack. Enforce secure passwords across your organization, hire a good security team, restrict access to sensitive information, that sort of thing. Don't use the same password across all your sites. And if you're a site owner who doesn't want to be scraped, put that in your website's terms of service. It obviously won't stop somebody who's really committed to scraping, but it will give you a legal recourse if and when that should happen.

BN: How can web scraping be made less intrusive?

NE: Again, the key in my opinion is to only scrape public information. The data that people put out there themselves into a public space.

I also do not feel that personally-identifiable information is ever really necessary -- and I think this is a common misconception many people have about data collection. Our customers are interested in sifting through huge volumes of business data, not the personal browsing habits of Joe Whoever.

Image credit: deyangeorgiev2/depositphotos.com

Comments are closed.

© 1998-2024 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.