Real-time web data -- a new source of competitive intelligence [Q&A]
Gathering real-time public web data for business intelligence is a new competitive asset for some companies, but little information is available about the use cases for such data.
We spoke to Aleksandras Šulženko, product owner at Oxylabs.io, to learn more about how web data can be a valuable resource for enterprises.
BN: How do businesses employ real-time web intelligence?
AS: Public web data is used by a growing number of companies. For example, recent research by Oxylabs and Censuswide of over 1,000 key decision makers in financial services companies found that almost half of them (44 percent) plan to invest the most into web scraping in the coming years. This is no surprise since a quarter (26 percent) of respondents said web scraping had the greatest impact on revenue compared to other data gathering methods.
Financial and ecommerce companies are the front runners in competitive web intelligence, but others are catching up, too. The internet offers a plethora of public data perfect for mining unique business insights and boosting decision-making and sales. One of the well-known use cases is travel fare aggregation and comparison -- such services as Skyscanner couldn't exist without web scraping technologies, and we wouldn't be able to catch those perfect flight deals as it is simply impossible to monitor so many different airlines manually.
Ecommerce companies gather real-time price and competitor intelligence to optimize dynamic pricing and assortment or monitor the supply chain. You've probably noticed that prices on major marketplaces can change several times per day -- this is possible only with the help of public competitor intelligence. Financial and investment firms rely on unique insights derived from alternative data to find the most profitable investment opportunities. Marketing agencies gather public web intelligence, such as consumer sentiment data, to understand economic trends or buyer behavior and preferences.
There are many other use cases, including search ranking optimization, cybersecurity, illegal content detection, and anti-counterfeiting. Digitalization of both business and everyday life means that there’s data for almost anything scattered around the internet. It is publicly available to all of us; however, the volumes are so extreme that organizations trying to make sense of web data need state-of-the-art technologies to gather, clean, and process it.
BN: Gathering data at such a scale can require enormous resources. Do companies generally extract web data in-house or by outsourcing it to third-party vendors?
AS: Some companies, such as cybersecurity firms that work with sensitive information, prefer to scrape data in-house. However, they need a robust proxy infrastructure to distribute requests and bypass geo-blocks and anti-scraping measures.
For businesses that need to gather public web data but don't have the resources to do it in-house, ready-made scraping solutions are the most cost-effective choice. They should consider Scraper APIs designed for different targets, including search engines and major marketplaces. They allow gathering web data with less coding and on a large scale.
Companies that gather web data in-house must overcome various technical difficulties that can be time- and money-consuming. For example, managing a proxy infrastructure, running headless browsers, maintaining scraping and parsing pipelines that can break down due to constant changes in web page layout, and generating custom fingerprints to bypass anti-scraping measures.
BN: What are the main challenges of gathering real-time web data?
AS: Gathering public web data is a challenging process in general. Firstly, to gather any web data, you will need to figure out what URLs you want to access. This can be done either by generating URLs (if they follow a certain pattern) or by crawling a site to figure out what URLs are present on it. Once you have the URLs, you may attempt to fetch the content from the web. The content will usually be in HTML format, so the next step is to parse the HTML into a simpler data structure, such as JSON or CSV, containing only the data points of interest. In the case of real-time data, complexity adds up as there is no room for error: the system must be up and running at all times.
One of the biggest challenges is gathering accurate data, as the wrong content comes in many different ways. Some scraping responses might seem genuine, although they contain CAPTCHAS or, even worse, false information from the so-called honey pots. Websites can also track and block scrapers based on fingerprints, which include the IP address, HTTP headers, cookies, JavaScript fingerprint attributes, and other data.
Anti-scraping measures and browser fingerprinting are becoming increasingly sophisticated. To avoid unwanted interruptions, companies have to play with different parameter combinations for different sites, which again increases the complexity of their data gathering solution. Fortunately, assembling fingerprints that bypass a particular anti-scraping solution can be automated and optimized with the help of machine learning.
However, getting blocked by an anti-scraping solution does not mean that web scraping is a bad or illegitimate action. With anti-scraping measures, websites simply try to secure their servers from request overload and actions done by irresponsible or malicious actors. Separating between these malicious actors and legitimate scrapers would be exceedingly difficult, so administrators just push a blanket ban on both. Sometimes, the data is locked because of the location -- many sites show different content in different countries. Yet, if a company is collecting competitor intelligence, for example, product prices, it needs to gather public data in various locations. It would be impossible without an extensive proxy network.
When parsing data, the main challenge is adapting to the constant layout changes of the web pages. This requires constant maintenance of parsers -- a task that is not particularly difficult but highly time-consuming, especially if the company is scraping many different page types.
Another interesting challenge when gathering public data from ecommerce marketplaces is product mapping. Imagine a company that needs to gather prices and reviews of five different models of Samsung headphones. In different online marketplaces, such products can be listed in different departments and subcategories or have slightly different product names. This makes it difficult to track the same product across multiple ecommerce sites, even with the use of scraping.
BN: Are there any use cases for employing alternative data beyond the business sector?
AS: Even among businesses, public web intelligence collection has started to gain traction only recently. NGOs, the public sector, and academia are still lagging behind, but the interest in public web data is growing there, too. There are ‘avant-garde’ players, such as the Bank of Japan, that do interesting social and economic research based on alternative data analysis. Academics in such fields as psychology have also started to uncover the benefits of web data, scraping public comments and forums for aggregated data to analyze human behavior.
Nonprofit organizations often have interesting research topics that allow employing web scraping technology for the common good.
BN: What will drive the web intelligence industry forward in the upcoming years?
AS: Without a doubt, ML and AI technologies. They allow automating recurring web scraping patterns, thus minimizing workload for the developers and the risk of human error. Web Unblocker is mainly based on different ML algorithms that help perform complicated tasks such as proxy management, dynamic fingerprinting, and response recognition.
It is interesting that web scraping is one of the main drivers behind AI and ML developments. ML requires massive amounts of data for training to improve algorithmic predictions and accuracy. Buying ready-made datasets from third-party providers is often not enough for modern ML technology. This is where publicly available web data comes to help. Therefore, both fields positively reinforce each other.
Photo credit: Maksim Kabakou / Shutterstock