AI crawlers -- what are they and why are they a problem? [Q&A]
Organizations have grappled with business threats posed by various automated bots and crawlers over the years. The latest flavor to take the spotlight is AI crawlers which source proprietary content to feed the AIs they serve.
We spoke to Eyal Benishti, CEO of IRONSCALES, to discuss AI crawlers and why it's important for security teams to establish boundaries for their use.
BN: What exactly are AI crawlers (what do they do and how do they work)?
EB: From a technological perspective, 'AI crawlers' are very similar to 'web crawlers,' which have been around for quite some time and play a vital role in how popular search engines like Google and Bing work. Web crawlers systematically explore webpages to understand the content of each page on a website, allowing for the indexing, updating, and retrieval of this information by search engines. In essence, it scans the web for us so search engines can then point us in the right direction when we submit a query.
Where AI crawlers differ is in the breadth of their capabilities and their purpose. AI crawlers are designed to collect and process data from a variety of different sources, including databases, documents, APIs, and other repositories. AI crawlers may also have additional functionality that web crawlers lack, such as semantic analysis, natural language processing, and data extraction. Ultimately, this is because the information scraped by AI crawlers is used not to index the web but to enhance the capabilities of AI systems -- being used as training data for a variety of AI applications, including popular generative AI tools such as ChatGPT, which also use the data to improve performance and broaden their knowledge base.
BN: What security risks do AI crawlers pose to organizations?
EB: Given the enormous scope of information collected by AI crawlers, their use has raised a number of ethical and practical considerations among today’s organizations. These include everything from privacy concerns to potentially biased data collection.
Looking at these tools from a business perspective, the security risks are substantial. First and foremost, the overwhelming majority of businesses today have almost no insight into what types of information are actually being collected by AI crawlers. As such, the potential collection and processing of sensitive or personally identifiable information could train LLMs to unwittingly expose such information to the public. AI crawlers can also inflict damage in more indirect ways. For example, AI crawlers that inadvertently collect inaccurate or unverified information from the web will enable the LLMs they train to spread misinformation about an organization or incorrectly ascribe falsehoods to them. Ultimately, without knowing what data is being collected and how it's being used, it's hard to know with any degree of certainty how an organization might be affected.
BN: What do cybersecurity and IT teams need to be aware of regarding AI crawlers and why should they take this threat seriously?
EB: The same unintentional consequences of AI crawlers can become intentional tools of compromise in the hands of cybercriminals. Apart from the potential for intellectual property theft, the collection of sensitive and personally identifiable information can train LLMs for a variety of social engineering purposes against an organization.
One such application is the ability to generate incredibly convincing phishing or spear-phishing emails, and do so at a pace and scale that was previously unachievable. With so much information about an organization and its employees consumed by AI crawlers and synthesized by public-facing generative AI tools, bad actors can easily enlist their aid in the research and development of social engineering attacks.
With websites like theorg.com and LinkedIn providing an unprecedented amount of visibility into companies' organizational and personnel details, and AI crawlers scraping all of it and then some, it's not hard to imagine a malicious actor using generative AI tools to quickly and effectively research and craft highly-targeted spear phishing or business email compromise (BEC) emails.
BN: What can organizations do to mitigate the risks associated with AI crawlers?
EB: In the absence of explicit legal or regulatory guidelines dictating the utilization of copyrighted material by AI, websites must, for the time being, take matters into their own hands and create their own boundaries. Thankfully, the owners of some AI crawlers, such as OpenAI, have been responsive to organizations' concerns and have publicized steps they can take to block crawlers from their domains. As a result, as of late 2023, nearly 26 percent of the top 1,000 most visited sites on the Web -- including Amazon.com, TheNewYorkTimes.com, and Reuters.com -- have blocked OpenAI’s crawler, GPTBot, from accessing their domains. Moreover, the steps required to block one AI crawler are typically suitable for blocking access to every crawler, with fairly straightforward commands usually being sufficient (e.g. 'User agent: GPTBot ; Disallow: /').
BN: How do you see this situation evolving in the near future, for example as it relates to policy, copyright law, organizational security, etc?
EB: Unfortunately, this is yet another example of our society’s technological capabilities far outpacing its legal system. However, thanks to mounting pressure from rights holders, IP lawyers, and other interests, legislators around the world are beginning to take a much harder look at where training data fits into our current understanding of copyrights and intellectual property.
The fundamental question that must be addressed is whether the collection and processing of data for the purposes of training AI models constitutes fair use or a copyright infringement (or something else entirely). To say the question doesn't fit neatly into our current legal framework would be an understatement. That's why we can be almost certain that new laws and regulations will need to be passed.
Thus far, lawmakers in the EU are leading the charge on this front with their landmark AI Act, which was approved by the EU's member states on February 2nd, 2024. The law now needs final sign-off from the European Parliament, which is expected to happen in April, before going into effect in 2026. The law sets forth two different classifications or tiers of AI systems based on their sensitivity or potential for harm. Most consumer-facing systems, which fall into the less strict Category 1, will be subjected to transparency requirements, including the need to publicly detail their training methodologies and demonstrate that they respect copyright laws.
This call for transparency is undoubtedly a step in the right direction. Only time will tell if it is, in fact, sufficient in and of itself to safeguard organizations against the inherent risks associated with bulk data collection for AI training.
Image credit: weerapat/depositphotos.com