The trouble with open source research on the web

open source keyboard button

Every open source research project -- no matter how simple or complex -- starts with browsing the internet. But researchers should know that their identity can be obtained through a number of basic techniques, which could have consequences ranging from modified data to directed cyber attacks or worse.

Even the simplest of website visits will expose significant details about your location and your device, and pretty much any site you visit will drop code on your computer to track what you’re doing as you traverse the internet. Most of the time, this exchange is benign, but there can be times when content will be modified or attacks launched based on the identity of the user.


When Tim Berners-Lee released his building blocks for the modern internet, they were designed for the academic research community. Like other initiatives of the time, web protocols (and the browsers to support them) were built to easily share information, not for privacy or security. In order to minimize or even prevent counter-surveillance while conducting open source research, it is important to understand how the underlying protocols exchange information when you visit a web page.

How your identity is broadcast for the web to see

Web server logs have been around as long as the web itself. Server logs will store information that can reveal details about the visitor. This can be done via command line tools to query DNS and other resources, through website analytics-focused websites with point-and-click tools or even via third-party marketing tools designed for converting anonymous website visitors into sales leads.

When your browser connects to a website, your connection is logged by the web server, and the originating IP address is typically your ISP. In organizations that use a "dirty line" to conduct investigations, too much traffic over that shared resource can in itself be enough information to jeopardize a project. That IP address can also be a starting point for reverse DNS queries to reveal your organization’s identity.

Once your IP address has been logged, basic IP analysis can show:

  • Your ISP
  • Your geographic location
  • Any referral servers
  • Various administrative contacts associated with the record

Depending on how the ISP has configured their records and how they represent customer information in their DNS data, your domain name could be exposed. It could be an admin’s email address and phone number, an address for submitting network abuse contacts or another technical contact in your organization.

Once your domain name has been identified, a broader set of queries against your organization’s DNS information can be run. These can be run through command line queries, or via any number of penetration tester’s toolkits.  With this information, additional tools can reveal more characteristics of your network.

In addition to the originating IP address of the query, each web page request includes certain machine information. The web server needs to know the capabilities of your browser in order to deliver content that can be formatted properly for your computer. Each request includes:

  • Machine type and OS version (e.g. Windows 8.1)
  • The browser and version number being used (e.g. Chrome 49)
  • Plugins available (e.g. Flash, Java v.x)
  • Language and fonts installed
  • Timezone of your machine (e.g. JST UTC/GMT+9:00)
  • Level of cookies enabled in the session

Content requests might be honored only if they match specific devices or originate from a specific location. A fingerprint element or request for content formatted in a specific way -- like in a different language or for a particular display -- that doesn’t line up with the expectation of the web service may be blocked or poisoned. This potential disconnect between the digital fingerprint and the claimed location is one of the common ways researchers can be identified, especially when using a proxy. All it takes is for a single step to be skipped or tool to be misconfigured for the identity of the researcher to be revealed.

By combining the domain information from the IP address and the computer information from the query, an analyst on the other side can start to build an extremely detailed and accurate profile on you. And once your identity has been revealed, the open source content you’re after may not be available anymore. Like a peephole in a door, if the host doesn’t like the look of the user making the request, they can refuse to open the door, blocking access to content.

How lax security can lead to compromised data

In addition to sharing information by default, browsers also have architectural flaws that expose users to potentially malicious code. When the browser connects to a web server, it downloads a blob of code to build the page elements. The browser’s rendering engine will then dutifully execute that code in order to present the interactive display of the page to the user.

That payload contains static elements that are harmless like text and images, but it can also include:

  • Active code like JavaScript or HTML5
  • Flash or other format-encoded audio/video content
  • Cookies and other trackers
  • Content cached on the local file system
  • Executable code in the header of rendered objects, such as PDF files

These elements may be innocuous and only used for displaying information or presenting the best web experience. But in today’s world of sophisticated web analytics and re-advertising, they can persist on your machine as a way for online properties to build profiles of you as a user. Active scripts in a web page can conduct a deeper analysis of your device, even passing your machine name back to the website. With this, your adversary doesn’t even need your IP address to glean your organization. Tracking cookies can be written to the browser cache that share search and browsing activity; for example: If you’re assessing a financial site to collect information, but also searching law enforcement resources on money laundering in another tab, that’s a tell that someone is conducting research. These types of payloads can jeopardize the mission.

Or worse. Malicious code can be passed from a website to your device, and through standard browser APIs, reach into other parts of your computer and across your network. From ransomware to remote execution, these browser-delivered exploits have changed the tenor of accessing the web.

For the vast majority of users, plugins, browser fingerprints, cookies and active code enhance the user experience on the web. But the motivation of the open source researcher isn’t to have a good web experience; it is to collect accurate and unadulterated data. The integrity of a research project may depend on remaining anonymous, whether it is a criminal investigation or an academic study.

The cat and mouse activities between researchers and targets can be undermined by the very protocols that enable the access in the first place. It is important for researchers to understand how they may be exposed and to take appropriate measures to compensate.

Photo Credit: rvlsoft/Shutterstock

Petry-Scott_3058x2447Scott Petry is Co-Founder and CEO of Authentic8. Prior to Authentic8, Scott founded Postini and served in a variety of C-level roles until its acquisition by Google in 2007. He served as Director of Product Management at Google until 2009. Prior to Postini, Scott was General Manager and Vice President of Cygnus Solutions (acquired by Redhat), Director of Advanced Messaging Products at SkyTel, and a Product Manager at Apple Computer. He graduated from San Diego State University and was a member of the U.S. National Rowing Team, earning a bronze medal in the world championships.

4 Responses to The trouble with open source research on the web

© 1998-2022 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.