How digital DNA can identify developers from their code [Q&A]

By Ian Barker
Published 4 weeks ago

Recent high-profile attacks have placed increased emphasis on the software supply chain and the need to understand where code has originated.

A developer's coding style is as unique as their fingerprint and, thanks to artificial intelligence, it’s possible to identify an author based on a short code segment. Felix Mächtle, a researcher at the University of Lübeck and member of the AI Grid research network, has developed a tool that does just that. We spoke to him to learn more.

BN: Why is it so important to be able to identify the origin of code changes?

FM: Securing critical infrastructure against supply chain attacks is more important than ever. In today's digital landscape, even minimal code changes can open the door to significant vulnerabilities. Consider incidents like the SolarWinds breach, where a single change led to a cascade of security failures affecting thousands of organizations. By comparing the style fingerprints of software updates to their previous versions, our system can quickly highlight any modifications from an unknown author. If an update has significant stylistic deviations from its established baseline, it can be flagged as potentially compromised. This early detection is essential for securing critical systems in sectors such as energy, transportation, and healthcare, where even small changes can have far-reaching consequences.

Our ability to attribute code to specific developers or threat actors plays a critical role not only in proactive defense, but also in forensic investigations. When similar stylistic fingerprints are identified across multiple cyber incidents, law enforcement can link disparate attacks and build a clearer picture of a broader threat landscape. This traceability is essential in legal proceedings, as it provides concrete evidence to support criminal investigations and prosecutions. In essence, reliably identifying the source of code changes is the foundation upon which robust cybersecurity measures are built. It enables us to isolate, analyze, and ultimately neutralize threats, ensuring that the digital infrastructure we depend on remains secure and trustworthy.

BN: There have been previous attempts to ‘fingerprint’ source code, how is your approach different?

FM: Traditional code fingerprinting methods typically rely on direct analysis of source code and often require a repository of known samples from each developer. While this approach works well when source code is available, most software applications depend on compiled programs. These dependencies are often optimized during translation to machine code, which can obscure stylistic nuances. As a result, when a dependency is only available as a binary and its author has not been included in the training data of the AI model, traditional methods fall short.

Our approach with OCEAN (Open-World Contrastive Authorship Identification) breaks this mold. Instead of relying on source code, which is often unavailable, we use contrastive learning techniques that compare pairs of code fragments to determine whether they share the same underlying stylistic features. This method works even when the fragments are in machine code. Such a so-called 'open world' methodology means that our system isn't limited to a predefined set of known developers. It can make informed attributions even when it encounters entirely new or unknown coding styles. In essence, while previous attempts have been limited by the need for human-readable source code or known authors, our solution extends fingerprinting capabilities into the more realistic and challenging realm of compiled and optimized code.

BN: How would this work in the real world?

FM: In practice, our method can be integrated into various cybersecurity frameworks to continuously monitor software updates and code repositories. For example, in critical infrastructure sectors such as energy, transportation, or healthcare, the system can automatically flag any code changes that deviate from the development team's established stylistic baseline. During routine updates, if changes exhibit stylistic anomalies, an alert is triggered for further investigation, potentially detecting unauthorized changes or malicious injections before they cause damage. Autonomous vulnerability detection systems could then prioritize these areas, or someone could contact the developer to see if there has been a change in team members. In essence, our system serves as an early warning mechanism that enables organizations to both detect and quickly respond to security breaches, thereby improving the overall integrity of their software ecosystems.

In addition, law enforcement and digital forensics teams can use this technology to trace the source of cyberattacks and link disparate incidents by revealing that similar coding styles underlie them.

BN: What about open source projects where there may have been multiple contributors?

FM: In open source projects, the diversity of contributions from many developers can make attribution difficult. However, OCEAN is designed to work at the function level, allowing us to attribute individual functions to specific developers based on their distinctive coding styles -- even in a multi-contributor environment.

However, a key challenge for future research is how to handle cases where multiple authors collaborate on a single function. Currently, OCEAN requires that at least 51 percent of a function is written by a single developer for reliable attribution. As a result, true multi-author attribution within a single function is not supported. Overcoming this limitation will be a critical step in refining our approach for even more complex, collaborative coding environments.

BN: Is it possible to accurately identify code that’s been written by AI rather than humans?

FM: Yes, it should be possible. The core of our method relies on detecting subtle stylistic differences in code, differences that occur in both human-written and AI-generated software. By analyzing these nuances, OCEAN should be able to determine whether a piece of code was produced by a human developer or a generative AI. It is important to note, however, that we have not yet conducted extensive testing specifically on AI-generated code, as our primary focus has been on distinguishing human-written code.

Differentiating between human-written and AI-generated code could be particularly valuable for organizations and academic institutions. For example, enterprises could verify that AI-generated code is being used in security-critical areas to ensure compliance with industry standards, while universities could confirm that students are completing their assignments independently rather than relying solely on AI tools.

The emergence of AI-generated code also raises important questions. Should its use be clearly labeled? What responsibilities do developers and companies have when incorporating or relying on such code? These considerations underscore the need for clear guidelines and policies to ensure that the use of AI in software development remains transparent and responsible.

Image credit: YAYImages/depositphotos.com

No Comments

How digital DNA can identify developers from their code [Q&A]

Recent Headlines

Insider threats are getting costlier and harder to detect

Why the future of AI isn’t about better models -- it’s about better governance [Q&A]

AI-powered attacks, zero-days, and supply chain breaches -- the top cyber threats of 2025

Microsoft warns of serious vulnerability in hybrid Exchange deployments

Microsoft brings AI to the Game Bar with Gaming Copilot

Disney is killing off the Hulu app, bringing all content to Disney+

You can finally repost content on Instagram

Most Commented Stories

This new Windows 11 clone is actually Linux and runs faster on your old PC -- get it now

Half of Americans think AI is a threat, the other half don't. Who's right?

Saying no to Windows 11 just got easier -- Operese automatically transfers your Windows 10 files and settings to Linux

This ergonomic AI mechanical keyboard is built for modern productivity

UpDownTool lets you move from Windows 11 to Windows 10 in just 5 clicks -- without losing any data

Google makes cheaper YouTube Premium Lite available more widely

IObit Software Updater 8 makes app updates faster and safer -- download it now

Facebook introduces the biggest change to text posts in years