How digital DNA can identify developers from their code [Q&A]


Recent high-profile attacks have placed increased emphasis on the software supply chain and the need to understand where code has originated.
A developer's coding style is as unique as their fingerprint and, thanks to artificial intelligence, it’s possible to identify an author based on a short code segment. Felix Mächtle, a researcher at the University of Lübeck and member of the AI Grid research network, has developed a tool that does just that. We spoke to him to learn more.
BN: Why is it so important to be able to identify the origin of code changes?
FM: Securing critical infrastructure against supply chain attacks is more important than ever. In today's digital landscape, even minimal code changes can open the door to significant vulnerabilities. Consider incidents like the SolarWinds breach, where a single change led to a cascade of security failures affecting thousands of organizations. By comparing the style fingerprints of software updates to their previous versions, our system can quickly highlight any modifications from an unknown author. If an update has significant stylistic deviations from its established baseline, it can be flagged as potentially compromised. This early detection is essential for securing critical systems in sectors such as energy, transportation, and healthcare, where even small changes can have far-reaching consequences.
Our ability to attribute code to specific developers or threat actors plays a critical role not only in proactive defense, but also in forensic investigations. When similar stylistic fingerprints are identified across multiple cyber incidents, law enforcement can link disparate attacks and build a clearer picture of a broader threat landscape. This traceability is essential in legal proceedings, as it provides concrete evidence to support criminal investigations and prosecutions. In essence, reliably identifying the source of code changes is the foundation upon which robust cybersecurity measures are built. It enables us to isolate, analyze, and ultimately neutralize threats, ensuring that the digital infrastructure we depend on remains secure and trustworthy.
BN: There have been previous attempts to ‘fingerprint’ source code, how is your approach different?
FM: Traditional code fingerprinting methods typically rely on direct analysis of source code and often require a repository of known samples from each developer. While this approach works well when source code is available, most software applications depend on compiled programs. These dependencies are often optimized during translation to machine code, which can obscure stylistic nuances. As a result, when a dependency is only available as a binary and its author has not been included in the training data of the AI model, traditional methods fall short.
Our approach with OCEAN (Open-World Contrastive Authorship Identification) breaks this mold. Instead of relying on source code, which is often unavailable, we use contrastive learning techniques that compare pairs of code fragments to determine whether they share the same underlying stylistic features. This method works even when the fragments are in machine code. Such a so-called 'open world' methodology means that our system isn't limited to a predefined set of known developers. It can make informed attributions even when it encounters entirely new or unknown coding styles. In essence, while previous attempts have been limited by the need for human-readable source code or known authors, our solution extends fingerprinting capabilities into the more realistic and challenging realm of compiled and optimized code.
BN: How would this work in the real world?
FM: In practice, our method can be integrated into various cybersecurity frameworks to continuously monitor software updates and code repositories. For example, in critical infrastructure sectors such as energy, transportation, or healthcare, the system can automatically flag any code changes that deviate from the development team's established stylistic baseline. During routine updates, if changes exhibit stylistic anomalies, an alert is triggered for further investigation, potentially detecting unauthorized changes or malicious injections before they cause damage. Autonomous vulnerability detection systems could then prioritize these areas, or someone could contact the developer to see if there has been a change in team members. In essence, our system serves as an early warning mechanism that enables organizations to both detect and quickly respond to security breaches, thereby improving the overall integrity of their software ecosystems.
In addition, law enforcement and digital forensics teams can use this technology to trace the source of cyberattacks and link disparate incidents by revealing that similar coding styles underlie them.
BN: What about open source projects where there may have been multiple contributors?
FM: In open source projects, the diversity of contributions from many developers can make attribution difficult. However, OCEAN is designed to work at the function level, allowing us to attribute individual functions to specific developers based on their distinctive coding styles -- even in a multi-contributor environment.
However, a key challenge for future research is how to handle cases where multiple authors collaborate on a single function. Currently, OCEAN requires that at least 51 percent of a function is written by a single developer for reliable attribution. As a result, true multi-author attribution within a single function is not supported. Overcoming this limitation will be a critical step in refining our approach for even more complex, collaborative coding environments.
BN: Is it possible to accurately identify code that’s been written by AI rather than humans?
FM: Yes, it should be possible. The core of our method relies on detecting subtle stylistic differences in code, differences that occur in both human-written and AI-generated software. By analyzing these nuances, OCEAN should be able to determine whether a piece of code was produced by a human developer or a generative AI. It is important to note, however, that we have not yet conducted extensive testing specifically on AI-generated code, as our primary focus has been on distinguishing human-written code.
Differentiating between human-written and AI-generated code could be particularly valuable for organizations and academic institutions. For example, enterprises could verify that AI-generated code is being used in security-critical areas to ensure compliance with industry standards, while universities could confirm that students are completing their assignments independently rather than relying solely on AI tools.
The emergence of AI-generated code also raises important questions. Should its use be clearly labeled? What responsibilities do developers and companies have when incorporating or relying on such code? These considerations underscore the need for clear guidelines and policies to ensure that the use of AI in software development remains transparent and responsible.
Image credit: YAYImages/depositphotos.com