Programmers tend to have their own distinct styles, but it’s not really feasible to pore over many lines of code looking for telltale cues about a program’s author. Now, that might not be necessary. Researchers have developed a machine learning system that can ‘de-anonymize’ programmers, whether it’s through raw source code or compiled binaries. As explained to Wired, the approach trains an algorithm to recognize a programmer’s coding structure based on examples of their work, and uses those to pinpoint common traits in code samples. You don’t need large chunks of a given program, either — short snippets are often enough.
In a test using results from Google’s Code Jam, the AI-based technology was relatively accurate, though far from foolproof. With 600 programmers and eight samples each, the system could identify creators 83 percent of the time.
The technology could be a boon to investigators. It would be useful for identifying malware creators, especially when the perpetrators try to frame someone else. It might also be helpful for plagiarism cases, where machine learning could tell between purely coincidental similarities and overt copying.
It could be just as much of a curse, however. While it’s feasible to mask the origins of code, this might make it difficult to contribute code with true anonymity. Someone could theoretically recognize your open work even if you’re switching accounts or otherwise don’t want to leave a trail. Any possible future implementation might have to strike a careful balance between the desire for security and the need for privacy.