“Trojan Source” hides flaws in source code from humans

Toy horse on a digital screen to symbolise the attack of the Trojan virus
(Image credit: Shutterstock)

Security researchers have revealed a flaw in compilers that could add vulnerabilities to open source projects. Dubbed Trojan Source, the researchers said the attack was potent within the context of software supply chains, such as this year’s SolarWinds attacks.

“If an adversary successfully commits targeted vulnerabilities into open-source code by deceiving human reviewers, downstream software will likely inherit the vulnerability,” said researchers.

Researchers said the attack exploits subtleties in text-encoding standards, such as Unicode, to produce source code with logically encoded tokens that are in a different order from how they are displayed, leading to vulnerabilities.

“These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens,” said researchers.

They added that compilers and interpreters adhere to the logical ordering of source code, not the visual order.

Hackers can use multiple techniques to exploit the visual reordering of source code tokens, according to researchers.

The first technique is called “Early Returns.” This causes a function to short circuit by executing a return statement that visually appears to be within a comment.

The second is “Commenting-Out.” This causes a comment to visually appear as code, which in turn is not executed.


The truth about cyber security training

Stop ticking boxes. Start delivering real change.


Lastly, there are “Stretched Strings.” These cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

There is also a variant that uses homoglyphs, which are characters that appear nearly identical to letters.

“An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code,” said researchers.

This attack variant is tracked as CVE-2021-42694.

Researchers said to defend against such attacks, compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

“Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals,” they added. “Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.”

Rene Millman

Rene Millman is a freelance writer and broadcaster who covers cybersecurity, AI, IoT, and the cloud. He also works as a contributing analyst at GigaOm and has previously worked as an analyst for Gartner covering the infrastructure market. He has made numerous television appearances to give his views and expertise on technology trends and companies that affect and shape our lives. You can follow Rene Millman on Twitter.