Unicode's Secret Compartment: The Variation Selector Trick

Steganography / January 16, 2025 • 3 min read
Tags: unicode steganography appsec

I recently stumbled upon an article written by Paul Butler. It caught my eye because it described a method for hiding data in text using unicode variation selectors. The method does not distort the main text or add any characters that might give away the secret message. You can try it out here.

The idea is simple, by using variation selectors which are used to modify the presentation of the preceding character, you can inject a unicode variation selector that does not modify the preceding character and is ignored during rendering.

I wonder, how can this be used in malware development or general mischief? 😈

What is unicode?

Unicode is a universal character encoding standard that represents the world’s writing systems in a consistent way. Each character is assigned a number, also known as code point, which enables computers to handle text in any language uniformly.

Unicode provides a single system that represents everything from English, Spanish to Chinese letters, and emojis of course.

What does variation selectors achieve?

Some languages represent their characters differently from the latin alphabet, such as Chinese and Arabic.

When using emojis, variation selectors determine whether an emoji should be displayed in color (emoji style) or as a black and white text symbol. For example, the heart symbol can appear either as ❤️ or ❤ depending on the variation selector used.

Variation selectors modify the modify the appearance of the preceding character, creating visual distinctions without requiring separate code points. In East Asian typography, variation selectors help specify different glyph variants of the same character that carry distinct meanings or are used in different contexts. This is crucial for correctly representing characters in Japanese, Chinese, and Korean texts where the same base character might need to appear differently based on usage.

Example

If you want to encode the word “SECRET” in the text “This message looks normal”, you follow the following formula:

1if (byte < 16) {
2    return String.fromCodePoint(0xfe00 + byte);
3} else {
4    return String.fromCodePoint(0xe0100 + (byte - 16));
5}

Which becomes:

S (ASCII value 83):
- 83 > 16, so we use the formula: 0xe0100 + (83 - 16)
- 0xe0100 + 67 = 0xe0143 (67 decimal is 43 in hex)
- The variation selector becomes U+E0143

HEX REPRESENTATION OF ENCODED TEXT (Single Line):
54 [E0143] DD43 68 [E0135] DD35 69 [E0133] DD33 73 [E0142] DD42 20 [E0135] DD35 6D [E0144] DD44 65 73 73 61 67 65 20 6C 6F 6F 6B 73 20 6E 6F 72 6D 61 6C

CHARACTER MAPPING (Character → Hidden Data):
T→S(83) | h→E(69) | i→C(67) | s→R(82) |  →E(69) | m→T(84) | e→(none) | s→(none) | s→(none) | a→(none) | g→(none) | e→(none) |  →(none) | l→(none) | o→(none) | o→(none) | k→(none) | s→(none) |  →(none) | n→(none) | o→(none) | r→(none) | m→(none) | a→(none) | l→(none)

You find a JavaScript implementation here. Check the source of the page for more information.

Mischief

Variation selectors could be used hide commands for a command and control system. An agent (or implant) could post what appears to be normal text on social media, but include hidden commands or data.

Some other possible mischief avenues:

  • Data exfiltration: While it may seem the attacker is exfiltrating meaningless data, the actual real data is embedded within the “fake” data. This could be discovered by observing the data formatted as HEX.
  • Evading content filters: If you have a decoder on the other side, using variation selectors to bypass the initial content filter could be possible.
  • Bypassing hash-based detection: By adding variation selectors to e.g. malware, the code itself could remain the same, while the resulting hash would change.

Some ideas for future exploration.

Conclusion

I want to thank Paul for writing about this subject, it was fun learning more about unicode!