Quidest?

Encoding Text in Emojis

Unicode represents text as a sequence of codepoints, each of which is basically just a number that the Unicode Consortium has assigned meaning to. Usually, a specific codepoint is written as U+XXXX, where XXXX is a number represented as uppercase hexadecimal.

For simple latin-alphabet text, there is a one-to-one mapping between Unicode codepoints and characters that appear on-screen. For example, U+0067 represents the character g.

For other writing systems, some on-screen characters may be represented by multiple codepoints. The character की (in Devanagari script) is represented by a consecutive pairing of the codepoints U+0915 and U+0940.

Variation Selectors

Unicode designates 256 codepoints as variation selectors, named VS-1 to VS-256. These have no on-screen representation of their own, but are used to modify the presentation of the preceeding character.

Most unicode characters do not have variations associated with them. Since unicode is an evolving standard and aims to be future-compatible, variation selectors are supposed to be preserved during transformations, even if their meaning is not known by the code handling them. So the codepoint U+0067 (“g”) followed by U+FE01 (VS-2) renders as a lowercase “g”, exactly the same as U+0067 alone. But if you copy and paste it, the variation selector will tag along with it.

Since 256 is exactly enough variations to represent a single byte, this gives us a way to “hide” one byte of data in any other unicode codepoint.

As it turns out, the Unicode spec does not specifically say anything about sequences of multiple variation selectors, except to imply that they should be ignored during rendering.

We can concatenate a sequence of variation selectors together to represent any arbitrary byte string.

For example, let’s say we want to encode the data [0x68, 0x65, 0x6c, 0x6c, 0x6f], which represents the text “hello”. We can do this by converting each byte into a corresponding variation selector, and then concatenating them together.

The variation selectors are broken into two ranges of codepoints: the original set of 16 at U+FE00 .. U+FE0F, and remaining 240 at U+E0100 .. U+E01EF (ranges inclusive).

Example in Golang

Encoding

 1func main() {
 2	encoded := encode('😀', []byte("Good job! Did you use ChatGPT to decode me?"))
 3	fmt.Printf("%s\n", encoded)
 4}
 5
 6func byteToVariationSelector(b byte) rune {
 7	var r rune
 8	if b < 16 {
 9		r = rune(0xFE00 + uint32(b))
10	} else {
11		r = rune(0xE0100 + uint32(b-16))
12	}
13
14	return r
15}
16
17func encode(base rune, sentence []byte) string {
18	s := new(strings.Builder)
19	s.WriteRune(base)
20	for _, b := range sentence {
21		s.WriteRune(byteToVariationSelector(b))
22	}
23
24	return s.String()
25}

Decoding

 1func main() {
 2	encoded := encode('😀', []byte("Good job! Did you use ChatGPT to decode me?"))
 3	decoded := decode(encoded)
 4	fmt.Printf("%s\n", decoded)  
 5}
 6
 7func variationSelectorToByte(vs rune) (byte, error) {
 8	varSel := uint32(vs)
 9	var range1S, range1E uint32 = 0xFE00, 0xFE0F
10	for i := range1S; i <= range1E; i++ {
11		if varSel == i {
12			return byte(varSel - range1S), nil
13		}
14	}
15	var range2S, range2E uint32 = 0xE0100, 0xE01EF
16	for i := range2S; i <= range2E; i++ {
17		if varSel == i {
18			return byte(varSel - range2S + 16), nil
19		}
20	}
21	return 0, errors.New("couldn't decode")
22}
23
24func decode(varSels string) string {
25	message := new(strings.Builder)
26	for _, vs := range varSels {
27		b, err := variationSelectorToByte(vs)
28		if err == nil {
29			message.WriteByte(b)
30		} else {
31			message.WriteByte('\n')
32		}
33	}
34	return message.String()
35}

References

#programming #hacker #tricks #unicode #variation selectors