When does í not equal í?

metadata

keywords:
- software
- Python
published:2025-01-17
updated:2025-05-24
Atom Feed

Recently I visited Reykjavík for a long weekend … little did I know that upon my return I would break this web site. When I downloaded my photos off my iPhone on to my FreeBSD NAS I decided to call the folder “2025-01-17 - Reykjavík - Phone”. This web site contains the paths to the original photos in a (not published) JSON file. Once I had added the few photos from the trip that I wanted to publish to that JSON then the web site generated fine on my MacBook Pro. Later that night, when the cron job on my FreeBSD NAS re-built the web site, it failed. FreeBSD claimed that my photos from Reykjavík did not exist. After a lot of digging I have now learned about Unicode normalisation. Consider the below Python script:


#!/usr/bin/env python3

# Use the proper idiom in the main module ...
# NOTE: See https://docs.python.org/3.12/library/multiprocessing.html#the-spawn-and-forkserver-start-methods
if __name__ == "__main__":
    # If I run "ls -l" on my FreeBSD NAS then I get the following string in the
    # terminal ...
    disk = '/path/to/Photographs/2025/2025-01-17 - Reykjavík - Phone/EXIF Data.json'

    # If I run "ls -l" on my MacOS laptop (which has mounted my FreeBSD NAS via
    # Samba) then I get the following in the terminal ...
    repo = '/path/to/Photographs/2025/2025-01-17 - Reykjavík - Phone/EXIF Data.json'

    # They are not the same strings, apparently ...
    print(disk == repo)

    # The MacOS one is longer ...
    print(len(disk))
    print(len(repo))

    # The MacOS one uses a different sequence of bytes to encode the album name ...
    print(disk.encode("utf-8"))
    print(repo.encode("utf-8"))

    # This command appears to be a null operation, but text editors (and
    # computing standards) are not what they seem ...
    print(disk == repo.replace(' - Reykjavík - ', ' - Reykjavík - '))

    # FreeBSD uses "composed" Unicode and MacOS uses "decomposed" Unicode.

    # I was doing some web site development locally on my MacOS laptop and the
    # web site generated fine on my MacOS laptop but when I did a "git push" on
    # my MacOS laptop followed by a "git pull" on my FreeBSD NAS the web site
    # stopped generating on my FreeBSD NAS because it claimed that some of the
    # photos were missing. I save the paths to the photos in a JSON in a Git
    # repository.

    # Who allowed there to be two valid ways of expressing the same character
    # such that "í != í"? See:
    #   * https://en.wikipedia.org/wiki/Unicode#Combining_characters
    #   * https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
    #   * https://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences

    # There is an obscure Python function to (sort of) deal with it, see:
    #   * https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize

    # For example ...
    import unicodedata
    for form in ["NFC", "NFKC", "NFD", "NFKD",]:
        print(form, unicodedata.is_normalized(form, disk), unicodedata.is_normalized(form, repo))
    print(disk == unicodedata.normalize("NFC", disk))
    print(repo == unicodedata.normalize("NFC", repo))
    print(unicodedata.normalize("NFC", disk) == unicodedata.normalize("NFC", repo))

You may also download “initial-problem.py” directly or view “initial-problem.py” on GitHub Gist (you may need to manually checkout the “main” branch).

… which produces the below output:


False
71
72
b'/path/to/Photographs/2025/2025-01-17 - Reykjav\xc3\xadk - Phone/EXIF Data.json'
b'/path/to/Photographs/2025/2025-01-17 - Reykjavi\xcc\x81k - Phone/EXIF Data.json'
True
NFC True False
NFKC True False
NFD False True
NFKD False True
True
False
True

You may also download “initial-problem.out” directly or view “initial-problem.out” on GitHub Gist (you may need to manually checkout the “main” branch).

FreeBSD chooses to use the two bytes \xc3\xad to represent the í character but MacOS chooses to use the three bytes i\xcc\x81 instead. FreeBSD composes the accent on the i character to a í by representing it as \xc3\xad, however, MacOS chooses to keep the í character decomposed by keeping it represented as a normal un-modified i character followed by two bytes describing how it is to be modified when displayed. The result is the same character is rendered on your screen but the byte sequence to describe it is different, which means that computers think that the strings are different, which means that FreeBSD cannot find the file path described by the string in the JSON.

There are four possible Unicode normalisations: NFC; NFKC; NFD; and NFKD. The two C ones use the composed byte sequence \xc3\xad and the two D ones use the decomposed byte sequence i\xcc\x81. The inclusion of a K in the normalisation name makes no difference to how í is represented, however, it does to the … character. If there is no K in the normalisation name then … is kept as a special Unicode character of … and is represented by the three bytes \xe2\x80\xa6. However, if there is a K in the normalisation name then … is expanded out to three lots of the . character and is represented by the three bytes .... This is demonstrated by the below Python script:


#!/usr/bin/env python3

# Use the proper idiom in the main module ...
# NOTE: See https://docs.python.org/3.12/library/multiprocessing.html#the-spawn-and-forkserver-start-methods
if __name__ == "__main__":
    # Import standard modules ...
    import unicodedata

    # Define an example string as raw bytes (to avoid ambiguity) and then
    # convert it to a Unicode string ...
    null_byt = b'Where is \xe2\x80\xa6 Reykjav\xc3\xadk?'
    null_str = null_byt.decode("utf-8")

    # Print summary ...
    print(f"example string as a Unicode string: {null_str}")
    print(f"example string as a bytes sequence: {repr(null_byt)}")

    # Loop over possible Unicode normalizations ...
    for form in ["NFC", "NFKC", "NFD", "NFKD",]:
        # Convert the example Unicode string to the current Unicode
        # normalization and then convert it to raw bytes (to avoid ambiguity) ...
        test_str = unicodedata.normalize(form, null_str)
        test_byt = test_str.encode("utf-8")

        # Print summary ...
        print(f"{form:4s}    {repr(unicodedata.is_normalized(form, null_str)):5s}    {repr(test_byt)}")

You may also download “full-demonstration.py” directly or view “full-demonstration.py” on GitHub Gist (you may need to manually checkout the “main” branch).

… which produces the below output:


example string as a Unicode string: Where is … Reykjavík?
example string as a bytes sequence: b'Where is \xe2\x80\xa6 Reykjav\xc3\xadk?'
NFC     True     b'Where is \xe2\x80\xa6 Reykjav\xc3\xadk?'
NFKC    False    b'Where is ... Reykjav\xc3\xadk?'
NFD     False    b'Where is \xe2\x80\xa6 Reykjavi\xcc\x81k?'
NFKD    False    b'Where is ... Reykjavi\xcc\x81k?'

You may also download “full-demonstration.out” directly or view “full-demonstration.out” on GitHub Gist (you may need to manually checkout the “main” branch).

In summary:

Unicode Normalisation	Byte Sequence Used To Represent í
NFC	`\xc3\xad`
NFKC	`\xc3\xad`
NFD	`i\xcc\x81`
NFKD	`i\xcc\x81`

Unicode Normalisation	Byte Sequence Used To Represent …
NFC	`\xe2\x80\xa6`
NFKC	`...`
NFD	`\xe2\x80\xa6`
NFKD	`...`

From now on I shall be paying more attention to the byte sequences used to represent Unicode characters and, where required, I shall be enforcing NFC normalisation so that different computers (and pieces of software) continue to talk nicely to each other.