Monday, April 3, 2017

The mess with internationalized domain names

While internationalized domain names (DNS names) are not common in the English speaking world, they exist and their use was standardized by IETF's IDNA standards. I first found out the existence of that possibility while reading the IETF's best practices for domain name verification. As english is not my mother tongue I was particularly interested on the topic, and wanted to make sure that GnuTLS would handle such domains correctly both for storing such domains, and verifying them. That proved not to be an easy task. The following text summarizes my brief understanding of the issues in the field (disclaimer: I am far from an expert in software internationalization topics).

How does IDNA work?

To make a long story short, the IDNA protocols are based on a simple principle. They translate domain names typed with unicode characters (UTF-8 or otherwise), to a US-ASCII (English text) representation which becomes the actual domain name. For example the greek name "ένα.gr" would translate to "xn--ixai9a.gr". On Linux systems one can find Simon Josefsson's idn and idn2 tools (more on that below), which can be used to translate from an internationalized string to IDNA format. For example:

    $ echo "ενα.gr"|idn
    xn--mxahy.gr

 

What are the issues with IDNA?

Although there are simple to use libraries (see Libidn) to access IDNA functionality, there is a catch. In 2010, IETF updated the IDNA standards with a new set of standards called IDNA2008, which were "mostly compatible" with the original standard (called IDNA2003). Mostly compatible meant that the majority of strings mapped to the same US-ASCII equivalent, though some didn't. They mapped to a different string. That affected many languages, including the Greek language mappins, and the following table displays the IDNA2003 and IDNA2008 mappings of few "problematic" Greek domain names.

non-English string IDNA2003 IDNA2008
νίκος.gr xn--kxawhku.gr xn--kxawhkp.gr
νίκοσ.gr xn--kxawhku.gr xn--kxawhku.gr
NΊΚΟΣ.gr xn--kxawhku.gr (undefined)

In the above table, we can see the differences in mappings for three strings. All of the above strings can be considered to be equal in the greek language, as the third is the capitalized version of the first, and the second is the "dumb" lower case equivalent of the last.

The problematic character is 'σ' which in Modern Greek is switched with 'ς' when it is present at the end of word. As both characters are considered to be identical in the language, they are both capitalized to the same character 'Σ' (Sigma).

There are two changes in IDNA2008 standard which affect the examples above. The first, is the treatment of the 'ς' and 'σ' characters as different, causing the discrepancy between the mappings in the first and second examples. The second is that IDNA2008 is defined only for a specific set of characters, and there is no pre-processing phase, which causes the undefined state of the third string, that contains capital letters. These changes, create a discrepancy between expectations formed by observing the behavior of domains consisting of US-ASCII strings and the actual reality with Internationalized scripts. Similar cases exist in other languages (e.g., with the treatment of the 'ß' character in German).

Even though some work-arounds of the protocol may seem obvious or intuitive to implement, such as lower-casing characters prior to converting to IDNA format, lower-casing doesn't make sense in all languages. This is the reason that the capitalized version (NΊΚΟΣ.gr) of the first string on the table, is undefined in IDNA2008.

You can verify the mappings I presented above with the idn2 application, which is IDNA2008-compliant. For example:

    echo "NΊΚΟΣ.gr"|idn2
    idn2: lookup: string contains a disallowed character

 

Is there any solution?

To address these issues, a different standards body --the Unicode consortium-- addressed the issue with the Unicode Technical Standard #46 (UTS#46 or TR#46). That standard was published in 2016 to clarify few aspects of IDNA2008 and propose a compatible with IDNA2003 behavior.

UTS#46 proposes two modes of IDNA2008, the transitional, which results to problematic characters being mapped to their IDNA2003 equivalents and the non-transitional mode, which is identical to the original IDNA2008 standard. In addition it requires the internationalized input to be pre-processed with the CaseFold algorithm which allows handling upper-case domain names such as "ΝΊΚΟΣ.gr" under IDNA2008.

 

Switching to IDNA2008

Unfortunately even with UTS#46, we are left with two IDNA2008 variants. The transitional which is IDNA2003 compatible and the non-transitional which is IDNA2008 incompatible. Some NICs and registrars have already switched to IDNA2008 non-transitional, but not all software has followed up.

A problem is, that UTS#46 does not define a period for the use of transitional encodings, something that makes their intended use questionable. Nevertheless, as the end-goal is to switch to the non-transitional IDNA2008, it still makes it practical to switch to it, by clarifying several undefined parts of the original protocol (e.g., adds a pre-processing phase). As a result, few browsers (e.g., Firefox) have already switched to it. It is also possible for software based on libidn, which only supports IDNA2003, to switch. The libidn2 2.0.0 release includes a libidn compatible APIs making it possible to switch to IDNA2008 (transitional or not).

 

Should I do the switch?

There are few important aspects of the IDNA2008 (non-transitional) domain names, which have to be taken into account prior to switching. As we saw above, the semantics of entering a domain in upper case, and expecting it to be translated to the proper web-site address wouldn't work for internationalized domain names. If one enters the domain "ΝΊΚΟΣ.gr", it would translate to the domain xn--kxawhku.gr (i.e., "νίκοσ.gr"), which is a misspelled version of the correct in Greek language "νίκος.gr".

Moreover, as few software has switched to IDNA2008 non-transitional processing of domain names, there is always the discrepancy between the IDNA2003 mapping and the IDNA2008 mapping. That is, a domain owner would have to be prepared to register both the IDNA2003 version of the name and the IDNA2008 version of it, to ensure all users are properly redirected to his intended site. This is apparent on the following real domains.
  • http://fass.de
  • http://faß.de
If you are a German speaker you most likely consider them equivalent, as the 'ß' character is often expanded to 'ss'. That is how IDNA2003 treated that character, however, that's not how IDNA2008 treats it. If you use the Chrome browser which at the moment uses IDNA2003 (or more precisely IDNA2008 transitional), both of these URIs you will be re-directed to the same web-site, fass.de. However, if you use Firefox, which uses IDNA2008, you will be re-directed to two different web sites. The first being the fass.de and the second the xn--fa-hia.de.

That discrepancy was treated as a security issue by the curl and wget projects and was assigned CVE-2016-8625. Both projects switched to non-transitional IDNA2008.

 

What about certificates, can they address the issue above?

Unfortunately the above situation, cannot be fixed with X.509 certificates and in fact such a situation undermines the trust in them. The operation of X.509 certificates for web site authentication, is based on the uniqueness of domain names. In english language we can be sure that a domain name, whether entered in upper or lower case will be mapped to unique web-site. With internationalized names that's no longer the case.

What is unique in internationalized names is the final output domain, e.g., xn--kxawhku.gr, which for authentication purposes is meaningless as it is, so we have to rely on software to do the reverse mapping for us, on the right place. If the software we use uses different mapping rules than the rules applied by the registrar of the domain, users are left helpless as in the case above.

 

What to do now?

Although at this point, we know that IDNA2008 has quite some peculiarities which will be problematic in the future, we have no better option available. IDNA2003 cannot support new unicode standards and is already obsolete, so biting the bullet, and moving to non-transitional IDNA2008 seems like the right way to go. It is better to have a single and a little problematic standard, rather than have two active standards for domain name mapping.