2020-03-03 10:14:40

by lampahome

[permalink] [raw]
Subject: Re: why do we need utf8 normalization when compare name?

> Unicode normalisation will take the strings "ñ" (U+00F1) and "n◌̃"
> (U+006E U+0303) and turn them into the same Unicode string. Note that
> there are four kinds of Unicode normalisation (NFD, NFC, NFKD, NFKC), so
> what precise string you end up with depends on which form you're using.
> Linux uses NFD, I believe.


> And yes, once the strings are normalised and encoded as UTF-8 you then
> do a byte-by-byte comparison (if the comparison is case-insensitive then
> fs/unicode/... will case-fold the Unicode symbols during normalisation).
>

What I'm confused is why encoded as utf-8 after normalize finished?
From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
Unicode string. Then why should we just compare bytes from normalized.


2020-03-03 17:47:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: why do we need utf8 normalization when compare name?

On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote:
>
> > And yes, once the strings are normalised and encoded as UTF-8 you then
> > do a byte-by-byte comparison (if the comparison is case-insensitive then
> > fs/unicode/... will case-fold the Unicode symbols during normalisation).
>
> What I'm confused is why encoded as utf-8 after normalize finished?
> From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
> Unicode string. Then why should we just compare bytes from normalized.

For the same reason why we don't upcase or downcase all of the letters
in a directory with case-folding. The term for this is
"case-preserving, case-insensitive" matching. So that means that if
you save a file as "Makefile", ls will return "Makefile", and not
"MAKEFILE" or "makefile".

Of course, if you delete or truncate "makefile", it will affect the
file stored in the directory as "Makefile", and the file system will
not allow a directory with case-folding enabled to contain "makefile"
and "Makefile" at the same time.

Simiarly, with normalization, we preserve the existing utf-8 form
(both the composed and decomposed forms are valid utf-8), but we
compare without taking the composition form into account.

Cheers,

- Ted

P.S. Some people may hate this, but if the goal is interoperability
with how Windows and MacOS does things, this is basically what they do
as well. (Well, mostly; MacOS is a little weird for historical
reasons.)

P.P.S. And before you comment on it, as one Internationalization
expert once said, I18N *is* complicated. It truly would be easier to
teach all of the world to speak a single language and use it as the
"Federation Standard" language, ala Star Trek. For better or for
worse, that's not happening, and so we deal with the world as it is,
not as we would like it to be. :-)