On 2020-03-02, lampahome <[email protected]> wrote:
> According to case insensitive since kernel 5.2, d_compare will
> transform string into normalized form and then compare.
>
> But why do we need this normalization function? Could we just compare
> by utf8 string?
The problem is that there are multiple ways to represent the same glyph
in Unicode -- for instance, you can represent ? (the symbol for
angstrom) as both U+212B and U+0041 U+030A (the latin letter "A"
followed by the ring-above symbol "?"). Different software may choose to
represent the same glyphs in different Unicode forms, hence the need for
normalisation.
[1] is the Wikipedia article that describes this problem and what the
different kinds of Unicode normalisation are.
[1]: https://en.wikipedia.org/wiki/Unicode_equivalence
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2020-03-02, Aleksa Sarai <[email protected]> wrote:
> On 2020-03-02, lampahome <[email protected]> wrote:
> > According to case insensitive since kernel 5.2, d_compare will
> > transform string into normalized form and then compare.
> >
> > But why do we need this normalization function? Could we just compare
> > by utf8 string?
>
> The problem is that there are multiple ways to represent the same glyph
> in Unicode -- for instance, you can represent Å (the symbol for
> angstrom) as both U+212B and U+0041 U+030A (the latin letter "A"
> followed by the ring-above symbol "°"). Different software may choose to
> represent the same glyphs in different Unicode forms, hence the need for
> normalisation.
Sorry, a better example would've been "ñ" (U+00F1). You can also
represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining
tilde"). Both forms are defined by Unicode to be canonically equivalent
so it would be incorrect to treat the two Unicode strings differently
(that isn't quite the case for "Å").
> [1] is the Wikipedia article that describes this problem and what the
> different kinds of Unicode normalisation are.
>
> [1]: https://en.wikipedia.org/wiki/Unicode_equivalence
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> Sorry, a better example would've been "ñ" (U+00F1). You can also
> represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining
> tilde"). Both forms are defined by Unicode to be canonically equivalent
> so it would be incorrect to treat the two Unicode strings differently
> (that isn't quite the case for "Å").
So utf8-normalize will convert "ñ" (U+00F1) and "n" (U+006E) followed
by "◌̃" to a utf8 code, and both are the same, right?
Then compare it byte by byte.