LinuxLists.cc - [PATCH 1/4] exfat: Simplify exfat_utf8_d

2020-03-17 22:27:06

Subject: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

Function partial_name_hash() takes long type value into which can be stored
one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
needed.

Signed-off-by: Pali Rohár <[email protected]>
---
fs/exfat/namei.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index a8681d91f569..e0ec4ff366f5 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -147,16 +147,10 @@ static int exfat_utf8_d_hash(const struct dentry *dentry, struct qstr *qstr)
return charlen;

/*
- * Convert to UTF-16: code points above U+FFFF are encoded as
- * surrogate pairs.
* exfat_toupper() works only for code points up to the U+FFFF.
*/
- if (u > 0xFFFF) {
- hash = partial_name_hash(exfat_high_surrogate(u), hash);
- hash = partial_name_hash(exfat_low_surrogate(u), hash);
- } else {
- hash = partial_name_hash(exfat_toupper(sb, u), hash);
- }
+ hash = partial_name_hash(u <= 0xFFFF ? exfat_toupper(sb, u) : u,
+ hash);
}

qstr->hash = end_name_hash(hash);
--
2.20.1

2020-03-18 00:11:04

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Roh?r wrote:
> Function partial_name_hash() takes long type value into which can be stored
> one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> needed.

Hmm... You might want to update the comment in stringhash.h...

2020-03-18 09:33:25

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Wednesday 18 March 2020 00:09:25 Al Viro wrote:
> On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Rohár wrote:
> > Function partial_name_hash() takes long type value into which can be stored
> > one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> > needed.
>
> Hmm... You might want to update the comment in stringhash.h...

Well, initially I have not looked at hashing functions deeply. Used
hashing function in stringhash.h is defined as:

static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
return (prevhash + (c << 4) + (c >> 4)) * 11;
}

I guess it was designed for 8bit types, not for long (64bit types) and
I'm not sure how effective it is even for 16bit types for which it is
already used.

So question is, what should we do for either 21bit number (one Unicode
code point = equivalent of UTF-32) or for sequence of 16bit numbers
(UTF-16)?

Any opinion?

2020-03-28 23:43:14

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Wednesday 18 March 2020 10:32:51 Pali Rohár wrote:
> On Wednesday 18 March 2020 00:09:25 Al Viro wrote:
> > On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Rohár wrote:
> > > Function partial_name_hash() takes long type value into which can be stored
> > > one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> > > needed.
> >
> > Hmm... You might want to update the comment in stringhash.h...
>
> Well, initially I have not looked at hashing functions deeply. Used
> hashing function in stringhash.h is defined as:
>
> static inline unsigned long
> partial_name_hash(unsigned long c, unsigned long prevhash)
> {
> return (prevhash + (c << 4) + (c >> 4)) * 11;
> }
>
> I guess it was designed for 8bit types, not for long (64bit types) and
> I'm not sure how effective it is even for 16bit types for which it is
> already used.
>
> So question is, what should we do for either 21bit number (one Unicode
> code point = equivalent of UTF-32) or for sequence of 16bit numbers
> (UTF-16)?
>
> Any opinion?

So what to do with that hashing function?

Anyway, "[PATCH 4/4] exfat: Fix discard support" should be reviewed as
currently discard support in exfat is broken.

2020-04-03 02:21:11

by [email protected]

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> I guess it was designed for 8bit types, not for long (64bit types) and
> I'm not sure how effective it is even for 16bit types for which it is
> already used.

In partial_name_hash (), when 8bit value or 16bit value is specified,
upper 8-12bits tend to be 0.

> So question is, what should we do for either 21bit number (one Unicode
> code point = equivalent of UTF-32) or for sequence of 16bit numbers
> (UTF-16)?

If you want to get an unbiased hash value by specifying an 8 or 16-bit value,
the hash32() function is a good choice.
ex1: Prepare by hash32 () function.
hash = partial_name_hash (hash32 (val16,32), hash);
ex2: Use the hash32() function directly.
hash + = hash32 (val16,32);

> partial_name_hash(unsigned long c, unsigned long prevhash)
> {
> return (prevhash + (c << 4) + (c >> 4)) * 11;
> }

Another way may replace partial_name_hash().

return prevhash + hash32(c,32)

2020-04-03 20:41:43

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Friday 03 April 2020 02:18:15 [email protected] wrote:
> > I guess it was designed for 8bit types, not for long (64bit types) and
> > I'm not sure how effective it is even for 16bit types for which it is
> > already used.
>
> In partial_name_hash (), when 8bit value or 16bit value is specified,
> upper 8-12bits tend to be 0.
>
> > So question is, what should we do for either 21bit number (one Unicode
> > code point = equivalent of UTF-32) or for sequence of 16bit numbers
> > (UTF-16)?
>
> If you want to get an unbiased hash value by specifying an 8 or 16-bit value,

Hello! In exfat we have sequence of 21-bit values (not 8, not 16).

> the hash32() function is a good choice.
> ex1: Prepare by hash32 () function.
> hash = partial_name_hash (hash32 (val16,32), hash);
> ex2: Use the hash32() function directly.
> hash + = hash32 (val16,32);

Did you mean hash_32() function from linux/hash.h?

> > partial_name_hash(unsigned long c, unsigned long prevhash)
> > {
> > return (prevhash + (c << 4) + (c >> 4)) * 11;
> > }
>
> Another way may replace partial_name_hash().
>
> return prevhash + hash32(c,32)
>

2020-04-06 09:41:49

by [email protected]

[permalink] [raw]

Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> > If you want to get an unbiased hash value by specifying an 8 or 16-bit
> > value,
>
> Hello! In exfat we have sequence of 21-bit values (not 8, not 16).

hash_32() generates a less-biased hash, even for 21-bit characters.

The hash of partial_name_hash() for the filename with the following character is ...
- 21-bit(surrogate pair): the upper 3-bits of hash tend to be 0.
- 16-bit(mostly CJKV): the upper 8-bits of hash tend to be 0.
- 8-bit(mostly latin): the upper 16-bits of hash tend to be 0.

I think the more frequently used latin/CJKV characters are more important
when considering the hash efficiency of surrogate pair characters.

The hash of partial_name_hash() for 8/16-bit characters is also biased.
However, it works well.

Surrogate pair characters are used less frequently, and the hash of
partial_name_hash() has less bias than for 8/16 bit characters.

So I think there is no problem with your patch.

> Did you mean hash_32() function from linux/hash.h?

Oops. I forgot '_'.
hash_32() is correct.

---
Kohada Tetsuhiro <[email protected]>

2020-04-07 10:07:53

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Monday 06 April 2020 09:37:38 [email protected] wrote:
> > > If you want to get an unbiased hash value by specifying an 8 or 16-bit
> > > value,
> >
> > Hello! In exfat we have sequence of 21-bit values (not 8, not 16).
>
> hash_32() generates a less-biased hash, even for 21-bit characters.
>
> The hash of partial_name_hash() for the filename with the following character is ...
> - 21-bit(surrogate pair): the upper 3-bits of hash tend to be 0.
> - 16-bit(mostly CJKV): the upper 8-bits of hash tend to be 0.
> - 8-bit(mostly latin): the upper 16-bits of hash tend to be 0.
>
> I think the more frequently used latin/CJKV characters are more important
> when considering the hash efficiency of surrogate pair characters.
>
> The hash of partial_name_hash() for 8/16-bit characters is also biased.
> However, it works well.
>
> Surrogate pair characters are used less frequently, and the hash of
> partial_name_hash() has less bias than for 8/16 bit characters.
>
> So I think there is no problem with your patch.

So partial_name_hash() like I used it in this patch series is enough?

2020-04-08 04:03:20

by [email protected]

[permalink] [raw]

Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> So partial_name_hash() like I used it in this patch series is enough?

I think partial_name_hash() is enough for 8/16/21bit characters.

Another point about the discrimination of 21bit characters:
I think that checking in exfat_toupper () can be more simplified.

ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;

---
Kohada Tetsuhiro <[email protected]>

2020-04-08 09:58:57

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Wednesday 08 April 2020 03:59:06 [email protected] wrote:
> > So partial_name_hash() like I used it in this patch series is enough?
>
> I think partial_name_hash() is enough for 8/16/21bit characters.

Great!

Al, could you please take this patch series?

> Another point about the discrimination of 21bit characters:
> I think that checking in exfat_toupper () can be more simplified.
>
> ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;

I was thinking about it, but it needs more refactoring. Currently
exfat_toupper() is used on other places for UTF-16 (u16 array) and
therefore it cannot be extended to take more then 16 bit value.

But I agree that this is another step which can be improved.

2020-04-13 09:15:17

by [email protected]

[permalink] [raw]

Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> On Wednesday 08 April 2020 03:59:06 [email protected] wrote:
> > > So partial_name_hash() like I used it in this patch series is enough?
> >
> > I think partial_name_hash() is enough for 8/16/21bit characters.
>
> Great!
>
> Al, could you please take this patch series?

I think it's good.

> > Another point about the discrimination of 21bit characters:
> > I think that checking in exfat_toupper () can be more simplified.
> >
> > ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;
>
> I was thinking about it, but it needs more refactoring. Currently
> exfat_toupper() is used on other places for UTF-16 (u16 array) and therefore it cannot be extended to take more then 16
> bit value.

I’m also a little worried that exfat_toupper() is designed for only utf16.
Currently, it is converting from utf8 to utf32 in some places, and from utf8 to utf16 in others.
Another way would be to unify to utf16.

> But I agree that this is another step which can be improved.

Yes.

2020-04-13 10:14:04

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Monday 13 April 2020 08:13:45 [email protected] wrote:
> > On Wednesday 08 April 2020 03:59:06 [email protected] wrote:
> > > > So partial_name_hash() like I used it in this patch series is enough?
> > >
> > > I think partial_name_hash() is enough for 8/16/21bit characters.
> >
> > Great!
> >
> > Al, could you please take this patch series?
>
> I think it's good.
>
>
> > > Another point about the discrimination of 21bit characters:
> > > I think that checking in exfat_toupper () can be more simplified.
> > >
> > > ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;
> >
> > I was thinking about it, but it needs more refactoring. Currently
> > exfat_toupper() is used on other places for UTF-16 (u16 array) and therefore it cannot be extended to take more then 16
> > bit value.
>
> I’m also a little worried that exfat_toupper() is designed for only utf16.
> Currently, it is converting from utf8 to utf32 in some places, and from utf8 to utf16 in others.
> Another way would be to unify to utf16.
>
> > But I agree that this is another step which can be improved.
>
> Yes.

There are two problems with it:

We do not know how code points above U+FFFF could be converted to upper
case. Basically from exfat specification can be deduced it only for
U+0000 .. U+FFFF code points. We asked if we can get answer from MS, but
I have not received any response yet.

Second problem is that all MS filesystems (vfat, ntfs and exfat) do not
use UCS-2 nor UTF-16, but rather some mix between it. Basically any
sequence of 16bit values (except those :/<>... vfat chars) is valid,
even unpaired surrogate half. So surrogate pair (two 16bit values)
represents one unicode code point (as in UTF-16), but one unpaired
surrogate half is also valid and represent (invalid) unicode code point
of its value. In unicode are not defined code points for values of
single / half surrogate.

Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to
fix a way how to handle those non-representative values in VFS encoding
(iocharset=) as UTF-8 is not able to represent it too. One option is to
extend UTF-8 to WTF-8 encoding [1] (yes, this is a real and make sense!)
and then ideally change exfat_toupper() to UTF-32 without restriction
for surrogate pairs values.

Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet
kernel drivers.

[1] - https://simonsapin.github.io/wtf-8/

2020-04-14 15:11:55

by [email protected]

[permalink] [raw]

Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> We do not know how code points above U+FFFF could be converted to upper case.

Code points above U+FFFF do not need to be converted to uppercase.

> Basically from exfat specification can be deduced it only for
> U+0000 .. U+FFFF code points.

exFAT specifications (sec.7.2.5.1) saids ...
-- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).

UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
It just says "Unicode".

> Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> of single / half surrogate.

Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
The character type is basically 'wchar_t'(16bit).
The description "0000h to FFFFh" also assumes the use of 'wchar_t'.

This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
but these should not be converted to upper case.
Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
(* RtlUpcaseUnicodeChar() is one of Windows native API)

If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.

The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
To be more strict...
D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

> Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8
> encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction
> for surrogate pairs values.

WTF-8 is new to me.
That's an interesting idea, but is it needed for exfat?

For characters over U+FFFF,
-For UTF-32, a value of 0x10000 or more
-For UTF-16, the value from 0xd800 to 0xdfff
I think these are just "don't convert to uppercase."

If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
it will simply be ignored by utf16s_to_utf8s().
string after utf8 conversion does not include illegal byte sequence.

> Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.

Ugh...

BR
---
Kohada Tetsuhiro <[email protected]>

2020-04-14 15:12:31

by Pali Rohár

[permalink] [raw]

Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

On Tuesday 14 April 2020 09:29:32 [email protected] wrote:
> > We do not know how code points above U+FFFF could be converted to upper case.
>
> Code points above U+FFFF do not need to be converted to uppercase.
>
> > Basically from exfat specification can be deduced it only for
> > U+0000 .. U+FFFF code points.
>
> exFAT specifications (sec.7.2.5.1) saids ...
> -- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).
>
> UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
> It just says "Unicode".

That is because in MS world, "Unicode" term lot of times means UCS-2 or
UTF-16. You need to have a crystal ball to correctly understand their
specifications.

>
> > Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> > it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> > surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> > also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> > of single / half surrogate.
>
> Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
> The character type is basically 'wchar_t'(16bit).
> The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
>
> This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
> but these should not be converted to upper case.
> Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
> (* RtlUpcaseUnicodeChar() is one of Windows native API)
>
> If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
> With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
>
> The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
> To be more strict...
> D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

Exactly, that is why surrogate pairs cannot be put into any "to upper"
function. Or rather "to upper" function needs to be identity for them to
not break anything. "to upper" does not make any sense on one u16 item
from UTF-16 sequence when you do not have a complete code point.
So API for UTF-16 "to upper" function needs to take full string, not
just one u16.

So for code points above U+FFFF it is needed some other mechanism how to
represent upcase table (e.g. by providing full UTF-16 pair or code point
encoded in UTF-32). And this is unknown and reason why I put question
which was IIRC forwarded to MS.

> > Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> > values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8
> > encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction
> > for surrogate pairs values.
>
> WTF-8 is new to me.
> That's an interesting idea, but is it needed for exfat?
>
> For characters over U+FFFF,
> -For UTF-32, a value of 0x10000 or more
> -For UTF-16, the value from 0xd800 to 0xdfff
> I think these are just "don't convert to uppercase."
>
> If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
> it will simply be ignored by utf16s_to_utf8s().

This is the example why it can be useful for exfat on linux. exfat
filename can contain just sequence of unpaired halves of surrogate
pairs. Such thing is not representable in UTF-8, but valid in exfat.
Therefore current linux kernel exfat driver with UTF-8 encoding cannot
handle such filenames. But with WTF-8 it is possible.

So if we want that userspace would be able to read such files from exfat
fs, some mechanism for converting "unpaired halves" to NULL-term char*
string suitable for filenames is needed. And WTF-8 seems like a good
choice as it is backward compatible with UTF-8.

> string after utf8 conversion does not include illegal byte sequence.

Yes, but this is loosy conversion. When you would have two filenames
with different "surrogate halves" they would be converted to same file
name. So you would not be able to access both of them.

>
> > Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.
>
> Ugh...
>
>
> BR
> ---
> Kohada Tetsuhiro <[email protected]>
>

2020-04-15 22:21:55

by [email protected]

[permalink] [raw]

Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

> > UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
> > It just says "Unicode".
>
> That is because in MS world, "Unicode" term lot of times means UCS-2 or UTF-16.

For example, the Joliet Specification describes using UCS-2 for character sets.
Similarly, the UDF Specification describes using Unicode Version 2.0 for character sets.
However, Windows File Systems also accepts UTF-16 encoded UCS-4.
The foundation of their main product(Windows NT) was designed in the era when UTF-16 and UCS-2 were equal.
The non-BMP plains were probably not fully considered.

> You need to have a crystal ball to correctly understand their specifications.

Exactly!!
My crystal ball says ...
"They've designed D800-DFFF to be a mysterious area, so it's going through it."

> > Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
> > The character type is basically 'wchar_t'(16bit).
> > The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
> >
> > This “0000h to FFFFh” also includes surrogate characters(U+D800 to
> > U+DFFF), but these should not be converted to upper case.
> > Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
> > (* RtlUpcaseUnicodeChar() is one of Windows native API)
> >
> > If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
> > With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
> >
> > The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
> > To be more strict...
> > D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().
>
> Exactly, that is why surrogate pairs cannot be put into any "to upper"
> function. Or rather "to upper" function needs to be identity for them to not break anything. "to upper" does not make
> any sense on one u16 item from UTF-16 sequence when you do not have a complete code point.
> So API for UTF-16 "to upper" function needs to take full string, not just one u16.
>
> So for code points above U+FFFF it is needed some other mechanism how to represent upcase table (e.g. by providing full
> UTF-16 pair or code point encoded in UTF-32). And this is unknown and reason why I put question which was IIRC forwarded
> to MS.

That's exactly the case with the "generic" UTF-16 toupper function.
However, exfat (and other MS-FS's) does not require uppercase conversion for non-BMP plains characters.
For non-BMP characters, I think it's enough to just do nothing (no skip, no conversion).So like Windows.

> > WTF-8 is new to me.
> > That's an interesting idea, but is it needed for exfat?
> >
> > For characters over U+FFFF,
> > -For UTF-32, a value of 0x10000 or more -For UTF-16, the value from
> > 0xd800 to 0xdfff I think these are just "don't convert to uppercase."
> >
> > If the File Name Directory Entry contains illegal surrogate
> > characters(such as one unpaired surrogate half), it will simply be ignored by utf16s_to_utf8s().
>
> This is the example why it can be useful for exfat on linux. exfat filename can contain just sequence of unpaired halves
> of surrogate pairs. Such thing is not representable in UTF-8, but valid in exfat.
> Therefore current linux kernel exfat driver with UTF-8 encoding cannot handle such filenames. But with WTF-8 it is possible.

In fact, exfat(and other MS-FSs) accept unpaired surrogate characters.
But this is illegal unicode.
Also, it is very rarely generated by normal user operation (except for VFAT shortname).
Illegal unicode characters were often a security risk and I think they should not be accepted. even if possible.

> So if we want that userspace would be able to read such files from exfat fs, some mechanism for converting "unpaired halves"
> to NULL-term char* string suitable for filenames is needed. And WTF-8 seems like a good choice as it is backward compatible
> with UTF-8.

I think there are very few requirements to access such file names.
It is rare to use non-BMP characters in file names, and it is even rarer to illegally record only half of them.

> > string after utf8 conversion does not include illegal byte sequence.
>
> Yes, but this is loosy conversion. When you would have two filenames with different "surrogate halves" they would be converted
> to same file name. So you would not be able to access both of them.

I also think there is a problem with this conversion.
Illegal byte sequences are stripped off, and behave as if they didn't exist
from the beginning (like a legal UTF-8 string).
I think it's safest to fail the conversion if it detects an illegal byte sequence.
And it's also popular to replace it with another character(such as'_ ').
(not perfect, but works reasonably)

Anyway, we don't need to convert non-BMP characters or unpaired surrogate characters
to uppercase in exfat(and other MS-FSs).

BR
---
Kohada Tetsuhiro <[email protected]>