2020-01-19 22:16:06

by Pali Rohár

[permalink] [raw]
Subject: vfat: Broken case-insensitive support for UTF-8

Hello!

I have looked more deeply at vfat kernel code how is UTF-8 encoding
handled and I found out that case-insensitivity is broken, or rather not
implemented at all.

In fat_fill_super() function is already FIXME comment about this problem

/* FIXME: utf8 is using iocharset for upper/lower conversion */
if (sbi->options.isvfat) {
sbi->nls_io = load_nls(sbi->options.iocharset);

Basically vfat always loads NLS table which is used for strnicmp and
tolower functions. When no is specified, then default (iso8859-1) is
used. And this applies also when utf8=1 mount option is specified. Also
note that kernel's utf8 NLS table does not implement toupper/tolower
functions (kernel's NLS API does not support tolower/toupper for
non-fixed-8bit encodings, like UTF-8).

So when UTF-8 on VFS for VFAT is enabled, then for VFS <--> VFAT
conversion are used utf16s_to_utf8s() and utf8s_to_utf16s() functions.
But in fat_name_match(), vfat_hashi() and vfat_cmpi() functions is used
NLS table (default iso8859-1) with nls_strnicmp() and nls_tolower().

Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
broken for vfat in UTF-8 mode.

I was thinking how to fix it, and the only possible way is to write a
uni_tolower() function which takes one Unicode code point and returns
lowercase of input's Unicode code point. We cannot do any Unicode
normalization as VFAT specification does not say anything about it and
MS reference fastfat.sys implementation does not do it neither.

So, what would be the best option for implementing that function?

unicode_t uni_tolower(unicode_t u);

Could a new fs/unicode code help with it? Or it is too tied with NFD
normalization and therefore cannot be easily used or extended?

New exfat code which is under review and hopefully would be merged,
contains own unicode upcase table (as defined by exfat specification) so
as exfat is similar to FAT32, maybe reusing it would be a better option?


========================================================================

Proof that vfat in UTF-8 mode is broken and must be fixed:

$ mount | grep /mnt/fat
/tmp/fat2 on /mnt/fat type vfat
(rw,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root 80 Jan 19 22:45 ../
$ touch /mnt/fat/č
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root 80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 č*
$ touch /mnt/fat/Č
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root 80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 č*
$ touch /mnt/fat/d
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root 80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 d*
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 č*
$ touch /mnt/fat/D
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root 80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali 0 Jan 19 22:51 d*
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali 0 Jan 19 22:50 č*

As you can see lowercase 'd' and uppercase 'D' are same, but lowercase
'č' and uppercase 'Č' are not same. This is because 'č' is two bytes
0xc4 0x8d sequence and comparing is done by Latin1 table. 0xc4 is in
Latin 'Ä' which is already in uppercase. 0x8d is control char so is not
changed by tolower/toupper function.

Bigger problem can be with U+C9FF code point. In UTF-8 it is encoded as
bytes 0xe3 0xa7 0xbf (in Latin1 㧿). If you convert it by Latin1 upper
case table you get ç¿ (bytes 0xc3 0xa7 0xbf). First two bytes is valid
UTF-8 sequence for character ç = U+00E7.

Therefore U+C9FF and U+00E7 may be treated in some cases as same
character (when comparing just prefixes), difference only in upper case,
which is fully wrong.

--
Pali Rohár
[email protected]


Attachments:
(No filename) (4.08 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-19 23:11:17

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Sun, Jan 19, 2020 at 11:14:55PM +0100, Pali Rohár wrote:

> So when UTF-8 on VFS for VFAT is enabled, then for VFS <--> VFAT
> conversion are used utf16s_to_utf8s() and utf8s_to_utf16s() functions.
> But in fat_name_match(), vfat_hashi() and vfat_cmpi() functions is used
> NLS table (default iso8859-1) with nls_strnicmp() and nls_tolower().
>
> Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> broken for vfat in UTF-8 mode.
>
> I was thinking how to fix it, and the only possible way is to write a
> uni_tolower() function which takes one Unicode code point and returns
> lowercase of input's Unicode code point. We cannot do any Unicode
> normalization as VFAT specification does not say anything about it and
> MS reference fastfat.sys implementation does not do it neither.

Then how can that possibly be broken? If it matches the native behaviour,
that's it.

> As you can see lowercase 'd' and uppercase 'D' are same, but lowercase
> 'č' and uppercase 'Č' are not same. This is because 'č' is two bytes
> 0xc4 0x8d sequence and comparing is done by Latin1 table. 0xc4 is in
> Latin 'Ä' which is already in uppercase. 0x8d is control char so is not
> changed by tolower/toupper function.

Again, who the hell cares? Does the behaviour match how Windows handles
that thing? "Case" is not something well-defined; the only definition
is "whatever weird crap does the native implementation choose to do".
That's the only reason to support that garbage at all...

2020-01-19 23:35:55

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Sunday 19 January 2020 23:08:09 Al Viro wrote:
> On Sun, Jan 19, 2020 at 11:14:55PM +0100, Pali Rohár wrote:
>
> > So when UTF-8 on VFS for VFAT is enabled, then for VFS <--> VFAT
> > conversion are used utf16s_to_utf8s() and utf8s_to_utf16s() functions.
> > But in fat_name_match(), vfat_hashi() and vfat_cmpi() functions is used
> > NLS table (default iso8859-1) with nls_strnicmp() and nls_tolower().
> >
> > Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> > broken for vfat in UTF-8 mode.
> >
> > I was thinking how to fix it, and the only possible way is to write a
> > uni_tolower() function which takes one Unicode code point and returns
> > lowercase of input's Unicode code point. We cannot do any Unicode
> > normalization as VFAT specification does not say anything about it and
> > MS reference fastfat.sys implementation does not do it neither.
>
> Then how can that possibly be broken? If it matches the native behaviour,
> that's it.

VFAT is case insensitive.

> > As you can see lowercase 'd' and uppercase 'D' are same, but lowercase
> > 'č' and uppercase 'Č' are not same. This is because 'č' is two bytes
> > 0xc4 0x8d sequence and comparing is done by Latin1 table. 0xc4 is in
> > Latin 'Ä' which is already in uppercase. 0x8d is control char so is not
> > changed by tolower/toupper function.
>
> Again, who the hell cares?

All users who use also non-Linux fat implementations.

> Does the behaviour match how Windows handles that thing?

Linux behavior does not match Windows behavior.

On Windows is FAT32 (fastfat.sys) case insensitive and file names "č"
and "Č" are treated as same file. Windows does not allow you to create
both files. It says that file already exists.

> "Case" is not something well-defined; the only definition
> is "whatever weird crap does the native implementation choose to do".

You are right that case sensitiveness is not well-defined, but in
Unicode we have also language-independent and basically well-defined
conversion.

And because VFAT is Unicode fs (internally UTF-16) it make sense that
well-defined Unicode folding should be used.

> That's the only reason to support that garbage at all...

What do you mean by garbage? Where? All filenames which I specified are
valid UTF-8 sequences, valid Unicode code points and therefore have
valid UTF-16 representation stored in VFAT fs.

Sorry, but I did not understand your comment.

--
Pali Rohár
[email protected]


Attachments:
(No filename) (2.47 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-20 00:15:16

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 12:33:48AM +0100, Pali Rohár wrote:

> > Does the behaviour match how Windows handles that thing?
>
> Linux behavior does not match Windows behavior.
>
> On Windows is FAT32 (fastfat.sys) case insensitive and file names "č"
> and "Č" are treated as same file. Windows does not allow you to create
> both files. It says that file already exists.

So how is the mapping specified in their implementation? That's
obviously the mapping we have to match.

> > That's the only reason to support that garbage at all...
>
> What do you mean by garbage?

Case-insensitive anything... the only reason to have that crap at all
is that native implementations are basically forcing it as fs
image correctness issue. It's worthless on its own merits, but
we can't do something that amounts to corrupting fs image when
we access it for write.

2020-01-20 04:06:30

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

Pali Roh?r <[email protected]> writes:

> Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> broken for vfat in UTF-8 mode.

Right. It is a known issue.

> I was thinking how to fix it, and the only possible way is to write a
> uni_tolower() function which takes one Unicode code point and returns
> lowercase of input's Unicode code point. We cannot do any Unicode
> normalization as VFAT specification does not say anything about it and
> MS reference fastfat.sys implementation does not do it neither.
>
> So, what would be the best option for implementing that function?
>
> unicode_t uni_tolower(unicode_t u);
>
> Could a new fs/unicode code help with it? Or it is too tied with NFD
> normalization and therefore cannot be easily used or extended?

To be perfect, the table would have to emulate what Windows use. It can
be unicode standard, or something other. And other fs can use different
what Windows use.

So the table would have to be switchable in perfect world (if there is
no consensus to use 1 table). If we use switchable table, I think it
would be better to put in userspace, and loadable like firmware data.

Well, so then it would not be simple work (especially, to be perfect).


Also, not directly same issue though. There is related issue for
case-insensitive. Even if we use some sort of internal wide char
(e.g. in nls, 16bits), dcache is holding name in user's encode
(e.g. utf8). So inefficient to convert cached name to wide char for each
access.

Relatively recent EXT4 case-insensitive may tackled this though, I'm not
checking it yet.

> New exfat code which is under review and hopefully would be merged,
> contains own unicode upcase table (as defined by exfat specification) so
> as exfat is similar to FAT32, maybe reusing it would be a better option?

exfat just put a case conversion table in fs. So I don't think it helps
fatfs.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2020-01-20 07:31:54

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:

> Also, not directly same issue though. There is related issue for
> case-insensitive. Even if we use some sort of internal wide char
> (e.g. in nls, 16bits), dcache is holding name in user's encode
> (e.g. utf8). So inefficient to convert cached name to wide char for each
> access.
>
> Relatively recent EXT4 case-insensitive may tackled this though, I'm not
> checking it yet.

What's more, comparisons in dcache lookups have to be very careful about
the rename-related issues. You can give false negatives if the name
changes under you; it's not a problem. You can even give a false positive
in case of name change happening in the middle of comparison; ->d_seq
mismatch will get caught and it will have the result discarded before it
causes problems. However, you can't e.g. assume that the string you are
trying to convert from utf8 to 16bit won't be changing right under you.
Again, the wrong result of comparison in such situation is not a problem;
wrong return value is not the worst thing that can happen to a string
function mistakenly assuming that the string is not changing under it.

And you very much need to be careful about the things you can access
there. E.g. something like "oh, I'll just look at the flags in the
inode of the parent of potential match" (as in the recently posted
series) is a bloody bad idea on many levels. Starting with "your
potential match is getting moved right now, and what used to be its
parent becomes negative by the time you get around to fetching its
->d_inode. Dereferencing the resulting NULL to get inode flags
is not pretty".

<checks ext4>
Yup, that bug is there as well, all right. Look:
#ifdef CONFIG_UNICODE
static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
const char *str, const struct qstr *name)
{
struct qstr qstr = {.name = str, .len = len };
struct inode *inode = dentry->d_parent->d_inode;

if (!IS_CASEFOLDED(inode) || !EXT4_SB(inode->i_sb)->s_encoding) {

Guess what happens if your (lockless) call of ->d_compare() runs
into the following sequence:
CPU1: ext4_d_compare() fetches ->d_parent
CPU1: takes a hardware interrupt
CPU2: dentry gets evicted by memory pressure; so is its parent, since
it was the only thing that used to keep it pinned. Eviction of the parent
calls dentry_unlink_inode() on the parent, which zeroes its ->d_inode.
CPU1: comes back
CPU1: fetches parent's ->d_inode and gets NULL
CPU1: oopses on null pointer dereference.

It's not impossible to hit. Note that e.g. vfat_cmpi() is not vulnerable
to that problem - ->d_sb is stable and both the superblock and ->nls_io
freeing is RCU-delayed.

I hadn't checked ->d_compare() instances for a while; somebody needs to
do that again, by the look of it. The above definitely is broken;
no idea how many other instaces had grown such bugs...

2020-01-20 07:47:04

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 07:30:40AM +0000, Al Viro wrote:

> <checks ext4>
> Yup, that bug is there as well, all right. Look:
> #ifdef CONFIG_UNICODE
> static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
> const char *str, const struct qstr *name)
> {
> struct qstr qstr = {.name = str, .len = len };
> struct inode *inode = dentry->d_parent->d_inode;
>
> if (!IS_CASEFOLDED(inode) || !EXT4_SB(inode->i_sb)->s_encoding) {
>
> Guess what happens if your (lockless) call of ->d_compare() runs
> into the following sequence:
> CPU1: ext4_d_compare() fetches ->d_parent
> CPU1: takes a hardware interrupt
> CPU2: dentry gets evicted by memory pressure; so is its parent, since
> it was the only thing that used to keep it pinned. Eviction of the parent
> calls dentry_unlink_inode() on the parent, which zeroes its ->d_inode.
> CPU1: comes back
> CPU1: fetches parent's ->d_inode and gets NULL
> CPU1: oopses on null pointer dereference.
>
> It's not impossible to hit. Note that e.g. vfat_cmpi() is not vulnerable
> to that problem - ->d_sb is stable and both the superblock and ->nls_io
> freeing is RCU-delayed.
>
> I hadn't checked ->d_compare() instances for a while; somebody needs to
> do that again, by the look of it. The above definitely is broken;
> no idea how many other instaces had grown such bugs...

f2fs one also has the same bug. Anyway, I'm going down right now, will
check the rest tomorrow morning...

2020-01-20 08:08:33

by Al Viro

[permalink] [raw]
Subject: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On Mon, Jan 20, 2020 at 07:45:58AM +0000, Al Viro wrote:
> On Mon, Jan 20, 2020 at 07:30:40AM +0000, Al Viro wrote:
>
> > <checks ext4>
> > Yup, that bug is there as well, all right. Look:
> > #ifdef CONFIG_UNICODE
> > static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
> > const char *str, const struct qstr *name)
> > {
> > struct qstr qstr = {.name = str, .len = len };
> > struct inode *inode = dentry->d_parent->d_inode;
> >
> > if (!IS_CASEFOLDED(inode) || !EXT4_SB(inode->i_sb)->s_encoding) {
> >
> > Guess what happens if your (lockless) call of ->d_compare() runs
> > into the following sequence:
> > CPU1: ext4_d_compare() fetches ->d_parent
> > CPU1: takes a hardware interrupt
> > CPU2: dentry gets evicted by memory pressure; so is its parent, since
> > it was the only thing that used to keep it pinned. Eviction of the parent
> > calls dentry_unlink_inode() on the parent, which zeroes its ->d_inode.
> > CPU1: comes back
> > CPU1: fetches parent's ->d_inode and gets NULL
> > CPU1: oopses on null pointer dereference.
> >
> > It's not impossible to hit. Note that e.g. vfat_cmpi() is not vulnerable
> > to that problem - ->d_sb is stable and both the superblock and ->nls_io
> > freeing is RCU-delayed.
> >
> > I hadn't checked ->d_compare() instances for a while; somebody needs to
> > do that again, by the look of it. The above definitely is broken;
> > no idea how many other instaces had grown such bugs...
>
> f2fs one also has the same bug. Anyway, I'm going down right now, will
> check the rest tomorrow morning...

We _probably_ can get away with just checking that inode for NULL and
buggering off if it is (->d_seq mismatch is guaranteed in that case),
but I suspect that we might need READ_ONCE() on both dereferences.
I hate memory barriers...

2020-01-20 11:06:35

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 13:04:42 OGAWA Hirofumi wrote:
> Pali Rohár <[email protected]> writes:
>
> > Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> > broken for vfat in UTF-8 mode.
>
> Right. It is a known issue.

Could be this issue better documented? E.g. in mount(8) manpage where
are written mount options for vfat? I think that people should be aware
of this issue when they use "utf8=1" mount option.

> > I was thinking how to fix it, and the only possible way is to write a
> > uni_tolower() function which takes one Unicode code point and returns
> > lowercase of input's Unicode code point. We cannot do any Unicode
> > normalization as VFAT specification does not say anything about it and
> > MS reference fastfat.sys implementation does not do it neither.
> >
> > So, what would be the best option for implementing that function?
> >
> > unicode_t uni_tolower(unicode_t u);
> >
> > Could a new fs/unicode code help with it? Or it is too tied with NFD
> > normalization and therefore cannot be easily used or extended?
>
> To be perfect, the table would have to emulate what Windows use. It can
> be unicode standard, or something other.

Windows FAT32 implementation (fastfat.sys) is opensource. So it should
be possible to inspect code and figure out how it is working.

I will try to look at it.

> And other fs can use different what Windows use.
>
> So the table would have to be switchable in perfect world (if there is
> no consensus to use 1 table). If we use switchable table, I think it
> would be better to put in userspace, and loadable like firmware data.
>
> Well, so then it would not be simple work (especially, to be perfect).

Switchable table is not really simple and I think as a first step would
be enough to have one (hardcoded) table for UTF-8. Like we have for all
other encodings.

> Also, not directly same issue though. There is related issue for
> case-insensitive. Even if we use some sort of internal wide char
> (e.g. in nls, 16bits), dcache is holding name in user's encode
> (e.g. utf8). So inefficient to convert cached name to wide char for each
> access.

Yes, this is truth. But this conversion is already doing exFAT
implementation. I think we do not have other choice if we want Windows
compatible implementation.

> Relatively recent EXT4 case-insensitive may tackled this though, I'm not
> checking it yet.
>
> > New exfat code which is under review and hopefully would be merged,
> > contains own unicode upcase table (as defined by exfat specification) so
> > as exfat is similar to FAT32, maybe reusing it would be a better option?
>
> exfat just put a case conversion table in fs. So I don't think it helps
> fatfs.

exfat has fallback conversion table (hardcoded in driver) which is used
when fs itself does not have conversion table. This is mandated by exfat
specification. Part of exFAT specification is that default conversion
table.

I was thinking... as both VFAT and exFAT are MS standard and exFAT is
just evolved FAT32 we could use that exFAT default conversion table
(which is prevent in that exfat driver).

--
Pali Rohár
[email protected]

2020-01-20 11:20:37

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 00:09:31 Al Viro wrote:
> On Mon, Jan 20, 2020 at 12:33:48AM +0100, Pali Rohár wrote:
>
> > > Does the behaviour match how Windows handles that thing?
> >
> > Linux behavior does not match Windows behavior.
> >
> > On Windows is FAT32 (fastfat.sys) case insensitive and file names "č"
> > and "Č" are treated as same file. Windows does not allow you to create
> > both files. It says that file already exists.
>
> So how is the mapping specified in their implementation? That's
> obviously the mapping we have to match.

FAT specification (fatgen103.doc) is just parody for specifications.
E.g. it requires you to use pencil and paper during implementation...

About case insensitivity I found in specification these parts:

"The UNICODE name passed to the file system is converted to upper case."

"UNICODE solves the case mapping problem prevalent in some OEM code
pages by always providing a translation for lower case characters to a
single, unique upper case character."

Which basically says nothing... I can deduce from it that for mapping
table should be used Unicode standard.

But we already know that in that specifications are mistakes. And
relevant is Microsoft FAT implementation (fastfat.sys). It is now open
source on github, so we can inspect how it implements upper case
conversion.

> > > That's the only reason to support that garbage at all...
> >
> > What do you mean by garbage?
>
> Case-insensitive anything... the only reason to have that crap at all
> is that native implementations are basically forcing it as fs
> image correctness issue.

You are right. But we need to deal with it.

> It's worthless on its own merits, but
> we can't do something that amounts to corrupting fs image when
> we access it for write.

If we implement same upper case conversion as in reference
implementation (fastfat.sys) then we prevent "corrupting fs".

--
Pali Rohár
[email protected]

2020-01-20 12:08:55

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

Pali Roh?r <[email protected]> writes:

>> To be perfect, the table would have to emulate what Windows use. It can
>> be unicode standard, or something other.
>
> Windows FAT32 implementation (fastfat.sys) is opensource. So it should
> be possible to inspect code and figure out how it is working.
>
> I will try to look at it.

I don't think the conversion library is not in fs driver though,
checking implement itself would be good.

>> And other fs can use different what Windows use.
>>
>> So the table would have to be switchable in perfect world (if there is
>> no consensus to use 1 table). If we use switchable table, I think it
>> would be better to put in userspace, and loadable like firmware data.
>>
>> Well, so then it would not be simple work (especially, to be perfect).
>
> Switchable table is not really simple and I think as a first step would
> be enough to have one (hardcoded) table for UTF-8. Like we have for all
> other encodings.

Ignoring if utf8 table is good or not. If the table is not windows
compatible or doesn't satisfy other fs's requirement, it also is yet
another broken table like now (of course, it would likely be better
off). Of course, we can define it as linux implementation limitation
though.

So yes, I think this work is not simple.

>> Also, not directly same issue though. There is related issue for
>> case-insensitive. Even if we use some sort of internal wide char
>> (e.g. in nls, 16bits), dcache is holding name in user's encode
>> (e.g. utf8). So inefficient to convert cached name to wide char for each
>> access.
>
> Yes, this is truth. But this conversion is already doing exFAT
> implementation. I think we do not have other choice if we want Windows
> compatible implementation.

For example, we can cache the both of display name, and upper/lower case
name. Anyway, at least, there are some implement options.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2020-01-20 15:08:34

by David Laight

[permalink] [raw]
Subject: RE: vfat: Broken case-insensitive support for UTF-8

From: Pali Rohár
> Sent: 20 January 2020 11:05
> On Monday 20 January 2020 13:04:42 OGAWA Hirofumi wrote:
> > Pali Rohár <[email protected]> writes:
> >
> > > Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> > > broken for vfat in UTF-8 mode.
> >
> > Right. It is a known issue.
>
> Could be this issue better documented? E.g. in mount(8) manpage where
> are written mount options for vfat? I think that people should be aware
> of this issue when they use "utf8=1" mount option.

What happens if the filesystem has filenames that invalid UTF8 sequences
or multiple filenames that decode from UTF8 to the same 'wchar' value.
Never mind ones that are just case-differences for the same filename.

UTF8 is just so broken it should never have been allowed to become
a standard.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-20 15:21:20

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 15:07:20 David Laight wrote:
> From: Pali Rohár
> > Sent: 20 January 2020 11:05
> > On Monday 20 January 2020 13:04:42 OGAWA Hirofumi wrote:
> > > Pali Rohár <[email protected]> writes:
> > >
> > > > Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
> > > > broken for vfat in UTF-8 mode.
> > >
> > > Right. It is a known issue.
> >
> > Could be this issue better documented? E.g. in mount(8) manpage where
> > are written mount options for vfat? I think that people should be aware
> > of this issue when they use "utf8=1" mount option.
>
> What happens if the filesystem has filenames that invalid UTF8 sequences

Could you please describe what you mean by this question?


VFAT filesystem stores file names in UTF-16. Therefore you cannot have
UTF-8 on FS (and therefore also you cannot have invalid UTF-8).

Ehm... UTF-16 is not fully truth, MS FAT32 implementations allows half
of UTF-16 surrogate pair stored in FS.

Therefore practically, on VFAT you can store any uint16_t[] sequence as
filename, there is no invalid sequence (except those characters like
:<>?... which are invalid in MS-DOS).



If by "the filesystem has filenames" you do not mean filesystem file
names, but rather Linux VFS file names (e.g. you call creat() call with
invalid UTF-8 sequence) then function utf8s_to_utf16s() (called in
namei_vfat.c) fails and returns error. Which should be propagated to
open() / creat() call that it is not possible to create filename with
such UTF-8 sequence.

> or multiple filenames that decode from UTF8 to the same 'wchar' value.

This is not possible. There is 1:1 mapping between UTF-8 sequence and
Unicode code point. wchar_t in kernel represent either one Unicode code
point (limited up to U+FFFF in NLS framework functions) or 2bytes in
UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
functions).

> Never mind ones that are just case-differences for the same filename.
>
> UTF8 is just so broken it should never have been allowed to become
> a standard.

Well, UTF-16 is worse then UTF-8... incompatible with ASCII, variable
length and space consuming.

--
Pali Rohár
[email protected]

2020-01-20 15:50:01

by David Laight

[permalink] [raw]
Subject: RE: vfat: Broken case-insensitive support for UTF-8

From: Pali Rohár
> Sent: 20 January 2020 15:20
...
> This is not possible. There is 1:1 mapping between UTF-8 sequence and
> Unicode code point. wchar_t in kernel represent either one Unicode code
> point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> functions).

Unfortunately there is neither a 1:1 mapping of all possible byte sequences
to wchar_t (or unicode code points), nor a 1:1 mapping of all possible
wchar_t values to UTF-8.
Really both need to be defined - even for otherwise 'invalid' sequences.

Even the 16-bit values above 0xd000 can appear on their own in
windows filesystems (according to wikipedia).

It is all to easy to get sequences of values that cannot be converted
to/from UTF-8.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-20 16:14:42

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 03:47:22PM +0000, David Laight wrote:
> From: Pali Roh?r
> > Sent: 20 January 2020 15:20
> ...
> > This is not possible. There is 1:1 mapping between UTF-8 sequence and
> > Unicode code point. wchar_t in kernel represent either one Unicode code
> > point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> > UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> > functions).
>
> Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> to wchar_t (or unicode code points), nor a 1:1 mapping of all possible
> wchar_t values to UTF-8.
> Really both need to be defined - even for otherwise 'invalid' sequences.

Who. Cares?

Filename is a sequence of octets, not codepoints. Its interpretation is
entirely up to the userland.

Same goes for the notion of "case" (locale-dependent, etc.); some
filesystems impose their (arbitrary) restrictions on the possible
octet sequences (and equally arbitrary equivalence relations between
them) that can be approximated in terms of upper/lower case in some
locale. It does not matter how arbitrary those are, or what stands
behind them:
* don't do that for any new filesystem designs
* for existing filesystem types, the actual behaviour of
native implementation IS THE ONE AND ONLY AUTHORITY. It does not
matter from what misguided thought process it has come from;
the absolute requirement is that if you mount a filesystem valid
from the native implementation POV, you must leave it in a state
that would be valid from the native implementation POV. That's
it.

Any talk about normalization, etc. is completely pointless -
for any sane uses it's an opaque stream of octets that filesystem
and VFS should leave the fuck alone. Codepoints, encodings, etc.
come into the game only to an extent they are useful to describe
the weird rules given filesystem might have. And they are just
that - tools to describe externally imposed mappings.

2020-01-20 16:28:13

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 15:47:22 David Laight wrote:
> From: Pali Rohár
> > Sent: 20 January 2020 15:20
> ...
> > This is not possible. There is 1:1 mapping between UTF-8 sequence and
> > Unicode code point. wchar_t in kernel represent either one Unicode code
> > point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> > UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> > functions).
>
> Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> to wchar_t (or unicode code points),

I was talking about valid UTF-8 sequence (invalid, illformed is out of
game and for sure would always cause problems).

> nor a 1:1 mapping of all possible wchar_t values to UTF-8.

This is not truth. There is exactly only one way how to convert sequence
of Unicode code points to UTF-8. UTF is Unicode Transformation Format
and has exact definition how is Unicode Transformed.

If you have valid UTF-8 sequence then it describe one exact sequence of
Unicode code points. And if you have sequence (ordinals) of Unicode code
points there is exactly one and only one its representation in UTF-8.

I would suggest you to read Unicode standard, section 2.5 Encoding Forms.

> Really both need to be defined - even for otherwise 'invalid' sequences.
>
> Even the 16-bit values above 0xd000 can appear on their own in
> windows filesystems (according to wikipedia).

If you are talking about UTF-16 (which is _not_ 16-bit as you wrote),
look at my previous email:

"MS FAT32 implementations allows half of UTF-16 surrogate pair stored in FS."

> It is all to easy to get sequences of values that cannot be converted
> to/from UTF-8.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

--
Pali Rohár
[email protected]


Attachments:
(No filename) (1.85 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-20 16:44:40

by David Laight

[permalink] [raw]
Subject: RE: vfat: Broken case-insensitive support for UTF-8

From: Pali Rohár
> Sent: 20 January 2020 16:27
...
> > Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> > to wchar_t (or unicode code points),
>
> I was talking about valid UTF-8 sequence (invalid, illformed is out of
> game and for sure would always cause problems).

Except that they are always likely to happen.
I've been pissed off by programs crashing because they assume that
a input string (eg an email) is UTF-8 but happens to contain a single
0xa3 byte in the otherwise 7-bit data.

The standard ought to have defined a translation for such sequences
and just a 'warning' from the function(s) that unexpected bytes were
processed.

> > nor a 1:1 mapping of all possible wchar_t values to UTF-8.
>
> This is not truth. There is exactly only one way how to convert sequence
> of Unicode code points to UTF-8. UTF is Unicode Transformation Format
> and has exact definition how is Unicode Transformed.

But a wchar_t can hold lots of values that aren't Unicode code points.
Prior to the 2003 changes half of the 2^32 values could be converted.
Afterwards only a small fraction.

> If you have valid UTF-8 sequence then it describe one exact sequence of
> Unicode code points. And if you have sequence (ordinals) of Unicode code
> points there is exactly one and only one its representation in UTF-8.
>
> I would suggest you to read Unicode standard, section 2.5 Encoding Forms.

That all assumes everyone is playing the correct game

> > Really both need to be defined - even for otherwise 'invalid' sequences.
> >
> > Even the 16-bit values above 0xd000 can appear on their own in
> > windows filesystems (according to wikipedia).
>
> If you are talking about UTF-16 (which is _not_ 16-bit as you wrote),
> look at my previous email:

UFT-16 is a sequence of 16-bit values....
It can contain 0xd000 to 0xffff (usually in pairs) but they aren't UTF-8 codepoints.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-20 16:54:17

by David Laight

[permalink] [raw]
Subject: RE: vfat: Broken case-insensitive support for UTF-8

From: Al Viro
> Sent: 20 January 2020 16:12
> > From: Pali Rohár
> > > Sent: 20 January 2020 15:20
> > ...
> > > This is not possible. There is 1:1 mapping between UTF-8 sequence and
> > > Unicode code point. wchar_t in kernel represent either one Unicode code
> > > point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> > > UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> > > functions).
> >
> > Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> > to wchar_t (or unicode code points), nor a 1:1 mapping of all possible
> > wchar_t values to UTF-8.
> > Really both need to be defined - even for otherwise 'invalid' sequences.
>
> Who. Cares?
>
> Filename is a sequence of octets, not codepoints. Its interpretation is
> entirely up to the userland.

For filesystems that really ought to be true.
Saves a lot of problems in the kernel.

I guess the fat driver has to do something to convert the UCS-16 on-disk filenames
to/from a sequence of octets.

Even Microsoft have made it much easier to have case-dependant
NTS4 filesystems in windows 10.
(Ever watched the number of different cases in the list of c:/windows/system32/drivers/*.sys
filenames output when windows boots? They are nearly all different!)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-20 16:58:31

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 16:43:21 David Laight wrote:
> From: Pali Rohár
> > Sent: 20 January 2020 16:27
> ...
> > > Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> > > to wchar_t (or unicode code points),
> >
> > I was talking about valid UTF-8 sequence (invalid, illformed is out of
> > game and for sure would always cause problems).
>
> Except that they are always likely to happen.

As wrote before, Linux kernel does not allow such sequences. So
userspace get error when is trying to store garbage.

> I've been pissed off by programs crashing because they assume that
> a input string (eg an email) is UTF-8 but happens to contain a single
> 0xa3 byte in the otherwise 7-bit data.
>
> The standard ought to have defined a translation for such sequences
> and just a 'warning' from the function(s) that unexpected bytes were
> processed.

There is informative part, how to replace invalid part of sequence to
Unicode code point U+FFFD. So if your need to to "process any byte
sequence as UTF-8" there is standardized way to convert it into one
exact sequence of Unicode code points. This is what email programs
should do and non-broken are already doing it.

> > > nor a 1:1 mapping of all possible wchar_t values to UTF-8.
> >
> > This is not truth. There is exactly only one way how to convert sequence
> > of Unicode code points to UTF-8. UTF is Unicode Transformation Format
> > and has exact definition how is Unicode Transformed.
>
> But a wchar_t can hold lots of values that aren't Unicode code points.
> Prior to the 2003 changes half of the 2^32 values could be converted.
> Afterwards only a small fraction.

wchar_t in kernel can hold only subset of Unicode code points, up to
the U+FFFF (2^16-1).

Halves of surrogate pairs are not valid Unicode code points but as
stated they are used in MS FAT.

So anything which can be put into kernel's wchar_t is valid for FAT.

>
> > If you have valid UTF-8 sequence then it describe one exact sequence of
> > Unicode code points. And if you have sequence (ordinals) of Unicode code
> > points there is exactly one and only one its representation in UTF-8.
> >
> > I would suggest you to read Unicode standard, section 2.5 Encoding Forms.
>
> That all assumes everyone is playing the correct game

And why should we not play correct game? On input we have UTF and
internally we works with Unicode. Unicode codepoints does not leak from
kernel, so we can play correct game and assume that our code in kernel
is correct (and if not, we can fix it). Plus when communicating with
outside word, just check that input data are valid (which we already do
for UTF-8 user input).

So I do not see any problem there.

> > > Really both need to be defined - even for otherwise 'invalid' sequences.
> > >
> > > Even the 16-bit values above 0xd000 can appear on their own in
> > > windows filesystems (according to wikipedia).
> >
> > If you are talking about UTF-16 (which is _not_ 16-bit as you wrote),
> > look at my previous email:
>
> UFT-16 is a sequence of 16-bit values....

No, this is not truth. UTF-16 is sequence either of 16-bit values or of
32-bit values with other restrictions. UTF-16 is variable length enc.

> It can contain 0xd000 to 0xffff (usually in pairs) but they aren't UTF-8 codepoints.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

--
Pali Rohár
[email protected]


Attachments:
(No filename) (3.48 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-20 17:33:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:
>
> To be perfect, the table would have to emulate what Windows use. It can
> be unicode standard, or something other. And other fs can use different
> what Windows use.

The big question is *which* version of Windows. vfat has been in use
for over two decades, and vfat predates Window starting to use Unicode
in 2001. Before that, vfat would have been using whatever code page
its local Windows installation was set to sue; and I'm not sure if
there was space in the FAT headers to indicate the codepage in use.

It would be entertaining for someone with ancient versions of Windows
9x to create some floppy images using codepage 437 and 450, and then
see what a modern Windows system does with those VFAT images --- would
it break horibbly when it tries to interpret them as UTF-16? Or would
it figure it out? And if so, how? Inquiring minds want to know....

Bonus points if the lack of forwards compatibility causes older
versions of Windows to Blue Screen. :-)

- Ted

P.S. And of course, then there's the question of how does older
versions of Windows handle versions of Unicode which postdate the
release date of that particular version of Windows? After all,
Unicode adds new code points with potential revisions to the case
folding table every 6-12 months. (The most recent version of Unicode
was released in in April 2019 to accomodate the new Japanese kanji
character "Rei" for the current era name with the elevation of the new
current reigning emperor of Japan.)

2020-01-20 17:39:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 03:07:20PM +0000, David Laight wrote:
> What happens if the filesystem has filenames that invalid UTF8 sequences
> or multiple filenames that decode from UTF8 to the same 'wchar' value.
> Never mind ones that are just case-differences for the same filename.
>
> UTF8 is just so broken it should never have been allowed to become
> a standard.

Internationalization is an overconstrained problem which is impacted
and influenced by human politics, incuding from the Cold War and who
attended which internal standards bodies meetings. So much so that an
I18N expert (very knowledgable about the problems in this domain) has
been known to have said (in a bar, late at night, and after much
alcohol) that it would be simpler to teach the entire human race
English.

Unfortunately, that's not going to happen, and if we are going to deal
with the market of "everyone which doesn't speak English", we're going
to have to live with Unicode, warts at and all. Seriously speaking,
UTF-8 is the worst encoding, except for all of the others. :-)

- Ted

2020-01-20 17:57:34

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 12:32:15 Theodore Y. Ts'o wrote:
> On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:
> >
> > To be perfect, the table would have to emulate what Windows use. It can
> > be unicode standard, or something other. And other fs can use different
> > what Windows use.
>
> The big question is *which* version of Windows. vfat has been in use
> for over two decades, and vfat predates Window starting to use Unicode
> in 2001. Before that, vfat would have been using whatever code page
> its local Windows installation was set to sue; and I'm not sure if
> there was space in the FAT headers to indicate the codepage in use.

VFAT is extension to FAT which stores file names in UTF-16. In original
FAT without VFAT extension (in all variants, FAT12, FAT16 and FAT32) is
file name stored "according to current 8bit OEM code page". VFAT-aware
FAT implementation would know if particular filename is really VFAT
(UTF-16) or without VFAT (8bit OEM code page). There are flags in FAT
which indicates if entry is VFAT (UTF-16).

And no, there are no bits in FAT header which specify OEM code page.
So if you use "mode con" or "chcp" (or what was those MS-DOS commands
for changing OEM codepage), all non-VFAT filenames would change after
next reading of FAT directory.

But because every OEM code page is full 8bit, you always get valid data.
Just you would see that your file name is different :D

> It would be entertaining for someone with ancient versions of Windows
> 9x to create some floppy images using codepage 437 and 450, and then
> see what a modern Windows system does with those VFAT images --- would

Hehe :-) I did it as part of my investigation, how is stored FAT volume
label and how different tools read it. FAT label is *not* stored as
UTF-16 but only in that OEM code page like old filenames on MS-DOS
https://www.spinics.net/lists/kernel/msg2640891.html

And what recent Windows do? They decode such filenames (and therefore
also volume label) via OEM codepage which belongs to current system
Language settings. You cannot change OEM codepage on recent Windows. You
can only change Regional Language (which then change OEM codepage which
belongs to it).

Mapping table between Windows Regional Language and OEM codepage is in
(still unreleased) fatlabel(8) manpage, section DOS CODEPAGES, here:
https://github.com/dosfstools/dosfstools/blob/master/manpages/fatlabel.8.in

> it break horibbly when it tries to interpret them as UTF-16? Or would

As Windows knows that filename is stored as 8bit and not UTF-16, nothing
is broken. Just for characters with upper bit set you probably does not
see filenames as you saw in MS-DOS.

But if you remember which OEM code page you used on MS-DOS, you can
change Windows Language to one which uses your OEM code page and then
you can read that old FAT fs without any broken file names.

> it figure it out? And if so, how? Inquiring minds want to know....
>
> Bonus points if the lack of forwards compatibility causes older
> versions of Windows to Blue Screen. :-)

I have not got any Blue Screens during reading of these older FAT fs
created and used by MS-DOS.

On Linux it is easier, just specify -o codepage= mount option and
vfat.ko translate it correctly.

>
> - Ted
>
> P.S. And of course, then there's the question of how does older
> versions of Windows handle versions of Unicode which postdate the
> release date of that particular version of Windows? After all,

This is not a problem. Windows allows you to store into filename
arbitrary sequence of uint16[] (except disallowed MS-DOS chars like
:?<>...). And when doing read directory operation you need to expect
that it will returns arbitrary sequence of uint16[].

Windows does not care about valid/invalid/assigned/unassigned code
points. It even do not care about halves of surrogate pairs. So it can
store also one half of (unpaired) surrogate pair (one uint16).

> Unicode adds new code points with potential revisions to the case
> folding table every 6-12 months. (The most recent version of Unicode
> was released in in April 2019 to accomodate the new Japanese kanji
> character "Rei" for the current era name with the elevation of the new
> current reigning emperor of Japan.)

--
Pali Rohár
[email protected]


Attachments:
(No filename) (4.31 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-20 19:37:12

by Al Viro

[permalink] [raw]
Subject: Re: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On Mon, Jan 20, 2020 at 08:07:21AM +0000, Al Viro wrote:

> > > I hadn't checked ->d_compare() instances for a while; somebody needs to
> > > do that again, by the look of it. The above definitely is broken;
> > > no idea how many other instaces had grown such bugs...
> >
> > f2fs one also has the same bug. Anyway, I'm going down right now, will
> > check the rest tomorrow morning...
>
> We _probably_ can get away with just checking that inode for NULL and
> buggering off if it is (->d_seq mismatch is guaranteed in that case),
> but I suspect that we might need READ_ONCE() on both dereferences.
> I hate memory barriers...

FWIW, other instances seem to be OK; HFS+ one might or might not be
OK in the face of concurrent rename (wrong result in that case is
no problem; oops would be), but it doesn't play silly buggers with
pointer-chasing.

ext4 and f2fs do, and ->d_compare() is broken in both of them.

2020-01-20 21:43:01

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 21:07:12 OGAWA Hirofumi wrote:
> Pali Rohár <[email protected]> writes:
>
> >> To be perfect, the table would have to emulate what Windows use. It can
> >> be unicode standard, or something other.
> >
> > Windows FAT32 implementation (fastfat.sys) is opensource. So it should
> > be possible to inspect code and figure out how it is working.
> >
> > I will try to look at it.
>
> I don't think the conversion library is not in fs driver though,
> checking implement itself would be good.

Ok, I did some research. It took me it longer as I thought as lot of
stuff is undocumented and hard to find all relevant information.

So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
takes UTF-16 string and returns upper case UTF-16 string. There is no
mapping table in fastfat.sys driver itself.

RtlUpcaseUnicodeString() is a ntos kernel function and after my research
it seems that this function is using only conversion table stored in
file l_intl.nls (from c:\windows\system32).

Project wine describe this file as "unicode casing tables" and seems
that it can parse this file format. Even more it distributes its own
version of this file which looks like to be generated from official
Unicode UnicodeData.txt via Perl script make_unicode (part of wine).

So question is... how much is MS changing l_intl.nls file in their
released Windows versions?

I would try to decode what is format of that file l_intl.nls and try to
compare data in it from some Windows versions.

Can we reuse upper case mapping table from that file?

--
Pali Rohár
[email protected]


Attachments:
(No filename) (1.62 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-20 22:47:35

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Roh?r wrote:

> Ok, I did some research. It took me it longer as I thought as lot of
> stuff is undocumented and hard to find all relevant information.
>
> So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> takes UTF-16 string and returns upper case UTF-16 string. There is no
> mapping table in fastfat.sys driver itself.

Er... Surely it's OK to just tabulate that function on 65536 values
and see how could that be packed into something more compact? Whatever
the license of that function might be, this should fall under
interoperability exceptions...

Actually, I wouldn't be surprised if f(x) - x would turn out to be constant
on large enough intervals to provide sufficiently compact representation...

What am I missing here?

2020-01-20 23:58:55

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 22:46:25 Al Viro wrote:
> On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Rohár wrote:
>
> > Ok, I did some research. It took me it longer as I thought as lot of
> > stuff is undocumented and hard to find all relevant information.
> >
> > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> > takes UTF-16 string and returns upper case UTF-16 string. There is no
> > mapping table in fastfat.sys driver itself.
>
> Er... Surely it's OK to just tabulate that function on 65536 values
> and see how could that be packed into something more compact?

It is OK, but too complicated. That function is in nt kernel. So you
need to build a new kernel module and also decide where to put output of
that function. It is a long time since I did some nt kernel hacking and
nowadays you need to download 10GB+ of Visual Studio code, then addons
for building kernel modules, figure out how to write and compile simple
kernel module via Visual Studio, write ini install file, try to load it
and then you even fail as recent Windows kernels refuse to load kernel
modules which are not signed...

So it was easier to me to look at different sources (WRK, ReactOS, Wine,
github, OSR forums, ...) and figure out what is RtlUpcaseUnicodeString()
doing here.

> Whatever
> the license of that function might be, this should fall under
> interoperability exceptions...
>
> Actually, I wouldn't be surprised if f(x) - x would turn out to be constant
> on large enough intervals to provide sufficiently compact representation...
>
> What am I missing here?

--
Pali Rohár
[email protected]


Attachments:
(No filename) (1.62 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-21 00:08:19

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tue, Jan 21, 2020 at 12:57:45AM +0100, Pali Roh?r wrote:
> On Monday 20 January 2020 22:46:25 Al Viro wrote:
> > On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Roh?r wrote:
> >
> > > Ok, I did some research. It took me it longer as I thought as lot of
> > > stuff is undocumented and hard to find all relevant information.
> > >
> > > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> > > takes UTF-16 string and returns upper case UTF-16 string. There is no
> > > mapping table in fastfat.sys driver itself.
> >
> > Er... Surely it's OK to just tabulate that function on 65536 values
> > and see how could that be packed into something more compact?
>
> It is OK, but too complicated. That function is in nt kernel. So you
> need to build a new kernel module and also decide where to put output of
> that function. It is a long time since I did some nt kernel hacking and
> nowadays you need to download 10GB+ of Visual Studio code, then addons
> for building kernel modules, figure out how to write and compile simple
> kernel module via Visual Studio, write ini install file, try to load it
> and then you even fail as recent Windows kernels refuse to load kernel
> modules which are not signed...

Wait a sec... From NT userland, on a mounted VFAT:
for all s in single-codepoint strings
open s for append
if failed
print s on stderr, along with error value
write s to the opened file, adding to its tail
close the file
the for each equivalence class you'll get a single file, with all
members of that class written to it. In addition you'll get the
list of prohibited codepoints.

Why bother with any kind of kernel modules? IDGI...

2020-01-21 03:54:57

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

"Theodore Y. Ts'o" <[email protected]> writes:

> On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:
>>
>> To be perfect, the table would have to emulate what Windows use. It can
>> be unicode standard, or something other. And other fs can use different
>> what Windows use.
>
> The big question is *which* version of Windows. vfat has been in use
> for over two decades, and vfat predates Window starting to use Unicode
> in 2001. Before that, vfat would have been using whatever code page
> its local Windows installation was set to sue; and I'm not sure if
> there was space in the FAT headers to indicate the codepage in use.
>
> It would be entertaining for someone with ancient versions of Windows
> 9x to create some floppy images using codepage 437 and 450, and then
> see what a modern Windows system does with those VFAT images --- would
> it break horibbly when it tries to interpret them as UTF-16? Or would
> it figure it out? And if so, how? Inquiring minds want to know....

Perfect encode converter have to support all versions if Windows changed
the table. However, right. Normal user would be ok with current unicode
standard, and doesn't care subtle differences. But strict custom system
will care subtle differences, it is why I'm saying *perfect*.

I'm not against to use current unicode standard. Just a noting.


BTW, VFAT has to store the both of shortname (codepage) and longname
(UTF16), and using both names to open a file. So Windows should be using
current locale codepage to make shortname even latest Windows for VFAT.

And before vfat (in linux fs driver, msdos) is using shortname
(codepage) only.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2020-01-21 11:02:05

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tuesday 21 January 2020 12:52:50 OGAWA Hirofumi wrote:
> BTW, VFAT has to store the both of shortname (codepage) and longname
> (UTF16), and using both names to open a file. So Windows should be using
> current locale codepage to make shortname even latest Windows for VFAT.

fastfat.sys stores into shortnames only 7bit characters. Which is same
in all OEM codepages. Non-7bit are replaced by underline or shortened by
"~N" syntax. According to source code of fastfat.sys it has some
registry option to allow usage also of full 8bit OEM codepage.

So default behavior seems to be safe.

--
Pali Rohár
[email protected]

2020-01-21 12:28:44

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

Pali Roh?r <[email protected]> writes:

> On Tuesday 21 January 2020 12:52:50 OGAWA Hirofumi wrote:
>> BTW, VFAT has to store the both of shortname (codepage) and longname
>> (UTF16), and using both names to open a file. So Windows should be using
>> current locale codepage to make shortname even latest Windows for VFAT.
>
> fastfat.sys stores into shortnames only 7bit characters. Which is same
> in all OEM codepages. Non-7bit are replaced by underline or shortened by
> "~N" syntax. According to source code of fastfat.sys it has some
> registry option to allow usage also of full 8bit OEM codepage.
>
> So default behavior seems to be safe.

Are you sure if default is 7bit only? I'm pretty sure, at least, old
Windows version stored 8bit chars by default install.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2020-01-21 12:45:36

by David Laight

[permalink] [raw]
Subject: RE: vfat: Broken case-insensitive support for UTF-8

From: Pali Rohár
> Sent: 20 January 2020 23:58
...
> It is OK, but too complicated. That function is in nt kernel. So you
> need to build a new kernel module and also decide where to put output of
> that function. It is a long time since I did some nt kernel hacking and
> nowadays you need to download 10GB+ of Visual Studio code, then addons
> for building kernel modules, figure out how to write and compile simple
> kernel module via Visual Studio, write ini install file, try to load it
> and then you even fail as recent Windows kernels refuse to load kernel
> modules which are not signed...

Actually it isn't quite that hard.
You can download Windbg.exe (without too much other junk if you find the right place).
Use bcdedit to let it look at the current kernel (and reboot).
Then you can grovel through the live system kernel with almost no restriction.
Point it at the 'symbol server' and it knows the layouts of a lot of kernel structures.
OTOH it's command syntax is spectacularly horrid.

There is also a boot flag to let you load 'test signed' drivers.
OTOH signing drivers for a release is now a real PITA.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-21 20:35:19

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tuesday 21 January 2020 00:07:01 Al Viro wrote:
> On Tue, Jan 21, 2020 at 12:57:45AM +0100, Pali Rohár wrote:
> > On Monday 20 January 2020 22:46:25 Al Viro wrote:
> > > On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Rohár wrote:
> > >
> > > > Ok, I did some research. It took me it longer as I thought as lot of
> > > > stuff is undocumented and hard to find all relevant information.
> > > >
> > > > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> > > > takes UTF-16 string and returns upper case UTF-16 string. There is no
> > > > mapping table in fastfat.sys driver itself.
> > >
> > > Er... Surely it's OK to just tabulate that function on 65536 values
> > > and see how could that be packed into something more compact?
> >
> > It is OK, but too complicated. That function is in nt kernel. So you
> > need to build a new kernel module and also decide where to put output of
> > that function. It is a long time since I did some nt kernel hacking and
> > nowadays you need to download 10GB+ of Visual Studio code, then addons
> > for building kernel modules, figure out how to write and compile simple
> > kernel module via Visual Studio, write ini install file, try to load it
> > and then you even fail as recent Windows kernels refuse to load kernel
> > modules which are not signed...
>
> Wait a sec... From NT userland, on a mounted VFAT:
> for all s in single-codepoint strings
> open s for append
> if failed
> print s on stderr, along with error value
> write s to the opened file, adding to its tail
> close the file
> the for each equivalence class you'll get a single file, with all
> members of that class written to it. In addition you'll get the
> list of prohibited codepoints.
>
> Why bother with any kind of kernel modules? IDGI...

This is a great idea to get FAT equivalence classes. Thank you!

Now I quickly tried it... and it failed. FAT has restriction for number
of files in a directory, so I would have to do it in more clever way,
e.g prepare N directories and then try to create/open file for each
single-point string in every directory until it success or fail in every
one.

--
Pali Rohár
[email protected]


Attachments:
(No filename) (2.19 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-21 21:38:14

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tue, Jan 21, 2020 at 09:34:05PM +0100, Pali Roh?r wrote:

> This is a great idea to get FAT equivalence classes. Thank you!
>
> Now I quickly tried it... and it failed. FAT has restriction for number
> of files in a directory, so I would have to do it in more clever way,
> e.g prepare N directories and then try to create/open file for each
> single-point string in every directory until it success or fail in every
> one.

IIRC, the limitation in root directory was much harder than in
subdirectories... Not sure, though - it had been a long time since
I had to touch *FAT for any reasons...

2020-01-21 22:17:10

by Al Viro

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tue, Jan 21, 2020 at 09:36:25PM +0000, Al Viro wrote:
> On Tue, Jan 21, 2020 at 09:34:05PM +0100, Pali Roh?r wrote:
>
> > This is a great idea to get FAT equivalence classes. Thank you!
> >
> > Now I quickly tried it... and it failed. FAT has restriction for number
> > of files in a directory, so I would have to do it in more clever way,
> > e.g prepare N directories and then try to create/open file for each
> > single-point string in every directory until it success or fail in every
> > one.
>
> IIRC, the limitation in root directory was much harder than in
> subdirectories... Not sure, though - it had been a long time since
> I had to touch *FAT for any reasons...

Interesting... FWIW, Linux vfat happily creates 65536 files in root
directory. What are the native limits?

2020-01-21 22:47:44

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tuesday 21 January 2020 22:14:47 Al Viro wrote:
> On Tue, Jan 21, 2020 at 09:36:25PM +0000, Al Viro wrote:
> > On Tue, Jan 21, 2020 at 09:34:05PM +0100, Pali Rohár wrote:
> >
> > > This is a great idea to get FAT equivalence classes. Thank you!
> > >
> > > Now I quickly tried it... and it failed. FAT has restriction for number
> > > of files in a directory, so I would have to do it in more clever way,
> > > e.g prepare N directories and then try to create/open file for each
> > > single-point string in every directory until it success or fail in every
> > > one.
> >
> > IIRC, the limitation in root directory was much harder than in
> > subdirectories... Not sure, though - it had been a long time since
> > I had to touch *FAT for any reasons...

IIRC limit for root directory entry was only in FAT12 and FAT16. But I
already used subdirectories. Also VFAT name occupies at least two
entries (shortname + VFAT).

> Interesting... FWIW, Linux vfat happily creates 65536 files in root
> directory. What are the native limits?

Interesting... When I tried to create a new file by Linux vfat in that
directory where Windows created 32794 files, Linux vfat returned error
"No space left on device" even FS has only 39% used space. Into upper
directory linux vfat can put new file without any problem.

--
Pali Rohár
[email protected]


Attachments:
(No filename) (1.35 kB)
signature.asc (201.00 B)
Download all attachments

2020-01-22 00:26:36

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

Pali Rohár <[email protected]> writes:

> On Monday 20 January 2020 21:07:12 OGAWA Hirofumi wrote:
>> Pali Rohár <[email protected]> writes:
>>
>> >> To be perfect, the table would have to emulate what Windows use. It can
>> >> be unicode standard, or something other.
>> >
>> > Windows FAT32 implementation (fastfat.sys) is opensource. So it should
>> > be possible to inspect code and figure out how it is working.
>> >
>> > I will try to look at it.
>>
>> I don't think the conversion library is not in fs driver though,
>> checking implement itself would be good.
>
> Ok, I did some research. It took me it longer as I thought as lot of
> stuff is undocumented and hard to find all relevant information.
>
> So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> takes UTF-16 string and returns upper case UTF-16 string. There is no
> mapping table in fastfat.sys driver itself.
>
> RtlUpcaseUnicodeString() is a ntos kernel function and after my research
> it seems that this function is using only conversion table stored in
> file l_intl.nls (from c:\windows\system32).
>
> Project wine describe this file as "unicode casing tables" and seems
> that it can parse this file format. Even more it distributes its own
> version of this file which looks like to be generated from official
> Unicode UnicodeData.txt via Perl script make_unicode (part of wine).
>
> So question is... how much is MS changing l_intl.nls file in their
> released Windows versions?
>
> I would try to decode what is format of that file l_intl.nls and try to
> compare data in it from some Windows versions.
>
> Can we reuse upper case mapping table from that file?

Regarding fs/unicode, we have some infrastructure to parse UCD files,
handle unicode versioning, and store the data in a more compact
structure. See the mkutf8data script.

Right now, we only store the mapping of the code-point to the NFD + full
casefold, but it would be possible to extend the parsing script to store
the un-normalized uppercase version in the data structure. So, if
l_intl.nls is generated from UnicodeData.txt, you might consider to
extend fs/unicode to store it. We store the code-points in an optimized
format to decode utf-8, but the infrastructure is half way there
already.

--
Gabriel Krisman Bertazi

2020-01-24 04:35:46

by Eric Biggers

[permalink] [raw]
Subject: Re: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On Mon, Jan 20, 2020 at 07:35:58PM +0000, Al Viro wrote:
> On Mon, Jan 20, 2020 at 08:07:21AM +0000, Al Viro wrote:
>
> > > > I hadn't checked ->d_compare() instances for a while; somebody needs to
> > > > do that again, by the look of it. The above definitely is broken;
> > > > no idea how many other instaces had grown such bugs...
> > >
> > > f2fs one also has the same bug. Anyway, I'm going down right now, will
> > > check the rest tomorrow morning...
> >
> > We _probably_ can get away with just checking that inode for NULL and
> > buggering off if it is (->d_seq mismatch is guaranteed in that case),
> > but I suspect that we might need READ_ONCE() on both dereferences.
> > I hate memory barriers...
>
> FWIW, other instances seem to be OK; HFS+ one might or might not be
> OK in the face of concurrent rename (wrong result in that case is
> no problem; oops would be), but it doesn't play silly buggers with
> pointer-chasing.
>
> ext4 and f2fs do, and ->d_compare() is broken in both of them.

Thanks Al. I sent out fixes for this:

ext4: https://lore.kernel.org/r/[email protected]
f2fs: https://lore.kernel.org/r/[email protected]

Note that ->d_hash() was broken too. In fact, that was much easier to
reproduce.

- Eric

2020-01-24 20:58:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On Thu, Jan 23, 2020 at 8:29 PM Eric Biggers <[email protected]> wrote:
>
> Thanks Al. I sent out fixes for this:

How did that f2fs_d_compare() function ever work? It was doing the
memcmp on completely the wrong thing.

Linus

2020-01-24 20:59:30

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On 01/24, Linus Torvalds wrote:
> On Thu, Jan 23, 2020 at 8:29 PM Eric Biggers <[email protected]> wrote:
> >
> > Thanks Al. I sent out fixes for this:
>
> How did that f2fs_d_compare() function ever work? It was doing the
> memcmp on completely the wrong thing.

Urg.. my bad. I didn't do enough stress test on casefolding case which
is only activated given "mkfs -C utf8:strict". And Android hasn't enabled
it yet.

>
> Linus

2020-01-24 21:00:55

by Eric Biggers

[permalink] [raw]
Subject: Re: oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8)

On Fri, Jan 24, 2020 at 10:03:23AM -0800, Jaegeuk Kim wrote:
> On 01/24, Linus Torvalds wrote:
> > On Thu, Jan 23, 2020 at 8:29 PM Eric Biggers <[email protected]> wrote:
> > >
> > > Thanks Al. I sent out fixes for this:
> >
> > How did that f2fs_d_compare() function ever work? It was doing the
> > memcmp on completely the wrong thing.
>
> Urg.. my bad. I didn't do enough stress test on casefolding case which
> is only activated given "mkfs -C utf8:strict". And Android hasn't enabled
> it yet.
>

It also didn't cause *really* obvious breakage because in practice it only
caused ->d_compare() to incorrectly return false, and that just caused new
dentries to be created rather than the existing ones reused. So most things
continued to work.

It can be noticed by way of deleting files not actually freeing up space... Or
the way I noticed it is that my reproducer for the ->d_hash() crash worked on
ext4 but not f2fs.

- Eric

2020-01-26 23:11:38

by Pali Rohár

[permalink] [raw]
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Tuesday 21 January 2020 21:34:05 Pali Rohár wrote:
> On Tuesday 21 January 2020 00:07:01 Al Viro wrote:
> > On Tue, Jan 21, 2020 at 12:57:45AM +0100, Pali Rohár wrote:
> > > On Monday 20 January 2020 22:46:25 Al Viro wrote:
> > > > On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Rohár wrote:
> > > >
> > > > > Ok, I did some research. It took me it longer as I thought as lot of
> > > > > stuff is undocumented and hard to find all relevant information.
> > > > >
> > > > > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> > > > > takes UTF-16 string and returns upper case UTF-16 string. There is no
> > > > > mapping table in fastfat.sys driver itself.
> > > >
> > > > Er... Surely it's OK to just tabulate that function on 65536 values
> > > > and see how could that be packed into something more compact?
> > >
> > > It is OK, but too complicated. That function is in nt kernel. So you
> > > need to build a new kernel module and also decide where to put output of
> > > that function. It is a long time since I did some nt kernel hacking and
> > > nowadays you need to download 10GB+ of Visual Studio code, then addons
> > > for building kernel modules, figure out how to write and compile simple
> > > kernel module via Visual Studio, write ini install file, try to load it
> > > and then you even fail as recent Windows kernels refuse to load kernel
> > > modules which are not signed...
> >
> > Wait a sec... From NT userland, on a mounted VFAT:
> > for all s in single-codepoint strings
> > open s for append
> > if failed
> > print s on stderr, along with error value
> > write s to the opened file, adding to its tail
> > close the file
> > the for each equivalence class you'll get a single file, with all
> > members of that class written to it. In addition you'll get the
> > list of prohibited codepoints.
> >
> > Why bother with any kind of kernel modules? IDGI...
>
> This is a great idea to get FAT equivalence classes. Thank you!
>
> Now I quickly tried it... and it failed. FAT has restriction for number
> of files in a directory, so I would have to do it in more clever way,
> e.g prepare N directories and then try to create/open file for each
> single-point string in every directory until it success or fail in every
> one.

Now I have done test with more directories and finally it passed. I run
it on WinXP with different configurations And results are interesting...

First important thing: DOS OEM codepage is implicitly configured by
option "Language for non-Unicode programs" found in "Regional and
Language Options" at "Advanced" tab (run: intl.cpl). It is *not*
affected by "Standards and formats" language and also *not* by
"Location" language. Description for "Language for non-Unicode programs"
says: "It does not affect Unicode programs" which is clearly non-truth
as it affects all Unicode programs which stores data to FAT fs.

Second thing: Equivalence classes depends on OEM codepage. And are
different. Note that some languages shares one codepage.

CP850 (languages: English UK, Afrikaans, ...) has 614 non-trivial (*)
equivalence classes, CP852 (Slavic languages) has 619 and CP437 (English
USA) has only 586.

The biggest equivalence class is for 'U' and has following elements:

CP437:
0x0055 0x0075 0x00d9 0x00da 0x00db 0x00f9 0x00fa 0x00fb 0x0168 0x0169
0x016a 0x016b 0x016c 0x016d 0x016e 0x016f 0x0170 0x0171 0x0172 0x0173
0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6 0x01d7 0x01d8 0x01d9 0x01da
0x01db 0x01dc 0xff35 0xff55

CP852:
0x0055 0x0075 0x00b5 0x00d9 0x00db 0x00f9 0x00fb 0x0168 0x0169 0x016a
0x016b 0x016c 0x016d 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5
0x01d6 0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0x03bc 0xff35 0xff55

CP850:
0x0055 0x0075 0x0168 0x0169 0x016a 0x016b 0x016c 0x016d 0x016e 0x016f
0x0170 0x0171 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6
0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0xff35 0xff55

Just to note that elements are Unicode code points.

It is interesting that for English USA (CP437) are "U" and "Ù" in same
equivalence class, but for English UK (CP850) are "U" and "Ù" in
different classes. CP850 has "U" in two-member class: 0x00d9 0x00f9

Are there any cultural, regional or linguistic reasons why English USA
and English UK languages/regions should treat "Ù" differently?

So third thing? How should be handle this complicated situation for our
VFAT implementation in Linux kernel when using UTF-8 encoding for
userspace?

For fixing case-insensitivity for UTF-8 I see there following options:

Option 1) Create intersect of equivalence classes from all codepages and
use this for Linux VFAT uppercase function. This would ensure that
whatever codepage/language windows uses, Linux VFAT does not create
inaccessible files for Windows (see PPS).

Option 2) As equivalence classes depends on codepage and VFAT already
needs to know codepage when mounting/accessing shortnames, we can
calculate "common" uppercase table (which would same for all codepages,
ideally from option 1)) and then differences from "common" uppercase
table to equivalence classes. Kernel already has uppercase tables for
NLS codepages and so we can store these "differences" to them. In this
case VFAT would know to uppercase function for specified codepage (which
is already passed as mount param).

Option 3) Ignores this MS shit nonsense (see PPS how it is broken) and
define uppercase table from Unicode standard. This would be the most
expected behavior for userspace, but incompatible with MS FAT32
implementation.

Option 4) Use uppercase table from Unicode standard (as in option 3),
but adds also definitions from option 1). This would ensure that all
files created by VFAT would be accessible on any Windows systems (see
PPS), plus there would be uppercase definitions from Unicode standard
(but only those which do not break definitions from 1) with respect to
PPS).

Option 5) Create API for kernel <---> userspace which would allow
userspace to define mapping table (or equivalence classes) and throw
away this problem from kernel to userspace. But as we already discussed
this is hard, plus without proper configuration from userspace, kernel's
VFAT driver could modify FS in way that MS would not be able to use it.

Or do you have a better idea how to handle this problem?



(*) - with more then one element

PS: If somebody is interested I can share my whole results and source
code of testing application.

PPS: If you create two files "U" and "Ù" on English UK (you can do that
as these codepoints are in different equivalence classes) and then
connect this FAT32 fs on English USA, you would not be able to access
"Ù" file. Windows English USA list both files "U" and "Ù", but whichever
you open, Windows get you always content of file "U". "Ù" is therefore
inaccessible until you change language to English UK.

--
Pali Rohár
[email protected]


Attachments:
(No filename) (6.90 kB)
signature.asc (201.00 B)
Download all attachments