Hello!
While I was reading a new patch series for exfat filesystem driver I
saw how is proposed implementation for converting exfat's UTF-16LE
filenames to userspace and so I decided to investigate what filesystems
which are already part of Linux kernel are doing.
I looked at filesystems supported by Linux kernel which do not store
filenames as sequence of octets, but rather expect that on-disk format
of filenames is according to some encoding.
Below is list of these filesystems with its native encoding:
befs UTF-8
cifs UTF-16LE
msdos unspecified OEM codepage
vfat unspecified OEM codepage or UTF-16LE
hfs octets
hfsplus UTF-16BE-NFD-Apple
isofs octets or UTF-16BE
jfs UTF-16LE
ntfs UTF-16LE
udf Latin1 or UTF-16BE
Filesystems msdos, vfat, hfs and isofs are bogus as their filesystem
structure does not say in which encoding is filename stored. For vfat
and isofs there is information if it is UTF-16LE or some unspecified
encoding. User who access such filesystem must know under which encoding
he stored data on it. For this purpose there is for vfat and hfs mount
option codepage=<codepage>.
All other filesystems stores in their structures encoding of filenames.
Either implicitly (hfsplus is always in UTF-16BE with modified Apple's
NFD normalization) or explicitly (in UDF is byte which says if filename
is in Latin1 or in UTF-16BE).
As passing UTF-16(LE|BE) buffers is not possible via null term strings
for any VFS sycall, Linux kernel translates these Unicode filenames to
some charset. It is done by various mount options. I looked which mount
options are understood by our Linux filesystems implementations. In all
next paragraphs by filesystem I would mean Linux driver implementation
(and not structure on disk), so do not be confused.
Below is table:
befs iocharset=<charset>
cifs iocharset=<charset>
msdos (unsupported)
vfat utf8=0|no|false|1|yes|true OR utf8 OR iocharset=<charset>
hfs iocharset=<charset>
hfsplus nls=<charset>
isofs iocharset=<charset> OR utf8
jfs iocharset=<charset>
ntfs nls=<charset> OR iochrset=<charset> OR utf8
udf utf8 OR iocharset=<charset>
Filesystem msdos does not support specifying OEM codepage encoding. It
passthrough 8bit buffer to userspace and expects that userspace
understand correct OEM codepage. There is no support for reencoding it
to UTF-8 (or any other charset). Same applies for isofs when Joliet
structure is not stored on filesystem.
Filesystem vfat has the most obscure way how to specifying charset.
Details are in mount(8) manual page. What is important: option
iocharset=utf8 is buggy and may break filesystem consistency (it allows
to create two directory entries which would differ only in case
sensitivity which is not allowed by FAT specification). Due to this
problem there is a fix, mount option utf8=1 (or utf8=yes or utf8=true or
just utf8) which do what you have would expect from iocharset=utf8 if it
was not buggy.
Filesystem ntfs has option iocharset=<charset> which is just alias for
nls=<charset> and says that iocharset= is deprecated. Same applies for
option utf8 which is just alias for nls=utf8.
Filesystems isofs and udf have two ways how to specify UTF-8 encoding.
First way is via utf8 mount option and second one via iocharset=utf8
option. Looks like that difference is only one, iocharset=utf8 supports
only Uncicode code points up to the U+FFFF (limited to 3 byte long UTF-8
sequences, like utf8/utf8mb3 encoding in MySQL/MariaDB) and utf8 option
supports also code points above U+FFFF, so full Unicode and not just
limited subset.
Filesystem cifs in UTF-8 mode (via iocharset=utf8) always supports code
points above U+FFFF. But remaining filesystems befs, hfs, hfsplus, jfs
and ntfs seems to supports only Unicode code points up to the U+FFFF. So
effectively they do not support UTF-16, but effectively just UCS-2. This
limitation comes from Kernel NLS table framework/API which is limited to
16bit integers and therefore maximal Unicode code point is U+FFFF.
Filesystems cifs, isofs, udf and vfat has own special code to work with
surrogate pairs and do not use limited NLS table functions. There are
also functions utf8s_to_utf16s() and utf16s_to_utf8s() for this purpose.
And here I see these improvements for all above filesystems:
1) Unify mount options for specifying charset.
Currently all filesystems except msdos and hfsplus have mount option
iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
implement re-encoding support. Plus vfat, udf and isofs have broken
iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
iocharset=<charset> option.
I would suggest following changes for unification:
* Add a new alias iocharset= for hfsplus which would do same as nls=
* Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
* Un-deprecate iocharset=<charset> option for ntfs
This would cause that all filesystems would have iocharset=<charset>
option which would work for any charset, including iocharset=utf8.
And it would fix also broken iocharset=utf8 for vfat, udf and isofs.
2) Add support for Unicode code points above U+FFFF for filesystems
befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
also with filenames in userspace which would be 4 bytes long UTF-8.
3) Add support for iocharset= and codepage= options for msdos
filesystem. It shares lot of pars of code with vfat driver.
What do you think about these improvements? First improvement should be
relatively simple and if we agree that this unification of mount option
iocharset= make sense, I could do it.
--
Pali Rohár
[email protected]
On Thu 02-01-20 22:18:55, Pali Roh?r wrote:
> 1) Unify mount options for specifying charset.
>
> Currently all filesystems except msdos and hfsplus have mount option
> iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
> implement re-encoding support. Plus vfat, udf and isofs have broken
> iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
> iocharset=<charset> option.
>
> I would suggest following changes for unification:
>
> * Add a new alias iocharset= for hfsplus which would do same as nls=
> * Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
> * Un-deprecate iocharset=<charset> option for ntfs
>
> This would cause that all filesystems would have iocharset=<charset>
> option which would work for any charset, including iocharset=utf8.
> And it would fix also broken iocharset=utf8 for vfat, udf and isofs.
Makes sense to me.
> 2) Add support for Unicode code points above U+FFFF for filesystems
> befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
> also with filenames in userspace which would be 4 bytes long UTF-8.
Also looks good but when doing this, I'd suggest we extend NLS to support
full UTF-8 rather than implementing it by hand like e.g. we did for UDF.
> 3) Add support for iocharset= and codepage= options for msdos
> filesystem. It shares lot of pars of code with vfat driver.
I guess this is for msdos filesystem maintainers to decide.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tuesday 07 January 2020 14:32:33 Jan Kara wrote:
> On Thu 02-01-20 22:18:55, Pali Rohár wrote:
> > 1) Unify mount options for specifying charset.
> >
> > Currently all filesystems except msdos and hfsplus have mount option
> > iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
> > implement re-encoding support. Plus vfat, udf and isofs have broken
> > iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
> > iocharset=<charset> option.
> >
> > I would suggest following changes for unification:
> >
> > * Add a new alias iocharset= for hfsplus which would do same as nls=
> > * Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
> > * Un-deprecate iocharset=<charset> option for ntfs
> >
> > This would cause that all filesystems would have iocharset=<charset>
> > option which would work for any charset, including iocharset=utf8.
> > And it would fix also broken iocharset=utf8 for vfat, udf and isofs.
>
> Makes sense to me.
Ok!
> > 2) Add support for Unicode code points above U+FFFF for filesystems
> > befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
> > also with filenames in userspace which would be 4 bytes long UTF-8.
>
> Also looks good but when doing this, I'd suggest we extend NLS to support
> full UTF-8 rather than implementing it by hand like e.g. we did for UDF.
Current kernel NLS framework API supports upper-case / lower-case
conversion only for single byte encodings. So no case-insensitive
support for UTF-8 encoding. And for Unicode conversion it supports only
UCS-2, therefore code points up to the U+FFFF, so for UTF-8 maximally
3byte long sequences.
This really is not possible to fix without rewriting existing
filesystems which uses NLS API.
One hacky option would be to extend NLS API from UCS-2 to UTF-16 and fix
all users of NLS API to expects UTF-16 surrogate pairs.
But I dislike UTF-16 and rather would use usage of unicode_t (UTF-32)
which is already present in kernel. But because existing filesystems
drivers pass their UCS-2/UTF-16 buffers from FS to NLS API it is not
easy to change whole NLS API from UCS-2 to UTF-32.
And still this change does not add support for case-insensitivity, so
is useless for all MS filesystems (msdos, vfat, ntfs), which is
majority.
Kernel already provides functions for converting between UTF-8 and
UTF-16, so this seems to be the easiest way how to provide full UTF-8
support for filesystems which internally uses UTF-16. Similarly like it
is implemented in UDF.
Moreover all NLS encodings except UTF-8 are single byte encodings and
maps into Plane-0, so can be represented by currently used UCS-2
encoding. Therefore conversion to Unicode works correctly and also their
case-insensitivity functions (or rather tables).
Adding support for case-insensitivity into UTF-8 NLS encoding would mean
to create completely new kernel NLS API (which would support variable
length encodings) and rewrite all NLS filesystems to use this new API.
Also all existing NLS encodings would be needed to port into this new
API.
It is really something which have a value? Just because of UTF-8?
For me it looks like better option would be to remove UTF-8 NLS encoding
as it is broken. Some filesystems already do not use NLS API for their
UTF-8 support (e.g. vfat, udf or newly prepared exfat). And others could
be modified/extended/fixed in similar way.
> > 3) Add support for iocharset= and codepage= options for msdos
> > filesystem. It shares lot of pars of code with vfat driver.
>
> I guess this is for msdos filesystem maintainers to decide.
Yes!
--
Pali Rohár
[email protected]
On Tue, Jan 07, 2020 at 06:38:42PM +0100, Pali Roh?r wrote:
> Adding support for case-insensitivity into UTF-8 NLS encoding would mean
> to create completely new kernel NLS API (which would support variable
> length encodings) and rewrite all NLS filesystems to use this new API.
> Also all existing NLS encodings would be needed to port into this new
> API.
>
> It is really something which have a value? Just because of UTF-8?
>
> For me it looks like better option would be to remove UTF-8 NLS encoding
> as it is broken. Some filesystems already do not use NLS API for their
> UTF-8 support (e.g. vfat, udf or newly prepared exfat). And others could
> be modified/extended/fixed in similar way.
You didn't mention ext4 and f2fs, which is using the Unicode code in
fs/unicode for its case-folding and normalization support. Ext4 and
f2fs only supports utf-8, so using the NLS API would have added no
value --- and it as you pointed out, the NLS API doesn't support
variable length encoding anyway. In contrast the fs/unicode functions
have support for full Unicode case folding and normalization, and
currently has the latest Unicode 12.1 tables (released May 2019).
What I'd suggest is to create a new API, enhancing the functions in
fs/unicode, to support those file systems that need to deal with
UTF-16 and UTF-32 for their on-disk directory format, and that we
assume that for the most part, userspace *will* be using a UTF-8
encoding for the user<->kernel interface. We can keep the existing
NLS interface and mount options for legacy support, but in my opinion
it's not worth the effort to try to do anything else.
- Ted
On Tuesday 07 January 2020 15:03:01 Theodore Y. Ts'o wrote:
> On Tue, Jan 07, 2020 at 06:38:42PM +0100, Pali Rohár wrote:
> > Adding support for case-insensitivity into UTF-8 NLS encoding would mean
> > to create completely new kernel NLS API (which would support variable
> > length encodings) and rewrite all NLS filesystems to use this new API.
> > Also all existing NLS encodings would be needed to port into this new
> > API.
> >
> > It is really something which have a value? Just because of UTF-8?
> >
> > For me it looks like better option would be to remove UTF-8 NLS encoding
> > as it is broken. Some filesystems already do not use NLS API for their
> > UTF-8 support (e.g. vfat, udf or newly prepared exfat). And others could
> > be modified/extended/fixed in similar way.
>
> You didn't mention ext4 and f2fs, which is using the Unicode code in
> fs/unicode for its case-folding and normalization support.
Hi! I have not mentioned because I took only filesystems which use NLS.
And also I forgot that ext4 now has unicode flag which basically put
this filesystem into group where FS "enforce" encoding of on disk
structure.
> Ext4 and
> f2fs only supports utf-8, so using the NLS API would have added no
> value --- and it as you pointed out, the NLS API doesn't support
> variable length encoding anyway.
Theoretically using NLS API could have a value if userspace is
configured to work in e.g. Latin1 encoding and you want to use ext4/f2fs
with unicode flag in this userspace (NLS API in this case could convert
3byte UTF-8 to Latin1). But it is very theoretical and limited use case.
> In contrast the fs/unicode functions
> have support for full Unicode case folding and normalization, and
> currently has the latest Unicode 12.1 tables (released May 2019).
That is great!
But for example even this is not enough for exfat. exfat has stored
upcase table directly in on-disk FS, so ensure that every implementation
of exfat driver would have same rules how to convert character (code
point) to upper case or lower case (case folding). Upcase table is
stored to FS itself when formatting. And in MS decided that for exfat
would not be used any Unicode normalization. So this whole fs/unicode
code is not usable for exfat.
> What I'd suggest is to create a new API, enhancing the functions in
> fs/unicode, to support those file systems that need to deal with
> UTF-16 and UTF-32 for their on-disk directory format, and that we
> assume that for the most part, userspace *will* be using a UTF-8
> encoding for the user<->kernel interface.
I do not see a use-case for such a new API. Kernel has already API
functions:
int utf8_to_utf32(const u8 *s, int len, unicode_t *pu);
int utf32_to_utf8(unicode_t u, u8 *s, int maxlen);
int utf8s_to_utf16s(const u8 *s, int len, enum utf16_endian endian, wchar_t *pwcs, int maxlen);
int utf16s_to_utf8s(const wchar_t *pwcs, int len, enum utf16_endian endian, u8 *s, int maxlen);
which are basically enough for all mentioned filesystems. Maybe in for
some cases would be useful function utf16 to utf32 (and vice-versa), but
that is all. fs/unicode does not bring a new value or simplification.
Mentioned filesystems are in most cases either case-sensitive (UDF),
having own case-folding (exfat), using own special normalization
incompatible with anything (hfsplus) or do not enforce any normalization
(cifs, vfat, ntfs, isofs+joliet). So result is that simple UTF-8 to
UTF-16LE/BE conversion function is enough and then filesystem module
implements own specific rules (special upcase table, incompatible
normalization).
And I do not thing that it make sense to extending fs/unicode for every
one stupid functionality which those filesystems have and needs to
handle. I see this as a unique filesystem specific code.
> We can keep the existing
> NLS interface and mount options for legacy support, but in my opinion
> it's not worth the effort to try to do anything else.
NLS interface is crucial part of VFAT. Reason is that in VFAT can be
filenames stored either as UTF-16LE or as 8bit in some CP encoding.
Linux kernel stores new non-7-bit-ASCII filenames as UTF-16LE, but it
has to able to read 8-bit filenames which were not stored as UTF-16LE,
but rather as 8bit in CP encoding. And therefore mount option codepage=
which specify it is required needs to be implemented. It says how
vfat.ko should handle on-disk structure, not which encoding is exported
to userspace (those are two different things).
And current vfat implementation uses NLS API for it. Via CONFIG_* is
specified default codepage= mount option (CP473 or what it is -- if you
do not specify one explicitly at mount time). And because FAT is
required part of UEFI, Linux kernel would have to support this stuff
forever (or at least until it support UEFI). I think this cannot be
marked as "legacy". It is pity, but truth.
--
Pali Rohár
[email protected]
Pali Roh?r <[email protected]> writes:
>> > 3) Add support for iocharset= and codepage= options for msdos
>> > filesystem. It shares lot of pars of code with vfat driver.
>>
>> I guess this is for msdos filesystem maintainers to decide.
>
> Yes!
Of course, it's ok to add though. If someone really wants to use, and
someone works for it...
Thanks.
--
OGAWA Hirofumi <[email protected]>
Pali Roh?r <[email protected]> writes:
> On Tuesday 07 January 2020 15:03:01 Theodore Y. Ts'o wrote:
>> On Tue, Jan 07, 2020 at 06:38:42PM +0100, Pali Roh?r wrote:
>
>> In contrast the fs/unicode functions
>> have support for full Unicode case folding and normalization, and
>> currently has the latest Unicode 12.1 tables (released May 2019).
>
> That is great!
>
> But for example even this is not enough for exfat. exfat has stored
> upcase table directly in on-disk FS, so ensure that every implementation
> of exfat driver would have same rules how to convert character (code
> point) to upper case or lower case (case folding). Upcase table is
> stored to FS itself when formatting. And in MS decided that for exfat
> would not be used any Unicode normalization. So this whole fs/unicode
> code is not usable for exfat.
>
>> What I'd suggest is to create a new API, enhancing the functions in
>> fs/unicode, to support those file systems that need to deal with
>> UTF-16 and UTF-32 for their on-disk directory format, and that we
>> assume that for the most part, userspace *will* be using a UTF-8
>> encoding for the user<->kernel interface.
>
> I do not see a use-case for such a new API. Kernel has already API
> functions:
>
> int utf8_to_utf32(const u8 *s, int len, unicode_t *pu);
> int utf32_to_utf8(unicode_t u, u8 *s, int maxlen);
> int utf8s_to_utf16s(const u8 *s, int len, enum utf16_endian endian, wchar_t *pwcs, int maxlen);
> int utf16s_to_utf8s(const wchar_t *pwcs, int len, enum utf16_endian endian, u8 *s, int maxlen);
>
> which are basically enough for all mentioned filesystems. Maybe in for
> some cases would be useful function utf16 to utf32 (and vice-versa), but
> that is all. fs/unicode does not bring a new value or simplification.
>
> Mentioned filesystems are in most cases either case-sensitive (UDF),
> having own case-folding (exfat), using own special normalization
> incompatible with anything (hfsplus) or do not enforce any normalization
> (cifs, vfat, ntfs, isofs+joliet). So result is that simple UTF-8 to
> UTF-16LE/BE conversion function is enough and then filesystem module
> implements own specific rules (special upcase table, incompatible
> normalization).
>
> And I do not thing that it make sense to extending fs/unicode for every
> one stupid functionality which those filesystems have and needs to
> handle. I see this as a unique filesystem specific code.
>
>> We can keep the existing
>> NLS interface and mount options for legacy support, but in my opinion
>> it's not worth the effort to try to do anything else.
>
> NLS interface is crucial part of VFAT. Reason is that in VFAT can be
> filenames stored either as UTF-16LE or as 8bit in some CP encoding.
> Linux kernel stores new non-7-bit-ASCII filenames as UTF-16LE, but it
> has to able to read 8-bit filenames which were not stored as UTF-16LE,
> but rather as 8bit in CP encoding. And therefore mount option codepage=
> which specify it is required needs to be implemented. It says how
> vfat.ko should handle on-disk structure, not which encoding is exported
> to userspace (those are two different things).
>
> And current vfat implementation uses NLS API for it. Via CONFIG_* is
> specified default codepage= mount option (CP473 or what it is -- if you
> do not specify one explicitly at mount time). And because FAT is
> required part of UEFI, Linux kernel would have to support this stuff
> forever (or at least until it support UEFI). I think this cannot be
> marked as "legacy". It is pity, but truth.
FWIW, what I imagined and but never try to implement in past is, iconv
(or such if you know better api). To support complete codepages, IIRC,
it has difference by OSes (e.g. mac, old windows, current windows,
unicode standard).
So the table is loaded from userspace like firmware data. (several codes
in kernel for special conversion cases are required though, table may be
able to share with glibc)
But this would be big work.
Thanks.
--
OGAWA Hirofumi <[email protected]>