2022-12-26 14:49:49

by Pali Rohár

[permalink] [raw]
Subject: [RFC PATCH v2 00/18] fs: Remove usage of broken nls_utf8 and drop it

Module nls_utf8 is broken in several ways. It does not support (full)
UTF-8, despite its name. It cannot handle 4-byte UTF-8 sequences and
tolower/toupper table is not implemented at all. Which means that it is
not suitable for usage in case-insensitive filesystems or UTF-16
filesystems (because of e.g. missing UTF-16 surrogate pairs processing).

This is RFC v2 patch series which unify and fix iocharset=utf8 mount
option in all fs drivers and converts all remaining fs drivers to use
utf8s_to_utf16s(), utf16s_to_utf8s(), utf8_to_utf32(), utf32_to_utf8
functions for implementing UTF-8 support instead of nls_utf8.

So at the end it allows to completely drop this broken nls_utf8 module.

For more details look at email thread where was discussed fs unification:
https://lore.kernel.org/linux-fsdevel/20200102211855.gg62r7jshp742d6i@pali/t/#u

This patch series is mostly untested and presented as RFC. Please let me
know what do you think about it and if is the correct way how to fix
broken UTF-8 support in fs drivers. As explained in above email thread I
think it does not make sense to try fixing whole NLS framework and it is
easier to just drop this nls_utf8 module.

Note: this patch series does not address UTF-8 fat case-sensitivity issue:
https://lore.kernel.org/linux-fsdevel/20200119221455.bac7dc55g56q2l4r@pali/

Changes since RFC v1:
* Dropped already merged udf and isofs patches
* Addressed review comments:
- updated documentation
- usage of seq_puts
- some code moved to local variables
- usage of true/false instead of 1/0
- rebased on top of master branch

Link to RFC v1:
https://lore.kernel.org/linux-fsdevel/[email protected]/

Pali Rohár (18):
fat: Fix iocharset=utf8 mount option
hfsplus: Add iocharset= mount option as alias for nls=
ntfs: Undeprecate iocharset= mount option
ntfs: Fix error processing when load_nls() fails
befs: Fix printing iocharset= mount option
befs: Rename enum value Opt_charset to Opt_iocharset to match mount
option
befs: Fix error processing when load_nls() fails
befs: Allow to use native UTF-8 mode
hfs: Explicitly set hsb->nls_disk when hsb->nls_io is set
hfs: Do not use broken utf8 NLS table for iocharset=utf8 mount option
hfsplus: Do not use broken utf8 NLS table for iocharset=utf8 mount
option
jfs: Remove custom iso8859-1 implementation
jfs: Fix buffer overflow in jfs_strfromUCS_le() function
jfs: Do not use broken utf8 NLS table for iocharset=utf8 mount option
ntfs: Do not use broken utf8 NLS table for iocharset=utf8 mount option
cifs: Do not use broken utf8 NLS table for iocharset=utf8 mount option
cifs: Remove usage of load_nls_default() calls
nls: Drop broken nls_utf8 module

Documentation/filesystems/hfsplus.rst | 3 +
Documentation/filesystems/ntfs.rst | 5 +-
Documentation/filesystems/vfat.rst | 13 +--
fs/befs/linuxvfs.c | 24 +++--
fs/cifs/cifs_unicode.c | 128 +++++++++++++++++---------
fs/cifs/cifs_unicode.h | 2 +-
fs/cifs/cifsfs.c | 2 +
fs/cifs/cifssmb.c | 8 +-
fs/cifs/connect.c | 8 +-
fs/cifs/dfs_cache.c | 24 ++---
fs/cifs/dir.c | 28 ++++--
fs/cifs/smb2pdu.c | 18 +---
fs/cifs/winucase.c | 14 ++-
fs/fat/Kconfig | 19 +---
fs/fat/dir.c | 17 ++--
fs/fat/fat.h | 22 +++++
fs/fat/inode.c | 28 +++---
fs/fat/namei_vfat.c | 26 ++++--
fs/hfs/super.c | 62 +++++++++++--
fs/hfs/trans.c | 62 +++++++------
fs/hfsplus/dir.c | 7 +-
fs/hfsplus/options.c | 39 +++++---
fs/hfsplus/super.c | 7 +-
fs/hfsplus/unicode.c | 31 ++++++-
fs/hfsplus/xattr.c | 20 ++--
fs/hfsplus/xattr_security.c | 6 +-
fs/jfs/jfs_dtree.c | 13 ++-
fs/jfs/jfs_unicode.c | 35 +++----
fs/jfs/jfs_unicode.h | 2 +-
fs/jfs/super.c | 29 ++++--
fs/nls/Kconfig | 9 --
fs/nls/Makefile | 1 -
fs/nls/nls_utf8.c | 67 --------------
fs/ntfs/dir.c | 6 +-
fs/ntfs/inode.c | 5 +-
fs/ntfs/super.c | 60 ++++++------
fs/ntfs/unistr.c | 29 +++++-
37 files changed, 493 insertions(+), 386 deletions(-)
delete mode 100644 fs/nls/nls_utf8.c

--
2.20.1


2022-12-26 14:49:55

by Pali Rohár

[permalink] [raw]
Subject: [RFC PATCH v2 01/18] fat: Fix iocharset=utf8 mount option

Currently iocharset=utf8 mount option is broken and error is printed to
dmesg when it is used. To use UTF-8 as iocharset, it is required to use
utf8=1 mount option.

Fix iocharset=utf8 mount option to use be equivalent to the utf8=1 mount
option and remove printing error from dmesg.

FAT by definition is case-insensitive but current Linux implementation is
case-sensitive for non-ASCII characters when UTF-8 is used. This patch does
not change this UTF-8 behavior. Only more comments in fat_utf8_strnicmp()
function are added about it.

After this patch iocharset=utf8 starts working, so there is no need to have
separate config option FAT_DEFAULT_UTF8 as FAT_DEFAULT_IOCHARSET for utf8
also starts working. So remove redundant config option FAT_DEFAULT_UTF8.

Signed-off-by: Pali Rohár <[email protected]>
---
Documentation/filesystems/vfat.rst | 13 ++-----------
fs/fat/Kconfig | 19 +------------------
fs/fat/dir.c | 17 +++++++----------
fs/fat/fat.h | 22 ++++++++++++++++++++++
fs/fat/inode.c | 28 +++++++++++-----------------
fs/fat/namei_vfat.c | 26 +++++++++++++++++++-------
6 files changed, 62 insertions(+), 63 deletions(-)

diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
index 760a4d83fdf9..28cacba7bb9a 100644
--- a/Documentation/filesystems/vfat.rst
+++ b/Documentation/filesystems/vfat.rst
@@ -65,19 +65,10 @@ VFAT MOUNT OPTIONS
in Unicode format, but Unix for the most part doesn't
know how to deal with Unicode.
By default, FAT_DEFAULT_IOCHARSET setting is used.
-
- There is also an option of doing UTF-8 translations
- with the utf8 option.
-
-.. note:: ``iocharset=utf8`` is not recommended. If unsure, you should consider
- the utf8 option instead.
+ **utf8** is supported too and recommended to use.

**utf8=<bool>**
- UTF-8 is the filesystem safe version of Unicode that
- is used by the console. It can be enabled or disabled
- for the filesystem with this option.
- If 'uni_xlate' gets set, UTF-8 gets disabled.
- By default, FAT_DEFAULT_UTF8 setting is used.
+ Alias for ``iocharset=utf8`` mount option.

**uni_xlate=<bool>**
Translate unhandled Unicode characters to special
diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig
index 238cc55f84c4..e98aaa3bb55b 100644
--- a/fs/fat/Kconfig
+++ b/fs/fat/Kconfig
@@ -93,29 +93,12 @@ config FAT_DEFAULT_IOCHARSET
like FAT to use. It should probably match the character set
that most of your FAT filesystems use, and can be overridden
with the "iocharset" mount option for FAT filesystems.
- Note that "utf8" is not recommended for FAT filesystems.
- If unsure, you shouldn't set "utf8" here - select the next option
- instead if you would like to use UTF-8 encoded file names by default.
+ "utf8" is supported too and recommended to use.
See <file:Documentation/filesystems/vfat.rst> for more information.

Enable any character sets you need in File Systems/Native Language
Support.

-config FAT_DEFAULT_UTF8
- bool "Enable FAT UTF-8 option by default"
- depends on VFAT_FS
- default n
- help
- Set this if you would like to have "utf8" mount option set
- by default when mounting FAT filesystems.
-
- Even if you say Y here can always disable UTF-8 for
- particular mount by adding "utf8=0" to mount options.
-
- Say Y if you use UTF-8 encoding for file names, N otherwise.
-
- See <file:Documentation/filesystems/vfat.rst> for more information.
-
config FAT_KUNIT_TEST
tristate "Unit Tests for FAT filesystems" if !KUNIT_ALL_TESTS
depends on KUNIT && FAT_FS
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 00235b8a1823..15166b0c4a6e 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -33,11 +33,6 @@
#define FAT_MAX_UNI_CHARS ((MSDOS_SLOTS - 1) * 13 + 1)
#define FAT_MAX_UNI_SIZE (FAT_MAX_UNI_CHARS * sizeof(wchar_t))

-static inline unsigned char fat_tolower(unsigned char c)
-{
- return ((c >= 'A') && (c <= 'Z')) ? c+32 : c;
-}
-
static inline loff_t fat_make_i_pos(struct super_block *sb,
struct buffer_head *bh,
struct msdos_dir_entry *de)
@@ -258,10 +253,12 @@ static inline int fat_name_match(struct msdos_sb_info *sbi,
if (a_len != b_len)
return 0;

- if (sbi->options.name_check != 's')
- return !nls_strnicmp(sbi->nls_io, a, b, a_len);
- else
+ if (sbi->options.name_check == 's')
return !memcmp(a, b, a_len);
+ else if (sbi->options.utf8)
+ return !fat_utf8_strnicmp(a, b, a_len);
+ else
+ return !nls_strnicmp(sbi->nls_io, a, b, a_len);
}

enum { PARSE_INVALID = 1, PARSE_NOT_LONGNAME, PARSE_EOF, };
@@ -384,7 +381,7 @@ static int fat_parse_short(struct super_block *sb,
de->lcase & CASE_LOWER_BASE);
if (chl <= 1) {
if (!isvfat)
- ptname[i] = nocase ? c : fat_tolower(c);
+ ptname[i] = nocase ? c : fat_ascii_to_lower(c);
i++;
if (c != ' ') {
name_len = i;
@@ -421,7 +418,7 @@ static int fat_parse_short(struct super_block *sb,
if (chl <= 1) {
k++;
if (!isvfat)
- ptname[i] = nocase ? c : fat_tolower(c);
+ ptname[i] = nocase ? c : fat_ascii_to_lower(c);
i++;
if (c != ' ') {
name_len = i;
diff --git a/fs/fat/fat.h b/fs/fat/fat.h
index a415c02ede39..6b0a6041c8d7 100644
--- a/fs/fat/fat.h
+++ b/fs/fat/fat.h
@@ -311,6 +311,28 @@ static inline void fatwchar_to16(__u8 *dst, const wchar_t *src, size_t len)
#endif
}

+static inline unsigned char fat_ascii_to_lower(unsigned char c)
+{
+ return ((c >= 'A') && (c <= 'Z')) ? c+32 : c;
+}
+
+static inline int fat_utf8_strnicmp(const unsigned char *a,
+ const unsigned char *b,
+ int len)
+{
+ int i;
+
+ /*
+ * FIXME: UTF-8 doesn't provide FAT semantics
+ * Case-insensitive support is only for 7-bit ASCII characters
+ */
+ for (i = 0; i < len; i++) {
+ if (fat_ascii_to_lower(a[i]) != fat_ascii_to_lower(b[i]))
+ return 1;
+ }
+ return 0;
+}
+
/* fat/cache.c */
extern void fat_cache_inval_inode(struct inode *inode);
extern int fat_get_cluster(struct inode *inode, int cluster,
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index d99b8549ec8f..818c9c37b419 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -956,7 +956,9 @@ static int fat_show_options(struct seq_file *m, struct dentry *root)
/* strip "cp" prefix from displayed option */
seq_printf(m, ",codepage=%s", &sbi->nls_disk->charset[2]);
if (isvfat) {
- if (sbi->nls_io)
+ if (opts->utf8)
+ seq_puts(m, ",iocharset=utf8");
+ else if (sbi->nls_io)
seq_printf(m, ",iocharset=%s", sbi->nls_io->charset);

switch (opts->shortname) {
@@ -993,8 +995,6 @@ static int fat_show_options(struct seq_file *m, struct dentry *root)
if (opts->nocase)
seq_puts(m, ",nocase");
} else {
- if (opts->utf8)
- seq_puts(m, ",utf8");
if (opts->unicode_xlate)
seq_puts(m, ",uni_xlate");
if (!opts->numtail)
@@ -1156,8 +1156,6 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
opts->errors = FAT_ERRORS_RO;
*debug = 0;

- opts->utf8 = IS_ENABLED(CONFIG_FAT_DEFAULT_UTF8) && is_vfat;
-
if (!options)
goto out;

@@ -1318,10 +1316,14 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
| VFAT_SFN_CREATE_WIN95;
break;
case Opt_utf8_no: /* 0 or no or false */
- opts->utf8 = 0;
+ fat_reset_iocharset(opts);
break;
case Opt_utf8_yes: /* empty or 1 or yes or true */
- opts->utf8 = 1;
+ fat_reset_iocharset(opts);
+ iocharset = kstrdup("utf8", GFP_KERNEL);
+ if (!iocharset)
+ return -ENOMEM;
+ opts->iocharset = iocharset;
break;
case Opt_uni_xl_no: /* 0 or no or false */
opts->unicode_xlate = 0;
@@ -1359,18 +1361,11 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
}

out:
- /* UTF-8 doesn't provide FAT semantics */
- if (!strcmp(opts->iocharset, "utf8")) {
- fat_msg(sb, KERN_WARNING, "utf8 is not a recommended IO charset"
- " for FAT filesystems, filesystem will be "
- "case sensitive!");
- }
+ opts->utf8 = !strcmp(opts->iocharset, "utf8") && is_vfat;

/* If user doesn't specify allow_utime, it's initialized from dmask. */
if (opts->allow_utime == (unsigned short)-1)
opts->allow_utime = ~opts->fs_dmask & (S_IWGRP | S_IWOTH);
- if (opts->unicode_xlate)
- opts->utf8 = 0;
if (opts->nfs == FAT_NFS_NOSTALE_RO) {
sb->s_flags |= SB_RDONLY;
sb->s_export_op = &fat_export_ops_nostale;
@@ -1828,8 +1823,7 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
goto out_fail;
}

- /* FIXME: utf8 is using iocharset for upper/lower conversion */
- if (sbi->options.isvfat) {
+ if (sbi->options.isvfat && !sbi->options.utf8) {
sbi->nls_io = load_nls(sbi->options.iocharset);
if (!sbi->nls_io) {
fat_msg(sb, KERN_ERR, "IO charset %s not found",
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 21620054e1c4..5c585e79d863 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -134,6 +134,7 @@ static int vfat_hash(const struct dentry *dentry, struct qstr *qstr)
static int vfat_hashi(const struct dentry *dentry, struct qstr *qstr)
{
struct nls_table *t = MSDOS_SB(dentry->d_sb)->nls_io;
+ int utf8 = MSDOS_SB(dentry->d_sb)->options.utf8;
const unsigned char *name;
unsigned int len;
unsigned long hash;
@@ -142,8 +143,17 @@ static int vfat_hashi(const struct dentry *dentry, struct qstr *qstr)
len = vfat_striptail_len(qstr);

hash = init_name_hash(dentry);
- while (len--)
- hash = partial_name_hash(nls_tolower(t, *name++), hash);
+ if (utf8) {
+ /*
+ * FIXME: UTF-8 doesn't provide FAT semantics
+ * Case-insensitive support is only for 7-bit ASCII characters
+ */
+ while (len--)
+ hash = partial_name_hash(fat_ascii_to_lower(*name++), hash);
+ } else {
+ while (len--)
+ hash = partial_name_hash(nls_tolower(t, *name++), hash);
+ }
qstr->hash = end_name_hash(hash);

return 0;
@@ -156,16 +166,18 @@ static int vfat_cmpi(const struct dentry *dentry,
unsigned int len, const char *str, const struct qstr *name)
{
struct nls_table *t = MSDOS_SB(dentry->d_sb)->nls_io;
+ int utf8 = MSDOS_SB(dentry->d_sb)->options.utf8;
unsigned int alen, blen;

/* A filename cannot end in '.' or we treat it like it has none */
alen = vfat_striptail_len(name);
blen = __vfat_striptail_len(len, str);
- if (alen == blen) {
- if (nls_strnicmp(t, name->name, str, alen) == 0)
- return 0;
- }
- return 1;
+ if (alen != blen)
+ return 1;
+ else if (utf8)
+ return fat_utf8_strnicmp(name->name, str, alen);
+ else
+ return nls_strnicmp(t, name->name, str, alen);
}

/*
--
2.20.1

2022-12-26 14:50:03

by Pali Rohár

[permalink] [raw]
Subject: [RFC PATCH v2 14/18] jfs: Do not use broken utf8 NLS table for iocharset=utf8 mount option

NLS table for utf8 is broken and cannot be fixed.

So instead of broken utf8 nls functions char2uni() and uni2char() use
functions utf8s_to_utf16s() and utf16s_to_utf8s() which implements correct
conversion between UTF-16 and UTF-8.

These functions implements also correct processing of UTF-16 surrogate
pairs and therefore after this change jfs driver would be able to correctly
handle also file names with 4-byte UTF-8 sequences.

When iochatset=utf8 is used then set sbi->nls_tab to NULL and use it for
distinguish between the fact if NLS table or native UTF-8 functions should
be used.

Signed-off-by: Pali Rohár <[email protected]>
---
fs/jfs/jfs_unicode.c | 17 +++++++++++++++--
fs/jfs/super.c | 24 +++++++++++++++---------
2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/jfs/jfs_unicode.c b/fs/jfs/jfs_unicode.c
index 2db923872bf1..0b0b80063a98 100644
--- a/fs/jfs/jfs_unicode.c
+++ b/fs/jfs/jfs_unicode.c
@@ -46,6 +46,9 @@ int jfs_strfromUCS_le(char *to, int maxlen, const __le16 * from,
}
}
}
+ } else {
+ outlen = utf16s_to_utf8s((const wchar_t *)from, len,
+ UTF16_LITTLE_ENDIAN, to, maxlen-1);
}
to[outlen] = 0;
return outlen;
@@ -61,6 +64,7 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len,
struct nls_table *codepage)
{
int charlen;
+ int outlen;
int i;

if (codepage) {
@@ -75,10 +79,19 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len,
return charlen;
}
}
+ outlen = i;
+ } else {
+ outlen = utf8s_to_utf16s(from, len, UTF16_LITTLE_ENDIAN,
+ to, len);
+ if (outlen < 1) {
+ jfs_err("jfs_strtoUCS: utf8s_to_utf16s returned %d.",
+ outlen);
+ return outlen;
+ }
}

- to[i] = 0;
- return i;
+ to[outlen] = 0;
+ return outlen;
}

/*
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index a2bb3d5d3f69..f26460147b62 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -261,16 +261,20 @@ static int parse_options(char *options, struct super_block *sb, s64 *newLVSize,
/* Don't do anything ;-) */
break;
case Opt_iocharset:
- if (nls_map && nls_map != (void *) -1)
+ if (nls_map && nls_map != (void *) -1) {
unload_nls(nls_map);
- /* compatibility alias none means ISO-8859-1 */
- if (strcmp(args[0].from, "none") == 0)
- nls_map = load_nls("iso8859-1");
- else
- nls_map = load_nls(args[0].from);
- if (!nls_map) {
- pr_err("JFS: charset not found\n");
- goto cleanup;
+ nls_map = NULL;
+ }
+ if (strcmp(args[0].from, "utf8") != 0) {
+ /* compatibility alias none means ISO-8859-1 */
+ if (strcmp(args[0].from, "none") == 0)
+ nls_map = load_nls("iso8859-1");
+ else
+ nls_map = load_nls(args[0].from);
+ if (!nls_map) {
+ pr_err("JFS: charset not found\n");
+ goto cleanup;
+ }
}
break;
case Opt_resize:
@@ -713,6 +717,8 @@ static int jfs_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",discard=%u", sbi->minblks_trim);
if (sbi->nls_tab)
seq_printf(seq, ",iocharset=%s", sbi->nls_tab->charset);
+ else
+ seq_puts(seq, ",iocharset=utf8");
if (sbi->flag & JFS_ERR_CONTINUE)
seq_printf(seq, ",errors=continue");
if (sbi->flag & JFS_ERR_PANIC)
--
2.20.1

2022-12-26 14:51:36

by Pali Rohár

[permalink] [raw]
Subject: [RFC PATCH v2 16/18] cifs: Do not use broken utf8 NLS table for iocharset=utf8 mount option

NLS table for utf8 is broken and cannot be fixed.

So instead of broken utf8 nls functions char2uni() and uni2char() use
functions utf8s_to_utf16s() and utf16s_to_utf8s() which implements correct
conversion between UTF-16 and UTF-8.

When iochatset=utf8 is used then set ctx->iocharset to NULL and use it for
distinguish between the fact if NLS table or native UTF-8 functions should
be used.

Signed-off-by: Pali Rohár <[email protected]>
---
fs/cifs/cifs_unicode.c | 128 +++++++++++++++++++++++++++--------------
fs/cifs/cifs_unicode.h | 2 +-
fs/cifs/cifsfs.c | 2 +
fs/cifs/connect.c | 8 ++-
fs/cifs/dir.c | 28 +++++++--
fs/cifs/winucase.c | 14 +++--
6 files changed, 124 insertions(+), 58 deletions(-)

diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c
index e7582dd79179..94b861e666e3 100644
--- a/fs/cifs/cifs_unicode.c
+++ b/fs/cifs/cifs_unicode.c
@@ -130,20 +130,17 @@ cifs_mapchar(char *target, const __u16 *from, const struct nls_table *cp,
convert_sfu_char(src_char, target))
return len;

- /* if character not one of seven in special remap set */
- len = cp->uni2char(src_char, target, NLS_MAX_CHARSET_SIZE);
- if (len <= 0)
- goto surrogate_pair;
-
- return len;
+ if (cp) {
+ /* if character not one of seven in special remap set */
+ len = cp->uni2char(src_char, target, NLS_MAX_CHARSET_SIZE);
+ if (len <= 0)
+ goto unknown;
+ } else {
+ len = utf16s_to_utf8s(from, 3, UTF16_LITTLE_ENDIAN, target, 6);
+ if (len <= 0)
+ goto unknown;
+ }

-surrogate_pair:
- /* convert SURROGATE_PAIR and IVS */
- if (strcmp(cp->charset, "utf8"))
- goto unknown;
- len = utf16s_to_utf8s(from, 3, UTF16_LITTLE_ENDIAN, target, 6);
- if (len <= 0)
- goto unknown;
return len;

unknown:
@@ -239,6 +236,37 @@ cifs_from_utf16(char *to, const __le16 *from, int tolen, int fromlen,
return outlen;
}

+static int cifs_utf8s_to_utf16s(const char *s, int inlen, __le16 *pwcs)
+{
+ __le16 *op;
+ int size;
+ unicode_t u;
+
+ op = pwcs;
+ while (inlen > 0 && *s) {
+ if (*s & 0x80) {
+ size = utf8_to_utf32(s, inlen, &u);
+ if (size <= 0) {
+ u = 0x003f; /* A question mark */
+ size = 1;
+ }
+ s += size;
+ inlen -= size;
+ if (u >= 0x10000) {
+ u -= 0x10000;
+ *op++ = __cpu_to_le16(0xd800 | ((u >> 10) & 0x03ff));
+ *op++ = __cpu_to_le16(0xdc00 | (u & 0x03ff));
+ } else {
+ *op++ = __cpu_to_le16(u);
+ }
+ } else {
+ *op++ = __cpu_to_le16(*s++);
+ inlen--;
+ }
+ }
+ return op - pwcs;
+}
+
/*
* NAME: cifs_strtoUTF16()
*
@@ -254,24 +282,14 @@ cifs_strtoUTF16(__le16 *to, const char *from, int len,
wchar_t wchar_to; /* needed to quiet sparse */

/* special case for utf8 to handle no plane0 chars */
- if (!strcmp(codepage->charset, "utf8")) {
+ if (!codepage) {
/*
* convert utf8 -> utf16, we assume we have enough space
* as caller should have assumed conversion does not overflow
- * in destination len is length in wchar_t units (16bits)
- */
- i = utf8s_to_utf16s(from, len, UTF16_LITTLE_ENDIAN,
- (wchar_t *) to, len);
-
- /* if success terminate and exit */
- if (i >= 0)
- goto success;
- /*
- * if fails fall back to UCS encoding as this
- * function should not return negative values
- * currently can fail only if source contains
- * invalid encoded characters
+ * in destination len is length in __le16 units
*/
+ i = cifs_utf8s_to_utf16s(from, len, to);
+ goto success;
}

for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
@@ -502,25 +520,29 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen,
* as they use backslash as separator.
*/
if (dst_char == 0) {
- charlen = cp->char2uni(source + i, srclen - i, &tmp);
- dst_char = cpu_to_le16(tmp);
-
- /*
- * if no match, use question mark, which at least in
- * some cases serves as wild card
- */
- if (charlen > 0)
- goto ctoUTF16;
-
- /* convert SURROGATE_PAIR */
- if (strcmp(cp->charset, "utf8") || !wchar_to)
- goto unknown;
- if (*(source + i) & 0x80) {
- charlen = utf8_to_utf32(source + i, 6, &u);
- if (charlen < 0)
+ if (cp) {
+ charlen = cp->char2uni(source + i, srclen - i, &tmp);
+ dst_char = cpu_to_le16(tmp);
+
+ /*
+ * if no match, use question mark, which at least in
+ * some cases serves as wild card
+ */
+ if (charlen > 0)
+ goto ctoUTF16;
+ else
goto unknown;
- } else
+ }
+
+ /* UTF-8 to UTF-16 conversion */
+
+ if (!wchar_to)
goto unknown;
+
+ charlen = utf8_to_utf32(source + i, 6, &u);
+ if (charlen < 0)
+ goto unknown;
+
ret = utf8s_to_utf16s(source + i, charlen,
UTF16_LITTLE_ENDIAN,
wchar_to, 6);
@@ -589,8 +611,26 @@ cifs_local_to_utf16_bytes(const char *from, int len,
{
int charlen;
int i;
+ int outlen;
+ unicode_t u_to;
wchar_t wchar_to;

+ if (!codepage) {
+ outlen = 0;
+ for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
+ charlen = utf8_to_utf32(from, len, &u_to);
+ /* Failed conversion defaults to a question mark */
+ if (charlen < 1) {
+ charlen = 1;
+ outlen += 2;
+ } else if (u_to <= 0xFFFF)
+ outlen += 2;
+ else
+ outlen += 4;
+ }
+ return outlen;
+ }
+
for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
charlen = codepage->char2uni(from, len, &wchar_to);
/* Failed conversion defaults to a question mark */
diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
index 80b3d845419f..b9a3290faaf7 100644
--- a/fs/cifs/cifs_unicode.h
+++ b/fs/cifs/cifs_unicode.h
@@ -106,7 +106,7 @@ extern __le16 *cifs_strndup_to_utf16(const char *src, const int maxlen,
int remap);
#endif

-wchar_t cifs_toupper(wchar_t in);
+unicode_t cifs_toupper(unicode_t in);

/*
* UniStrcat: Concatenate the second string to the first
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 10e00c624922..1537bc8bb698 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -591,6 +591,8 @@ cifs_show_options(struct seq_file *s, struct dentry *root)
cifs_sb->ctx->dir_mode);
if (cifs_sb->ctx->iocharset)
seq_printf(s, ",iocharset=%s", cifs_sb->ctx->iocharset);
+ else
+ seq_puts(s, ",iocharset=utf8");
if (tcon->seal)
seq_puts(s, ",seal");
else if (tcon->ses->server->ignore_signature)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index d371259d6808..fb841a7baef6 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2676,7 +2676,11 @@ compare_mount_options(struct super_block *sb, struct cifs_mnt_data *mnt_data)
old->ctx->dir_mode != new->ctx->dir_mode)
return 0;

- if (strcmp(old->local_nls->charset, new->local_nls->charset))
+ if (old->local_nls && !new->local_nls)
+ return 0;
+ if (!old->local_nls && new->local_nls)
+ return 0;
+ if (old->local_nls && new->local_nls && strcmp(old->local_nls->charset, new->local_nls->charset))
return 0;

if (old->ctx->acregmax != new->ctx->acregmax)
@@ -3162,7 +3166,7 @@ int cifs_setup_cifs_sb(struct cifs_sb_info *cifs_sb)
if (ctx->iocharset == NULL) {
/* load_nls_default cannot return null */
cifs_sb->local_nls = load_nls_default();
- } else {
+ } else if (strcmp(ctx->iocharset, "utf8") != 0) {
cifs_sb->local_nls = load_nls(ctx->iocharset);
if (cifs_sb->local_nls == NULL) {
cifs_dbg(VFS, "CIFS mount error: iocharset %s not found\n",
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index ad4208bf1e32..83deba65e188 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -804,16 +804,22 @@ static int cifs_ci_hash(const struct dentry *dentry, struct qstr *q)
{
struct nls_table *codepage = CIFS_SB(dentry->d_sb)->local_nls;
unsigned long hash;
+ unicode_t u;
wchar_t c;
int i, charlen;

hash = init_name_hash(dentry);
for (i = 0; i < q->len; i += charlen) {
- charlen = codepage->char2uni(&q->name[i], q->len - i, &c);
+ if (codepage) {
+ charlen = codepage->char2uni(&q->name[i], q->len - i, &c);
+ if (likely(charlen > 0))
+ u = c;
+ } else
+ charlen = utf8_to_utf32(&q->name[i], q->len - i, &u);
/* error out if we can't convert the character */
if (unlikely(charlen < 0))
return charlen;
- hash = partial_name_hash(cifs_toupper(c), hash);
+ hash = partial_name_hash(cifs_toupper(u), hash);
}
q->hash = end_name_hash(hash);

@@ -824,6 +830,7 @@ static int cifs_ci_compare(const struct dentry *dentry,
unsigned int len, const char *str, const struct qstr *name)
{
struct nls_table *codepage = CIFS_SB(dentry->d_sb)->local_nls;
+ unicode_t u1, u2;
wchar_t c1, c2;
int i, l1, l2;

@@ -837,9 +844,18 @@ static int cifs_ci_compare(const struct dentry *dentry,
return 1;

for (i = 0; i < len; i += l1) {
- /* Convert characters in both strings to UTF-16. */
- l1 = codepage->char2uni(&str[i], len - i, &c1);
- l2 = codepage->char2uni(&name->name[i], name->len - i, &c2);
+ /* Convert characters in both strings to UTF-32. */
+ if (codepage) {
+ l1 = codepage->char2uni(&str[i], len - i, &c1);
+ l2 = codepage->char2uni(&name->name[i], name->len - i, &c2);
+ if (likely(l1 > 0))
+ u1 = c1;
+ if (likely(l2 > 0))
+ u2 = c2;
+ } else {
+ l1 = utf8_to_utf32(&str[i], len - i, &u1);
+ l2 = utf8_to_utf32(&name->name[i], name->len - i, &u2);
+ }

/*
* If we can't convert either character, just declare it to
@@ -860,7 +876,7 @@ static int cifs_ci_compare(const struct dentry *dentry,
return 1;

/* Now compare uppercase versions of these characters */
- if (cifs_toupper(c1) != cifs_toupper(c2))
+ if (cifs_toupper(u1) != cifs_toupper(u2))
return 1;
}

diff --git a/fs/cifs/winucase.c b/fs/cifs/winucase.c
index 2f075b5b50df..b3647b35a7e1 100644
--- a/fs/cifs/winucase.c
+++ b/fs/cifs/winucase.c
@@ -17,7 +17,7 @@

#include <linux/nls.h>

-wchar_t cifs_toupper(wchar_t in); /* quiet sparse */
+unicode_t cifs_toupper(unicode_t in); /* quiet sparse */

static const wchar_t t2_00[256] = {
0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
@@ -615,20 +615,24 @@ static const wchar_t *const toplevel[256] = {
};

/**
- * cifs_toupper - convert a wchar_t from lower to uppercase
+ * cifs_toupper - convert a unicode_t from lower to uppercase
* @in: character to convert from lower to uppercase
*
- * This function consults the static tables above to convert a wchar_t from
+ * This function consults the static tables above to convert a unicode_t from
* lower to uppercase. In the event that there is no mapping, the original
* "in" character is returned.
*/
-wchar_t
-cifs_toupper(wchar_t in)
+unicode_t
+cifs_toupper(unicode_t in)
{
unsigned char idx;
const wchar_t *tbl;
wchar_t out;

+ /* cifs_toupper table has only defines for plane-0 */
+ if (in > 0xffff)
+ return in;
+
/* grab upper byte */
idx = (in & 0xff00) >> 8;

--
2.20.1

2023-01-10 09:34:29

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC PATCH v2 01/18] fat: Fix iocharset=utf8 mount option

Pali Roh?r <[email protected]> writes:

> Currently iocharset=utf8 mount option is broken and error is printed to
> dmesg when it is used. To use UTF-8 as iocharset, it is required to use
> utf8=1 mount option.
>
> Fix iocharset=utf8 mount option to use be equivalent to the utf8=1 mount
> option and remove printing error from dmesg.

[...]

> -
> - There is also an option of doing UTF-8 translations
> - with the utf8 option.
> -
> -.. note:: ``iocharset=utf8`` is not recommended. If unsure, you should consider
> - the utf8 option instead.
> + **utf8** is supported too and recommended to use.
>
> **utf8=<bool>**
> - UTF-8 is the filesystem safe version of Unicode that
> - is used by the console. It can be enabled or disabled
> - for the filesystem with this option.
> - If 'uni_xlate' gets set, UTF-8 gets disabled.
> - By default, FAT_DEFAULT_UTF8 setting is used.
> + Alias for ``iocharset=utf8`` mount option.
>
> **uni_xlate=<bool>**
> Translate unhandled Unicode characters to special
> diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig
> index 238cc55f84c4..e98aaa3bb55b 100644
> --- a/fs/fat/Kconfig
> +++ b/fs/fat/Kconfig
> @@ -93,29 +93,12 @@ config FAT_DEFAULT_IOCHARSET
> like FAT to use. It should probably match the character set
> that most of your FAT filesystems use, and can be overridden
> with the "iocharset" mount option for FAT filesystems.
> - Note that "utf8" is not recommended for FAT filesystems.
> - If unsure, you shouldn't set "utf8" here - select the next option
> - instead if you would like to use UTF-8 encoded file names by default.
> + "utf8" is supported too and recommended to use.

This patch fixes the issue of utf-8 partially only. I think we can't
still recommend only partially working one.

[...]

> - opts->utf8 = IS_ENABLED(CONFIG_FAT_DEFAULT_UTF8) && is_vfat;
> -
> if (!options)
> goto out;
>
> @@ -1318,10 +1316,14 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
> | VFAT_SFN_CREATE_WIN95;
> break;
> case Opt_utf8_no: /* 0 or no or false */
> - opts->utf8 = 0;
> + fat_reset_iocharset(opts);

This changes the behavior of "iocharset=iso8859-1,utf8=no" for
example. Do we need this user visible change here?

> break;
> case Opt_utf8_yes: /* empty or 1 or yes or true */
> - opts->utf8 = 1;
> + fat_reset_iocharset(opts);
> + iocharset = kstrdup("utf8", GFP_KERNEL);
> + if (!iocharset)
> + return -ENOMEM;
> + opts->iocharset = iocharset;
> break;
> case Opt_uni_xl_no: /* 0 or no or false */
> opts->unicode_xlate = 0;
> @@ -1359,18 +1361,11 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
> }
>
> out:
> - /* UTF-8 doesn't provide FAT semantics */
> - if (!strcmp(opts->iocharset, "utf8")) {
> - fat_msg(sb, KERN_WARNING, "utf8 is not a recommended IO charset"
> - " for FAT filesystems, filesystem will be "
> - "case sensitive!");
> - }
> + opts->utf8 = !strcmp(opts->iocharset, "utf8") && is_vfat;

Still broken, so I think we still need the warning here (would be
tweaked warning).

> /* If user doesn't specify allow_utime, it's initialized from dmask. */
> if (opts->allow_utime == (unsigned short)-1)
> opts->allow_utime = ~opts->fs_dmask & (S_IWGRP | S_IWOTH);
> - if (opts->unicode_xlate)
> - opts->utf8 = 0;

unicode_xlate option is exclusive with utf8, need to adjust
somewhere. (with this patch, unicode_xlate and utf8 will shows by
show_options())

> + else if (utf8)
> + return fat_utf8_strnicmp(name->name, str, alen);
> + else
> + return nls_strnicmp(t, name->name, str, alen);
> }

Not strong opinion though, maybe we better to consolidate this to a
(inline) function? (FWIW, it may be better to refactor to provide some
filename functions to hide the detail of handling nls/utf8)

Thanks.
--
OGAWA Hirofumi <[email protected]>

2023-02-04 10:57:17

by Pali Rohár

[permalink] [raw]
Subject: Re: [RFC PATCH v2 01/18] fat: Fix iocharset=utf8 mount option

On Tuesday 10 January 2023 18:17:05 OGAWA Hirofumi wrote:
> Pali Rohár <[email protected]> writes:
>
> > Currently iocharset=utf8 mount option is broken and error is printed to
> > dmesg when it is used. To use UTF-8 as iocharset, it is required to use
> > utf8=1 mount option.
> >
> > Fix iocharset=utf8 mount option to use be equivalent to the utf8=1 mount
> > option and remove printing error from dmesg.
>
> [...]
>
> > -
> > - There is also an option of doing UTF-8 translations
> > - with the utf8 option.
> > -
> > -.. note:: ``iocharset=utf8`` is not recommended. If unsure, you should consider
> > - the utf8 option instead.
> > + **utf8** is supported too and recommended to use.
> >
> > **utf8=<bool>**
> > - UTF-8 is the filesystem safe version of Unicode that
> > - is used by the console. It can be enabled or disabled
> > - for the filesystem with this option.
> > - If 'uni_xlate' gets set, UTF-8 gets disabled.
> > - By default, FAT_DEFAULT_UTF8 setting is used.
> > + Alias for ``iocharset=utf8`` mount option.
> >
> > **uni_xlate=<bool>**
> > Translate unhandled Unicode characters to special
> > diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig
> > index 238cc55f84c4..e98aaa3bb55b 100644
> > --- a/fs/fat/Kconfig
> > +++ b/fs/fat/Kconfig
> > @@ -93,29 +93,12 @@ config FAT_DEFAULT_IOCHARSET
> > like FAT to use. It should probably match the character set
> > that most of your FAT filesystems use, and can be overridden
> > with the "iocharset" mount option for FAT filesystems.
> > - Note that "utf8" is not recommended for FAT filesystems.
> > - If unsure, you shouldn't set "utf8" here - select the next option
> > - instead if you would like to use UTF-8 encoded file names by default.
> > + "utf8" is supported too and recommended to use.
>
> This patch fixes the issue of utf-8 partially only. I think we can't
> still recommend only partially working one.

With this patch FAT_DEFAULT_IOCHARSET=utf8 is same what was
FAT_DEFAULT_UTF8=y without this patch. And option FAT_DEFAULT_UTF8 was
recommended in description before "select the next option instead if you
would like to use UTF-8 encoded file names by default."

> [...]
>
> > - opts->utf8 = IS_ENABLED(CONFIG_FAT_DEFAULT_UTF8) && is_vfat;
> > -
> > if (!options)
> > goto out;
> >
> > @@ -1318,10 +1316,14 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
> > | VFAT_SFN_CREATE_WIN95;
> > break;
> > case Opt_utf8_no: /* 0 or no or false */
> > - opts->utf8 = 0;
> > + fat_reset_iocharset(opts);
>
> This changes the behavior of "iocharset=iso8859-1,utf8=no" for
> example. Do we need this user visible change here?

Ok, I agree, we do not want to change iocharset when
"iocharset=iso8859-1,utf8=no" was specified. It should stay on
iso8859-1.

> > break;
> > case Opt_utf8_yes: /* empty or 1 or yes or true */
> > - opts->utf8 = 1;
> > + fat_reset_iocharset(opts);
> > + iocharset = kstrdup("utf8", GFP_KERNEL);
> > + if (!iocharset)
> > + return -ENOMEM;
> > + opts->iocharset = iocharset;
> > break;
> > case Opt_uni_xl_no: /* 0 or no or false */
> > opts->unicode_xlate = 0;
> > @@ -1359,18 +1361,11 @@ static int parse_options(struct super_block *sb, char *options, int is_vfat,
> > }
> >
> > out:
> > - /* UTF-8 doesn't provide FAT semantics */
> > - if (!strcmp(opts->iocharset, "utf8")) {
> > - fat_msg(sb, KERN_WARNING, "utf8 is not a recommended IO charset"
> > - " for FAT filesystems, filesystem will be "
> > - "case sensitive!");
> > - }
> > + opts->utf8 = !strcmp(opts->iocharset, "utf8") && is_vfat;
>
> Still broken, so I think we still need the warning here (would be
> tweaked warning).

There was no warning before for utf8=1. And with this patch
iocharset=utf8 should have same behavior as what was utf8=1 before this
patch.

So if we should show some warning for utf8=1 then it is somehow not
related to this patch and it should be done separately, possible also to
the current codebase and before this patch.

> > /* If user doesn't specify allow_utime, it's initialized from dmask. */
> > if (opts->allow_utime == (unsigned short)-1)
> > opts->allow_utime = ~opts->fs_dmask & (S_IWGRP | S_IWOTH);
> > - if (opts->unicode_xlate)
> > - opts->utf8 = 0;
>
> unicode_xlate option is exclusive with utf8, need to adjust
> somewhere. (with this patch, unicode_xlate and utf8 will shows by
> show_options())

Ok, seems that there is more work to handle unicode_xlate option
correctly.

> > + else if (utf8)
> > + return fat_utf8_strnicmp(name->name, str, alen);
> > + else
> > + return nls_strnicmp(t, name->name, str, alen);
> > }
>
> Not strong opinion though, maybe we better to consolidate this to a
> (inline) function? (FWIW, it may be better to refactor to provide some
> filename functions to hide the detail of handling nls/utf8)

This looks like an another cleanup / improvement which can be done. I'm
not sure if it is a good idea to put into this patch series (it is
already big).

2023-02-08 10:11:16

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC PATCH v2 01/18] fat: Fix iocharset=utf8 mount option

Pali Roh?r <[email protected]> writes:

>> This patch fixes the issue of utf-8 partially only. I think we can't
>> still recommend only partially working one.
>
> With this patch FAT_DEFAULT_IOCHARSET=utf8 is same what was
> FAT_DEFAULT_UTF8=y without this patch. And option FAT_DEFAULT_UTF8 was
> recommended in description before "select the next option instead if you
> would like to use UTF-8 encoded file names by default."

It is not recommending to use UTF-8 as default, right? I wanted to say
no warning and recommend has big difference, and I can't recommend the
incompatible behavior that creates the case sensitive filename.

>> Still broken, so I think we still need the warning here (would be
>> tweaked warning).
>
> There was no warning before for utf8=1. And with this patch
> iocharset=utf8 should have same behavior as what was utf8=1 before this
> patch.
>
> So if we should show some warning for utf8=1 then it is somehow not
> related to this patch and it should be done separately, possible also to
> the current codebase and before this patch.

Sure, you are right.

Thanks.
--
OGAWA Hirofumi <[email protected]>