by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive support

On Mon, Sep 24, 2018 at 05:56:30PM -0400, Gabriel Krisman Bertazi wrote:
> [Resend, but also rebased on top of Ted's dev branch.]
>
> Hi Ted,
>
> This patchset implements encoding support as a superblock feature, valid
> for the entire disk, and Case-insensitive lookups support as a
> per-directory feature, configured as an inode flag. Alongside the
> addition of the casefolding patch, which was not part of v1, I fixed
> some bugs and addressed the concerns raised regarding invalid sequences
> and what normalization we use.
>
> My approach to solve the two problems above is to make things more
> flexible. An encoding flags field in the superblock registers what kind
> of normalization is used by the filesystem. Right now, the only
> encoding supported for utf8 is NFKD, for instance, but if we want to
> support a different normalization function in the future, we can be

Hmmm, I'm curious, why pick NFKD specifically? AFAICT Linux userspace
environments (I only tried with GNOME and KDE) use NF[K]C. I saved a
file with the name "caf?.txt" (c-a-f-compose-e-'-dot-t-x-t), and this is
what the ls output looked like:

$ /bin/ls caf?.txt | od -tx1 -Ad -c
0000000 63 61 66 c3 a9 2e 74 78 74 0a
c a f 303 251 . t x t \n
0000010

and xfs_db confirms that's what's in the file name:

# xfs_db /tmp/a.img
xfs_db> sb 0
xfs_db> addr rootino
xfs_db> p u3.sfdir3
u3.sfdir3.hdr.count = 1
u3.sfdir3.hdr.i8count = 0
u3.sfdir3.hdr.parent.i4 = 128
u3.sfdir3.list[0].namelen = 9
u3.sfdir3.list[0].offset = 0x60
u3.sfdir3.list[0].name = "caf\303\251.txt"
u3.sfdir3.list[0].inumber.i4 = 131
u3.sfdir3.list[0].filetype = 1

Is there a particular reason you picked NFKD? Ohhh, right, because this
series is a derivative of the ~2014 XFS case folding patchset. Hmm, so
looking at the ext4 changes, I guess what you do is add a custom ->d_hash
function so that the dentries are hashed by hash(nfkd(fname))? Which
makes it easy to have link() look for names that will conflict after
normalization? Which I guess is also how casefolding happens for
lookups? And I suppose that's why the superblock also has to store the
version of Unicode used and the folding options? Hmm, ok, sounds
reasonable to me so far.

> backward compatible. Likewise, superblock flags also define the
> behavior when dealing with invalid sequences. The default behavior is
> to just treat invalid sequences as opaque byte sequences, falling back
> to the original behavior. Alternatively if the strict flag is enabled,
> the kernel will reject invalid sequences as soon as they are detected.
> All these flags, can only be set at mkfs time, using the encoding_flags
> parameter.
>
> Each encoding has its default flags, when creating the filesystem in
> mkfs, regarding normalization/casefold functions and how to deal with
> opaque sequences.
>
> The NLS patches implement generic casefold and normalization operation
> (defined as tolower() and identity(), respectively), allowing us to use
> any NLS charset in the kernel. We store the NLS charset as a magic
> number in the superblock, and for that only a few charsets are defined.
> If you want to force a different charset, the encoding and
> encoding_flags mount options are also provided.
>
> This patchset also includes the NLS changes that I am proposing,
> including the entire utf8 normalization implementation. Differently
> from the previous version, I merged utf8 and utf8n, allowing the user to
> request a specific normalization (or no normalization at all) using the
> flags. As usual, I did not include the source ucd files because they
> would bounce in the list, but a completely functional implementation can
> be found at:
>
> https://gitlab.collabora.com/krisman/linux.git -b ext4-ci-directory
>
> I am also not supporting encoding with encrypted directories, given the
> cost of searching encrypted directories where the diggested name is not
> normalized. This means that we need to decrypt each filename beforehand,
> so I decided to simply skip this for now. If the user tries to mount
> with encoding a directory that has the encryption feature, we simply
> bail out saying it is not supported.
>
> The patchset survives without failures the smoke tests of xfstests, with
> the obvious exception of generic/453. This test, which verifies that
> multiple files having equivalent names (which would match in utf8
> normalization) are not the same file, doesn't really make sense with
> this patchset, since it basically verifies the fs is *not* encoding
> aware.

Yep. This probably needs a _require_unnormalized_dnames() helper to
detect when filename normalization is happening. IIRC this test also
fails on HFS+, as expected.

Do you normalize attr names? I surmise not since you didn't complain
about generic/454.

> We also developed encoding and casefolding tests for xfstests, allowing
> us to quickly verify the implementation. I will be sumitting them
> upstream too, but for now they are available at:
>
> https://gitlab.collabora.com/krisman/xfstests.git -b encoding

Er... those common/encoding helpers are going to have to deal with other
filesystems. :)

> These tests validate the usage on inline dirs, on dx directories, when
> dealing with dcache and more. I am happy with the coverage we have now,
> but if you have specific concerns I can add more tests.
>
> A modified version of e2fsprogs is necessary to run the tests, and to
> enable the feature. Support for mkfs, fsck, lsattr, chattr is
> available. I also hacked tune2fs to prevent it from setting the
> encryption flag when encoding is enabled. This tune2fs change is
> temporary until we are able to support these two features together.
>
> https://gitlab.collabora.com/krisman/e2fsprogs.git -b encoding-feature
>
> Gabriel Krisman Bertazi (21):
> nls: Wrap uni2char/char2uni callers
> nls: Wrap charset field access
> nls: Wrap charset hooks in ops structure
> nls: Split default charset from NLS core
> nls: Split struct nls_charset from struct nls_table
> nls: Add support for multiple versions of an encoding
> nls: Implement NLS_STRICT_MODE flag
> nls: Let charsets define the behavior of tolower/toupper
> nls: Add new interface for string comparisons
> nls: Add optional normalization and casefold hooks
> nls: ascii: Support validation and normalization operations
> nls: utf8: Move nls-utf8{,-core}.c
> nls: utf8: Integrate utf8 normalization code with utf8 charset
> nls: utf8: Introduce test module for normalized utf8 implementation
> vfs: Handle case-exact lookup in d_add_ci
> ext4: Include encoding information in the superblock
> ext4: Add encoding mount options
> ext4: Support encoding-aware file name lookups
> ext4: Implement encoding-aware dcache hooks
> ext4: Implement EXT4_CASEFOLD_FL flag
> docs: ext4.rst: Document encoding and case-insensitive lookups
>
> Olaf Weber (4):
> nls: utf8n: Add unicode character database files
> scripts: add trie generator for UTF-8
> nls: utf8: Introduce code for UTF-8 normalization
> nls: utf8n: reduce the size of utf8data[]
>
> Documentation/filesystems/ext4/ext4.rst | 38 +
> fs/befs/linuxvfs.c | 8 +-
> fs/cifs/cifs_unicode.c | 15 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/connect.c | 2 +-
> fs/cifs/dir.c | 7 +-
> fs/dcache.c | 33 +-
> fs/ext4/dir.c | 59 +
> fs/ext4/ext4.h | 33 +-
> fs/ext4/hash.c | 38 +-
> fs/ext4/ialloc.c | 2 +-
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 18 +
> fs/ext4/namei.c | 99 +-
> fs/ext4/super.c | 170 ++
> fs/fat/dir.c | 13 +-
> fs/fat/inode.c | 6 +-
> fs/fat/namei_vfat.c | 6 +-
> fs/hfs/super.c | 6 +-
> fs/hfs/trans.c | 9 +-
> fs/hfsplus/options.c | 2 +-
> fs/hfsplus/unicode.c | 6 +-
> fs/isofs/inode.c | 5 +-
> fs/isofs/joliet.c | 3 +-
> fs/jfs/jfs_unicode.c | 9 +-
> fs/jfs/super.c | 3 +-
> fs/nls/Kconfig | 15 +
> fs/nls/Makefile | 20 +
> fs/nls/mac-celtic.c | 34 +-
> fs/nls/mac-centeuro.c | 34 +-
> fs/nls/mac-croatian.c | 34 +-
> fs/nls/mac-cyrillic.c | 34 +-
> fs/nls/mac-gaelic.c | 34 +-
> fs/nls/mac-greek.c | 34 +-
> fs/nls/mac-iceland.c | 34 +-
> fs/nls/mac-inuit.c | 34 +-
> fs/nls/mac-roman.c | 34 +-
> fs/nls/mac-romanian.c | 34 +-
> fs/nls/mac-turkish.c | 34 +-
> fs/nls/nls_ascii.c | 84 +-
> fs/nls/nls_core.c | 163 ++
> fs/nls/nls_cp1250.c | 34 +-
> fs/nls/nls_cp1251.c | 34 +-
> fs/nls/nls_cp1255.c | 36 +-
> fs/nls/nls_cp437.c | 34 +-
> fs/nls/nls_cp737.c | 34 +-
> fs/nls/nls_cp775.c | 34 +-
> fs/nls/nls_cp850.c | 34 +-
> fs/nls/nls_cp852.c | 34 +-
> fs/nls/nls_cp855.c | 34 +-
> fs/nls/nls_cp857.c | 34 +-
> fs/nls/nls_cp860.c | 34 +-
> fs/nls/nls_cp861.c | 34 +-
> fs/nls/nls_cp862.c | 34 +-
> fs/nls/nls_cp863.c | 34 +-
> fs/nls/nls_cp864.c | 34 +-
> fs/nls/nls_cp865.c | 34 +-
> fs/nls/nls_cp866.c | 34 +-
> fs/nls/nls_cp869.c | 34 +-
> fs/nls/nls_cp874.c | 36 +-
> fs/nls/nls_cp932.c | 36 +-
> fs/nls/nls_cp936.c | 36 +-
> fs/nls/nls_cp949.c | 36 +-
> fs/nls/nls_cp950.c | 36 +-
> fs/nls/{nls_base.c => nls_default.c} | 124 +-
> fs/nls/nls_euc-jp.c | 29 +-
> fs/nls/nls_iso8859-1.c | 34 +-
> fs/nls/nls_iso8859-13.c | 34 +-
> fs/nls/nls_iso8859-14.c | 34 +-
> fs/nls/nls_iso8859-15.c | 34 +-
> fs/nls/nls_iso8859-2.c | 34 +-
> fs/nls/nls_iso8859-3.c | 34 +-
> fs/nls/nls_iso8859-4.c | 34 +-
> fs/nls/nls_iso8859-5.c | 34 +-
> fs/nls/nls_iso8859-6.c | 34 +-
> fs/nls/nls_iso8859-7.c | 34 +-
> fs/nls/nls_iso8859-9.c | 34 +-
> fs/nls/nls_koi8-r.c | 34 +-
> fs/nls/nls_koi8-ru.c | 30 +-
> fs/nls/nls_koi8-u.c | 34 +-
> fs/nls/nls_utf8-core.c | 333 +++
> fs/nls/nls_utf8-norm.c | 797 ++++++
> fs/nls/nls_utf8-selftest.c | 307 ++
> fs/nls/nls_utf8.c | 67 -
> fs/nls/ucd/README | 33 +
> fs/nls/utf8n.h | 117 +
> fs/ntfs/inode.c | 2 +-
> fs/ntfs/super.c | 6 +-
> fs/ntfs/unistr.c | 13 +-
> fs/udf/super.c | 3 +-
> fs/udf/unicode.c | 4 +-
> include/linux/fs.h | 2 +
> include/linux/nls.h | 293 +-
> scripts/Makefile | 1 +
> scripts/mkutf8data.c | 3464 +++++++++++++++++++++++
> 96 files changed, 7482 insertions(+), 633 deletions(-)
> create mode 100644 fs/nls/nls_core.c
> rename fs/nls/{nls_base.c => nls_default.c} (89%)
> create mode 100644 fs/nls/nls_utf8-core.c
> create mode 100644 fs/nls/nls_utf8-norm.c
> create mode 100644 fs/nls/nls_utf8-selftest.c
> delete mode 100644 fs/nls/nls_utf8.c
> create mode 100644 fs/nls/ucd/README
> create mode 100644 fs/nls/utf8n.h
> create mode 100644 scripts/mkutf8data.c
>
> --
> 2.19.0
>