Hi,
V3 applies the fixes suggested by Eric Biggers (thank you for your
review!). Changelog inlined in the patches.
Retested with xfstests for ext4 and f2fs.
--
cover letter from v1.
This patchset enables negative dentries for case-insensitive directories
in ext4/f2fs. It solves the corner cases for this feature, including
those already tested by fstests (generic/556). It also solves an
existing bug with the existing implementation where old negative
dentries are left behind after a directory conversion to
case-insensitive.
Testing-wise, I ran sanity checks to show it properly uses the created
negative dentries, observed the expected performance increase of the
dentry cache hit, and showed it survives the quick group in fstests on
both f2fs and ext4 without regressions.
* Background
Negative dentries have always been disabled in case-insensitive
directories because, in their current form, they can't provide enough
assurances that all the case variations of a filename won't exist in a
directory, and the name-preserving case-insenstive semantics
during file creation prevents some negative dentries from being
instantiated unmodified.
Nevertheless, for the general case, the existing implementation would
already work with negative dentries, even though they are fully
disabled. That is: if the original lookup that created the dentry was
done in a case-insensitive way, the negative dentry can usually be
validated, since it assures that no other dcache entry exists, *and*
that no variation of the file exists on disk (since the lookup
failed). A following lookup would then be executed with the
case-insensitive-aware d_hash and d_lookup, which would find the right
negative dentry and use it.
The first corner case arises when a case-insensitive directory has
negative dentries that were created before the directory was flipped to
case-insensitive. A directory must be empty to be converted, but it
doesn't mean the directory doesn't have negative dentry children. If
that happens, the dangling dentries left behind can't assure that no
case-variation of the name exists. They only mean the exact name
doesn't exist. A further lookup would incorrectly validate them.
The code below demonstrates the problem. In this example $1 and $2 are
two strings, where:
(i) $1 != $2
(ii) casefold($1) == casefold($2)
(iii) hash($1) == hash($2) == hash(casefold($1))
Then, the following sequence could potentially return a ENOENT, even
though the case-insensitive lookup should exist:
mkdir d <- Case-sensitive directory
touch d/$1
touch d/$2
unlink d/$1 <- leaves negative dentry behind.
unlink d/$2 <- leaves *another* negative dentry behind.
chattr +F d <- make 'd' case-insensitive.
touch d/$1 <- Both negative dentries could match. finds one of them,
and instantiate
access d/$1 <- Find the other negative dentry, get -ENOENT.
In fact, this is a problem even on the current implementation, where
negative dentries for CI are disabled. There was a bug reported by Al
Viro in 2020, where a directory might end up with dangling negative
dentries created during a case-sensitive lookup, because they existed
before the +F attribute was set.
It is hard to trigger the issue, because condition (iii) is hard to test
on an unmodified kernel. By hacking the kernel to force the hash
collision, there are a few ways we can trigger this bizarre behavior in
case-insensitive directories through the insertion of negative dentries.
Another problem exists when turning a negative dentry to positive. If
the negative dentry has a different case than what is currently being
used for lookup, the dentry cannot be reused without changing its name,
in order to guarantee filename-preserving semantics to userspace. We
need to either change the name or invalidate the dentry. This issue is
currently avoided in mainline, since the negative dentry mechanism is
disabled.
* Proposal
The main idea is to differentiate negative dentries created in a
case-insensitive context from those created during a case-sensitive
lookup via a new dentry flag, D_CASEFOLD_LOOKUP, set by the filesystem
the d_lookup hook. Since the former can be used (except for the
name-preserving issue), d_revalidate will just check the flag to
quickly accept or reject the dentry.
A different solution would be to guarantee no negative dentry exists
during the case-sensitive to case-insensitive directory conversion (the
other direction is safe). It has the following problems:
1) It is not trivial to implement a race-free mechanism to ensure
negative dentries won't be recreated immediately after invalidation
while converting the directory.
2) The knowledge whether the negative dentry is valid (i.e. comes from
a case-insensitive lookup) is implicit on the fact that we are
correctly invalidating dentries when converting the directory.
Having a D_CASEFOLD_LOOKUP avoids both issues, and seems to be a cheap
solution to the problem.
But, as explained above, due to the filename preserving semantics, we
cannot just validate based on D_CASEFOLD_LOOKUP.
For that, one solution would be to invalidate the negative dentry when
it is decided to turn it positive, instead of reusing it. I implemented
that in the past (2018) but Al Viro made it clear we don't want to incur
costs on the VFS critical path for filesystems who don't care about
case-insensitiveness.
Instead, this patch invalidates negative dentries in casefold
directories in d_revalidate during creation lookups, iff the lookup name
is not exactly what is cached. Other kinds of lookups wouldn't need
this limitation.
* caveats
1) Encryption
Negative dentries on case-insensitive encrypted directories are also
disabled. No semantic change for them is intended in
this patchset; we just bypass the revalidation directly to fscrypt, for
positive dentries. Encryption support is future work.
2) revalidate the cached dentry using the name under lookup
Validating based on the lookup name is strange for a cache. the new
semantic is implemented by d_revalidate, to stay out of the critical
path of filesystems who don't care about case-insensitiveness, as much
as possible. The only change is the addition of a new flavor of
d_revalidate.
* Tests
There are a tests in place for most of the corner cases in generic/556.
They mainly verify the name-preserving semantics. The invalidation when
converting the directory is harder to test, because it is hard to force
the invalidation of specific cached dentries that occlude a dangling
invalid dentry. I tested it with forcing the positive dentries to be
removed, but I'm not sure how to write an upstreamable test.
It also survives fstests quick group regression testing on both ext4 and
f2fs.
* Performance
The latency of lookups of non-existing files is obviously improved, as
would be expected. The following numbers compare the execution time of 10^6
lookups of a non-existing file in a case-insensitive directory
pre-populated with 100k files in ext4.
Without the patch: 10.363s / 0.349s / 9.920s (real/user/sys)
With the patch: 1.752s / 0.276s / 1.472s (real/user/sys)
* patchset
Patch 1 introduces a new flavor of d_revalidate to provide the
filesystem with the name under lookup; Patch 2 introduces the new flag
to signal the dentry creation context; Patch 3 introduces a libfs helper
to revalidate negative dentries on case-insensitive directories; Patch 4
deals with encryption; Patch 5 cleans up the now redundant dentry
operations for case-insensitive with and without encryption; Finally,
Patch 6 and 7 enable support on case-insensitive directories
for ext4 and f2fs, respectively.
Gabriel Krisman Bertazi (7):
fs: Expose name under lookup to d_revalidate hook
fs: Add DCACHE_CASEFOLDED_NAME flag
libfs: Validate negative dentries in case-insensitive directories
libfs: Chain encryption checks after case-insensitive revalidation
libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
ext4: Enable negative dentries on case-insensitive lookup
f2fs: Enable negative dentries on case-insensitive lookup
Documentation/filesystems/locking.rst | 3 +
Documentation/filesystems/vfs.rst | 12 ++++
fs/dcache.c | 10 ++-
fs/ext4/namei.c | 35 ++--------
fs/f2fs/namei.c | 25 ++-----
fs/libfs.c | 93 ++++++++++++++++++---------
fs/namei.c | 23 ++++---
include/linux/dcache.h | 9 +++
8 files changed, 116 insertions(+), 94 deletions(-)
--
2.41.0
From: Gabriel Krisman Bertazi <[email protected]>
Now that casefold needs d_revalidate and calls fscrypt_d_revalidate
itself, generic_encrypt_ci_dentry_ops and generic_ci_dentry_ops are now
equivalent. Merge them together and simplify the setup code.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
changes since v2:
- reword comment for clarity (Eric)
---
fs/libfs.c | 45 +++++++++++++--------------------------------
1 file changed, 13 insertions(+), 32 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 0e5d3bb1dddc..73cd06e7ff90 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1509,7 +1509,7 @@ static inline int generic_ci_d_revalidate(struct dentry *dentry,
return fscrypt_d_revalidate(dentry, flags);
}
-static const struct dentry_operations generic_ci_dentry_ops = {
+static const struct dentry_operations generic_encrypted_ci_dentry_ops = {
.d_hash = generic_ci_d_hash,
.d_compare = generic_ci_d_compare,
.d_revalidate_name = generic_ci_d_revalidate,
@@ -1522,26 +1522,19 @@ static const struct dentry_operations generic_encrypted_dentry_ops = {
};
#endif
-#if defined(CONFIG_FS_ENCRYPTION) && IS_ENABLED(CONFIG_UNICODE)
-static const struct dentry_operations generic_encrypted_ci_dentry_ops = {
- .d_hash = generic_ci_d_hash,
- .d_compare = generic_ci_d_compare,
- .d_revalidate_name = generic_ci_d_revalidate,
-};
-#endif
-
/**
* generic_set_encrypted_ci_d_ops - helper for setting d_ops for given dentry
* @dentry: dentry to set ops on
*
- * Casefolded directories need d_hash and d_compare set, so that the dentries
- * contained in them are handled case-insensitively. Note that these operations
- * are needed on the parent directory rather than on the dentries in it, and
- * while the casefolding flag can be toggled on and off on an empty directory,
- * dentry_operations can't be changed later. As a result, if the filesystem has
- * casefolding support enabled at all, we have to give all dentries the
- * casefolding operations even if their inode doesn't have the casefolding flag
- * currently (and thus the casefolding ops would be no-ops for now).
+ * Casefolded directories need some dentry_operations set, so that the dentries
+ * contained in them are handled case-insensitively. Note that d_hash and
+ * d_compare are needed on the parent directory rather than on the dentries in
+ * it, and while the casefolding flag can be toggled on and off on an empty
+ * directory, dentry_operations can't be changed later. As a result, if the
+ * filesystem has casefolding support enabled at all, we have to give all
+ * dentries the casefolding operations even if their inode doesn't have the
+ * casefolding flag currently (and thus the casefolding ops would be no-ops for
+ * now).
*
* Encryption works differently in that the only dentry operation it needs is
* d_revalidate, which it only needs on dentries that have the no-key name flag.
@@ -1550,34 +1543,22 @@ static const struct dentry_operations generic_encrypted_ci_dentry_ops = {
* Finally, to maximize compatibility with overlayfs (which isn't compatible
* with certain dentry operations) and to avoid taking an unnecessary
* performance hit, we use custom dentry_operations for each possible
- * combination rather than always installing all operations.
+ * combination of operations rather than always installing them.
*/
void generic_set_encrypted_ci_d_ops(struct dentry *dentry)
{
-#ifdef CONFIG_FS_ENCRYPTION
- bool needs_encrypt_ops = dentry->d_flags & DCACHE_NOKEY_NAME;
-#endif
#if IS_ENABLED(CONFIG_UNICODE)
- bool needs_ci_ops = dentry->d_sb->s_encoding;
-#endif
-#if defined(CONFIG_FS_ENCRYPTION) && IS_ENABLED(CONFIG_UNICODE)
- if (needs_encrypt_ops && needs_ci_ops) {
+ if (dentry->d_sb->s_encoding) {
d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops);
return;
}
#endif
#ifdef CONFIG_FS_ENCRYPTION
- if (needs_encrypt_ops) {
+ if (dentry->d_flags & DCACHE_NOKEY_NAME) {
d_set_d_op(dentry, &generic_encrypted_dentry_ops);
return;
}
#endif
-#if IS_ENABLED(CONFIG_UNICODE)
- if (needs_ci_ops) {
- d_set_d_op(dentry, &generic_ci_dentry_ops);
- return;
- }
-#endif
}
EXPORT_SYMBOL(generic_set_encrypted_ci_d_ops);
--
2.41.0
Sorry, one more thing...
On Wed, Jul 19, 2023 at 06:19:11PM -0400, Gabriel Krisman Bertazi wrote:
>
> Another problem exists when turning a negative dentry to positive. If
> the negative dentry has a different case than what is currently being
> used for lookup, the dentry cannot be reused without changing its name,
> in order to guarantee filename-preserving semantics to userspace. We
> need to either change the name or invalidate the dentry. This issue is
> currently avoided in mainline, since the negative dentry mechanism is
> disabled.
Are you sure this problem even needs to be solved?
It actually isn't specific to negative dentries. If you have a file "foo"
that's not in the dcache, and you open it (or look it up in any other way) as
"FOO", then the positive dentry that gets created is named "FOO".
As a result, the name that shows up in /proc/$pid/fd/ for anyone who has the
file open is "FOO", not the true name "foo". This is true even for processes
that open it as "foo", as long as the dentry remains in the dcache.
No negative dentries involved at all!
Is your thinking that you just don't want to increase the number of ways in
which this behavior can occur?
Or, it looks like the positive dentry case is solvable using d_add_ci().
So maybe you are planning to do that? It's not clear to me.
- Eric
Eric Biggers <[email protected]> writes:
>> Another problem exists when turning a negative dentry to positive. If
>> the negative dentry has a different case than what is currently being
>> used for lookup, the dentry cannot be reused without changing its name,
>> in order to guarantee filename-preserving semantics to userspace. We
>> need to either change the name or invalidate the dentry. This issue is
>> currently avoided in mainline, since the negative dentry mechanism is
>> disabled.
>
> Are you sure this problem even needs to be solved?
Yes, because we promise name-preserving semantics. If we don't do it,
the name on the disk might be different than what was asked for, and tools
that rely on it (they exist) will break. During initial testing, I've
had tools breaking with case-insensitive ext4 because they created a
file, did getdents and wanted to see exactly the name they used.
> It actually isn't specific to negative dentries. If you have a file "foo"
> that's not in the dcache, and you open it (or look it up in any other way) as
> "FOO", then the positive dentry that gets created is named "FOO".
>
> As a result, the name that shows up in /proc/$pid/fd/ for anyone who has the
> file open is "FOO", not the true name "foo". This is true even for processes
> that open it as "foo", as long as the dentry remains in the dcache.
>
> No negative dentries involved at all!
I totally agree it is goes beyond negative dentries, but this case is
particularly important because it is the only one (that I know of) where
the incorrect case might actually be written to the disk. other cases,
like /proc/<pid>/fd/ can just display a different case to userspace,
which is confusing. Still, the disk has the right version, exactly as
originally created.
I see the current /proc/$pid/fd/ semantics as a bug. In fact, I have/had
a bug report for bwrap/flatkpak breaking because it was mounting
something and immediately checking /proc/mounts to confirm it worked. A
former team member tried fixing it a while ago, but we didn't follow up,
and I don't really love the way they did it. I need to look into that.
> Or, it looks like the positive dentry case is solvable using d_add_ci().
> So maybe you are planning to do that? It's not clear to me.
I want to use d_add_ci for the future, yes. It is not hard, but not
trivial, because there is an infinite recursion if d_add_ci uses
->d_compare() when doing the lookup for a duplicate. We sent a patch to
fix d_add_ci a while ago, but it was rejected. I need to revisit.
--
Gabriel Krisman Bertazi
On Thu, Jul 20, 2023 at 01:35:37PM -0400, Gabriel Krisman Bertazi wrote:
> Eric Biggers <[email protected]> writes:
>
> >> Another problem exists when turning a negative dentry to positive. If
> >> the negative dentry has a different case than what is currently being
> >> used for lookup, the dentry cannot be reused without changing its name,
> >> in order to guarantee filename-preserving semantics to userspace. We
> >> need to either change the name or invalidate the dentry. This issue is
> >> currently avoided in mainline, since the negative dentry mechanism is
> >> disabled.
> >
> > Are you sure this problem even needs to be solved?
>
> Yes, because we promise name-preserving semantics. If we don't do it,
> the name on the disk might be different than what was asked for, and tools
> that rely on it (they exist) will break. During initial testing, I've
> had tools breaking with case-insensitive ext4 because they created a
> file, did getdents and wanted to see exactly the name they used.
>
> > It actually isn't specific to negative dentries. If you have a file "foo"
> > that's not in the dcache, and you open it (or look it up in any other way) as
> > "FOO", then the positive dentry that gets created is named "FOO".
> >
> > As a result, the name that shows up in /proc/$pid/fd/ for anyone who has the
> > file open is "FOO", not the true name "foo". This is true even for processes
> > that open it as "foo", as long as the dentry remains in the dcache.
> >
> > No negative dentries involved at all!
>
> I totally agree it is goes beyond negative dentries, but this case is
> particularly important because it is the only one (that I know of) where
> the incorrect case might actually be written to the disk. other cases,
> like /proc/<pid>/fd/ can just display a different case to userspace,
> which is confusing. Still, the disk has the right version, exactly as
> originally created.
>
> I see the current /proc/$pid/fd/ semantics as a bug. In fact, I have/had
> a bug report for bwrap/flatkpak breaking because it was mounting
> something and immediately checking /proc/mounts to confirm it worked. A
> former team member tried fixing it a while ago, but we didn't follow up,
> and I don't really love the way they did it. I need to look into that.
>
> > Or, it looks like the positive dentry case is solvable using d_add_ci().
> > So maybe you are planning to do that? It's not clear to me.
>
> I want to use d_add_ci for the future, yes. It is not hard, but not
> trivial, because there is an infinite recursion if d_add_ci uses
> ->d_compare() when doing the lookup for a duplicate. We sent a patch to
> fix d_add_ci a while ago, but it was rejected. I need to revisit.
>
Thanks, I missed that choosing a different-case dentry actually changes the name
given to the new file. This is because the filesystem is told about the name of
the file to create via the negative dentry that gets found/created -- not via
the original user-specified string. It would help if you made this clear in a
code comment. Taking the comment I suggested, I'd maybe revise it to:
/*
* If the lookup is for creation, then a negative dentry
* can only be reused if it's a case-sensitive match,
* not just a case-insensitive one. This is needed to
* make the new file be created with the name the user
* specified, preserving case.
*
- Eric