From: Gabriel Krisman Bertazi <[email protected]>
Introduce a dentry revalidation helper to be used by case-insensitive
filesystems to check if it is safe to reuse a negative dentry.
A negative dentry is safe to be reused on a case-insensitive lookup if
it was created during a case-insensitive lookup and this is not a lookup
that will instantiate a dentry. If this is a creation lookup, we also
need to make sure the name matches sensitively the name under lookup in
order to assure the name preserving semantics.
dentry->d_name is only checked by the case-insensitive d_revalidate hook
in the LOOKUP_CREATE/LOOKUP_RENAME_TARGET case since, for these cases,
d_revalidate is always called with the parent inode locked, and
therefore the name cannot change from under us.
d_revalidate is only called in 4 places: lookup_dcache, __lookup_slow,
lookup_open and lookup_fast:
- lookup_dcache always calls it with zeroed flags, with the exception
of when coming from __lookup_hash, which needs the parent locked
already, for instance in the open/creation path, which is locked in
open_last_lookups.
- In __lookup_slow, either the parent inode is locked by the
caller (lookup_slow), or it is called with no
flags (lookup_one/lookup_one_len).
- lookup_open also requires the parent to be locked in the creation
case, which is done in open_last_lookups.
- lookup_fast will indeed be called with the parent unlocked, but it
shouldn't be called with LOOKUP_CREATE. Either it is called in the
link_path_walk, where nd->flags doesn't have LOOKUP_CREATE yet or in
open_last_lookups. But, in this case, it also never has LOOKUP_CREATE,
because it is only called on the !O_CREAT case, which means op->intent
doesn't have LOOKUP_CREAT (set in build_open_flags only if O_CREAT is
set).
Finally, for the LOOKUP_RENAME_TARGET, we are doing a rename, so the
parents inodes are also be locked.
Reviewed-by: Theodore Ts'o <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Changes since v2:
- Add comments to all rejection cases (eric)
- safeguard against filesystem creating dentries without LOOKUP flags
---
fs/libfs.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/fs/libfs.c b/fs/libfs.c
index 5b851315eeed..dd213f446427 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1462,9 +1462,57 @@ static int generic_ci_d_hash(const struct dentry *dentry, struct qstr *str)
return 0;
}
+static inline int generic_ci_d_revalidate(struct dentry *dentry,
+ const struct qstr *name,
+ unsigned int flags)
+{
+ if (d_is_negative(dentry)) {
+ const struct dentry *parent = READ_ONCE(dentry->d_parent);
+ const struct inode *dir = READ_ONCE(parent->d_inode);
+
+ if (dir && needs_casefold(dir)) {
+ /*
+ * Filesystems will call into d_revalidate without
+ * setting LOOKUP_ flags even for file creation(see
+ * lookup_one* variants). Reject negative dentries in
+ * this case, since we can't know for sure it won't be
+ * used for creation.
+ */
+ if (!flags)
+ return 0;
+
+ /*
+ * Negative dentries created prior to turning the
+ * directory case-insensitive cannot be trusted, since
+ * they don't ensure any possible case version of the
+ * filename doesn't exist.
+ */
+ if (!d_is_casefold_lookup(dentry))
+ return 0;
+
+ if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) {
+ /*
+ * ->d_name won't change from under us in the
+ * creation path only, since d_revalidate during
+ * creation and renames is always called with
+ * the parent inode locked. It isn't the case
+ * for all lookup callpaths, so ->d_name must
+ * not be touched outside
+ * (LOOKUP_CREATE|LOOKUP_RENAME_TARGET) context.
+ */
+ if (dentry->d_name.len != name->len ||
+ memcmp(dentry->d_name.name, name->name, name->len))
+ return 0;
+ }
+ }
+ }
+ return 1;
+}
+
static const struct dentry_operations generic_ci_dentry_ops = {
.d_hash = generic_ci_d_hash,
.d_compare = generic_ci_d_compare,
+ .d_revalidate_name = generic_ci_d_revalidate,
};
#endif
--
2.41.0
On Wed, Jul 19, 2023 at 06:19:14PM -0400, Gabriel Krisman Bertazi wrote:
> +static inline int generic_ci_d_revalidate(struct dentry *dentry,
> + const struct qstr *name,
> + unsigned int flags)
This shouldn't be 'inline', since it can't be inlined into anywhere anyway.
> + if (dir && needs_casefold(dir)) {
> + /*
> + * Filesystems will call into d_revalidate without
> + * setting LOOKUP_ flags even for file creation(see
> + * lookup_one* variants). Reject negative dentries in
> + * this case, since we can't know for sure it won't be
> + * used for creation.
> + */
> + if (!flags)
> + return 0;
> +
> + /*
> + * Negative dentries created prior to turning the
> + * directory case-insensitive cannot be trusted, since
> + * they don't ensure any possible case version of the
> + * filename doesn't exist.
> + */
> + if (!d_is_casefold_lookup(dentry))
> + return 0;
> +
> + if (flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET)) {
> + /*
> + * ->d_name won't change from under us in the
> + * creation path only, since d_revalidate during
> + * creation and renames is always called with
> + * the parent inode locked. It isn't the case
> + * for all lookup callpaths, so ->d_name must
> + * not be touched outside
> + * (LOOKUP_CREATE|LOOKUP_RENAME_TARGET) context.
> + */
> + if (dentry->d_name.len != name->len ||
> + memcmp(dentry->d_name.name, name->name, name->len))
> + return 0;
> + }
> + }
This would be easier to follow if the '!flags' and 'flags & (LOOKUP_CREATE |
LOOKUP_RENAME_TARGET)' sections were adjacent to each other. They group
together logically, since they both deal with the following problem: "when the
lookup might be for creation, invalidate the negative dentry if it might not be
a case-sensitive match". (And it would help if there was a comment explaining
that problem.) The d_is_casefold_lookup() check solves a different problem.
I'm also having trouble understanding exactly when ->d_name is stable here.
AFAICS, unfortunately the VFS has an edge case where a dentry can be moved
without its parent's ->i_rwsem being held. It happens when a subdirectory is
"found" under multiple names. The VFS doesn't support directory hard links, so
if it finds a second link to a directory, it just moves the whole dentry tree to
the new location. This can happen if a filesystem image is corrupted and
contains directory hard links. Coincidentally, it can also happen in an
encrypted directory due to the no-key name => normal name transition...
But, negative dentries are never moved, at all. (__d_move() even WARNs if you
ask it to move a negative dentry.) That feels like that would make everything
else irrelevant here, though your patchset doesn't mention this. I suppose the
problem would be what if the dentry is made positive concurrently.
I'm struggling to find an ideal solution here. Maybe ->d_lock should just be
taken for the name comparison. That is legal here, and it reliably synchronizes
with the dentry being moved, without having to consider anything else.
So, the following is probably what I'd do. I'd be interested to hear your
thoughts, though:
/*
* A negative dentry that was created before the
* directory became case-insensitive can't be trusted,
* since it doesn't ensure any possible case version of
* the filename doesn't exist.
*/
if (!d_is_casefold_lookup(dentry))
return 0;
/*
* If the lookup is for creation, then a negative dentry
* can only be reused if it's a case-sensitive match,
* not just a case-insensitive one. This is required to
* provide case-preserving semantics.
*
* In some cases (lookup_one_*()), the lookup intent is
* unknown, resulting in flags==0 here. So we have to
* assume that flags==0 is potentially a creation.
*/
if (flags == 0 ||
(flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))) {
bool match;
spin_lock(&dentry->d_lock);
match = (dentry->d_name.len == name->len &&
memcmp(dentry->d_name.name, name->name,
name->len) == 0);
spin_unlock(&dentry->d_lock);
if (!match)
return 0;
}
On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote:
>
> I'm also having trouble understanding exactly when ->d_name is stable here.
> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved
> without its parent's ->i_rwsem being held. It happens when a subdirectory is
> "found" under multiple names. The VFS doesn't support directory hard links, so
> if it finds a second link to a directory, it just moves the whole dentry tree to
> the new location. This can happen if a filesystem image is corrupted and
> contains directory hard links. Coincidentally, it can also happen in an
> encrypted directory due to the no-key name => normal name transition...
Sorry, I think I got this slightly wrong. The move does happen with the
parent's ->i_rwsem held, but it's for read, not for write. First, before
->lookup is called, the ->i_rwsem of the parent directory is taken for read.
->lookup() calls d_splice_alias() which can call __d_unalias() which does the
__d_move(). If the old alias is in a different directory (which cannot happen
in that fscrypt case, but can happen in the general "directory hard links"
case), __d_unalias() takes that directory's ->i_rwsem for read too.
So it looks like the parent's ->i_rwsem does indeed exclude moves of child
dentries, but only if it's taken for *write*. So I guess you can rely on that;
it's just a bit more subtle than it first appears. Though, some of your
explanation seems to assume that a read lock is sufficient ("In __lookup_slow,
either the parent inode is locked by the caller (lookup_slow) ..."), so maybe
there is still a problem.
- Eric
Eric Biggers <[email protected]> writes:
> On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote:
>>
>> I'm also having trouble understanding exactly when ->d_name is stable here.
>> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved
>> without its parent's ->i_rwsem being held. It happens when a subdirectory is
>> "found" under multiple names. The VFS doesn't support directory hard links, so
>> if it finds a second link to a directory, it just moves the whole dentry tree to
>> the new location. This can happen if a filesystem image is corrupted and
>> contains directory hard links. Coincidentally, it can also happen in an
>> encrypted directory due to the no-key name => normal name transition...
>
> Sorry, I think I got this slightly wrong. The move does happen with the
> parent's ->i_rwsem held, but it's for read, not for write. First, before
> ->lookup is called, the ->i_rwsem of the parent directory is taken for read.
> ->lookup() calls d_splice_alias() which can call __d_unalias() which does the
> __d_move(). If the old alias is in a different directory (which cannot happen
> in that fscrypt case, but can happen in the general "directory hard links"
> case), __d_unalias() takes that directory's ->i_rwsem for read too.
>
> So it looks like the parent's ->i_rwsem does indeed exclude moves of child
> dentries, but only if it's taken for *write*. So I guess you can rely on that;
> it's just a bit more subtle than it first appears. Though, some of your
> explanation seems to assume that a read lock is sufficient ("In __lookup_slow,
> either the parent inode is locked by the caller (lookup_slow) ..."), so maybe
> there is still a problem.
I think I'm missing something on your clarification. I see your point
about __d_unalias, and I see in the case where alias->d_parent !=
dentry->d_parent we acquire the parent inode read lock:
static int __d_unalias(struct inode *inode,
struct dentry *dentry, struct dentry *alias)
{
...
m1 = &dentry->d_sb->s_vfs_rename_mutex;
if (!inode_trylock_shared(alias->d_parent->d_inode))
goto out_err;
}
And it seems to use that for __d_move. In this case, __d_move changes
from under us even with a read lock, which is dangerous. I think I
agree with your first email more than the clarification.
In the lookup_slow then:
lookup_slow()
d_lookup()
d_splice_alias()
__d_unalias()
__d_move()
this __d_move Can do a dentry move and race with d_revalidate even
though it has the parent read lock.
> So it looks like the parent's ->i_rwsem does indeed exclude moves of child
> dentries, but only if it's taken for *write*. So I guess you can rely on that;
We can get away of it with acquiring the d_lock as you suggested, I
think. But can you clarify the above? I wanna make sure I didn't miss
anything. I am indeed relying only on the read lock here, as you can see.
--
Gabriel Krisman Bertazi
On Fri, Jul 21, 2023 at 04:16:30PM -0400, Gabriel Krisman Bertazi wrote:
> Eric Biggers <[email protected]> writes:
>
> > On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote:
> >>
> >> I'm also having trouble understanding exactly when ->d_name is stable here.
> >> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved
> >> without its parent's ->i_rwsem being held. It happens when a subdirectory is
> >> "found" under multiple names. The VFS doesn't support directory hard links, so
> >> if it finds a second link to a directory, it just moves the whole dentry tree to
> >> the new location. This can happen if a filesystem image is corrupted and
> >> contains directory hard links. Coincidentally, it can also happen in an
> >> encrypted directory due to the no-key name => normal name transition...
> >
> > Sorry, I think I got this slightly wrong. The move does happen with the
> > parent's ->i_rwsem held, but it's for read, not for write. First, before
> > ->lookup is called, the ->i_rwsem of the parent directory is taken for read.
> > ->lookup() calls d_splice_alias() which can call __d_unalias() which does the
> > __d_move(). If the old alias is in a different directory (which cannot happen
> > in that fscrypt case, but can happen in the general "directory hard links"
> > case), __d_unalias() takes that directory's ->i_rwsem for read too.
> >
> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child
> > dentries, but only if it's taken for *write*. So I guess you can rely on that;
> > it's just a bit more subtle than it first appears. Though, some of your
> > explanation seems to assume that a read lock is sufficient ("In __lookup_slow,
> > either the parent inode is locked by the caller (lookup_slow) ..."), so maybe
> > there is still a problem.
>
> I think I'm missing something on your clarification. I see your point
> about __d_unalias, and I see in the case where alias->d_parent !=
> dentry->d_parent we acquire the parent inode read lock:
>
> static int __d_unalias(struct inode *inode,
> struct dentry *dentry, struct dentry *alias)
> {
> ...
> m1 = &dentry->d_sb->s_vfs_rename_mutex;
> if (!inode_trylock_shared(alias->d_parent->d_inode))
> goto out_err;
> }
>
> And it seems to use that for __d_move. In this case, __d_move changes
> from under us even with a read lock, which is dangerous. I think I
> agree with your first email more than the clarification.
>
> In the lookup_slow then:
>
> lookup_slow()
> d_lookup()
> d_splice_alias()
> __d_unalias()
> __d_move()
>
> this __d_move Can do a dentry move and race with d_revalidate even
> though it has the parent read lock.
>
> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child
> > dentries, but only if it's taken for *write*. So I guess you can rely on that;
>
> We can get away of it with acquiring the d_lock as you suggested, I
> think. But can you clarify the above? I wanna make sure I didn't miss
> anything. I am indeed relying only on the read lock here, as you can see.
In my first email I thought that __d_move() can be called without the parent
inode's i_rwsem held at all. In my second email I realized that it is always
called with at least a read (shared) lock.
The question is what kind of parent i_rwsem lock is guaranteed to be held by the
caller of ->d_revalidate() when the name comparison is done. Based on the
above, it needs to be at least a write (exclusive) lock to exclude __d_move()
without taking d_lock. However, your analysis (in the commit message of "libfs:
Validate negative dentries in case-insensitive directories") only talks about
i_rwsem being "taken", without saying whether it's for read or write. One thing
you mentioned as taking i_rwsem is lookup_slow, but that only takes it for read.
That seems like a problem, as it makes your analysis not correct.
- Eric
Eric Biggers <[email protected]> writes:
> On Fri, Jul 21, 2023 at 04:16:30PM -0400, Gabriel Krisman Bertazi wrote:
>> Eric Biggers <[email protected]> writes:
>>
>> > On Wed, Jul 19, 2023 at 11:06:57PM -0700, Eric Biggers wrote:
>> >>
>> >> I'm also having trouble understanding exactly when ->d_name is stable here.
>> >> AFAICS, unfortunately the VFS has an edge case where a dentry can be moved
>> >> without its parent's ->i_rwsem being held. It happens when a subdirectory is
>> >> "found" under multiple names. The VFS doesn't support directory hard links, so
>> >> if it finds a second link to a directory, it just moves the whole dentry tree to
>> >> the new location. This can happen if a filesystem image is corrupted and
>> >> contains directory hard links. Coincidentally, it can also happen in an
>> >> encrypted directory due to the no-key name => normal name transition...
>> >
>> > Sorry, I think I got this slightly wrong. The move does happen with the
>> > parent's ->i_rwsem held, but it's for read, not for write. First, before
>> > ->lookup is called, the ->i_rwsem of the parent directory is taken for read.
>> > ->lookup() calls d_splice_alias() which can call __d_unalias() which does the
>> > __d_move(). If the old alias is in a different directory (which cannot happen
>> > in that fscrypt case, but can happen in the general "directory hard links"
>> > case), __d_unalias() takes that directory's ->i_rwsem for read too.
>> >
>> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child
>> > dentries, but only if it's taken for *write*. So I guess you can rely on that;
>> > it's just a bit more subtle than it first appears. Though, some of your
>> > explanation seems to assume that a read lock is sufficient ("In __lookup_slow,
>> > either the parent inode is locked by the caller (lookup_slow) ..."), so maybe
>> > there is still a problem.
>>
>> I think I'm missing something on your clarification. I see your point
>> about __d_unalias, and I see in the case where alias->d_parent !=
>> dentry->d_parent we acquire the parent inode read lock:
>>
>> static int __d_unalias(struct inode *inode,
>> struct dentry *dentry, struct dentry *alias)
>> {
>> ...
>> m1 = &dentry->d_sb->s_vfs_rename_mutex;
>> if (!inode_trylock_shared(alias->d_parent->d_inode))
>> goto out_err;
>> }
>> this __d_move Can do a dentry move and race with d_revalidate even
>> though it has the parent read lock.
>>
>> > So it looks like the parent's ->i_rwsem does indeed exclude moves of child
>> > dentries, but only if it's taken for *write*. So I guess you can rely on that;
>>
>> We can get away of it with acquiring the d_lock as you suggested, I
>> think. But can you clarify the above? I wanna make sure I didn't miss
>> anything. I am indeed relying only on the read lock here, as you can see.
>
> In my first email I thought that __d_move() can be called without the parent
> inode's i_rwsem held at all. In my second email I realized that it is always
> called with at least a read (shared) lock.
I see. Thank you. We are on the same page now. I was confused by
this part of your second comment:
>> > I guess you can rely on that; it's just a bit more subtle than it
>> > first appears. Though, some of your explanation seems to assume
>> > that a read lock is sufficient ("In __lookup_slow, either the
>> > parent inode is locked by the caller (lookup_slow) ..."),
...because I was then failing to see, after learning about the __d_move
case, how I could rely on the inode read lock. But, as I now realize,
__d_move is not called for negative dentries, so lookup_slow is indeed
safe.
> The question is what kind of parent i_rwsem lock is guaranteed to be held by the
> caller of ->d_revalidate() when the name comparison is done. Based on the
> above, it needs to be at least a write (exclusive) lock to exclude __d_move()
> without taking d_lock. However, your analysis (in the commit message of "libfs:
> Validate negative dentries in case-insensitive directories") only talks about
> i_rwsem being "taken", without saying whether it's for read or write. One thing
> you mentioned as taking i_rwsem is lookup_slow, but that only takes it for read.
> That seems like a problem, as it makes your analysis not correct.
My understanding and explanation was that a read lock should be enough
at all times, despite the __d_move case. Any time d_revalidate is
called for a (LOOKUP_CREATE | LOOKUP_RENAME_TARGET), it holds at least
the read lock, preventing concurrent changes to d_name of negative
dentries.
I will review the places that touch ->d_name again and I will keep the
patch as-is and update my explanation to include this case.
--
Gabriel Krisman Bertazi