2019-04-20 00:32:05

by Hou Tao

[permalink] [raw]
Subject: [PATCH] dcache: ensure d_flags & d_inode are consistent in lookup_fast()

After extending the size of dentry from 192-bytes to 208-bytes
under aarch64, we got oops during the running of xfstests generic/429:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002
CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G D 5.1.0-rc4
pc : inode_permission+0x28/0x160
lr : link_path_walk.part.11+0x27c/0x528
......
Call trace:
inode_permission+0x28/0x160
link_path_walk.part.11+0x27c/0x528
path_lookupat+0x64/0x208
filename_lookup+0xa0/0x178
user_path_at_empty+0x58/0x70
vfs_statx+0x94/0x118
__se_sys_newfstatat+0x58/0x98
__arm64_sys_newfstatat+0x24/0x30
el0_svc_common+0x7c/0x148
el0_svc_handler+0x38/0x88
el0_svc+0x8/0xc

If we revert the size extension of dentry, the oops will be gone.
However if we just move the d_inode field from the begin of dentry
struct to the end of dentry struct (poorly simulate the way how
__randomize_layout works), the oops will reoccur.

The following scenario illustrates the problem:

precondition:
* dentry A has just been unlinked and becomes a negative dentry
* dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate()
* lookup process is looking A/file, and creation process is creating A

lookup process: creation process:

lookup_fast
__d_lookup_rcu returns dentry A

d_revalidate returns -ECHILD

d_revalidate again succeed
__d_set_inode_and_type
dentry->d_inode = inode
WRITE_ONCE(dentry->d_flags, flags)

d_is_negative(dentry) return false
follow_managed doesn't nothing
// inconsistent with d_flags
d_backing_inode() return NULL
nd->inode = NULL

may_lookup()
// oops occurs
inode_permission(nd->inode

The root cause is the inconsistency between d_flags & d_inode
during the REF-walk in lookup_fast(): d_is_negative(dentry)
returns false, but d_backing_inode() still returns a NULL pointer.

The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode
are consistent, and lookup_slow() use inode lock to ensure that, so only
the REF-walk path in lookup_fast() is problematic.

Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing
of d_inode & d_flags to ensure the consistency.

Signed-off-by: Hou Tao <[email protected]>
---
fs/dcache.c | 2 ++
fs/namei.c | 7 +++++++
2 files changed, 9 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf4743..1eb85f9fcb0f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
unsigned flags;

dentry->d_inode = inode;
+ /* paired with smp_rmb() in lookup_fast() */
+ smp_wmb();
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
diff --git a/fs/namei.c b/fs/namei.c
index dede0147b3f6..833f760c70b2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd,
return -ENOENT;
}

+ /*
+ * Paired with smp_wmb() in __d_set_inode_and_type() to ensure
+ * d_backing_inode is not NULL after the checking of d_flags
+ * in d_is_negative() completes.
+ */
+ smp_rmb();
+
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
--
2.16.2.dirty


2019-04-22 15:39:25

by Hou Tao

[permalink] [raw]
Subject: Re: [PATCH] dcache: ensure d_flags & d_inode are consistent in lookup_fast()

ping ?

On 2019/4/19 16:48, Hou Tao wrote:
> After extending the size of dentry from 192-bytes to 208-bytes
> under aarch64, we got oops during the running of xfstests generic/429:
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002
> CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G D 5.1.0-rc4
> pc : inode_permission+0x28/0x160
> lr : link_path_walk.part.11+0x27c/0x528
> ......
> Call trace:
> inode_permission+0x28/0x160
> link_path_walk.part.11+0x27c/0x528
> path_lookupat+0x64/0x208
> filename_lookup+0xa0/0x178
> user_path_at_empty+0x58/0x70
> vfs_statx+0x94/0x118
> __se_sys_newfstatat+0x58/0x98
> __arm64_sys_newfstatat+0x24/0x30
> el0_svc_common+0x7c/0x148
> el0_svc_handler+0x38/0x88
> el0_svc+0x8/0xc
>
> If we revert the size extension of dentry, the oops will be gone.
> However if we just move the d_inode field from the begin of dentry
> struct to the end of dentry struct (poorly simulate the way how
> __randomize_layout works), the oops will reoccur.
>
> The following scenario illustrates the problem:
>
> precondition:
> * dentry A has just been unlinked and becomes a negative dentry
> * dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate()
> * lookup process is looking A/file, and creation process is creating A
>
> lookup process: creation process:
>
> lookup_fast
> __d_lookup_rcu returns dentry A
>
> d_revalidate returns -ECHILD
>
> d_revalidate again succeed
> __d_set_inode_and_type
> dentry->d_inode = inode
> WRITE_ONCE(dentry->d_flags, flags)
>
> d_is_negative(dentry) return false
> follow_managed doesn't nothing
> // inconsistent with d_flags
> d_backing_inode() return NULL
> nd->inode = NULL
>
> may_lookup()
> // oops occurs
> inode_permission(nd->inode
>
> The root cause is the inconsistency between d_flags & d_inode
> during the REF-walk in lookup_fast(): d_is_negative(dentry)
> returns false, but d_backing_inode() still returns a NULL pointer.
>
> The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode
> are consistent, and lookup_slow() use inode lock to ensure that, so only
> the REF-walk path in lookup_fast() is problematic.
>
> Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing
> of d_inode & d_flags to ensure the consistency.
>
> Signed-off-by: Hou Tao <[email protected]>
> ---
> fs/dcache.c | 2 ++
> fs/namei.c | 7 +++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index aac41adf4743..1eb85f9fcb0f 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
> unsigned flags;
>
> dentry->d_inode = inode;
> + /* paired with smp_rmb() in lookup_fast() */
> + smp_wmb();
> flags = READ_ONCE(dentry->d_flags);
> flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> flags |= type_flags;
> diff --git a/fs/namei.c b/fs/namei.c
> index dede0147b3f6..833f760c70b2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd,
> return -ENOENT;
> }
>
> + /*
> + * Paired with smp_wmb() in __d_set_inode_and_type() to ensure
> + * d_backing_inode is not NULL after the checking of d_flags
> + * in d_is_negative() completes.
> + */
> + smp_rmb();
> +
> path->mnt = mnt;
> path->dentry = dentry;
> err = follow_managed(path, nd);
>

2019-04-22 15:57:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] dcache: ensure d_flags & d_inode are consistent in lookup_fast()

On Fri, Apr 19, 2019 at 04:48:10PM +0800, Hou Tao wrote:

> The root cause is the inconsistency between d_flags & d_inode
> during the REF-walk in lookup_fast(): d_is_negative(dentry)
> returns false, but d_backing_inode() still returns a NULL pointer.
>
> The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode
> are consistent, and lookup_slow() use inode lock to ensure that, so only
> the REF-walk path in lookup_fast() is problematic.
>
> Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing
> of d_inode & d_flags to ensure the consistency.

The problem is real, but I'm not sure I like the proposed fix ;-/
We could simply use d_really_is_negative() there, avoiding all that
mess. If and when we get around to whiteouts-in-dcache (i.e. if
unionfs series gets resurrected), we can revisit that...