If the inode of an inline symlink is released (dropped) while a path walk
that is in rcu-walk mode is occuring the inode ->get_link() method can be
NULL when VFS dereferences it causing a crash. But, since the release can
occur at any time there's a small but finite possibility the link path
text could be freed while it's being used.
Changes since v1:
- don't bother trying to rcu-free the link path since there could
be side effects from the xfs reclaim code.
---
Ian Kent (2):
vfs: check dentry is still valid in get_link()
xfs: make sure link path does not go away at access
fs/xfs/xfs_iops.c | 3 +++
1 file changed, 3 insertions(+)
--
Ian
When following a trailing symlink in rcu-walk mode it's possible for
the dentry to become invalid between the last dentry seq lock check
and getting the link (eg. an unlink) leading to a backtrace similar
to this:
crash> bt
PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX"
…
#7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
[exception RIP: unknown or invalid address]
RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246
RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: 0000000000000000
RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: 0000000000000000
RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: 0000000000000000
R10: ffff951c95d8c318 R11: 0000000000080000 R12: ffffae44d0a6fd98
R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
#9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
#10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
#11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
#12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
#13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b
Most of the time this is not a problem because the inode is unchanged
while the rcu read lock is held.
But xfs can re-use inodes which can result in the inode ->get_link()
method becoming invalid (or NULL).
This case needs to be checked for in fs/namei.c:get_link() and if
detected the walk re-started.
Signed-off-by: Ian Kent <[email protected]>
---
0 files changed
diff --git a/fs/namei.c b/fs/namei.c
index 1946d9667790..9a48a6106516 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1760,8 +1760,11 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
if (!res) {
const char * (*get)(struct dentry *, struct inode *,
struct delayed_call *);
- get = inode->i_op->get_link;
+ get = READ_ONCE(inode->i_op->get_link);
if (nd->flags & LOOKUP_RCU) {
+ /* Does the inode still match the associated dentry? */
+ if (unlikely(read_seqcount_retry(&link->dentry->d_seq, last->seq)))
+ return ERR_PTR(-ECHILD);
res = get(NULL, inode, &last->done);
if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
res = get(link->dentry, inode, &last->done);
When following an inline symlink in rcu-walk mode it's possible to
succeed in getting the ->get_link() method pointer but the link path
string be deallocated while it's being used.
This is becuase of the xfs inode reclaim mechanism. While rcu freeing
the link path can prevent it from being freed during use the inode
reclaim could assign a new value to the field at any time outside of
the path walk and result in an invalid link path pointer being
returned. Admittedly a very small race window but possible.
The best way to mitigate this risk is to return -ECHILD to the VFS
if the inline symlink method, ->get_link(), is called in rcu-walk mode
so the VFS can switch to ref-walk mode or redo the walk if the inode
has become invalid.
If it's discovered that staying in rcu-walk mode gives a worth while
performance improvement (unlikely) then the link path could be freed
under rcu once potential side effects of the xfs inode reclaim
sub-system have been analysed and dealt with if needed.
Signed-off-by: Ian Kent <[email protected]>
---
fs/xfs/xfs_iops.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..0a96183c5381 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -520,6 +520,9 @@ xfs_vn_get_link_inline(
struct xfs_inode *ip = XFS_I(inode);
char *link;
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
/*
On Mon, Nov 15, 2021 at 10:24:22AM +0800, Ian Kent wrote:
> When following a trailing symlink in rcu-walk mode it's possible for
> the dentry to become invalid between the last dentry seq lock check
> and getting the link (eg. an unlink) leading to a backtrace similar
> to this:
>
> crash> bt
> PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX"
> …
> #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
> [exception RIP: unknown or invalid address]
> RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246
> RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: 0000000000000000
> RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: 0000000000000000
> RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: 0000000000000000
> R10: ffff951c95d8c318 R11: 0000000000080000 R12: ffffae44d0a6fd98
> R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
> #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
> #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
> #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
> #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
> #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b
>
> Most of the time this is not a problem because the inode is unchanged
> while the rcu read lock is held.
>
> But xfs can re-use inodes which can result in the inode ->get_link()
> method becoming invalid (or NULL).
>
> This case needs to be checked for in fs/namei.c:get_link() and if
> detected the walk re-started.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
Reviewed-by: Brian Foster <[email protected]>
> 0 files changed
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1946d9667790..9a48a6106516 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1760,8 +1760,11 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
> if (!res) {
> const char * (*get)(struct dentry *, struct inode *,
> struct delayed_call *);
> - get = inode->i_op->get_link;
> + get = READ_ONCE(inode->i_op->get_link);
> if (nd->flags & LOOKUP_RCU) {
> + /* Does the inode still match the associated dentry? */
> + if (unlikely(read_seqcount_retry(&link->dentry->d_seq, last->seq)))
> + return ERR_PTR(-ECHILD);
> res = get(NULL, inode, &last->done);
> if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
> res = get(link->dentry, inode, &last->done);
>
>
On Mon, Nov 15, 2021 at 10:24:28AM +0800, Ian Kent wrote:
> When following an inline symlink in rcu-walk mode it's possible to
> succeed in getting the ->get_link() method pointer but the link path
> string be deallocated while it's being used.
>
> This is becuase of the xfs inode reclaim mechanism. While rcu freeing
> the link path can prevent it from being freed during use the inode
> reclaim could assign a new value to the field at any time outside of
> the path walk and result in an invalid link path pointer being
> returned. Admittedly a very small race window but possible.
>
> The best way to mitigate this risk is to return -ECHILD to the VFS
> if the inline symlink method, ->get_link(), is called in rcu-walk mode
> so the VFS can switch to ref-walk mode or redo the walk if the inode
> has become invalid.
>
> If it's discovered that staying in rcu-walk mode gives a worth while
> performance improvement (unlikely) then the link path could be freed
> under rcu once potential side effects of the xfs inode reclaim
> sub-system have been analysed and dealt with if needed.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
Reviewed-by: Brian Foster <[email protected]>
> fs/xfs/xfs_iops.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a607d6aca5c4..0a96183c5381 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -520,6 +520,9 @@ xfs_vn_get_link_inline(
> struct xfs_inode *ip = XFS_I(inode);
> char *link;
>
> + if (!dentry)
> + return ERR_PTR(-ECHILD);
> +
> ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
>
> /*
>
>
On Mon, Nov 15, 2021 at 10:24:28AM +0800, Ian Kent wrote:
> When following an inline symlink in rcu-walk mode it's possible to
> succeed in getting the ->get_link() method pointer but the link path
> string be deallocated while it's being used.
>
> This is becuase of the xfs inode reclaim mechanism. While rcu freeing
> the link path can prevent it from being freed during use the inode
> reclaim could assign a new value to the field at any time outside of
> the path walk and result in an invalid link path pointer being
> returned. Admittedly a very small race window but possible.
>
> The best way to mitigate this risk is to return -ECHILD to the VFS
> if the inline symlink method, ->get_link(), is called in rcu-walk mode
> so the VFS can switch to ref-walk mode or redo the walk if the inode
> has become invalid.
>
> If it's discovered that staying in rcu-walk mode gives a worth while
> performance improvement (unlikely) then the link path could be freed
> under rcu once potential side effects of the xfs inode reclaim
> sub-system have been analysed and dealt with if needed.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
> fs/xfs/xfs_iops.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a607d6aca5c4..0a96183c5381 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -520,6 +520,9 @@ xfs_vn_get_link_inline(
> struct xfs_inode *ip = XFS_I(inode);
> char *link;
>
> + if (!dentry)
> + return ERR_PTR(-ECHILD);
> +
> ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
NACK. As I just mentioned in the original thread, we can fix this
inode reuse within the RCU grace period problem realtively easily
without needing to turn off lockless pathwalk support for inline
symlinks.
Cheers,
Dave.
--
Dave Chinner
[email protected]