2022-10-18 02:44:54

by Ian Kent

[permalink] [raw]
Subject: [PATCH 0/2] kernfs: remove i_lock usage that isn't needed

In the patch series to change the global kernfs mutex to a read-write
semaphore the inode i_lock is used in several functions.

But after thinking about it the i_lock is in fact not needed.

Signed-off-by: Ian Kent <[email protected]>
---

Ian Kent (2):
kernfs: dont take i_lock on inode attr read
kernfs: dont take i_lock on revalidate


fs/kernfs/dir.c | 24 +++++++++++++++++-------
fs/kernfs/inode.c | 4 ----
2 files changed, 17 insertions(+), 11 deletions(-)

--
Ian


2022-10-18 02:51:40

by Ian Kent

[permalink] [raw]
Subject: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

The kernfs write lock is held when the kernfs node inode attributes
are updated. Therefore, when either kernfs_iop_getattr() or
kernfs_iop_permission() are called the kernfs node inode attributes
won't change.

Consequently concurrent kernfs_refresh_inode() calls always copy the
same values from the kernfs node.

So there's no need to take the inode i_lock to get consistent values
for generic_fillattr() and generic_permission(), the kernfs read lock
is sufficient.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/inode.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..74f3453f4639 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -190,10 +190,8 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct kernfs_root *root = kernfs_root(kn);

down_read(&root->kernfs_rwsem);
- spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
- spin_unlock(&inode->i_lock);
up_read(&root->kernfs_rwsem);

return 0;
@@ -288,10 +286,8 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
root = kernfs_root(kn);

down_read(&root->kernfs_rwsem);
- spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
- spin_unlock(&inode->i_lock);
up_read(&root->kernfs_rwsem);

return ret;


2022-10-18 03:06:52

by Ian Kent

[permalink] [raw]
Subject: [PATCH 2/2] kernfs: dont take i_lock on revalidate

In kernfs_dop_revalidate() when the passed in dentry is negative the
dentry directory is checked to see if it has changed and if so the
negative dentry is discarded so it can refreshed. During this check
the dentry inode i_lock is taken to mitigate against a possible
concurrent rename.

But if it's racing with a rename, becuase the dentry is negative, it
can't be the source it must be the target and it must be going to do
a d_move() otherwise the rename will return an error.

In this case the parent dentry of the target will not change, it will
be the same over the d_move(), only the source dentry parent may change
so the inode i_lock isn't needed.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 3990f3e270cb..6acd9c3d4cff 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1073,20 +1073,30 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)

/* If the kernfs parent node has changed discard and
* proceed to ->lookup.
+ *
+ * There's nothing special needed here when getting the
+ * dentry parent, even if a concurrent rename is in
+ * progress. That's because the dentry is negative so
+ * it can only be the target of the rename and it will
+ * be doing a d_move() not a replace. Consequently the
+ * dentry d_parent won't change over the d_move().
+ *
+ * Also kernfs negative dentries transitioning from
+ * negative to positive during revalidate won't happen
+ * because they are invalidated on containing directory
+ * changes and the lookup re-done so that a new positive
+ * dentry can be properly created.
*/
- spin_lock(&dentry->d_lock);
+ root = kernfs_root_from_sb(dentry->d_sb);
+ down_read(&root->kernfs_rwsem);
parent = kernfs_dentry_node(dentry->d_parent);
if (parent) {
- spin_unlock(&dentry->d_lock);
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
if (kernfs_dir_changed(parent, dentry)) {
up_read(&root->kernfs_rwsem);
return 0;
}
- up_read(&root->kernfs_rwsem);
- } else
- spin_unlock(&dentry->d_lock);
+ }
+ up_read(&root->kernfs_rwsem);

/* The kernfs parent node hasn't changed, leave the
* dentry negative and return success.


2022-10-24 09:04:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On Tue, 18 Oct 2022 at 04:32, Ian Kent <[email protected]> wrote:
>
> The kernfs write lock is held when the kernfs node inode attributes
> are updated. Therefore, when either kernfs_iop_getattr() or
> kernfs_iop_permission() are called the kernfs node inode attributes
> won't change.
>
> Consequently concurrent kernfs_refresh_inode() calls always copy the
> same values from the kernfs node.
>
> So there's no need to take the inode i_lock to get consistent values
> for generic_fillattr() and generic_permission(), the kernfs read lock
> is sufficient.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2022-10-24 09:18:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: dont take i_lock on revalidate

On Tue, 18 Oct 2022 at 04:32, Ian Kent <[email protected]> wrote:
>
> In kernfs_dop_revalidate() when the passed in dentry is negative the
> dentry directory is checked to see if it has changed and if so the
> negative dentry is discarded so it can refreshed. During this check
> the dentry inode i_lock is taken to mitigate against a possible
> concurrent rename.
>
> But if it's racing with a rename, becuase the dentry is negative, it
> can't be the source it must be the target and it must be going to do
> a d_move() otherwise the rename will return an error.
>
> In this case the parent dentry of the target will not change, it will
> be the same over the d_move(), only the source dentry parent may change
> so the inode i_lock isn't needed.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2022-10-31 22:39:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> The kernfs write lock is held when the kernfs node inode attributes
> are updated. Therefore, when either kernfs_iop_getattr() or
> kernfs_iop_permission() are called the kernfs node inode attributes
> won't change.
>
> Consequently concurrent kernfs_refresh_inode() calls always copy the
> same values from the kernfs node.
>
> So there's no need to take the inode i_lock to get consistent values
> for generic_fillattr() and generic_permission(), the kernfs read lock
> is sufficient.
>
> Signed-off-by: Ian Kent <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2022-10-31 22:50:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: dont take i_lock on revalidate

On Tue, Oct 18, 2022 at 10:32:49AM +0800, Ian Kent wrote:
> In kernfs_dop_revalidate() when the passed in dentry is negative the
> dentry directory is checked to see if it has changed and if so the
> negative dentry is discarded so it can refreshed. During this check
> the dentry inode i_lock is taken to mitigate against a possible
> concurrent rename.
>
> But if it's racing with a rename, becuase the dentry is negative, it
> can't be the source it must be the target and it must be going to do
> a d_move() otherwise the rename will return an error.
>
> In this case the parent dentry of the target will not change, it will
> be the same over the d_move(), only the source dentry parent may change
> so the inode i_lock isn't needed.
>
> Signed-off-by: Ian Kent <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2022-11-01 08:02:50

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: dont take i_lock on revalidate

On Tue, Oct 18, 2022 at 5:58 AM Ian Kent <[email protected]> wrote:
>
> In kernfs_dop_revalidate() when the passed in dentry is negative the
> dentry directory is checked to see if it has changed and if so the
> negative dentry is discarded so it can refreshed. During this check
> the dentry inode i_lock is taken to mitigate against a possible
> concurrent rename.
>
> But if it's racing with a rename, becuase the dentry is negative, it
> can't be the source it must be the target and it must be going to do
> a d_move() otherwise the rename will return an error.
>
> In this case the parent dentry of the target will not change, it will
> be the same over the d_move(), only the source dentry parent may change
> so the inode i_lock isn't needed.

You meant d_lock.
Same for the commit title.

Thanks,
Amir.

2022-11-01 08:46:09

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: dont take i_lock on revalidate


On 1/11/22 15:46, Amir Goldstein wrote:
> On Tue, Oct 18, 2022 at 5:58 AM Ian Kent <[email protected]> wrote:
>> In kernfs_dop_revalidate() when the passed in dentry is negative the
>> dentry directory is checked to see if it has changed and if so the
>> negative dentry is discarded so it can refreshed. During this check
>> the dentry inode i_lock is taken to mitigate against a possible
>> concurrent rename.
>>
>> But if it's racing with a rename, becuase the dentry is negative, it
>> can't be the source it must be the target and it must be going to do
>> a d_move() otherwise the rename will return an error.
>>
>> In this case the parent dentry of the target will not change, it will
>> be the same over the d_move(), only the source dentry parent may change
>> so the inode i_lock isn't needed.
> You meant d_lock.
> Same for the commit title.

Ha, well how do you like that, such an obvious mistake, how

did I not see it?


Not sure what to do about it now though ...

Any suggestions anyone?


Ian



2022-12-21 13:52:09

by Anders Roxell

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 2022-10-31 12:30, Tejun Heo wrote:
> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> > The kernfs write lock is held when the kernfs node inode attributes
> > are updated. Therefore, when either kernfs_iop_getattr() or
> > kernfs_iop_permission() are called the kernfs node inode attributes
> > won't change.
> >
> > Consequently concurrent kernfs_refresh_inode() calls always copy the
> > same values from the kernfs node.
> >
> > So there's no need to take the inode i_lock to get consistent values
> > for generic_fillattr() and generic_permission(), the kernfs read lock
> > is sufficient.
> >
> > Signed-off-by: Ian Kent <[email protected]>
>
> Acked-by: Tejun Heo <[email protected]>

Hi,

Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
booting that in qemu I see the following "BUG: KCSAN: data-race in
set_nlink / set_nlink".


==================================================================
[ 1540.388669][ T123] BUG: KCSAN: data-race in set_nlink / set_nlink
[ 1540.392779][ T123]
[ 1540.394302][ T123] write to 0xffff00000adcc3e4 of 4 bytes by task 126 on cpu 0:
[ 1540.398828][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:371)
[ 1540.401609][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
[ 1540.404925][ T123] kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:194)
[ 1540.408088][ T123] vfs_getattr_nosec (/home/anders/src/kernel/next/fs/stat.c:133)
[ 1540.411334][ T123] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:170)
[ 1540.414224][ T123] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
[ 1540.417166][ T123] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
[ 1540.420539][ T123] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
[ 1540.424003][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
[ 1540.427648][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
[ 1540.430378][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
[ 1540.433053][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
[ 1540.436421][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
[ 1540.439303][ T123]
[ 1540.440828][ T123] 1 lock held by systemd-udevd/126:
[ 1540.444055][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:193)
[ 1540.450699][ T123] irq event stamp: 185034
[ 1540.453373][ T123] hardirqs last enabled at (185034): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
[ 1540.460087][ T123] hardirqs last disabled at (185033): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
[ 1540.466686][ T123] softirqs last enabled at (185001): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
[ 1540.473310][ T123] softirqs last disabled at (184999): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
[ 1540.479872][ T123]
[ 1540.481406][ T123] read to 0xffff00000adcc3e4 of 4 bytes by task 123 on cpu 0:
[ 1540.485893][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:368)
[ 1540.488663][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
[ 1540.491961][ T123] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:290)
[ 1540.495260][ T123] inode_permission (/home/anders/src/kernel/next/fs/namei.c:458 /home/anders/src/kernel/next/fs/namei.c:525)
[ 1540.498450][ T123] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1715 /home/anders/src/kernel/next/fs/namei.c:2262)
[ 1540.501552][ T123] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2473 (discriminator 2))
[ 1540.504592][ T123] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2503)
[ 1540.507740][ T123] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2876)
[ 1540.511010][ T123] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
[ 1540.514097][ T123] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
[ 1540.517598][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
[ 1540.521319][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
[ 1540.524125][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
[ 1540.526795][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
[ 1540.530224][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
[ 1540.533176][ T123]
[ 1540.534710][ T123] 1 lock held by systemd-udevd/123:
[ 1540.537977][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
[ 1540.544892][ T123] irq event stamp: 216564
[ 1540.547603][ T123] hardirqs last enabled at (216564): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
[ 1540.554368][ T123] hardirqs last disabled at (216563): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
[ 1540.561107][ T123] softirqs last enabled at (216533): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
[ 1540.567833][ T123] softirqs last disabled at (216531): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
[ 1540.574496][ T123]
[ 1540.576050][ T123] Reported by Kernel Concurrency Sanitizer on:
[ 1540.587925][ T123] Hardware name: linux,dummy-virt (DT)
[ 1540.591282][ T123] ==================================================================


Reverting this patch I don't see the data race anymore.

Cheers,
Anders

2022-12-22 23:33:39

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read


On 21/12/22 21:34, Anders Roxell wrote:
> On 2022-10-31 12:30, Tejun Heo wrote:
>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>> The kernfs write lock is held when the kernfs node inode attributes
>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>> won't change.
>>>
>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>> same values from the kernfs node.
>>>
>>> So there's no need to take the inode i_lock to get consistent values
>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>> is sufficient.
>>>
>>> Signed-off-by: Ian Kent <[email protected]>
>> Acked-by: Tejun Heo <[email protected]>
> Hi,
>
> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
> booting that in qemu I see the following "BUG: KCSAN: data-race in
> set_nlink / set_nlink".


I'll check if I missed any places where set_link() could be

called where the link count could be different.


If there aren't any the question will then be can writing the

same value to this location in multiple concurrent threads

corrupt it?


Ian

>
>
> ==================================================================
> [ 1540.388669][ T123] BUG: KCSAN: data-race in set_nlink / set_nlink
> [ 1540.392779][ T123]
> [ 1540.394302][ T123] write to 0xffff00000adcc3e4 of 4 bytes by task 126 on cpu 0:
> [ 1540.398828][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:371)
> [ 1540.401609][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
> [ 1540.404925][ T123] kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:194)
> [ 1540.408088][ T123] vfs_getattr_nosec (/home/anders/src/kernel/next/fs/stat.c:133)
> [ 1540.411334][ T123] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:170)
> [ 1540.414224][ T123] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1540.417166][ T123] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1540.420539][ T123] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1540.424003][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
> [ 1540.427648][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
> [ 1540.430378][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
> [ 1540.433053][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
> [ 1540.436421][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
> [ 1540.439303][ T123]
> [ 1540.440828][ T123] 1 lock held by systemd-udevd/126:
> [ 1540.444055][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:193)
> [ 1540.450699][ T123] irq event stamp: 185034
> [ 1540.453373][ T123] hardirqs last enabled at (185034): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
> [ 1540.460087][ T123] hardirqs last disabled at (185033): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
> [ 1540.466686][ T123] softirqs last enabled at (185001): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
> [ 1540.473310][ T123] softirqs last disabled at (184999): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
> [ 1540.479872][ T123]
> [ 1540.481406][ T123] read to 0xffff00000adcc3e4 of 4 bytes by task 123 on cpu 0:
> [ 1540.485893][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:368)
> [ 1540.488663][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
> [ 1540.491961][ T123] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:290)
> [ 1540.495260][ T123] inode_permission (/home/anders/src/kernel/next/fs/namei.c:458 /home/anders/src/kernel/next/fs/namei.c:525)
> [ 1540.498450][ T123] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1715 /home/anders/src/kernel/next/fs/namei.c:2262)
> [ 1540.501552][ T123] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2473 (discriminator 2))
> [ 1540.504592][ T123] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2503)
> [ 1540.507740][ T123] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2876)
> [ 1540.511010][ T123] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
> [ 1540.514097][ T123] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
> [ 1540.517598][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
> [ 1540.521319][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
> [ 1540.524125][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
> [ 1540.526795][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
> [ 1540.530224][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
> [ 1540.533176][ T123]
> [ 1540.534710][ T123] 1 lock held by systemd-udevd/123:
> [ 1540.537977][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1540.544892][ T123] irq event stamp: 216564
> [ 1540.547603][ T123] hardirqs last enabled at (216564): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
> [ 1540.554368][ T123] hardirqs last disabled at (216563): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
> [ 1540.561107][ T123] softirqs last enabled at (216533): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
> [ 1540.567833][ T123] softirqs last disabled at (216531): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
> [ 1540.574496][ T123]
> [ 1540.576050][ T123] Reported by Kernel Concurrency Sanitizer on:
> [ 1540.587925][ T123] Hardware name: linux,dummy-virt (DT)
> [ 1540.591282][ T123] ==================================================================
>
>
> Reverting this patch I don't see the data race anymore.
>
> Cheers,
> Anders

2022-12-29 09:42:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> On 21/12/22 21:34, Anders Roxell wrote:
>> On 2022-10-31 12:30, Tejun Heo wrote:
>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>> won't change.
>>>>
>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>> same values from the kernfs node.
>>>>
>>>> So there's no need to take the inode i_lock to get consistent values
>>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>>> is sufficient.
>>>>
>>>> Signed-off-by: Ian Kent <[email protected]>
>>> Acked-by: Tejun Heo <[email protected]>
>> Hi,
>>
>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>> set_nlink / set_nlink".
>
>
> I'll check if I missed any places where set_link() could be
> called where the link count could be different.
>
>
> If there aren't any the question will then be can writing the
> same value to this location in multiple concurrent threads
> corrupt it?

I think the race that is getting reported for set_nlink()
is about this bit getting called simulatenously on multiple
CPUs with only the read lock held for the inode:

/* Yes, some filesystems do change nlink from zero to one */
if (inode->i_nlink == 0)
atomic_long_dec(&inode->i_sb->s_remove_count);
inode->__i_nlink = nlink;

Since i_nlink and __i_nlink refer to the same memory location,
the 'inode->i_nlink == 0' check can be true for all of them
before the nonzero nlink value gets set, and this results in
s_remove_count being decremented more than once.

Arnd

2022-12-29 13:28:12

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read


On 29/12/22 17:20, Arnd Bergmann wrote:
> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>> On 21/12/22 21:34, Anders Roxell wrote:
>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>> won't change.
>>>>>
>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>> same values from the kernfs node.
>>>>>
>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>>>> is sufficient.
>>>>>
>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>> Acked-by: Tejun Heo <[email protected]>
>>> Hi,
>>>
>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>> set_nlink / set_nlink".
>>
>> I'll check if I missed any places where set_link() could be
>> called where the link count could be different.
>>
>>
>> If there aren't any the question will then be can writing the
>> same value to this location in multiple concurrent threads
>> corrupt it?
> I think the race that is getting reported for set_nlink()
> is about this bit getting called simulatenously on multiple
> CPUs with only the read lock held for the inode:
>
> /* Yes, some filesystems do change nlink from zero to one */
> if (inode->i_nlink == 0)
> atomic_long_dec(&inode->i_sb->s_remove_count);
> inode->__i_nlink = nlink;
>
> Since i_nlink and __i_nlink refer to the same memory location,
> the 'inode->i_nlink == 0' check can be true for all of them
> before the nonzero nlink value gets set, and this results in
> s_remove_count being decremented more than once.


Thanks for the comment Arnd.


I'll certainly have a close look at that.

It will be a little while though, given the season, ;).


Ian

2023-01-23 03:12:04

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read


On 29/12/22 21:07, Ian Kent wrote:
>
> On 29/12/22 17:20, Arnd Bergmann wrote:
>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>>> won't change.
>>>>>>
>>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>>> same values from the kernfs node.
>>>>>>
>>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>>> for generic_fillattr() and generic_permission(), the kernfs read
>>>>>> lock
>>>>>> is sufficient.
>>>>>>
>>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>>> Acked-by: Tejun Heo <[email protected]>
>>>> Hi,
>>>>
>>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>>> set_nlink / set_nlink".
>>>
>>> I'll check if I missed any places where set_link() could be
>>> called where the link count could be different.
>>>
>>>
>>> If there aren't any the question will then be can writing the
>>> same value to this location in multiple concurrent threads
>>> corrupt it?
>> I think the race that is getting reported for set_nlink()
>> is about this bit getting called simulatenously on multiple
>> CPUs with only the read lock held for the inode:
>>
>>       /* Yes, some filesystems do change nlink from zero to one */
>>       if (inode->i_nlink == 0)
>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>       inode->__i_nlink = nlink;
>>
>> Since i_nlink and __i_nlink refer to the same memory location,
>> the 'inode->i_nlink == 0' check can be true for all of them
>> before the nonzero nlink value gets set, and this results in
>> s_remove_count being decremented more than once.
>
>
> Thanks for the comment Arnd.


Hello all,


I've been looking at this and after consulting Miklos and his pointing

out that it looks like a false positive the urgency dropped off a bit. So

apologies for taking so long to report back.


Anyway it needs some description of conclusions reached so far.


I'm still looking around but in short, kernfs will set directories to <# of

directory entries> + 2 unconditionally for directories. I can't yet find

any other places where i_nlink is set or changed and if there are none

then i_nlink will never be set to zero so the race should not occur.


Consequently my claim is this is a real false positive.


There are the file system operations that may be passed at mount time

but given the way kernfs sets i_nlink it pretty much dictates those
operations

(if there were any that modify it and there don't appear to be any) leave it

alone.


So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...


Ian






2023-07-18 19:25:45

by Anders Roxell

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 2023-01-23 11:11, Ian Kent wrote:
>
> On 29/12/22 21:07, Ian Kent wrote:
> >
> > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> > > > > > > The kernfs write lock is held when the kernfs node inode attributes
> > > > > > > are updated. Therefore, when either kernfs_iop_getattr() or
> > > > > > > kernfs_iop_permission() are called the kernfs node inode attributes
> > > > > > > won't change.
> > > > > > >
> > > > > > > Consequently concurrent kernfs_refresh_inode() calls always copy the
> > > > > > > same values from the kernfs node.
> > > > > > >
> > > > > > > So there's no need to take the inode i_lock to get consistent values
> > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > kernfs read lock
> > > > > > > is sufficient.
> > > > > > >
> > > > > > > Signed-off-by: Ian Kent <[email protected]>
> > > > > > Acked-by: Tejun Heo <[email protected]>
> > > > > Hi,
> > > > >
> > > > > Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
> > > > > booting that in qemu I see the following "BUG: KCSAN: data-race in
> > > > > set_nlink / set_nlink".
> > > >
> > > > I'll check if I missed any places where set_link() could be
> > > > called where the link count could be different.
> > > >
> > > >
> > > > If there aren't any the question will then be can writing the
> > > > same value to this location in multiple concurrent threads
> > > > corrupt it?
> > > I think the race that is getting reported for set_nlink()
> > > is about this bit getting called simulatenously on multiple
> > > CPUs with only the read lock held for the inode:
> > >
> > >       /* Yes, some filesystems do change nlink from zero to one */
> > >       if (inode->i_nlink == 0)
> > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > >       inode->__i_nlink = nlink;
> > >
> > > Since i_nlink and __i_nlink refer to the same memory location,
> > > the 'inode->i_nlink == 0' check can be true for all of them
> > > before the nonzero nlink value gets set, and this results in
> > > s_remove_count being decremented more than once.
> >
> >
> > Thanks for the comment Arnd.
>
>
> Hello all,
>
>
> I've been looking at this and after consulting Miklos and his pointing
>
> out that it looks like a false positive the urgency dropped off a bit. So
>
> apologies for taking so long to report back.
>
>
> Anyway it needs some description of conclusions reached so far.
>
>
> I'm still looking around but in short, kernfs will set directories to <# of
>
> directory entries> + 2 unconditionally for directories. I can't yet find
>
> any other places where i_nlink is set or changed and if there are none
>
> then i_nlink will never be set to zero so the race should not occur.
>
>
> Consequently my claim is this is a real false positive.
>
>
> There are the file system operations that may be passed at mount time
>
> but given the way kernfs sets i_nlink it pretty much dictates those
> operations
>
> (if there were any that modify it and there don't appear to be any) leave it
>
> alone.
>
>
> So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...

On todays next tag, next-20230718 this KCSAN BUG poped up again. When I
built an allmodconfig arm64 kernel and booted it in QEMU. Full log can
be found http://ix.io/4AUd

[ 1694.987789][ T137] BUG: KCSAN: data-race in inode_permission / kernfs_refresh_inode
[ 1694.992912][ T137]
[ 1694.994532][ T137] write to 0xffff00000bab6070 of 2 bytes by task 104 on cpu 0:
[ 1694.999269][ T137] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
[ 1695.002707][ T137] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
[ 1695.006148][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
[ 1695.009420][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
[ 1695.012643][ T137] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
[ 1695.015781][ T137] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
[ 1695.019059][ T137] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
[ 1695.022024][ T137] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
[ 1695.025067][ T137] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
[ 1695.028497][ T137] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
[ 1695.032080][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
[ 1695.035916][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
[ 1695.038796][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
[ 1695.041468][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
[ 1695.044889][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
[ 1695.047904][ T137]
[ 1695.049511][ T137] 1 lock held by systemd-udevd/104:
[ 1695.052837][ T137] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
[ 1695.060241][ T137] irq event stamp: 82902
[ 1695.063006][ T137] hardirqs last enabled at (82901): _raw_spin_unlock_irqrestore (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-macros.h:250 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140 /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151 /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
[ 1695.069673][ T137] hardirqs last disabled at (82902): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
[ 1695.075474][ T137] softirqs last enabled at (82792): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
[ 1695.082319][ T137] softirqs last disabled at (82790): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
[ 1695.089049][ T137]
[ 1695.090659][ T137] read to 0xffff00000bab6070 of 2 bytes by task 137 on cpu 0:
[ 1695.095374][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:532)
[ 1695.098655][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
[ 1695.101857][ T137] path_openat (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
[ 1695.104885][ T137] do_filp_open (/home/anders/src/kernel/next/fs/namei.c:3820)
[ 1695.108006][ T137] do_sys_openat2 (/home/anders/src/kernel/next/fs/open.c:1418)
[ 1695.111290][ T137] __arm64_sys_openat (/home/anders/src/kernel/next/fs/open.c:1433)
[ 1695.114825][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
[ 1695.118662][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
[ 1695.121555][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
[ 1695.124207][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
[ 1695.127590][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
[ 1695.130641][ T137]
[ 1695.132241][ T137] no locks held by systemd-udevd/137.
[ 1695.135618][ T137] irq event stamp: 3246
[ 1695.138519][ T137] hardirqs last enabled at (3245): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
[ 1695.145825][ T137] hardirqs last disabled at (3246): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
[ 1695.151942][ T137] softirqs last enabled at (3208): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
[ 1695.158950][ T137] softirqs last disabled at (3206): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
[ 1695.166036][ T137]
[ 1695.167621][ T137] Reported by Kernel Concurrency Sanitizer on:
[ 1695.179990][ T137] Hardware name: linux,dummy-virt (DT)
[ 1695.183687][ T137] ==================================================================

[...]

[ 1738.053819][ T104] BUG: KCSAN: data-race in set_nlink / set_nlink
[ 1738.058223][ T104]
[ 1738.059865][ T104] read to 0xffff00000bab6918 of 4 bytes by task 108 on cpu 0:
[ 1738.064916][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:369)
[ 1738.067845][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
[ 1738.071607][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
[ 1738.075467][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
[ 1738.078868][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
[ 1738.082270][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
[ 1738.085488][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
[ 1738.089101][ T104] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2907)
[ 1738.092469][ T104] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
[ 1738.095970][ T104] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
[ 1738.099529][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
[ 1738.103696][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
[ 1738.106560][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
[ 1738.109613][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
[ 1738.113035][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
[ 1738.116346][ T104]
[ 1738.117924][ T104] 1 lock held by systemd-udevd/108:
[ 1738.121580][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
[ 1738.129355][ T104] irq event stamp: 31000
[ 1738.132088][ T104] hardirqs last enabled at (31000): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
[ 1738.139417][ T104] hardirqs last disabled at (30999): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
[ 1738.146781][ T104] softirqs last enabled at (30973): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
[ 1738.153891][ T104] softirqs last disabled at (30971): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
[ 1738.161012][ T104]
[ 1738.162663][ T104] write to 0xffff00000bab6918 of 4 bytes by task 104 on cpu 0:
[ 1738.167730][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:372)
[ 1738.170559][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
[ 1738.174355][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
[ 1738.177829][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
[ 1738.181403][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
[ 1738.184738][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
[ 1738.188268][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
[ 1738.191865][ T104] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
[ 1738.196236][ T104] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
[ 1738.200120][ T104] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
[ 1738.204095][ T104] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
[ 1738.207676][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
[ 1738.211820][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
[ 1738.214815][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
[ 1738.217709][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
[ 1738.221239][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
[ 1738.224502][ T104]
[ 1738.226090][ T104] 1 lock held by systemd-udevd/104:
[ 1738.229747][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
[ 1738.237504][ T104] irq event stamp: 108353
[ 1738.240262][ T104] hardirqs last enabled at (108353): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
[ 1738.247443][ T104] hardirqs last disabled at (108352): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
[ 1738.254510][ T104] softirqs last enabled at (108326): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
[ 1738.262187][ T104] softirqs last disabled at (108324): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
[ 1738.270239][ T104]
[ 1738.272140][ T104] Reported by Kernel Concurrency Sanitizer on:
[ 1738.285185][ T104] Hardware name: linux,dummy-virt (DT)
[ 1738.288703][ T104] ==================================================================


Cheers,
Anders

2023-07-19 05:10:42

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 19/7/23 03:00, Anders Roxell wrote:
> On 2023-01-23 11:11, Ian Kent wrote:
>> On 29/12/22 21:07, Ian Kent wrote:
>>> On 29/12/22 17:20, Arnd Bergmann wrote:
>>>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>>>>> won't change.
>>>>>>>>
>>>>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>>>>> same values from the kernfs node.
>>>>>>>>
>>>>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>>>>> for generic_fillattr() and generic_permission(), the
>>>>>>>> kernfs read lock
>>>>>>>> is sufficient.
>>>>>>>>
>>>>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>>>>> Acked-by: Tejun Heo <[email protected]>
>>>>>> Hi,
>>>>>>
>>>>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>>>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>>>>> set_nlink / set_nlink".
>>>>> I'll check if I missed any places where set_link() could be
>>>>> called where the link count could be different.
>>>>>
>>>>>
>>>>> If there aren't any the question will then be can writing the
>>>>> same value to this location in multiple concurrent threads
>>>>> corrupt it?
>>>> I think the race that is getting reported for set_nlink()
>>>> is about this bit getting called simulatenously on multiple
>>>> CPUs with only the read lock held for the inode:
>>>>
>>>>       /* Yes, some filesystems do change nlink from zero to one */
>>>>       if (inode->i_nlink == 0)
>>>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>>>       inode->__i_nlink = nlink;
>>>>
>>>> Since i_nlink and __i_nlink refer to the same memory location,
>>>> the 'inode->i_nlink == 0' check can be true for all of them
>>>> before the nonzero nlink value gets set, and this results in
>>>> s_remove_count being decremented more than once.
>>>
>>> Thanks for the comment Arnd.
>>
>> Hello all,
>>
>>
>> I've been looking at this and after consulting Miklos and his pointing
>>
>> out that it looks like a false positive the urgency dropped off a bit. So
>>
>> apologies for taking so long to report back.
>>
>>
>> Anyway it needs some description of conclusions reached so far.
>>
>>
>> I'm still looking around but in short, kernfs will set directories to <# of
>>
>> directory entries> + 2 unconditionally for directories. I can't yet find
>>
>> any other places where i_nlink is set or changed and if there are none
>>
>> then i_nlink will never be set to zero so the race should not occur.
>>
>>
>> Consequently my claim is this is a real false positive.
>>
>>
>> There are the file system operations that may be passed at mount time
>>
>> but given the way kernfs sets i_nlink it pretty much dictates those
>> operations
>>
>> (if there were any that modify it and there don't appear to be any) leave it
>>
>> alone.
>>
>>
>> So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...
> On todays next tag, next-20230718 this KCSAN BUG poped up again. When I
> built an allmodconfig arm64 kernel and booted it in QEMU. Full log can
> be found http://ix.io/4AUd
>
> [ 1694.987789][ T137] BUG: KCSAN: data-race in inode_permission / kernfs_refresh_inode
> [ 1694.992912][ T137]
> [ 1694.994532][ T137] write to 0xffff00000bab6070 of 2 bytes by task 104 on cpu 0:
> [ 1694.999269][ T137] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> [ 1695.002707][ T137] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1695.006148][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1695.009420][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1695.012643][ T137] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1695.015781][ T137] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1695.019059][ T137] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
> [ 1695.022024][ T137] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1695.025067][ T137] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1695.028497][ T137] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1695.032080][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1695.035916][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1695.038796][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1695.041468][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1695.044889][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1695.047904][ T137]
> [ 1695.049511][ T137] 1 lock held by systemd-udevd/104:
> [ 1695.052837][ T137] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1695.060241][ T137] irq event stamp: 82902
> [ 1695.063006][ T137] hardirqs last enabled at (82901): _raw_spin_unlock_irqrestore (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-macros.h:250 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140 /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151 /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> [ 1695.069673][ T137] hardirqs last disabled at (82902): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> [ 1695.075474][ T137] softirqs last enabled at (82792): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1695.082319][ T137] softirqs last disabled at (82790): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1695.089049][ T137]
> [ 1695.090659][ T137] read to 0xffff00000bab6070 of 2 bytes by task 137 on cpu 0:
> [ 1695.095374][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:532)
> [ 1695.098655][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1695.101857][ T137] path_openat (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> [ 1695.104885][ T137] do_filp_open (/home/anders/src/kernel/next/fs/namei.c:3820)
> [ 1695.108006][ T137] do_sys_openat2 (/home/anders/src/kernel/next/fs/open.c:1418)
> [ 1695.111290][ T137] __arm64_sys_openat (/home/anders/src/kernel/next/fs/open.c:1433)
> [ 1695.114825][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1695.118662][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1695.121555][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1695.124207][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1695.127590][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1695.130641][ T137]
> [ 1695.132241][ T137] no locks held by systemd-udevd/137.
> [ 1695.135618][ T137] irq event stamp: 3246
> [ 1695.138519][ T137] hardirqs last enabled at (3245): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1695.145825][ T137] hardirqs last disabled at (3246): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> [ 1695.151942][ T137] softirqs last enabled at (3208): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1695.158950][ T137] softirqs last disabled at (3206): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1695.166036][ T137]
> [ 1695.167621][ T137] Reported by Kernel Concurrency Sanitizer on:
> [ 1695.179990][ T137] Hardware name: linux,dummy-virt (DT)
> [ 1695.183687][ T137] ==================================================================


This one is different to the original.


I can't see where the problem is here, can someone help me out

please.


I don't see any shared data values used by the call

devcgroup_inode_permission(inode, mask) in fs/namei.c:inode_permission()

that might be a problem during the read except possibly inode->i_mode.


I'll check on that ... maybe something's been missed when kernfs_rwsem

was changed to a separate lock (kernfs_iattr_rwsem).


>
> [...]
>
> [ 1738.053819][ T104] BUG: KCSAN: data-race in set_nlink / set_nlink
> [ 1738.058223][ T104]
> [ 1738.059865][ T104] read to 0xffff00000bab6918 of 4 bytes by task 108 on cpu 0:
> [ 1738.064916][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:369)
> [ 1738.067845][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> [ 1738.071607][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1738.075467][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1738.078868][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1738.082270][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1738.085488][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1738.089101][ T104] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2907)
> [ 1738.092469][ T104] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
> [ 1738.095970][ T104] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
> [ 1738.099529][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1738.103696][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1738.106560][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1738.109613][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1738.113035][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1738.116346][ T104]
> [ 1738.117924][ T104] 1 lock held by systemd-udevd/108:
> [ 1738.121580][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1738.129355][ T104] irq event stamp: 31000
> [ 1738.132088][ T104] hardirqs last enabled at (31000): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1738.139417][ T104] hardirqs last disabled at (30999): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> [ 1738.146781][ T104] softirqs last enabled at (30973): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1738.153891][ T104] softirqs last disabled at (30971): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1738.161012][ T104]
> [ 1738.162663][ T104] write to 0xffff00000bab6918 of 4 bytes by task 104 on cpu 0:
> [ 1738.167730][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:372)
> [ 1738.170559][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> [ 1738.174355][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1738.177829][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1738.181403][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1738.184738][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1738.188268][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1738.191865][ T104] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
> [ 1738.196236][ T104] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1738.200120][ T104] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1738.204095][ T104] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1738.207676][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1738.211820][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1738.214815][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1738.217709][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1738.221239][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1738.224502][ T104]
> [ 1738.226090][ T104] 1 lock held by systemd-udevd/104:
> [ 1738.229747][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1738.237504][ T104] irq event stamp: 108353
> [ 1738.240262][ T104] hardirqs last enabled at (108353): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1738.247443][ T104] hardirqs last disabled at (108352): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> [ 1738.254510][ T104] softirqs last enabled at (108326): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1738.262187][ T104] softirqs last disabled at (108324): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1738.270239][ T104]
> [ 1738.272140][ T104] Reported by Kernel Concurrency Sanitizer on:
> [ 1738.285185][ T104] Hardware name: linux,dummy-virt (DT)
> [ 1738.288703][ T104] ==================================================================

This looks just like the original except a different lock is being

used now.


The link count can't be less than two if set_nlink() is called.


Maybe I am missing something but the directory count is changed only

while holding the root->kernfs_iattr_rwsem so the value used by

set_nlink() will not change during concurrent calls to refresh_inode().

Still looks like a false positive, I'll check the write operations
again just to be sure.

Ian


2023-07-20 02:33:18

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> On 19/7/23 03:00, Anders Roxell wrote:
> > On 2023-01-23 11:11, Ian Kent wrote:
> > > On 29/12/22 21:07, Ian Kent wrote:
> > > > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
> > > > > > > > wrote:
> > > > > > > > > The kernfs write lock is held when the kernfs node
> > > > > > > > > inode attributes
> > > > > > > > > are updated. Therefore, when either
> > > > > > > > > kernfs_iop_getattr() or
> > > > > > > > > kernfs_iop_permission() are called the kernfs node
> > > > > > > > > inode attributes
> > > > > > > > > won't change.
> > > > > > > > >
> > > > > > > > > Consequently concurrent kernfs_refresh_inode() calls
> > > > > > > > > always copy the
> > > > > > > > > same values from the kernfs node.
> > > > > > > > >
> > > > > > > > > So there's no need to take the inode i_lock to get
> > > > > > > > > consistent values
> > > > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > > > kernfs read lock
> > > > > > > > > is sufficient.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Ian Kent <[email protected]>
> > > > > > > > Acked-by: Tejun Heo <[email protected]>
> > > > > > > Hi,
> > > > > > >
> > > > > > > Building an allmodconfig arm64 kernel on yesterdays next-
> > > > > > > 20221220 and
> > > > > > > booting that in qemu I see the following "BUG: KCSAN:
> > > > > > > data-race in
> > > > > > > set_nlink / set_nlink".
> > > > > > I'll check if I missed any places where set_link() could be
> > > > > > called where the link count could be different.
> > > > > >
> > > > > >
> > > > > > If there aren't any the question will then be can writing
> > > > > > the
> > > > > > same value to this location in multiple concurrent threads
> > > > > > corrupt it?
> > > > > I think the race that is getting reported for set_nlink()
> > > > > is about this bit getting called simulatenously on multiple
> > > > > CPUs with only the read lock held for the inode:
> > > > >
> > > > > ?????? /* Yes, some filesystems do change nlink from zero to
> > > > > one */
> > > > > ?????? if (inode->i_nlink == 0)
> > > > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > > > > ?????? inode->__i_nlink = nlink;
> > > > >
> > > > > Since i_nlink and __i_nlink refer to the same memory
> > > > > location,
> > > > > the 'inode->i_nlink == 0' check can be true for all of them
> > > > > before the nonzero nlink value gets set, and this results in
> > > > > s_remove_count being decremented more than once.
> > > >
> > > > Thanks for the comment Arnd.
> > >
> > > Hello all,
> > >
> > >
> > > I've been looking at this and after consulting Miklos and his
> > > pointing
> > >
> > > out that it looks like a false positive the urgency dropped off a
> > > bit. So
> > >
> > > apologies for taking so long to report back.
> > >
> > >
> > > Anyway it needs some description of conclusions reached so far.
> > >
> > >
> > > I'm still looking around but in short, kernfs will set
> > > directories to <# of
> > >
> > > directory entries> + 2 unconditionally for directories. I can't
> > > yet find
> > >
> > > any other places where i_nlink is set or changed and if there are
> > > none
> > >
> > > then i_nlink will never be set to zero so the race should not
> > > occur.
> > >
> > >
> > > Consequently my claim is this is a real false positive.
> > >
> > >
> > > There are the file system operations that may be passed at mount
> > > time
> > >
> > > but given the way kernfs sets i_nlink it pretty much dictates
> > > those
> > > operations
> > >
> > > (if there were any that modify it and there don't appear to be
> > > any) leave it
> > >
> > > alone.
> > >
> > >
> > > So it just doesn't make sense for users of kernfs to fiddle with
> > > i_nlink ...
> > On todays next tag, next-20230718 this KCSAN BUG poped up again.
> > When I
> > built an allmodconfig arm64 kernel and booted it in QEMU. Full log
> > can
> > be found http://ix.io/4AUd
> >
> > [ 1694.987789][? T137] BUG: KCSAN: data-race in inode_permission /
> > kernfs_refresh_inode
> > [ 1694.992912][? T137]
> > [ 1694.994532][? T137] write to 0xffff00000bab6070 of 2 bytes by
> > task 104 on cpu 0:
> > [ 1694.999269][ T137] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> > [ 1695.002707][ T137] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1695.006148][ T137] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1695.009420][ T137] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1695.012643][ T137] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1695.015781][ T137] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1695.019059][ T137] vfs_statx
> > (/home/anders/src/kernel/next/fs/stat.c:238)
> > [ 1695.022024][ T137] vfs_fstatat
> > (/home/anders/src/kernel/next/fs/stat.c:276)
> > [ 1695.025067][ T137] __do_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:446)
> > [ 1695.028497][ T137] __arm64_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:440
> > /home/anders/src/kernel/next/fs/stat.c:440)
> > [ 1695.032080][ T137] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1695.035916][ T137] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1695.038796][ T137] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1695.041468][ T137] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1695.044889][ T137] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1695.047904][? T137]
> > [ 1695.049511][? T137] 1 lock held by systemd-udevd/104:
> > [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1695.060241][? T137] irq event stamp: 82902
> > [ 1695.063006][ T137] hardirqs last enabled at (82901):
> > _raw_spin_unlock_irqrestore
> > (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
> > macros.h:250
> > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
> > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
> > /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
> > /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> > [ 1695.069673][ T137] hardirqs last disabled at (82902):
> > el1_interrupt
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > [ 1695.075474][ T137] softirqs last enabled at (82792):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1695.082319][ T137] softirqs last disabled at (82790):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1695.089049][? T137]
> > [ 1695.090659][? T137] read to 0xffff00000bab6070 of 2 bytes by
> > task 137 on cpu 0:
> > [ 1695.095374][ T137] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:532)
> > [ 1695.098655][ T137] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1695.101857][ T137] path_openat
> > (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> > [ 1695.104885][ T137] do_filp_open
> > (/home/anders/src/kernel/next/fs/namei.c:3820)
> > [ 1695.108006][ T137] do_sys_openat2
> > (/home/anders/src/kernel/next/fs/open.c:1418)
> > [ 1695.111290][ T137] __arm64_sys_openat
> > (/home/anders/src/kernel/next/fs/open.c:1433)
> > [ 1695.114825][ T137] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1695.118662][ T137] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1695.121555][ T137] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1695.124207][ T137] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1695.127590][ T137] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1695.130641][? T137]
> > [ 1695.132241][? T137] no locks held by systemd-udevd/137.
> > [ 1695.135618][? T137] irq event stamp: 3246
> > [ 1695.138519][ T137] hardirqs last enabled at (3245):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1695.145825][ T137] hardirqs last disabled at (3246):
> > el1_interrupt
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > [ 1695.151942][ T137] softirqs last enabled at (3208):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1695.158950][ T137] softirqs last disabled at (3206):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1695.166036][? T137]
> > [ 1695.167621][? T137] Reported by Kernel Concurrency Sanitizer on:
> > [ 1695.179990][? T137] Hardware name: linux,dummy-virt (DT)
> > [ 1695.183687][? T137]
> > ==================================================================
>
>
> This one is different to the original.
>
>
> I can't see where the problem is here, can someone help me out
>
> please.
>
>
> I don't see any shared data values used by the call
>
> devcgroup_inode_permission(inode, mask) in
> fs/namei.c:inode_permission()
>
> that might be a problem during the read except possibly inode-
> >i_mode.
>
>
> I'll check on that ... maybe something's been missed when
> kernfs_rwsem
>
> was changed to a separate lock (kernfs_iattr_rwsem).
>
>
> >
> > [...]
> >
> > [ 1738.053819][? T104] BUG: KCSAN: data-race in set_nlink /
> > set_nlink
> > [ 1738.058223][? T104]
> > [ 1738.059865][? T104] read to 0xffff00000bab6918 of 4 bytes by
> > task 108 on cpu 0:
> > [ 1738.064916][ T104] set_nlink
> > (/home/anders/src/kernel/next/fs/inode.c:369)
> > [ 1738.067845][ T104] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > [ 1738.071607][ T104] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1738.075467][ T104] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1738.078868][ T104] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1738.082270][ T104] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1738.085488][ T104] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1738.089101][ T104] user_path_at_empty
> > (/home/anders/src/kernel/next/fs/namei.c:2907)
> > [ 1738.092469][ T104] do_readlinkat
> > (/home/anders/src/kernel/next/fs/stat.c:477)
> > [ 1738.095970][ T104] __arm64_sys_readlinkat
> > (/home/anders/src/kernel/next/fs/stat.c:504
> > /home/anders/src/kernel/next/fs/stat.c:501
> > /home/anders/src/kernel/next/fs/stat.c:501)
> > [ 1738.099529][ T104] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1738.103696][ T104] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1738.106560][ T104] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1738.109613][ T104] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1738.113035][ T104] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1738.116346][? T104]
> > [ 1738.117924][? T104] 1 lock held by systemd-udevd/108:
> > [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1738.129355][? T104] irq event stamp: 31000
> > [ 1738.132088][ T104] hardirqs last enabled at (31000):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1738.139417][ T104] hardirqs last disabled at (30999):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > [ 1738.146781][ T104] softirqs last enabled at (30973):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1738.153891][ T104] softirqs last disabled at (30971):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1738.161012][? T104]
> > [ 1738.162663][? T104] write to 0xffff00000bab6918 of 4 bytes by
> > task 104 on cpu 0:
> > [ 1738.167730][ T104] set_nlink
> > (/home/anders/src/kernel/next/fs/inode.c:372)
> > [ 1738.170559][ T104] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > [ 1738.174355][ T104] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1738.177829][ T104] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1738.181403][ T104] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1738.184738][ T104] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1738.188268][ T104] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1738.191865][ T104] vfs_statx
> > (/home/anders/src/kernel/next/fs/stat.c:238)
> > [ 1738.196236][ T104] vfs_fstatat
> > (/home/anders/src/kernel/next/fs/stat.c:276)
> > [ 1738.200120][ T104] __do_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:446)
> > [ 1738.204095][ T104] __arm64_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:440
> > /home/anders/src/kernel/next/fs/stat.c:440)
> > [ 1738.207676][ T104] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1738.211820][ T104] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1738.214815][ T104] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1738.217709][ T104] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1738.221239][ T104] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1738.224502][? T104]
> > [ 1738.226090][? T104] 1 lock held by systemd-udevd/104:
> > [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1738.237504][? T104] irq event stamp: 108353
> > [ 1738.240262][ T104] hardirqs last enabled at (108353):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1738.247443][ T104] hardirqs last disabled at (108352):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > [ 1738.254510][ T104] softirqs last enabled at (108326):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1738.262187][ T104] softirqs last disabled at (108324):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1738.270239][? T104]
> > [ 1738.272140][? T104] Reported by Kernel Concurrency Sanitizer on:
> > [ 1738.285185][? T104] Hardware name: linux,dummy-virt (DT)
> > [ 1738.288703][? T104]
> > ==================================================================
>
> This looks just like the original except a different lock is being
>
> used now.
>
>
> The link count can't be less than two if set_nlink() is called.
>
>
> Maybe I am missing something but the directory count is changed only
>
> while holding the root->kernfs_iattr_rwsem so the value used by
>
> set_nlink() will not change during concurrent calls to
> refresh_inode().
>
> Still looks like a false positive, I'll check the write operations
> again just to be sure.

I do see a problem with recent changes.

I'll send this off to Greg after I've done some testing (primarily just
compile and function).

Here's a patch which describes what I found.

Comments are, of course, welcome, ;)

kernfs: fix missing kernfs_iattr_rwsem locking

From: Ian Kent <[email protected]>

When the kernfs_iattr_rwsem was introduced a case was missed.

The update of the kernfs directory node child count was also protected
by the kernfs_rwsem and needs to be included in the change so that the
child count (and so the inode n_link attribute) does not change while
holding the rwsem for read.

Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
attributes)

Signed-off-by: Ian Kent <[email protected]>
Cc: Anders Roxell <[email protected]>
Cc: Imran Khan <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Eric Sandeen <[email protected]>
---
fs/kernfs/dir.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 45b6919903e6..6e84bb69602e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
*kn)
rb_insert_color(&kn->rb, &kn->parent->dir.children);

/* successfully added, account subdir number */
+ down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
kernfs_inc_rev(kn->parent);
+ up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);

return 0;
}
@@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
kernfs_node *kn)
if (RB_EMPTY_NODE(&kn->rb))
return false;

+ down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
kernfs_inc_rev(kn->parent);
+ up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);

rb_erase(&kn->rb, &kn->parent->dir.children);
RB_CLEAR_NODE(&kn->rb);


2023-07-26 14:38:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On Thu, 20 Jul 2023 at 04:03, Ian Kent <[email protected]> wrote:
>
> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> > On 19/7/23 03:00, Anders Roxell wrote:
> > > On 2023-01-23 11:11, Ian Kent wrote:
> > > > On 29/12/22 21:07, Ian Kent wrote:
> > > > > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > > > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
> > > > > > > > > wrote:
> > > > > > > > > > The kernfs write lock is held when the kernfs node
> > > > > > > > > > inode attributes
> > > > > > > > > > are updated. Therefore, when either
> > > > > > > > > > kernfs_iop_getattr() or
> > > > > > > > > > kernfs_iop_permission() are called the kernfs node
> > > > > > > > > > inode attributes
> > > > > > > > > > won't change.
> > > > > > > > > >
> > > > > > > > > > Consequently concurrent kernfs_refresh_inode() calls
> > > > > > > > > > always copy the
> > > > > > > > > > same values from the kernfs node.
> > > > > > > > > >
> > > > > > > > > > So there's no need to take the inode i_lock to get
> > > > > > > > > > consistent values
> > > > > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > > > > kernfs read lock
> > > > > > > > > > is sufficient.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Ian Kent <[email protected]>
> > > > > > > > > Acked-by: Tejun Heo <[email protected]>
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Building an allmodconfig arm64 kernel on yesterdays next-
> > > > > > > > 20221220 and
> > > > > > > > booting that in qemu I see the following "BUG: KCSAN:
> > > > > > > > data-race in
> > > > > > > > set_nlink / set_nlink".
> > > > > > > I'll check if I missed any places where set_link() could be
> > > > > > > called where the link count could be different.
> > > > > > >
> > > > > > >
> > > > > > > If there aren't any the question will then be can writing
> > > > > > > the
> > > > > > > same value to this location in multiple concurrent threads
> > > > > > > corrupt it?
> > > > > > I think the race that is getting reported for set_nlink()
> > > > > > is about this bit getting called simulatenously on multiple
> > > > > > CPUs with only the read lock held for the inode:
> > > > > >
> > > > > > /* Yes, some filesystems do change nlink from zero to
> > > > > > one */
> > > > > > if (inode->i_nlink == 0)
> > > > > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > > > > > inode->__i_nlink = nlink;
> > > > > >
> > > > > > Since i_nlink and __i_nlink refer to the same memory
> > > > > > location,
> > > > > > the 'inode->i_nlink == 0' check can be true for all of them
> > > > > > before the nonzero nlink value gets set, and this results in
> > > > > > s_remove_count being decremented more than once.
> > > > >
> > > > > Thanks for the comment Arnd.
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > I've been looking at this and after consulting Miklos and his
> > > > pointing
> > > >
> > > > out that it looks like a false positive the urgency dropped off a
> > > > bit. So
> > > >
> > > > apologies for taking so long to report back.
> > > >
> > > >
> > > > Anyway it needs some description of conclusions reached so far.
> > > >
> > > >
> > > > I'm still looking around but in short, kernfs will set
> > > > directories to <# of
> > > >
> > > > directory entries> + 2 unconditionally for directories. I can't
> > > > yet find
> > > >
> > > > any other places where i_nlink is set or changed and if there are
> > > > none
> > > >
> > > > then i_nlink will never be set to zero so the race should not
> > > > occur.
> > > >
> > > >
> > > > Consequently my claim is this is a real false positive.
> > > >
> > > >
> > > > There are the file system operations that may be passed at mount
> > > > time
> > > >
> > > > but given the way kernfs sets i_nlink it pretty much dictates
> > > > those
> > > > operations
> > > >
> > > > (if there were any that modify it and there don't appear to be
> > > > any) leave it
> > > >
> > > > alone.
> > > >
> > > >
> > > > So it just doesn't make sense for users of kernfs to fiddle with
> > > > i_nlink ...
> > > On todays next tag, next-20230718 this KCSAN BUG poped up again.
> > > When I
> > > built an allmodconfig arm64 kernel and booted it in QEMU. Full log
> > > can
> > > be found http://ix.io/4AUd
> > >
> > > [ 1694.987789][ T137] BUG: KCSAN: data-race in inode_permission /
> > > kernfs_refresh_inode
> > > [ 1694.992912][ T137]
> > > [ 1694.994532][ T137] write to 0xffff00000bab6070 of 2 bytes by
> > > task 104 on cpu 0:
> > > [ 1694.999269][ T137] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> > > [ 1695.002707][ T137] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1695.006148][ T137] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1695.009420][ T137] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1695.012643][ T137] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1695.015781][ T137] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1695.019059][ T137] vfs_statx
> > > (/home/anders/src/kernel/next/fs/stat.c:238)
> > > [ 1695.022024][ T137] vfs_fstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:276)
> > > [ 1695.025067][ T137] __do_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:446)
> > > [ 1695.028497][ T137] __arm64_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:440
> > > /home/anders/src/kernel/next/fs/stat.c:440)
> > > [ 1695.032080][ T137] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1695.035916][ T137] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1695.038796][ T137] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1695.041468][ T137] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1695.044889][ T137] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1695.047904][ T137]
> > > [ 1695.049511][ T137] 1 lock held by systemd-udevd/104:
> > > [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1695.060241][ T137] irq event stamp: 82902
> > > [ 1695.063006][ T137] hardirqs last enabled at (82901):
> > > _raw_spin_unlock_irqrestore
> > > (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
> > > macros.h:250
> > > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
> > > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
> > > /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
> > > /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> > > [ 1695.069673][ T137] hardirqs last disabled at (82902):
> > > el1_interrupt
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > > [ 1695.075474][ T137] softirqs last enabled at (82792):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1695.082319][ T137] softirqs last disabled at (82790):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1695.089049][ T137]
> > > [ 1695.090659][ T137] read to 0xffff00000bab6070 of 2 bytes by
> > > task 137 on cpu 0:
> > > [ 1695.095374][ T137] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:532)
> > > [ 1695.098655][ T137] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1695.101857][ T137] path_openat
> > > (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> > > [ 1695.104885][ T137] do_filp_open
> > > (/home/anders/src/kernel/next/fs/namei.c:3820)
> > > [ 1695.108006][ T137] do_sys_openat2
> > > (/home/anders/src/kernel/next/fs/open.c:1418)
> > > [ 1695.111290][ T137] __arm64_sys_openat
> > > (/home/anders/src/kernel/next/fs/open.c:1433)
> > > [ 1695.114825][ T137] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1695.118662][ T137] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1695.121555][ T137] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1695.124207][ T137] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1695.127590][ T137] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1695.130641][ T137]
> > > [ 1695.132241][ T137] no locks held by systemd-udevd/137.
> > > [ 1695.135618][ T137] irq event stamp: 3246
> > > [ 1695.138519][ T137] hardirqs last enabled at (3245):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1695.145825][ T137] hardirqs last disabled at (3246):
> > > el1_interrupt
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > > [ 1695.151942][ T137] softirqs last enabled at (3208):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1695.158950][ T137] softirqs last disabled at (3206):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1695.166036][ T137]
> > > [ 1695.167621][ T137] Reported by Kernel Concurrency Sanitizer on:
> > > [ 1695.179990][ T137] Hardware name: linux,dummy-virt (DT)
> > > [ 1695.183687][ T137]
> > > ==================================================================
> >
> >
> > This one is different to the original.
> >
> >
> > I can't see where the problem is here, can someone help me out
> >
> > please.
> >
> >
> > I don't see any shared data values used by the call
> >
> > devcgroup_inode_permission(inode, mask) in
> > fs/namei.c:inode_permission()
> >
> > that might be a problem during the read except possibly inode-
> > >i_mode.
> >
> >
> > I'll check on that ... maybe something's been missed when
> > kernfs_rwsem
> >
> > was changed to a separate lock (kernfs_iattr_rwsem).
> >
> >
> > >
> > > [...]
> > >
> > > [ 1738.053819][ T104] BUG: KCSAN: data-race in set_nlink /
> > > set_nlink
> > > [ 1738.058223][ T104]
> > > [ 1738.059865][ T104] read to 0xffff00000bab6918 of 4 bytes by
> > > task 108 on cpu 0:
> > > [ 1738.064916][ T104] set_nlink
> > > (/home/anders/src/kernel/next/fs/inode.c:369)
> > > [ 1738.067845][ T104] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > > [ 1738.071607][ T104] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1738.075467][ T104] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1738.078868][ T104] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1738.082270][ T104] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1738.085488][ T104] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1738.089101][ T104] user_path_at_empty
> > > (/home/anders/src/kernel/next/fs/namei.c:2907)
> > > [ 1738.092469][ T104] do_readlinkat
> > > (/home/anders/src/kernel/next/fs/stat.c:477)
> > > [ 1738.095970][ T104] __arm64_sys_readlinkat
> > > (/home/anders/src/kernel/next/fs/stat.c:504
> > > /home/anders/src/kernel/next/fs/stat.c:501
> > > /home/anders/src/kernel/next/fs/stat.c:501)
> > > [ 1738.099529][ T104] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1738.103696][ T104] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1738.106560][ T104] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1738.109613][ T104] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1738.113035][ T104] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1738.116346][ T104]
> > > [ 1738.117924][ T104] 1 lock held by systemd-udevd/108:
> > > [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1738.129355][ T104] irq event stamp: 31000
> > > [ 1738.132088][ T104] hardirqs last enabled at (31000):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1738.139417][ T104] hardirqs last disabled at (30999):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > > [ 1738.146781][ T104] softirqs last enabled at (30973):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1738.153891][ T104] softirqs last disabled at (30971):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1738.161012][ T104]
> > > [ 1738.162663][ T104] write to 0xffff00000bab6918 of 4 bytes by
> > > task 104 on cpu 0:
> > > [ 1738.167730][ T104] set_nlink
> > > (/home/anders/src/kernel/next/fs/inode.c:372)
> > > [ 1738.170559][ T104] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > > [ 1738.174355][ T104] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1738.177829][ T104] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1738.181403][ T104] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1738.184738][ T104] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1738.188268][ T104] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1738.191865][ T104] vfs_statx
> > > (/home/anders/src/kernel/next/fs/stat.c:238)
> > > [ 1738.196236][ T104] vfs_fstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:276)
> > > [ 1738.200120][ T104] __do_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:446)
> > > [ 1738.204095][ T104] __arm64_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:440
> > > /home/anders/src/kernel/next/fs/stat.c:440)
> > > [ 1738.207676][ T104] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1738.211820][ T104] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1738.214815][ T104] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1738.217709][ T104] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1738.221239][ T104] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1738.224502][ T104]
> > > [ 1738.226090][ T104] 1 lock held by systemd-udevd/104:
> > > [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1738.237504][ T104] irq event stamp: 108353
> > > [ 1738.240262][ T104] hardirqs last enabled at (108353):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1738.247443][ T104] hardirqs last disabled at (108352):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > > [ 1738.254510][ T104] softirqs last enabled at (108326):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1738.262187][ T104] softirqs last disabled at (108324):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1738.270239][ T104]
> > > [ 1738.272140][ T104] Reported by Kernel Concurrency Sanitizer on:
> > > [ 1738.285185][ T104] Hardware name: linux,dummy-virt (DT)
> > > [ 1738.288703][ T104]
> > > ==================================================================
> >
> > This looks just like the original except a different lock is being
> >
> > used now.
> >
> >
> > The link count can't be less than two if set_nlink() is called.
> >
> >
> > Maybe I am missing something but the directory count is changed only
> >
> > while holding the root->kernfs_iattr_rwsem so the value used by
> >
> > set_nlink() will not change during concurrent calls to
> > refresh_inode().
> >
> > Still looks like a false positive, I'll check the write operations
> > again just to be sure.
>
> I do see a problem with recent changes.
>
> I'll send this off to Greg after I've done some testing (primarily just
> compile and function).
>
> Here's a patch which describes what I found.
>
> Comments are, of course, welcome, ;)
>
> kernfs: fix missing kernfs_iattr_rwsem locking
>
> From: Ian Kent <[email protected]>
>
> When the kernfs_iattr_rwsem was introduced a case was missed.
>
> The update of the kernfs directory node child count was also protected
> by the kernfs_rwsem and needs to be included in the change so that the
> child count (and so the inode n_link attribute) does not change while
> holding the rwsem for read.
>
> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
> attributes)
>
> Signed-off-by: Ian Kent <[email protected]>
> Cc: Anders Roxell <[email protected]>
> Cc: Imran Khan <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Eric Sandeen <[email protected]>

Looks good.

Acked-by: Miklos Szeredi <[email protected]>

2023-07-27 01:00:23

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 20/7/23 10:03, Ian Kent wrote:
> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>> On 19/7/23 03:00, Anders Roxell wrote:
>>> On 2023-01-23 11:11, Ian Kent wrote:
>>>> On 29/12/22 21:07, Ian Kent wrote:
>>>>> On 29/12/22 17:20, Arnd Bergmann wrote:
>>>>>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>>>>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>>>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
>>>>>>>>> wrote:
>>>>>>>>>> The kernfs write lock is held when the kernfs node
>>>>>>>>>> inode attributes
>>>>>>>>>> are updated. Therefore, when either
>>>>>>>>>> kernfs_iop_getattr() or
>>>>>>>>>> kernfs_iop_permission() are called the kernfs node
>>>>>>>>>> inode attributes
>>>>>>>>>> won't change.
>>>>>>>>>>
>>>>>>>>>> Consequently concurrent kernfs_refresh_inode() calls
>>>>>>>>>> always copy the
>>>>>>>>>> same values from the kernfs node.
>>>>>>>>>>
>>>>>>>>>> So there's no need to take the inode i_lock to get
>>>>>>>>>> consistent values
>>>>>>>>>> for generic_fillattr() and generic_permission(), the
>>>>>>>>>> kernfs read lock
>>>>>>>>>> is sufficient.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>>>>>>> Acked-by: Tejun Heo <[email protected]>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Building an allmodconfig arm64 kernel on yesterdays next-
>>>>>>>> 20221220 and
>>>>>>>> booting that in qemu I see the following "BUG: KCSAN:
>>>>>>>> data-race in
>>>>>>>> set_nlink / set_nlink".
>>>>>>> I'll check if I missed any places where set_link() could be
>>>>>>> called where the link count could be different.
>>>>>>>
>>>>>>>
>>>>>>> If there aren't any the question will then be can writing
>>>>>>> the
>>>>>>> same value to this location in multiple concurrent threads
>>>>>>> corrupt it?
>>>>>> I think the race that is getting reported for set_nlink()
>>>>>> is about this bit getting called simulatenously on multiple
>>>>>> CPUs with only the read lock held for the inode:
>>>>>>
>>>>>>        /* Yes, some filesystems do change nlink from zero to
>>>>>> one */
>>>>>>        if (inode->i_nlink == 0)
>>>>>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>>>>>        inode->__i_nlink = nlink;
>>>>>>
>>>>>> Since i_nlink and __i_nlink refer to the same memory
>>>>>> location,
>>>>>> the 'inode->i_nlink == 0' check can be true for all of them
>>>>>> before the nonzero nlink value gets set, and this results in
>>>>>> s_remove_count being decremented more than once.
>>>>> Thanks for the comment Arnd.
>>>> Hello all,
>>>>
>>>>
>>>> I've been looking at this and after consulting Miklos and his
>>>> pointing
>>>>
>>>> out that it looks like a false positive the urgency dropped off a
>>>> bit. So
>>>>
>>>> apologies for taking so long to report back.
>>>>
>>>>
>>>> Anyway it needs some description of conclusions reached so far.
>>>>
>>>>
>>>> I'm still looking around but in short, kernfs will set
>>>> directories to <# of
>>>>
>>>> directory entries> + 2 unconditionally for directories. I can't
>>>> yet find
>>>>
>>>> any other places where i_nlink is set or changed and if there are
>>>> none
>>>>
>>>> then i_nlink will never be set to zero so the race should not
>>>> occur.
>>>>
>>>>
>>>> Consequently my claim is this is a real false positive.
>>>>
>>>>
>>>> There are the file system operations that may be passed at mount
>>>> time
>>>>
>>>> but given the way kernfs sets i_nlink it pretty much dictates
>>>> those
>>>> operations
>>>>
>>>> (if there were any that modify it and there don't appear to be
>>>> any) leave it
>>>>
>>>> alone.
>>>>
>>>>
>>>> So it just doesn't make sense for users of kernfs to fiddle with
>>>> i_nlink ...
>>> On todays next tag, next-20230718 this KCSAN BUG poped up again.
>>> When I
>>> built an allmodconfig arm64 kernel and booted it in QEMU. Full log
>>> can
>>> be found http://ix.io/4AUd
>>>
>>> [ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission /
>>> kernfs_refresh_inode
>>> [ 1694.992912][  T137]
>>> [ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by
>>> task 104 on cpu 0:
>>> [ 1694.999269][ T137] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
>>> [ 1695.002707][ T137] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1695.006148][ T137] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1695.009420][ T137] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1695.012643][ T137] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1695.015781][ T137] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1695.019059][ T137] vfs_statx
>>> (/home/anders/src/kernel/next/fs/stat.c:238)
>>> [ 1695.022024][ T137] vfs_fstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:276)
>>> [ 1695.025067][ T137] __do_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:446)
>>> [ 1695.028497][ T137] __arm64_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:440
>>> /home/anders/src/kernel/next/fs/stat.c:440)
>>> [ 1695.032080][ T137] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1695.035916][ T137] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1695.038796][ T137] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1695.041468][ T137] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1695.044889][ T137] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1695.047904][  T137]
>>> [ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
>>> [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1695.060241][  T137] irq event stamp: 82902
>>> [ 1695.063006][ T137] hardirqs last enabled at (82901):
>>> _raw_spin_unlock_irqrestore
>>> (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
>>> macros.h:250
>>> /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
>>> /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
>>> /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
>>> /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
>>> [ 1695.069673][ T137] hardirqs last disabled at (82902):
>>> el1_interrupt
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
>>> [ 1695.075474][ T137] softirqs last enabled at (82792):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1695.082319][ T137] softirqs last disabled at (82790):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1695.089049][  T137]
>>> [ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by
>>> task 137 on cpu 0:
>>> [ 1695.095374][ T137] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:532)
>>> [ 1695.098655][ T137] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1695.101857][ T137] path_openat
>>> (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
>>> [ 1695.104885][ T137] do_filp_open
>>> (/home/anders/src/kernel/next/fs/namei.c:3820)
>>> [ 1695.108006][ T137] do_sys_openat2
>>> (/home/anders/src/kernel/next/fs/open.c:1418)
>>> [ 1695.111290][ T137] __arm64_sys_openat
>>> (/home/anders/src/kernel/next/fs/open.c:1433)
>>> [ 1695.114825][ T137] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1695.118662][ T137] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1695.121555][ T137] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1695.124207][ T137] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1695.127590][ T137] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1695.130641][  T137]
>>> [ 1695.132241][  T137] no locks held by systemd-udevd/137.
>>> [ 1695.135618][  T137] irq event stamp: 3246
>>> [ 1695.138519][ T137] hardirqs last enabled at (3245):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1695.145825][ T137] hardirqs last disabled at (3246):
>>> el1_interrupt
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
>>> [ 1695.151942][ T137] softirqs last enabled at (3208):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1695.158950][ T137] softirqs last disabled at (3206):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1695.166036][  T137]
>>> [ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
>>> [ 1695.183687][  T137]
>>> ==================================================================
>>
>> This one is different to the original.
>>
>>
>> I can't see where the problem is here, can someone help me out
>>
>> please.
>>
>>
>> I don't see any shared data values used by the call
>>
>> devcgroup_inode_permission(inode, mask) in
>> fs/namei.c:inode_permission()
>>
>> that might be a problem during the read except possibly inode-
>>> i_mode.
>>
>> I'll check on that ... maybe something's been missed when
>> kernfs_rwsem
>>
>> was changed to a separate lock (kernfs_iattr_rwsem).
>>
>>
>>> [...]
>>>
>>> [ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink /
>>> set_nlink
>>> [ 1738.058223][  T104]
>>> [ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by
>>> task 108 on cpu 0:
>>> [ 1738.064916][ T104] set_nlink
>>> (/home/anders/src/kernel/next/fs/inode.c:369)
>>> [ 1738.067845][ T104] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
>>> [ 1738.071607][ T104] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1738.075467][ T104] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1738.078868][ T104] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1738.082270][ T104] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1738.085488][ T104] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1738.089101][ T104] user_path_at_empty
>>> (/home/anders/src/kernel/next/fs/namei.c:2907)
>>> [ 1738.092469][ T104] do_readlinkat
>>> (/home/anders/src/kernel/next/fs/stat.c:477)
>>> [ 1738.095970][ T104] __arm64_sys_readlinkat
>>> (/home/anders/src/kernel/next/fs/stat.c:504
>>> /home/anders/src/kernel/next/fs/stat.c:501
>>> /home/anders/src/kernel/next/fs/stat.c:501)
>>> [ 1738.099529][ T104] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1738.103696][ T104] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1738.106560][ T104] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1738.109613][ T104] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1738.113035][ T104] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1738.116346][  T104]
>>> [ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
>>> [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1738.129355][  T104] irq event stamp: 31000
>>> [ 1738.132088][ T104] hardirqs last enabled at (31000):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1738.139417][ T104] hardirqs last disabled at (30999):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
>>> [ 1738.146781][ T104] softirqs last enabled at (30973):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1738.153891][ T104] softirqs last disabled at (30971):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1738.161012][  T104]
>>> [ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by
>>> task 104 on cpu 0:
>>> [ 1738.167730][ T104] set_nlink
>>> (/home/anders/src/kernel/next/fs/inode.c:372)
>>> [ 1738.170559][ T104] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
>>> [ 1738.174355][ T104] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1738.177829][ T104] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1738.181403][ T104] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1738.184738][ T104] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1738.188268][ T104] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1738.191865][ T104] vfs_statx
>>> (/home/anders/src/kernel/next/fs/stat.c:238)
>>> [ 1738.196236][ T104] vfs_fstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:276)
>>> [ 1738.200120][ T104] __do_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:446)
>>> [ 1738.204095][ T104] __arm64_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:440
>>> /home/anders/src/kernel/next/fs/stat.c:440)
>>> [ 1738.207676][ T104] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1738.211820][ T104] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1738.214815][ T104] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1738.217709][ T104] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1738.221239][ T104] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1738.224502][  T104]
>>> [ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
>>> [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1738.237504][  T104] irq event stamp: 108353
>>> [ 1738.240262][ T104] hardirqs last enabled at (108353):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1738.247443][ T104] hardirqs last disabled at (108352):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
>>> [ 1738.254510][ T104] softirqs last enabled at (108326):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1738.262187][ T104] softirqs last disabled at (108324):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1738.270239][  T104]
>>> [ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
>>> [ 1738.288703][  T104]
>>> ==================================================================
>> This looks just like the original except a different lock is being
>>
>> used now.
>>
>>
>> The link count can't be less than two if set_nlink() is called.
>>
>>
>> Maybe I am missing something but the directory count is changed only
>>
>> while holding the root->kernfs_iattr_rwsem so the value used by
>>
>> set_nlink() will not change during concurrent calls to
>> refresh_inode().
>>
>> Still looks like a false positive, I'll check the write operations
>> again just to be sure.
> I do see a problem with recent changes.
>
> I'll send this off to Greg after I've done some testing (primarily just
> compile and function).
>
> Here's a patch which describes what I found.
>
> Comments are, of course, welcome, ;)

Anders I was hoping you would check if/what lockdep trace

you get with this patch.


Imran, I was hoping you would comment on my change as it

relates to the kernfs_iattr_rwsem changes.


Ian

>
> kernfs: fix missing kernfs_iattr_rwsem locking
>
> From: Ian Kent <[email protected]>
>
> When the kernfs_iattr_rwsem was introduced a case was missed.
>
> The update of the kernfs directory node child count was also protected
> by the kernfs_rwsem and needs to be included in the change so that the
> child count (and so the inode n_link attribute) does not change while
> holding the rwsem for read.
>
> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
> attributes)
>
> Signed-off-by: Ian Kent <[email protected]>
> Cc: Anders Roxell <[email protected]>
> Cc: Imran Khan <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Eric Sandeen <[email protected]>
> ---
> fs/kernfs/dir.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 45b6919903e6..6e84bb69602e 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
> *kn)
> rb_insert_color(&kn->rb, &kn->parent->dir.children);
>
> /* successfully added, account subdir number */
> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs++;
> kernfs_inc_rev(kn->parent);
> + up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>
> return 0;
> }
> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
> kernfs_node *kn)
> if (RB_EMPTY_NODE(&kn->rb))
> return false;
>
> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs--;
> kernfs_inc_rev(kn->parent);
> + up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>
> rb_erase(&kn->rb, &kn->parent->dir.children);
> RB_CLEAR_NODE(&kn->rb);
>

2023-07-27 04:36:17

by Imran Khan

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

Hello Ian,
Sorry for late reply. I was about to reply this week.

On 27/7/2023 10:38 am, Ian Kent wrote:
> On 20/7/23 10:03, Ian Kent wrote:
>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:

[...]
>> I do see a problem with recent changes.
>>
>> I'll send this off to Greg after I've done some testing (primarily just
>> compile and function).
>>
>> Here's a patch which describes what I found.
>>
>> Comments are, of course, welcome, ;)
>
> Anders I was hoping you would check if/what lockdep trace
>
> you get with this patch.
>
>
> Imran, I was hoping you would comment on my change as it
>
> relates to the kernfs_iattr_rwsem changes.
>
>
> Ian
>
>>
>> kernfs: fix missing kernfs_iattr_rwsem locking
>>
>> From: Ian Kent <[email protected]>
>>
>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>
>> The update of the kernfs directory node child count was also protected
>> by the kernfs_rwsem and needs to be included in the change so that the
>> child count (and so the inode n_link attribute) does not change while
>> holding the rwsem for read.
>>

kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
these are getting invoked while adding (kernfs_add_one),
removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
operations proceed under kernfs_rwsem and I see each invocation of
kernfs_link/unlink_sibling during the above mentioned operations is happening
under kernfs_rwsem.
So the child count should still be protected by kernfs_rwsem and we should not
need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.

Kindly let me know your thoughts. I would still like to see new lockdep traces
with this change.

Thanks,
Imran

>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>> attributes)
>>
>> Signed-off-by: Ian Kent <[email protected]>
>> Cc: Anders Roxell <[email protected]>
>> Cc: Imran Khan <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Eric Sandeen <[email protected]>
>> ---
>>   fs/kernfs/dir.c |    4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 45b6919903e6..6e84bb69602e 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>> *kn)
>>       rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>         /* successfully added, account subdir number */
>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>       if (kernfs_type(kn) == KERNFS_DIR)
>>           kn->parent->dir.subdirs++;
>>       kernfs_inc_rev(kn->parent);
>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>         return 0;
>>   }
>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>> kernfs_node *kn)
>>       if (RB_EMPTY_NODE(&kn->rb))
>>           return false;
>>   +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>       if (kernfs_type(kn) == KERNFS_DIR)
>>           kn->parent->dir.subdirs--;
>>       kernfs_inc_rev(kn->parent);
>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>         rb_erase(&kn->rb, &kn->parent->dir.children);
>>       RB_CLEAR_NODE(&kn->rb);
>>

2023-07-27 07:08:09

by Imran Khan

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

Hello again Ian,
I take back my previous comment :).

On 27/7/2023 2:30 pm, Imran Khan wrote:
> Hello Ian,
> Sorry for late reply. I was about to reply this week.
>
> On 27/7/2023 10:38 am, Ian Kent wrote:
>> On 20/7/23 10:03, Ian Kent wrote:
>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>
> [...]
>>> I do see a problem with recent changes.
>>>
>>> I'll send this off to Greg after I've done some testing (primarily just
>>> compile and function).
>>>
>>> Here's a patch which describes what I found.
>>>
>>> Comments are, of course, welcome, ;)
>>
>> Anders I was hoping you would check if/what lockdep trace
>>
>> you get with this patch.
>>
>>
>> Imran, I was hoping you would comment on my change as it
>>
>> relates to the kernfs_iattr_rwsem changes.
>>
>>
>> Ian
>>
>>>
>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>
>>> From: Ian Kent <[email protected]>
>>>
>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>
>>> The update of the kernfs directory node child count was also protected
>>> by the kernfs_rwsem and needs to be included in the change so that the
>>> child count (and so the inode n_link attribute) does not change while
>>> holding the rwsem for read.
>>>
>
> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
> these are getting invoked while adding (kernfs_add_one),
> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
> operations proceed under kernfs_rwsem and I see each invocation of
> kernfs_link/unlink_sibling during the above mentioned operations is happening
> under kernfs_rwsem.
> So the child count should still be protected by kernfs_rwsem and we should not
> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>

kernfs_refresh_inode can still race against kernfs_link/unlink_siblings. So your
change looks good to me.
My tests are not showing any issues either. ( I tested on 4.14 and 5.4 kernels
as well).

Fee free to add my RB.

Reviewed-by: Imran Khan <[email protected]>

> Kindly let me know your thoughts. I would still like to see new lockdep traces
> with this change.
>
> Thanks,
> Imran
>
>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>> attributes)
>>>
>>> Signed-off-by: Ian Kent <[email protected]>
>>> Cc: Anders Roxell <[email protected]>
>>> Cc: Imran Khan <[email protected]>
>>> Cc: Arnd Bergmann <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> Cc: Eric Sandeen <[email protected]>
>>> ---
>>>   fs/kernfs/dir.c |    4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 45b6919903e6..6e84bb69602e 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>> *kn)
>>>       rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>         /* successfully added, account subdir number */
>>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs++;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         return 0;
>>>   }
>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>> kernfs_node *kn)
>>>       if (RB_EMPTY_NODE(&kn->rb))
>>>           return false;
>>>   +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs--;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         rb_erase(&kn->rb, &kn->parent->dir.children);
>>>       RB_CLEAR_NODE(&kn->rb);
>>>

2023-07-28 01:16:36

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 28/7/23 08:00, Ian Kent wrote:
> On 27/7/23 12:30, Imran Khan wrote:
>> Hello Ian,
>> Sorry for late reply. I was about to reply this week.
>>
>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>> On 20/7/23 10:03, Ian Kent wrote:
>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>> [...]
>>>> I do see a problem with recent changes.
>>>>
>>>> I'll send this off to Greg after I've done some testing (primarily
>>>> just
>>>> compile and function).
>>>>
>>>> Here's a patch which describes what I found.
>>>>
>>>> Comments are, of course, welcome, ;)
>>> Anders I was hoping you would check if/what lockdep trace
>>>
>>> you get with this patch.
>>>
>>>
>>> Imran, I was hoping you would comment on my change as it
>>>
>>> relates to the kernfs_iattr_rwsem changes.
>>>
>>>
>>> Ian
>>>
>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>
>>>> From: Ian Kent <[email protected]>
>>>>
>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>
>>>> The update of the kernfs directory node child count was also protected
>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>> child count (and so the inode n_link attribute) does not change while
>>>> holding the rwsem for read.
>>>>
>> kernfs direcytory node's child count changes in
>> kernfs_(un)link_sibling and
>> these are getting invoked while adding (kernfs_add_one),
>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of
>> these
>> operations proceed under kernfs_rwsem and I see each invocation of
>> kernfs_link/unlink_sibling during the above mentioned operations is
>> happening
>> under kernfs_rwsem.
>> So the child count should still be protected by kernfs_rwsem and we
>> should not
>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>
> Yes, that's exactly what I intended (assuming you mean write lock in
> those cases)
>
> when I did it so now I wonder what I saw that lead to my patch, I'll
> need to look
>
> again ...

Ahh, I see why I thought this ...

It's the hunk:

@@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
        kn = inode->i_private;
        root = kernfs_root(kn);

-       down_read(&root->kernfs_rwsem);
+       down_read(&root->kernfs_iattr_rwsem);
        kernfs_refresh_inode(kn, inode);
        ret = generic_permission(&nop_mnt_idmap, inode, mask);
-       up_read(&root->kernfs_rwsem);
+       up_read(&root->kernfs_iattr_rwsem);

        return ret;
 }

which takes away the kernfs_rwsem and introduces the possibility of

the change. It may be more instructive to add back taking the read

lock of kernfs_rwsem in .permission() than altering the sibling link

and unlink functions, I mean I even caught myself on it.


Ian

>
>
>>
>> Kindly let me know your thoughts. I would still like to see new
>> lockdep traces
>> with this change.
>
> Indeed, I hope Anders can find time to get the trace.
>
>
> Ian
>
>>
>> Thanks,
>> Imran
>>
>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>> attributes)
>>>>
>>>> Signed-off-by: Ian Kent <[email protected]>
>>>> Cc: Anders Roxell <[email protected]>
>>>> Cc: Imran Khan <[email protected]>
>>>> Cc: Arnd Bergmann <[email protected]>
>>>> Cc: Minchan Kim <[email protected]>
>>>> Cc: Eric Sandeen <[email protected]>
>>>> ---
>>>>    fs/kernfs/dir.c |    4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>> index 45b6919903e6..6e84bb69602e 100644
>>>> --- a/fs/kernfs/dir.c
>>>> +++ b/fs/kernfs/dir.c
>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>> *kn)
>>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>          /* successfully added, account subdir number */
>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>            kn->parent->dir.subdirs++;
>>>>        kernfs_inc_rev(kn->parent);
>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>          return 0;
>>>>    }
>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>> kernfs_node *kn)
>>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>>            return false;
>>>>    + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>            kn->parent->dir.subdirs--;
>>>>        kernfs_inc_rev(kn->parent);
>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>        RB_CLEAR_NODE(&kn->rb);
>>>>

2023-07-28 01:21:55

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 27/7/23 12:30, Imran Khan wrote:
> Hello Ian,
> Sorry for late reply. I was about to reply this week.
>
> On 27/7/2023 10:38 am, Ian Kent wrote:
>> On 20/7/23 10:03, Ian Kent wrote:
>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> [...]
>>> I do see a problem with recent changes.
>>>
>>> I'll send this off to Greg after I've done some testing (primarily just
>>> compile and function).
>>>
>>> Here's a patch which describes what I found.
>>>
>>> Comments are, of course, welcome, ;)
>> Anders I was hoping you would check if/what lockdep trace
>>
>> you get with this patch.
>>
>>
>> Imran, I was hoping you would comment on my change as it
>>
>> relates to the kernfs_iattr_rwsem changes.
>>
>>
>> Ian
>>
>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>
>>> From: Ian Kent <[email protected]>
>>>
>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>
>>> The update of the kernfs directory node child count was also protected
>>> by the kernfs_rwsem and needs to be included in the change so that the
>>> child count (and so the inode n_link attribute) does not change while
>>> holding the rwsem for read.
>>>
> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
> these are getting invoked while adding (kernfs_add_one),
> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
> operations proceed under kernfs_rwsem and I see each invocation of
> kernfs_link/unlink_sibling during the above mentioned operations is happening
> under kernfs_rwsem.
> So the child count should still be protected by kernfs_rwsem and we should not
> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.

Yes, that's exactly what I intended (assuming you mean write lock in
those cases)

when I did it so now I wonder what I saw that lead to my patch, I'll
need to look

again ...


>
> Kindly let me know your thoughts. I would still like to see new lockdep traces
> with this change.

Indeed, I hope Anders can find time to get the trace.


Ian

>
> Thanks,
> Imran
>
>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>> attributes)
>>>
>>> Signed-off-by: Ian Kent <[email protected]>
>>> Cc: Anders Roxell <[email protected]>
>>> Cc: Imran Khan <[email protected]>
>>> Cc: Arnd Bergmann <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> Cc: Eric Sandeen <[email protected]>
>>> ---
>>>   fs/kernfs/dir.c |    4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 45b6919903e6..6e84bb69602e 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>> *kn)
>>>       rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>         /* successfully added, account subdir number */
>>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs++;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         return 0;
>>>   }
>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>> kernfs_node *kn)
>>>       if (RB_EMPTY_NODE(&kn->rb))
>>>           return false;
>>>   +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs--;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         rb_erase(&kn->rb, &kn->parent->dir.children);
>>>       RB_CLEAR_NODE(&kn->rb);
>>>

2023-07-28 01:33:34

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

On 28/7/23 09:06, Imran Khan wrote:
> Hello Ian,
>
> On 28/7/2023 10:16 am, Ian Kent wrote:
>> On 28/7/23 08:00, Ian Kent wrote:
>>> On 27/7/23 12:30, Imran Khan wrote:
>>>> Hello Ian,
>>>> Sorry for late reply. I was about to reply this week.
>>>>
>>>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>>>> On 20/7/23 10:03, Ian Kent wrote:
>>>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>>>> [...]
>>>>>> I do see a problem with recent changes.
>>>>>>
>>>>>> I'll send this off to Greg after I've done some testing (primarily just
>>>>>> compile and function).
>>>>>>
>>>>>> Here's a patch which describes what I found.
>>>>>>
>>>>>> Comments are, of course, welcome, ;)
>>>>> Anders I was hoping you would check if/what lockdep trace
>>>>>
>>>>> you get with this patch.
>>>>>
>>>>>
>>>>> Imran, I was hoping you would comment on my change as it
>>>>>
>>>>> relates to the kernfs_iattr_rwsem changes.
>>>>>
>>>>>
>>>>> Ian
>>>>>
>>>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>>>
>>>>>> From: Ian Kent <[email protected]>
>>>>>>
>>>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>>>
>>>>>> The update of the kernfs directory node child count was also protected
>>>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>>>> child count (and so the inode n_link attribute) does not change while
>>>>>> holding the rwsem for read.
>>>>>>
>>>> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
>>>> these are getting invoked while adding (kernfs_add_one),
>>>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
>>>> operations proceed under kernfs_rwsem and I see each invocation of
>>>> kernfs_link/unlink_sibling during the above mentioned operations is happening
>>>> under kernfs_rwsem.
>>>> So the child count should still be protected by kernfs_rwsem and we should not
>>>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>>> Yes, that's exactly what I intended (assuming you mean write lock in those cases)
>>>
>>> when I did it so now I wonder what I saw that lead to my patch, I'll need to look
>>>
>>> again ...
>> Ahh, I see why I thought this ...
>>
>> It's the hunk:
>>
>> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
>>         kn = inode->i_private;
>>         root = kernfs_root(kn);
>>
>> -       down_read(&root->kernfs_rwsem);
>> +       down_read(&root->kernfs_iattr_rwsem);
>>         kernfs_refresh_inode(kn, inode);
>>         ret = generic_permission(&nop_mnt_idmap, inode, mask);
>> -       up_read(&root->kernfs_rwsem);
>> +       up_read(&root->kernfs_iattr_rwsem);
>>
>>         return ret;
>>  }
>>
>> which takes away the kernfs_rwsem and introduces the possibility of
>>
>> the change. It may be more instructive to add back taking the read
>>
>> lock of kernfs_rwsem in .permission() than altering the sibling link
>>
>> and unlink functions, I mean I even caught myself on it.
>>
> Yes this was the block I referred to in my second comment [1]. However adding
> back read lock of kernfs_rwsem in .permission() will again introduce the
> bottleneck that I mentioned at [2]. So I think taking taking the locks in
> link/unlink is the better option.

Yes, sorry, I always fall into the trap of not reading through all my

mail before commenting, oops!


The fact that .permission() is called so much more than the create/remove

functions also occurred to be me too just after I posted my comment (and

is probably why I originally did it that way).


I'll forward the patch for merge but would really like to see the lockdep

trace so I'll wait a little while ...

> I understand having one lock to synchronize everything makes it easier
> debug/development wise but sometimes (such as the case mentioned in [2]), it is
> not optimum performance wise.

Indeed, the performance improvement work that has been happening over

quite some time now is very good.


I had seen some opportunities around the file open path long ago but

hadn't got to doing anything there as you have, the work looks good

to me, thanks for doing it.


Ian

> Thoughts ?
>
> Thanks,
> Imran
>
> [1]: https://lore.kernel.org/all/[email protected]/
> [2]: https://lore.kernel.org/all/[email protected]/
>> Ian
>>
>>>
>>>> Kindly let me know your thoughts. I would still like to see new lockdep traces
>>>> with this change.
>>> Indeed, I hope Anders can find time to get the trace.
>>>
>>>
>>> Ian
>>>
>>>> Thanks,
>>>> Imran
>>>>
>>>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>>>> attributes)
>>>>>>
>>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>>>> Cc: Anders Roxell <[email protected]>
>>>>>> Cc: Imran Khan <[email protected]>
>>>>>> Cc: Arnd Bergmann <[email protected]>
>>>>>> Cc: Minchan Kim <[email protected]>
>>>>>> Cc: Eric Sandeen <[email protected]>
>>>>>> ---
>>>>>>    fs/kernfs/dir.c |    4 ++++
>>>>>>    1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>>>> index 45b6919903e6..6e84bb69602e 100644
>>>>>> --- a/fs/kernfs/dir.c
>>>>>> +++ b/fs/kernfs/dir.c
>>>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>>>> *kn)
>>>>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>>>          /* successfully added, account subdir number */
>>>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>>            kn->parent->dir.subdirs++;
>>>>>>        kernfs_inc_rev(kn->parent);
>>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>          return 0;
>>>>>>    }
>>>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>>>> kernfs_node *kn)
>>>>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>>>>            return false;
>>>>>>    + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>>            kn->parent->dir.subdirs--;
>>>>>>        kernfs_inc_rev(kn->parent);
>>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>>>        RB_CLEAR_NODE(&kn->rb);
>>>>>>

2023-07-28 02:09:53

by Imran Khan

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

Hello Ian,

On 28/7/2023 10:16 am, Ian Kent wrote:
> On 28/7/23 08:00, Ian Kent wrote:
>> On 27/7/23 12:30, Imran Khan wrote:
>>> Hello Ian,
>>> Sorry for late reply. I was about to reply this week.
>>>
>>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>>> On 20/7/23 10:03, Ian Kent wrote:
>>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>>> [...]
>>>>> I do see a problem with recent changes.
>>>>>
>>>>> I'll send this off to Greg after I've done some testing (primarily just
>>>>> compile and function).
>>>>>
>>>>> Here's a patch which describes what I found.
>>>>>
>>>>> Comments are, of course, welcome, ;)
>>>> Anders I was hoping you would check if/what lockdep trace
>>>>
>>>> you get with this patch.
>>>>
>>>>
>>>> Imran, I was hoping you would comment on my change as it
>>>>
>>>> relates to the kernfs_iattr_rwsem changes.
>>>>
>>>>
>>>> Ian
>>>>
>>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>>
>>>>> From: Ian Kent <[email protected]>
>>>>>
>>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>>
>>>>> The update of the kernfs directory node child count was also protected
>>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>>> child count (and so the inode n_link attribute) does not change while
>>>>> holding the rwsem for read.
>>>>>
>>> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
>>> these are getting invoked while adding (kernfs_add_one),
>>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
>>> operations proceed under kernfs_rwsem and I see each invocation of
>>> kernfs_link/unlink_sibling during the above mentioned operations is happening
>>> under kernfs_rwsem.
>>> So the child count should still be protected by kernfs_rwsem and we should not
>>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>>
>> Yes, that's exactly what I intended (assuming you mean write lock in those cases)
>>
>> when I did it so now I wonder what I saw that lead to my patch, I'll need to look
>>
>> again ...
>
> Ahh, I see why I thought this ...
>
> It's the hunk:
>
> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
>         kn = inode->i_private;
>         root = kernfs_root(kn);
>
> -       down_read(&root->kernfs_rwsem);
> +       down_read(&root->kernfs_iattr_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         ret = generic_permission(&nop_mnt_idmap, inode, mask);
> -       up_read(&root->kernfs_rwsem);
> +       up_read(&root->kernfs_iattr_rwsem);
>
>         return ret;
>  }
>
> which takes away the kernfs_rwsem and introduces the possibility of
>
> the change. It may be more instructive to add back taking the read
>
> lock of kernfs_rwsem in .permission() than altering the sibling link
>
> and unlink functions, I mean I even caught myself on it.
>

Yes this was the block I referred to in my second comment [1]. However adding
back read lock of kernfs_rwsem in .permission() will again introduce the
bottleneck that I mentioned at [2]. So I think taking taking the locks in
link/unlink is the better option.
I understand having one lock to synchronize everything makes it easier
debug/development wise but sometimes (such as the case mentioned in [2]), it is
not optimum performance wise.
Thoughts ?

Thanks,
Imran

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/
>
> Ian
>
>>
>>
>>>
>>> Kindly let me know your thoughts. I would still like to see new lockdep traces
>>> with this change.
>>
>> Indeed, I hope Anders can find time to get the trace.
>>
>>
>> Ian
>>
>>>
>>> Thanks,
>>> Imran
>>>
>>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>>> attributes)
>>>>>
>>>>> Signed-off-by: Ian Kent <[email protected]>
>>>>> Cc: Anders Roxell <[email protected]>
>>>>> Cc: Imran Khan <[email protected]>
>>>>> Cc: Arnd Bergmann <[email protected]>
>>>>> Cc: Minchan Kim <[email protected]>
>>>>> Cc: Eric Sandeen <[email protected]>
>>>>> ---
>>>>>    fs/kernfs/dir.c |    4 ++++
>>>>>    1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>>> index 45b6919903e6..6e84bb69602e 100644
>>>>> --- a/fs/kernfs/dir.c
>>>>> +++ b/fs/kernfs/dir.c
>>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>>> *kn)
>>>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>>          /* successfully added, account subdir number */
>>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>            kn->parent->dir.subdirs++;
>>>>>        kernfs_inc_rev(kn->parent);
>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>          return 0;
>>>>>    }
>>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>>> kernfs_node *kn)
>>>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>>>            return false;
>>>>>    + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>            kn->parent->dir.subdirs--;
>>>>>        kernfs_inc_rev(kn->parent);
>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>>        RB_CLEAR_NODE(&kn->rb);
>>>>>