2023-09-05 16:20:58

by Mateusz Guzik

[permalink] [raw]
Subject: [RFC PATCH] vfs: add inode lockdep assertions

Thread "Use exclusive lock for file_remove_privs" [1] reports an issue
which should have been found by asserts -- inode not write locked by the
caller.

It did not happen because the attempt to do it in notify_change:
WARN_ON_ONCE(!inode_is_locked(inode));

passes if the inode is only read-locked:
static inline int rwsem_is_locked(struct rw_semaphore *sem)
{
return atomic_long_read(&sem->count) != 0;
}

According to git blame this regressed from 2 commits:
1. 5955102c9984 ("wrappers for ->i_mutex access") which replaced a
bunch of mutex_is_locked with inode_is_locked
2. 9902af79c01a ("parallel lookups: actual switch to rwsem") which
implemented inode_is_locked as a mere check on the semaphore being
held in *any* manner

In order to remedy this I'm proposing lockdep-ing the check with 2
helpers: inode_assert_locked and inode_assert_write_locked

Below I'm adding the helpers and converting *some* of the spots modified
by the first patch. I boot tested it and nothing blow up on ext4, but
btrfs should cause a complaint.

I can finish the other spots originally touched by 1 and touch up the 3
uses I grepped in fs/namei.c, but ultimately filesystem maintainers are
going to have to patch their code at their leasure. On top of that there
are probably quite a few places which should assert, but don't.

Comments?

Link: https://lore.kernel.org/linux-fsdevel/[email protected]/

---
fs/attr.c | 2 +-
fs/btrfs/xattr.c | 2 +-
fs/ext4/ext4.h | 4 ++--
fs/ext4/extents.c | 4 ++--
fs/ext4/inode.c | 4 ++--
include/linux/fs.h | 10 ++++++++++
6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index a8ae5f6d9b16..90dec999a952 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -387,7 +387,7 @@ int notify_change(struct mnt_idmap *idmap, struct dentry *dentry,
struct timespec64 now;
unsigned int ia_valid = attr->ia_valid;

- WARN_ON_ONCE(!inode_is_locked(inode));
+ inode_assert_write_locked(inode);

error = may_setattr(idmap, inode, ia_valid);
if (error)
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 96828a13dd43..46b268a433dd 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -120,7 +120,7 @@ int btrfs_setxattr(struct btrfs_trans_handle *trans, struct inode *inode,
* locks the inode's i_mutex before calling setxattr or removexattr.
*/
if (flags & XATTR_REPLACE) {
- ASSERT(inode_is_locked(inode));
+ inode_assert_write_locked(inode);
di = btrfs_lookup_xattr(NULL, root, path,
btrfs_ino(BTRFS_I(inode)), name, name_len, 0);
if (!di)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 481491e892df..df428f22f624 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3364,8 +3364,8 @@ do { \
/* Update i_disksize. Requires i_rwsem to avoid races with truncate */
static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
{
- WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
- !inode_is_locked(inode));
+ if (S_ISREG(inode->i_mode))
+ inode_assert_write_locked(inode);
down_write(&EXT4_I(inode)->i_data_sem);
if (newsize > EXT4_I(inode)->i_disksize)
WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 202c76996b62..149783ecfe16 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5588,8 +5588,8 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,

BUG_ON(!rwsem_is_locked(&EXT4_I(inode1)->i_data_sem));
BUG_ON(!rwsem_is_locked(&EXT4_I(inode2)->i_data_sem));
- BUG_ON(!inode_is_locked(inode1));
- BUG_ON(!inode_is_locked(inode2));
+ inode_assert_write_locked(inode1);
+ inode_assert_write_locked(inode2);

ext4_es_remove_extent(inode1, lblk1, count);
ext4_es_remove_extent(inode2, lblk2, count);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 89737d5a1614..2ecdef6ddc88 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3797,7 +3797,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,

loff_t size = i_size_read(inode);

- WARN_ON(!inode_is_locked(inode));
+ inode_assert_write_locked(inode);
if (offset > size || offset + len < size)
return 0;

@@ -4068,7 +4068,7 @@ int ext4_truncate(struct inode *inode)
* have i_rwsem locked because it's not necessary.
*/
if (!(inode->i_state & (I_NEW|I_FREEING)))
- WARN_ON(!inode_is_locked(inode));
+ inode_assert_write_locked(inode);
trace_ext4_truncate_enter(inode);

if (!ext4_can_truncate(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c8ff4156a0a1..93d48b6b9f67 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -842,6 +842,16 @@ static inline void inode_lock_shared_nested(struct inode *inode, unsigned subcla
down_read_nested(&inode->i_rwsem, subclass);
}

+static inline void inode_assert_locked(struct inode *inode)
+{
+ lockdep_assert_held(&inode->i_rwsem);
+}
+
+static inline void inode_assert_write_locked(struct inode *inode)
+{
+ lockdep_assert_held_write(&inode->i_rwsem);
+}
+
static inline void filemap_invalidate_lock(struct address_space *mapping)
{
down_write(&mapping->invalidate_lock);
--
2.39.2


2023-09-06 19:11:18

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add inode lockdep assertions

On Wed, Sep 06, 2023 at 05:00:14PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 06, 2023 at 08:29:48AM -0700, Darrick J. Wong wrote:
> > Or hoist the XFS mrlock, because it actually /does/ know if the rwsem is
> > held in shared or exclusive mode.
>
> ... or to put it another way, if we had rwsem_is_write_locked(),
> we could get rid of mrlock?
>
> diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> index 79155eec341b..5530f03aaed1 100644
> --- a/fs/xfs/mrlock.h
> +++ b/fs/xfs/mrlock.h
> @@ -10,18 +10,10 @@
>
> typedef struct {
> struct rw_semaphore mr_lock;
> -#if defined(DEBUG) || defined(XFS_WARN)
> - int mr_writer;
> -#endif
> } mrlock_t;
>
> -#if defined(DEBUG) || defined(XFS_WARN)
> -#define mrinit(mrp, name) \
> - do { (mrp)->mr_writer = 0; init_rwsem(&(mrp)->mr_lock); } while (0)
> -#else
> #define mrinit(mrp, name) \
> do { init_rwsem(&(mrp)->mr_lock); } while (0)
> -#endif
>
> #define mrlock_init(mrp, t,n,s) mrinit(mrp, n)
> #define mrfree(mrp) do { } while (0)
> @@ -34,9 +26,6 @@ static inline void mraccess_nested(mrlock_t *mrp, int subclass)
> static inline void mrupdate_nested(mrlock_t *mrp, int subclass)
> {
> down_write_nested(&mrp->mr_lock, subclass);
> -#if defined(DEBUG) || defined(XFS_WARN)
> - mrp->mr_writer = 1;
> -#endif
> }
>
> static inline int mrtryaccess(mrlock_t *mrp)
> @@ -48,17 +37,11 @@ static inline int mrtryupdate(mrlock_t *mrp)
> {
> if (!down_write_trylock(&mrp->mr_lock))
> return 0;
> -#if defined(DEBUG) || defined(XFS_WARN)
> - mrp->mr_writer = 1;
> -#endif
> return 1;
> }
>
> static inline void mrunlock_excl(mrlock_t *mrp)
> {
> -#if defined(DEBUG) || defined(XFS_WARN)
> - mrp->mr_writer = 0;
> -#endif
> up_write(&mrp->mr_lock);
> }
>
> @@ -69,9 +52,6 @@ static inline void mrunlock_shared(mrlock_t *mrp)
>
> static inline void mrdemote(mrlock_t *mrp)
> {
> -#if defined(DEBUG) || defined(XFS_WARN)
> - mrp->mr_writer = 0;
> -#endif
> downgrade_write(&mrp->mr_lock);
> }
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9e62cc500140..b99c3bd78c5e 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -361,7 +361,7 @@ xfs_isilocked(
> {
> if (lock_flags & (XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)) {
> if (!(lock_flags & XFS_ILOCK_SHARED))
> - return !!ip->i_lock.mr_writer;
> + return rwsem_is_write_locked(&ip->i_lock.mr_lock);

You'd be better off converting this to:

return __xfs_rwsem_islocked(&ip->i_lock.mr_lock,
(lock_flags & XFS_ILOCK_SHARED));

And then fixing __xfs_rwsem_islocked to do:

static inline bool
__xfs_rwsem_islocked(
struct rw_semaphore *rwsem,
bool shared)
{
if (!debug_locks) {
if (!shared)
return rwsem_is_write_locked(rwsem);
return rwsem_is_locked(rwsem);
}

...
}

(and then getting rid of mrlock_t.h entirely)

> return rwsem_is_locked(&ip->i_lock.mr_lock);
> }
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index e05e167dbd16..277b8c96bbf9 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -69,7 +69,7 @@ static inline void mmap_assert_locked(struct mm_struct *mm)
> static inline void mmap_assert_write_locked(struct mm_struct *mm)
> {
> lockdep_assert_held_write(&mm->mmap_lock);
> - VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> + VM_BUG_ON_MM(!rwsem_is_write_locked(&mm->mmap_lock), mm);
> }
>
> #ifdef CONFIG_PER_VMA_LOCK
> diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
> index 1d264dd08625..3c25b14edc05 100644
> --- a/include/linux/rwbase_rt.h
> +++ b/include/linux/rwbase_rt.h
> @@ -31,6 +31,11 @@ static __always_inline bool rw_base_is_locked(struct rwbase_rt *rwb)
> return atomic_read(&rwb->readers) != READER_BIAS;
> }
>
> +static __always_inline bool rw_base_is_write_locked(struct rwbase_rt *rwb)
> +{
> + return atomic_read(&rwb->readers) == WRITER_BIAS;
> +}
> +
> static __always_inline bool rw_base_is_contended(struct rwbase_rt *rwb)
> {
> return atomic_read(&rwb->readers) > 0;
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 1dd530ce8b45..241a12c6019e 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -72,6 +72,11 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
> return atomic_long_read(&sem->count) != 0;
> }
>
> +static inline int rwsem_is_write_locked(struct rw_semaphore *sem)
> +{
> + return atomic_long_read(&sem->count) & 1;


atomic_long_read(&sem->count) & RWSEM_WRITER_LOCKED ?

In one of the past "replace mrlock_t" threads I complained about
hardcoding this value instead of using the symbol. I saw it go by,
neglected to copy the url, and ten minutes later I can't find it. :(

--D

> +}
> +
> #define RWSEM_UNLOCKED_VALUE 0L
> #define __RWSEM_COUNT_INIT(name) .count = ATOMIC_LONG_INIT(RWSEM_UNLOCKED_VALUE)
>
> @@ -157,6 +162,11 @@ static __always_inline int rwsem_is_locked(struct rw_semaphore *sem)
> return rw_base_is_locked(&sem->rwbase);
> }
>
> +static __always_inline int rwsem_is_write_locked(struct rw_semaphore *sem)
> +{
> + return rw_base_is_write_locked(&sem->rwbase);
> +}
> +
> static __always_inline int rwsem_is_contended(struct rw_semaphore *sem)
> {
> return rw_base_is_contended(&sem->rwbase);

2023-09-07 01:53:24

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add inode lockdep assertions

On Wed, Sep 06, 2023 at 10:10:25PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 06, 2023 at 10:07:24AM -0700, Darrick J. Wong wrote:
> > You'd be better off converting this to:
> >
> > return __xfs_rwsem_islocked(&ip->i_lock.mr_lock,
> > (lock_flags & XFS_ILOCK_SHARED));
> >
> > And then fixing __xfs_rwsem_islocked to do:
> >
> > static inline bool
> > __xfs_rwsem_islocked(
> > struct rw_semaphore *rwsem,
> > bool shared)
> > {
> > if (!debug_locks) {
> > if (!shared)
> > return rwsem_is_write_locked(rwsem);
> > return rwsem_is_locked(rwsem);
> > }
> >
> > ...
> > }
>
> so ... I did that. And then many isilocked() calls started failing.
> generic/017 was the first one:
>
> 00030 XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_trans_inode.c, line: 93
> 00030 ------------[ cut here ]------------
> 00030 WARNING: CPU: 5 PID: 77 at fs/xfs/xfs_message.c:104 assfail+0x2c/0x40
> 00030 Modules linked in:
> 00030 CPU: 5 PID: 77 Comm: kworker/5:1 Kdump: loaded Not tainted 6.5.0-00006-g88a61c17df8f-dirty #14
> 00030 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> 00030 Workqueue: xfsalloc xfs_btree_split_worker
> 00030 RIP: 0010:assfail+0x2c/0x40
> 00030 Code: 89 d0 41 89 c9 48 c7 c2 80 70 0f 82 48 89 f1 48 89 e5 48 89 fe 48 c7
> c7 08 4f 07 82 e8 fd fd ff ff 80 3d 26 cc ed 00 00 75 04 <0f> 0b 5d c3 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48
> 00030 RSP: 0018:ffff888003f0bc28 EFLAGS: 00010246
> 00030 RAX: 00000000ffffffea RBX: ffff88800d9a0000 RCX: 000000007fffffff
> 00030 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff82074f08
> 00030 RBP: ffff888003f0bc28 R08: 0000000000000000 R09: 000000000000000a
> 00030 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff888009ff6000
> 00030 R13: ffff88800ac8a000 R14: 0000000000000001 R15: ffff88800af0a000
> 00030 FS: 0000000000000000(0000) GS:ffff88807d940000(0000) knlGS:0000000000000000
> 00030 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 00030 CR2: 00007ff177b05d58 CR3: 0000000007aa2002 CR4: 0000000000770ea0
> 00030 PKRU: 55555554
> 00030 Call Trace:
> 00030 <TASK>
> 00030 ? show_regs+0x5c/0x70
> 00030 ? __warn+0x84/0x180
> 00030 ? assfail+0x2c/0x40
> 00030 ? report_bug+0x18e/0x1c0
> 00030 ? handle_bug+0x3e/0x70
> 00030 ? exc_invalid_op+0x18/0x70
> 00030 ? asm_exc_invalid_op+0x1b/0x20
> 00030 ? assfail+0x2c/0x40
> 00030 ? assfail+0x23/0x40
> 00030 xfs_trans_log_inode+0xf9/0x120
> 00030 xfs_bmbt_alloc_block+0xf0/0x1c0
> 00030 __xfs_btree_split+0xf8/0x6c0
> 00030 ? __this_cpu_preempt_check+0x13/0x20
> 00030 ? lock_acquire+0xc8/0x280
> 00030 xfs_btree_split_worker+0x8a/0x110
> 00030 process_one_work+0x23f/0x510
> 00030 worker_thread+0x4d/0x3b0
> 00030 ? process_one_work+0x510/0x510
> 00030 kthread+0x106/0x140
> 00030 ? kthread_complete_and_exit+0x20/0x20
> 00030 ret_from_fork+0x31/0x50
> 00030 ? kthread_complete_and_exit+0x20/0x20
> 00030 ret_from_fork_asm+0x11/0x20
> 00030 </TASK>
> 00030 irq event stamp: 2901
> 00030 hardirqs last enabled at (2909): [<ffffffff810e4d83>] __up_console_sem+0x53/0x60
> 00030 hardirqs last disabled at (2916): [<ffffffff810e4d68>] __up_console_sem+0x38/0x60
> 00030 softirqs last enabled at (0): [<ffffffff81067bd0>] copy_process+0x830/0x1c10
> 00030 softirqs last disabled at (0): [<0000000000000000>] 0x0
> 00030 ---[ end trace 0000000000000000 ]---
>
> but here's the thing, I have lockdep enabled. So it's not testing my
> new rwsem_is_write_locked() code, it's testing the current lockdep
> stuff.
>
> So I have a feeling that we're not telling lockdep that we've acquired
> the i_lock. Or something? Seems unlikely that this is a real bug;
> surely we'd've noticed before now.
>
> Code here:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/mrlock
>
> ie git clone git://git.infradead.org/users/willy/pagecache.git mrlock
>
> You don't need the top commit. 754fb6a68dae is enough to provoke it,
> as long as you have CONFIG_LOCKDEP enabled.

Yeah. The lockdep assertions in __xfs_rwsem_islocked are broken because
the fsstress thread takes the ILOCK_EXCL, decides to defer a bmap btree
split to a workqueue to avoid overflowing the kernel stack, and doesn't
tell lockdep that the workqueue is (effectively) the rwsem owner until
it completes.

The last time we tried to get rid of mrlock_t, dchinner suggested using
lockdep_assert_held_non_owner to fix the assertions, but (a) that
doesn't seem to exist in TOT 6.5 and (b) the inode rwsems protect the
inode data structure, not threads. Hence you could just get rid of the
lockdep assertions and use the rwsem_* versions unconditionally.

--D

2023-09-14 22:37:33

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add inode lockdep assertions

On 9/6/23, Matthew Wilcox <[email protected]> wrote:
> On Wed, Sep 06, 2023 at 05:23:42PM +0200, Bernd Schubert wrote:
>>
>>
>> On 9/6/23 17:20, Matthew Wilcox wrote:
>> > On Thu, Aug 31, 2023 at 05:14:14PM +0200, Mateusz Guzik wrote:
>> > > +++ b/include/linux/fs.h
>> > > @@ -842,6 +842,16 @@ static inline void
>> > > inode_lock_shared_nested(struct inode *inode, unsigned subcla
>> > > down_read_nested(&inode->i_rwsem, subclass);
>> > > }
>> > > +static inline void inode_assert_locked(struct inode *inode)
>> > > +{
>> > > + lockdep_assert_held(&inode->i_rwsem);
>> > > +}
>> > > +
>> > > +static inline void inode_assert_write_locked(struct inode *inode)
>> > > +{
>> > > + lockdep_assert_held_write(&inode->i_rwsem);
>> > > +}
>> >
>> > This mirrors what we have in mm, but it's only going to trigger on
>> > builds that have lockdep enabled. Lockdep is very expensive; it
>> > easily doubles the time it takes to run xfstests on my laptop, so
>> > I don't generally enable it. So what we also have in MM is:
>> >
>> > static inline void mmap_assert_write_locked(struct mm_struct *mm)
>> > {
>> > lockdep_assert_held_write(&mm->mmap_lock);
>> > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>> > }
>> >
>> > Now if you have lockdep enabled, you get the lockdep check which
>> > gives you all the lovely lockdep information, but if you don't, you
>> > at least get the cheap check that someone is holding the lock at all.
>> >
>> > ie I would make this:
>> >
>> > +static inline void inode_assert_write_locked(struct inode *inode)
>> > +{
>> > + lockdep_assert_held_write(&inode->i_rwsem);
>> > + WARN_ON_ONCE(!inode_is_locked(inode));
>> > +}
>> >
>> > Maybe the locking people could give us a rwsem_is_write_locked()
>> > predicate, but until then, this is the best solution we came up with.
>>
>>
>> Which is exactly what I had suggested in the other thread :)
>
> Yes, but apparently comments in that thread don't count :eyeroll:
>

Pretty weird reaction mate, they very much *do* count which is why I'm
confused why you resent an e-mail with the bogus is_locked check
(which I explicitly pointed out).

Since you posted a separate patch to add write-locking check to rwsem
I'm going to wait for that bit to get sorted out (unless it stalls for
a long time).

fwiw I'm confused what's up with people making kernel changes without
running lockdep. If it adds too much overhead for use in normal
development someone(tm) should have fixed that aspect long time ago.

--
Mateusz Guzik <mjguzik gmail.com>