LinuxLists.cc - Crash with PREEMPT

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 2022-11-03 12:54:44 [+0100], Jan Kara wrote:
> Hello,
Hi,

> I was tracking down the following crash with 6.0 kernel with
> patch-6.0.5-rt14.patch applied:
>
> [ T6611] ------------[ cut here ]------------
> [ T6611] kernel BUG at fs/inode.c:625!

seems like an off-by-one ;)

> The machine is aarch64 architecture, kernel config is attached. I have seen
> the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> triggered relatively reliably (on two different aarch64 machines) by our
> performance testing framework when running dbench benchmark against an XFS
> filesystem.

different aarch64 machines as in different SoC? Or the same CPU twice.
And no trouble on x86-64 I guess?

> Now originally I thought this is some problem with XFS or writeback code
> but after debugging this for some time I don't think that anymore.
> clear_inode() complains about inode->i_wb_list being non-empty. In fact
> looking at the list_head, I can see it is corrupted. In all the occurences
> of the problem ->prev points back to the list_head itself but ->next points
> to some list_head that used to be part of the sb->s_inodes_wb list (or
> actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> the stack as ->next pointer as well).

so you assume a delete and add operation in parallel?

> This is not just some memory ordering issue with the check in
> clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> clear_inode(), the problem still reproduces.

What about dropping the list_empty() check in sb_mark_inode_writeback()
and sb_clear_inode_writeback() so that the check operation always
happens within the locked section? Either way, missing an add/delete
should result in consistent pointers.

> If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> raw_spinlock_t, the problem disappears.
>
> Finally, I'd note that the list is modified from three places which makes
> audit relatively simple. sb_mark_inode_writeback(),
> sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> loss what could be causing this. As unlikely as it seems to me I've started
> wondering whether it is not some subtle issue with RT spinlocks on aarch64
> possibly in combination with interrupts (because sb_clear_inode_writeback()
> may be called from an interrupt).

This should be modified from a threaded interrupt so interrupts and
preemption should be enabled at this point.
If preemption and or interrupts are disabled at some point then
CONFIG_DEBUG_ATOMIC_SLEEP should complain about it.

spinlock_t and raw_spinlock_t differ slightly in terms of locking.
rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
while the actual lock is modified via cmpxchg.

> Any ideas?
>
> Honza

Sebastian

2022-11-07 13:04:46

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Fri 04-11-22 16:06:37, Hillf Danton wrote:
> On 3 Nov 2022 12:54:44 +0100 Jan Kara <[email protected]>
> > Hello,
> >
> > I was tracking down the following crash with 6.0 kernel with
> > patch-6.0.5-rt14.patch applied:
> >
> > [ T6611] ------------[ cut here ]------------
> > [ T6611] kernel BUG at fs/inode.c:625!
> > [ T6611] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
> > [ T6611] Modules linked in: xfs(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) arm_spe_pmu(E) mlx5_core(E) sunrpc(E) mlxfw(E) pci_hyperv_intf(E) nls_iso8859_1(E) acpi_ipmi(E) nls_cp437(E) ipmi_ssif(E) vfat(E) ipmi_devintf(E) tls(E) igb(E) psample(E) button(E) arm_cmn(E) arm_dmc620_pmu(E) ipmi_msghandler(E) fat(E) cppc_cpufreq(E) arm_dsu_pmu(E) fuse(E) ip_tables(E) x_tables(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) gf128mul(E) nvme(E) drm_kms_helper(E) sha2_ce(E) syscopyarea(E) sha256_arm64(E) sysfillrect(E) xhci_pci(E) sha1_ce(E) sysimgblt(E) nvme_core(E) xhci_pci_renesas(E) fb_sys_fops(E) nvme_common(E) drm_ttm_helper(E) sbsa_gwdt(E) t10_pi(E) ttm(E) xhci_hcd(E) crc64_rocksoft_generic(E) crc64_rocksoft(E) usbcore(E) crc64(E) drm(E) usb_common(E) i2c_designware_platform(E) i2c_designware_core(E) btrfs(E) blake2b_generic(E) libcrc32c(E) xor(E) xor_neon(E)
> > [ T6611] raid6_pq(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E)
> > [ T6611] CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1 4a18df02c109f1e703cf2ff86b77cf9cd9d5a188
> > [ T6611] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F16f (SCP: 1.06.20210615) 07/01/2021
> > [ T6611] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ T6611] pc : clear_inode+0xa0/0xc0
> > [ T6611] lr : clear_inode+0x38/0xc0
> > [ T6611] sp : ffff80000f4f3cd0
> > [ T6611] x29: ffff80000f4f3cd0 x28: ffff07ff92142000 x27: 0000000000000000
> > [ T6611] x26: ffff08012aef6058 x25: 0000000000000002 x24: ffffb657395e8000
> > [ T6611] x23: ffffb65739072008 x22: ffffb656e0bed0a8 x21: ffff08012aef6190
> > [ T6611] x20: ffff08012aef61f8 x19: ffff08012aef6058 x18: 0000000000000014
> > [ T6611] x17: 00000000f0d86255 x16: ffffb65737dfdb00 x15: 0100000004000000
> > [ T6611] x14: 644d000008090000 x13: 644d000008090000 x12: ffff80000f4f3b20
> > [ T6611] x11: 0000000000000002 x10: ffff083f5ffbe1c0 x9 : ffffb657388284a4
> > [ T6611] x8 : fffffffffffffffe x7 : ffff80000f4f3b20 x6 : ffff80000f4f3b20
> > [ T6611] x5 : ffff08012aef6210 x4 : ffff08012aef6210 x3 : 0000000000000000
> > [ T6611] x2 : ffff08012aef62d8 x1 : ffff07ff8fbbf690 x0 : ffff08012aef61a0
> > [ T6611] Call trace:
> > [ T6611] clear_inode+0xa0/0xc0
> > [ T6611] evict+0x160/0x180
> > [ T6611] iput+0x154/0x240
> > [ T6611] do_unlinkat+0x184/0x300
> > [ T6611] __arm64_sys_unlinkat+0x48/0xc0
> > [ T6611] el0_svc_common.constprop.4+0xe4/0x2c0
> > [ T6611] do_el0_svc+0xac/0x100
> > [ T6611] el0_svc+0x78/0x200
> > [ T6611] el0t_64_sync_handler+0x9c/0xc0
> > [ T6611] el0t_64_sync+0x19c/0x1a0
> > [ T6611] Code: d4210000 d503201f d4210000 d503201f (d4210000)
> > [ T6611] ---[ end trace 0000000000000000 ]---
> >
> > The machine is aarch64 architecture, kernel config is attached. I have seen
> > the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> > triggered relatively reliably (on two different aarch64 machines) by our
> > performance testing framework when running dbench benchmark against an XFS
> > filesystem.
> >
> > Now originally I thought this is some problem with XFS or writeback code
> > but after debugging this for some time I don't think that anymore.
> > clear_inode() complains about inode->i_wb_list being non-empty. In fact
> > looking at the list_head, I can see it is corrupted. In all the occurences
> > of the problem ->prev points back to the list_head itself but ->next points
> > to some list_head that used to be part of the sb->s_inodes_wb list (or
> > actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> > the stack as ->next pointer as well).
> >
> > This is not just some memory ordering issue with the check in
> > clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> > clear_inode(), the problem still reproduces.
> >
> > If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> > raw_spinlock_t, the problem disappears.
> >
> > Finally, I'd note that the list is modified from three places which makes
> > audit relatively simple. sb_mark_inode_writeback(),
> > sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> > sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> > loss what could be causing this. As unlikely as it seems to me I've started
> > wondering whether it is not some subtle issue with RT spinlocks on aarch64
> > possibly in combination with interrupts (because sb_clear_inode_writeback()
> > may be called from an interrupt).
> >
> > Any ideas?
>
> Feel free to collect debug info ONLY in your spare cycles, given your
> relatively reliable reproducer.

So in fact I made sure (by debug counters) that sb_mark_inode_writeback()
and sb_clear_inode_writeback() get called the same number of times before
evict() gets called. So your debug patch would change nothing AFAICT...

Honza

> +++ b/fs/fs-writeback.c
> @@ -1256,6 +1256,7 @@ void sb_mark_inode_writeback(struct inod
> if (list_empty(&inode->i_wb_list)) {
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> if (list_empty(&inode->i_wb_list)) {
> + ihold(inode);
> list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> trace_sb_mark_inode_writeback(inode);
> }
> @@ -1272,12 +1273,19 @@ void sb_clear_inode_writeback(struct ino
> unsigned long flags;
>
> if (!list_empty(&inode->i_wb_list)) {
> + int put = 0;
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> if (!list_empty(&inode->i_wb_list)) {
> + put = 1;
> list_del_init(&inode->i_wb_list);
> trace_sb_clear_inode_writeback(inode);
> }
> spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
> + if (put) {
> + ihold(inode);
> + iput(inode);
> + iput(inode);
> + }
> }
> }
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-07 14:18:30

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Fri 04-11-22 17:30:29, Sebastian Andrzej Siewior wrote:
> On 2022-11-03 12:54:44 [+0100], Jan Kara wrote:
> > Hello,
> Hi,
>
> > I was tracking down the following crash with 6.0 kernel with
> > patch-6.0.5-rt14.patch applied:
> >
> > [ T6611] ------------[ cut here ]------------
> > [ T6611] kernel BUG at fs/inode.c:625!
>
> seems like an off-by-one ;)
>
> > The machine is aarch64 architecture, kernel config is attached. I have seen
> > the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> > triggered relatively reliably (on two different aarch64 machines) by our
> > performance testing framework when running dbench benchmark against an XFS
> > filesystem.
>
> different aarch64 machines as in different SoC? Or the same CPU twice.
> And no trouble on x86-64 I guess?

The same CPU it appears, just different machines. The problem never
happened on x86-64, that is correct. /proc/cpuinfo from the two machines
is:

processor : 0
BogoMIPS : 50.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0c
CPU revision : 1

...

there are 80 cpus in total in the machine.

> > Now originally I thought this is some problem with XFS or writeback code
> > but after debugging this for some time I don't think that anymore.
> > clear_inode() complains about inode->i_wb_list being non-empty. In fact
> > looking at the list_head, I can see it is corrupted. In all the occurences
> > of the problem ->prev points back to the list_head itself but ->next points
> > to some list_head that used to be part of the sb->s_inodes_wb list (or
> > actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> > the stack as ->next pointer as well).
>
> so you assume a delete and add operation in parallel?

Yes, I assume sb_clear_inode_writeback() was deleting inode from the list
while wait_sb_inodes() was doing list_move_tail() operation on the same list.

> > This is not just some memory ordering issue with the check in
> > clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> > clear_inode(), the problem still reproduces.
>
> What about dropping the list_empty() check in sb_mark_inode_writeback()
> and sb_clear_inode_writeback() so that the check operation always
> happens within the locked section? Either way, missing an add/delete
> should result in consistent pointers.

I've tested removing the list_empty() checks from sb_mark_inode_writeback()
and sb_clear_inode_writeback() but it didn't change a bit. The corruption
still happened.

> > If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> > raw_spinlock_t, the problem disappears.
> >
> > Finally, I'd note that the list is modified from three places which makes
> > audit relatively simple. sb_mark_inode_writeback(),
> > sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> > sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> > loss what could be causing this. As unlikely as it seems to me I've started
> > wondering whether it is not some subtle issue with RT spinlocks on aarch64
> > possibly in combination with interrupts (because sb_clear_inode_writeback()
> > may be called from an interrupt).
>
> This should be modified from a threaded interrupt so interrupts and
> preemption should be enabled at this point.
> If preemption and or interrupts are disabled at some point then
> CONFIG_DEBUG_ATOMIC_SLEEP should complain about it.

I see.

> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> while the actual lock is modified via cmpxchg.

So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
stops happening as well. So do you suspect some bug in the CPU itself?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-07 16:00:03

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

+ locking, arm64

On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > while the actual lock is modified via cmpxchg.
>
> So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> stops happening as well. So do you suspect some bug in the CPU itself?

If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
then it looks very suspicious.
CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
always fail (and so the slowpath under a raw_spinlock_t is done).

So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
somehow smells like the CPU is misbehaving.

Could someone from the locking/arm64 department check if the locking in
RT-mutex (rtlock_lock()) is correct?

rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
(and try_cmpxchg_release(, ptr, ptr) for unlock).
Now looking at it again, I don't see much difference compared to what
queued_spin_trylock() does except the latter always operates on 32bit
value instead a pointer.

> Honza
>

Sebastian

2022-11-07 16:41:11

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Mon 07-11-22 16:10:34, Sebastian Andrzej Siewior wrote:
> + locking, arm64
>
> On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > while the actual lock is modified via cmpxchg.
> >
> > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > stops happening as well. So do you suspect some bug in the CPU itself?
>
> If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> then it looks very suspicious.

Just to confirm, CONFIG_DEBUG_RT_MUTEXES is the only thing I've enabled and
the list corruption disappeared.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-07 17:28:21

by Waiman Long

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> + locking, arm64
>
> On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
>>> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
>>> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
>>> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
>>> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
>>> while the actual lock is modified via cmpxchg.
>> So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
>> stops happening as well. So do you suspect some bug in the CPU itself?
> If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> then it looks very suspicious.
> CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> always fail (and so the slowpath under a raw_spinlock_t is done).
>
> So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> somehow smells like the CPU is misbehaving.
>
> Could someone from the locking/arm64 department check if the locking in
> RT-mutex (rtlock_lock()) is correct?
>
> rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> (and try_cmpxchg_release(, ptr, ptr) for unlock).
> Now looking at it again, I don't see much difference compared to what
> queued_spin_trylock() does except the latter always operates on 32bit
> value instead a pointer.

Both the fast path of queued spinlock and rt_spin_lock are using
try_cmpxchg_acquire(), the only difference I saw is the size of the data
to be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock
uses 64-bit pointer. So I believe it is more on how the arm64 does
cmpxchg. I believe there are two different ways of doing it depending on
whether LSE atomics is available in the platform. So exactly what arm64
system is being used here and what hardware capability does it have?

Cheers,
Longman

2022-11-07 18:02:58

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 2022-11-07 17:30:16 [+0100], Jan Kara wrote:
> On Mon 07-11-22 16:10:34, Sebastian Andrzej Siewior wrote:
> > + locking, arm64
> >
> > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > while the actual lock is modified via cmpxchg.
> > >
> > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > stops happening as well. So do you suspect some bug in the CPU itself?
> >
> > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > then it looks very suspicious.
>
> Just to confirm, CONFIG_DEBUG_RT_MUTEXES is the only thing I've enabled and
> the list corruption disappeared.

I don't know if this works but:
if you tell task_struct_cachep to be created with SLAB_CACHE_DMA32 then
the pointer should only have the lower 32bit set. With this could make
rt_mutex_base::owner an atomic_t type. You could then replace
try_cmpxchg_acquire() with atomic_try_cmpxchg_acquire() and do the 32bit
cmpxchg. You would then need set the const upper 32bit of the pointer
while returning the pointer.
I have no idea how much sense it makes but you would avoid the 64bit
cmpxchg making those two a little more alike :)

> Honza
>

Sebastian

2022-11-08 12:08:54

by Mark Rutland

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > + locking, arm64
> >
> > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > while the actual lock is modified via cmpxchg.
> > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > stops happening as well. So do you suspect some bug in the CPU itself?
> > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > then it looks very suspicious.
> > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > always fail (and so the slowpath under a raw_spinlock_t is done).
> >
> > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > somehow smells like the CPU is misbehaving.
> >
> > Could someone from the locking/arm64 department check if the locking in
> > RT-mutex (rtlock_lock()) is correct?
> >
> > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > Now looking at it again, I don't see much difference compared to what
> > queued_spin_trylock() does except the latter always operates on 32bit
> > value instead a pointer.
>
> Both the fast path of queued spinlock and rt_spin_lock are using
> try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> believe there are two different ways of doing it depending on whether LSE
> atomics is available in the platform. So exactly what arm64 system is being
> used here and what hardware capability does it have?

From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
LSE atomics. Assuming the kernel was built with support for atomics in-kernel
(which is selected by default), it'll be using the LSE version.

Mark.

2022-11-08 17:56:49

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Tue 08-11-22 10:53:40, Mark Rutland wrote:
> On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > + locking, arm64
> > >
> > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > while the actual lock is modified via cmpxchg.
> > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > then it looks very suspicious.
> > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > >
> > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > somehow smells like the CPU is misbehaving.
> > >
> > > Could someone from the locking/arm64 department check if the locking in
> > > RT-mutex (rtlock_lock()) is correct?
> > >
> > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > Now looking at it again, I don't see much difference compared to what
> > > queued_spin_trylock() does except the latter always operates on 32bit
> > > value instead a pointer.
> >
> > Both the fast path of queued spinlock and rt_spin_lock are using
> > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > believe there are two different ways of doing it depending on whether LSE
> > atomics is available in the platform. So exactly what arm64 system is being
> > used here and what hardware capability does it have?
>
> From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
> LSE atomics. Assuming the kernel was built with support for atomics in-kernel
> (which is selected by default), it'll be using the LSE version.

So I was able to reproduce the corruption both with LSE atomics enabled &
disabled in the kernel. It seems the problem takes considerably longer to
reproduce with LSE atomics enabled but it still does happen.

BTW, I've tried to reproduced the problem on another aarch64 machine with
CPU from a different vendor:

processor : 0
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
CPU implementer : 0x48
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd01
CPU revision : 0

And there the problem does not reproduce. So might it be a genuine bug in
the CPU implementation?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-09 10:13:34

by Mark Rutland

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
> On Tue 08-11-22 10:53:40, Mark Rutland wrote:
> > On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > > + locking, arm64
> > > >
> > > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > > while the actual lock is modified via cmpxchg.
> > > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > > then it looks very suspicious.
> > > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > > >
> > > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > > somehow smells like the CPU is misbehaving.
> > > >
> > > > Could someone from the locking/arm64 department check if the locking in
> > > > RT-mutex (rtlock_lock()) is correct?
> > > >
> > > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > > Now looking at it again, I don't see much difference compared to what
> > > > queued_spin_trylock() does except the latter always operates on 32bit
> > > > value instead a pointer.
> > >
> > > Both the fast path of queued spinlock and rt_spin_lock are using
> > > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > > believe there are two different ways of doing it depending on whether LSE
> > > atomics is available in the platform. So exactly what arm64 system is being
> > > used here and what hardware capability does it have?
> >
> > From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
> > LSE atomics. Assuming the kernel was built with support for atomics in-kernel
> > (which is selected by default), it'll be using the LSE version.
>
> So I was able to reproduce the corruption both with LSE atomics enabled &
> disabled in the kernel. It seems the problem takes considerably longer to
> reproduce with LSE atomics enabled but it still does happen.
>
> BTW, I've tried to reproduced the problem on another aarch64 machine with
> CPU from a different vendor:
>
> processor : 0
> BogoMIPS : 200.00
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
> CPU implementer : 0x48
> CPU architecture: 8
> CPU variant : 0x1
> CPU part : 0xd01
> CPU revision : 0
>
> And there the problem does not reproduce. So might it be a genuine bug in
> the CPU implementation?

Perhaps, though I suspect it's more likely that we have an ordering bug in the
kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
We've had a couple of those show up on Apple M1, so it might be worth trying on
one of those.

How easy is this to reproduce? What's necessary?

Thanks,
Mark.

2022-11-09 10:56:44

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 11/9/22 10:55, Mark Rutland wrote:
> On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
>> On Tue 08-11-22 10:53:40, Mark Rutland wrote:
>>> On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
>>>> On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
>>>>> + locking, arm64
>>>>>
>>>>> On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
>>>>>>> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
>>>>>>> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
>>>>>>> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
>>>>>>> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
>>>>>>> while the actual lock is modified via cmpxchg.
>>>>>> So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
>>>>>> stops happening as well. So do you suspect some bug in the CPU itself?
>>>>> If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
>>>>> then it looks very suspicious.
>>>>> CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
>>>>> part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
>>>>> always fail (and so the slowpath under a raw_spinlock_t is done).
>>>>>
>>>>> So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
>>>>> somehow smells like the CPU is misbehaving.
>>>>>
>>>>> Could someone from the locking/arm64 department check if the locking in
>>>>> RT-mutex (rtlock_lock()) is correct?
>>>>>
>>>>> rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
>>>>> (and try_cmpxchg_release(, ptr, ptr) for unlock).
>>>>> Now looking at it again, I don't see much difference compared to what
>>>>> queued_spin_trylock() does except the latter always operates on 32bit
>>>>> value instead a pointer.
>>>>
>>>> Both the fast path of queued spinlock and rt_spin_lock are using
>>>> try_cmpxchg_acquire(), the only difference I saw is the size of the data to
>>>> be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
>>>> 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
>>>> believe there are two different ways of doing it depending on whether LSE
>>>> atomics is available in the platform. So exactly what arm64 system is being
>>>> used here and what hardware capability does it have?
>>>
>>> From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
>>> LSE atomics. Assuming the kernel was built with support for atomics in-kernel
>>> (which is selected by default), it'll be using the LSE version.
>>
>> So I was able to reproduce the corruption both with LSE atomics enabled &
>> disabled in the kernel. It seems the problem takes considerably longer to
>> reproduce with LSE atomics enabled but it still does happen.
>>
>> BTW, I've tried to reproduced the problem on another aarch64 machine with
>> CPU from a different vendor:
>>
>> processor : 0
>> BogoMIPS : 200.00
>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
>> CPU implementer : 0x48
>> CPU architecture: 8
>> CPU variant : 0x1
>> CPU part : 0xd01
>> CPU revision : 0
>>
>> And there the problem does not reproduce. So might it be a genuine bug in
>> the CPU implementation?
>
> Perhaps, though I suspect it's more likely that we have an ordering bug in the
> kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
> We've had a couple of those show up on Apple M1, so it might be worth trying on
> one of those.
>
> How easy is this to reproduce? What's necessary?
>
> Thanks,
> Mark.

It is possible to reproduce it on an Ampere Altra, which has the following cpuinfo:
processor : 0
BogoMIPS : 50.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0c
CPU revision : 1

Command used:
dbench 80 -t 300 --clients-per-process=8

With the diff [2] applied, I get [1]. So sb_clear_inode_writeback()
seems to be called on a CPU with a null i_count. Maybe:
- CPUx deletes the inode via iput().
- CPUy calls sb_clear_inode_writeback() to update inode->i_wb_list
at the same time.

[1]
[ 165.003036] ------------[ cut here ]------------
[ 165.003042] kernel BUG at fs/fs-writeback.c:1294!
[ 165.003047] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
[ 165.003052] Modules linked in: [...]
[ 165.003131] CPU: 87 PID: 9 Comm: kworker/u320:0 Not tainted 6.0.5-rt14-[...] #92
[ 165.003137] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[ 165.003138] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
[ 165.003148] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 165.003151] pc : sb_clear_inode_writeback (fs/fs-writeback.c:1294 (discriminator 1))
[ 165.003159] lr : sb_clear_inode_writeback (./include/linux/atomic/atomic-instrumented.h:28 fs/fs-writeback.c:1292)
[...]
[ 165.003219] Call trace:
[ 165.003222] sb_clear_inode_writeback (fs/fs-writeback.c:1294 (discriminator 1))
[ 165.003226] __folio_end_writeback (./include/linux/spinlock_rt.h:123 mm/page-writeback.c:2942)
[ 165.003231] folio_end_writeback (mm/filemap.c:1620)
[ 165.003234] end_page_writeback (mm/folio-compat.c:27)
[ 165.003237] ext4_finish_bio (fs/ext4/page-io.c:145)
[ 165.003240] ext4_release_io_end (fs/ext4/page-io.c:161 (discriminator 3))
[ 165.003243] ext4_end_io_rsv_work (./include/linux/list.h:292 fs/ext4/page-io.c:254 fs/ext4/page-io.c:273)
[ 165.003246] process_one_work (kernel/workqueue.c:2294)
[ 165.003250] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2437)
[ 165.003252] kthread (kernel/kthread.c:376)
[ 165.003255] ret_from_fork (arch/arm64/kernel/entry.S:861)
[ 165.003260] Code: 54fff8a0 942cf1d4 17ffffc3 d4210000 (d4210000)
[...]
[ 165.245010] ------------[ cut here ]------------

[2]
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 443f83382b9b..0edb03eb43a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1252,13 +1252,27 @@ void sb_mark_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
unsigned long flags;
+ int local_count;

if (list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
+
+ local_count = atomic_read(&inode->count) + 1;
+ BUG_ON(local_count != atomic_inc_return(&inode->count));
+ BUG_ON(!atomic_read(&inode->i_count));
+
if (list_empty(&inode->i_wb_list)) {
+
+ BUG_ON(!atomic_read(&inode->i_count));
+
list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
trace_sb_mark_inode_writeback(inode);
}
+
+ local_count = atomic_read(&inode->count) + 1;
+ BUG_ON(local_count != atomic_inc_return(&inode->count));
+ BUG_ON(!atomic_read(&inode->i_count));
+
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
}
@@ -1270,13 +1284,27 @@ void sb_clear_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
unsigned long flags;
+ int local_count;

if (!list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
+
+ local_count = atomic_read(&inode->count) - 1;
+ BUG_ON(local_count != atomic_dec_return(&inode->count));
+ BUG_ON(!atomic_read(&inode->i_count));
+
if (!list_empty(&inode->i_wb_list)) {
+
+ BUG_ON(!atomic_read(&inode->i_count));
+
list_del_init(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
+
+ local_count = atomic_read(&inode->count) - 1;
+ BUG_ON(local_count != atomic_dec_return(&inode->count));
+ BUG_ON(!atomic_read(&inode->i_count));
+
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 56a4b4b02477..67027d4973a1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -658,6 +658,7 @@ struct inode {
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
struct list_head i_wb_list; /* backing dev writeback list */
+ atomic_t count;
union {
struct hlist_head i_dentry;
struct rcu_head i_rcu;

2022-11-09 11:11:02

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Wed 09-11-22 09:55:07, Mark Rutland wrote:
> On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
> > On Tue 08-11-22 10:53:40, Mark Rutland wrote:
> > > On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > > > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > > > + locking, arm64
> > > > >
> > > > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > > > while the actual lock is modified via cmpxchg.
> > > > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > > > then it looks very suspicious.
> > > > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > > > >
> > > > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > > > somehow smells like the CPU is misbehaving.
> > > > >
> > > > > Could someone from the locking/arm64 department check if the locking in
> > > > > RT-mutex (rtlock_lock()) is correct?
> > > > >
> > > > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > > > Now looking at it again, I don't see much difference compared to what
> > > > > queued_spin_trylock() does except the latter always operates on 32bit
> > > > > value instead a pointer.
> > > >
> > > > Both the fast path of queued spinlock and rt_spin_lock are using
> > > > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > > > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > > > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > > > believe there are two different ways of doing it depending on whether LSE
> > > > atomics is available in the platform. So exactly what arm64 system is being
> > > > used here and what hardware capability does it have?
> > >
> > > From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
> > > LSE atomics. Assuming the kernel was built with support for atomics in-kernel
> > > (which is selected by default), it'll be using the LSE version.
> >
> > So I was able to reproduce the corruption both with LSE atomics enabled &
> > disabled in the kernel. It seems the problem takes considerably longer to
> > reproduce with LSE atomics enabled but it still does happen.
> >
> > BTW, I've tried to reproduced the problem on another aarch64 machine with
> > CPU from a different vendor:
> >
> > processor : 0
> > BogoMIPS : 200.00
> > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
> > CPU implementer : 0x48
> > CPU architecture: 8
> > CPU variant : 0x1
> > CPU part : 0xd01
> > CPU revision : 0
> >
> > And there the problem does not reproduce. So might it be a genuine bug in
> > the CPU implementation?
>
> Perhaps, though I suspect it's more likely that we have an ordering bug in the
> kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
> We've had a couple of those show up on Apple M1, so it might be worth trying on
> one of those.
>
> How easy is this to reproduce? What's necessary?

As Pierre writes, on Ampere Altra machine running dbench benchmark on XFS
filesystem triggers this relatively easily (it takes it about 10 minutes to
trigger without atomics and about 30 minutes to trigger with the atomics
enabled).

Running the benchmark on XFS somehow seems to be important, we didn't see
the crash happen on ext4 (which may just mean it is less frequent on ext4
and didn't trigger in our initial testing after which we've started to
investigate crashes with XFS).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-09 11:12:43

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Wed 09-11-22 11:11:46, Pierre Gondois wrote:
> On 11/9/22 10:55, Mark Rutland wrote:
> > On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
> > > On Tue 08-11-22 10:53:40, Mark Rutland wrote:
> > > > On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > > > > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > > > > + locking, arm64
> > > > > >
> > > > > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > > > > while the actual lock is modified via cmpxchg.
> > > > > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > > > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > > > > then it looks very suspicious.
> > > > > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > > > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > > > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > > > > >
> > > > > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > > > > somehow smells like the CPU is misbehaving.
> > > > > >
> > > > > > Could someone from the locking/arm64 department check if the locking in
> > > > > > RT-mutex (rtlock_lock()) is correct?
> > > > > >
> > > > > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > > > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > > > > Now looking at it again, I don't see much difference compared to what
> > > > > > queued_spin_trylock() does except the latter always operates on 32bit
> > > > > > value instead a pointer.
> > > > >
> > > > > Both the fast path of queued spinlock and rt_spin_lock are using
> > > > > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > > > > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > > > > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > > > > believe there are two different ways of doing it depending on whether LSE
> > > > > atomics is available in the platform. So exactly what arm64 system is being
> > > > > used here and what hardware capability does it have?
> > > >
> > > > From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
> > > > LSE atomics. Assuming the kernel was built with support for atomics in-kernel
> > > > (which is selected by default), it'll be using the LSE version.
> > >
> > > So I was able to reproduce the corruption both with LSE atomics enabled &
> > > disabled in the kernel. It seems the problem takes considerably longer to
> > > reproduce with LSE atomics enabled but it still does happen.
> > >
> > > BTW, I've tried to reproduced the problem on another aarch64 machine with
> > > CPU from a different vendor:
> > >
> > > processor : 0
> > > BogoMIPS : 200.00
> > > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
> > > CPU implementer : 0x48
> > > CPU architecture: 8
> > > CPU variant : 0x1
> > > CPU part : 0xd01
> > > CPU revision : 0
> > >
> > > And there the problem does not reproduce. So might it be a genuine bug in
> > > the CPU implementation?
> >
> > Perhaps, though I suspect it's more likely that we have an ordering bug in the
> > kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
> > We've had a couple of those show up on Apple M1, so it might be worth trying on
> > one of those.
> >
> > How easy is this to reproduce? What's necessary?
> >
> > Thanks,
> > Mark.
>
> It is possible to reproduce it on an Ampere Altra, which has the following cpuinfo:
> processor : 0
> BogoMIPS : 50.00
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant : 0x3
> CPU part : 0xd0c
> CPU revision : 1
>
> Command used:
> dbench 80 -t 300 --clients-per-process=8
>
> With the diff [2] applied, I get [1]. So sb_clear_inode_writeback()
> seems to be called on a CPU with a null i_count. Maybe:
> - CPUx deletes the inode via iput().
> - CPUy calls sb_clear_inode_writeback() to update inode->i_wb_list
> at the same time.

Yes, this is actually expected. inode->i_count can reach 0 while writeback
is in progress. Once i_count reaches 0, we get to iput_final() and either
add the inode to the LRU or decide to remove the inode and go to evict()
where truncate_inode_pages_final() is going to wait for all PageWriteback
bits and thus for all outstanding writeback for the inode.

If you want to independently check whether the inode is not getting freed
while still being under writeback, you can add assertion about your
inode->count being 0 to clear_inode(). In fact in my testing I have added
this assertion and it never triggered despite the list corruption was
happening.

Honza

> [1]
> [ 165.003036] ------------[ cut here ]------------
> [ 165.003042] kernel BUG at fs/fs-writeback.c:1294!
> [ 165.003047] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
> [ 165.003052] Modules linked in: [...]
> [ 165.003131] CPU: 87 PID: 9 Comm: kworker/u320:0 Not tainted 6.0.5-rt14-[...] #92
> [ 165.003137] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
> [ 165.003138] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [ 165.003148] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 165.003151] pc : sb_clear_inode_writeback (fs/fs-writeback.c:1294 (discriminator 1))
> [ 165.003159] lr : sb_clear_inode_writeback (./include/linux/atomic/atomic-instrumented.h:28 fs/fs-writeback.c:1292)
> [...]
> [ 165.003219] Call trace:
> [ 165.003222] sb_clear_inode_writeback (fs/fs-writeback.c:1294 (discriminator 1))
> [ 165.003226] __folio_end_writeback (./include/linux/spinlock_rt.h:123 mm/page-writeback.c:2942)
> [ 165.003231] folio_end_writeback (mm/filemap.c:1620)
> [ 165.003234] end_page_writeback (mm/folio-compat.c:27)
> [ 165.003237] ext4_finish_bio (fs/ext4/page-io.c:145)
> [ 165.003240] ext4_release_io_end (fs/ext4/page-io.c:161 (discriminator 3))
> [ 165.003243] ext4_end_io_rsv_work (./include/linux/list.h:292 fs/ext4/page-io.c:254 fs/ext4/page-io.c:273)
> [ 165.003246] process_one_work (kernel/workqueue.c:2294)
> [ 165.003250] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2437)
> [ 165.003252] kthread (kernel/kthread.c:376)
> [ 165.003255] ret_from_fork (arch/arm64/kernel/entry.S:861)
> [ 165.003260] Code: 54fff8a0 942cf1d4 17ffffc3 d4210000 (d4210000)
> [...]
> [ 165.245010] ------------[ cut here ]------------
>
>
> [2]
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 443f83382b9b..0edb03eb43a4 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1252,13 +1252,27 @@ void sb_mark_inode_writeback(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
> unsigned long flags;
> + int local_count;
> if (list_empty(&inode->i_wb_list)) {
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> +
> + local_count = atomic_read(&inode->count) + 1;
> + BUG_ON(local_count != atomic_inc_return(&inode->count));
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> if (list_empty(&inode->i_wb_list)) {
> +
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> trace_sb_mark_inode_writeback(inode);
> }
> +
> + local_count = atomic_read(&inode->count) + 1;
> + BUG_ON(local_count != atomic_inc_return(&inode->count));
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
> }
> }
> @@ -1270,13 +1284,27 @@ void sb_clear_inode_writeback(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
> unsigned long flags;
> + int local_count;
> if (!list_empty(&inode->i_wb_list)) {
> spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
> +
> + local_count = atomic_read(&inode->count) - 1;
> + BUG_ON(local_count != atomic_dec_return(&inode->count));
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> if (!list_empty(&inode->i_wb_list)) {
> +
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> list_del_init(&inode->i_wb_list);
> trace_sb_clear_inode_writeback(inode);
> }
> +
> + local_count = atomic_read(&inode->count) - 1;
> + BUG_ON(local_count != atomic_dec_return(&inode->count));
> + BUG_ON(!atomic_read(&inode->i_count));
> +
> spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
> }
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 56a4b4b02477..67027d4973a1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -658,6 +658,7 @@ struct inode {
> struct list_head i_lru; /* inode LRU list */
> struct list_head i_sb_list;
> struct list_head i_wb_list; /* backing dev writeback list */
> + atomic_t count;
> union {
> struct hlist_head i_dentry;
> struct rcu_head i_rcu;
>
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-09 13:12:56

by Will Deacon

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > + locking, arm64
> >
> > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > while the actual lock is modified via cmpxchg.
> > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > stops happening as well. So do you suspect some bug in the CPU itself?
> > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > then it looks very suspicious.
> > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > always fail (and so the slowpath under a raw_spinlock_t is done).
> >
> > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > somehow smells like the CPU is misbehaving.
> >
> > Could someone from the locking/arm64 department check if the locking in
> > RT-mutex (rtlock_lock()) is correct?
> >
> > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > Now looking at it again, I don't see much difference compared to what
> > queued_spin_trylock() does except the latter always operates on 32bit
> > value instead a pointer.
>
> Both the fast path of queued spinlock and rt_spin_lock are using
> try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> believe there are two different ways of doing it depending on whether LSE
> atomics is available in the platform. So exactly what arm64 system is being
> used here and what hardware capability does it have?

I'd be more inclined to be suspicious of the slowpath tbh, as we need to
make sure that we have acquire semantics on all paths where the lock can
be taken. Looking at the rtmutex code, this really isn't obvious to me --
for example, try_to_take_rt_mutex() appears to be able to return via the
'takeit' label without acquire semantics and it looks like we might be
relying on the caller's subsequent _unlock_ of the wait_lock for ordering,
but that will give us release semantics which aren't correct.

As a quick hack, can you try chucking a barrier into rt_mutex_set_owner()?

Will

--->8

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8abc2a..dd6a66c90f53 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -98,6 +98,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
val |= RT_MUTEX_HAS_WAITERS;

WRITE_ONCE(lock->owner, (struct task_struct *)val);
+ smp_mb();
}

static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)

2022-11-09 14:48:39

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 11/9/22 12:01, Jan Kara wrote:
> On Wed 09-11-22 09:55:07, Mark Rutland wrote:
>> On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
>>> On Tue 08-11-22 10:53:40, Mark Rutland wrote:
>>>> On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
>>>>> On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
>>>>>> + locking, arm64
>>>>>>
>>>>>> On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
>>>>>>>> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
>>>>>>>> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
>>>>>>>> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
>>>>>>>> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
>>>>>>>> while the actual lock is modified via cmpxchg.
>>>>>>> So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
>>>>>>> stops happening as well. So do you suspect some bug in the CPU itself?
>>>>>> If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
>>>>>> then it looks very suspicious.
>>>>>> CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
>>>>>> part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
>>>>>> always fail (and so the slowpath under a raw_spinlock_t is done).
>>>>>>
>>>>>> So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
>>>>>> somehow smells like the CPU is misbehaving.
>>>>>>
>>>>>> Could someone from the locking/arm64 department check if the locking in
>>>>>> RT-mutex (rtlock_lock()) is correct?
>>>>>>
>>>>>> rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
>>>>>> (and try_cmpxchg_release(, ptr, ptr) for unlock).
>>>>>> Now looking at it again, I don't see much difference compared to what
>>>>>> queued_spin_trylock() does except the latter always operates on 32bit
>>>>>> value instead a pointer.
>>>>>
>>>>> Both the fast path of queued spinlock and rt_spin_lock are using
>>>>> try_cmpxchg_acquire(), the only difference I saw is the size of the data to
>>>>> be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
>>>>> 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
>>>>> believe there are two different ways of doing it depending on whether LSE
>>>>> atomics is available in the platform. So exactly what arm64 system is being
>>>>> used here and what hardware capability does it have?
>>>>
>>>> From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
>>>> LSE atomics. Assuming the kernel was built with support for atomics in-kernel
>>>> (which is selected by default), it'll be using the LSE version.
>>>
>>> So I was able to reproduce the corruption both with LSE atomics enabled &
>>> disabled in the kernel. It seems the problem takes considerably longer to
>>> reproduce with LSE atomics enabled but it still does happen.
>>>
>>> BTW, I've tried to reproduced the problem on another aarch64 machine with
>>> CPU from a different vendor:
>>>
>>> processor : 0
>>> BogoMIPS : 200.00
>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
>>> CPU implementer : 0x48
>>> CPU architecture: 8
>>> CPU variant : 0x1
>>> CPU part : 0xd01
>>> CPU revision : 0
>>>
>>> And there the problem does not reproduce. So might it be a genuine bug in
>>> the CPU implementation?
>>
>> Perhaps, though I suspect it's more likely that we have an ordering bug in the
>> kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
>> We've had a couple of those show up on Apple M1, so it might be worth trying on
>> one of those.
>>
>> How easy is this to reproduce? What's necessary?
>
> As Pierre writes, on Ampere Altra machine running dbench benchmark on XFS
> filesystem triggers this relatively easily (it takes it about 10 minutes to
> trigger without atomics and about 30 minutes to trigger with the atomics
> enabled).
>
> Running the benchmark on XFS somehow seems to be important, we didn't see
> the crash happen on ext4 (which may just mean it is less frequent on ext4
> and didn't trigger in our initial testing after which we've started to
> investigate crashes with XFS).
>
> Honza

It was possible to reproduce on an Ampere eMAG. It takes < 1min to reproduce
once dbench is launched and seems more likely to trigger with the previous diff
applied. It even sometimes triggers without launching dbench on the Altra.

/proc/cpuinfo for eMAG:
processor : 0
BogoMIPS : 80.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x50
CPU architecture: 8
CPU variant : 0x3
CPU part : 0x000
CPU revision : 2

2022-11-09 14:52:20

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 11/9/22 14:52, Pierre Gondois wrote:
>
>
> On 11/9/22 12:01, Jan Kara wrote:
>> On Wed 09-11-22 09:55:07, Mark Rutland wrote:
>>> On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:
>>>> On Tue 08-11-22 10:53:40, Mark Rutland wrote:
>>>>> On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
>>>>>> On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
>>>>>>> + locking, arm64
>>>>>>>
>>>>>>> On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
>>>>>>>>> spinlock_t and raw_spinlock_t differ slightly in terms of locking.
>>>>>>>>> rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
>>>>>>>>> enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
>>>>>>>>> always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
>>>>>>>>> while the actual lock is modified via cmpxchg.
>>>>>>>> So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
>>>>>>>> stops happening as well. So do you suspect some bug in the CPU itself?
>>>>>>> If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
>>>>>>> then it looks very suspicious.
>>>>>>> CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
>>>>>>> part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
>>>>>>> always fail (and so the slowpath under a raw_spinlock_t is done).
>>>>>>>
>>>>>>> So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
>>>>>>> somehow smells like the CPU is misbehaving.
>>>>>>>
>>>>>>> Could someone from the locking/arm64 department check if the locking in
>>>>>>> RT-mutex (rtlock_lock()) is correct?
>>>>>>>
>>>>>>> rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
>>>>>>> (and try_cmpxchg_release(, ptr, ptr) for unlock).
>>>>>>> Now looking at it again, I don't see much difference compared to what
>>>>>>> queued_spin_trylock() does except the latter always operates on 32bit
>>>>>>> value instead a pointer.
>>>>>>
>>>>>> Both the fast path of queued spinlock and rt_spin_lock are using
>>>>>> try_cmpxchg_acquire(), the only difference I saw is the size of the data to
>>>>>> be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
>>>>>> 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
>>>>>> believe there are two different ways of doing it depending on whether LSE
>>>>>> atomics is available in the platform. So exactly what arm64 system is being
>>>>>> used here and what hardware capability does it have?
>>>>>
>>>>> From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
>>>>> LSE atomics. Assuming the kernel was built with support for atomics in-kernel
>>>>> (which is selected by default), it'll be using the LSE version.
>>>>
>>>> So I was able to reproduce the corruption both with LSE atomics enabled &
>>>> disabled in the kernel. It seems the problem takes considerably longer to
>>>> reproduce with LSE atomics enabled but it still does happen.
>>>>
>>>> BTW, I've tried to reproduced the problem on another aarch64 machine with
>>>> CPU from a different vendor:
>>>>
>>>> processor : 0
>>>> BogoMIPS : 200.00
>>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
>>>> CPU implementer : 0x48
>>>> CPU architecture: 8
>>>> CPU variant : 0x1
>>>> CPU part : 0xd01
>>>> CPU revision : 0
>>>>
>>>> And there the problem does not reproduce. So might it be a genuine bug in
>>>> the CPU implementation?
>>>
>>> Perhaps, though I suspect it's more likely that we have an ordering bug in the
>>> kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
>>> We've had a couple of those show up on Apple M1, so it might be worth trying on
>>> one of those.
>>>
>>> How easy is this to reproduce? What's necessary?
>>
>> As Pierre writes, on Ampere Altra machine running dbench benchmark on XFS
>> filesystem triggers this relatively easily (it takes it about 10 minutes to
>> trigger without atomics and about 30 minutes to trigger with the atomics
>> enabled).
>>
>> Running the benchmark on XFS somehow seems to be important, we didn't see
>> the crash happen on ext4 (which may just mean it is less frequent on ext4
>> and didn't trigger in our initial testing after which we've started to
>> investigate crashes with XFS).
>>
>> Honza
>
> It was possible to reproduce on an Ampere eMAG. It takes < 1min to reproduce
> once dbench is launched and seems more likely to trigger with the previous diff
> applied. It even sometimes triggers without launching dbench on the Altra.
>
> /proc/cpuinfo for eMAG:
> processor : 0
> BogoMIPS : 80.00
> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
> CPU implementer : 0x50
> CPU architecture: 8
> CPU variant : 0x3
> CPU part : 0x000
> CPU revision : 2
>

I misread the logs, the issue was not reproduced on the eMAG,
Sorry for the noise.

2022-11-09 16:35:24

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Wed 09-11-22 12:57:57, Will Deacon wrote:
> On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > + locking, arm64
> > >
> > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > while the actual lock is modified via cmpxchg.
> > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > then it looks very suspicious.
> > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > >
> > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > somehow smells like the CPU is misbehaving.
> > >
> > > Could someone from the locking/arm64 department check if the locking in
> > > RT-mutex (rtlock_lock()) is correct?
> > >
> > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > Now looking at it again, I don't see much difference compared to what
> > > queued_spin_trylock() does except the latter always operates on 32bit
> > > value instead a pointer.
> >
> > Both the fast path of queued spinlock and rt_spin_lock are using
> > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > believe there are two different ways of doing it depending on whether LSE
> > atomics is available in the platform. So exactly what arm64 system is being
> > used here and what hardware capability does it have?
>
> I'd be more inclined to be suspicious of the slowpath tbh, as we need to
> make sure that we have acquire semantics on all paths where the lock can
> be taken. Looking at the rtmutex code, this really isn't obvious to me --
> for example, try_to_take_rt_mutex() appears to be able to return via the
> 'takeit' label without acquire semantics and it looks like we might be
> relying on the caller's subsequent _unlock_ of the wait_lock for ordering,
> but that will give us release semantics which aren't correct.
>
> As a quick hack, can you try chucking a barrier into rt_mutex_set_owner()?

Bingo! This patch fixes the crashes for me.

Honza

> --->8
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 7779ee8abc2a..dd6a66c90f53 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -98,6 +98,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
> val |= RT_MUTEX_HAS_WAITERS;
>
> WRITE_ONCE(lock->owner, (struct task_struct *)val);
> + smp_mb();
> }
>
> static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-11 15:21:41

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Wed 09-11-22 16:40:23, Jan Kara wrote:
> On Wed 09-11-22 12:57:57, Will Deacon wrote:
> > On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > > + locking, arm64
> > > >
> > > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > > while the actual lock is modified via cmpxchg.
> > > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > > then it looks very suspicious.
> > > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > > >
> > > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > > somehow smells like the CPU is misbehaving.
> > > >
> > > > Could someone from the locking/arm64 department check if the locking in
> > > > RT-mutex (rtlock_lock()) is correct?
> > > >
> > > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > > Now looking at it again, I don't see much difference compared to what
> > > > queued_spin_trylock() does except the latter always operates on 32bit
> > > > value instead a pointer.
> > >
> > > Both the fast path of queued spinlock and rt_spin_lock are using
> > > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > > believe there are two different ways of doing it depending on whether LSE
> > > atomics is available in the platform. So exactly what arm64 system is being
> > > used here and what hardware capability does it have?
> >
> > I'd be more inclined to be suspicious of the slowpath tbh, as we need to
> > make sure that we have acquire semantics on all paths where the lock can
> > be taken. Looking at the rtmutex code, this really isn't obvious to me --
> > for example, try_to_take_rt_mutex() appears to be able to return via the
> > 'takeit' label without acquire semantics and it looks like we might be
> > relying on the caller's subsequent _unlock_ of the wait_lock for ordering,
> > but that will give us release semantics which aren't correct.
> >
> > As a quick hack, can you try chucking a barrier into rt_mutex_set_owner()?
>
> Bingo! This patch fixes the crashes for me.

So I suppose this is not an official fix, is it? Sebastian, it appears to
be a bug in rtmutex implementation in the end AFAIU ;)

> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 7779ee8abc2a..dd6a66c90f53 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -98,6 +98,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
> > val |= RT_MUTEX_HAS_WAITERS;
> >
> > WRITE_ONCE(lock->owner, (struct task_struct *)val);
> > + smp_mb();
> > }
> >
> > static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-14 13:23:54

by Will Deacon

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Fri, Nov 11, 2022 at 03:27:42PM +0100, Jan Kara wrote:
> On Wed 09-11-22 16:40:23, Jan Kara wrote:
> > On Wed 09-11-22 12:57:57, Will Deacon wrote:
> > > On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:
> > > > On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:
> > > > > + locking, arm64
> > > > >
> > > > > On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:
> > > > > > > spinlock_t and raw_spinlock_t differ slightly in terms of locking.
> > > > > > > rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
> > > > > > > enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
> > > > > > > always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
> > > > > > > while the actual lock is modified via cmpxchg.
> > > > > > So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
> > > > > > stops happening as well. So do you suspect some bug in the CPU itself?
> > > > > If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
> > > > > then it looks very suspicious.
> > > > > CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
> > > > > part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
> > > > > always fail (and so the slowpath under a raw_spinlock_t is done).
> > > > >
> > > > > So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
> > > > > somehow smells like the CPU is misbehaving.
> > > > >
> > > > > Could someone from the locking/arm64 department check if the locking in
> > > > > RT-mutex (rtlock_lock()) is correct?
> > > > >
> > > > > rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
> > > > > (and try_cmpxchg_release(, ptr, ptr) for unlock).
> > > > > Now looking at it again, I don't see much difference compared to what
> > > > > queued_spin_trylock() does except the latter always operates on 32bit
> > > > > value instead a pointer.
> > > >
> > > > Both the fast path of queued spinlock and rt_spin_lock are using
> > > > try_cmpxchg_acquire(), the only difference I saw is the size of the data to
> > > > be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
> > > > 64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
> > > > believe there are two different ways of doing it depending on whether LSE
> > > > atomics is available in the platform. So exactly what arm64 system is being
> > > > used here and what hardware capability does it have?
> > >
> > > I'd be more inclined to be suspicious of the slowpath tbh, as we need to
> > > make sure that we have acquire semantics on all paths where the lock can
> > > be taken. Looking at the rtmutex code, this really isn't obvious to me --
> > > for example, try_to_take_rt_mutex() appears to be able to return via the
> > > 'takeit' label without acquire semantics and it looks like we might be
> > > relying on the caller's subsequent _unlock_ of the wait_lock for ordering,
> > > but that will give us release semantics which aren't correct.
> > >
> > > As a quick hack, can you try chucking a barrier into rt_mutex_set_owner()?
> >
> > Bingo! This patch fixes the crashes for me.
>
> So I suppose this is not an official fix, is it? Sebastian, it appears to
> be a bug in rtmutex implementation in the end AFAIU ;)

Right, somebody needs to go audit all the acquisition paths on the slow-path
and make sure they all have acquire semantics. The trick is doing that
without incurring unnecessary overhead, e.g. by making use of dependency
ordering where it already exists.

Will

>
> > > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > > index 7779ee8abc2a..dd6a66c90f53 100644
> > > --- a/kernel/locking/rtmutex.c
> > > +++ b/kernel/locking/rtmutex.c
> > > @@ -98,6 +98,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
> > > val |= RT_MUTEX_HAS_WAITERS;
> > >
> > > WRITE_ONCE(lock->owner, (struct task_struct *)val);
> > > + smp_mb();
> > > }
> > >
> > > static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-11-28 16:16:59

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

How about this?

- The fast path is easy…

- The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
I made that one _acquire so that it is visible by the unlocker forcing
everyone into slow path.

- If the lock is acquired, then the owner is written via
rt_mutex_set_owner(). This happens under wait_lock so it is
serialized and so a WRITE_ONCE() is used to write the final owner. I
replaced it with a cmpxchg_acquire() to have the owner there.
Not sure if I shouldn't make this as you put it:
| e.g. by making use of dependency ordering where it already exists.
The other (locking) CPU needs to see the owner not only the WAITER
bit. I'm not sure if this could be replaced with smp_store_release().

- After the whole operation completes, fixup_rt_mutex_waiters() cleans
the WAITER bit and I kept the _acquire semantic here. Now looking at
it again, I don't think that needs to be done since that shouldn't be
set here.

- There is rtmutex_spin_on_owner() which (as the name implies) spins on
the owner as long as it active. It obtains it via READ_ONCE() and I'm
not sure if any memory barrier is needed. Worst case is that it will
spin while owner isn't set if it observers a stale value.

- The unlock path first clears the waiter bit if there are no waiters
recorded (via simple assignments under the wait_lock (every locker
will fail with the cmpxchg_acquire() and go for the wait_lock)) and
then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
Should there be a wait, it will just store the WAITER bit with
smp_store_release() (setting the owner is NULL but the WAITER bit
forces everyone into the slow path).

- Added rt_mutex_set_owner_pi() which does simple assignment. This is
used from the futex code and here everything happens under a lock.

- I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
*think* want to observe a real waiter and not something stale.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/rtmutex.h | 2 +-
kernel/locking/rtmutex.c | 26 ++++++++++++++++++--------
kernel/locking/rtmutex_api.c | 4 ++--
3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 7d049883a08ac..4447e01f723d4 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -41,7 +41,7 @@ struct rt_mutex_base {
*/
static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
{
- return READ_ONCE(lock->owner) != NULL;
+ return smp_load_acquire(&lock->owner) != NULL;
}

extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8abc2a0..e3cc673e0c988 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -97,7 +97,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;

- WRITE_ONCE(lock->owner, (struct task_struct *)val);
+ WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
}

static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
@@ -106,6 +106,17 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
}

+static __always_inline void
+rt_mutex_set_owner_pi(struct rt_mutex_base *lock, struct task_struct *owner)
+{
+ unsigned long val = (unsigned long)owner;
+
+ if (rt_mutex_has_waiters(lock))
+ val |= RT_MUTEX_HAS_WAITERS;
+
+ WRITE_ONCE(lock->owner, val);
+}
+
static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;
@@ -173,7 +184,7 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
*/
owner = READ_ONCE(*p);
if (owner & RT_MUTEX_HAS_WAITERS)
- WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
+ cmpxchg_acquire(p, owner, owner & ~RT_MUTEX_HAS_WAITERS);
}

/*
@@ -196,17 +207,16 @@ static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
}

/*
- * Callers must hold the ->wait_lock -- which is the whole purpose as we force
- * all future threads that attempt to [Rmw] the lock to the slowpath. As such
- * relaxed semantics suffice.
+ * Callers hold the ->wait_lock. This needs to be visible to the lockowner,
+ * forcing him into the slow path for the unlock operation.
*/
static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;

do {
- owner = *p;
- } while (cmpxchg_relaxed(p, owner,
+ owner = READ_ONCE(*p);
+ } while (cmpxchg_acquire(p, owner,
owner | RT_MUTEX_HAS_WAITERS) != owner);
}

@@ -1218,7 +1228,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
* slow path making sure no task of lower priority than
* the top waiter can steal this lock.
*/
- lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
+ smp_store_release(&lock->owner, (void *) RT_MUTEX_HAS_WAITERS);

/*
* We deboosted before waking the top waiter task such that we don't
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 900220941caac..9acc176f93d38 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -249,7 +249,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
* recursion. Give the futex/rtmutex wait_lock a separate key.
*/
lockdep_set_class(&lock->wait_lock, &pi_futex_key);
- rt_mutex_set_owner(lock, proxy_owner);
+ rt_mutex_set_owner_pi(lock, proxy_owner);
}

/**
@@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
{
debug_rt_mutex_proxy_unlock(lock);
- rt_mutex_set_owner(lock, NULL);
+ rt_mutex_set_owner_pi(lock, NULL);
}

/**
--
2.38.1

2022-11-28 21:00:15

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Attachments:

(No filename) (5.74 kB)
config (288.28 kB)
Download all attachments

2022-11-28 21:27:45

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Hi Sebastian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master v6.1-rc7 next-20221128]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/Re-Crash-with-PREEMPT_RT-on-aarch64-machine/20221129-000000
patch link: https://lore.kernel.org/r/Y4Tapja2qq8HiHBZ%40linutronix.de
patch subject: Re: Crash with PREEMPT_RT on aarch64 machine
config: alpha-allyesconfig
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/d4aea399f0b5472dc15a4c623d99b4ae79005050
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Sebastian-Andrzej-Siewior/Re-Crash-with-PREEMPT_RT-on-aarch64-machine/20221129-000000
git checkout d4aea399f0b5472dc15a4c623d99b4ae79005050
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from include/asm-generic/bug.h:7,
from arch/alpha/include/asm/bug.h:23,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/preempt.h:5,
from ./arch/alpha/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:56,
from kernel/locking/rtmutex_api.c:5:
kernel/locking/rtmutex.c: In function 'rt_mutex_set_owner':
>> arch/alpha/include/asm/cmpxchg.h:59:34: warning: initialization of 'struct task_struct *' from 'long unsigned int' makes pointer from integer without a cast [-Wint-conversion]
59 | __typeof__(*(ptr)) _o_ = (o); \
| ^
include/linux/once_lite.h:28:41: note: in definition of macro 'DO_ONCE_LITE_IF'
28 | bool __ret_do_once = !!(condition); \
| ^~~~~~~~~
kernel/locking/rtmutex.c:100:9: note: in expansion of macro 'WARN_ON_ONCE'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~
include/linux/atomic/atomic-arch-fallback.h:35:30: note: in expansion of macro 'arch_cmpxchg'
35 | #define arch_cmpxchg_acquire arch_cmpxchg
| ^~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:9: note: in expansion of macro 'arch_cmpxchg_acquire'
1923 | arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
| ^~~~~~~~~~~~~~~~~~~~
kernel/locking/rtmutex.c:100:22: note: in expansion of macro 'cmpxchg_acquire'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~~~~
arch/alpha/include/asm/cmpxchg.h:59:34: note: (near initialization for '__ret_do_once')
59 | __typeof__(*(ptr)) _o_ = (o); \
| ^
include/linux/once_lite.h:28:41: note: in definition of macro 'DO_ONCE_LITE_IF'
28 | bool __ret_do_once = !!(condition); \
| ^~~~~~~~~
kernel/locking/rtmutex.c:100:9: note: in expansion of macro 'WARN_ON_ONCE'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~
include/linux/atomic/atomic-arch-fallback.h:35:30: note: in expansion of macro 'arch_cmpxchg'
35 | #define arch_cmpxchg_acquire arch_cmpxchg
| ^~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:9: note: in expansion of macro 'arch_cmpxchg_acquire'
1923 | arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
| ^~~~~~~~~~~~~~~~~~~~
kernel/locking/rtmutex.c:100:22: note: in expansion of macro 'cmpxchg_acquire'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~~~~
arch/alpha/include/asm/cmpxchg.h:60:34: warning: initialization of 'struct task_struct *' from 'long unsigned int' makes pointer from integer without a cast [-Wint-conversion]
60 | __typeof__(*(ptr)) _n_ = (n); \
| ^
include/linux/once_lite.h:28:41: note: in definition of macro 'DO_ONCE_LITE_IF'
28 | bool __ret_do_once = !!(condition); \
| ^~~~~~~~~
kernel/locking/rtmutex.c:100:9: note: in expansion of macro 'WARN_ON_ONCE'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~
include/linux/atomic/atomic-arch-fallback.h:35:30: note: in expansion of macro 'arch_cmpxchg'
35 | #define arch_cmpxchg_acquire arch_cmpxchg
| ^~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:9: note: in expansion of macro 'arch_cmpxchg_acquire'
1923 | arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
| ^~~~~~~~~~~~~~~~~~~~
kernel/locking/rtmutex.c:100:22: note: in expansion of macro 'cmpxchg_acquire'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~~~~
arch/alpha/include/asm/cmpxchg.h:60:34: note: (near initialization for '__ret_do_once')
60 | __typeof__(*(ptr)) _n_ = (n); \
| ^
include/linux/once_lite.h:28:41: note: in definition of macro 'DO_ONCE_LITE_IF'
28 | bool __ret_do_once = !!(condition); \
| ^~~~~~~~~
kernel/locking/rtmutex.c:100:9: note: in expansion of macro 'WARN_ON_ONCE'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~
include/linux/atomic/atomic-arch-fallback.h:35:30: note: in expansion of macro 'arch_cmpxchg'
35 | #define arch_cmpxchg_acquire arch_cmpxchg
| ^~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:9: note: in expansion of macro 'arch_cmpxchg_acquire'
1923 | arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
| ^~~~~~~~~~~~~~~~~~~~
kernel/locking/rtmutex.c:100:22: note: in expansion of macro 'cmpxchg_acquire'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~~~~
kernel/locking/rtmutex.c:100:79: warning: comparison between pointer and integer
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~
include/linux/once_lite.h:28:41: note: in definition of macro 'DO_ONCE_LITE_IF'
28 | bool __ret_do_once = !!(condition); \
| ^~~~~~~~~
kernel/locking/rtmutex.c:100:9: note: in expansion of macro 'WARN_ON_ONCE'
100 | WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
| ^~~~~~~~~~~~
In file included from arch/alpha/include/asm/rwonce.h:33,
from include/linux/compiler.h:246,
from include/linux/build_bug.h:5,
from include/linux/container_of.h:5,
from include/linux/list.h:5,
from include/linux/preempt.h:11:
kernel/locking/rtmutex.c: In function 'rt_mutex_set_owner_pi':
include/asm-generic/rwonce.h:55:37: warning: assignment to 'struct task_struct *' from 'long unsigned int' makes pointer from integer without a cast [-Wint-conversion]
55 | *(volatile typeof(x) *)&(x) = (val); \
| ^
include/asm-generic/rwonce.h:61:9: note: in expansion of macro '__WRITE_ONCE'
61 | __WRITE_ONCE(x, val); \
| ^~~~~~~~~~~~
kernel/locking/rtmutex.c:117:9: note: in expansion of macro 'WRITE_ONCE'
117 | WRITE_ONCE(lock->owner, val);
| ^~~~~~~~~~

vim +59 arch/alpha/include/asm/cmpxchg.h

5ba840f9da1ff9 Paul Gortmaker 2012-04-02 55
96d330aff7060f Mark Rutland 2021-05-25 56 #define arch_cmpxchg(ptr, o, n) \
5ba840f9da1ff9 Paul Gortmaker 2012-04-02 57 ({ \
fbfcd019917098 Andrea Parri 2018-02-27 58 __typeof__(*(ptr)) __ret; \
5ba840f9da1ff9 Paul Gortmaker 2012-04-02 @59 __typeof__(*(ptr)) _o_ = (o); \
5ba840f9da1ff9 Paul Gortmaker 2012-04-02 60 __typeof__(*(ptr)) _n_ = (n); \
fbfcd019917098 Andrea Parri 2018-02-27 61 smp_mb(); \
fbfcd019917098 Andrea Parri 2018-02-27 62 __ret = (__typeof__(*(ptr))) __cmpxchg((ptr), \
fbfcd019917098 Andrea Parri 2018-02-27 63 (unsigned long)_o_, (unsigned long)_n_, sizeof(*(ptr)));\
fbfcd019917098 Andrea Parri 2018-02-27 64 smp_mb(); \
fbfcd019917098 Andrea Parri 2018-02-27 65 __ret; \
5ba840f9da1ff9 Paul Gortmaker 2012-04-02 66 })
5ba840f9da1ff9 Paul Gortmaker 2012-04-02 67

--
0-DAY CI Kernel Test Service
https://01.org/lkp

Attachments:

(No filename) (10.45 kB)
config (320.58 kB)
Download all attachments

2022-11-29 05:56:41

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Hi Sebastian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on linus/master v6.1-rc7 next-20221128]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/Re-Crash-with-PREEMPT_RT-on-aarch64-machine/20221129-000000
patch link: https://lore.kernel.org/r/Y4Tapja2qq8HiHBZ%40linutronix.de
patch subject: Re: Crash with PREEMPT_RT on aarch64 machine
config: hexagon-randconfig-r041-20221128
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6e4cea55f0d1104408b26ac574566a0e4de48036)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/d4aea399f0b5472dc15a4c623d99b4ae79005050
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Sebastian-Andrzej-Siewior/Re-Crash-with-PREEMPT_RT-on-aarch64-machine/20221129-000000
git checkout d4aea399f0b5472dc15a4c623d99b4ae79005050
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

In file included from kernel/locking/rtmutex_api.c:9:
>> kernel/locking/rtmutex.c:100:15: error: incompatible integer to pointer conversion initializing 'typeof (*(__ai_ptr))' (aka 'struct task_struct *') with an expression of type 'unsigned long' [-Wint-conversion]
WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:2: note: expanded from macro 'cmpxchg_acquire'
arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
^
include/linux/atomic/atomic-arch-fallback.h:35:30: note: expanded from macro 'arch_cmpxchg_acquire'
#define arch_cmpxchg_acquire arch_cmpxchg
^
arch/hexagon/include/asm/cmpxchg.h:57:21: note: expanded from macro 'arch_cmpxchg'
__typeof__(*(ptr)) __old = (old); \
^
include/asm-generic/bug.h:180:41: note: expanded from macro 'WARN_ON_ONCE'
#define WARN_ON_ONCE(condition) WARN_ON(condition)
~~~~~~~~^~~~~~~~~~
include/asm-generic/bug.h:167:25: note: expanded from macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^~~~~~~~~
In file included from kernel/locking/rtmutex_api.c:9:
>> kernel/locking/rtmutex.c:100:15: error: incompatible integer to pointer conversion initializing 'typeof (*(__ai_ptr))' (aka 'struct task_struct *') with an expression of type 'unsigned long' [-Wint-conversion]
WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/atomic/atomic-instrumented.h:1923:2: note: expanded from macro 'cmpxchg_acquire'
arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
^
include/linux/atomic/atomic-arch-fallback.h:35:30: note: expanded from macro 'arch_cmpxchg_acquire'
#define arch_cmpxchg_acquire arch_cmpxchg
^
arch/hexagon/include/asm/cmpxchg.h:58:21: note: expanded from macro 'arch_cmpxchg'
__typeof__(*(ptr)) __new = (new); \
^
include/asm-generic/bug.h:180:41: note: expanded from macro 'WARN_ON_ONCE'
#define WARN_ON_ONCE(condition) WARN_ON(condition)
~~~~~~~~^~~~~~~~~~
include/asm-generic/bug.h:167:25: note: expanded from macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^~~~~~~~~
In file included from kernel/locking/rtmutex_api.c:9:
>> kernel/locking/rtmutex.c:100:72: warning: comparison between pointer and integer ('typeof (*(__ai_ptr))' (aka 'struct task_struct *') and 'unsigned long') [-Wpointer-integer-compare]
WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/bug.h:180:41: note: expanded from macro 'WARN_ON_ONCE'
#define WARN_ON_ONCE(condition) WARN_ON(condition)
~~~~~~~~^~~~~~~~~~
include/asm-generic/bug.h:167:25: note: expanded from macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^~~~~~~~~
In file included from kernel/locking/rtmutex_api.c:9:
>> kernel/locking/rtmutex.c:117:2: error: incompatible integer to pointer conversion assigning to 'volatile typeof (lock->owner)' (aka 'struct task_struct *volatile') from 'unsigned long' [-Wint-conversion]
WRITE_ONCE(lock->owner, val);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:61:2: note: expanded from macro 'WRITE_ONCE'
__WRITE_ONCE(x, val); \
^~~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:55:30: note: expanded from macro '__WRITE_ONCE'
*(volatile typeof(x) *)&(x) = (val); \
^ ~~~~~
1 warning and 3 errors generated.

vim +100 kernel/locking/rtmutex.c

64
65 /*
66 * lock->owner state tracking:
67 *
68 * lock->owner holds the task_struct pointer of the owner. Bit 0
69 * is used to keep track of the "lock has waiters" state.
70 *
71 * owner bit0
72 * NULL 0 lock is free (fast acquire possible)
73 * NULL 1 lock is free and has waiters and the top waiter
74 * is going to take the lock*
75 * taskpointer 0 lock is held (fast release possible)
76 * taskpointer 1 lock is held and has waiters**
77 *
78 * The fast atomic compare exchange based acquire and release is only
79 * possible when bit 0 of lock->owner is 0.
80 *
81 * (*) It also can be a transitional state when grabbing the lock
82 * with ->wait_lock is held. To prevent any fast path cmpxchg to the lock,
83 * we need to set the bit0 before looking at the lock, and the owner may be
84 * NULL in this small time, hence this can be a transitional state.
85 *
86 * (**) There is a small time when bit 0 is set but there are no
87 * waiters. This can happen when grabbing the lock in the slow path.
88 * To prevent a cmpxchg of the owner releasing the lock, we need to
89 * set this bit before looking at the lock.
90 */
91
92 static __always_inline void
93 rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
94 {
95 unsigned long val = (unsigned long)owner;
96
97 if (rt_mutex_has_waiters(lock))
98 val |= RT_MUTEX_HAS_WAITERS;
99
> 100 WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
101 }
102
103 static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
104 {
105 lock->owner = (struct task_struct *)
106 ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
107 }
108
109 static __always_inline void
110 rt_mutex_set_owner_pi(struct rt_mutex_base *lock, struct task_struct *owner)
111 {
112 unsigned long val = (unsigned long)owner;
113
114 if (rt_mutex_has_waiters(lock))
115 val |= RT_MUTEX_HAS_WAITERS;
116
> 117 WRITE_ONCE(lock->owner, val);
118 }
119

--
0-DAY CI Kernel Test Service
https://01.org/lkp

Attachments:

(No filename) (8.38 kB)
config (139.97 kB)
Download all attachments

2022-11-29 06:27:46

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Attachments:

(No filename) (3.79 kB)
config (123.97 kB)
Download all attachments

2022-11-29 07:06:46

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Attachments:

(No filename) (4.43 kB)
config (119.72 kB)
Download all attachments

2022-11-29 07:49:23

[permalink] [raw]

Subject: Re: Re: Crash with PREEMPT_RT on aarch64 machine

Attachments:

(No filename) (3.32 kB)
config (127.38 kB)
Download all attachments

2022-11-30 19:14:18

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On 11/28/22 16:58, Sebastian Andrzej Siewior wrote:
> How about this?
>
> - The fast path is easy…
>
> - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
> I made that one _acquire so that it is visible by the unlocker forcing
> everyone into slow path.
>
> - If the lock is acquired, then the owner is written via
> rt_mutex_set_owner(). This happens under wait_lock so it is
> serialized and so a WRITE_ONCE() is used to write the final owner. I
> replaced it with a cmpxchg_acquire() to have the owner there.
> Not sure if I shouldn't make this as you put it:
> | e.g. by making use of dependency ordering where it already exists.
> The other (locking) CPU needs to see the owner not only the WAITER
> bit. I'm not sure if this could be replaced with smp_store_release().
>
> - After the whole operation completes, fixup_rt_mutex_waiters() cleans
> the WAITER bit and I kept the _acquire semantic here. Now looking at
> it again, I don't think that needs to be done since that shouldn't be
> set here.
>
> - There is rtmutex_spin_on_owner() which (as the name implies) spins on
> the owner as long as it active. It obtains it via READ_ONCE() and I'm
> not sure if any memory barrier is needed. Worst case is that it will
> spin while owner isn't set if it observers a stale value.
>
> - The unlock path first clears the waiter bit if there are no waiters
> recorded (via simple assignments under the wait_lock (every locker
> will fail with the cmpxchg_acquire() and go for the wait_lock)) and
> then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
> Should there be a wait, it will just store the WAITER bit with
> smp_store_release() (setting the owner is NULL but the WAITER bit
> forces everyone into the slow path).
>
> - Added rt_mutex_set_owner_pi() which does simple assignment. This is
> used from the futex code and here everything happens under a lock.
>
> - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
> *think* want to observe a real waiter and not something stale.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Hello,
Just to share some debug attempts, I tried Sebastian's patch and could not
reproduce the error. While trying to understand the solution, I could not
reproduce the error if I only took the changes made to
mark_rt_mutex_waiters(), or to rt_mutex_set_owner_pi(). I am not sure I
understand why this would be a rt-mutex issue.

Without Sebastian's patch, to try adding some synchronization around the
'i_wb_list', I did the following:

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 443f83382b9b..42ce1f7f8aef 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1271,10 +1271,10 @@ void sb_clear_inode_writeback(struct inode *inode)
struct super_block *sb = inode->i_sb;
unsigned long flags;

- if (!list_empty(&inode->i_wb_list)) {
+ if (!list_empty_careful(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
- if (!list_empty(&inode->i_wb_list)) {
- list_del_init(&inode->i_wb_list);
+ if (!list_empty_careful(&inode->i_wb_list)) {
+ list_del_init_careful(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
diff --git a/fs/inode.c b/fs/inode.c
index b608528efd3a..fbe6b4fe5831 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -621,7 +621,7 @@ void clear_inode(struct inode *inode)
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
- BUG_ON(!list_empty(&inode->i_wb_list));
+ BUG_ON(!list_empty_careful(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
inode->i_state = I_FREEING | I_CLEAR;
}

I never stepped on the:
BUG_ON(!list_empty_careful(&inode->i_wb_list))
statement again, but got the dump at [2]. I also regularly end-up
with the following endless logs when trying other things, when rebooting:

EXT4-fs (nvme0n1p3): sb orphan head is 2840597
sb_info orphan list:
inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
...

Also, Jan said that the issue was reproducible on 'two different aarch64
machines', cf [1]. Would it be possible to know which one ?

Regards,
Pierre

[1]
https://lore.kernel.org/all/20221103115444.m2rjglbkubydidts@quack3/

[2]
EXT4-fs (nvme0n1p3): Inode 2834153 (0000000051c7b29b): orphan list check failed!
0000000051c7b29b: 0000f30a 00000004 00000000 00000000 ................
00000000a0792dde: 00000000 00000000 00000000 00000000 ................
0000000065a25e3d: 00000000 00000000 00000000 00000000 ................
00000000b2085d6d: 00000000 00000000 00000000 002b3fe8 .............?+.
0000000088f2c42f: 00000000 00000000 00000159 00000000 ........Y.......
00000000b9d22813: 00080000 00000040 80000000 00000000 ....@...........
0000000004f93ec7: 00000000 00000000 00000000 00000000 ................
00000000d8e00df6: 00000000 00000000 00000000 00000000 ................
00000000d29047af: ba813320 ffff07ff 0d822240 ffff4001 3......@"...@..
00000000e9708e5a: 0d822250 ffff4001 0d822250 ffff4001 P"[email protected]"...@..
000000001f08ff8a: 0d822260 ffff4001 0d822260 ffff4001 `"...@..`"...@..
000000003231aba6: 00000000 00000000 00000000 00000000 ................
00000000df5b63ba: 00000000 00000000 00000000 00000000 ................
00000000367f58f3: 00000000 00000000 00000000 00000000 ................
000000008a1af872: 0d8222a0 ffff4001 0d8222a0 ffff4001 ."...@..."...@..
000000002a8dc95e: 00000000 00000000 00000000 00000000 ................
00000000e4c5bdb4: 00000000 00000000 00000000 00000000 ................
000000004f0a738d: 00000000 00000000 80000000 00000000 ................
0000000082a9e5b2: 00000000 00000000 00000000 00000000 ................
0000000064dce462: 00000000 00000000 00000000 00000000 ................
00000000fe106bb0: 000d8180 000040ab 000040ab 00000000 .....@...@......
00000000946bcd55: 00000000 00000000 00000000 00000000 ................
000000001d64e9fd: c1390a80 ffffdab5 87d6c800 ffff07ff ..9.............
000000002afa83ff: 0d822488 ffff4001 54d25060 ffff4002 .$...@..`P.T.@..
0000000075bfe8c6: 002b3ee9 00000000 00000000 00000000 .>+.............
00000000580feb22: 00000000 00000000 63877e76 00000000 ........v~.c....
00000000cca606aa: 010d8f0e 00000000 63877e76 00000000 ........v~.c....
00000000d446eaef: 010d8f0e 00000000 63877e76 00000000 ........v~.c....
0000000072243536: 010d8f0e 00000000 00000000 00000000 ................
000000007cbeccb9: 00000000 00000000 00000000 00000000 ................
0000000026d5ad72: 00000000 00000000 000c0000 00000000 ................
00000000e28ac20a: 00000000 00000000 00000060 00000000 ........`.......
0000000076ed32fb: 80000000 00000000 00000000 00000000 ................
00000000cd183175: 00000000 00000000 00000000 00000000 ................
00000000ecafc825: 00000000 00000000 00000000 00000000 ................
00000000408fde6f: 00000000 00000000 00000000 00000000 ................
000000004d7c3704: 00000000 00000000 0d822408 ffff4001 .........$...@..
000000007a24c141: 0d822408 ffff4001 0d822418 ffff4001 .$...@...$...@..
000000007ce51788: 0d822418 ffff4001 0d822428 ffff4001 .$...@..($...@..
00000000efd9c162: 0d822428 ffff4001 0d822438 ffff4001 ([email protected]$...@..
0000000013f0626e: 0d822438 ffff4001 00000000 00000000 8$...@..........
00000000e8fc5904: 00000000 00000000 00000002 00000000 ................
000000006533e04b: 00000000 00000000 00000000 00000000 ................
000000009a33c9d5: 00000000 00000000 c1390b40 ffffdab5 [email protected].....
00000000b743e93e: 00000000 00000000 0d822300 ffff4001 .........#...@..
0000000041c5a701: 00000000 00000000 00000000 00000000 ................
000000009f872e56: 00000000 00000000 00000000 00000000 ................
00000000f6ca0703: 00000021 00000000 00000000 00000000 !...............
00000000e79eacb9: 80000000 00000000 00000000 00000000 ................
000000003afb3989: 00000000 00000000 00000000 00000000 ................
00000000164436d4: 00000000 00000000 00100cca 00000000 ................
00000000d63a3021: 00000000 00000000 00000000 00000000 ................
00000000e0e1ace3: 80000000 00000000 00000000 00000000 ................
0000000043ac9c19: 00000000 00000000 00000000 00000000 ................
00000000f3870564: 00000000 00000000 00000000 00000000 ................
0000000082c87bf9: 00000000 00000000 c1391310 ffffdab5 ..........9.....
00000000f5524c75: 00000010 00000000 00000000 00000000 ................
00000000dfda4192: 00000000 00000000 00000000 00000000 ................
00000000863650dd: 00000000 00000000 00000000 00000000 ................
00000000b709ac61: 0d822570 ffff4001 0d822570 ffff4001 p%[email protected]%...@..
000000009f316d71: 00000000 00000000 0d822588 ffff4001 .........%...@..
00000000f15fb4ed: 0d822588 ffff4001 00000000 00000000 .%...@..........
000000000027077a: 6401ffec 00000000 00000000 00000000 ...d............
00000000c828fe47: 00000000 00000000 00000000 00000000 ................
00000000b5f575af: 00000000 00000000 00000000 00000000 ................
000000001466bf98: 00000000 00000000 00000000 00000000 ................
0000000097025855: 00000000 00000000 00000000 00000000 ................
000000002557bcf0: 63877e76 00000000 010d8f0e 00000000 v~.c............
00000000b128f3c5: 00000000 00000000 0d822608 ffff4001 .........&...@..
000000005b473b40: 0d822608 ffff4001 00000000 00000000 .&...@..........
000000000913f445: 00000000 00000000 00000000 00000000 ................
000000003023853f: 00000000 00000000 00000000 00000000 ................
0000000025fcdffb: 00000000 00000000 80000000 00000000 ................
00000000334c5dc4: 00000000 00000000 00000000 00000000 ................
00000000f1ada795: 00000000 00000000 00000000 00000000 ................
0000000030e27dd3: 00000000 00000000 0d822678 ffff4001 ........x&...@..
0000000033a6a483: 0d822678 ffff4001 00000000 00000000 x&...@..........
00000000eb614a98: 00000000 ffffffff 00000000 00000000 ................
00000000df74e25f: 00000000 00000000 00000020 00000000 ........ .......
00000000efcf717a: 00000000 00000000 00000000 00000000 ................
00000000f84ffeba: 00000000 00000000 00000000 00000000 ................
00000000651071c7: 00000000 00000000 0d8226d8 ffff4001 .........&...@..
00000000a748241c: 0d8226d8 ffff4001 ffffffe0 0000000f .&...@..........
000000004c1557b3: 0d8226f0 ffff4001 0d8226f0 ffff4001 .&...@...&...@..
00000000dccbb716: c08cf6b0 ffffdab5 00000000 00000000 ................
000000009d2fb057: 00000000 00000000 00000000 00000000 ................
00000000f6f284e6: 00000000 00000000 00000000 00000000 ................
000000003bb74f4c: 00414087 00414087 00000000 00000000 .@A..@A.........
00000000e0601ec8: 00000000 00000000 00000000 00000000 ................
000000006a017bfb: cef69528 00000000 (.......
CPU: 125 PID: 3898 Comm: dbench Not tainted 6.0.5-rt14-[...]
Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
Call trace:
[...]
ext4_destroy_inode+0xc8/0xd0
destroy_inode+0x48/0x80
evict+0x148/0x190
iput+0x184/0x250
do_unlinkat+0x1d8/0x290
__arm64_sys_unlinkat+0x48/0x90
invoke_syscall+0x78/0x100
el0_svc_common.constprop.0+0x54/0x194
do_el0_svc+0x38/0xd0
el0_svc+0x34/0x160
el0t_64_sync_handler+0xbc/0x13c
el0t_64_sync+0x1a0/0x1a4

2022-11-30 20:33:54

by Mel Gorman

[permalink] [raw]

Subject: Re: Crash with PREEMPT_RT on aarch64 machine

On Mon, Nov 28, 2022 at 04:58:30PM +0100, Sebastian Andrzej Siewior wrote:
> How about this?
>
> - The fast path is easy???
>
> - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
> I made that one _acquire so that it is visible by the unlocker forcing
> everyone into slow path.
>
> - If the lock is acquired, then the owner is written via
> rt_mutex_set_owner(). This happens under wait_lock so it is
> serialized and so a WRITE_ONCE() is used to write the final owner. I
> replaced it with a cmpxchg_acquire() to have the owner there.
> Not sure if I shouldn't make this as you put it:
> | e.g. by making use of dependency ordering where it already exists.
> The other (locking) CPU needs to see the owner not only the WAITER
> bit. I'm not sure if this could be replaced with smp_store_release().
>
> - After the whole operation completes, fixup_rt_mutex_waiters() cleans
> the WAITER bit and I kept the _acquire semantic here. Now looking at
> it again, I don't think that needs to be done since that shouldn't be
> set here.
>
> - There is rtmutex_spin_on_owner() which (as the name implies) spins on
> the owner as long as it active. It obtains it via READ_ONCE() and I'm
> not sure if any memory barrier is needed. Worst case is that it will
> spin while owner isn't set if it observers a stale value.
>
> - The unlock path first clears the waiter bit if there are no waiters
> recorded (via simple assignments under the wait_lock (every locker
> will fail with the cmpxchg_acquire() and go for the wait_lock)) and
> then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
> Should there be a wait, it will just store the WAITER bit with
> smp_store_release() (setting the owner is NULL but the WAITER bit
> forces everyone into the slow path).
>
> - Added rt_mutex_set_owner_pi() which does simple assignment. This is
> used from the futex code and here everything happens under a lock.
>
> - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
> *think* want to observe a real waiter and not something stale.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

include/linux/rtmutex.h needs to also include asm/barrier.h to resolve
some build problems. Once that was resolved, 10 iterations of the dbench
work completed successfully and without the patch, 1 iteration could not
complete.

Review is trickier as I'm not spent any reasonable amount of time on locking
primitives. I'd have to defer to Peter in that regard but I skimmed it at
least before wrapping up for the evening.

> ---
> include/linux/rtmutex.h | 2 +-
> kernel/locking/rtmutex.c | 26 ++++++++++++++++++--------
> kernel/locking/rtmutex_api.c | 4 ++--
> 3 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
> index 7d049883a08ac..4447e01f723d4 100644
> --- a/include/linux/rtmutex.h
> +++ b/include/linux/rtmutex.h
> @@ -41,7 +41,7 @@ struct rt_mutex_base {
> */
> static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
> {
> - return READ_ONCE(lock->owner) != NULL;
> + return smp_load_acquire(&lock->owner) != NULL;
> }
>

I don't believe this is necessary. It's only needed when checking if a
lock is acquired or not and it's inherently race-prone. It's harmless if
a stale value is observed and it does not pair with a release. Mostly it's
useful for debugging checks.

> extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 7779ee8abc2a0..e3cc673e0c988 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -97,7 +97,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
> if (rt_mutex_has_waiters(lock))
> val |= RT_MUTEX_HAS_WAITERS;
>
> - WRITE_ONCE(lock->owner, (struct task_struct *)val);
> + WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
> }
>
> static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
> @@ -106,6 +106,17 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
> ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
> }
>
> +static __always_inline void
> +rt_mutex_set_owner_pi(struct rt_mutex_base *lock, struct task_struct *owner)
> +{

What does pi mean in this context? I think the naming here might
misleading. rt_mutex_set_owner_pi is used when initialising and when
clearing the owner. rt_mutex_set_owner is set when acquiring the lock.

Consider renaming rt_mutex_set_owner_pi to rt_mutex_clear_owner. The init
could still use rt_mutex_set_owner as an extra barrier is not a big deal
during init if the straight assignment was unpopular. The init could also
do a plain assignment because it cannot have any waiters yet.

What is less obvious is if rt_mutex_clear_owner should have explicit release
semantics to pair with rt_mutex_set_owner. It looks like it might not
matter because at least some paths end up having release semantics anyway
due to a spinlock but I didn't check all cases and it's potentially fragile.

> + unsigned long val = (unsigned long)owner;
> +
> + if (rt_mutex_has_waiters(lock))
> + val |= RT_MUTEX_HAS_WAITERS;
> +
> + WRITE_ONCE(lock->owner, val);
> +}
> +
> static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
> {
> unsigned long owner, *p = (unsigned long *) &lock->owner;
> @@ -173,7 +184,7 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
> */
> owner = READ_ONCE(*p);
> if (owner & RT_MUTEX_HAS_WAITERS)
> - WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
> + cmpxchg_acquire(p, owner, owner & ~RT_MUTEX_HAS_WAITERS);
> }
>
> /*
> @@ -196,17 +207,16 @@ static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
> }
>
> /*
> - * Callers must hold the ->wait_lock -- which is the whole purpose as we force
> - * all future threads that attempt to [Rmw] the lock to the slowpath. As such
> - * relaxed semantics suffice.
> + * Callers hold the ->wait_lock. This needs to be visible to the lockowner,
> + * forcing him into the slow path for the unlock operation.
> */
> static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> {
> unsigned long owner, *p = (unsigned long *) &lock->owner;
>
> do {
> - owner = *p;
> - } while (cmpxchg_relaxed(p, owner,
> + owner = READ_ONCE(*p);
> + } while (cmpxchg_acquire(p, owner,
> owner | RT_MUTEX_HAS_WAITERS) != owner);
> }
>

Not 100% sure although I see it's to cover an exit path from
try_to_take_rt_mutex. I'm undecided if rt_mutex_set_owner having acquire
semantics and rt_mutex_clear_owner having clear semantics would be
sufficient. try_to_take_rt_mutex can still return with release semantics
but only in the case where it fails to acquire the lock.

> @@ -1218,7 +1228,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
> * slow path making sure no task of lower priority than
> * the top waiter can steal this lock.
> */
> - lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
> + smp_store_release(&lock->owner, (void *) RT_MUTEX_HAS_WAITERS);
>
> /*
> * We deboosted before waking the top waiter task such that we don't

This is within a locked section and would definitely see a barrier if
rt_mutex_wake_q_add_task calls wake_q_add but it's less clear if the
optimisation in rt_mutex_wake_q_add_task could race so I'm undecided if
it's necessary or not.

> diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
> index 900220941caac..9acc176f93d38 100644
> --- a/kernel/locking/rtmutex_api.c
> +++ b/kernel/locking/rtmutex_api.c
> @@ -249,7 +249,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
> * recursion. Give the futex/rtmutex wait_lock a separate key.
> */
> lockdep_set_class(&lock->wait_lock, &pi_futex_key);
> - rt_mutex_set_owner(lock, proxy_owner);
> + rt_mutex_set_owner_pi(lock, proxy_owner);
> }
>
> /**
> @@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
> void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
> {
> debug_rt_mutex_proxy_unlock(lock);
> - rt_mutex_set_owner(lock, NULL);
> + rt_mutex_set_owner_pi(lock, NULL);
> }
>
> /**
> --
> 2.38.1
>

--
Mel Gorman
SUSE Labs

2022-12-01 12:59:45