2022-07-08 16:40:22

by John Keeping

[permalink] [raw]
Subject: [PATCH v3] sched/core: Always flush pending blk_plug

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.

- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: John Keeping <[email protected]>
---
v3:
- Add SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT) as suggested by
Peter
v2:
- Add Steven's R-b and update the commit message with his suggested
quote from [1]

include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 8 ++++++--
2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028c08b49..994c25640e156 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif

extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4660a1915b3..71d6385ece83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,8 +6578,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}

- if (tsk_is_pi_blocked(tsk))
- return;
+ /*
+ * spinlock and rwlock must not flush block requests. This will
+ * deadlock if the callback attempts to acquire a lock which is
+ * already acquired.
+ */
+ SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);

/*
* If we are going to sleep and we have plugged IO queued,
--
2.37.0


2022-07-09 16:22:51

by Paul Gortmaker

[permalink] [raw]
Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug

[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:

> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
>
> INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> Not tainted 5.15.49-rt46 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
> Workqueue: writeback wb_workfn (flush-7:0)

This is great. I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.

Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced. Otherise I'd probably not
have had your backtraces grab my attention at all.

Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue. So...

Tested-by: Paul Gortmaker <[email protected]>

As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.

Thanks,
Paul.

-----------------


There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01"

Reproducer details are at the bottom. What seems to happen is as follows:

The kworker doing writeback wb_workfn (pid 213 below) stalls on the
one and only down_write in ext4_map_blocks() and never moves again.

/*
* New blocks allocate and/or writing to unwritten extent
* will possibly result in updating i_data, so we take
* the write lock of i_data_sem, and call get_block()
* with create == 1 flag.
*/
down_write(&EXT4_I(inode)->i_data_sem);

So who is blocking the kworker? There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:

ext4_truncate+0x46a/0x660
--> ext4_ext_truncate+0xd6/0xf0
--> ext4_ext_remove_space+0x59f/0x1830
--> jbd2_journal_get_write_access+0x51/0x80
--> do_get_write_access+0x2ad/0x550
--> complete_all+0x60/0x60
--> bit_wait_io+0x11/0x60

They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.

Here is where I think lockdep shows us the problem:

5 locks held by ftest01/818:
[...]
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

5 locks held by ftest01/821:
[...]
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

Yes - that is the one the kworker was trying the down_write() on.

Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:

/*
* There is one case we have to be very careful about. If the
* committing transaction is currently writing this buffer out to disk
* and has NOT made a copy-out, then we cannot modify the buffer
* contents at all right now. The essence of copy-out is that it is
* the extra copy, not the primary copy, which gets journaled. If the
* primary copy is already going to disk then we cannot do copy-out
* here.
*/
if (buffer_shadow(bh)) {
JBUFFER_TRACE(jh, "on shadow: sleep");
spin_unlock(&jh->b_state_lock);
wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
goto repeat;
}

Presumaby this is a corner case that preempt-rt makes easier to trigger?

The full trace with interspersed lockdep info follows.

Various detailed reproducer information follows that.

-----------------------

sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:

task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __this_cpu_preempt_check+0x13/0x20
? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
schedule+0x48/0xc0
__rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
rwbase_write_lock+0x98/0x6e0
down_write+0x46/0x50
? ext4_map_blocks+0x156/0x5c0
ext4_map_blocks+0x156/0x5c0
ext4_writepages+0xa1a/0x1290
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0xcd/0x150
do_writepages+0xd2/0x1a0
__writeback_single_inode+0x64/0x850
writeback_sb_inodes+0x22d/0x5b0
__writeback_inodes_wb+0x67/0xe0
wb_writeback+0x269/0x5f0
? get_nr_inodes+0x49/0x70
wb_workfn+0x43d/0x780
? __this_cpu_preempt_check+0x13/0x20
process_one_work+0x286/0x710
worker_thread+0x3c/0x3d0
? process_one_work+0x710/0x710
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

6 locks held by kworker/u129:34/213:
#0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
#3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
#4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0


task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x329/0xae0
schedule+0x48/0xc0
jbd2_journal_wait_updates+0x7c/0xf0
? schedstat_stop+0x10/0x10
jbd2_journal_commit_transaction+0x284/0x20e0
? __this_cpu_preempt_check+0x13/0x20
? find_held_lock+0x35/0xa0
? _raw_spin_unlock_irqrestore+0x86/0x8f
? _raw_spin_unlock_irqrestore+0x86/0x8f
? try_to_del_timer_sync+0x53/0x80
? trace_hardirqs_on+0x3b/0x120
? try_to_del_timer_sync+0x53/0x80
kjournald2+0xd5/0x2b0
? schedstat_stop+0x10/0x10
? commit_timeout+0x20/0x20
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>


task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
</TASK>

1 lock held by ftest01/817:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
</TASK>

5 locks held by ftest01/818:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660


task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
</TASK>

1 lock held by ftest01/819:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
</TASK>

1 lock held by ftest01/820:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
</TASK>

5 locks held by ftest01/821:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

--------------------------------------

Reproducer:

Get latest preempt-rt kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
make defconfig
manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
make oldconfig and take defaults for other options hidden behind EXPERT
can optionally enable LOCKDEP options so sysrq-d works.

Get LTP ; build + install
https://github.com/linux-test-project/ltp.git

Boot and then run reproducer:
dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
mkfs.ext4 -F /root/tmp_mount_file
mount /root/tmp_mount_file /mnt/
/opt/ltp/runltp -f fs -s ftest01 -d /mnt/
umount /mnt

loop device isn't strictly necessary but contains fs fallout when
inevitable reboot is required to get rid of D state processes.

Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads
in D state - any other i/o directed at /mnt also hangs in D state.

2022-07-12 19:40:20

by Paul Gortmaker

[permalink] [raw]
Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug

[Re: [PATCH v3] sched/core: Always flush pending blk_plug] On 10/07/2022 (Sun 09:01) Hillf Danton wrote:

> Looks like a ABBA deadlock that should be fed to lockdep, given what was
> blocking flusher (pid 213).

[,,,]

> Make lock_buffer have mutex_lock() semantics in attempt to catch the
> deadlock above.
>
> Only for thoughts now.

Thanks for the test patch - I'm keeping it as a reference for future
use.

So, preempt-rt issues an early lockdep splat at partition probe bootup,
both with and without the original v3 patch from this thread.

Of course then I figured I'd try the debug patch on vanilla v5.19-rc3
and we get pretty much the same lockdep complaint.


sd 1:0:0:0: Attached scsi generic sg1 type 0
sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
scsi 2:0:0:0: CD-ROM HL-DT-ST DVD-ROM DH30N A101 PQ: 0 ANSI: 5

=====================================
WARNING: bad unlock balance detected!
5.19.0-rc3-dirty #2 Not tainted
-------------------------------------
swapper/2/0 is trying to release lock (buffer_head_lock) at:
[<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180
but there are no more locks to release!

other info that might help us debug this:
1 lock held by swapper/2/0:
#0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180

stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2
Hardware name: Dell Inc. Precision WorkStation T5500 /0D883F, BIOS A16 05/28/2013
Call Trace:
<IRQ>
dump_stack_lvl+0x40/0x5c
lock_release+0x245/0x3f0
unlock_buffer+0x15/0x30
end_buffer_async_read+0x5b/0x180
end_bio_bh_io_sync+0x1e/0x40
blk_update_request+0x9a/0x470
scsi_end_request+0x27/0x190
scsi_io_completion+0x3c/0x580
blk_complete_reqs+0x39/0x50
__do_softirq+0x11d/0x344
irq_exit_rcu+0xa9/0xc0
common_interrupt+0xa5/0xc0
</IRQ>
<TASK>
asm_common_interrupt+0x27/0x40
RIP: 0010:cpuidle_enter_state+0x12d/0x3f0
Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff
RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40
R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6
R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001
cpuidle_enter+0x24/0x40
do_idle+0x1e3/0x230
cpu_startup_entry+0x14/0x20
start_secondary+0xe8/0xf0
secondary_startup_64_no_verify+0xe0/0xeb
</TASK>
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sdb: sdb1 sdb2
sd 1:0:0:0: [sdb] Attached SCSI disk

Not quite sure what to make of that.

Paul.
--

Subject: [tip: sched/core] sched/core: Always flush pending blk_plug

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 401e4963bf45c800e3e9ea0d3a0289d738005fd4
Gitweb: https://git.kernel.org/tip/401e4963bf45c800e3e9ea0d3a0289d738005fd4
Author: John Keeping <[email protected]>
AuthorDate: Fri, 08 Jul 2022 17:27:02 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 13 Jul 2022 11:29:17 +02:00

sched/core: Always flush pending blk_plug

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.

- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 8 ++++++--
2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028..994c256 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif

extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c703d17..a463dbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6470,8 +6470,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}

- if (tsk_is_pi_blocked(tsk))
- return;
+ /*
+ * spinlock and rwlock must not flush block requests. This will
+ * deadlock if the callback attempts to acquire a lock which is
+ * already acquired.
+ */
+ SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);

/*
* If we are going to sleep and we have plugged IO queued,

Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug

On 2022-07-08 17:27:02 [+0100], John Keeping wrote:
> Looking at the history of the -rt patchset, in fact this change was
> present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
> of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
> ("sched/core: Rework the __schedule() preempt argument").

Thanks for spotting that. The plan was to get rid of this with the
introduction of schedule_rtlock() but it somehow remained resilient.

Sebastian