Dear Linux folks,
On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux
5.4.57, we twice saw a self-detected stall on a CPU (October 27th, 2020,
January 18th, 2021).
Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
```
[ 0.000000] Linux version 5.4.57.mx64.340
([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue
Aug 11 13:20:33 CEST 2020
[…]
[48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
[48962.987511] rcu: 4-....: (20999 ticks this GP)
idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
[48962.998805] (t=21017 jiffies g=14529009 q=32263)
[48963.004074] Task dump for CPU 4:
[48963.007689] kworker/4:2 R running task 0 25587 2
0x80004008
[48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
[48963.020570] Call Trace:
[48963.023311] <IRQ>
[48963.025560] sched_show_task+0x11e/0x150
[48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
[48963.034545] rcu_sched_clock_irq+0x502/0x770
[48963.039322] ? tick_sched_do_timer+0x60/0x60
[48963.044106] update_process_times+0x24/0x60
[48963.048791] tick_sched_timer+0x37/0x70
[48963.053089] __hrtimer_run_queues+0x11f/0x2b0
[48963.057960] ? recalibrate_cpu_khz+0x10/0x10
[48963.062744] hrtimer_interrupt+0xe5/0x240
[48963.067235] smp_apic_timer_interrupt+0x6f/0x130
[48963.072407] apic_timer_interrupt+0xf/0x20
[48963.076994] </IRQ>
[48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
[48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea
5e ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56
9d <c3> 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
[48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff13
[48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
ffffffff82407588
[48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
ffffffff82407580
[48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
0000000000000001
[48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
ffffffff82407588
[48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
ffffea005e7ae600
[48963.154971] wake_up_page_bit+0xe0/0x100
[48963.159366] xfs_destroy_ioend+0xce/0x1c0
[48963.163857] xfs_end_ioend+0xcf/0x1a0
[48963.167958] xfs_end_io+0xa4/0xd0
[48963.171672] process_one_work+0x1e5/0x410
[48963.176163] worker_thread+0x2d/0x3c0
[48963.180265] ? cancel_delayed_work+0x90/0x90
[48963.185048] kthread+0x117/0x130
[48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
[48963.194321] ret_from_fork+0x35/0x40
```
As it’s just log level INFO, is there anything what should be done, or
was the system probably just “overloaded”?
Kind regards,
Paul
On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> Dear Linux folks,
>
>
> On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> 2021).
>
> Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
>
> ```
> [ 0.000000] Linux version 5.4.57.mx64.340
> ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> 13:20:33 CEST 2020
> […]
> [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> [48962.987511] rcu: 4-....: (20999 ticks this GP)
> idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> [48963.004074] Task dump for CPU 4:
> [48963.007689] kworker/4:2 R running task 0 25587 2
> 0x80004008
> [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> [48963.020570] Call Trace:
> [48963.023311] <IRQ>
> [48963.025560] sched_show_task+0x11e/0x150
> [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> [48963.034545] rcu_sched_clock_irq+0x502/0x770
> [48963.039322] ? tick_sched_do_timer+0x60/0x60
> [48963.044106] update_process_times+0x24/0x60
> [48963.048791] tick_sched_timer+0x37/0x70
> [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> [48963.062744] hrtimer_interrupt+0xe5/0x240
> [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> [48963.072407] apic_timer_interrupt+0xf/0x20
> [48963.076994] </IRQ>
> [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffff13
> [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> ffffffff82407588
> [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> ffffffff82407580
> [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> 0000000000000001
> [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> ffffffff82407588
> [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> ffffea005e7ae600
> [48963.154971] wake_up_page_bit+0xe0/0x100
> [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> [48963.163857] xfs_end_ioend+0xcf/0x1a0
> [48963.167958] xfs_end_io+0xa4/0xd0
> [48963.171672] process_one_work+0x1e5/0x410
> [48963.176163] worker_thread+0x2d/0x3c0
> [48963.180265] ? cancel_delayed_work+0x90/0x90
> [48963.185048] kthread+0x117/0x130
> [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> [48963.194321] ret_from_fork+0x35/0x40
> ```
>
> As it’s just log level INFO, is there anything what should be done, or was
> the system probably just “overloaded”?
I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
rather than CONFIG_PREEMPT_VOLUNTARY=y.
If so, and if the problem is that you are temporarily overdriving xfs I/O,
one approach would be as follows:
------------------------------------------------------------------------
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f16d5f1..06be426 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -390,6 +390,7 @@ xfs_end_io(
list_del_init(&ioend->io_list);
xfs_ioend_try_merge(ioend, &completion_list);
xfs_end_ioend(ioend);
+ cond_resched();
}
}
------------------------------------------------------------------------
If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
problem is likely massive lock contention in wake_up_page_bit(), or
perhaps someone having failed to release that lock. The usual way to
work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
is often not what you want enabled in production.
Darrick, thoughts from an xfs perspective?
Thanx, Paul
On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > Dear Linux folks,
> >
> >
> > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > 2021).
> >
> > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> >
> > ```
> > [ 0.000000] Linux version 5.4.57.mx64.340
> > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > 13:20:33 CEST 2020
> > […]
> > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > [48963.004074] Task dump for CPU 4:
> > [48963.007689] kworker/4:2 R running task 0 25587 2
> > 0x80004008
> > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > [48963.020570] Call Trace:
> > [48963.023311] <IRQ>
> > [48963.025560] sched_show_task+0x11e/0x150
> > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > [48963.044106] update_process_times+0x24/0x60
> > [48963.048791] tick_sched_timer+0x37/0x70
> > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > [48963.072407] apic_timer_interrupt+0xf/0x20
> > [48963.076994] </IRQ>
> > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > ffffffffffffff13
> > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > ffffffff82407588
> > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > ffffffff82407580
> > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > 0000000000000001
> > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > ffffffff82407588
> > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > ffffea005e7ae600
> > [48963.154971] wake_up_page_bit+0xe0/0x100
> > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > [48963.167958] xfs_end_io+0xa4/0xd0
> > [48963.171672] process_one_work+0x1e5/0x410
> > [48963.176163] worker_thread+0x2d/0x3c0
> > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > [48963.185048] kthread+0x117/0x130
> > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > [48963.194321] ret_from_fork+0x35/0x40
> > ```
> >
> > As it’s just log level INFO, is there anything what should be done, or was
> > the system probably just “overloaded”?
>
> I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> rather than CONFIG_PREEMPT_VOLUNTARY=y.
>
> If so, and if the problem is that you are temporarily overdriving xfs I/O,
> one approach would be as follows:
>
> ------------------------------------------------------------------------
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f16d5f1..06be426 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -390,6 +390,7 @@ xfs_end_io(
> list_del_init(&ioend->io_list);
> xfs_ioend_try_merge(ioend, &completion_list);
> xfs_end_ioend(ioend);
> + cond_resched();
> }
> }
>
> ------------------------------------------------------------------------
FWIW, this looks quite similar to the problem I attempted to fix with
these patches:
https://lore.kernel.org/linux-xfs/[email protected]/
Brian
>
> If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> problem is likely massive lock contention in wake_up_page_bit(), or
> perhaps someone having failed to release that lock. The usual way to
> work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> is often not what you want enabled in production.
>
> Darrick, thoughts from an xfs perspective?
>
> Thanx, Paul
>
On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > Dear Linux folks,
> > >
> > >
> > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > > 2021).
> > >
> > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > >
> > > ```
> > > [ 0.000000] Linux version 5.4.57.mx64.340
> > > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > > 13:20:33 CEST 2020
> > > […]
> > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > > [48963.004074] Task dump for CPU 4:
> > > [48963.007689] kworker/4:2 R running task 0 25587 2
> > > 0x80004008
> > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > [48963.020570] Call Trace:
> > > [48963.023311] <IRQ>
> > > [48963.025560] sched_show_task+0x11e/0x150
> > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > > [48963.044106] update_process_times+0x24/0x60
> > > [48963.048791] tick_sched_timer+0x37/0x70
> > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > > [48963.072407] apic_timer_interrupt+0xf/0x20
> > > [48963.076994] </IRQ>
> > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > > ffffffffffffff13
> > > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > > ffffffff82407588
> > > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > > ffffffff82407580
> > > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > > 0000000000000001
> > > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > > ffffffff82407588
> > > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > > ffffea005e7ae600
> > > [48963.154971] wake_up_page_bit+0xe0/0x100
> > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > > [48963.167958] xfs_end_io+0xa4/0xd0
> > > [48963.171672] process_one_work+0x1e5/0x410
> > > [48963.176163] worker_thread+0x2d/0x3c0
> > > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > > [48963.185048] kthread+0x117/0x130
> > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > > [48963.194321] ret_from_fork+0x35/0x40
> > > ```
> > >
> > > As it’s just log level INFO, is there anything what should be done, or was
> > > the system probably just “overloaded”?
> >
> > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> >
> > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > one approach would be as follows:
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index f16d5f1..06be426 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -390,6 +390,7 @@ xfs_end_io(
> > list_del_init(&ioend->io_list);
> > xfs_ioend_try_merge(ioend, &completion_list);
> > xfs_end_ioend(ioend);
> > + cond_resched();
> > }
> > }
> >
> > ------------------------------------------------------------------------
>
> FWIW, this looks quite similar to the problem I attempted to fix with
> these patches:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
Looks plausible to me! Do you plan to re-post taking the feedback
into account?
Thanx, Paul
> Brian
>
> >
> > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> > problem is likely massive lock contention in wake_up_page_bit(), or
> > perhaps someone having failed to release that lock. The usual way to
> > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> > is often not what you want enabled in production.
> >
> > Darrick, thoughts from an xfs perspective?
> >
> > Thanx, Paul
> >
>
On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > Dear Linux folks,
> > > >
> > > >
> > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > > > 2021).
> > > >
> > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > >
> > > > ```
> > > > [ 0.000000] Linux version 5.4.57.mx64.340
> > > > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > > > 13:20:33 CEST 2020
> > > > […]
> > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > > > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > > > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > > > [48963.004074] Task dump for CPU 4:
> > > > [48963.007689] kworker/4:2 R running task 0 25587 2
> > > > 0x80004008
> > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > [48963.020570] Call Trace:
> > > > [48963.023311] <IRQ>
> > > > [48963.025560] sched_show_task+0x11e/0x150
> > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > > > [48963.044106] update_process_times+0x24/0x60
> > > > [48963.048791] tick_sched_timer+0x37/0x70
> > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > > > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > > > [48963.072407] apic_timer_interrupt+0xf/0x20
> > > > [48963.076994] </IRQ>
> > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > > > ffffffffffffff13
> > > > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > > > ffffffff82407588
> > > > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > > > ffffffff82407580
> > > > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > > > 0000000000000001
> > > > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > > > ffffffff82407588
> > > > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > > > ffffea005e7ae600
> > > > [48963.154971] wake_up_page_bit+0xe0/0x100
> > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > > > [48963.167958] xfs_end_io+0xa4/0xd0
> > > > [48963.171672] process_one_work+0x1e5/0x410
> > > > [48963.176163] worker_thread+0x2d/0x3c0
> > > > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > > > [48963.185048] kthread+0x117/0x130
> > > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [48963.194321] ret_from_fork+0x35/0x40
> > > > ```
> > > >
> > > > As it’s just log level INFO, is there anything what should be done, or was
> > > > the system probably just “overloaded”?
> > >
> > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> > >
> > > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > > one approach would be as follows:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index f16d5f1..06be426 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -390,6 +390,7 @@ xfs_end_io(
> > > list_del_init(&ioend->io_list);
> > > xfs_ioend_try_merge(ioend, &completion_list);
> > > xfs_end_ioend(ioend);
> > > + cond_resched();
> > > }
> > > }
> > >
> > > ------------------------------------------------------------------------
> >
> > FWIW, this looks quite similar to the problem I attempted to fix with
> > these patches:
> >
> > https://lore.kernel.org/linux-xfs/[email protected]/
>
> Looks plausible to me! Do you plan to re-post taking the feedback
> into account?
>
There was a v2 inline that incorporated some directed feedback.
Otherwise there were questions and ideas about making the whole thing
faster, but I've no idea if that addresses the problem or not (if so,
that would be an entirely different set of patches). I'll wait and see
what Darrick thinks about this and rebase/repost if the approach is
agreeable..
Brian
> Thanx, Paul
>
> > Brian
> >
> > >
> > > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> > > problem is likely massive lock contention in wake_up_page_bit(), or
> > > perhaps someone having failed to release that lock. The usual way to
> > > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> > > is often not what you want enabled in production.
> > >
> > > Darrick, thoughts from an xfs perspective?
> > >
> > > Thanx, Paul
> > >
> >
>
On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > > Dear Linux folks,
> > > > >
> > > > >
> > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > > > > 2021).
> > > > >
> > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > > >
> > > > > ```
> > > > > [ 0.000000] Linux version 5.4.57.mx64.340
> > > > > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > > > > 13:20:33 CEST 2020
> > > > > […]
> > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > > > > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > > > > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > > > > [48963.004074] Task dump for CPU 4:
> > > > > [48963.007689] kworker/4:2 R running task 0 25587 2
> > > > > 0x80004008
> > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > > [48963.020570] Call Trace:
> > > > > [48963.023311] <IRQ>
> > > > > [48963.025560] sched_show_task+0x11e/0x150
> > > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > > > > [48963.044106] update_process_times+0x24/0x60
> > > > > [48963.048791] tick_sched_timer+0x37/0x70
> > > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > > > > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > > > > [48963.072407] apic_timer_interrupt+0xf/0x20
> > > > > [48963.076994] </IRQ>
> > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > > > > ffffffffffffff13
> > > > > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > > > > ffffffff82407588
> > > > > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > > > > ffffffff82407580
> > > > > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > > > > 0000000000000001
> > > > > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > > > > ffffffff82407588
> > > > > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > > > > ffffea005e7ae600
> > > > > [48963.154971] wake_up_page_bit+0xe0/0x100
> > > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > > > > [48963.167958] xfs_end_io+0xa4/0xd0
> > > > > [48963.171672] process_one_work+0x1e5/0x410
> > > > > [48963.176163] worker_thread+0x2d/0x3c0
> > > > > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > > > > [48963.185048] kthread+0x117/0x130
> > > > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > [48963.194321] ret_from_fork+0x35/0x40
> > > > > ```
> > > > >
> > > > > As it’s just log level INFO, is there anything what should be done, or was
> > > > > the system probably just “overloaded”?
> > > >
> > > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > > > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> > > >
> > > > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > > > one approach would be as follows:
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index f16d5f1..06be426 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -390,6 +390,7 @@ xfs_end_io(
> > > > list_del_init(&ioend->io_list);
> > > > xfs_ioend_try_merge(ioend, &completion_list);
> > > > xfs_end_ioend(ioend);
> > > > + cond_resched();
> > > > }
> > > > }
> > > >
> > > > ------------------------------------------------------------------------
> > >
> > > FWIW, this looks quite similar to the problem I attempted to fix with
> > > these patches:
> > >
> > > https://lore.kernel.org/linux-xfs/[email protected]/
> >
> > Looks plausible to me! Do you plan to re-post taking the feedback
> > into account?
>
> There was a v2 inline that incorporated some directed feedback.
> Otherwise there were questions and ideas about making the whole thing
> faster, but I've no idea if that addresses the problem or not (if so,
> that would be an entirely different set of patches). I'll wait and see
> what Darrick thinks about this and rebase/repost if the approach is
> agreeable..
There is always the school of thought that says that the best way to
get people to focus on this is to rebase and repost. Otherwise, they
are all too likely to assume that you lost interest in this.
Thanx, Paul
> Brian
>
> > Thanx, Paul
> >
> > > Brian
> > >
> > > >
> > > > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> > > > problem is likely massive lock contention in wake_up_page_bit(), or
> > > > perhaps someone having failed to release that lock. The usual way to
> > > > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> > > > is often not what you want enabled in production.
> > > >
> > > > Darrick, thoughts from an xfs perspective?
> > > >
> > > > Thanx, Paul
> > > >
> > >
> >
>
On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > > > Dear Linux folks,
> > > > > >
> > > > > >
> > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > > > > > 2021).
> > > > > >
> > > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > > > >
> > > > > > ```
> > > > > > [ 0.000000] Linux version 5.4.57.mx64.340
> > > > > > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > > > > > 13:20:33 CEST 2020
> > > > > > […]
> > > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > > > > > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > > > > > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > > > > > [48963.004074] Task dump for CPU 4:
> > > > > > [48963.007689] kworker/4:2 R running task 0 25587 2
> > > > > > 0x80004008
> > > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > > > [48963.020570] Call Trace:
> > > > > > [48963.023311] <IRQ>
> > > > > > [48963.025560] sched_show_task+0x11e/0x150
> > > > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > > > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > > > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > > > > > [48963.044106] update_process_times+0x24/0x60
> > > > > > [48963.048791] tick_sched_timer+0x37/0x70
> > > > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > > > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > > > > > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > > > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > > > > > [48963.072407] apic_timer_interrupt+0xf/0x20
> > > > > > [48963.076994] </IRQ>
> > > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > > > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > > > > > ffffffffffffff13
> > > > > > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > > > > > ffffffff82407588
> > > > > > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > > > > > ffffffff82407580
> > > > > > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > > > > > 0000000000000001
> > > > > > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > > > > > ffffffff82407588
> > > > > > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > > > > > ffffea005e7ae600
> > > > > > [48963.154971] wake_up_page_bit+0xe0/0x100
> > > > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > > > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > > > > > [48963.167958] xfs_end_io+0xa4/0xd0
> > > > > > [48963.171672] process_one_work+0x1e5/0x410
> > > > > > [48963.176163] worker_thread+0x2d/0x3c0
> > > > > > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > > > > > [48963.185048] kthread+0x117/0x130
> > > > > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > [48963.194321] ret_from_fork+0x35/0x40
> > > > > > ```
> > > > > >
> > > > > > As it’s just log level INFO, is there anything what should be done, or was
> > > > > > the system probably just “overloaded”?
> > > > >
> > > > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > > > > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> > > > >
> > > > > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > > > > one approach would be as follows:
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > index f16d5f1..06be426 100644
> > > > > --- a/fs/xfs/xfs_aops.c
> > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > @@ -390,6 +390,7 @@ xfs_end_io(
> > > > > list_del_init(&ioend->io_list);
> > > > > xfs_ioend_try_merge(ioend, &completion_list);
> > > > > xfs_end_ioend(ioend);
> > > > > + cond_resched();
> > > > > }
> > > > > }
> > > > >
> > > > > ------------------------------------------------------------------------
> > > >
> > > > FWIW, this looks quite similar to the problem I attempted to fix with
> > > > these patches:
> > > >
> > > > https://lore.kernel.org/linux-xfs/[email protected]/
> > >
> > > Looks plausible to me! Do you plan to re-post taking the feedback
> > > into account?
> >
> > There was a v2 inline that incorporated some directed feedback.
> > Otherwise there were questions and ideas about making the whole thing
> > faster, but I've no idea if that addresses the problem or not (if so,
> > that would be an entirely different set of patches). I'll wait and see
> > what Darrick thinks about this and rebase/repost if the approach is
> > agreeable..
>
> There is always the school of thought that says that the best way to
> get people to focus on this is to rebase and repost. Otherwise, they
> are all too likely to assume that you lost interest in this.
I was hoping that a better solution would emerge for clearing
PageWriteback on hundreds of thousands of pages, but nothing easy popped
out.
The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
to completion workqueue" gives me unease because who's to say if marking
262,144 pages on a particular CPU will actually stall it long enough to
trip the hangcheck? Is the number lower on (say) some pokey NAS box
with a lot of storage but a slow CPU?
That said, /some/ threshold is probably better than no threshold. Could
someone try to confirm if that series of Brian's fixes this problem too?
--D
> Thanx, Paul
>
> > Brian
> >
> > > Thanx, Paul
> > >
> > > > Brian
> > > >
> > > > >
> > > > > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> > > > > problem is likely massive lock contention in wake_up_page_bit(), or
> > > > > perhaps someone having failed to release that lock. The usual way to
> > > > > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> > > > > is often not what you want enabled in production.
> > > > >
> > > > > Darrick, thoughts from an xfs perspective?
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > >
> > >
> >
On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > > > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > > > > Dear Linux folks,
> > > > > > >
> > > > > > >
> > > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > > > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > > > > > > 2021).
> > > > > > >
> > > > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > > > > >
> > > > > > > ```
> > > > > > > [ 0.000000] Linux version 5.4.57.mx64.340
> > > > > > > ([email protected]) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > > > > > > 13:20:33 CEST 2020
> > > > > > > […]
> > > > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > > > [48962.987511] rcu: 4-....: (20999 ticks this GP)
> > > > > > > idle=fe6/1/0x4000000000000002 softirq=3630188/3630188 fqs=4696
> > > > > > > [48962.998805] (t=21017 jiffies g=14529009 q=32263)
> > > > > > > [48963.004074] Task dump for CPU 4:
> > > > > > > [48963.007689] kworker/4:2 R running task 0 25587 2
> > > > > > > 0x80004008
> > > > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > > > > [48963.020570] Call Trace:
> > > > > > > [48963.023311] <IRQ>
> > > > > > > [48963.025560] sched_show_task+0x11e/0x150
> > > > > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0
> > > > > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770
> > > > > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60
> > > > > > > [48963.044106] update_process_times+0x24/0x60
> > > > > > > [48963.048791] tick_sched_timer+0x37/0x70
> > > > > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0
> > > > > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10
> > > > > > > [48963.062744] hrtimer_interrupt+0xe5/0x240
> > > > > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130
> > > > > > > [48963.072407] apic_timer_interrupt+0xf/0x20
> > > > > > > [48963.076994] </IRQ>
> > > > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > > > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d <c3>
> > > > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > > > > [48963.106524] RSP: 0018:ffffc9000738fd40 EFLAGS: 00000202 ORIG_RAX:
> > > > > > > ffffffffffffff13
> > > > > > > [48963.115003] RAX: ffffffff82407588 RBX: ffffffff82407580 RCX:
> > > > > > > ffffffff82407588
> > > > > > > [48963.122994] RDX: ffffffff82407588 RSI: 0000000000000202 RDI:
> > > > > > > ffffffff82407580
> > > > > > > [48963.130989] RBP: 0000000000000202 R08: ffffffff8203ea00 R09:
> > > > > > > 0000000000000001
> > > > > > > [48963.138982] R10: ffffc9000738fbb8 R11: 0000000000000001 R12:
> > > > > > > ffffffff82407588
> > > > > > > [48963.146976] R13: ffffea005e7ae600 R14: ffff8897b7e5a040 R15:
> > > > > > > ffffea005e7ae600
> > > > > > > [48963.154971] wake_up_page_bit+0xe0/0x100
> > > > > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0
> > > > > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0
> > > > > > > [48963.167958] xfs_end_io+0xa4/0xd0
> > > > > > > [48963.171672] process_one_work+0x1e5/0x410
> > > > > > > [48963.176163] worker_thread+0x2d/0x3c0
> > > > > > > [48963.180265] ? cancel_delayed_work+0x90/0x90
> > > > > > > [48963.185048] kthread+0x117/0x130
> > > > > > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > > [48963.194321] ret_from_fork+0x35/0x40
> > > > > > > ```
> > > > > > >
> > > > > > > As it’s just log level INFO, is there anything what should be done, or was
> > > > > > > the system probably just “overloaded”?
> > > > > >
> > > > > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > > > > > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> > > > > >
> > > > > > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > > > > > one approach would be as follows:
> > > > > >
> > > > > > ------------------------------------------------------------------------
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > > index f16d5f1..06be426 100644
> > > > > > --- a/fs/xfs/xfs_aops.c
> > > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > > @@ -390,6 +390,7 @@ xfs_end_io(
> > > > > > list_del_init(&ioend->io_list);
> > > > > > xfs_ioend_try_merge(ioend, &completion_list);
> > > > > > xfs_end_ioend(ioend);
> > > > > > + cond_resched();
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > ------------------------------------------------------------------------
> > > > >
> > > > > FWIW, this looks quite similar to the problem I attempted to fix with
> > > > > these patches:
> > > > >
> > > > > https://lore.kernel.org/linux-xfs/[email protected]/
> > > >
> > > > Looks plausible to me! Do you plan to re-post taking the feedback
> > > > into account?
> > >
> > > There was a v2 inline that incorporated some directed feedback.
> > > Otherwise there were questions and ideas about making the whole thing
> > > faster, but I've no idea if that addresses the problem or not (if so,
> > > that would be an entirely different set of patches). I'll wait and see
> > > what Darrick thinks about this and rebase/repost if the approach is
> > > agreeable..
> >
> > There is always the school of thought that says that the best way to
> > get people to focus on this is to rebase and repost. Otherwise, they
> > are all too likely to assume that you lost interest in this.
>
> I was hoping that a better solution would emerge for clearing
> PageWriteback on hundreds of thousands of pages, but nothing easy popped
> out.
>
> The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> to completion workqueue" gives me unease because who's to say if marking
> 262,144 pages on a particular CPU will actually stall it long enough to
> trip the hangcheck? Is the number lower on (say) some pokey NAS box
> with a lot of storage but a slow CPU?
>
> That said, /some/ threshold is probably better than no threshold. Could
> someone try to confirm if that series of Brian's fixes this problem too?
>
Note that this particular report looks like it's already in wq context,
so patch 1 by itself might be sufficient...
Brian
> --D
>
> > Thanx, Paul
> >
> > > Brian
> > >
> > > > Thanx, Paul
> > > >
> > > > > Brian
> > > > >
> > > > > >
> > > > > > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> > > > > > problem is likely massive lock contention in wake_up_page_bit(), or
> > > > > > perhaps someone having failed to release that lock. The usual way to
> > > > > > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> > > > > > is often not what you want enabled in production.
> > > > > >
> > > > > > Darrick, thoughts from an xfs perspective?
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > >
> > > >
> > >
>
On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > There was a v2 inline that incorporated some directed feedback.
> > > Otherwise there were questions and ideas about making the whole thing
> > > faster, but I've no idea if that addresses the problem or not (if so,
> > > that would be an entirely different set of patches). I'll wait and see
> > > what Darrick thinks about this and rebase/repost if the approach is
> > > agreeable..
> >
> > There is always the school of thought that says that the best way to
> > get people to focus on this is to rebase and repost. Otherwise, they
> > are all too likely to assume that you lost interest in this.
>
> I was hoping that a better solution would emerge for clearing
> PageWriteback on hundreds of thousands of pages, but nothing easy popped
> out.
>
> The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> to completion workqueue" gives me unease because who's to say if marking
> 262,144 pages on a particular CPU will actually stall it long enough to
> trip the hangcheck? Is the number lower on (say) some pokey NAS box
> with a lot of storage but a slow CPU?
It's also not the right thing to do given the IO completion
workqueue is a bound workqueue. Anything that is doing large amounts
of CPU intensive work should be on a unbound workqueue so that the
scheduler can bounce it around different CPUs as needed.
Quite frankly, the problem is a huge long ioend chain being built by
the submission code. We need to keep ioend completion overhead down.
It runs in either softirq or bound workqueue context and so
individual items of work that are performed in this context must not
be -unbounded- in size or time. Unbounded ioend chains are bad for
IO latency, they are bad for memory reclaim and they are bad for CPU
scheduling.
As I've said previously, we gain nothing by aggregating ioends past
a few tens of megabytes of submitted IO. The batching gains are
completely diminished once we've got enough IO in flight to keep the
submission queue full. We're talking here about gigabytes of
sequential IOs in a single ioend chain which are 2-3 orders of
magnitude larger than needed for optimal background IO submission
and completion efficiency and throughput. IOWs, we really should be
limiting the ioend chain length at submission time, not trying to
patch over bad completion behaviour that results from sub-optimal IO
submission behaviour...
> That said, /some/ threshold is probably better than no threshold. Could
> someone try to confirm if that series of Brian's fixes this problem too?
262144 pages is still too much work to be doing in a single softirq
IO completion callback. It's likely to be too much work for a bound
workqueue, too, especially when you consider that the workqueue
completion code will merge sequential ioends into one ioend, hence
making the IO completion loop counts bigger and latency problems worse
rather than better...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Feb 09, 2021 at 07:43:14AM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > > There was a v2 inline that incorporated some directed feedback.
> > > > Otherwise there were questions and ideas about making the whole thing
> > > > faster, but I've no idea if that addresses the problem or not (if so,
> > > > that would be an entirely different set of patches). I'll wait and see
> > > > what Darrick thinks about this and rebase/repost if the approach is
> > > > agreeable..
> > >
> > > There is always the school of thought that says that the best way to
> > > get people to focus on this is to rebase and repost. Otherwise, they
> > > are all too likely to assume that you lost interest in this.
> >
> > I was hoping that a better solution would emerge for clearing
> > PageWriteback on hundreds of thousands of pages, but nothing easy popped
> > out.
> >
> > The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> > to completion workqueue" gives me unease because who's to say if marking
> > 262,144 pages on a particular CPU will actually stall it long enough to
> > trip the hangcheck? Is the number lower on (say) some pokey NAS box
> > with a lot of storage but a slow CPU?
>
> It's also not the right thing to do given the IO completion
> workqueue is a bound workqueue. Anything that is doing large amounts
> of CPU intensive work should be on a unbound workqueue so that the
> scheduler can bounce it around different CPUs as needed.
>
> Quite frankly, the problem is a huge long ioend chain being built by
> the submission code. We need to keep ioend completion overhead down.
> It runs in either softirq or bound workqueue context and so
> individual items of work that are performed in this context must not
> be -unbounded- in size or time. Unbounded ioend chains are bad for
> IO latency, they are bad for memory reclaim and they are bad for CPU
> scheduling.
>
> As I've said previously, we gain nothing by aggregating ioends past
> a few tens of megabytes of submitted IO. The batching gains are
> completely diminished once we've got enough IO in flight to keep the
> submission queue full. We're talking here about gigabytes of
> sequential IOs in a single ioend chain which are 2-3 orders of
> magnitude larger than needed for optimal background IO submission
> and completion efficiency and throughput. IOWs, we really should be
> limiting the ioend chain length at submission time, not trying to
> patch over bad completion behaviour that results from sub-optimal IO
> submission behaviour...
>
That was the patch I posted prior to the aforementioned set. Granted, it
was an RFC, but for reference:
https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/
(IIRC, you also had a variant that was essentially the same change.)
The discussion that followed in that thread was around the preference to
move completion of large chains into workqueue context instead of
breaking up the chains. The series referenced in my first reply fell out
of that as a targeted fix for the stall warning.
> > That said, /some/ threshold is probably better than no threshold. Could
> > someone try to confirm if that series of Brian's fixes this problem too?
>
> 262144 pages is still too much work to be doing in a single softirq
> IO completion callback. It's likely to be too much work for a bound
> workqueue, too, especially when you consider that the workqueue
> completion code will merge sequential ioends into one ioend, hence
> making the IO completion loop counts bigger and latency problems worse
> rather than better...
>
That was just a conservative number picked based on observation of the
original report (10+ GB ioends IIRC). I figured the review cycle would
involve narrowing it down to something more generically reasonable
(10s-100s of MB?) once we found an acceptable approach (and hopefully
received some testing feedback), but we've never really got to that
point..
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>