It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:
divide error: 0000 [#1] SMP
Call Trace:
[<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
[<ffffffff810bd04d>] find_busiest_group+0x2d/0x4b0
[<ffffffff810bd640>] load_balance+0x170/0x950
[<ffffffff810be3ff>] rebalance_domains+0x13f/0x290
[<ffffffff810852bc>] __do_softirq+0xec/0x300
[<ffffffff8108578a>] irq_exit+0xfa/0x110
[<ffffffff816167d9>] reschedule_interrupt+0xc9/0xd0
Make sure we update the rq clock and load before balancing.
Cc: Ingo Molnar <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..2c81662c858a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
*/
smp_mb();
+ /*
+ * Ensure this_rq's clock and load are up-to-date before we
+ * rebalance since it's possible that they haven't been
+ * updated for multiple schedule periods, i.e. many seconds.
+ */
+ raw_spin_lock_irq(&this_rq->lock);
+ update_rq_clock(this_rq);
+ cpu_load_update_idle(this_rq);
+ raw_spin_unlock_irq(&this_rq->lock);
+
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;
--
2.13.6
FYI, we noticed the following commit (built with gcc-7):
commit: fbd51884933192c9cada60628892024495942482 ("[PATCH] sched/fair: Avoid divide by zero when rebalancing domains")
url: https://github.com/0day-ci/linux/commits/Matt-Fleming/sched-fair-Avoid-divide-by-zero-when-rebalancing-domains/20180705-024633
in testcase: trinity
with following parameters:
runtime: 300s
test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/
on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+-------------------------------------------------------+------------+------------+
| | f83ee19be4 | fbd5188493 |
+-------------------------------------------------------+------------+------------+
| boot_successes | 9 | 4 |
| boot_failures | 4 | 8 |
| kernel_BUG_at_lib/list_debug.c | 4 | 5 |
| invalid_opcode:#[##] | 4 | 5 |
| RIP:__list_add_valid | 4 | 5 |
| RIP:separate_adjacent_colors | 3 | |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 4 | 5 |
| RIP:rb_next | 1 | |
| WARNING:inconsistent_lock_state | 0 | 8 |
| inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage | 0 | 8 |
| RIP:smp_call_function_single | 0 | 5 |
| BUG:kernel_hang_in_boot_stage | 0 | 1 |
+-------------------------------------------------------+------------+------------+
[ 0.335612] WARNING: inconsistent lock state
[ 0.336473] pnp: PnP ACPI init
[ 0.337205] 4.18.0-rc3-00016-gfbd5188 #1 Not tainted
[ 0.337206] --------------------------------
[ 0.337208] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 0.337211] kworker/u4:0/7 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 0.337212] (____ptrval____) (&rq->lock){?.-.}, at: pick_next_task_fair+0x402/0xb20
[ 0.337226] {IN-HARDIRQ-W} state was registered at:
[ 0.338351] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[ 0.338559] lock_acquire+0x59/0x7c
[ 0.338559] _raw_spin_lock+0x2a/0x40
[ 0.338559] scheduler_tick+0x4c/0xf0
[ 0.338559] update_process_times+0x42/0x50
[ 0.338559] tick_periodic+0x9d/0xa0
[ 0.338559] tick_handle_periodic+0x20/0x70
[ 0.338559] timer_interrupt+0x10/0x20
[ 0.338559] __handle_irq_event_percpu+0x35/0xf0
[ 0.338559] handle_irq_event_percpu+0x2d/0x70
[ 0.338559] handle_irq_event+0x34/0x60
[ 0.338559] handle_level_irq+0xb9/0x110
[ 0.338559] handle_irq+0x18/0x20
[ 0.338559] do_IRQ+0x7b/0x100
[ 0.338559] ret_from_intr+0x0/0x2f
[ 0.338559] native_restore_fl+0x6/0x10
[ 0.338559] _raw_spin_unlock_irqrestore+0x4c/0x60
[ 0.338559] __setup_irq+0x440/0x660
[ 0.338559] setup_irq+0x51/0x80
[ 0.338559] hpet_time_init+0x20/0x32
[ 0.338559] x86_late_time_init+0xa/0x17
[ 0.338559] start_kernel+0x458/0x4d8
[ 0.338559] x86_64_start_reservations+0x2a/0x2c
[ 0.338559] x86_64_start_kernel+0x77/0x7a
[ 0.338559] secondary_startup_64+0xa5/0xb0
[ 0.338559] irq event stamp: 39
[ 0.338559] hardirqs last enabled at (39): [<ffffffff81cf3a97>] _raw_spin_unlock_irq+0x27/0x40
[ 0.338559] hardirqs last disabled at (38): [<ffffffff81ced501>] __schedule+0x181/0x7b0
[ 0.338559] softirqs last enabled at (0): [<ffffffff810ba5c7>] copy_process+0x467/0x1ac0
[ 0.340695] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
[ 0.338559] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 0.338559]
[ 0.338559] other info that might help us debug this:
[ 0.342629] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
[ 0.338559] Possible unsafe locking scenario:
[ 0.338559]
[ 0.338559] CPU0
[ 0.338559] ----
[ 0.338559] lock(&rq->lock);
[ 0.338559] <Interrupt>
[ 0.338559] lock(&rq->lock);
[ 0.338559]
[ 0.338559] *** DEADLOCK ***
[ 0.338559]
[ 0.338559] no locks held by kworker/u4:0/7.
[ 0.338559]
[ 0.338559] stack backtrace:
[ 0.338559] CPU: 1 PID: 7 Comm: kworker/u4:0 Not tainted 4.18.0-rc3-00016-gfbd5188 #1
[ 0.338559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 0.338559] Call Trace:
[ 0.345537] pnp 00:03: [dma 2]
[ 0.338559] dump_stack+0x8e/0xd5
[ 0.348212] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
[ 0.338559] print_usage_bug+0x26a/0x280
[ 0.338559] mark_lock+0x361/0x5a0
[ 0.338559] ? check_usage_forwards+0x120/0x120
[ 0.338559] __lock_acquire+0x35c/0x17c0
[ 0.338559] ? __lock_acquire+0x27d/0x17c0
[ 0.338559] ? _nohz_idle_balance+0xa4/0x4f0
[ 0.338559] ? sched_clock_local+0x17/0x90
[ 0.338559] ? trace_hardirqs_on_caller+0x178/0x1a0
[ 0.338559] lock_acquire+0x59/0x7c
[ 0.338559] ? lock_acquire+0x59/0x7c
[ 0.338559] ? pick_next_task_fair+0x402/0xb20
[ 0.338559] _raw_spin_lock+0x2a/0x40
[ 0.338559] ? pick_next_task_fair+0x402/0xb20
[ 0.338559] pick_next_task_fair+0x402/0xb20
[ 0.338559] ? dequeue_task_fair+0x4f5/0x680
[ 0.338559] ? __lock_is_held+0x4b/0x90
[ 0.338559] __schedule+0x3e3/0x7b0
[ 0.338559] schedule+0x1f/0x40
[ 0.338559] worker_thread+0x2f4/0x3b0
[ 0.338559] ? __kthread_parkme+0x4d/0x90
[ 0.338559] kthread+0x121/0x130
[ 0.338559] ? process_one_work+0x3d0/0x3d0
[ 0.338559] ? __kthread_bind_mask+0x60/0x60
[ 0.338559] ret_from_fork+0x24/0x30
[ 0.413149] pnp 00:04: Plug and Play ACPI device, IDs PNP0400 (active)
[ 0.414880] pnp 00:05: Plug and Play ACPI device, IDs PNP0501 (active)
[ 0.416421] pnp 00:06: Plug and Play ACPI device, IDs PNP0501 (active)
[ 0.418533] pnp: PnP ACPI: found 7 devices
[ 0.425600] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.427702] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 0.429058] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 0.430451] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[ 0.432021] pci_bus 0000:00: resource 7 [mem 0x40000000-0xfebfffff window]
[ 0.433752] NET: Registered protocol family 2
[ 0.435374] tcp_listen_portaddr_hash hash table entries: 512 (order: 3, 36864 bytes)
[ 0.444589] TCP established hash table entries: 8192 (order: 4, 65536 bytes)
[ 0.446316] TCP bind hash table entries: 8192 (order: 7, 524288 bytes)
[ 0.448159] TCP: Hash tables configured (established 8192 bind 8192)
[ 0.449826] UDP hash table entries: 512 (order: 4, 81920 bytes)
[ 0.451339] UDP-Lite hash table entries: 512 (order: 4, 81920 bytes)
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Xiaolong
Hi,
On 07/05/2018 10:02 AM, kernel test robot wrote:
>
> FYI, we noticed the following commit (built with gcc-7):
>
> commit: fbd51884933192c9cada60628892024495942482 ("[PATCH] sched/fair: Avoid divide by zero when rebalancing domains")
> url: https://github.com/0day-ci/linux/commits/Matt-Fleming/sched-fair-Avoid-divide-by-zero-when-rebalancing-domains/20180705-024633
>
>
> in testcase: trinity
> with following parameters:
>
> runtime: 300s
>
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G
[...]
> [ 0.335612] WARNING: inconsistent lock state
I get the same on arm64 (juno r0) during boot consistently:
[ 1.458414] ================================
[ 1.462641] WARNING: inconsistent lock state
[ 1.466870] 4.18.0-rc3-00016-g1b05c8317958 #2 Not tainted
[ 1.472215] --------------------------------
[ 1.476440] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1.482389] rcu_preempt/10 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1.487733] (____ptrval____) (&rq->lock){?.-.}, at: pick_next_task_fair+0x234/0x8e8
[ 1.495342] {IN-HARDIRQ-W} state was registered at:
[ 1.500174] lock_acquire+0xc8/0x290
[ 1.503802] _raw_spin_lock+0x44/0x58
[ 1.507517] scheduler_tick+0x5c/0x118
[ 1.511316] update_process_times+0x48/0x60
[ 1.515545] tick_periodic+0x50/0x108
[ 1.519256] tick_handle_periodic+0x38/0xa8
[ 1.523485] arch_timer_handler_phys+0x3c/0x50
[ 1.527973] handle_percpu_devid_irq+0xcc/0x4a0
[ 1.532543] generic_handle_irq+0x34/0x50
[ 1.536598] __handle_domain_irq+0x68/0xc0
[ 1.540738] gic_handle_irq+0x60/0xb8
[ 1.544448] el1_irq+0xb4/0x130
[ 1.547644] start_kernel+0x34c/0x490
[ 1.551353] irq event stamp: 1601
[ 1.554637] hardirqs last enabled at (1601): [<ffff000008d64764>] _raw_spin_unlock_irqrestore+0x74/0xa8
[ 1.564026] hardirqs last disabled at (1600): [<ffff000008d64514>] _raw_spin_lock_irqsave+0x2c/0x70
[ 1.572986] softirqs last enabled at (0): [<ffff0000080e9c20>] copy_process.isra.4.part.5+0x430/0x18f0
[ 1.582285] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 1.589606]
[ 1.589606] other info that might help us debug this:
[ 1.596067] Possible unsafe locking scenario:
[ 1.596067]
[ 1.601926] CPU0
[ 1.604344] ----
[ 1.606761] lock(&rq->lock);
[ 1.609788] <Interrupt>
[ 1.612377] lock(&rq->lock);
[ 1.615576]
[ 1.615576] *** DEADLOCK ***
[ 1.615576]
[ 1.621438] no locks held by rcu_preempt/10.
[ 1.625661]
[ 1.625661] stack backtrace:
[ 1.629977] CPU: 2 PID: 10 Comm: rcu_preempt Not tainted 4.18.0-rc3-00016-g1b05c8317958 #2
[ 1.638160] Hardware name: ARM Juno development board (r0) (DT)
[ 1.644018] Call trace:
[ 1.646439] dump_backtrace+0x0/0x170
[ 1.650063] show_stack+0x24/0x30
[ 1.653345] dump_stack+0xac/0xe4
[ 1.656625] print_usage_bug+0x208/0x2a8
[ 1.660506] mark_lock+0x5c0/0x668
[ 1.663871] __lock_acquire+0x550/0x1990
[ 1.667752] lock_acquire+0xc8/0x290
[ 1.671290] _raw_spin_lock+0x44/0x58
[ 1.674914] pick_next_task_fair+0x234/0x8e8
[ 1.679142] __schedule+0x160/0xc40
[ 1.682595] schedule+0x50/0xc0
[ 1.685702] schedule_timeout+0x1f4/0x568
[ 1.689673] rcu_gp_kthread+0x4fc/0x860
[ 1.693470] kthread+0x100/0x130
[ 1.696664] ret_from_fork+0x10/0x1c
On 07/05/2018 10:58 AM, Dietmar Eggemann wrote:
> Hi,
>
> On 07/05/2018 10:02 AM, kernel test robot wrote:
>>
>> FYI, we noticed the following commit (built with gcc-7):
>>
>> commit: fbd51884933192c9cada60628892024495942482 ("[PATCH] sched/fair: Avoid divide by zero when rebalancing domains")
>> url: https://github.com/0day-ci/linux/commits/Matt-Fleming/sched-fair-Avoid-divide-by-zero-when-rebalancing-domains/20180705-024633
>>
>>
>> in testcase: trinity
>> with following parameters:
>>
>> runtime: 300s
>>
>> test-description: Trinity is a linux system call fuzz tester.
>> test-url: http://codemonkey.org.uk/projects/trinity/
>>
>>
>> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G
>
> [...]
>
>> [ 0.335612] WARNING: inconsistent lock state
>
> I get the same on arm64 (juno r0) during boot consistently:
Moving the code from _nohz_idle_balance to nohz_idle_balance let it disappear:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02be51c9dcc1..070924f07c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9596,16 +9596,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
*/
smp_mb();
- /*
- * Ensure this_rq's clock and load are up-to-date before we
- * rebalance since it's possible that they haven't been
- * updated for multiple schedule periods, i.e. many seconds.
- */
- raw_spin_lock_irq(&this_rq->lock);
- update_rq_clock(this_rq);
- cpu_load_update_idle(this_rq);
- raw_spin_unlock_irq(&this_rq->lock);
-
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;
@@ -9701,6 +9691,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (!(flags & NOHZ_KICK_MASK))
return false;
+ /*
+ * Ensure this_rq's clock and load are up-to-date before we
+ * rebalance since it's possible that they haven't been
+ * updated for multiple schedule periods, i.e. many seconds.
+ */
+ raw_spin_lock_irq(&this_rq->lock);
+ update_rq_clock(this_rq);
+ cpu_load_update_idle(this_rq);
+ raw_spin_unlock_irq(&this_rq->lock);
+
_nohz_idle_balance(this_rq, flags, idle);
return true;
Hi,
On 04/07/18 15:24, Matt Fleming wrote:
> It's possible that the CPU doing nohz idle balance hasn't had its own
> load updated for many seconds. This can lead to huge deltas between
> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> cause the following crash:
>
> divide error: 0000 [#1] SMP
> Call Trace:
> [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
> [<ffffffff810bd04d>] find_busiest_group+0x2d/0x4b0
> [<ffffffff810bd640>] load_balance+0x170/0x950
> [<ffffffff810be3ff>] rebalance_domains+0x13f/0x290
> [<ffffffff810852bc>] __do_softirq+0xec/0x300
> [<ffffffff8108578a>] irq_exit+0xfa/0x110
> [<ffffffff816167d9>] reschedule_interrupt+0xc9/0xd0
>
Do you have some sort of reproducer for that crash? If not I guess I can cook
something up with a quiet userspace & rt-app, though I've never seen that one
on arm64.
> Make sure we update the rq clock and load before balancing.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> kernel/sched/fair.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f0a0be4d344..2c81662c858a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> */
> smp_mb();
>
> + /*
> + * Ensure this_rq's clock and load are up-to-date before we
> + * rebalance since it's possible that they haven't been
> + * updated for multiple schedule periods, i.e. many seconds.
> + */
> + raw_spin_lock_irq(&this_rq->lock);
> + update_rq_clock(this_rq);
> + cpu_load_update_idle(this_rq);
> + raw_spin_unlock_irq(&this_rq->lock);
> +
I'm failing to understand why the updates further down below are seemingly
not enough. After we've potentially done
update_rq_clock(rq);
cpu_load_update_idle(rq);
for all nohz cpus != this_cpu, we still end up doing:
if (idle != CPU_NEWLY_IDLE) {
update_blocked_averages(this_cpu);
has_blocked_load |= this_rq->has_blocked_load;
}
which should properly update this_rq's clock and load before we attempt to do
any balancing on it.
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;
>
On Thu, 05 Jul, at 11:52:21AM, Dietmar Eggemann wrote:
>
> Moving the code from _nohz_idle_balance to nohz_idle_balance let it disappear:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02be51c9dcc1..070924f07c68 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9596,16 +9596,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> */
> smp_mb();
>
> - /*
> - * Ensure this_rq's clock and load are up-to-date before we
> - * rebalance since it's possible that they haven't been
> - * updated for multiple schedule periods, i.e. many seconds.
> - */
> - raw_spin_lock_irq(&this_rq->lock);
> - update_rq_clock(this_rq);
> - cpu_load_update_idle(this_rq);
> - raw_spin_unlock_irq(&this_rq->lock);
> -
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;
> @@ -9701,6 +9691,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> if (!(flags & NOHZ_KICK_MASK))
> return false;
>
> + /*
> + * Ensure this_rq's clock and load are up-to-date before we
> + * rebalance since it's possible that they haven't been
> + * updated for multiple schedule periods, i.e. many seconds.
> + */
> + raw_spin_lock_irq(&this_rq->lock);
> + update_rq_clock(this_rq);
> + cpu_load_update_idle(this_rq);
> + raw_spin_unlock_irq(&this_rq->lock);
> +
> _nohz_idle_balance(this_rq, flags, idle);
>
> return true;
>
Hmm.. it still looks to me like we should be saving and restoring IRQs
since this can be called from IRQ context, no?
The patch was a forward-port from one of our SLE kernels, and I messed
up the IRQ flag balancing for the v4.18-rc3 code :-(
On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> Hi,
>
> On 04/07/18 15:24, Matt Fleming wrote:
> > It's possible that the CPU doing nohz idle balance hasn't had its own
> > load updated for many seconds. This can lead to huge deltas between
> > rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> > cause the following crash:
> >
> > divide error: 0000 [#1] SMP
> > Call Trace:
> > [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
> > [<ffffffff810bd04d>] find_busiest_group+0x2d/0x4b0
> > [<ffffffff810bd640>] load_balance+0x170/0x950
> > [<ffffffff810be3ff>] rebalance_domains+0x13f/0x290
> > [<ffffffff810852bc>] __do_softirq+0xec/0x300
> > [<ffffffff8108578a>] irq_exit+0xfa/0x110
> > [<ffffffff816167d9>] reschedule_interrupt+0xc9/0xd0
> >
>
> Do you have some sort of reproducer for that crash? If not I guess I can cook
> something up with a quiet userspace & rt-app, though I've never seen that one
> on arm64.
Unfortunately no, I don't have a reproduction case. Would love to have
one to verify the patch though.
> > Make sure we update the rq clock and load before balancing.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Matt Fleming <[email protected]>
> > ---
> > kernel/sched/fair.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2f0a0be4d344..2c81662c858a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> > */
> > smp_mb();
> >
> > + /*
> > + * Ensure this_rq's clock and load are up-to-date before we
> > + * rebalance since it's possible that they haven't been
> > + * updated for multiple schedule periods, i.e. many seconds.
> > + */
> > + raw_spin_lock_irq(&this_rq->lock);
> > + update_rq_clock(this_rq);
> > + cpu_load_update_idle(this_rq);
> > + raw_spin_unlock_irq(&this_rq->lock);
> > +
>
> I'm failing to understand why the updates further down below are seemingly
> not enough. After we've potentially done
>
> update_rq_clock(rq);
> cpu_load_update_idle(rq);
>
> for all nohz cpus != this_cpu, we still end up doing:
>
> if (idle != CPU_NEWLY_IDLE) {
> update_blocked_averages(this_cpu);
> has_blocked_load |= this_rq->has_blocked_load;
> }
>
> which should properly update this_rq's clock and load before we attempt to do
> any balancing on it.
But cpu_load_update_idle() and update_blocked_averages() are not the same
thing.
On Thu, 05 Jul, at 02:24:58PM, Matt Fleming wrote:
>
> Hmm.. it still looks to me like we should be saving and restoring IRQs
> since this can be called from IRQ context, no?
>
> The patch was a forward-port from one of our SLE kernels, and I messed
> up the IRQ flag balancing for the v4.18-rc3 code :-(
Something like this?
---->8----
From 9b152d8dadec04ac631300d86a92552e57e81db5 Mon Sep 17 00:00:00 2001
From: Matt Fleming <[email protected]>
Date: Wed, 4 Jul 2018 14:22:51 +0100
Subject: [PATCH v2] sched/fair: Avoid divide by zero when rebalancing domains
It's possible that the CPU doing nohz idle balance hasn't had its own
load updated for many seconds. This can lead to huge deltas between
rq->avg_stamp and rq->clock when rebalancing, and has been seen to
cause the following crash:
divide error: 0000 [#1] SMP
Call Trace:
[<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
[<ffffffff810bd04d>] find_busiest_group+0x2d/0x4b0
[<ffffffff810bd640>] load_balance+0x170/0x950
[<ffffffff810be3ff>] rebalance_domains+0x13f/0x290
[<ffffffff810852bc>] __do_softirq+0xec/0x300
[<ffffffff8108578a>] irq_exit+0xfa/0x110
[<ffffffff816167d9>] reschedule_interrupt+0xc9/0xd0
Make sure we update the rq clock and load before balancing.
Cc: Ingo Molnar <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
Changes in v2: Balance IRQ flags properly.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..150b92c7c9d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9676,6 +9676,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
{
int this_cpu = this_rq->cpu;
unsigned int flags;
+ struct rq_flags rf;
if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
return false;
@@ -9692,6 +9693,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (!(flags & NOHZ_KICK_MASK))
return false;
+ /*
+ * Ensure this_rq's clock and load are up-to-date before we
+ * rebalance since it's possible that they haven't been
+ * updated for multiple schedule periods, i.e. many seconds.
+ */
+ rq_lock_irqsave(this_rq, &rf);
+ update_rq_clock(this_rq);
+ cpu_load_update_idle(this_rq);
+ rq_unlock_irqrestore(this_rq, &rf);
+
_nohz_idle_balance(this_rq, flags, idle);
return true;
--
2.13.6
On 07/05/2018 03:24 PM, Matt Fleming wrote:
> On Thu, 05 Jul, at 11:52:21AM, Dietmar Eggemann wrote:
>>
>> Moving the code from _nohz_idle_balance to nohz_idle_balance let it disappear:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 02be51c9dcc1..070924f07c68 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9596,16 +9596,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>> */
>> smp_mb();
>>
>> - /*
>> - * Ensure this_rq's clock and load are up-to-date before we
>> - * rebalance since it's possible that they haven't been
>> - * updated for multiple schedule periods, i.e. many seconds.
>> - */
>> - raw_spin_lock_irq(&this_rq->lock);
>> - update_rq_clock(this_rq);
>> - cpu_load_update_idle(this_rq);
>> - raw_spin_unlock_irq(&this_rq->lock);
>> -
>> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>> continue;
>> @@ -9701,6 +9691,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>> if (!(flags & NOHZ_KICK_MASK))
>> return false;
>>
>> + /*
>> + * Ensure this_rq's clock and load are up-to-date before we
>> + * rebalance since it's possible that they haven't been
>> + * updated for multiple schedule periods, i.e. many seconds.
>> + */
>> + raw_spin_lock_irq(&this_rq->lock);
>> + update_rq_clock(this_rq);
>> + cpu_load_update_idle(this_rq);
>> + raw_spin_unlock_irq(&this_rq->lock);
>> +
>> _nohz_idle_balance(this_rq, flags, idle);
>>
>> return true;
>>
>
> Hmm.. it still looks to me like we should be saving and restoring IRQs
> since this can be called from IRQ context, no?
You mean in nohz_idle_balance()? Yes.
I just wanted to hint that since you need this cpu_load_update_idle() in
the nohz idle balance case and not in idle balance, you could just
update it in nohz_idle_balance() rather than in _nohz_idle_balance()
(which is also called in nohz_newidle_balance()) avoiding this
inconsistent lock state warning.
[...]
On 05/07/18 14:27, Matt Fleming wrote:
> On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
>> Hi,
>>
>> On 04/07/18 15:24, Matt Fleming wrote:
>>> It's possible that the CPU doing nohz idle balance hasn't had its own
>>> load updated for many seconds. This can lead to huge deltas between
>>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
>>> cause the following crash:
>>>
>>> divide error: 0000 [#1] SMP
>>> Call Trace:
>>> [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
My confusion comes from not seeing where that crash happens. Would you mind
sharing the associated line number? I can feel the "how did I not see this"
from there but it can't be helped :(
>>> [<ffffffff810bd04d>] find_busiest_group+0x2d/0x4b0
>>> [<ffffffff810bd640>] load_balance+0x170/0x950
>>> [<ffffffff810be3ff>] rebalance_domains+0x13f/0x290
>>> [<ffffffff810852bc>] __do_softirq+0xec/0x300
>>> [<ffffffff8108578a>] irq_exit+0xfa/0x110
>>> [<ffffffff816167d9>] reschedule_interrupt+0xc9/0xd0
>>>
>>
>> Do you have some sort of reproducer for that crash? If not I guess I can cook
>> something up with a quiet userspace & rt-app, though I've never seen that one
>> on arm64.
>
> Unfortunately no, I don't have a reproduction case. Would love to have
> one to verify the patch though.
>
>>> Make sure we update the rq clock and load before balancing.
>>>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: Mike Galbraith <[email protected]>
>>> Cc: Peter Zijlstra <[email protected]>
>>> Signed-off-by: Matt Fleming <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 2f0a0be4d344..2c81662c858a 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>>> */
>>> smp_mb();
>>>
>>> + /*
>>> + * Ensure this_rq's clock and load are up-to-date before we
>>> + * rebalance since it's possible that they haven't been
>>> + * updated for multiple schedule periods, i.e. many seconds.
>>> + */
>>> + raw_spin_lock_irq(&this_rq->lock);
>>> + update_rq_clock(this_rq);
>>> + cpu_load_update_idle(this_rq);
>>> + raw_spin_unlock_irq(&this_rq->lock);
>>> +
>>
>> I'm failing to understand why the updates further down below are seemingly
>> not enough. After we've potentially done
>>
>> update_rq_clock(rq);
>> cpu_load_update_idle(rq);
>>
>> for all nohz cpus != this_cpu, we still end up doing:
>>
>> if (idle != CPU_NEWLY_IDLE) {
>> update_blocked_averages(this_cpu);
>> has_blocked_load |= this_rq->has_blocked_load;
>> }
>>
>> which should properly update this_rq's clock and load before we attempt to do
>> any balancing on it.
>
> But cpu_load_update_idle() and update_blocked_averages() are not the same
> thing.
>
Right, we don't do any rq->cpu_load[] update for this_rq in the current nohz
code (i.e. by using update_blocked_averages()), which I think we do want to
do. I'm just miserably failing to find how not doing this leads to a div by 0.
On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote:
> On 05/07/18 14:27, Matt Fleming wrote:
> > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> >> Hi,
> >>
> >> On 04/07/18 15:24, Matt Fleming wrote:
> >>> It's possible that the CPU doing nohz idle balance hasn't had its own
> >>> load updated for many seconds. This can lead to huge deltas between
> >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> >>> cause the following crash:
> >>>
> >>> divide error: 0000 [#1] SMP
> >>> Call Trace:
> >>> [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
>
> My confusion comes from not seeing where that crash happens. Would you mind
> sharing the associated line number? I can feel the "how did I not see this"
> from there but it can't be helped :(
The divide by zero comes from scale_rt_capacity() where 'total' is a
u64 but gets truncated when passed to div_u64() since the divisor
parameter is u32.
Sure, you could use div64_u64() instead, but the real issue is that
the load hasn't been updated for a very long time and that we're
trying to balance the domains with stale data.
Hi,
On 17/08/18 11:27, Matt Fleming wrote:
> On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote:
>> On 05/07/18 14:27, Matt Fleming wrote:
>>> On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
>>>> Hi,
>>>>
>>>> On 04/07/18 15:24, Matt Fleming wrote:
>>>>> It's possible that the CPU doing nohz idle balance hasn't had its own
>>>>> load updated for many seconds. This can lead to huge deltas between
>>>>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
>>>>> cause the following crash:
>>>>>
>>>>> divide error: 0000 [#1] SMP
>>>>> Call Trace:
>>>>> [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
>>
>> My confusion comes from not seeing where that crash happens. Would you mind
>> sharing the associated line number? I can feel the "how did I not see this"
>> from there but it can't be helped :(
>
> The divide by zero comes from scale_rt_capacity() where 'total' is a
> u64 but gets truncated when passed to div_u64() since the divisor
> parameter is u32.
>
Ah, nasty one. Interestingly enough that bit has been changed quite recently,
so I don't think you can get a div by 0 in there anymore - see
523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") and subsequent
cleanups.
> Sure, you could use div64_u64() instead, but the real issue is that
> the load hasn't been updated for a very long time and that we're
> trying to balance the domains with stale data.
>
Yeah I agree with that. However, the problem is with cpu_load - blocked load
on nohz CPUs will be periodically updated until entirely decayed. And if we
end up getting rid of cpu_load (depends on how [1] goes), then there's
nothing left to do. But we're not there yet...
[1]: https://lore.kernel.org/lkml/[email protected]/