2018-04-17 14:23:20

by Matt Fleming

[permalink] [raw]
Subject: cpu stopper threads and load balancing leads to deadlock

Hi guys,

We've seen a bug in one of our SLE kernels where the cpu stopper
thread ("migration/15") is entering idle balance. This then triggers
active load balance.

At the same time, a task on another CPU triggers a page fault and NUMA
balancing kicks in to try and migrate the task closer to the NUMA node
for that page (we're inside stop_two_cpus()). This faulting task is
spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
!VAL)), waiting for "migration/15" to context switch.

Unfortunately, because "migration/15" is doing active load balance
it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
which is already held (since it's inside stop_two_cpus()).

Deadlock ensues.

This seems like a situation that should be prohibited, but I cannot
find any code to prevent it. Is it OK for stopper threads to load
balance? Is there something that should prevent this situation from
happening?


2018-04-18 05:50:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Tue, 2018-04-17 at 15:21 +0100, Matt Fleming wrote:
> Hi guys,
>
> We've seen a bug in one of our SLE kernels where the cpu stopper
> thread ("migration/15") is entering idle balance. This then triggers
> active load balance.
>
> At the same time, a task on another CPU triggers a page fault and NUMA
> balancing kicks in to try and migrate the task closer to the NUMA node
> for that page (we're inside stop_two_cpus()). This faulting task is
> spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
> !VAL)), waiting for "migration/15" to context switch.
>
> Unfortunately, because "migration/15" is doing active load balance
> it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> which is already held (since it's inside stop_two_cpus()).
>
> Deadlock ensues.
>
> This seems like a situation that should be prohibited, but I cannot
> find any code to prevent it. Is it OK for stopper threads to load
> balance? Is there something that should prevent this situation from
> happening?

I don't see anything to stop the deadlock either, would exclude stop
class from playing idle balancer entirely, though I suppose you could
check for caller being stop class in need_active_balance(). I don't
think any RT class playing idle balancer is particularly wonderful.

-Mike

2018-04-19 05:40:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Wed, 2018-04-18 at 07:47 +0200, Mike Galbraith wrote:
> On Tue, 2018-04-17 at 15:21 +0100, Matt Fleming wrote:
> > Hi guys,
> >
> > We've seen a bug in one of our SLE kernels where the cpu stopper
> > thread ("migration/15") is entering idle balance. This then triggers
> > active load balance.
> >
> > At the same time, a task on another CPU triggers a page fault and NUMA
> > balancing kicks in to try and migrate the task closer to the NUMA node
> > for that page (we're inside stop_two_cpus()). This faulting task is
> > spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
> > !VAL)), waiting for "migration/15" to context switch.
> >
> > Unfortunately, because "migration/15" is doing active load balance
> > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > which is already held (since it's inside stop_two_cpus()).
> >
> > Deadlock ensues.
> >
> > This seems like a situation that should be prohibited, but I cannot
> > find any code to prevent it. Is it OK for stopper threads to load
> > balance? Is there something that should prevent this situation from
> > happening?
>
> I don't see anything to stop the deadlock either, would exclude stop
> class from playing idle balancer entirely...

Bah, insufficient: __do_softirq() -> rebalance_domains() still bites.

2018-04-20 09:51:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> Hi guys,
>
> We've seen a bug in one of our SLE kernels where the cpu stopper
> thread ("migration/15") is entering idle balance. This then triggers
> active load balance.
>
> At the same time, a task on another CPU triggers a page fault and NUMA
> balancing kicks in to try and migrate the task closer to the NUMA node
> for that page (we're inside stop_two_cpus()). This faulting task is
> spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
> !VAL)), waiting for "migration/15" to context switch.
>
> Unfortunately, because "migration/15" is doing active load balance
> it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> which is already held (since it's inside stop_two_cpus()).
>
> Deadlock ensues.


So if I read that right, something like the following happens:

CPU0 CPU1

schedule(.prev=migrate/0) <fault>
pick_next_task ...
idle_balance migrate_swap()
active_balance stop_two_cpus()
spin_lock(stopper0->lock)
spin_lock(stopper1->lock)
ttwu(migrate/0)
smp_cond_load_acquire() -- waits for schedule()
stop_one_cpu(1)
spin_lock(stopper1->lock) -- waits for stopper lock


Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
I'm not entirely sure there isn't more dragons here, but this particular
one seems fixable by doing that.

Is there any way you can reproduce/test this?

Maybe-signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/stop_machine.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b7591261652d..64c0291b579c 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/smpboot.h>
#include <linux/atomic.h>
#include <linux/nmi.h>
+#include <linux/sched/wake_q.h>

/*
* Structure to determine completion condition and record errors. May
@@ -65,27 +66,31 @@ static void cpu_stop_signal_done(struct cpu_stop_done *done)
}

static void __cpu_stop_queue_work(struct cpu_stopper *stopper,
- struct cpu_stop_work *work)
+ struct cpu_stop_work *work,
+ struct wake_q_head *wakeq)
{
list_add_tail(&work->list, &stopper->works);
- wake_up_process(stopper->thread);
+ wake_q_add(wakeq, stopper->thread);
}

/* queue @work to @stopper. if offline, @work is completed immediately */
static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
+ DEFINE_WAKE_Q(wakeq);
unsigned long flags;
bool enabled;

spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
if (enabled)
- __cpu_stop_queue_work(stopper, work);
+ __cpu_stop_queue_work(stopper, work, &wakeq);
else if (work->done)
cpu_stop_signal_done(work->done);
spin_unlock_irqrestore(&stopper->lock, flags);

+ wake_up_q(&wakeq);
+
return enabled;
}

@@ -229,6 +234,7 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
{
struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1);
struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2);
+ DEFINE_WAKE_Q(wakeq);
int err;
retry:
spin_lock_irq(&stopper1->lock);
@@ -252,8 +258,8 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
goto unlock;

err = 0;
- __cpu_stop_queue_work(stopper1, work1);
- __cpu_stop_queue_work(stopper2, work2);
+ __cpu_stop_queue_work(stopper1, work1, &wakeq);
+ __cpu_stop_queue_work(stopper2, work2, &wakeq);
unlock:
spin_unlock(&stopper2->lock);
spin_unlock_irq(&stopper1->lock);
@@ -263,6 +269,9 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
cpu_relax();
goto retry;
}
+
+ wake_up_q(&wakeq);
+
return err;
}
/**

2018-04-24 13:35:40

by Matt Fleming

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Fri, 20 Apr, at 11:50:05AM, Peter Zijlstra wrote:
> On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> > Hi guys,
> >
> > We've seen a bug in one of our SLE kernels where the cpu stopper
> > thread ("migration/15") is entering idle balance. This then triggers
> > active load balance.
> >
> > At the same time, a task on another CPU triggers a page fault and NUMA
> > balancing kicks in to try and migrate the task closer to the NUMA node
> > for that page (we're inside stop_two_cpus()). This faulting task is
> > spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
> > !VAL)), waiting for "migration/15" to context switch.
> >
> > Unfortunately, because "migration/15" is doing active load balance
> > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > which is already held (since it's inside stop_two_cpus()).
> >
> > Deadlock ensues.
>
>
> So if I read that right, something like the following happens:
>
> CPU0 CPU1
>
> schedule(.prev=migrate/0) <fault>
> pick_next_task ...
> idle_balance migrate_swap()
> active_balance stop_two_cpus()
> spin_lock(stopper0->lock)
> spin_lock(stopper1->lock)
> ttwu(migrate/0)
> smp_cond_load_acquire() -- waits for schedule()
> stop_one_cpu(1)
> spin_lock(stopper1->lock) -- waits for stopper lock

Yep, that's exactly right.

> Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
> I'm not entirely sure there isn't more dragons here, but this particular
> one seems fixable by doing that.
>
> Is there any way you can reproduce/test this?

I'm afraid I don't have any way to test this, but I can ask the
customer that reported it if they can.

Either way, this fix looks good to me.

Subject: [tip:sched/urgent] stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock

Commit-ID: 0b26351b910fb8fe6a056f8a1bbccabe50c0e19f
Gitweb: https://git.kernel.org/tip/0b26351b910fb8fe6a056f8a1bbccabe50c0e19f
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 20 Apr 2018 11:50:05 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 3 May 2018 07:38:03 +0200

stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock

Matt reported the following deadlock:

CPU0 CPU1

schedule(.prev=migrate/0) <fault>
pick_next_task() ...
idle_balance() migrate_swap()
active_balance() stop_two_cpus()
spin_lock(stopper0->lock)
spin_lock(stopper1->lock)
ttwu(migrate/0)
smp_cond_load_acquire() -- waits for schedule()
stop_one_cpu(1)
spin_lock(stopper1->lock) -- waits for stopper lock

Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reported-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Matt Fleming <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/stop_machine.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b7591261652d..64c0291b579c 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/smpboot.h>
#include <linux/atomic.h>
#include <linux/nmi.h>
+#include <linux/sched/wake_q.h>

/*
* Structure to determine completion condition and record errors. May
@@ -65,27 +66,31 @@ static void cpu_stop_signal_done(struct cpu_stop_done *done)
}

static void __cpu_stop_queue_work(struct cpu_stopper *stopper,
- struct cpu_stop_work *work)
+ struct cpu_stop_work *work,
+ struct wake_q_head *wakeq)
{
list_add_tail(&work->list, &stopper->works);
- wake_up_process(stopper->thread);
+ wake_q_add(wakeq, stopper->thread);
}

/* queue @work to @stopper. if offline, @work is completed immediately */
static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
+ DEFINE_WAKE_Q(wakeq);
unsigned long flags;
bool enabled;

spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
if (enabled)
- __cpu_stop_queue_work(stopper, work);
+ __cpu_stop_queue_work(stopper, work, &wakeq);
else if (work->done)
cpu_stop_signal_done(work->done);
spin_unlock_irqrestore(&stopper->lock, flags);

+ wake_up_q(&wakeq);
+
return enabled;
}

@@ -229,6 +234,7 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
{
struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1);
struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2);
+ DEFINE_WAKE_Q(wakeq);
int err;
retry:
spin_lock_irq(&stopper1->lock);
@@ -252,8 +258,8 @@ retry:
goto unlock;

err = 0;
- __cpu_stop_queue_work(stopper1, work1);
- __cpu_stop_queue_work(stopper2, work2);
+ __cpu_stop_queue_work(stopper1, work1, &wakeq);
+ __cpu_stop_queue_work(stopper2, work2, &wakeq);
unlock:
spin_unlock(&stopper2->lock);
spin_unlock_irq(&stopper1->lock);
@@ -263,6 +269,9 @@ unlock:
cpu_relax();
goto retry;
}
+
+ wake_up_q(&wakeq);
+
return err;
}
/**

2018-05-03 12:13:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Tue, 2018-04-24 at 14:33 +0100, Matt Fleming wrote:
> On Fri, 20 Apr, at 11:50:05AM, Peter Zijlstra wrote:
> > On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> > > Hi guys,
> > >
> > > We've seen a bug in one of our SLE kernels where the cpu stopper
> > > thread ("migration/15") is entering idle balance. This then triggers
> > > active load balance.
> > >
> > > At the same time, a task on another CPU triggers a page fault and NUMA
> > > balancing kicks in to try and migrate the task closer to the NUMA node
> > > for that page (we're inside stop_two_cpus()). This faulting task is
> > > spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu,
> > > !VAL)), waiting for "migration/15" to context switch.
> > >
> > > Unfortunately, because "migration/15" is doing active load balance
> > > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > > which is already held (since it's inside stop_two_cpus()).
> > >
> > > Deadlock ensues.
> >
> >
> > So if I read that right, something like the following happens:
> >
> > CPU0 CPU1
> >
> > schedule(.prev=migrate/0) <fault>
> > pick_next_task ...
> > idle_balance migrate_swap()
> > active_balance stop_two_cpus()
> > spin_lock(stopper0->lock)
> > spin_lock(stopper1->lock)
> > ttwu(migrate/0)
> > smp_cond_load_acquire() -- waits for schedule()
> > stop_one_cpu(1)
> > spin_lock(stopper1->lock) -- waits for stopper lock
>
> Yep, that's exactly right.
>
> > Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
> > I'm not entirely sure there isn't more dragons here, but this particular
> > one seems fixable by doing that.
> >
> > Is there any way you can reproduce/test this?
>
> I'm afraid I don't have any way to test this, but I can ask the
> customer that reported it if they can.
>
> Either way, this fix looks good to me.

Seems there's another problem there with hotplug. Virgin tip...

[ 122.147601] smpboot: CPU 4 is now offline
[ 122.189701] smpboot: CPU 5 is now offline
[ 122.225612] smpboot: CPU 6 is now offline
[ 122.257760] smpboot: CPU 7 is now offline
[ 124.172418] smpboot: CPU 2 is now offline
[ 124.209121] smpboot: CPU 3 is now offline
[ 124.215810] smpboot: Booting Node 0 Processor 2 APIC 0x4

[ 124.216939] =============================
[ 124.216939] WARNING: suspicious RCU usage
[ 124.216941] 4.17.0.g66d489e-tip-default #82 Tainted: G E
[ 124.216941] -----------------------------
[ 124.216943] kernel/sched/core.c:1614 suspicious rcu_dereference_check() usage!
[ 124.216944]
other info that might help us debug this:

[ 124.216945]
RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 0
[ 124.216946] 4 locks held by swapper/2/0:
[ 124.216947] #0: 000000001f9fa447 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x130
[ 124.216953] #1: 000000004cb07b3b (&stopper->lock){..-.}, at: cpu_stop_queue_work+0x2d/0x80
[ 124.216958] #2: 00000000d3a46b90 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
[ 124.216964] #3: 00000000f360767b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
[ 124.216969]
stack backtrace:
[ 124.216971] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #82
[ 124.216972] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 124.216973] Call Trace:
[ 124.216977] dump_stack+0x78/0xb3
[ 124.216979] ttwu_stat+0x121/0x130
[ 124.216983] try_to_wake_up+0x2c2/0x5f0
[ 124.216988] ? cpu_stop_park+0x30/0x30
[ 124.216990] cpu_stop_queue_work+0x7c/0x80
[ 124.216993] queue_stop_cpus_work+0x61/0xb0
[ 124.216997] stop_machine_from_inactive_cpu+0xd3/0x130
[ 124.216999] ? mtrr_restore+0x80/0x80
[ 124.217005] mtrr_ap_init+0x62/0x70
[ 124.217008] identify_secondary_cpu+0x18/0x80
[ 124.217011] smp_store_cpu_info+0x44/0x50
[ 124.217014] start_secondary+0x9a/0x1e0
[ 124.217017] secondary_startup_64+0xa5/0xb0

[ 124.218433] ======================================================
[ 124.218433] WARNING: possible circular locking dependency detected
[ 124.218433] 4.17.0.g66d489e-tip-default #82 Tainted: G E
[ 124.218434] ------------------------------------------------------
[ 124.218434] swapper/2/0 is trying to acquire lock:
[ 124.218434] 000000006b311cf8 ((console_sem).lock){-...}, at: down_trylock+0xf/0x30

[ 124.218436] but task is already holding lock:
[ 124.218436] 00000000d3a46b90 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0

[ 124.218438] which lock already depends on the new lock.


[ 124.218438] the existing dependency chain (in reverse order) is:

[ 124.218439] -> #1 (&p->pi_lock){-.-.}:
[ 124.218440] lock_acquire+0xbd/0x200
[ 124.218440] _raw_spin_lock_irqsave+0x4c/0x60
[ 124.218441] try_to_wake_up+0x2d/0x5f0
[ 124.218441] up+0x40/0x50
[ 124.218441] __up_console_sem+0x41/0x80
[ 124.218441] console_unlock+0x398/0x6b0
[ 124.218442] do_con_write.part.20+0x71b/0x9d0
[ 124.218442] con_write+0x55/0x60
[ 124.218442] do_output_char+0x181/0x200
[ 124.218442] n_tty_write+0x20a/0x460
[ 124.218443] tty_write+0x14a/0x2b0
[ 124.218443] do_iter_write+0x144/0x180
[ 124.218443] vfs_writev+0x71/0xd0
[ 124.218444] do_writev+0x51/0xd0
[ 124.218444] do_syscall_64+0x60/0x210
[ 124.218444] entry_SYSCALL_64_after_hwframe+0x49/0xbe

[ 124.218445] -> #0 ((console_sem).lock){-...}:
[ 124.218446] __lock_acquire+0x436/0x770
[ 124.218446] lock_acquire+0xbd/0x200
[ 124.218446] _raw_spin_lock_irqsave+0x4c/0x60
[ 124.218447] down_trylock+0xf/0x30
[ 124.218447] __down_trylock_console_sem+0x37/0xa0
[ 124.218447] console_trylock+0x13/0x60
[ 124.218447] vprintk_emit+0x237/0x460
[ 124.218448] printk+0x48/0x4a
[ 124.218448] lockdep_rcu_suspicious+0x26/0x100
[ 124.218448] ttwu_stat+0x121/0x130
[ 124.218449] try_to_wake_up+0x2c2/0x5f0
[ 124.218449] cpu_stop_queue_work+0x7c/0x80
[ 124.218449] queue_stop_cpus_work+0x61/0xb0
[ 124.218450] stop_machine_from_inactive_cpu+0xd3/0x130
[ 124.218450] mtrr_ap_init+0x62/0x70
[ 124.218450] identify_secondary_cpu+0x18/0x80
[ 124.218450] smp_store_cpu_info+0x44/0x50
[ 124.218451] start_secondary+0x9a/0x1e0
[ 124.218451] secondary_startup_64+0xa5/0xb0

[ 124.218451] other info that might help us debug this:

[ 124.218452] Possible unsafe locking scenario:

[ 124.218452] CPU0 CPU1
[ 124.218453] ---- ----
[ 124.218453] lock(&p->pi_lock);
[ 124.218454] lock((console_sem).lock);
[ 124.218454] lock(&p->pi_lock);
[ 124.218455] lock((console_sem).lock);

[ 124.218456] *** DEADLOCK ***

[ 124.218456] 4 locks held by swapper/2/0:
[ 124.218456] #0: 000000001f9fa447 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x130
[ 124.218458] #1: 000000004cb07b3b (&stopper->lock){..-.}, at: cpu_stop_queue_work+0x2d/0x80
[ 124.218459] #2: 00000000d3a46b90 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
[ 124.218460] #3: 00000000f360767b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80

[ 124.218462] stack backtrace:
[ 124.218462] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #82
[ 124.218462] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 124.218463] Call Trace:
[ 124.218463] dump_stack+0x78/0xb3
[ 124.218463] print_circular_bug.isra.38+0x1f3/0x200
[ 124.218463] check_prev_add.constprop.46+0x717/0x730
[ 124.218464] validate_chain.isra.42+0x652/0x980
[ 124.218464] __lock_acquire+0x436/0x770
[ 124.218464] lock_acquire+0xbd/0x200
[ 124.218464] ? down_trylock+0xf/0x30
[ 124.218465] ? vprintk_emit+0x237/0x460
[ 124.218465] _raw_spin_lock_irqsave+0x4c/0x60
[ 124.218465] ? down_trylock+0xf/0x30
[ 124.218466] down_trylock+0xf/0x30
[ 124.218466] __down_trylock_console_sem+0x37/0xa0
[ 124.218466] console_trylock+0x13/0x60
[ 124.218466] vprintk_emit+0x237/0x460
[ 124.218467] printk+0x48/0x4a
[ 124.218467] lockdep_rcu_suspicious+0x26/0x100
[ 124.218467] ttwu_stat+0x121/0x130
[ 124.218467] try_to_wake_up+0x2c2/0x5f0
[ 124.218468] ? cpu_stop_park+0x30/0x30
[ 124.218468] cpu_stop_queue_work+0x7c/0x80
[ 124.218468] queue_stop_cpus_work+0x61/0xb0
[ 124.218468] stop_machine_from_inactive_cpu+0xd3/0x130
[ 124.218469] ? mtrr_restore+0x80/0x80
[ 124.218469] mtrr_ap_init+0x62/0x70
[ 124.218469] identify_secondary_cpu+0x18/0x80
[ 124.218469] smp_store_cpu_info+0x44/0x50
[ 124.218470] start_secondary+0x9a/0x1e0
[ 124.218470] secondary_startup_64+0xa5/0xb0
[ 124.478753] smpboot: Booting Node 0 Processor 3 APIC 0x6

2018-05-03 12:28:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 02:12:22PM +0200, Mike Galbraith wrote:
> [ 124.216939] =============================
> [ 124.216939] WARNING: suspicious RCU usage
> [ 124.216941] 4.17.0.g66d489e-tip-default #82 Tainted: G E
> [ 124.216941] -----------------------------
> [ 124.216943] kernel/sched/core.c:1614 suspicious rcu_dereference_check() usage!
> [ 124.216944]
> other info that might help us debug this:
>
> [ 124.216945]
> RCU used illegally from offline CPU!
> rcu_scheduler_active = 2, debug_locks = 0
> [ 124.216946] 4 locks held by swapper/2/0:
> [ 124.216947] #0: 000000001f9fa447 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x130
> [ 124.216953] #1: 000000004cb07b3b (&stopper->lock){..-.}, at: cpu_stop_queue_work+0x2d/0x80
> [ 124.216958] #2: 00000000d3a46b90 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
> [ 124.216964] #3: 00000000f360767b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
> [ 124.216969]
> stack backtrace:
> [ 124.216971] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #82
> [ 124.216972] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [ 124.216973] Call Trace:
> [ 124.216977] dump_stack+0x78/0xb3
> [ 124.216979] ttwu_stat+0x121/0x130
> [ 124.216983] try_to_wake_up+0x2c2/0x5f0
> [ 124.216988] ? cpu_stop_park+0x30/0x30
> [ 124.216990] cpu_stop_queue_work+0x7c/0x80
> [ 124.216993] queue_stop_cpus_work+0x61/0xb0
> [ 124.216997] stop_machine_from_inactive_cpu+0xd3/0x130
> [ 124.216999] ? mtrr_restore+0x80/0x80
> [ 124.217005] mtrr_ap_init+0x62/0x70
> [ 124.217008] identify_secondary_cpu+0x18/0x80
> [ 124.217011] smp_store_cpu_info+0x44/0x50
> [ 124.217014] start_secondary+0x9a/0x1e0
> [ 124.217017] secondary_startup_64+0xa5/0xb0

Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
stopper lock, but that should not affect the RCU state.

The warning is of course valid, stop_machine_from_inactive_cpu()
explicitly run on an 'offline' CPU. The patch didn't change this.

2018-05-03 12:41:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-03 at 14:28 +0200, Peter Zijlstra wrote:
>
> Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
> stopper lock, but that should not affect the RCU state.

No, not new, just an additional woes from same spot.

-Mike


2018-05-03 12:50:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 02:40:21PM +0200, Mike Galbraith wrote:
> On Thu, 2018-05-03 at 14:28 +0200, Peter Zijlstra wrote:
> >
> > Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
> > stopper lock, but that should not affect the RCU state.
>
> No, not new, just an additional woes from same spot.

Ah, ok. Does somsething like this make it go away?


diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index f89014a2c238..a32518c2ba4a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
/* Schedule work on other CPUs and execute directly for local CPU */
set_state(&msdata, MULTI_STOP_PREPARE);
cpu_stop_init_done(&done, num_active_cpus());
- queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
- &done);
+
+ RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop,
+ &msdata, &done));
+
ret = multi_cpu_stop(&msdata);

/* Busy wait for completion. */

2018-05-03 13:34:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-03 at 14:49 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 02:40:21PM +0200, Mike Galbraith wrote:
> > On Thu, 2018-05-03 at 14:28 +0200, Peter Zijlstra wrote:
> > >
> > > Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
> > > stopper lock, but that should not affect the RCU state.
> >
> > No, not new, just an additional woes from same spot.
>
> Ah, ok. Does somsething like this make it go away?

Dang. With $subject fix applied as well..

[ 151.103732] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 151.104908] =============================
[ 151.104909] WARNING: suspicious RCU usage
[ 151.104910] 4.17.0.g66d489e-tip-default #84 Tainted: G E
[ 151.104911] -----------------------------
[ 151.104912] kernel/sched/core.c:1625 suspicious rcu_dereference_check() usage!
[ 151.104913]
other info that might help us debug this:

[ 151.104914]
RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 0
[ 151.104916] 3 locks held by swapper/2/0:
[ 151.104916] #0: 00000000560adb60 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x140
[ 151.104923] #1: 00000000e4fb0238 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
[ 151.104929] #2: 000000003341403b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
[ 151.104934]
stack backtrace:
[ 151.104937] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #84
[ 151.104938] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 151.104938] Call Trace:
[ 151.104942] dump_stack+0x78/0xb3
[ 151.104945] ttwu_stat+0x121/0x130
[ 151.104949] try_to_wake_up+0x2c2/0x5f0
[ 151.104953] ? cpu_stop_park+0x30/0x30
[ 151.104956] wake_up_q+0x4a/0x70
[ 151.104959] cpu_stop_queue_work+0x6b/0xa0
[ 151.104963] queue_stop_cpus_work+0x61/0xb0
[ 151.104968] stop_machine_from_inactive_cpu+0xd8/0x140
[ 151.104970] ? mtrr_restore+0x80/0x80
[ 151.104976] mtrr_ap_init+0x62/0x70
[ 151.104979] identify_secondary_cpu+0x18/0x80
[ 151.104982] smp_store_cpu_info+0x44/0x50
[ 151.104985] start_secondary+0x9a/0x1e0
[ 151.104988] secondary_startup_64+0xa5/0xb0

> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a2c238..a32518c2ba4a 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
> /* Schedule work on other CPUs and execute directly for local CPU */
> set_state(&msdata, MULTI_STOP_PREPARE);
> cpu_stop_init_done(&done, num_active_cpus());
> - queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
> - &done);
> +
> + RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop,
> + &msdata, &done));
> +
> ret = multi_cpu_stop(&msdata);
>
> /* Busy wait for completion. */

2018-05-03 13:57:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:

> Dang. With $subject fix applied as well..

That's a NO then... :-(

> [ 151.103732] smpboot: Booting Node 0 Processor 2 APIC 0x4
> [ 151.104908] =============================
> [ 151.104909] WARNING: suspicious RCU usage
> [ 151.104910] 4.17.0.g66d489e-tip-default #84 Tainted: G E
> [ 151.104911] -----------------------------
> [ 151.104912] kernel/sched/core.c:1625 suspicious rcu_dereference_check() usage!
> [ 151.104913]
> other info that might help us debug this:
>
> [ 151.104914]
> RCU used illegally from offline CPU!
> rcu_scheduler_active = 2, debug_locks = 0
> [ 151.104916] 3 locks held by swapper/2/0:
> [ 151.104916] #0: 00000000560adb60 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x140
> [ 151.104923] #1: 00000000e4fb0238 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
> [ 151.104929] #2: 000000003341403b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
> [ 151.104934]
> stack backtrace:
> [ 151.104937] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #84
> [ 151.104938] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [ 151.104938] Call Trace:
> [ 151.104942] dump_stack+0x78/0xb3
> [ 151.104945] ttwu_stat+0x121/0x130
> [ 151.104949] try_to_wake_up+0x2c2/0x5f0
> [ 151.104953] ? cpu_stop_park+0x30/0x30
> [ 151.104956] wake_up_q+0x4a/0x70
> [ 151.104959] cpu_stop_queue_work+0x6b/0xa0
> [ 151.104963] queue_stop_cpus_work+0x61/0xb0
> [ 151.104968] stop_machine_from_inactive_cpu+0xd8/0x140

> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index f89014a2c238..a32518c2ba4a 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
> > /* Schedule work on other CPUs and execute directly for local CPU */
> > set_state(&msdata, MULTI_STOP_PREPARE);
> > cpu_stop_init_done(&done, num_active_cpus());
> > - queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
> > - &done);
> > +
> > + RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop,
> > + &msdata, &done));
> > +
> > ret = multi_cpu_stop(&msdata);

Paul, any clue on what else to try here? The whole MTRR setup is
radically crazy but it's something we're stuck with (yay hardware) :/

So the issue is that we're doing wakeups from an offline CPU (very early
during bringup) and RCU (rightfully) complains about that. I thought
RCU_NONIDLE() was the magic incantation that makes RCU 'watch', but
clearly it's not enough here.

2018-05-03 14:18:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
>
> > Dang. With $subject fix applied as well..
>
> That's a NO then... :-(

Could say who cares about oddball offline wakeup stat. <cringe>

2018-05-03 14:39:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 03:56:17PM +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
>
> > Dang. With $subject fix applied as well..
>
> That's a NO then... :-(
>
> > [ 151.103732] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > [ 151.104908] =============================
> > [ 151.104909] WARNING: suspicious RCU usage
> > [ 151.104910] 4.17.0.g66d489e-tip-default #84 Tainted: G E
> > [ 151.104911] -----------------------------
> > [ 151.104912] kernel/sched/core.c:1625 suspicious rcu_dereference_check() usage!
> > [ 151.104913]
> > other info that might help us debug this:
> >
> > [ 151.104914]
> > RCU used illegally from offline CPU!
> > rcu_scheduler_active = 2, debug_locks = 0
> > [ 151.104916] 3 locks held by swapper/2/0:
> > [ 151.104916] #0: 00000000560adb60 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x140
> > [ 151.104923] #1: 00000000e4fb0238 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
> > [ 151.104929] #2: 000000003341403b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
> > [ 151.104934]
> > stack backtrace:
> > [ 151.104937] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #84
> > [ 151.104938] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> > [ 151.104938] Call Trace:
> > [ 151.104942] dump_stack+0x78/0xb3
> > [ 151.104945] ttwu_stat+0x121/0x130
> > [ 151.104949] try_to_wake_up+0x2c2/0x5f0
> > [ 151.104953] ? cpu_stop_park+0x30/0x30
> > [ 151.104956] wake_up_q+0x4a/0x70
> > [ 151.104959] cpu_stop_queue_work+0x6b/0xa0
> > [ 151.104963] queue_stop_cpus_work+0x61/0xb0
> > [ 151.104968] stop_machine_from_inactive_cpu+0xd8/0x140
>
> > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > > index f89014a2c238..a32518c2ba4a 100644
> > > --- a/kernel/stop_machine.c
> > > +++ b/kernel/stop_machine.c
> > > @@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
> > > /* Schedule work on other CPUs and execute directly for local CPU */
> > > set_state(&msdata, MULTI_STOP_PREPARE);
> > > cpu_stop_init_done(&done, num_active_cpus());
> > > - queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
> > > - &done);
> > > +
> > > + RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop,
> > > + &msdata, &done));
> > > +
> > > ret = multi_cpu_stop(&msdata);
>
> Paul, any clue on what else to try here? The whole MTRR setup is
> radically crazy but it's something we're stuck with (yay hardware) :/
>
> So the issue is that we're doing wakeups from an offline CPU (very early
> during bringup) and RCU (rightfully) complains about that. I thought
> RCU_NONIDLE() was the magic incantation that makes RCU 'watch', but
> clearly it's not enough here.

Huh.

No, RCU_NONIDLE() only works for idle, not for offline.

Maybe... Let me take a look. There must be some way to mark a
specific lock acquisition and release as being lockdep-invisible...

Another approach would be to have an architecture-specific thing that
caused RCU to be enabled way earlier on x86.

Thanx, Paul


2018-05-03 14:45:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> >
> > > Dang. With $subject fix applied as well..
> >
> > That's a NO then... :-(
>
> Could say who cares about oddball offline wakeup stat. <cringe>

Yeah, nobody.. but I don't want to have to change the wakeup code to
deal with this if at all possible. That'd just add conditions that are
'always' false, except in this exceedingly rare circumstance.

So ideally we manage to tell RCU that it needs to pay attention while
we're doing this here thing, which is what I thought RCU_NONIDLE() was
about.

2018-05-03 14:54:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 07:39:41AM -0700, Paul E. McKenney wrote:
> Huh.
>
> No, RCU_NONIDLE() only works for idle, not for offline.

Oh bummer..

> Maybe... Let me take a look. There must be some way to mark a
> specific lock acquisition and release as being lockdep-invisible...

But I suspect it flags a real issue. Shutting it up isn't the issue as
such. There could be a concurrent sched domain rebuild while we do this
setup, and if we free that domain while we're looking at it, bad things
happen.

... I wonder what brought this failure to pass now. This code has been
like this a for a long time.

2018-05-03 16:11:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > >
> > > > Dang. With $subject fix applied as well..
> > >
> > > That's a NO then... :-(
> >
> > Could say who cares about oddball offline wakeup stat. <cringe>
>
> Yeah, nobody.. but I don't want to have to change the wakeup code to
> deal with this if at all possible. That'd just add conditions that are
> 'always' false, except in this exceedingly rare circumstance.
>
> So ideally we manage to tell RCU that it needs to pay attention while
> we're doing this here thing, which is what I thought RCU_NONIDLE() was
> about.

One straightforward approach would be to provide a arch-specific
Kconfig option that tells notify_cpu_starting() not to bother invoking
rcu_cpu_starting(). Then x86 selects this Kconfig option and invokes
rcu_cpu_starting() itself early enough to avoid splats.

See the (untested, probably does not even build) patch below.

I have no idea where to insert either the "select" or the call to
rcu_cpu_starting(), so I left those out. I know that putting the
call too early will cause trouble, but I have no idea what constitutes
"too early". :-/

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0db8938fbb23..58f7ea1de247 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -948,7 +948,8 @@ void notify_cpu_starting(unsigned int cpu)
enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
int ret;

- rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
+ if (!IS_ENABLED(CONFIG_RCU_CPU_ONLINE_EARLY))
+ rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
while (st->state < target) {
st->state++;
ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 9210379c0353..a874c0d74797 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -238,4 +238,7 @@ config RCU_NOCB_CPU
Say Y here if you want to help to debug reduced OS jitter.
Say N here if you are unsure.

+config RCU_CPU_ONLINE_EARLY
+ bool
+
endmenu # "RCU Subsystem"


2018-05-03 16:45:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 09:12:31AM -0700, Paul E. McKenney wrote:
> On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> > On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > > >
> > > > > Dang. With $subject fix applied as well..
> > > >
> > > > That's a NO then... :-(
> > >
> > > Could say who cares about oddball offline wakeup stat. <cringe>
> >
> > Yeah, nobody.. but I don't want to have to change the wakeup code to
> > deal with this if at all possible. That'd just add conditions that are
> > 'always' false, except in this exceedingly rare circumstance.
> >
> > So ideally we manage to tell RCU that it needs to pay attention while
> > we're doing this here thing, which is what I thought RCU_NONIDLE() was
> > about.
>
> One straightforward approach would be to provide a arch-specific
> Kconfig option that tells notify_cpu_starting() not to bother invoking
> rcu_cpu_starting(). Then x86 selects this Kconfig option and invokes
> rcu_cpu_starting() itself early enough to avoid splats.
>
> See the (untested, probably does not even build) patch below.
>
> I have no idea where to insert either the "select" or the call to
> rcu_cpu_starting(), so I left those out. I know that putting the
> call too early will cause trouble, but I have no idea what constitutes
> "too early". :-/

Something like so perhaps? Mike, can you play around with that? Could
burn your granny and eat your cookies.


diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 7468de429087..07360523c3ce 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -793,6 +793,9 @@ void mtrr_ap_init(void)

if (!use_intel() || mtrr_aps_delayed_init)
return;
+
+ rcu_cpu_starting(smp_processor_id());
+
/*
* Ideally we should hold mtrr_mutex here to avoid mtrr entries
* changed, but this routine will be called in cpu boot time,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a734692a581..4dab46950fdb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
return 0;
}

+static DEFINE_PER_CPU(int, rcu_cpu_started);
+
/*
* Mark the specified CPU as being online so that subsequent grace periods
* (both expedited and normal) will wait on it. Note that this means that
@@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
struct rcu_node *rnp;
struct rcu_state *rsp;

+ if (per_cpu(rcu_cpu_started, cpu))
+ return;
+
+ per_cpu(rcu_cpu_started, cpu) = 1;
+
for_each_rcu_flavor(rsp) {
rdp = per_cpu_ptr(rsp->rda, cpu);
rnp = rdp->mynode;
@@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
preempt_enable();
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
+
+ per_cpu(rcu_cpu_started, cpu) = 0;
}

/* Migrate the dead CPU's callbacks to the current CPU. */

2018-05-03 17:19:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 06:45:08PM +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 09:12:31AM -0700, Paul E. McKenney wrote:
> > On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > > > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > > > >
> > > > > > Dang. With $subject fix applied as well..
> > > > >
> > > > > That's a NO then... :-(
> > > >
> > > > Could say who cares about oddball offline wakeup stat. <cringe>
> > >
> > > Yeah, nobody.. but I don't want to have to change the wakeup code to
> > > deal with this if at all possible. That'd just add conditions that are
> > > 'always' false, except in this exceedingly rare circumstance.
> > >
> > > So ideally we manage to tell RCU that it needs to pay attention while
> > > we're doing this here thing, which is what I thought RCU_NONIDLE() was
> > > about.
> >
> > One straightforward approach would be to provide a arch-specific
> > Kconfig option that tells notify_cpu_starting() not to bother invoking
> > rcu_cpu_starting(). Then x86 selects this Kconfig option and invokes
> > rcu_cpu_starting() itself early enough to avoid splats.
> >
> > See the (untested, probably does not even build) patch below.
> >
> > I have no idea where to insert either the "select" or the call to
> > rcu_cpu_starting(), so I left those out. I know that putting the
> > call too early will cause trouble, but I have no idea what constitutes
> > "too early". :-/
>
> Something like so perhaps? Mike, can you play around with that? Could
> burn your granny and eat your cookies.
>
>
> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 7468de429087..07360523c3ce 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
>
> if (!use_intel() || mtrr_aps_delayed_init)
> return;
> +
> + rcu_cpu_starting(smp_processor_id());
> +
> /*
> * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> * changed, but this routine will be called in cpu boot time,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a734692a581..4dab46950fdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> return 0;
> }
>
> +static DEFINE_PER_CPU(int, rcu_cpu_started);
> +
> /*
> * Mark the specified CPU as being online so that subsequent grace periods
> * (both expedited and normal) will wait on it. Note that this means that
> @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> struct rcu_node *rnp;
> struct rcu_state *rsp;
>
> + if (per_cpu(rcu_cpu_started, cpu))

I would log a non-splat dmesg the first time this happened, just for my
future sanity, but otherwise looks fine. I am a bit concerned about
calls to rcu_cpu_starting() getting sprinkled all through the code.
Or am I being excessively paranoid?

Thanx, Paul

> + return;
> +
> + per_cpu(rcu_cpu_started, cpu) = 1;
> +
> for_each_rcu_flavor(rsp) {
> rdp = per_cpu_ptr(rsp->rda, cpu);
> rnp = rdp->mynode;
> @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> preempt_enable();
> for_each_rcu_flavor(rsp)
> rcu_cleanup_dying_idle_cpu(cpu, rsp);
> +
> + per_cpu(rcu_cpu_started, cpu) = 0;
> }
>
> /* Migrate the dead CPU's callbacks to the current CPU. */
>


2018-05-03 17:55:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 10:18:50AM -0700, Paul E. McKenney wrote:
> > + if (per_cpu(rcu_cpu_started, cpu))
>
> I would log a non-splat dmesg the first time this happened, just for my
> future sanity, but otherwise looks fine. I am a bit concerned about
> calls to rcu_cpu_starting() getting sprinkled all through the code.
> Or am I being excessively paranoid?

Don't think so, but lets first see if this at all works. If not we'll
need to figure out something else anyway.

2018-05-03 18:23:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 03, 2018 at 07:54:56PM +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 10:18:50AM -0700, Paul E. McKenney wrote:
> > > + if (per_cpu(rcu_cpu_started, cpu))
> >
> > I would log a non-splat dmesg the first time this happened, just for my
> > future sanity, but otherwise looks fine. I am a bit concerned about
> > calls to rcu_cpu_starting() getting sprinkled all through the code.
> > Or am I being excessively paranoid?
>
> Don't think so, but lets first see if this at all works. If not we'll
> need to figure out something else anyway.

Works for me!

Thanx, Paul


2018-05-04 03:39:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-03 at 18:45 +0200, Peter Zijlstra wrote:
>
> Something like so perhaps? Mike, can you play around with that? Could
> burn your granny and eat your cookies.

That worked, and nothing entertaining has happened.. yet. Hm, I could
use this kernel to update my backup drive, if there's a cookie monster
lurking, that might get its attention :)

> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 7468de429087..07360523c3ce 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
>
> if (!use_intel() || mtrr_aps_delayed_init)
> return;
> +
> + rcu_cpu_starting(smp_processor_id());
> +
> /*
> * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> * changed, but this routine will be called in cpu boot time,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a734692a581..4dab46950fdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> return 0;
> }
>
> +static DEFINE_PER_CPU(int, rcu_cpu_started);
> +
> /*
> * Mark the specified CPU as being online so that subsequent grace periods
> * (both expedited and normal) will wait on it. Note that this means that
> @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> struct rcu_node *rnp;
> struct rcu_state *rsp;
>
> + if (per_cpu(rcu_cpu_started, cpu))
> + return;
> +
> + per_cpu(rcu_cpu_started, cpu) = 1;
> +
> for_each_rcu_flavor(rsp) {
> rdp = per_cpu_ptr(rsp->rda, cpu);
> rnp = rdp->mynode;
> @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> preempt_enable();
> for_each_rcu_flavor(rsp)
> rcu_cleanup_dying_idle_cpu(cpu, rsp);
> +
> + per_cpu(rcu_cpu_started, cpu) = 0;
> }
>
> /* Migrate the dead CPU's callbacks to the current CPU. */

2018-05-15 04:31:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-03 at 18:45 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 09:12:31AM -0700, Paul E. McKenney wrote:
> > On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > > > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > > > >
> > > > > > Dang. With $subject fix applied as well..
> > > > >
> > > > > That's a NO then... :-(
> > > >
> > > > Could say who cares about oddball offline wakeup stat. <cringe>
> > >
> > > Yeah, nobody.. but I don't want to have to change the wakeup code to
> > > deal with this if at all possible. That'd just add conditions that are
> > > 'always' false, except in this exceedingly rare circumstance.
> > >
> > > So ideally we manage to tell RCU that it needs to pay attention while
> > > we're doing this here thing, which is what I thought RCU_NONIDLE() was
> > > about.
> >
> > One straightforward approach would be to provide a arch-specific
> > Kconfig option that tells notify_cpu_starting() not to bother invoking
> > rcu_cpu_starting(). Then x86 selects this Kconfig option and invokes
> > rcu_cpu_starting() itself early enough to avoid splats.
> >
> > See the (untested, probably does not even build) patch below.
> >
> > I have no idea where to insert either the "select" or the call to
> > rcu_cpu_starting(), so I left those out. I know that putting the
> > call too early will cause trouble, but I have no idea what constitutes
> > "too early". :-/
>
> Something like so perhaps? Mike, can you play around with that? Could
> burn your granny and eat your cookies.

Did this get queued anywhere?

> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 7468de429087..07360523c3ce 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
>
> if (!use_intel() || mtrr_aps_delayed_init)
> return;
> +
> + rcu_cpu_starting(smp_processor_id());
> +
> /*
> * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> * changed, but this routine will be called in cpu boot time,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a734692a581..4dab46950fdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> return 0;
> }
>
> +static DEFINE_PER_CPU(int, rcu_cpu_started);
> +
> /*
> * Mark the specified CPU as being online so that subsequent grace periods
> * (both expedited and normal) will wait on it. Note that this means that
> @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> struct rcu_node *rnp;
> struct rcu_state *rsp;
>
> + if (per_cpu(rcu_cpu_started, cpu))
> + return;
> +
> + per_cpu(rcu_cpu_started, cpu) = 1;
> +
> for_each_rcu_flavor(rsp) {
> rdp = per_cpu_ptr(rsp->rda, cpu);
> rnp = rdp->mynode;
> @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> preempt_enable();
> for_each_rcu_flavor(rsp)
> rcu_cleanup_dying_idle_cpu(cpu, rsp);
> +
> + per_cpu(rcu_cpu_started, cpu) = 0;
> }
>
> /* Migrate the dead CPU's callbacks to the current CPU. */

2018-05-17 14:03:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
> On Thu, 2018-05-03 at 18:45 +0200, Peter Zijlstra wrote:
> > On Thu, May 03, 2018 at 09:12:31AM -0700, Paul E. McKenney wrote:
> > > On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> > > > On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > > > > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > > > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > > > > >
> > > > > > > Dang. With $subject fix applied as well..
> > > > > >
> > > > > > That's a NO then... :-(
> > > > >
> > > > > Could say who cares about oddball offline wakeup stat. <cringe>
> > > >
> > > > Yeah, nobody.. but I don't want to have to change the wakeup code to
> > > > deal with this if at all possible. That'd just add conditions that are
> > > > 'always' false, except in this exceedingly rare circumstance.
> > > >
> > > > So ideally we manage to tell RCU that it needs to pay attention while
> > > > we're doing this here thing, which is what I thought RCU_NONIDLE() was
> > > > about.
> > >
> > > One straightforward approach would be to provide a arch-specific
> > > Kconfig option that tells notify_cpu_starting() not to bother invoking
> > > rcu_cpu_starting(). Then x86 selects this Kconfig option and invokes
> > > rcu_cpu_starting() itself early enough to avoid splats.
> > >
> > > See the (untested, probably does not even build) patch below.
> > >
> > > I have no idea where to insert either the "select" or the call to
> > > rcu_cpu_starting(), so I left those out. I know that putting the
> > > call too early will cause trouble, but I have no idea what constitutes
> > > "too early". :-/
> >
> > Something like so perhaps? Mike, can you play around with that? Could
> > burn your granny and eat your cookies.
>
> Did this get queued anywhere?

I have not queued it, but given Peter's Signed-off-by and your Tested-by
I would be happy to do so.

Thanx, Paul

> > diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> > index 7468de429087..07360523c3ce 100644
> > --- a/arch/x86/kernel/cpu/mtrr/main.c
> > +++ b/arch/x86/kernel/cpu/mtrr/main.c
> > @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
> >
> > if (!use_intel() || mtrr_aps_delayed_init)
> > return;
> > +
> > + rcu_cpu_starting(smp_processor_id());
> > +
> > /*
> > * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> > * changed, but this routine will be called in cpu boot time,
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 2a734692a581..4dab46950fdb 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> > return 0;
> > }
> >
> > +static DEFINE_PER_CPU(int, rcu_cpu_started);
> > +
> > /*
> > * Mark the specified CPU as being online so that subsequent grace periods
> > * (both expedited and normal) will wait on it. Note that this means that
> > @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > struct rcu_node *rnp;
> > struct rcu_state *rsp;
> >
> > + if (per_cpu(rcu_cpu_started, cpu))
> > + return;
> > +
> > + per_cpu(rcu_cpu_started, cpu) = 1;
> > +
> > for_each_rcu_flavor(rsp) {
> > rdp = per_cpu_ptr(rsp->rda, cpu);
> > rnp = rdp->mynode;
> > @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> > preempt_enable();
> > for_each_rcu_flavor(rsp)
> > rcu_cleanup_dying_idle_cpu(cpu, rsp);
> > +
> > + per_cpu(rcu_cpu_started, cpu) = 0;
> > }
> >
> > /* Migrate the dead CPU's callbacks to the current CPU. */
>


2018-05-17 14:11:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, 2018-05-17 at 07:03 -0700, Paul E. McKenney wrote:
> On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
>
> > > Something like so perhaps? Mike, can you play around with that? Could
> > > burn your granny and eat your cookies.
> >
> > Did this get queued anywhere?
>
> I have not queued it, but given Peter's Signed-off-by and your Tested-by
> I would be happy to do so.

Here's the later. Tested-by: Mike Galbraith <[email protected]>

> > > diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> > > index 7468de429087..07360523c3ce 100644
> > > --- a/arch/x86/kernel/cpu/mtrr/main.c
> > > +++ b/arch/x86/kernel/cpu/mtrr/main.c
> > > @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
> > >
> > > if (!use_intel() || mtrr_aps_delayed_init)
> > > return;
> > > +
> > > + rcu_cpu_starting(smp_processor_id());
> > > +
> > > /*
> > > * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> > > * changed, but this routine will be called in cpu boot time,
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 2a734692a581..4dab46950fdb 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> > > return 0;
> > > }
> > >
> > > +static DEFINE_PER_CPU(int, rcu_cpu_started);
> > > +
> > > /*
> > > * Mark the specified CPU as being online so that subsequent grace periods
> > > * (both expedited and normal) will wait on it. Note that this means that
> > > @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > > struct rcu_node *rnp;
> > > struct rcu_state *rsp;
> > >
> > > + if (per_cpu(rcu_cpu_started, cpu))
> > > + return;
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 1;
> > > +
> > > for_each_rcu_flavor(rsp) {
> > > rdp = per_cpu_ptr(rsp->rda, cpu);
> > > rnp = rdp->mynode;
> > > @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> > > preempt_enable();
> > > for_each_rcu_flavor(rsp)
> > > rcu_cleanup_dying_idle_cpu(cpu, rsp);
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 0;
> > > }
> > >
> > > /* Migrate the dead CPU's callbacks to the current CPU. */
> >
>

2018-05-17 14:25:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 17, 2018 at 07:03:45AM -0700, Paul E. McKenney wrote:
> On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
> I have not queued it, but given Peter's Signed-off-by and your Tested-by
> I would be happy to do so.

And a Changelog of course :-)

---
From: Peter Zijlstra <[email protected]>
Subject: rcu/x86: Provide early rcu_cpu_starting() callback

The x86/mtrr code does horrific things because hardware. It uses
stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
thread on another CPU), which uses RCU, all before the CPU is onlined.

RCU complains about this, because wakeups use RCU and RCU does
(rightfully) not consider offline CPUs for grace-periods.

Fix this by initializing RCU way early in the MTRR case.

Tested-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---

> > > diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> > > index 7468de429087..07360523c3ce 100644
> > > --- a/arch/x86/kernel/cpu/mtrr/main.c
> > > +++ b/arch/x86/kernel/cpu/mtrr/main.c
> > > @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
> > >
> > > if (!use_intel() || mtrr_aps_delayed_init)
> > > return;
> > > +
> > > + rcu_cpu_starting(smp_processor_id());
> > > +
> > > /*
> > > * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> > > * changed, but this routine will be called in cpu boot time,
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 2a734692a581..4dab46950fdb 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> > > return 0;
> > > }
> > >
> > > +static DEFINE_PER_CPU(int, rcu_cpu_started);
> > > +
> > > /*
> > > * Mark the specified CPU as being online so that subsequent grace periods
> > > * (both expedited and normal) will wait on it. Note that this means that
> > > @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > > struct rcu_node *rnp;
> > > struct rcu_state *rsp;
> > >
> > > + if (per_cpu(rcu_cpu_started, cpu))
> > > + return;
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 1;
> > > +
> > > for_each_rcu_flavor(rsp) {
> > > rdp = per_cpu_ptr(rsp->rda, cpu);
> > > rnp = rdp->mynode;
> > > @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> > > preempt_enable();
> > > for_each_rcu_flavor(rsp)
> > > rcu_cleanup_dying_idle_cpu(cpu, rsp);
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 0;
> > > }
> > >
> > > /* Migrate the dead CPU's callbacks to the current CPU. */
> >
>

2018-05-17 14:55:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 17, 2018 at 04:23:22PM +0200, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 07:03:45AM -0700, Paul E. McKenney wrote:
> > On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
> > I have not queued it, but given Peter's Signed-off-by and your Tested-by
> > I would be happy to do so.
>
> And a Changelog of course :-)

Thank you both!

Is this OK for 4.19, or do we need it for 4.18? I am guessing 4.18,
but figured I should ask before wildly rebasing the 82 commits destined
for 4.19. ;-)

Thanx, Paul

> ---
> From: Peter Zijlstra <[email protected]>
> Subject: rcu/x86: Provide early rcu_cpu_starting() callback
>
> The x86/mtrr code does horrific things because hardware. It uses
> stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
> thread on another CPU), which uses RCU, all before the CPU is onlined.
>
> RCU complains about this, because wakeups use RCU and RCU does
> (rightfully) not consider offline CPUs for grace-periods.
>
> Fix this by initializing RCU way early in the MTRR case.
>
> Tested-by: Mike Galbraith <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
>
> > > > diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> > > > index 7468de429087..07360523c3ce 100644
> > > > --- a/arch/x86/kernel/cpu/mtrr/main.c
> > > > +++ b/arch/x86/kernel/cpu/mtrr/main.c
> > > > @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
> > > >
> > > > if (!use_intel() || mtrr_aps_delayed_init)
> > > > return;
> > > > +
> > > > + rcu_cpu_starting(smp_processor_id());
> > > > +
> > > > /*
> > > > * Ideally we should hold mtrr_mutex here to avoid mtrr entries
> > > > * changed, but this routine will be called in cpu boot time,
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 2a734692a581..4dab46950fdb 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> > > > return 0;
> > > > }
> > > >
> > > > +static DEFINE_PER_CPU(int, rcu_cpu_started);
> > > > +
> > > > /*
> > > > * Mark the specified CPU as being online so that subsequent grace periods
> > > > * (both expedited and normal) will wait on it. Note that this means that
> > > > @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > > > struct rcu_node *rnp;
> > > > struct rcu_state *rsp;
> > > >
> > > > + if (per_cpu(rcu_cpu_started, cpu))
> > > > + return;
> > > > +
> > > > + per_cpu(rcu_cpu_started, cpu) = 1;
> > > > +
> > > > for_each_rcu_flavor(rsp) {
> > > > rdp = per_cpu_ptr(rsp->rda, cpu);
> > > > rnp = rdp->mynode;
> > > > @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> > > > preempt_enable();
> > > > for_each_rcu_flavor(rsp)
> > > > rcu_cleanup_dying_idle_cpu(cpu, rsp);
> > > > +
> > > > + per_cpu(rcu_cpu_started, cpu) = 0;
> > > > }
> > > >
> > > > /* Migrate the dead CPU's callbacks to the current CPU. */
> > >
> >
>


2018-05-22 17:08:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: cpu stopper threads and load balancing leads to deadlock

On Thu, May 17, 2018 at 07:56:14AM -0700, Paul E. McKenney wrote:
> On Thu, May 17, 2018 at 04:23:22PM +0200, Peter Zijlstra wrote:
> > On Thu, May 17, 2018 at 07:03:45AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
> > > I have not queued it, but given Peter's Signed-off-by and your Tested-by
> > > I would be happy to do so.
> >
> > And a Changelog of course :-)
>
> Thank you both!
>
> Is this OK for 4.19, or do we need it for 4.18? I am guessing 4.18,
> but figured I should ask before wildly rebasing the 82 commits destined
> for 4.19. ;-)

Hearing nothing countering my assumption, I have queued this just after
my v4.18 merge point. Please see below for the commit, and please let
me know if I messed anything up.

Thanx, Paul

------------------------------------------------------------------------

commit 354238a30982b817262fe6f9d00f808451694366
Author: Peter Zijlstra <[email protected]>
Date: Tue May 22 09:50:53 2018 -0700

rcu/x86: Provide early rcu_cpu_starting() callback

The x86/mtrr code does horrific things because hardware. It uses
stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
thread on another CPU), which uses RCU, all before the CPU is onlined.

RCU complains about this, because wakeups use RCU and RCU does
(rightfully) not consider offline CPUs for grace-periods.

Fix this by initializing RCU way early in the MTRR case.

Tested-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 7468de429087..07360523c3ce 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -793,6 +793,9 @@ void mtrr_ap_init(void)

if (!use_intel() || mtrr_aps_delayed_init)
return;
+
+ rcu_cpu_starting(smp_processor_id());
+
/*
* Ideally we should hold mtrr_mutex here to avoid mtrr entries
* changed, but this routine will be called in cpu boot time,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4fccdfa25716..aa7cade1b9f3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3665,6 +3665,8 @@ int rcutree_dead_cpu(unsigned int cpu)
return 0;
}

+static DEFINE_PER_CPU(int, rcu_cpu_started);
+
/*
* Mark the specified CPU as being online so that subsequent grace periods
* (both expedited and normal) will wait on it. Note that this means that
@@ -3686,6 +3688,11 @@ void rcu_cpu_starting(unsigned int cpu)
struct rcu_node *rnp;
struct rcu_state *rsp;

+ if (per_cpu(rcu_cpu_started, cpu))
+ return;
+
+ per_cpu(rcu_cpu_started, cpu) = 1;
+
for_each_rcu_flavor(rsp) {
rdp = per_cpu_ptr(rsp->rda, cpu);
rnp = rdp->mynode;
@@ -3742,6 +3749,8 @@ void rcu_report_dead(unsigned int cpu)
preempt_enable();
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_idle_cpu(cpu, rsp);
+
+ per_cpu(rcu_cpu_started, cpu) = 0;
}

/* Migrate the dead CPU's callbacks to the current CPU. */