Hey,
tglx says I have something for ya :-)
======================================================
WARNING: possible circular locking dependency detected
4.13.0-rc6+ #1 Not tainted
------------------------------------------------------
watchdog/3/27 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8100c489>] release_ds_buffers+0x29/0xd0
but now in release context of a crosslock acquired at the following:
((complete)&self->parked){+.+.}, at: [<ffffffff810895f6>] kthread_park+0x46/0x60
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((complete)&self->parked){+.+.}:
__lock_acquire+0x10af/0x1110
lock_acquire+0xea/0x1f0
wait_for_completion+0x3b/0x130
kthread_park+0x46/0x60
__smpboot_create_thread.part.5+0x7d/0xf0
smpboot_register_percpu_thread_cpumask+0xa2/0x100
spawn_ksoftirqd+0x3b/0x45
do_one_initcall+0x52/0x198
kernel_init_freeable+0x6f/0x1a1
kernel_init+0xe/0x100
ret_from_fork+0x2a/0x40
-> #1 (smpboot_threads_lock){+.+.}:
__lock_acquire+0x10af/0x1110
lock_acquire+0xea/0x1f0
__mutex_lock+0x6c/0x940
mutex_lock_nested+0x1b/0x20
smpboot_register_percpu_thread_cpumask+0x42/0x100
spawn_ksoftirqd+0x3b/0x45
do_one_initcall+0x52/0x198
kernel_init_freeable+0x6f/0x1a1
kernel_init+0xe/0x100
ret_from_fork+0x2a/0x40
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
cpus_read_lock+0x2a/0x90
release_ds_buffers+0x29/0xd0
x86_release_hardware+0x8f/0xa0
hw_perf_event_destroy+0xe/0x20
_free_event+0xa7/0x250
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock.rw_sem --> smpboot_threads_lock --> (complete)&self->parked
Possible unsafe locking scenario by crosslock:
CPU0 CPU1
---- ----
lock(smpboot_threads_lock);
lock((complete)&self->parked);
lock(cpu_hotplug_lock.rw_sem);
unlock((complete)&self->parked);
*** DEADLOCK ***
1 lock held by watchdog/3/27:
#0: (&x->wait){....}, at: [<ffffffff810b115d>] complete+0x1d/0x60
stack backtrace:
CPU: 3 PID: 27 Comm: watchdog/3 Not tainted 4.13.0-rc6+ #1
Hardware name: LENOVO 2320CTO/2320CTO, BIOS G2ET86WW (2.06 ) 11/13/2012
Call Trace:
dump_stack+0x86/0xcf
print_circular_bug+0x1fa/0x2f0
check_prev_add+0x3be/0x700
? __lock_acquire+0x4c6/0x1110
? trace_event_raw_event_lock+0xf0/0xf0
lock_commit_crosslock+0x40d/0x590
? lock_commit_crosslock+0x40d/0x590
complete+0x29/0x60
__kthread_parkme+0x54/0x80
kthread_parkme+0x24/0x40
smpboot_thread_fn+0x95/0x230
kthread+0x147/0x180
? sort_range+0x30/0x30
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2a/0x40
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
.config
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.13.0-rc6+ #1 Not tainted
> ------------------------------------------------------
While looking at this, I stumbled upon another one also enabled by
"completion annotation" in the TIP:
| ======================================================
| WARNING: possible circular locking dependency detected
| 4.13.0-rc6-00758-gd80d4177391f-dirty #112 Not tainted
| ------------------------------------------------------
| cpu-off.sh/426 is trying to acquire lock:
| ((complete)&st->done){+.+.}, at: [<ffffffff810cb344>] takedown_cpu+0x84/0xf0
|
| but task is already holding lock:
| (sparse_irq_lock){+.+.}, at: [<ffffffff811220f2>] irq_lock_sparse+0x12/0x20
|
| which lock already depends on the new lock.
|
| the existing dependency chain (in reverse order) is:
|
| -> #1 (sparse_irq_lock){+.+.}:
| __mutex_lock+0x88/0x9a0
| mutex_lock_nested+0x16/0x20
| irq_lock_sparse+0x12/0x20
| irq_affinity_online_cpu+0x13/0xd0
| cpuhp_invoke_callback+0x4a/0x130
|
| -> #0 ((complete)&st->done){+.+.}:
| check_prev_add+0x351/0x700
| __lock_acquire+0x114a/0x1220
| lock_acquire+0x47/0x70
| wait_for_completion+0x5c/0x180
| takedown_cpu+0x84/0xf0
| cpuhp_invoke_callback+0x4a/0x130
| cpuhp_down_callbacks+0x3d/0x80
…
|
| other info that might help us debug this:
|
| Possible unsafe locking scenario:
| CPU0 CPU1
| ---- ----
| lock(sparse_irq_lock);
| lock((complete)&st->done);
| lock(sparse_irq_lock);
| lock((complete)&st->done);
|
| *** DEADLOCK ***
We hold the sparse_irq_lock lock while waiting for the completion in the
CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
while the other CPU is waiting for the completion.
This is not an issue if my interpretation of lockdep here is correct.
How do we annotate this?
Sebastian
On Fri, Aug 25, 2017 at 11:47 PM, Sebastian Andrzej Siewior
<[email protected]> wrote:
> On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 4.13.0-rc6+ #1 Not tainted
>> ------------------------------------------------------
>
> While looking at this, I stumbled upon another one also enabled by
> "completion annotation" in the TIP:
>
> | ======================================================
> | WARNING: possible circular locking dependency detected
> | 4.13.0-rc6-00758-gd80d4177391f-dirty #112 Not tainted
> | ------------------------------------------------------
> | cpu-off.sh/426 is trying to acquire lock:
> | ((complete)&st->done){+.+.}, at: [<ffffffff810cb344>] takedown_cpu+0x84/0xf0
> |
> | but task is already holding lock:
> | (sparse_irq_lock){+.+.}, at: [<ffffffff811220f2>] irq_lock_sparse+0x12/0x20
> |
> | which lock already depends on the new lock.
> |
> | the existing dependency chain (in reverse order) is:
> |
> | -> #1 (sparse_irq_lock){+.+.}:
> | __mutex_lock+0x88/0x9a0
> | mutex_lock_nested+0x16/0x20
> | irq_lock_sparse+0x12/0x20
> | irq_affinity_online_cpu+0x13/0xd0
> | cpuhp_invoke_callback+0x4a/0x130
> |
> | -> #0 ((complete)&st->done){+.+.}:
> | check_prev_add+0x351/0x700
> | __lock_acquire+0x114a/0x1220
> | lock_acquire+0x47/0x70
> | wait_for_completion+0x5c/0x180
> | takedown_cpu+0x84/0xf0
> | cpuhp_invoke_callback+0x4a/0x130
> | cpuhp_down_callbacks+0x3d/0x80
> …
> |
> | other info that might help us debug this:
> |
> | Possible unsafe locking scenario:
> | CPU0 CPU1
> | ---- ----
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> |
> | *** DEADLOCK ***
>
> We hold the sparse_irq_lock lock while waiting for the completion in the
> CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
> while the other CPU is waiting for the completion.
> This is not an issue if my interpretation of lockdep here is correct.
Hello Sebastian,
I think you parsed the message correctly.
The message is saying that, for example:
context A (maybe being up?)
--
lock(sparse_irq_lock) // wait for sparse_irq_lock in B to be released
complete(st->done) // impossible to hit here
context B (maybe wanting to synchronize with the cpu being up?)
--
lock(sparse_irq_lock) // acquired successfully
wait_for_completion(st->done) // wait for completion of st->done in A
unlock(sparse_irq_lock) // impossible to hit here
I cannot check the kernel code at the moment.. I wonder if this scenario is
impossible. Could you answer it?
--
Thanks,
Byungchul
On Sat, 26 Aug 2017, Byungchul Park wrote:
> On Fri, Aug 25, 2017 at 11:47 PM, Sebastian Andrzej Siewior
> <[email protected]> wrote:
> > We hold the sparse_irq_lock lock while waiting for the completion in the
> > CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
> > while the other CPU is waiting for the completion.
> > This is not an issue if my interpretation of lockdep here is correct.
>
> Hello Sebastian,
>
> I think you parsed the message correctly.
>
> The message is saying that, for example:
>
> context A (maybe being up?)
> --
> lock(sparse_irq_lock) // wait for sparse_irq_lock in B to be released
> complete(st->done) // impossible to hit here
>
> context B (maybe wanting to synchronize with the cpu being up?)
> --
> lock(sparse_irq_lock) // acquired successfully
> wait_for_completion(st->done) // wait for completion of st->done in A
> unlock(sparse_irq_lock) // impossible to hit here
>
> I cannot check the kernel code at the moment.. I wonder if this scenario is
> impossible. Could you answer it?
Yes, it's impossible because cpu hotplug is globally serialized. So the cpu
down scenario cannot happen in parallel with the cpu up scenario.
Thanks,
tglx
On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
> Hey,
Hi Borislav,
> tglx says I have something for ya :-)
:)
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.13.0-rc6+ #1 Not tainted
> ------------------------------------------------------
> watchdog/3/27 is trying to acquire lock:
> (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8100c489>] release_ds_buffers+0x29/0xd0
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&self->parked){+.+.}, at: [<ffffffff810895f6>] kthread_park+0x46/0x60
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&self->parked){+.+.}:
> __lock_acquire+0x10af/0x1110
> lock_acquire+0xea/0x1f0
> wait_for_completion+0x3b/0x130
> kthread_park+0x46/0x60
> __smpboot_create_thread.part.5+0x7d/0xf0
> smpboot_register_percpu_thread_cpumask+0xa2/0x100
> spawn_ksoftirqd+0x3b/0x45
> do_one_initcall+0x52/0x198
> kernel_init_freeable+0x6f/0x1a1
> kernel_init+0xe/0x100
> ret_from_fork+0x2a/0x40
>
> -> #1 (smpboot_threads_lock){+.+.}:
> __lock_acquire+0x10af/0x1110
> lock_acquire+0xea/0x1f0
> __mutex_lock+0x6c/0x940
> mutex_lock_nested+0x1b/0x20
> smpboot_register_percpu_thread_cpumask+0x42/0x100
> spawn_ksoftirqd+0x3b/0x45
> do_one_initcall+0x52/0x198
> kernel_init_freeable+0x6f/0x1a1
> kernel_init+0xe/0x100
> ret_from_fork+0x2a/0x40
>
> -> #0 (cpu_hotplug_lock.rw_sem){++++}:
> cpus_read_lock+0x2a/0x90
> release_ds_buffers+0x29/0xd0
> x86_release_hardware+0x8f/0xa0
> hw_perf_event_destroy+0xe/0x20
> _free_event+0xa7/0x250
>
> other info that might help us debug this:
>
> Chain exists of:
> cpu_hotplug_lock.rw_sem --> smpboot_threads_lock --> (complete)&self->parked
>
> Possible unsafe locking scenario by crosslock:
>
> CPU0 CPU1
> ---- ----
> lock(smpboot_threads_lock);
> lock((complete)&self->parked);
> lock(cpu_hotplug_lock.rw_sem);
> unlock((complete)&self->parked);
>
> *** DEADLOCK ***
>
> 1 lock held by watchdog/3/27:
> #0: (&x->wait){....}, at: [<ffffffff810b115d>] complete+0x1d/0x60
>
> stack backtrace:
> CPU: 3 PID: 27 Comm: watchdog/3 Not tainted 4.13.0-rc6+ #1
> Hardware name: LENOVO 2320CTO/2320CTO, BIOS G2ET86WW (2.06 ) 11/13/2012
> Call Trace:
> dump_stack+0x86/0xcf
> print_circular_bug+0x1fa/0x2f0
> check_prev_add+0x3be/0x700
> ? __lock_acquire+0x4c6/0x1110
> ? trace_event_raw_event_lock+0xf0/0xf0
> lock_commit_crosslock+0x40d/0x590
> ? lock_commit_crosslock+0x40d/0x590
> complete+0x29/0x60
> __kthread_parkme+0x54/0x80
> kthread_parkme+0x24/0x40
> smpboot_thread_fn+0x95/0x230
> kthread+0x147/0x180
> ? sort_range+0x30/0x30
> ? kthread_create_on_node+0x40/0x40
> ret_from_fork+0x2a/0x40
I don't really get that one. I *think* it complains about the
cpu_hotplug_lock.rw_sem which was held while waiting for the completion
->parked (this is smpboot_register_percpu_thread() -> kthread_park() ->
wait_for_completion()) *and* then other case where kthread_park() waits
for the thread (while holding cpu_hotplug_lock.rw_sem) and out of
nowhere release_ds_buffers() comes on the CPU and asks for the
cpu_hotplug_lock.rw_sem. So something like this happens:
<...>-1 [002] .... 1.050786: kthread_park: watchdog/3
<...>-1 [002] .... 1.050794: kthread_park: wait… <- wait_for_completion()
<...>-27 [003] .... 1.050844: release_ds_buffers: before
<...>-27 [003] .... 1.050857: release_ds_buffers: after
<...>-27 [003] .... 1.050905: release_ds_buffers: released
<...>-27 [003] .... 1.050908: __kthread_parkme: do complete <- complete()
<...>-27 [003] d..1 1.050916: print_circular_bug:
<...>-1 [002] .... 1.459364: kthread_park: wait… done
We can change the timing by adding a printk to kthread_parkme() and then
lockdep won't see release_ds_buffers() within the wait_for_completion() section:
<...>-1 [003] .... 1.065596: kthread_park: watchdog/3
<...>-27 [003] .... 1.065657: release_ds_buffers: before
<...>-27 [003] .... 1.065669: release_ds_buffers: after
<...>-27 [003] .... 1.065718: release_ds_buffers: released
<...>-27 [003] .... 1.072390: __kthread_parkme: do complete <- complete
<...>-1 [003] .... 1.072402: kthread_park: wait… <- wait_for_completion()
<...>-1 [003] .... 1.072404: kthread_park: wait… done
different timing, no complains.
I don't think that this is a bug because we can have multiple readers of
the rwsem and *if* a someone is holding a write lock then
release_ds_buffers() would be simply deferred until smbboot_.* and the
write-lock holder are done.
If my interpretation of the situation is correct, how do we teach this
lockdep.
Sebastian
On Fri, Aug 25, 2017 at 04:47:55PM +0200, Sebastian Andrzej Siewior wrote:
> On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.13.0-rc6+ #1 Not tainted
> > ------------------------------------------------------
>
> While looking at this, I stumbled upon another one also enabled by
> "completion annotation" in the TIP:
>
> | ======================================================
> | WARNING: possible circular locking dependency detected
> | 4.13.0-rc6-00758-gd80d4177391f-dirty #112 Not tainted
> | ------------------------------------------------------
> | cpu-off.sh/426 is trying to acquire lock:
> | ((complete)&st->done){+.+.}, at: [<ffffffff810cb344>] takedown_cpu+0x84/0xf0
> |
> | but task is already holding lock:
> | (sparse_irq_lock){+.+.}, at: [<ffffffff811220f2>] irq_lock_sparse+0x12/0x20
> |
> | which lock already depends on the new lock.
> |
> | the existing dependency chain (in reverse order) is:
> |
> | -> #1 (sparse_irq_lock){+.+.}:
> | __mutex_lock+0x88/0x9a0
> | mutex_lock_nested+0x16/0x20
> | irq_lock_sparse+0x12/0x20
> | irq_affinity_online_cpu+0x13/0xd0
> | cpuhp_invoke_callback+0x4a/0x130
> |
> | -> #0 ((complete)&st->done){+.+.}:
> | check_prev_add+0x351/0x700
> | __lock_acquire+0x114a/0x1220
> | lock_acquire+0x47/0x70
> | wait_for_completion+0x5c/0x180
> | takedown_cpu+0x84/0xf0
> | cpuhp_invoke_callback+0x4a/0x130
> | cpuhp_down_callbacks+0x3d/0x80
> …
> |
> | other info that might help us debug this:
> |
> | Possible unsafe locking scenario:
> | CPU0 CPU1
> | ---- ----
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> |
> | *** DEADLOCK ***
>
> We hold the sparse_irq_lock lock while waiting for the completion in the
> CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
> while the other CPU is waiting for the completion.
> This is not an issue if my interpretation of lockdep here is correct.
>
> How do we annotate this?
Does something like so work?
---
include/linux/completion.h | 15 ++++++++++++---
kernel/kthread.c | 14 +++++++++++++-
kernel/sched/completion.c | 18 +++++++++++++-----
3 files changed, 38 insertions(+), 9 deletions(-)
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 791f053f28b7..0eccd2d44c85 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -34,9 +34,9 @@ struct completion {
};
#ifdef CONFIG_LOCKDEP_COMPLETIONS
-static inline void complete_acquire(struct completion *x)
+static inline void complete_acquire(struct completion *x, int subclass)
{
- lock_acquire_exclusive((struct lockdep_map *)&x->map, 0, 0, NULL, _RET_IP_);
+ lock_acquire_exclusive((struct lockdep_map *)&x->map, subclass, 0, NULL, _RET_IP_);
}
static inline void complete_release(struct completion *x)
@@ -59,7 +59,7 @@ do { \
} while (0)
#else
#define init_completion(x) __init_completion(x)
-static inline void complete_acquire(struct completion *x) {}
+static inline void complete_acquire(struct completion *x, int subclass) {}
static inline void complete_release(struct completion *x) {}
static inline void complete_release_commit(struct completion *x) {}
#endif
@@ -132,6 +132,15 @@ static inline void reinit_completion(struct completion *x)
}
extern void wait_for_completion(struct completion *);
+
+#ifndef CONFIG_LOCKDEP
+static inline void
+wait_for_completion_nested(struct completion *x, int subclass)
+{
+ wait_for_completion(x);
+}
+#endif
+
extern void wait_for_completion_io(struct completion *);
extern int wait_for_completion_interruptible(struct completion *x);
extern int wait_for_completion_killable(struct completion *x);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528c1d88..6092702dd908 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -485,7 +485,19 @@ int kthread_park(struct task_struct *k)
set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
if (k != current) {
wake_up_process(k);
- wait_for_completion(&kthread->parked);
+ /*
+ * CPU-UP CPU-DOWN
+ *
+ * cpu_hotplug_lock
+ * wait_for_completion()
+ * cpu_hotplug_lock
+ * complete()
+ *
+ * Which normally spells deadlock, except of course
+ * that up and down are globally serialized so the
+ * above cannot in fact happen concurrently.
+ */
+ wait_for_completion_nested(&kthread->parked, 1);
}
}
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index cc873075c3bd..18ca9b7ef677 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -101,11 +101,11 @@ do_wait_for_common(struct completion *x,
static inline long __sched
__wait_for_common(struct completion *x,
- long (*action)(long), long timeout, int state)
+ long (*action)(long), long timeout, int state, int subclass)
{
might_sleep();
- complete_acquire(x);
+ complete_acquire(x, subclass);
spin_lock_irq(&x->wait.lock);
timeout = do_wait_for_common(x, action, timeout, state);
@@ -117,9 +117,9 @@ __wait_for_common(struct completion *x,
}
static long __sched
-wait_for_common(struct completion *x, long timeout, int state)
+wait_for_common(struct completion *x, long timeout, int state, int subclass)
{
- return __wait_for_common(x, schedule_timeout, timeout, state);
+ return __wait_for_common(x, schedule_timeout, timeout, state, subclass);
}
static long __sched
@@ -140,10 +140,18 @@ wait_for_common_io(struct completion *x, long timeout, int state)
*/
void __sched wait_for_completion(struct completion *x)
{
- wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE);
+ wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE, 0);
}
EXPORT_SYMBOL(wait_for_completion);
+#ifdef CONFIG_LOCKDEP
+void __sched wait_for_completion_nested(struct completion *x, int subclass)
+{
+ wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE, subclass);
+}
+EXPORT_SYMBOL(wait_for_completion);
+#endif
+
/**
* wait_for_completion_timeout: - waits for completion of a task (w/timeout)
* @x: holds the state of this particular completion
On Fri, Aug 25, 2017 at 04:47:55PM +0200, Sebastian Andrzej Siewior wrote:
> On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
> | ======================================================
> | WARNING: possible circular locking dependency detected
> | 4.13.0-rc6-00758-gd80d4177391f-dirty #112 Not tainted
> | ------------------------------------------------------
> | cpu-off.sh/426 is trying to acquire lock:
> | ((complete)&st->done){+.+.}, at: [<ffffffff810cb344>] takedown_cpu+0x84/0xf0
> |
> | but task is already holding lock:
> | (sparse_irq_lock){+.+.}, at: [<ffffffff811220f2>] irq_lock_sparse+0x12/0x20
> |
> | which lock already depends on the new lock.
> |
> | the existing dependency chain (in reverse order) is:
> |
> | -> #1 (sparse_irq_lock){+.+.}:
> | __mutex_lock+0x88/0x9a0
> | mutex_lock_nested+0x16/0x20
> | irq_lock_sparse+0x12/0x20
> | irq_affinity_online_cpu+0x13/0xd0
> | cpuhp_invoke_callback+0x4a/0x130
> |
> | -> #0 ((complete)&st->done){+.+.}:
> | check_prev_add+0x351/0x700
> | __lock_acquire+0x114a/0x1220
> | lock_acquire+0x47/0x70
> | wait_for_completion+0x5c/0x180
> | takedown_cpu+0x84/0xf0
> | cpuhp_invoke_callback+0x4a/0x130
> | cpuhp_down_callbacks+0x3d/0x80
> …
> |
> | other info that might help us debug this:
> |
> | Possible unsafe locking scenario:
> | CPU0 CPU1
> | ---- ----
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> |
> | *** DEADLOCK ***
>
> We hold the sparse_irq_lock lock while waiting for the completion in the
> CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
> while the other CPU is waiting for the completion.
> This is not an issue if my interpretation of lockdep here is correct.
>
> How do we annotate this?
Right, so I'm also seeing this one on offline. Its not making sense to
me. wait_for_completion() cannot be #0, we don't hold any other locks
(except the scheduler locks) while we have the complete_acquire() thing.
Now, I have enough hackery here to make it go away, but I'm not in fact
sure I understand things yet :-(
On Fri, Aug 25, 2017 at 12:03:04PM +0200, Borislav Petkov wrote:
> Hey,
>
> tglx says I have something for ya :-)
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.13.0-rc6+ #1 Not tainted
> ------------------------------------------------------
> watchdog/3/27 is trying to acquire lock:
> (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8100c489>] release_ds_buffers+0x29/0xd0
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&self->parked){+.+.}, at: [<ffffffff810895f6>] kthread_park+0x46/0x60
So I'm thinking this one is an actual deadlock.
So, as far as I can tell this ends up being:
CPU0 CPU1
(smpboot_regiser_percpu_thread_cpumask)
get_online_cpus()
__smpboot_create_thread()
kthread_park();
wait_for_completion(&X)
(smpboot_thread_fn)
->park() := watchdog_disable()
watchdog_nmi_disable()
perf_event_release_kernel();
put_event()
_free_event()
->destroy() := hw_perf_event_destroy()
x86_release_hardware()
release_ds_buffers()
get_online_cpus()
kthread_parkme()
complete(&X)
So CPU0 holds cpus_hotplug_lock while wait_for_completion() and CPU1
needs to acquire before complete().
So if, in between, CPU2 does down_write(), things will get unstuck.
What's worse, there's also:
cpus_write_lock()
...
takedown_cpu()
smpboot_park_threads()
smpboot_park_thread()
kthread_park()
->park() := watchdog_disable()
watchdog_nmi_disable()
perf_event_release_kernel();
put_event()
_free_event()
->destroy() := hw_perf_event_destroy()
x86_release_hardware()
release_ds_buffers()
get_online_cpus()
which as far as I can tell, spells instant deadlock..
On Mon, Aug 28, 2017 at 04:58:08PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 25, 2017 at 12:03:04PM +0200, Borislav Petkov wrote:
> > Hey,
> >
> > tglx says I have something for ya :-)
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.13.0-rc6+ #1 Not tainted
> > ------------------------------------------------------
> > watchdog/3/27 is trying to acquire lock:
> > (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8100c489>] release_ds_buffers+0x29/0xd0
> >
> > but now in release context of a crosslock acquired at the following:
> > ((complete)&self->parked){+.+.}, at: [<ffffffff810895f6>] kthread_park+0x46/0x60
>
>
> So I'm thinking this one is an actual deadlock.
>
> So, as far as I can tell this ends up being:
>
> CPU0 CPU1
>
> (smpboot_regiser_percpu_thread_cpumask)
>
> get_online_cpus()
> __smpboot_create_thread()
> kthread_park();
> wait_for_completion(&X)
>
>
> (smpboot_thread_fn)
>
> ->park() := watchdog_disable()
> watchdog_nmi_disable()
> perf_event_release_kernel();
> put_event()
> _free_event()
> ->destroy() := hw_perf_event_destroy()
> x86_release_hardware()
> release_ds_buffers()
> get_online_cpus()
>
>
> kthread_parkme()
> complete(&X)
>
>
>
> So CPU0 holds cpus_hotplug_lock while wait_for_completion() and CPU1
> needs to acquire before complete().
>
> So if, in between, CPU2 does down_write(), things will get unstuck.
>
> What's worse, there's also:
>
> cpus_write_lock()
> ...
> takedown_cpu()
> smpboot_park_threads()
> smpboot_park_thread()
> kthread_park()
> ->park() := watchdog_disable()
> watchdog_nmi_disable()
> perf_event_release_kernel();
> put_event()
> _free_event()
> ->destroy() := hw_perf_event_destroy()
> x86_release_hardware()
> release_ds_buffers()
> get_online_cpus()
>
> which as far as I can tell, spells instant deadlock..
Aah, but that latter will never happen.. because each CPU will have a
&pmc_refcount and we can't unplug _all_ CPUs.
So the first one will only ever happen on boot, where we park() the very
first watchdog thread and is a potential deadlock, but won't happen
because nobody is around to do down_write() just yet.
argh!
On Mon, Aug 28, 2017 at 05:06:17PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 28, 2017 at 04:58:08PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 25, 2017 at 12:03:04PM +0200, Borislav Petkov wrote:
> > > Hey,
> > >
> > > tglx says I have something for ya :-)
> > >
> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > 4.13.0-rc6+ #1 Not tainted
> > > ------------------------------------------------------
> > > watchdog/3/27 is trying to acquire lock:
> > > (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8100c489>] release_ds_buffers+0x29/0xd0
> > >
> > > but now in release context of a crosslock acquired at the following:
> > > ((complete)&self->parked){+.+.}, at: [<ffffffff810895f6>] kthread_park+0x46/0x60
> >
> >
> > So I'm thinking this one is an actual deadlock.
> >
> > So, as far as I can tell this ends up being:
> >
> > CPU0 CPU1
> >
> > (smpboot_regiser_percpu_thread_cpumask)
> >
> > get_online_cpus()
> > __smpboot_create_thread()
> > kthread_park();
> > wait_for_completion(&X)
> >
> >
> > (smpboot_thread_fn)
> >
> > ->park() := watchdog_disable()
> > watchdog_nmi_disable()
> > perf_event_release_kernel();
> > put_event()
> > _free_event()
> > ->destroy() := hw_perf_event_destroy()
> > x86_release_hardware()
> > release_ds_buffers()
> > get_online_cpus()
> >
> >
> > kthread_parkme()
> > complete(&X)
> >
> >
> >
> > So CPU0 holds cpus_hotplug_lock while wait_for_completion() and CPU1
> > needs to acquire before complete().
> >
> > So if, in between, CPU2 does down_write(), things will get unstuck.
> >
> > What's worse, there's also:
> >
> > cpus_write_lock()
> > ...
> > takedown_cpu()
> > smpboot_park_threads()
> > smpboot_park_thread()
> > kthread_park()
> > ->park() := watchdog_disable()
> > watchdog_nmi_disable()
> > perf_event_release_kernel();
> > put_event()
> > _free_event()
> > ->destroy() := hw_perf_event_destroy()
> > x86_release_hardware()
> > release_ds_buffers()
> > get_online_cpus()
> >
> > which as far as I can tell, spells instant deadlock..
>
> Aah, but that latter will never happen.. because each CPU will have a
> &pmc_refcount and we can't unplug _all_ CPUs.
>
> So the first one will only ever happen on boot, where we park() the very
> first watchdog thread and is a potential deadlock, but won't happen
> because nobody is around to do down_write() just yet.
I suspect however it is possible to interleave:
sysctl.kernel.nmi_watchdog = 0
proc_watchdog_common()
with:
hot un-plug
watchdog_disable()
to tickle that exact problem. Just needs a bit of luck.
On Mon, 28 Aug 2017, Peter Zijlstra wrote:
> What's worse, there's also:
>
> cpus_write_lock()
> ...
> takedown_cpu()
> smpboot_park_threads()
> smpboot_park_thread()
> kthread_park()
> ->park() := watchdog_disable()
> watchdog_nmi_disable()
> perf_event_release_kernel();
> put_event()
> _free_event()
> ->destroy() := hw_perf_event_destroy()
> x86_release_hardware()
> release_ds_buffers()
> get_online_cpus()
>
> which as far as I can tell, spells instant deadlock..
Yes, it does if the destroyed event has the last reference on pmc_refcount.
All it needs for that is to shutdown the watchdog on CPU0 via the sysctl
cpumask and then offline all other CPUs. Works^Wdeadlocks like a charm.
That's not a new deadlock, it's been there forever. Just now lockdep tells
us that it's a potential deadlock _before_ we actually hit it.
The user space interface one has been there as well before the cpu lock
rework. That one was not covered by lockdep either.
None of this is easy to fixup. I started to tackle the unholy mess in the
watchdog code, but now I'm stuck in yet another circular dependency hell.
One solution I'm looking into right now is to reverse the lock order and
actually make the hotplug code do:
watchdog_lock();
cpu_write_lock();
....
cpu_write_unlock();
watchdog_unlock();
and get rid of cpu_read_(un)lock() in the sysctl interface completely. I
know it's ugly, but we have other locks we take in the hotplug path as
well.
That solves that part of the issue, but it does not solve the
release_ds_buffers() problem. Though with the watchdog_lock() mechanism, it
allows me to do:
->park() := watchdog_disable()
perf_event_disable(percpuevt);
cleanup_event = percpuevt;
percpuevt = NULL;
and then
watchdog_unlock()
if (cleanup_event) {
perf_event_release_ebent(cleanup_event);
cleanup_event = NULL;
}
mutex_unlock(&watchdog_mutex);
That should do the trick nicely for both user space functions and the cpu
hotplug machinery.
Though it's quite a rewrite of that mess, which is particularly non trivial
because that extra non perf implementation in arch/powerpc which has its
own NMI watchdog thingy wants its calls preserved. But AFAICT so far it
should just work. Famous last words....
Thoughts?
Thanks,
tglx
On Fri, Aug 25, 2017 at 04:47:55PM +0200, Sebastian Andrzej Siewior wrote:
> On 2017-08-25 12:03:04 [+0200], Borislav Petkov wrote:
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.13.0-rc6+ #1 Not tainted
> > ------------------------------------------------------
>
> While looking at this, I stumbled upon another one also enabled by
> "completion annotation" in the TIP:
>
> | ======================================================
> | WARNING: possible circular locking dependency detected
> | 4.13.0-rc6-00758-gd80d4177391f-dirty #112 Not tainted
> | ------------------------------------------------------
> | cpu-off.sh/426 is trying to acquire lock:
> | ((complete)&st->done){+.+.}, at: [<ffffffff810cb344>] takedown_cpu+0x84/0xf0
> |
> | but task is already holding lock:
> | (sparse_irq_lock){+.+.}, at: [<ffffffff811220f2>] irq_lock_sparse+0x12/0x20
> |
> | which lock already depends on the new lock.
> |
> | the existing dependency chain (in reverse order) is:
> |
> | -> #1 (sparse_irq_lock){+.+.}:
> | __mutex_lock+0x88/0x9a0
> | mutex_lock_nested+0x16/0x20
> | irq_lock_sparse+0x12/0x20
> | irq_affinity_online_cpu+0x13/0xd0
> | cpuhp_invoke_callback+0x4a/0x130
> |
> | -> #0 ((complete)&st->done){+.+.}:
> | check_prev_add+0x351/0x700
> | __lock_acquire+0x114a/0x1220
> | lock_acquire+0x47/0x70
> | wait_for_completion+0x5c/0x180
> | takedown_cpu+0x84/0xf0
> | cpuhp_invoke_callback+0x4a/0x130
> | cpuhp_down_callbacks+0x3d/0x80
> …
> |
> | other info that might help us debug this:
> |
> | Possible unsafe locking scenario:
> | CPU0 CPU1
> | ---- ----
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> | lock(sparse_irq_lock);
> | lock((complete)&st->done);
> |
> | *** DEADLOCK ***
>
> We hold the sparse_irq_lock lock while waiting for the completion in the
> CPU-down case and in the CPU-up case we acquire the sparse_irq_lock lock
> while the other CPU is waiting for the completion.
> This is not an issue if my interpretation of lockdep here is correct.
>
> How do we annotate this?
The below is the nicest thing I could come up with. This results in
_cpu_down and _cpu_up having a different class for st->done.
Its a bit weird, and would probably need a wee comment to explain
things, but it boots and avoids the splat on hotplug.
---
kernel/cpu.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index acf5308fad51..d93df21c5cfb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -375,13 +375,6 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
/*
* The cpu hotplug threads manage the bringup and teardown of the cpus
*/
-static void cpuhp_create(unsigned int cpu)
-{
- struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
-
- init_completion(&st->done);
-}
-
static int cpuhp_should_run(unsigned int cpu)
{
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
@@ -520,7 +513,6 @@ static int cpuhp_kick_ap_work(unsigned int cpu)
static struct smp_hotplug_thread cpuhp_threads = {
.store = &cpuhp_state.thread,
- .create = &cpuhp_create,
.thread_should_run = cpuhp_should_run,
.thread_fn = cpuhp_thread_fun,
.thread_comm = "cpuhp/%u",
@@ -687,7 +679,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
enum cpuhp_state target)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
- int prev_state, ret = 0;
+ int i, prev_state, ret = 0;
if (num_online_cpus() == 1)
return -EBUSY;
@@ -697,6 +689,9 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
cpus_write_lock();
+ for_each_possible_cpu(i)
+ init_completion(&per_cpu_ptr(&cpuhp_state, i)->done);
+
cpuhp_tasks_frozen = tasks_frozen;
prev_state = st->state;
@@ -802,10 +797,13 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
struct task_struct *idle;
- int ret = 0;
+ int i, ret = 0;
cpus_write_lock();
+ for_each_possible_cpu(i)
+ init_completion(&per_cpu_ptr(&cpuhp_state, i)->done);
+
if (!cpu_present(cpu)) {
ret = -EINVAL;
goto out;
On Tue, Aug 29, 2017 at 07:40:44PM +0200, Thomas Gleixner wrote:
> One solution I'm looking into right now is to reverse the lock order and
> actually make the hotplug code do:
>
> watchdog_lock();
> cpu_write_lock();
>
> ....
> cpu_write_unlock();
> watchdog_unlock();
>
> and get rid of cpu_read_(un)lock() in the sysctl interface completely. I
> know it's ugly, but we have other locks we take in the hotplug path as
> well.
This is to serialize the sysctl against hotplug? I'm not immediately
seeing why watchdog_lock needs to be the outer most lock, is that
because of vfs locks or something?
> That solves that part of the issue, but it does not solve the
> release_ds_buffers() problem. Though with the watchdog_lock() mechanism, it
> allows me to do:
>
> ->park() := watchdog_disable()
> perf_event_disable(percpuevt);
> cleanup_event = percpuevt;
> percpuevt = NULL;
> and then
>
> watchdog_unlock()
> if (cleanup_event) {
> perf_event_release_ebent(cleanup_event);
> cleanup_event = NULL;
> }
> mutex_unlock(&watchdog_mutex);
>
> That should do the trick nicely for both user space functions and the cpu
> hotplug machinery.
>
> Though it's quite a rewrite of that mess, which is particularly non trivial
> because that extra non perf implementation in arch/powerpc which has its
> own NMI watchdog thingy wants its calls preserved. But AFAICT so far it
> should just work. Famous last words....
>
> Thoughts?
So I have a patch _somewhere_ that preserves the event<->cpu relation
across hotplug and disable/enable would be sufficient. If you want I can
try and dig that out and make it work again.
That would avoid having to do the destroy/create cycle of the watchdog
events.
On Tue, 29 Aug 2017, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:40:44PM +0200, Thomas Gleixner wrote:
>
> > One solution I'm looking into right now is to reverse the lock order and
> > actually make the hotplug code do:
> >
> > watchdog_lock();
> > cpu_write_lock();
> >
> > ....
> > cpu_write_unlock();
> > watchdog_unlock();
> >
> > and get rid of cpu_read_(un)lock() in the sysctl interface completely. I
> > know it's ugly, but we have other locks we take in the hotplug path as
> > well.
>
> This is to serialize the sysctl against hotplug? I'm not immediately
> seeing why watchdog_lock needs to be the outer most lock, is that
> because of vfs locks or something?
Well, the watchdog sysctls serialization today is:
cpus_read_lock();
mutex_lock(&watchdog_mutex);
do_stuff()
access -> online_cpu_mask
do_stuff()
...
...
cpus_read_lock();
So we need
watchdog_mutex -> cpuhotplug_rwsem
lock order all over the place.
> > Though it's quite a rewrite of that mess, which is particularly non trivial
> > because that extra non perf implementation in arch/powerpc which has its
> > own NMI watchdog thingy wants its calls preserved. But AFAICT so far it
> > should just work. Famous last words....
> >
> > Thoughts?
>
> So I have a patch _somewhere_ that preserves the event<->cpu relation
> across hotplug and disable/enable would be sufficient. If you want I can
> try and dig that out and make it work again.
>
> That would avoid having to do the destroy/create cycle of the watchdog
> events.
Yes, that would solve the x86_release_hw() issue, but still lots of the
other rework is required in one way or the other.
I'm currently trying to avoid that extra lock mess in the cpu hotplug code,
which would just open the door for everybody to add his extra locks there,
so we end up taking a gazillion locks before we can hotplug :)
I think I have an idea how to solve that cleanly, but certainly your offer
of preserving the event - cpu relation accross hotplug would help
tremendously.
Thanks,
tglx
On Tue, Aug 29, 2017 at 10:10:37PM +0200, Thomas Gleixner wrote:
> On Tue, 29 Aug 2017, Peter Zijlstra wrote:
> > So I have a patch _somewhere_ that preserves the event<->cpu relation
> > across hotplug and disable/enable would be sufficient. If you want I can
> > try and dig that out and make it work again.
> >
> > That would avoid having to do the destroy/create cycle of the watchdog
> > events.
>
> Yes, that would solve the x86_release_hw() issue, but still lots of the
> other rework is required in one way or the other.
>
> I'm currently trying to avoid that extra lock mess in the cpu hotplug code,
> which would just open the door for everybody to add his extra locks there,
> so we end up taking a gazillion locks before we can hotplug :)
>
> I think I have an idea how to solve that cleanly, but certainly your offer
> of preserving the event - cpu relation accross hotplug would help
> tremendously.
I think something like the below ought to work. Compile tested only.
On offline it basically does perf_event_disable() for all CPU context
events, and then adds HOTPLUG_OFFSET (-32) to arrive at: OFF +
HOTPLUG_OFFSET = -33.
That's smaller than ERROR and thus perf_event_enable() no-ops on events
for offline CPUs (maybe we should try and plumb an error return for
IOC_ENABLE).
On online we subtract the HOTPLUG_OFFSET again and the event becomes a
regular OFF, after which perf_event_enable() should work again.
---
include/linux/perf_event.h | 2 ++
kernel/events/core.c | 51 +++++++++++++++++++++++++++++++++-------------
2 files changed, 39 insertions(+), 14 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9bac4bfa5e1a..7b39ceeb206b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -497,6 +497,8 @@ enum perf_event_active_state {
PERF_EVENT_STATE_OFF = -1,
PERF_EVENT_STATE_INACTIVE = 0,
PERF_EVENT_STATE_ACTIVE = 1,
+
+ PERF_EVENT_STATE_HOTPLUG_OFFSET = -32,
};
struct file;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f77c97477e08..b277c27fd81e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11025,19 +11025,21 @@ void perf_swevent_init_cpu(unsigned int cpu)
}
#if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC_CORE
-static void __perf_event_exit_context(void *__info)
+static void __perf_event_exit_cpu(void *__info)
{
- struct perf_event_context *ctx = __info;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = __info;
+ struct perf_event_context *ctx = &cpuctx->ctx;
struct perf_event *event;
raw_spin_lock(&ctx->lock);
- list_for_each_entry(event, &ctx->event_list, event_entry)
- __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
+ list_for_each_entry(event, &ctx->event_list, event_entry) {
+ __perf_event_disable(event, cpuctx, ctx, NULL);
+ event->state += PERF_EVENT_STATE_HOTPLUG_OFFSET;
+ }
raw_spin_unlock(&ctx->lock);
}
-static void perf_event_exit_cpu_context(int cpu)
+int perf_event_exit_cpu(unsigned int cpu)
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
@@ -11049,17 +11051,43 @@ static void perf_event_exit_cpu_context(int cpu)
ctx = &cpuctx->ctx;
mutex_lock(&ctx->mutex);
- smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
+ smp_call_function_single(cpu, __perf_event_exit_cpu, cpuctx, 1);
cpuctx->online = 0;
mutex_unlock(&ctx->mutex);
}
cpumask_clear_cpu(cpu, perf_online_mask);
mutex_unlock(&pmus_lock);
+
+ return 0;
+}
+
+static void __perf_event_init_cpu(void *__info)
+{
+ struct perf_cpu_context *cpuctx = __info;
+ struct perf_event_context *ctx = &cpuctx->ctx;
+ struct perf_event *event;
+
+ raw_spin_lock(&ctx->lock);
+ list_for_each_entry(event, &ctx->event_list, event_entry)
+ event->state -= PERF_EVENT_STATE_HOTPLUG_OFFSET;
+ raw_spin_unlock(&ctx->lock);
+}
+
+static void _perf_event_init_cpu(int cpu, struct perf_cpu_context *cpuctx)
+{
+ smp_call_function_single(cpu, __perf_event_init_cpu, cpuctx, 1);
}
+
#else
-static void perf_event_exit_cpu_context(int cpu) { }
+int perf_event_exit_cpu(unsigned int cpu)
+{
+ return 0;
+}
+static void _perf_event_init_cpu(int cpu, struct perf_cpu_context *cpuctx)
+{
+}
#endif
int perf_event_init_cpu(unsigned int cpu)
@@ -11078,6 +11106,7 @@ int perf_event_init_cpu(unsigned int cpu)
mutex_lock(&ctx->mutex);
cpuctx->online = 1;
+ _perf_event_init_cpu(cpu, cpuctx);
mutex_unlock(&ctx->mutex);
}
mutex_unlock(&pmus_lock);
@@ -11085,12 +11114,6 @@ int perf_event_init_cpu(unsigned int cpu)
return 0;
}
-int perf_event_exit_cpu(unsigned int cpu)
-{
- perf_event_exit_cpu_context(cpu);
- return 0;
-}
-
static int
perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
{
On Wed, 30 Aug 2017, Peter Zijlstra wrote:
> On offline it basically does perf_event_disable() for all CPU context
> events, and then adds HOTPLUG_OFFSET (-32) to arrive at: OFF +
> HOTPLUG_OFFSET = -33.
>
> That's smaller than ERROR and thus perf_event_enable() no-ops on events
> for offline CPUs (maybe we should try and plumb an error return for
> IOC_ENABLE).
>
> On online we subtract the HOTPLUG_OFFSET again and the event becomes a
> regular OFF, after which perf_event_enable() should work again.
I haven't come around to test that as I was busy cleaning up the unholy
mess in the watchdog code.
One other thing I stumbled over is:
perf_event_create()
....
x86_hw_reserve(event)
if (__x86_pmu_event_init(event) < 0)
event->destroy(event);
x86_hw_release()
....
cpus_read_lock();
If that happens from a hotplug function, we are doomed.
I mean, that particular watchdog event won't fail if the watchdog code
would verify that already at init time (which it does soon), but in general
event creation during hotplug is dangerous.
Thanks,
tglx
On Thu, Aug 31, 2017 at 09:08:05AM +0200, Thomas Gleixner wrote:
> On Wed, 30 Aug 2017, Peter Zijlstra wrote:
> > On offline it basically does perf_event_disable() for all CPU context
> > events, and then adds HOTPLUG_OFFSET (-32) to arrive at: OFF +
> > HOTPLUG_OFFSET = -33.
> >
> > That's smaller than ERROR and thus perf_event_enable() no-ops on events
> > for offline CPUs (maybe we should try and plumb an error return for
> > IOC_ENABLE).
> >
> > On online we subtract the HOTPLUG_OFFSET again and the event becomes a
> > regular OFF, after which perf_event_enable() should work again.
>
> I haven't come around to test that as I was busy cleaning up the unholy
> mess in the watchdog code.
>
> One other thing I stumbled over is:
>
> perf_event_create()
> ....
> x86_hw_reserve(event)
>
> if (__x86_pmu_event_init(event) < 0)
> event->destroy(event);
> x86_hw_release()
> ....
> cpus_read_lock();
>
> If that happens from a hotplug function, we are doomed.
>
> I mean, that particular watchdog event won't fail if the watchdog code
> would verify that already at init time (which it does soon), but in general
> event creation during hotplug is dangerous.
Arghh!!!
And allowing us to create events for offline CPUs (possible I think, but
maybe slightly tricky) won't solve that, because we're already holding
the hotplug_lock during PREPARE.
I'll try and think...
On Thu, 31 Aug 2017, Peter Zijlstra wrote:
> On Thu, Aug 31, 2017 at 09:08:05AM +0200, Thomas Gleixner wrote:
> > On Wed, 30 Aug 2017, Peter Zijlstra wrote:
> > > On offline it basically does perf_event_disable() for all CPU context
> > > events, and then adds HOTPLUG_OFFSET (-32) to arrive at: OFF +
> > > HOTPLUG_OFFSET = -33.
> > >
> > > That's smaller than ERROR and thus perf_event_enable() no-ops on events
> > > for offline CPUs (maybe we should try and plumb an error return for
> > > IOC_ENABLE).
> > >
> > > On online we subtract the HOTPLUG_OFFSET again and the event becomes a
> > > regular OFF, after which perf_event_enable() should work again.
> >
> > I haven't come around to test that as I was busy cleaning up the unholy
> > mess in the watchdog code.
> >
> > One other thing I stumbled over is:
> >
> > perf_event_create()
> > ....
> > x86_hw_reserve(event)
> >
> > if (__x86_pmu_event_init(event) < 0)
> > event->destroy(event);
> > x86_hw_release()
> > ....
> > cpus_read_lock();
> >
> > If that happens from a hotplug function, we are doomed.
> >
> > I mean, that particular watchdog event won't fail if the watchdog code
> > would verify that already at init time (which it does soon), but in general
> > event creation during hotplug is dangerous.
>
> Arghh!!!
>
> And allowing us to create events for offline CPUs (possible I think, but
> maybe slightly tricky) won't solve that, because we're already holding
> the hotplug_lock during PREPARE.
There are two ways to cure that:
1) Have a pre cpus_write_lock() stage which is serialized via
cpus_add_remove_lock, which is the outer lock for hotplug.
There we can sanely create stuff and fail with all consequences.
2) Have some deferred mechanism, which is destroying the event after
failure, but that might be tricky as RCU and workqueues might end up
being flushed during hotplug, which creates the same mess again.
Thanks,
tglx
On Thu, Aug 31, 2017 at 09:55:57AM +0200, Thomas Gleixner wrote:
> > Arghh!!!
> >
> > And allowing us to create events for offline CPUs (possible I think, but
> > maybe slightly tricky) won't solve that, because we're already holding
> > the hotplug_lock during PREPARE.
>
> There are two ways to cure that:
>
> 1) Have a pre cpus_write_lock() stage which is serialized via
> cpus_add_remove_lock, which is the outer lock for hotplug.
>
> There we can sanely create stuff and fail with all consequences.
True, if you're willing to add more state to that hotplug thing I'll try
and make that perf patch that allows attaching to offline CPUs.
On Thu, 31 Aug 2017, Peter Zijlstra wrote:
> On Thu, Aug 31, 2017 at 09:55:57AM +0200, Thomas Gleixner wrote:
> > > Arghh!!!
> > >
> > > And allowing us to create events for offline CPUs (possible I think, but
> > > maybe slightly tricky) won't solve that, because we're already holding
> > > the hotplug_lock during PREPARE.
> >
> > There are two ways to cure that:
> >
> > 1) Have a pre cpus_write_lock() stage which is serialized via
> > cpus_add_remove_lock, which is the outer lock for hotplug.
> >
> > There we can sanely create stuff and fail with all consequences.
>
> True, if you're willing to add more state to that hotplug thing I'll try
> and make that perf patch that allows attaching to offline CPUs.
Now that I think more about it. That's going to be an interesting exercise
vs. the hotplug state registration which relies on cpus_read_lock()
serialization.....
Thanks,
tglx
On Thu, 31 Aug 2017, Thomas Gleixner wrote:
> On Thu, 31 Aug 2017, Peter Zijlstra wrote:
>
> > On Thu, Aug 31, 2017 at 09:55:57AM +0200, Thomas Gleixner wrote:
> > > > Arghh!!!
> > > >
> > > > And allowing us to create events for offline CPUs (possible I think, but
> > > > maybe slightly tricky) won't solve that, because we're already holding
> > > > the hotplug_lock during PREPARE.
> > >
> > > There are two ways to cure that:
> > >
> > > 1) Have a pre cpus_write_lock() stage which is serialized via
> > > cpus_add_remove_lock, which is the outer lock for hotplug.
> > >
> > > There we can sanely create stuff and fail with all consequences.
> >
> > True, if you're willing to add more state to that hotplug thing I'll try
> > and make that perf patch that allows attaching to offline CPUs.
>
> Now that I think more about it. That's going to be an interesting exercise
> vs. the hotplug state registration which relies on cpus_read_lock()
> serialization.....
We could have that for built-in stuff which is guaranteed to be never
unregistered. Pretty restricted, but for cases like that it could
work. Famous last work ...
Thanks,
tglx
On Thu, Aug 31, 2017 at 11:24:13PM +0200, Thomas Gleixner wrote:
> On Thu, 31 Aug 2017, Thomas Gleixner wrote:
> > On Thu, 31 Aug 2017, Peter Zijlstra wrote:
> >
> > > On Thu, Aug 31, 2017 at 09:55:57AM +0200, Thomas Gleixner wrote:
> > > > > Arghh!!!
> > > > >
> > > > > And allowing us to create events for offline CPUs (possible I think, but
> > > > > maybe slightly tricky) won't solve that, because we're already holding
> > > > > the hotplug_lock during PREPARE.
> > > >
> > > > There are two ways to cure that:
> > > >
> > > > 1) Have a pre cpus_write_lock() stage which is serialized via
> > > > cpus_add_remove_lock, which is the outer lock for hotplug.
> > > >
> > > > There we can sanely create stuff and fail with all consequences.
> > >
> > > True, if you're willing to add more state to that hotplug thing I'll try
> > > and make that perf patch that allows attaching to offline CPUs.
> >
> > Now that I think more about it. That's going to be an interesting exercise
> > vs. the hotplug state registration which relies on cpus_read_lock()
> > serialization.....
>
> We could have that for built-in stuff which is guaranteed to be never
> unregistered. Pretty restricted, but for cases like that it could
> work. Famous last work ...
I think something like this (on top of the previous patch that preserves
the event<->cpu relation over hotplug) should allow
perf_event_create_kernel_counter() to create events on offline CPUs.
It will create them as if perf_event_attr::disabled=1 and requires
perf_event_enable() after the CPU comes online (mirroring how offline
does an implicit perf_event_disable() as per the previous patch).
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2427,6 +2427,24 @@ perf_install_in_context(struct perf_even
smp_store_release(&event->ctx, ctx);
if (!task) {
+ /*
+ * Check if the @cpu we're creating an event for is offline.
+ *
+ * We use the perf_cpu_context::ctx::mutex to serialize against
+ * the hotplug notifiers. See perf_event_{init,exit}_cpu().
+ */
+ struct perf_cpu_context *cpuctx =
+ container_of(ctx, struct perf_cpu_context, ctx);
+
+ if (!cpuctx->online) {
+ raw_spin_lock_irq(&ctx->lock);
+ add_event_to_context(event, ctx);
+ event->state = PERF_EVENT_STATE_OFF +
+ PERF_EVENT_STATE_HOTPLUG_OFFSET;
+ raw_spin_unlock_irq(&ctx->lock);
+ return;
+ }
+
cpu_function_call(cpu, __perf_install_in_context, event);
return;
}
@@ -10181,6 +10199,8 @@ SYSCALL_DEFINE5(perf_event_open,
*
* We use the perf_cpu_context::ctx::mutex to serialize against
* the hotplug notifiers. See perf_event_{init,exit}_cpu().
+ *
+ * XXX not strictly required, preserves existing behaviour.
*/
struct perf_cpu_context *cpuctx =
container_of(ctx, struct perf_cpu_context, ctx);
@@ -10368,21 +10388,6 @@ perf_event_create_kernel_counter(struct
goto err_unlock;
}
- if (!task) {
- /*
- * Check if the @cpu we're creating an event for is online.
- *
- * We use the perf_cpu_context::ctx::mutex to serialize against
- * the hotplug notifiers. See perf_event_{init,exit}_cpu().
- */
- struct perf_cpu_context *cpuctx =
- container_of(ctx, struct perf_cpu_context, ctx);
- if (!cpuctx->online) {
- err = -ENODEV;
- goto err_unlock;
- }
- }
-
if (!exclusive_event_installable(event, ctx)) {
err = -EBUSY;
goto err_unlock;