I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
event_function_local.constprop.79+0xe2/0xe8
which was introduced with
commit cca2094605ef ("perf/core: Fix event_function_local()").
This is the WARN_ON_ONCE message, which sometimes shows up in the kernel
log:
[ 4598.316519] WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
event_function_local.constprop.79+0xe2/0xe8
[ 4598.316524] Kernel panic - not syncing: panic_on_warn set ...
[ 4598.316527] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G OE
5.1.0-20190402.rc3.git0.5e7a8ca31926.300.fc29.s390x+git #1
[ 4598.316529] Hardware name: IBM 2964 NC9 712 (LPAR)
[ 4598.316531] Call Trace:
[ 4598.316534] ([<0000000000112eb8>] show_stack+0x58/0x70)
[ 4598.316538] [<0000000000a820aa>] dump_stack+0x7a/0xa8
[ 4598.316541] [<0000000000143b52>] panic+0x11a/0x2d0
[ 4598.316543] [<00000000001439b0>] __warn+0xf8/0x118
[ 4598.316545] [<0000000000a811a8>] report_bug+0xd8/0x150
[ 4598.316547] [<00000000001014ac>] do_report_trap+0xc4/0xe0
[ 4598.316549] [<0000000000101680>] illegal_op+0x138/0x150
[ 4598.316552] [<0000000000aa233c>] pgm_check_handler+0x1cc/0x220
[ 4598.316554] [<00000000002a6d02>] event_function_local.constprop.79+
0xe2/0xe8
[ 4598.316556] ([<00000000002a6c7a>] event_function_local.constprop.79+
0x5a/0xe8)
[ 4598.316559] [<00000000002aa400>] perf_pending_event+0x88/0xb0
[ 4598.316561] [<0000000000271a8c>] irq_work_run_list+0x8c/0xb8
[ 4598.316563] [<0000000000271d78>] irq_work_tick+0x48/0x68
[ 4598.316566] [<00000000001d6058>] update_process_times+0x68/0x80
[ 4598.316568] [<00000000001e6c10>] tick_sched_handle.isra.6+0x50/0x60
[ 4598.316570] [<00000000001e6c7e>] tick_sched_timer+0x5e/0xb8
[ 4598.316573] [<00000000001d6b6a>] __hrtimer_run_queues+0x10a/0x2c0
[ 4598.316575] [<00000000001d7a88>] hrtimer_interrupt+0x138/0x2a8
[ 4598.316577] [<000000000010c3e4>] do_IRQ+0xac/0xb0
[ 4598.316597] [<0000000000aa2744>] ext_int_handler+0x128/0x12c
[ 4598.316600] [<00000000001034f6>] enabled_wait+0x46/0xd0
[ 4598.316602] ([<000003e000d2fe10>] 0x3e000d2fe10)
[ 4598.316604] [<0000000000103842>] arch_cpu_idle+0x3a/0x50
[ 4598.316606] [<000000000017b808>] do_idle+0x180/0x1b0
[ 4598.316608] [<000000000017ba06>] cpu_startup_entry+0x36/0x40
[ 4598.316611] [<0000000000115d16>] smp_init_secondary+0xd6/0xf0
[ 4598.316613] [<000000000011521e>] smp_start_secondary+0x86/0x98
Any ideas or hints who to avoid/fix this warning?
I ran a user space program which uses
1. perf_event_open() system call with type HARDWARE
and PERF_COUNT_HW_CPU_CYCLES in sampling mode.
2. Assigns signal SIGIO to the file descriptor returned from
perf_event_open() using:
rc = fcntl(fd, F_SETFL, O_RDWR | O_NONBLOCK | O_ASYNC);
rc |= fcntl(fd, F_SETSIG, signo);
rc |= fcntl(fd, F_SETOWN, getpid());
3. The signal handler increments some variables and issues
ioctl(fd, PERF_EVENT_IOC_REFRESH, 1);
to restart signal delivery.
I can send you the test program if needed.
Thanks a lot for your help.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Wed, Apr 03, 2019 at 11:47:00AM +0200, Thomas-Mich Richter wrote:
> I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
>
> WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
> event_function_local.constprop.79+0xe2/0xe8
>
> which was introduced with
> commit cca2094605ef ("perf/core: Fix event_function_local()").
>
> This is the WARN_ON_ONCE message, which sometimes shows up in the kernel
> log:
> [ 4598.316519] WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
> event_function_local.constprop.79+0xe2/0xe8
> [ 4598.316524] Kernel panic - not syncing: panic_on_warn set ...
> [ 4598.316527] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G OE
> 5.1.0-20190402.rc3.git0.5e7a8ca31926.300.fc29.s390x+git #1
> [ 4598.316529] Hardware name: IBM 2964 NC9 712 (LPAR)
> [ 4598.316531] Call Trace:
> [ 4598.316534] ([<0000000000112eb8>] show_stack+0x58/0x70)
> [ 4598.316538] [<0000000000a820aa>] dump_stack+0x7a/0xa8
> [ 4598.316541] [<0000000000143b52>] panic+0x11a/0x2d0
> [ 4598.316543] [<00000000001439b0>] __warn+0xf8/0x118
> [ 4598.316545] [<0000000000a811a8>] report_bug+0xd8/0x150
> [ 4598.316547] [<00000000001014ac>] do_report_trap+0xc4/0xe0
> [ 4598.316549] [<0000000000101680>] illegal_op+0x138/0x150
> [ 4598.316552] [<0000000000aa233c>] pgm_check_handler+0x1cc/0x220
> [ 4598.316554] [<00000000002a6d02>] event_function_local.constprop.79+
> 0xe2/0xe8
> [ 4598.316556] ([<00000000002a6c7a>] event_function_local.constprop.79+
> 0x5a/0xe8)
> [ 4598.316559] [<00000000002aa400>] perf_pending_event+0x88/0xb0
> [ 4598.316561] [<0000000000271a8c>] irq_work_run_list+0x8c/0xb8
> [ 4598.316563] [<0000000000271d78>] irq_work_tick+0x48/0x68
> [ 4598.316566] [<00000000001d6058>] update_process_times+0x68/0x80
> [ 4598.316568] [<00000000001e6c10>] tick_sched_handle.isra.6+0x50/0x60
> [ 4598.316570] [<00000000001e6c7e>] tick_sched_timer+0x5e/0xb8
> [ 4598.316573] [<00000000001d6b6a>] __hrtimer_run_queues+0x10a/0x2c0
> [ 4598.316575] [<00000000001d7a88>] hrtimer_interrupt+0x138/0x2a8
> [ 4598.316577] [<000000000010c3e4>] do_IRQ+0xac/0xb0
> [ 4598.316597] [<0000000000aa2744>] ext_int_handler+0x128/0x12c
> [ 4598.316600] [<00000000001034f6>] enabled_wait+0x46/0xd0
> [ 4598.316602] ([<000003e000d2fe10>] 0x3e000d2fe10)
> [ 4598.316604] [<0000000000103842>] arch_cpu_idle+0x3a/0x50
> [ 4598.316606] [<000000000017b808>] do_idle+0x180/0x1b0
> [ 4598.316608] [<000000000017ba06>] cpu_startup_entry+0x36/0x40
> [ 4598.316611] [<0000000000115d16>] smp_init_secondary+0xd6/0xf0
> [ 4598.316613] [<000000000011521e>] smp_start_secondary+0x86/0x98
>
> Any ideas or hints who to avoid/fix this warning?
Some thoughts here:
https://lkml.kernel.org/r/[email protected]
tl;dr, I've no frigging clue.
> I ran a user space program which uses
>
> 1. perf_event_open() system call with type HARDWARE
> and PERF_COUNT_HW_CPU_CYCLES in sampling mode.
> 2. Assigns signal SIGIO to the file descriptor returned from
> perf_event_open() using:
> rc = fcntl(fd, F_SETFL, O_RDWR | O_NONBLOCK | O_ASYNC);
> rc |= fcntl(fd, F_SETSIG, signo);
> rc |= fcntl(fd, F_SETOWN, getpid());
>
> 3. The signal handler increments some variables and issues
> ioctl(fd, PERF_EVENT_IOC_REFRESH, 1);
> to restart signal delivery.
>
> I can send you the test program if needed.
Does it reproduce on x86 without virt on? I don't have a PPC LPAR to
test things on.
On 4/3/19 12:41 PM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:47:00AM +0200, Thomas-Mich Richter wrote:
>> I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
>>
>> WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
>> event_function_local.constprop.79+0xe2/0xe8
>>
>> which was introduced with
>> commit cca2094605ef ("perf/core: Fix event_function_local()").
>>
>> This is the WARN_ON_ONCE message, which sometimes shows up in the kernel
>> log:
>> [ 4598.316519] WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
>> event_function_local.constprop.79+0xe2/0xe8
>> [ 4598.316524] Kernel panic - not syncing: panic_on_warn set ...
>> [ 4598.316527] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G OE
>> 5.1.0-20190402.rc3.git0.5e7a8ca31926.300.fc29.s390x+git #1
>> [ 4598.316529] Hardware name: IBM 2964 NC9 712 (LPAR)
>> [ 4598.316531] Call Trace:
>> [ 4598.316534] ([<0000000000112eb8>] show_stack+0x58/0x70)
>> [ 4598.316538] [<0000000000a820aa>] dump_stack+0x7a/0xa8
>> [ 4598.316541] [<0000000000143b52>] panic+0x11a/0x2d0
>> [ 4598.316543] [<00000000001439b0>] __warn+0xf8/0x118
>> [ 4598.316545] [<0000000000a811a8>] report_bug+0xd8/0x150
>> [ 4598.316547] [<00000000001014ac>] do_report_trap+0xc4/0xe0
>> [ 4598.316549] [<0000000000101680>] illegal_op+0x138/0x150
>> [ 4598.316552] [<0000000000aa233c>] pgm_check_handler+0x1cc/0x220
>> [ 4598.316554] [<00000000002a6d02>] event_function_local.constprop.79+
>> 0xe2/0xe8
>> [ 4598.316556] ([<00000000002a6c7a>] event_function_local.constprop.79+
>> 0x5a/0xe8)
>> [ 4598.316559] [<00000000002aa400>] perf_pending_event+0x88/0xb0
>> [ 4598.316561] [<0000000000271a8c>] irq_work_run_list+0x8c/0xb8
>> [ 4598.316563] [<0000000000271d78>] irq_work_tick+0x48/0x68
>> [ 4598.316566] [<00000000001d6058>] update_process_times+0x68/0x80
>> [ 4598.316568] [<00000000001e6c10>] tick_sched_handle.isra.6+0x50/0x60
>> [ 4598.316570] [<00000000001e6c7e>] tick_sched_timer+0x5e/0xb8
>> [ 4598.316573] [<00000000001d6b6a>] __hrtimer_run_queues+0x10a/0x2c0
>> [ 4598.316575] [<00000000001d7a88>] hrtimer_interrupt+0x138/0x2a8
>> [ 4598.316577] [<000000000010c3e4>] do_IRQ+0xac/0xb0
>> [ 4598.316597] [<0000000000aa2744>] ext_int_handler+0x128/0x12c
>> [ 4598.316600] [<00000000001034f6>] enabled_wait+0x46/0xd0
>> [ 4598.316602] ([<000003e000d2fe10>] 0x3e000d2fe10)
>> [ 4598.316604] [<0000000000103842>] arch_cpu_idle+0x3a/0x50
>> [ 4598.316606] [<000000000017b808>] do_idle+0x180/0x1b0
>> [ 4598.316608] [<000000000017ba06>] cpu_startup_entry+0x36/0x40
>> [ 4598.316611] [<0000000000115d16>] smp_init_secondary+0xd6/0xf0
>> [ 4598.316613] [<000000000011521e>] smp_start_secondary+0x86/0x98
>>
>> Any ideas or hints who to avoid/fix this warning?
>
> Some thoughts here:
>
> https://lkml.kernel.org/r/[email protected]
>
> tl;dr, I've no frigging clue.
>
>> I ran a user space program which uses
>>
>> 1. perf_event_open() system call with type HARDWARE
>> and PERF_COUNT_HW_CPU_CYCLES in sampling mode.
>> 2. Assigns signal SIGIO to the file descriptor returned from
>> perf_event_open() using:
>> rc = fcntl(fd, F_SETFL, O_RDWR | O_NONBLOCK | O_ASYNC);
>> rc |= fcntl(fd, F_SETSIG, signo);
>> rc |= fcntl(fd, F_SETOWN, getpid());
>>
>> 3. The signal handler increments some variables and issues
>> ioctl(fd, PERF_EVENT_IOC_REFRESH, 1);
>> to restart signal delivery.
>>
>> I can send you the test program if needed.
>
> Does it reproduce on x86 without virt on? I don't have a PPC LPAR to
> test things on.
>
It happens on an s390 mainframe (not PowerPC) without KVM or z/VM virtualisation.
I'll try to repro it on x86.
PS: Put my team members Heiko and Hendrik on cc.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On 4/3/19 12:41 PM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:47:00AM +0200, Thomas-Mich Richter wrote:
>> I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
>>
>> WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
>> event_function_local.constprop.79+0xe2/0xe8
>>
>> which was introduced with
>> commit cca2094605ef ("perf/core: Fix event_function_local()").
>> ..snip....
>>
>> Any ideas or hints who to avoid/fix this warning?
>
> Some thoughts here:
>
> https://lkml.kernel.org/r/[email protected]
>
> tl;dr, I've no frigging clue.
>
I have read this thread and at the end you mentioned:
Humm, but in that case:
context_switch()
prepare_task_switch()
perf_event_task_sched_out()
__perf_event_task_sched_out()
perf_event_context_sched_out()
task_ctx_sched_out()
ctx_sched_out()
group_sched_out()
event_sched_out()
if (event->pending_disable)
Would have already cleared the pending_disable state, so the IPI would
not have ran perf_event_disable_local() in the first place.
Our test system is configured to panic in WARN_ON_ONCE(). I looked
at the dump. The event triggering WARN_ON_ONCE:
crash> struct perf_event.oncpu 0x1f9b24800
oncpu = 0
crash> struct perf_event.state 0x1f9b24800
state = PERF_EVENT_STATE_ACTIVE
crash>
This means the code in
static void event_sched_out(....)
{
....
event->pmu->del(event, 0);
event->oncpu = -1;
if (event->pending_disable) {
event->pending_disable = 0;
state = PERF_EVENT_STATE_OFF;
}
perf_event_set_state(event, state);
...
}
has not finished and returned from this function. So the task was not completely context-switched
out from CPU 0 while the interrupt handler was executing on CPU 15:
static void perf_pending_event(...)
{
....
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event); <--- Causes the WARN_ON_ONCE()
}
.....
}
I think there is a race, especially when the interrupt handler on CPU 15
was invoked via timer interrupt an runs on a different CPU.
>
> Does it reproduce on x86 without virt on? I don't have a PPC LPAR to
> test things on.
>
s390 LPARs run under hipervisor control, no chance to run any OS without it.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Thu, Apr 04, 2019 at 11:15:39AM +0200, Thomas-Mich Richter wrote:
> On 4/3/19 12:41 PM, Peter Zijlstra wrote:
> > On Wed, Apr 03, 2019 at 11:47:00AM +0200, Thomas-Mich Richter wrote:
> >> I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
> >>
> >> WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
> >> event_function_local.constprop.79+0xe2/0xe8
> >>
> >> which was introduced with
> >> commit cca2094605ef ("perf/core: Fix event_function_local()").
> >> ..snip....
> >>
> >> Any ideas or hints who to avoid/fix this warning?
> >
> > Some thoughts here:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > tl;dr, I've no frigging clue.
> >
>
> I have read this thread and at the end you mentioned:
>
> Humm, but in that case:
>
> context_switch()
> prepare_task_switch()
> perf_event_task_sched_out()
> __perf_event_task_sched_out()
> perf_event_context_sched_out()
> task_ctx_sched_out()
> ctx_sched_out()
> group_sched_out()
> event_sched_out()
> if (event->pending_disable)
>
> Would have already cleared the pending_disable state, so the IPI would
> not have ran perf_event_disable_local() in the first place.
>
> Our test system is configured to panic in WARN_ON_ONCE(). I looked
> at the dump. The event triggering WARN_ON_ONCE:
>
> crash> struct perf_event.oncpu 0x1f9b24800
> oncpu = 0
> crash> struct perf_event.state 0x1f9b24800
> state = PERF_EVENT_STATE_ACTIVE
> crash>
>
> This means the code in
> static void event_sched_out(....)
> {
> ....
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> if (event->pending_disable) {
> event->pending_disable = 0;
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> ...
> }
>
> has not finished and returned from this function. So the task was not completely context-switched
> out from CPU 0 while the interrupt handler was executing on CPU 15:
>
> static void perf_pending_event(...)
> {
> ....
> if (event->pending_disable) {
> event->pending_disable = 0;
> perf_event_disable_local(event); <--- Causes the WARN_ON_ONCE()
> }
> .....
> }
>
> I think there is a race, especially when the interrupt handler on CPU 15
> was invoked via timer interrupt an runs on a different CPU.
That is not entirely the scenario I talked about, but *groan*.
So what I meant was:
CPU-0 CPU-n
__schedule()
local_irq_disable()
...
deactivate_task(prev);
try_to_wake_up(@p)
...
smp_cond_load_acquire(&p->on_cpu, !VAL);
<PMI>
..
perf_event_disable_inatomic()
event->pending_disable = 1;
irq_work_queue() /* self-IPI */
</PMI>
context_switch()
prepare_task_switch()
perf_event_task_sched_out()
// the above chain that clears pending_disable
finish_task_switch()
finish_task()
smp_store_release(prev->on_cpu, 0);
/* finally.... */
// take woken
// context_switch to @p
finish_lock_switch()
raw_spin_unlock_irq()
/* w00t, IRQs enabled, self-IPI time */
<self-IPI>
perf_pending_event()
// event->pending_disable == 0
</self-IPI>
What you're suggesting, is that the time between:
smp_store_release(prev->on_cpu, 0);
and
<self-IPI>
on CPU-0 is sufficient for CPU-n to context switch to the task, enable
the event there, trigger a PMI that calls perf_event_disable_inatomic()
_again_ (this would mean irq_work_queue() failing, which we don't check)
(and schedule out again, although that's not required).
This being virt that might actually be possible if (v)CPU-0 takes a nap
I suppose.
Let me think about this a little more...
> > Does it reproduce on x86 without virt on? I don't have a PPC LPAR to
> > test things on.
> >
>
> s390 LPARs run under hipervisor control, no chance to run any OS without it.
Yah, I know.. Same difference though; I also don't have an s390.
On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
> That is not entirely the scenario I talked about, but *groan*.
>
> So what I meant was:
>
> CPU-0 CPU-n
>
> __schedule()
> local_irq_disable()
>
> ...
> deactivate_task(prev);
>
> try_to_wake_up(@p)
> ...
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> <PMI>
> ..
> perf_event_disable_inatomic()
> event->pending_disable = 1;
> irq_work_queue() /* self-IPI */
> </PMI>
>
> context_switch()
> prepare_task_switch()
> perf_event_task_sched_out()
> // the above chain that clears pending_disable
>
> finish_task_switch()
> finish_task()
> smp_store_release(prev->on_cpu, 0);
> /* finally.... */
> // take woken
> // context_switch to @p
> finish_lock_switch()
> raw_spin_unlock_irq()
> /* w00t, IRQs enabled, self-IPI time */
> <self-IPI>
> perf_pending_event()
> // event->pending_disable == 0
> </self-IPI>
>
>
> What you're suggesting, is that the time between:
>
> smp_store_release(prev->on_cpu, 0);
>
> and
>
> <self-IPI>
>
> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
> the event there, trigger a PMI that calls perf_event_disable_inatomic()
> _again_ (this would mean irq_work_queue() failing, which we don't check)
> (and schedule out again, although that's not required).
>
> This being virt that might actually be possible if (v)CPU-0 takes a nap
> I suppose.
>
> Let me think about this a little more...
Arghh... s390 doesn't implement arch_irq_work_raise(), which makes it
far far worse.
I have a hack that might've cured it, were it not for that. Let me think
more still..
On Thu, Apr 04, 2019 at 02:02:14PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
> > That is not entirely the scenario I talked about, but *groan*.
> >
> > So what I meant was:
> >
> > CPU-0 CPU-n
> >
> > __schedule()
> > local_irq_disable()
> >
> > ...
> > deactivate_task(prev);
> >
> > try_to_wake_up(@p)
> > ...
> > smp_cond_load_acquire(&p->on_cpu, !VAL);
> >
> > <PMI>
> > ..
> > perf_event_disable_inatomic()
> > event->pending_disable = 1;
> > irq_work_queue() /* self-IPI */
> > </PMI>
> >
> > context_switch()
> > prepare_task_switch()
> > perf_event_task_sched_out()
> > // the above chain that clears pending_disable
> >
> > finish_task_switch()
> > finish_task()
> > smp_store_release(prev->on_cpu, 0);
> > /* finally.... */
> > // take woken
> > // context_switch to @p
> > finish_lock_switch()
> > raw_spin_unlock_irq()
> > /* w00t, IRQs enabled, self-IPI time */
> > <self-IPI>
> > perf_pending_event()
> > // event->pending_disable == 0
> > </self-IPI>
> >
> >
> > What you're suggesting, is that the time between:
> >
> > smp_store_release(prev->on_cpu, 0);
> >
> > and
> >
> > <self-IPI>
> >
> > on CPU-0 is sufficient for CPU-n to context switch to the task, enable
> > the event there, trigger a PMI that calls perf_event_disable_inatomic()
> > _again_ (this would mean irq_work_queue() failing, which we don't check)
> > (and schedule out again, although that's not required).
> >
> > This being virt that might actually be possible if (v)CPU-0 takes a nap
> > I suppose.
> >
> > Let me think about this a little more...
>
> Arghh... s390 doesn't implement arch_irq_work_raise(), which makes it
> far far worse.
>
> I have a hack that might've cured it, were it not for that. Let me think
> more still..
Could you educate me on the s390 PMU, afaict only the SF one has a
sampling interrupt (cpumf_measurement_alert), is that NMI-like or a
regular IRQ ?
On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
> That is not entirely the scenario I talked about, but *groan*.
>
> So what I meant was:
>
> CPU-0 CPU-n
>
> __schedule()
> local_irq_disable()
>
> ...
> deactivate_task(prev);
>
> try_to_wake_up(@p)
> ...
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> <PMI>
> ..
> perf_event_disable_inatomic()
> event->pending_disable = 1;
> irq_work_queue() /* self-IPI */
> </PMI>
>
> context_switch()
> prepare_task_switch()
> perf_event_task_sched_out()
> // the above chain that clears pending_disable
>
> finish_task_switch()
> finish_task()
> smp_store_release(prev->on_cpu, 0);
> /* finally.... */
> // take woken
> // context_switch to @p
> finish_lock_switch()
> raw_spin_unlock_irq()
> /* w00t, IRQs enabled, self-IPI time */
> <self-IPI>
> perf_pending_event()
> // event->pending_disable == 0
> </self-IPI>
>
>
> What you're suggesting, is that the time between:
>
> smp_store_release(prev->on_cpu, 0);
>
> and
>
> <self-IPI>
>
> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
> the event there, trigger a PMI that calls perf_event_disable_inatomic()
> _again_ (this would mean irq_work_queue() failing, which we don't check)
> (and schedule out again, although that's not required).
>
> This being virt that might actually be possible if (v)CPU-0 takes a nap
> I suppose.
>
> Let me think about this a little more...
Does the below cure things? It's not exactly pretty, but it could just
do the trick.
---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dfc4bab0b02b..d496e6911442 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
event->pmu->del(event, 0);
event->oncpu = -1;
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (event->pending_disable == smp_processor_id()) {
+ event->pending_disable = -1;
state = PERF_EVENT_STATE_OFF;
}
perf_event_set_state(event, state);
@@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
- event->pending_disable = 1;
+ event->pending_disable = smp_processor_id();
irq_work_queue(&event->pending);
}
@@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
* and we won't recurse 'further'.
*/
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (event->pending_disable == smp_processor_id()) {
+ event->pending_disable = -1;
perf_event_disable_local(event);
}
@@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_waitqueue_head(&event->waitq);
+ event->pending_disable = -1;
init_irq_work(&event->pending, perf_pending_event);
mutex_init(&event->mmap_mutex);
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dfc4bab0b02b..d496e6911442 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> irq_work_queue(&event->pending);
> }
>
> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> perf_event_disable_local(event);
> }
>
> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
>
Thanks for the quick reply, will give it a try...
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
Thanks a lot for the patch, I have built a new kernel and let it run over the week end.
s390 does not have a PMI, all interrupts (including the measurement interrupts from
the PMU) are normal, maskable interrupts.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Fri, Apr 05, 2019 at 12:18:54PM +0200, Thomas-Mich Richter wrote:
> On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> > On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
> >
> >> That is not entirely the scenario I talked about, but *groan*.
> >>
> >> So what I meant was:
> >>
> >> CPU-0 CPU-n
> >>
> >> __schedule()
> >> local_irq_disable()
> >>
> >> ...
> >> deactivate_task(prev);
> >>
> >> try_to_wake_up(@p)
> >> ...
> >> smp_cond_load_acquire(&p->on_cpu, !VAL);
> >>
> >> <PMI>
> >> ..
> >> perf_event_disable_inatomic()
> >> event->pending_disable = 1;
> >> irq_work_queue() /* self-IPI */
> >> </PMI>
> >>
> >> context_switch()
> >> prepare_task_switch()
> >> perf_event_task_sched_out()
> >> // the above chain that clears pending_disable
> >>
> >> finish_task_switch()
> >> finish_task()
> >> smp_store_release(prev->on_cpu, 0);
> >> /* finally.... */
> >> // take woken
> >> // context_switch to @p
> >> finish_lock_switch()
> >> raw_spin_unlock_irq()
> >> /* w00t, IRQs enabled, self-IPI time */
> >> <self-IPI>
> >> perf_pending_event()
> >> // event->pending_disable == 0
> >> </self-IPI>
> >>
> >>
> >> What you're suggesting, is that the time between:
> >>
> >> smp_store_release(prev->on_cpu, 0);
> >>
> >> and
> >>
> >> <self-IPI>
> >>
> >> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
> >> the event there, trigger a PMI that calls perf_event_disable_inatomic()
> >> _again_ (this would mean irq_work_queue() failing, which we don't check)
> >> (and schedule out again, although that's not required).
> >>
> >> This being virt that might actually be possible if (v)CPU-0 takes a nap
> >> I suppose.
> >>
> >> Let me think about this a little more...
> >
> > Does the below cure things? It's not exactly pretty, but it could just
> > do the trick.
> >
>
> Thanks a lot for the patch, I have built a new kernel and let it run over the week end.
>
> s390 does not have a PMI, all interrupts (including the measurement interrupts from
> the PMU) are normal, maskable interrupts.
Please implement arch_irq_work_raise() for s390 though, traditionally
that was only required if you had NMI-like PMis, and IRQ based PMIs
would have to run irq_work_run() at the end of their handler.
The s390-sf handler does not do this, but even it if were to do that, it
wouldn't be sufficient, since you also drain the buffer from
pmu::stop().
Not doing this will get you some 'weird' behavoiur. As in, your signals
could be delivered very late etc.
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dfc4bab0b02b..d496e6911442 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> irq_work_queue(&event->pending);
> }
>
> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> perf_event_disable_local(event);
> }
>
> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
>
Peter,
very good news, your fix ran over the weekend without any hit!!!
Thanks very much for your help. Do you submit this patch to the kernel mailing list?
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> > Does the below cure things? It's not exactly pretty, but it could just
> > do the trick.
> >
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index dfc4bab0b02b..d496e6911442 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> > event->pmu->del(event, 0);
> > event->oncpu = -1;
> >
> > - if (event->pending_disable) {
> > - event->pending_disable = 0;
> > + if (event->pending_disable == smp_processor_id()) {
> > + event->pending_disable = -1;
> > state = PERF_EVENT_STATE_OFF;
> > }
> > perf_event_set_state(event, state);
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
> >
> > void perf_event_disable_inatomic(struct perf_event *event)
> > {
> > - event->pending_disable = 1;
> > + event->pending_disable = smp_processor_id();
> > irq_work_queue(&event->pending);
> > }
> >
> > @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> > * and we won't recurse 'further'.
> > */
> >
> > - if (event->pending_disable) {
> > - event->pending_disable = 0;
> > + if (event->pending_disable == smp_processor_id()) {
> > + event->pending_disable = -1;
> > perf_event_disable_local(event);
> > }
> >
> > @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> >
> >
> > init_waitqueue_head(&event->waitq);
> > + event->pending_disable = -1;
> > init_irq_work(&event->pending, perf_pending_event);
> >
> > mutex_init(&event->mmap_mutex);
> >
>
> Peter,
>
> very good news, your fix ran over the weekend without any hit!!!
>
> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
Most excellent, let me go write a Changelog.
Could I convince you to implement arch_irq_work_raise() for s390?
On 4/8/19 10:22 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>>> Does the below cure things? It's not exactly pretty, but it could just
>>> do the trick.
>>>
>>> ---
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index dfc4bab0b02b..d496e6911442 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
>>> event->pmu->del(event, 0);
>>> event->oncpu = -1;
>>>
>>> - if (event->pending_disable) {
>>> - event->pending_disable = 0;
>>> + if (event->pending_disable == smp_processor_id()) {
>>> + event->pending_disable = -1;
>>> state = PERF_EVENT_STATE_OFF;
>>> }
>>> perf_event_set_state(event, state);
>>> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>>>
>>> void perf_event_disable_inatomic(struct perf_event *event)
>>> {
>>> - event->pending_disable = 1;
>>> + event->pending_disable = smp_processor_id();
>>> irq_work_queue(&event->pending);
>>> }
>>>
>>> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
>>> * and we won't recurse 'further'.
>>> */
>>>
>>> - if (event->pending_disable) {
>>> - event->pending_disable = 0;
>>> + if (event->pending_disable == smp_processor_id()) {
>>> + event->pending_disable = -1;
>>> perf_event_disable_local(event);
>>> }
>>>
>>> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>>
>>>
>>> init_waitqueue_head(&event->waitq);
>>> + event->pending_disable = -1;
>>> init_irq_work(&event->pending, perf_pending_event);
>>>
>>> mutex_init(&event->mmap_mutex);
>>>
>>
>> Peter,
>>
>> very good news, your fix ran over the weekend without any hit!!!
>>
>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>
> Most excellent, let me go write a Changelog.
>
> Could I convince you to implement arch_irq_work_raise() for s390?
>
Yes, I am convinced, however I need to discuss this with the s390 maintainers
Martin Schwidesfky and Heiko Carstens first.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> > very good news, your fix ran over the weekend without any hit!!!
> >
> > Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>
> Most excellent, let me go write a Changelog.
Hi Thomas, find below.
Sadly, while writing the Changelog I ended up with a 'completely'
differet patch again, could I bother you to test this one too?
---
Subject: perf: Fix perf_event_disable_inatomic()
From: Peter Zijlstra <[email protected]>
Date: Thu, 4 Apr 2019 15:03:00 +0200
Thomas-Mich Richter reported he triggered a WARN from event_function_local()
on his s390. The problem boils down to:
CPU-A CPU-B
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1
irq_work_queue();
sched-out
event_sched_out()
@pending_disable = 0
sched-in
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1;
irq_work_queue(); // FAILS
irq_work_run()
perf_pending_event()
if (@pending_disable)
perf_event_disable_local(); // WHOOPS
The problem exists in generic, but s390 is particularly sensitive
because it doesn't implement arch_irq_work_raise(), nor does it call
irq_work_run() from it's PMU interrupt handler (nor would that be
sufficient in this case, because s390 also generates
perf_event_overflow() from pmu::stop). Add to that the fact that s390
is a virtual architecture and (virtual) CPU-A can stall long enough
for the above race to happen, even if it would self-IPI.
Adding an irq_work_syn() to event_sched_in() would work for all hardare
PMUs that properly use irq_work_run() but fails for software PMUs.
Instead encode the CPU number in @pending_disable, such that we can
tell which CPU requested the disable. This then allows us to detect
the above scenario and even redirect the IPI to make up for the failed
queue.
Cc: Heiko Carstens <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Hendrik Brueckner <[email protected]>
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Jiri Olsa <[email protected]>
Reported-by: Thomas-Mich Richter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 9 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event
event->pmu->del(event, 0);
event->oncpu = -1;
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (READ_ONCE(event->pending_disable) >= 0) {
+ WRITE_ONCE(event->pending_disable, -1);
state = PERF_EVENT_STATE_OFF;
}
perf_event_set_state(event, state);
@@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
- event->pending_disable = 1;
+ WRITE_ONCE(event->pending_disable, smp_processor_id());
+ /* can fail, see perf_pending_event_disable() */
irq_work_queue(&event->pending);
}
@@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event
}
}
+static void perf_pending_event_disable(struct perf_event *event)
+{
+ int cpu = READ_ONCE(event->pending_disable);
+
+ if (cpu < 0)
+ return;
+
+ if (cpu == smp_processor_id()) {
+ WRITE_ONCE(event->pending_disable, -1);
+ perf_event_disable_local(event);
+ return;
+ }
+
+ /*
+ * CPU-A CPU-B
+ *
+ * perf_event_disable_inatomic()
+ * @pending_disable = CPU-A;
+ * irq_work_queue();
+ *
+ * sched-out
+ * @pending_disable = -1;
+ *
+ * sched-in
+ * perf_event_disable_inatomic()
+ * @pending_disable = CPU-B;
+ * irq_work_queue(); // FAILS
+ *
+ * irq_work_run()
+ * perf_pending_event()
+ *
+ * But the event runs on CPU-B and wants disabling there.
+ */
+ irq_work_queue_on(&event->pending, cpu);
+}
+
static void perf_pending_event(struct irq_work *entry)
{
- struct perf_event *event = container_of(entry,
- struct perf_event, pending);
+ struct perf_event *event = container_of(entry, struct perf_event, pending);
int rctx;
rctx = perf_swevent_get_recursion_context();
@@ -5822,10 +5858,7 @@ static void perf_pending_event(struct ir
* and we won't recurse 'further'.
*/
- if (event->pending_disable) {
- event->pending_disable = 0;
- perf_event_disable_local(event);
- }
+ perf_pending_event_disable(event);
if (event->pending_wakeup) {
event->pending_wakeup = 0;
@@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr
init_waitqueue_head(&event->waitq);
+ event->pending_disable = -1;
init_irq_work(&event->pending, perf_pending_event);
mutex_init(&event->mmap_mutex);
On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
>>> very good news, your fix ran over the weekend without any hit!!!
>>>
>>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>>
>> Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
Hi Peter, no problem, test is now running overnight.
Let you know the outcome tomorrow...
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
>>> very good news, your fix ran over the weekend without any hit!!!
>>>
>>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>>
>> Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
Hi Peter,
I verified your second patch, it ran overnight without any message in the
kernel log. Thanks again.
Tested-by: Thomas Richter <[email protected]>
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
On Tue, Apr 09, 2019 at 08:07:49AM +0200, Thomas-Mich Richter wrote:
> On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> >> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> >
> >>> very good news, your fix ran over the weekend without any hit!!!
> >>>
> >>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
> >>
> >> Most excellent, let me go write a Changelog.
> >
> > Hi Thomas, find below.
> >
> > Sadly, while writing the Changelog I ended up with a 'completely'
> > differet patch again, could I bother you to test this one too?
> >
>
> Hi Peter,
>
> I verified your second patch, it ran overnight without any message in the
> kernel log. Thanks again.
>
> Tested-by: Thomas Richter <[email protected]>
Most excellent, thanks for that!
On Mon, Apr 08, 2019 at 11:50:31AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
> > > very good news, your fix ran over the weekend without any hit!!!
> > >
> > > Thanks very much for your help. Do you submit this patch to the kernel mailing list?
> >
> > Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
> ---
> Subject: perf: Fix perf_event_disable_inatomic()
> From: Peter Zijlstra <[email protected]>
> Date: Thu, 4 Apr 2019 15:03:00 +0200
>
> Thomas-Mich Richter reported he triggered a WARN from event_function_local()
> on his s390. The problem boils down to:
>
> CPU-A CPU-B
>
> perf_event_overflow()
> perf_event_disable_inatomic()
> @pending_disable = 1
> irq_work_queue();
>
> sched-out
> event_sched_out()
> @pending_disable = 0
>
> sched-in
> perf_event_overflow()
> perf_event_disable_inatomic()
> @pending_disable = 1;
> irq_work_queue(); // FAILS
>
> irq_work_run()
> perf_pending_event()
> if (@pending_disable)
> perf_event_disable_local(); // WHOOPS
>
> The problem exists in generic, but s390 is particularly sensitive
> because it doesn't implement arch_irq_work_raise(), nor does it call
> irq_work_run() from it's PMU interrupt handler (nor would that be
> sufficient in this case, because s390 also generates
> perf_event_overflow() from pmu::stop). Add to that the fact that s390
> is a virtual architecture and (virtual) CPU-A can stall long enough
> for the above race to happen, even if it would self-IPI.
>
> Adding an irq_work_syn() to event_sched_in() would work for all hardare
> PMUs that properly use irq_work_run() but fails for software PMUs.
Typo: s/syn/sync/
>
> Instead encode the CPU number in @pending_disable, such that we can
> tell which CPU requested the disable. This then allows us to detect
> the above scenario and even redirect the IPI to make up for the failed
> queue.
>
> Cc: Heiko Carstens <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Hendrik Brueckner <[email protected]>
> Cc: [email protected]
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Reported-by: Thomas-Mich Richter <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
I can't think of a nicer way of handling this, so FWIW:
Acked-by: Mark Rutland <[email protected]>
Mark.
> ---
> kernel/events/core.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 43 insertions(+), 9 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (READ_ONCE(event->pending_disable) >= 0) {
> + WRITE_ONCE(event->pending_disable, -1);
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + WRITE_ONCE(event->pending_disable, smp_processor_id());
> + /* can fail, see perf_pending_event_disable() */
> irq_work_queue(&event->pending);
> }
>
> @@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event
> }
> }
>
> +static void perf_pending_event_disable(struct perf_event *event)
> +{
> + int cpu = READ_ONCE(event->pending_disable);
> +
> + if (cpu < 0)
> + return;
> +
> + if (cpu == smp_processor_id()) {
> + WRITE_ONCE(event->pending_disable, -1);
> + perf_event_disable_local(event);
> + return;
> + }
> +
> + /*
> + * CPU-A CPU-B
> + *
> + * perf_event_disable_inatomic()
> + * @pending_disable = CPU-A;
> + * irq_work_queue();
> + *
> + * sched-out
> + * @pending_disable = -1;
> + *
> + * sched-in
> + * perf_event_disable_inatomic()
> + * @pending_disable = CPU-B;
> + * irq_work_queue(); // FAILS
> + *
> + * irq_work_run()
> + * perf_pending_event()
> + *
> + * But the event runs on CPU-B and wants disabling there.
> + */
> + irq_work_queue_on(&event->pending, cpu);
> +}
> +
> static void perf_pending_event(struct irq_work *entry)
> {
> - struct perf_event *event = container_of(entry,
> - struct perf_event, pending);
> + struct perf_event *event = container_of(entry, struct perf_event, pending);
> int rctx;
>
> rctx = perf_swevent_get_recursion_context();
> @@ -5822,10 +5858,7 @@ static void perf_pending_event(struct ir
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> - perf_event_disable_local(event);
> - }
> + perf_pending_event_disable(event);
>
> if (event->pending_wakeup) {
> event->pending_wakeup = 0;
> @@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
On Wed, Apr 10, 2019 at 03:51:24PM +0200, Thomas-Mich Richter wrote:
> Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
>
> However doing an fgrep on the pending_disable member of struct perf_event
> reveals two more hits in file kernel/events/ringbuffer.c when events
> have auxiliary data.
>
> Not sure if this needs to be addresses too, just wanted to let you know.
*groan* indeed, and yes that would need fixing too. I completely missed
the AUX stuff using that too.
I think we can simply do the below on top, Alexander?
---
kernel/events/ring_buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 678ccec60d8f..ab2b7b38adc5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -392,7 +392,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
* store that will be enabled on successful return
*/
if (!handle->size) { /* A, matches D */
- event->pending_disable = 1;
+ event->pending_disable = smp_processor_id();
perf_output_wakeup(handle);
local_set(&rb->aux_nest, 0);
goto err_put;
@@ -480,7 +480,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
if (wakeup) {
if (handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)
- handle->event->pending_disable = 1;
+ handle->event->pending_disable = smp_processor_id();
perf_output_wakeup(handle);
}
On 4/9/19 10:53 AM, Mark Rutland wrote:
> On Mon, Apr 08, 2019 at 11:50:31AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>>
.....
>>
>> Instead encode the CPU number in @pending_disable, such that we can
>> tell which CPU requested the disable. This then allows us to detect
>> the above scenario and even redirect the IPI to make up for the failed
>> queue.
>>
>> Cc: Heiko Carstens <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Hendrik Brueckner <[email protected]>
>> Cc: [email protected]
>> Cc: Martin Schwidefsky <[email protected]>
>> Cc: Mark Rutland <[email protected]>
>> Cc: Jiri Olsa <[email protected]>
>> Reported-by: Thomas-Mich Richter <[email protected]>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> I can't think of a nicer way of handling this, so FWIW:
>
> Acked-by: Mark Rutland <[email protected]>
>
> Mark.
>
>> ---
Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
However doing an fgrep on the pending_disable member of struct perf_event
reveals two more hits in file kernel/events/ringbuffer.c when events
have auxiliary data.
Not sure if this needs to be addresses too, just wanted to let you know.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
Peter Zijlstra <[email protected]> writes:
> On Wed, Apr 10, 2019 at 03:51:24PM +0200, Thomas-Mich Richter wrote:
>> Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
>>
>> However doing an fgrep on the pending_disable member of struct perf_event
>> reveals two more hits in file kernel/events/ringbuffer.c when events
>> have auxiliary data.
>>
>> Not sure if this needs to be addresses too, just wanted to let you know.
>
> *groan* indeed, and yes that would need fixing too. I completely missed
> the AUX stuff using that too.
>
> I think we can simply do the below on top, Alexander?
Looks good. FWIW, renaming pending_disable in the first place to match
the fact that it's a cpu# now, would have probably caught all these.
> ---
> kernel/events/ring_buffer.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 678ccec60d8f..ab2b7b38adc5 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -392,7 +392,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
> * store that will be enabled on successful return
> */
> if (!handle->size) { /* A, matches D */
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> perf_output_wakeup(handle);
> local_set(&rb->aux_nest, 0);
> goto err_put;
> @@ -480,7 +480,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
>
> if (wakeup) {
> if (handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)
> - handle->event->pending_disable = 1;
> + handle->event->pending_disable = smp_processor_id();
> perf_output_wakeup(handle);
> }
Thanks,
--
Alex