2019-04-03 15:01:57

by Vince Weaver

[permalink] [raw]
Subject: perf: perf_fuzzer crashes on Pentium 4 systems


so moving this to its own thread.

There was a two-part question asked.
1. Can the perf-fuzzer crash a Pentium 4 system
2. Does anyone care anymore?

The answer to #1 turns out to be "yes"
I'm not sure about #2 (but it's telling my p4 test system hadn't been
turned on in over 3 years).

In any case the perf_fuzzer can crash my p4 system within an hour or so.
The debugging from this isn't great, I forget what the preferred debug
things to enable in the kernel hacking menu are.

Here is one crash that just happened:

The instruction at RIP is unhelpfully
./arch/x86/include/asm/processor.h:400
which is
DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;

Though looking at the assembly it looks like
p4_pmu_enable_event() is called with NULL as the paramater.

[ 1930.122902] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158
[ 1930.130715] #PF error: [normal kernel read fault]
[ 1930.135402] PGD 0 P4D 0
[ 1930.137928] Oops: 0000 [#1] SMP PTI
[ 1930.141405] CPU: 0 PID: 30179 Comm: perf_fuzzer Tainted: G W 5.1.0-rc3+ #6
[ 1930.149555] Hardware name: LENOVO 88088NU/LENOVO, BIOS 2JKT37AUS 07/12/2007
[ 1930.156497] RIP: 0010:p4_pmu_enable_event+0x10/0x160
[ 1930.161443] Code: 89 f0 0f 30 31 c0 8b 15 e6 2e 0f 01 85 d2 7f 01 c3 89 c2 89 cf e9 70 65 3b 00 0f 1f 44 00 00 41 56 41 55 41 54 49 89 fc 55 53 <48> 8b 9f 58 01 00 00 48 89 dd 48 89 da 48 c1 ed 20 48 c1 ea 3f 89
[ 1930.180155] RSP: 0018:ffffc90001f57d50 EFLAGS: 00010017
[ 1930.185361] RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000360
[ 1930.192472] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 1930.199582] RBP: ffff88803e40f620 R08: 00000000ffffffff R09: 0000000c00000000
[ 1930.206691] R10: 4801fe0000fc0000 R11: 8000000fce030200 R12: 0000000000000000
[ 1930.213802] R13: ffff888035c4a0c0 R14: ffff88803e429300 R15: 0000000000000402
[ 1930.220913] FS: 00007ff3b934a540(0000) GS:ffff88803e400000(0000) knlGS:0000000000000000
[ 1930.228976] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1930.234700] CR2: 0000000000000158 CR3: 000000003a72e000 CR4: 00000000000007f0
[ 1930.241811] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1930.248921] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1930.256030] Call Trace:
[ 1930.258472] p4_pmu_enable_all+0x3c/0x50
[ 1930.262384] __perf_event_task_sched_in+0x174/0x1a0
[ 1930.267247] ? __switch_to_asm+0x34/0x70
[ 1930.271155] ? __switch_to_asm+0x40/0x70
[ 1930.275064] ? __switch_to_asm+0x34/0x70
[ 1930.278971] ? __switch_to_asm+0x40/0x70
[ 1930.282882] finish_task_switch+0x10a/0x290
[ 1930.287053] __schedule+0x207/0x800
[ 1930.290530] ? event_function_call+0x85/0x100
[ 1930.294873] ? ctx_resched+0xc0/0xc0
[ 1930.298437] preempt_schedule_common+0xa/0x20
[ 1930.302777] _cond_resched+0x1d/0x30
[ 1930.306340] mutex_lock+0xe/0x30
[ 1930.309558] perf_event_ctx_lock_nested.isra.89+0x46/0x90
[ 1930.314939] ? _perf_event_disable+0x40/0x40
[ 1930.319193] perf_event_task_enable+0x3f/0xa0
[ 1930.323537] __x64_sys_prctl+0x1b2/0x560
[ 1930.327448] do_syscall_64+0x4f/0xf0
[ 1930.331011] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1930.336045] RIP: 0033:0x7ff3b928240a


2019-04-03 15:16:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Wed, Apr 03, 2019 at 10:59:32AM -0400, Vince Weaver wrote:
>
> so moving this to its own thread.
>
> There was a two-part question asked.
> 1. Can the perf-fuzzer crash a Pentium 4 system
> 2. Does anyone care anymore?
>
> The answer to #1 turns out to be "yes"
> I'm not sure about #2 (but it's telling my p4 test system hadn't been
> turned on in over 3 years).
>
> In any case the perf_fuzzer can crash my p4 system within an hour or so.
> The debugging from this isn't great, I forget what the preferred debug
> things to enable in the kernel hacking menu are.
>
> Here is one crash that just happened:
>
> The instruction at RIP is unhelpfully
> ./arch/x86/include/asm/processor.h:400
> which is
> DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
>
> Though looking at the assembly it looks like
> p4_pmu_enable_event() is called with NULL as the paramater.
>

Interesting! I'll look more carefully at evening. As far as I remember we
rely on active_mask bit set completely, not sure how it could happen that
we get nil here. Thanks for pointing!

2019-04-03 19:22:00

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Wed, Apr 03, 2019 at 10:59:32AM -0400, Vince Weaver wrote:
>
> so moving this to its own thread.
>
> There was a two-part question asked.
> 1. Can the perf-fuzzer crash a Pentium 4 system
> 2. Does anyone care anymore?
>
> The answer to #1 turns out to be "yes"
> I'm not sure about #2 (but it's telling my p4 test system hadn't been
> turned on in over 3 years).
>
> In any case the perf_fuzzer can crash my p4 system within an hour or so.
> The debugging from this isn't great, I forget what the preferred debug
> things to enable in the kernel hacking menu are.
>
> Here is one crash that just happened:
>
> The instruction at RIP is unhelpfully
> ./arch/x86/include/asm/processor.h:400
> which is
> DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
>
> Though looking at the assembly it looks like
> p4_pmu_enable_event() is called with NULL as the paramater.
>

You know, seems I got what happened -- p4_general_events do
not cover all general events, they stop at PERF_COUNT_HW_BUS_CYCLES,
while more 3 general event left. This is 'cause I've not been following
pmu evolution in code. I will try to cover this events hopefully more
less soon and send you a patch to test (if you don't mind).

2019-04-03 20:32:39

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Wed, Apr 03, 2019 at 10:19:44PM +0300, Cyrill Gorcunov wrote:
>
> You know, seems I got what happened -- p4_general_events do
> not cover all general events, they stop at PERF_COUNT_HW_BUS_CYCLES,
> while more 3 general event left. This is 'cause I've not been following
> pmu evolution in code. I will try to cover this events hopefully more
> less soon and send you a patch to test (if you don't mind).

Still this should not cause nil deref, continue investigating. Vince
could oyu please apply the patch below, I doubt if it help with nil
issue but worth having anyway
---
From: Cyrill Gorcunov <[email protected]>
Subject: [PATCH] perf/x86/intel/p4: Limit p4_general_events down to real ones

p4_general_events are allocated up to PERF_COUNT_HW_MAX while this constant
is bigger than the number of general events we do support by now. Thus the
all other entries are equal to zero and maps to P4_EVENT_TC_DELIVER_MODE
which is wrong of course. Instead drop off PERF_COUNT_HW_MAX constant
from declaration, we use ARRAY_SIZE for max_events.

Signed-off-by: Cyrill Gorcunov <[email protected]>
---
arch/x86/events/intel/p4.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/arch/x86/events/intel/p4.c
===================================================================
--- linux-tip.git.orig/arch/x86/events/intel/p4.c
+++ linux-tip.git/arch/x86/events/intel/p4.c
@@ -648,7 +648,7 @@ static u64 p4_get_alias_event(u64 config
return config_match | (config & P4_CONFIG_EVENT_ALIAS_IMMUTABLE_BITS);
}

-static u64 p4_general_events[PERF_COUNT_HW_MAX] = {
+static u64 p4_general_events[] = {
/* non-halted CPU clocks */
[PERF_COUNT_HW_CPU_CYCLES] =
p4_config_pack_escr(P4_ESCR_EVENT(P4_EVENT_GLOBAL_POWER_EVENTS) |

2019-04-04 13:27:17

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Wed, 3 Apr 2019, Cyrill Gorcunov wrote:

> On Wed, Apr 03, 2019 at 10:19:44PM +0300, Cyrill Gorcunov wrote:
> >
> > You know, seems I got what happened -- p4_general_events do
> > not cover all general events, they stop at PERF_COUNT_HW_BUS_CYCLES,
> > while more 3 general event left. This is 'cause I've not been following
> > pmu evolution in code. I will try to cover this events hopefully more
> > less soon and send you a patch to test (if you don't mind).
>
> Still this should not cause nil deref, continue investigating. Vince
> could oyu please apply the patch below, I doubt if it help with nil
> issue but worth having anyway


It looks like there are at least two bugs here, one that's a full
hardlockup with nothing on serial console. The other is the NULL
dereference.

Just ran with your patch applied and it hit the hard lockup case.

I'll have to see if things are reproducible and I can try to see if I can
get a reproducible value for what even caused the issue. perf_fuzzer has
some infrastructure for determining that but it's hit or miss if you can
get anything useful from it.

I'll keep running things, but I'm a bit busy at work here the next few
days so there might be some delay in the results.

Vince

2019-04-04 13:36:10

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, Apr 04, 2019 at 09:25:47AM -0400, Vince Weaver wrote:
>
> It looks like there are at least two bugs here, one that's a full
> hardlockup with nothing on serial console. The other is the NULL
> dereference.
>

Seems so. I've spent plenty of time yesterday trying to figure out how
we even reach the case when event = nil but without much luck.

> Just ran with your patch applied and it hit the hard lockup case.
>
> I'll have to see if things are reproducible and I can try to see if I can
> get a reproducible value for what even caused the issue. perf_fuzzer has
> some infrastructure for determining that but it's hit or miss if you can
> get anything useful from it.

At least the sequence of events migh give us some ideas, maybe not indeed.
Thanks a huge, Vince!

> I'll keep running things, but I'm a bit busy at work here the next few
> days so there might be some delay in the results.

Sure, take your time. I think this problem with p4 is not urgent. I'll ping
you if I get some more ideas.

2019-04-04 16:39:33

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, 4 Apr 2019, Cyrill Gorcunov wrote:

> On Thu, Apr 04, 2019 at 09:25:47AM -0400, Vince Weaver wrote:
> >
> > It looks like there are at least two bugs here, one that's a full
> > hardlockup with nothing on serial console. The other is the NULL
> > dereference.

OK, it turns out the hard-lock and the null pointer dereference might be
the same, I have a random seed for the fuzzer from a hard-lock crash that
reproduces and it generated the null pointer crash. (This is with your
patch applied).

I can try to see if I can bisect down to a specific event sequence that
triggers this, but that can be tricky sometimes if things lock up so fast
that the event log doesn't get written out before the crash.

Vince


2019-04-04 16:47:56

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, Apr 04, 2019 at 12:37:18PM -0400, Vince Weaver wrote:
> On Thu, 4 Apr 2019, Cyrill Gorcunov wrote:
>
> > On Thu, Apr 04, 2019 at 09:25:47AM -0400, Vince Weaver wrote:
> > >
> > > It looks like there are at least two bugs here, one that's a full
> > > hardlockup with nothing on serial console. The other is the NULL
> > > dereference.
>
> OK, it turns out the hard-lock and the null pointer dereference might be
> the same, I have a random seed for the fuzzer from a hard-lock crash that
> reproduces and it generated the null pointer crash. (This is with your
> patch applied).

I see. My patch simply eliminates wrong event for unimplemented general
events, but it definitely won't help with nil deref, so it mostly to
eliminate some potential side effects.

> I can try to see if I can bisect down to a specific event sequence that
> triggers this, but that can be tricky sometimes if things lock up so fast
> that the event log doesn't get written out before the crash.

Oh, Vince, I suspect such kind of bisection might consume a lot of your
time :( Maybe we could update perf fuzzer so that it would send events
to some net-storage first then write them to the counters, iow to automatize
this all stuff somehow?

2019-04-04 19:02:26

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, 4 Apr 2019, Cyrill Gorcunov wrote:

> On Thu, Apr 04, 2019 at 12:37:18PM -0400, Vince Weaver wrote:
>
> Oh, Vince, I suspect such kind of bisection might consume a lot of your
> time :( Maybe we could update perf fuzzer so that it would send events
> to some net-storage first then write them to the counters, iow to automatize
> this all stuff somehow?

I do have a lot of this automated already from tracking down past bugs,
but it turns out that most of the fuzzer-found bugs aren't deterministic
so it doesn't always work.

For example this bug, while I can easily repeat it, doesn't happen at
the same time each time. I suspect something corrupts things, but the
crash doesn't trigger until a context switch happens.

For what it's worth I've put code in p4_pmu_enable_all() to see what's
going on when the NULL dereference happens, and sure enough the printk is
triggered where I'd expect.

[ 138.132889] VMW: p4_pmu_enable_all: idx 4 is NULL
[ 138.171380] VMW: p4_pmu_enable_all: idx 4 is NULL
[ 138.212588] VMW: p4_pmu_enable_all: idx 4 is NULL
[ 138.263761] VMW: p4_pmu_enable_all: idx 4 is NULL
[ 138.279944] VMW: p4_pmu_enable_all: idx 4 is NULL

static void p4_pmu_enable_all(int added)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int idx;

for (idx = 0; idx < x86_pmu.num_counters; idx++) {
struct perf_event *event = cpuc->events[idx];
if (!test_bit(idx, cpuc->active_mask))
continue;
if (event==NULL) {
printk("VMW: p4_pmu_enable_all: idx %d is NULL\n",idx);
} else {
p4_pmu_enable_event(event);
}
}
}


the machine still crashes after this, but not right away.

Vince

2019-04-04 21:08:58

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, Apr 04, 2019 at 03:01:14PM -0400, Vince Weaver wrote:
>
> I do have a lot of this automated already from tracking down past bugs,
> but it turns out that most of the fuzzer-found bugs aren't deterministic
> so it doesn't always work.
>
> For example this bug, while I can easily repeat it, doesn't happen at
> the same time each time. I suspect something corrupts things, but the
> crash doesn't trigger until a context switch happens.

I fear so, I've readin code around to figure out where it might came
from but without much luck yet.

> For what it's worth I've put code in p4_pmu_enable_all() to see what's
> going on when the NULL dereference happens, and sure enough the printk is
> triggered where I'd expect.
>
> [ 138.132889] VMW: p4_pmu_enable_all: idx 4 is NULL
...
>
> the machine still crashes after this, but not right away.

yes, exactly, if look into disasm code we will see that 0x158
offset points to hwc from event. Vince, gimme some time, probably
the weekend so I would dive into the perf code more deeply and
will try to make some debugging patch for more precise tracking
of events. The kernel you're running is the latest -tip?

2019-04-07 20:22:16

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Thu, Apr 04, 2019 at 11:20:38PM +0300, Cyrill Gorcunov wrote:
> > the machine still crashes after this, but not right away.
>
> yes, exactly, if look into disasm code we will see that 0x158
> offset points to hwc from event. Vince, gimme some time, probably
> the weekend so I would dive into the perf code more deeply and
> will try to make some debugging patch for more precise tracking
> of events. The kernel you're running is the latest -tip?

Vince, could you please disable alias events and see if it change
anything, once you have time? Note once we've aliases disabled the
counter for cpu cycles get used for NMI watchdog so PERF_COUNT_HW_CPU_CYCLES
won't be available in "perf" tool itself, but I guess perf_fuzzer uses
direct kernel syscall.
---
arch/x86/events/intel/p4.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-tip.git/arch/x86/events/intel/p4.c
===================================================================
--- linux-tip.git.orig/arch/x86/events/intel/p4.c
+++ linux-tip.git/arch/x86/events/intel/p4.c
@@ -622,6 +622,8 @@ static u64 p4_get_alias_event(u64 config
u64 config_match;
int i;

+ return 0;
+
/*
* Only event with special mark is allowed,
* we're to be sure it didn't come as malformed

2019-04-09 16:39:34

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Sun, 7 Apr 2019, Cyrill Gorcunov wrote:

> Vince, could you please disable alias events and see if it change
> anything, once you have time? Note once we've aliases disabled the
> counter for cpu cycles get used for NMI watchdog so PERF_COUNT_HW_CPU_CYCLES
> won't be available in "perf" tool itself, but I guess perf_fuzzer uses
> direct kernel syscall.
> ---
> arch/x86/events/intel/p4.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-tip.git/arch/x86/events/intel/p4.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/events/intel/p4.c
> +++ linux-tip.git/arch/x86/events/intel/p4.c
> @@ -622,6 +622,8 @@ static u64 p4_get_alias_event(u64 config
> u64 config_match;
> int i;
>
> + return 0;
> +
> /*
> * Only event with special mark is allowed,
> * we're to be sure it didn't come as malformed
>

It still crashes at the same place with this patch and my reproducible
test case.

Vince


2019-04-09 17:01:52

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: perf: perf_fuzzer crashes on Pentium 4 systems

On Tue, Apr 09, 2019 at 12:38:25PM -0400, Vince Weaver wrote:
>
> It still crashes at the same place with this patch and my reproducible
> test case.

Thank for testing, Vince! At least now we know it is unrelated to
alias events. Will ping you once I get more ideas.