2018-12-04 15:56:07

by Vince Weaver

[permalink] [raw]
Subject: perf: perf_fuzzer triggers GPF in perf_prepare_sample

Hello,

I was able to trigger another oops with the perf_fuzzer with current git.

This is 4.20-rc5 after the fix for the very similar oops I previously
reported got committed.

It seems to be pointing to the same location in the source as
before, I guess maybe triggered a different way?

Unfortunately this crash is not easily reproducible like the last one was.

kernel/events/core.c:6393

if (sample_type & PERF_SAMPLE_CALLCHAIN) {
int size = 1;

if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
data->callchain = perf_callchain(event, regs);

>>>>>>>>> size += data->callchain->nr;

header->size += size * sizeof(u64);
}


Vince

[45050.698745] general protection fault: 0000 [#1] SMP PTI
[45050.698745] CPU: 5 PID: 13475 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5 #124
[45050.698746] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[45050.698746] RIP: 0010:perf_prepare_sample+0x82/0x4a0
[45050.698746] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
[45050.698747] RSP: 0000:ffffc900206bfb00 EFLAGS: 00010082
[45050.698747] RAX: dead000000000200 RBX: ffffc900206bfb58 RCX: 000000000000001f
[45050.698747] RDX: 0000000000000000 RSI: 0000000025bbf56f RDI: 0000000000000000
[45050.698748] RBP: 8000000000000275 R08: 0000000000000002 R09: 00000000000215c0
[45050.698748] R10: 00008b25b2e2f5c8 R11: 0000000000000000 R12: ffffc900206bfc40
[45050.698748] R13: ffff8880cf6d7800 R14: ffffc900206bfb98 R15: ffff88811ab4f420
[45050.698748] FS: 00007fab66133500(0000) GS:ffff88811ab40000(0000) knlGS:0000000000000000
[45050.698749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[45050.698749] CR2: 00007fab66133480 CR3: 00000000811aa004 CR4: 00000000001607e0
[45050.698749] DR0: 0000000000000000 DR1: 000000008e8e8000 DR2: 0000000000000000
[45050.698749] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[45050.698750] Call Trace:
[45050.698750] intel_pmu_drain_bts_buffer+0x151/0x220
[45050.698750] ? mem_cgroup_commit_charge+0x7a/0x510
[45050.698750] ? wp_page_copy+0x39e/0x650
[45050.698750] ? reuse_swap_page+0x129/0x340
[45050.698751] ? _raw_spin_unlock+0xa/0x10
[45050.698751] ? do_wp_page+0x30f/0x4d0
[45050.698751] ? finish_mkwrite_fault+0x140/0x140
[45050.698751] ? __handle_mm_fault+0xb22/0x12c0
[45050.698751] intel_pmu_handle_irq+0x6d/0x160
[45050.698752] perf_event_nmi_handler+0x2d/0x50
[45050.698752] nmi_handle+0x63/0x110
[45050.698752] default_do_nmi+0x4e/0x100
[45050.698752] do_nmi+0x112/0x170
[45050.698752] nmi+0x8b/0xd4
[45050.698753] RIP: 0033:0x558a6a6366c3
[45050.698753] Code: 01 d0 48 c1 e0 06 48 89 c2 48 8d 05 cf 93 23 00 48 8b 04 02 48 85 c0 74 11 8b 45 f8 3b 45 f4 75 05 8b 45 fc eb 16 83 45 f8 01 <83> 45 fc 01 81 7d fc 9f 86 01 00 7e 96 b8 ff ff ff ff c9 c3 55 48
[45050.698753] RSP: 002b:00007ffc9f521660 EFLAGS: 00000246
[45050.698754] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
[45050.698754] RDX: 000000000000e740 RSI: 00007ffc9f521634 RDI: 00007fab6612c740
[45050.698754] RBP: 00007ffc9f521670 R08: 00007fab6612c1f0 R09: 00007fab6612c240
[45050.698754] R10: 00007fab661337d0 R11: 0000000000000246 R12: 0000558a6a6364c0
[45050.698755] R13: 00007ffc9f523ad0 R14: 0000000000000000 R15: 0000000000000000
[45050.698755] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel coretemp tpm_tis snd_hda_codec snd_hda_core kvm_intel tpm_tis_core i915 snd_hwdep kvm tpm snd_pcm rng_core wmi_bmof mei_me sg iosf_mbi irqbypass drm_kms_helper evdev crct10dif_pclmul drm mei iTCO_wdt i2c_algo_bit iTCO_vendor_support snd_timer pcc_cpufreq crc32_pclmul ghash_clmulni_intel aesni_intel snd video aes_x86_64 crypto_simd cryptd glue_helper soundcore pcspkr wmi button binfmt_misc ip_tables x_tables autofs4 sr_mod sd_mod cdrom ahci libahci ehci_pci xhci_pci libata xhci_hcd ehci_hcd lpc_ich mfd_core crc32c_intel scsi_mod e1000e i2c_i801 usbcore usb_common fan thermal[45051.027024] ---[ end trace 9565944010fbdf23 ]---
[45051.027024] RIP: 0010:perf_prepare_sample+0x82/0x4a0
[45051.027025] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
[45051.027025] RSP: 0000:ffffc900206bfb00 EFLAGS: 00010082
[45051.027025] RAX: dead000000000200 RBX: ffffc900206bfb58 RCX: 000000000000001f
[45051.027025] RDX: 0000000000000000 RSI: 0000000025bbf56f RDI: 0000000000000000
[45051.027026] RBP: 8000000000000275 R08: 0000000000000002 R09: 00000000000215c0
[45051.027026] R10: 00008b25b2e2f5c8 R11: 0000000000000000 R12: ffffc900206bfc40
[45051.027026] R13: ffff8880cf6d7800 R14: ffffc900206bfb98 R15: ffff88811ab4f420
[45051.027027] FS: 00007fab66133500(0000) GS:ffff88811ab40000(0000) knlGS:0000000000000000
[45051.027027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[45051.027027] CR2: 00007fab66133480 CR3: 00000000811aa004 CR4: 00000000001607e0
[45051.027027] DR0: 0000000000000000 DR1: 000000008e8e8000 DR2: 0000000000000000
[45051.027027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[45051.027028] Kernel panic - not syncing: Fatal exception in interrupt
[45051.027051] Kernel Offset: disabled
[45051.149441] ---[ end Kernel panic - not syncing: Fatal exception in interrupt]---


2018-12-05 12:46:40

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> Hello,
>
> I was able to trigger another oops with the perf_fuzzer with current git.
>
> This is 4.20-rc5 after the fix for the very similar oops I previously
> reported got committed.
>
> It seems to be pointing to the same location in the source as
> before, I guess maybe triggered a different way?

nice.. yep, looks the same

>
> Unfortunately this crash is not easily reproducible like the last one was.

will check

jirka

>
> kernel/events/core.c:6393
>
> if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> int size = 1;
>
> if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
> data->callchain = perf_callchain(event, regs);
>
> >>>>>>>>> size += data->callchain->nr;
>
> header->size += size * sizeof(u64);
> }
>
>
> Vince
>
> [45050.698745] general protection fault: 0000 [#1] SMP PTI
> [45050.698745] CPU: 5 PID: 13475 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5 #124
> [45050.698746] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> [45050.698746] RIP: 0010:perf_prepare_sample+0x82/0x4a0
> [45050.698746] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
> [45050.698747] RSP: 0000:ffffc900206bfb00 EFLAGS: 00010082
> [45050.698747] RAX: dead000000000200 RBX: ffffc900206bfb58 RCX: 000000000000001f
> [45050.698747] RDX: 0000000000000000 RSI: 0000000025bbf56f RDI: 0000000000000000
> [45050.698748] RBP: 8000000000000275 R08: 0000000000000002 R09: 00000000000215c0
> [45050.698748] R10: 00008b25b2e2f5c8 R11: 0000000000000000 R12: ffffc900206bfc40
> [45050.698748] R13: ffff8880cf6d7800 R14: ffffc900206bfb98 R15: ffff88811ab4f420
> [45050.698748] FS: 00007fab66133500(0000) GS:ffff88811ab40000(0000) knlGS:0000000000000000
> [45050.698749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [45050.698749] CR2: 00007fab66133480 CR3: 00000000811aa004 CR4: 00000000001607e0
> [45050.698749] DR0: 0000000000000000 DR1: 000000008e8e8000 DR2: 0000000000000000
> [45050.698749] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> [45050.698750] Call Trace:
> [45050.698750] intel_pmu_drain_bts_buffer+0x151/0x220
> [45050.698750] ? mem_cgroup_commit_charge+0x7a/0x510
> [45050.698750] ? wp_page_copy+0x39e/0x650
> [45050.698750] ? reuse_swap_page+0x129/0x340
> [45050.698751] ? _raw_spin_unlock+0xa/0x10
> [45050.698751] ? do_wp_page+0x30f/0x4d0
> [45050.698751] ? finish_mkwrite_fault+0x140/0x140
> [45050.698751] ? __handle_mm_fault+0xb22/0x12c0
> [45050.698751] intel_pmu_handle_irq+0x6d/0x160
> [45050.698752] perf_event_nmi_handler+0x2d/0x50
> [45050.698752] nmi_handle+0x63/0x110
> [45050.698752] default_do_nmi+0x4e/0x100
> [45050.698752] do_nmi+0x112/0x170
> [45050.698752] nmi+0x8b/0xd4
> [45050.698753] RIP: 0033:0x558a6a6366c3
> [45050.698753] Code: 01 d0 48 c1 e0 06 48 89 c2 48 8d 05 cf 93 23 00 48 8b 04 02 48 85 c0 74 11 8b 45 f8 3b 45 f4 75 05 8b 45 fc eb 16 83 45 f8 01 <83> 45 fc 01 81 7d fc 9f 86 01 00 7e 96 b8 ff ff ff ff c9 c3 55 48
> [45050.698753] RSP: 002b:00007ffc9f521660 EFLAGS: 00000246
> [45050.698754] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
> [45050.698754] RDX: 000000000000e740 RSI: 00007ffc9f521634 RDI: 00007fab6612c740
> [45050.698754] RBP: 00007ffc9f521670 R08: 00007fab6612c1f0 R09: 00007fab6612c240
> [45050.698754] R10: 00007fab661337d0 R11: 0000000000000246 R12: 0000558a6a6364c0
> [45050.698755] R13: 00007ffc9f523ad0 R14: 0000000000000000 R15: 0000000000000000
> [45050.698755] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel coretemp tpm_tis snd_hda_codec snd_hda_core kvm_intel tpm_tis_core i915 snd_hwdep kvm tpm snd_pcm rng_core wmi_bmof mei_me sg iosf_mbi irqbypass drm_kms_helper evdev crct10dif_pclmul drm mei iTCO_wdt i2c_algo_bit iTCO_vendor_support snd_timer pcc_cpufreq crc32_pclmul ghash_clmulni_intel aesni_intel snd video aes_x86_64 crypto_simd cryptd glue_helper soundcore pcspkr wmi button binfmt_misc ip_tables x_tables autofs4 sr_mod sd_mod cdrom ahci libahci ehci_pci xhci_pci libata xhci_hcd ehci_hcd lpc_ich mfd_core crc32c_intel scsi_mod e1000e i2c_i801 usbcore usb_common fan thermal[45051.027024] ---[ end trace 9565944010fbdf23 ]---
> [45051.027024] RIP: 0010:perf_prepare_sample+0x82/0x4a0
> [45051.027025] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
> [45051.027025] RSP: 0000:ffffc900206bfb00 EFLAGS: 00010082
> [45051.027025] RAX: dead000000000200 RBX: ffffc900206bfb58 RCX: 000000000000001f
> [45051.027025] RDX: 0000000000000000 RSI: 0000000025bbf56f RDI: 0000000000000000
> [45051.027026] RBP: 8000000000000275 R08: 0000000000000002 R09: 00000000000215c0
> [45051.027026] R10: 00008b25b2e2f5c8 R11: 0000000000000000 R12: ffffc900206bfc40
> [45051.027026] R13: ffff8880cf6d7800 R14: ffffc900206bfb98 R15: ffff88811ab4f420
> [45051.027027] FS: 00007fab66133500(0000) GS:ffff88811ab40000(0000) knlGS:0000000000000000
> [45051.027027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [45051.027027] CR2: 00007fab66133480 CR3: 00000000811aa004 CR4: 00000000001607e0
> [45051.027027] DR0: 0000000000000000 DR1: 000000008e8e8000 DR2: 0000000000000000
> [45051.027027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> [45051.027028] Kernel panic - not syncing: Fatal exception in interrupt
> [45051.027051] Kernel Offset: disabled
> [45051.149441] ---[ end Kernel panic - not syncing: Fatal exception in interrupt]---

2018-12-05 16:41:16

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Wed, Dec 05, 2018 at 01:45:38PM +0100, Jiri Olsa wrote:
> On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> > Hello,
> >
> > I was able to trigger another oops with the perf_fuzzer with current git.
> >
> > This is 4.20-rc5 after the fix for the very similar oops I previously
> > reported got committed.
> >
> > It seems to be pointing to the same location in the source as
> > before, I guess maybe triggered a different way?
>
> nice.. yep, looks the same
>
> >
> > Unfortunately this crash is not easily reproducible like the last one was.
>
> will check

what model are hitting this on?

jirka

2018-12-05 17:12:46

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Wed, 5 Dec 2018, Jiri Olsa wrote:

> On Wed, Dec 05, 2018 at 01:45:38PM +0100, Jiri Olsa wrote:
> > On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> > > Hello,
> > >
> > > I was able to trigger another oops with the perf_fuzzer with current git.
> > >
> > > This is 4.20-rc5 after the fix for the very similar oops I previously
> > > reported got committed.
> > >
> > > It seems to be pointing to the same location in the source as
> > > before, I guess maybe triggered a different way?
> >
> > nice.. yep, looks the same
> >
> > >
> > > Unfortunately this crash is not easily reproducible like the last one was.
> >
> > will check
>
> what model are hitting this on?

Haswell. 6/60/3.

While I can't deterministically trigger this, the fuzzer usually hits it
within an hour or two. Is there any debug or printk messages I can
add that would help figure out what's going on?

Vince



2018-12-05 18:35:37

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Wed, Dec 05, 2018 at 12:11:19PM -0500, Vince Weaver wrote:
> On Wed, 5 Dec 2018, Jiri Olsa wrote:
>
> > On Wed, Dec 05, 2018 at 01:45:38PM +0100, Jiri Olsa wrote:
> > > On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> > > > Hello,
> > > >
> > > > I was able to trigger another oops with the perf_fuzzer with current git.
> > > >
> > > > This is 4.20-rc5 after the fix for the very similar oops I previously
> > > > reported got committed.
> > > >
> > > > It seems to be pointing to the same location in the source as
> > > > before, I guess maybe triggered a different way?
> > >
> > > nice.. yep, looks the same
> > >
> > > >
> > > > Unfortunately this crash is not easily reproducible like the last one was.
> > >
> > > will check
> >
> > what model are hitting this on?
>
> Haswell. 6/60/3.
>
> While I can't deterministically trigger this, the fuzzer usually hits it
> within an hour or two. Is there any debug or printk messages I can
> add that would help figure out what's going on?

I can't see how we could end up with that config other than
some corruption.. the only way I see could be that we touch
cpu->events array without checking its active_mask bit

but that does not explain why the crash happened in the same
place as before

jirka


---
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ecc3e34ca955..9a2fd5a68d87 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2404,7 +2404,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
struct cpu_hw_events *cpuc;
int loops;
u64 status;
- int handled;
+ int handled = 0;
int pmu_enabled;

cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -2423,8 +2423,10 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
intel_bts_disable_local();
cpuc->enabled = 0;
__intel_pmu_disable_all();
- handled = intel_pmu_drain_bts_buffer();
- handled += intel_bts_interrupt();
+ if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
+ handled += intel_pmu_drain_bts_buffer();
+ handled += intel_bts_interrupt();
+ }
status = intel_pmu_get_status();
if (!status)
goto done;

2018-12-06 15:37:47

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Wed, 5 Dec 2018, Jiri Olsa wrote:

> On Wed, Dec 05, 2018 at 12:11:19PM -0500, Vince Weaver wrote:
> > On Wed, 5 Dec 2018, Jiri Olsa wrote:
> >
> > > On Wed, Dec 05, 2018 at 01:45:38PM +0100, Jiri Olsa wrote:
> > > > On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> > > > > Hello,
> > > > >
> > > > > I was able to trigger another oops with the perf_fuzzer with current git.
> > > > >
> > > > > This is 4.20-rc5 after the fix for the very similar oops I previously
> > > > > reported got committed.
> > > > >
> > > > > It seems to be pointing to the same location in the source as
> > > > > before, I guess maybe triggered a different way?
> > > >
> > > > nice.. yep, looks the same
> > > >
> > > > >
> > > > > Unfortunately this crash is not easily reproducible like the last one was.
> > > >
> > > > will check
> > >
> > > what model are hitting this on?
> >
> > Haswell. 6/60/3.
> >
> > While I can't deterministically trigger this, the fuzzer usually hits it
> > within an hour or two. Is there any debug or printk messages I can
> > add that would help figure out what's going on?
>
> I can't see how we could end up with that config other than
> some corruption.. the only way I see could be that we touch
> cpu->events array without checking its active_mask bit
>
> but that does not explain why the crash happened in the same
> place as before

Maybe it is a corruption issue. I had applied my own debug patch that
would dump some info if data->callchain was NULL.

But my debug code didn't trigger this time because it looks like
data->callchain was "1" rather than "0".

[27764.840179] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[27764.840179] PGD 0 P4D 0
[27764.840180] Oops: 0000 [#1] SMP PTI
[27764.840180] CPU: 1 PID: 18687 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #125
[27764.840180] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014

Vince

2018-12-06 15:46:17

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Thu, Dec 06, 2018 at 10:35:28AM -0500, Vince Weaver wrote:
> On Wed, 5 Dec 2018, Jiri Olsa wrote:
>
> > On Wed, Dec 05, 2018 at 12:11:19PM -0500, Vince Weaver wrote:
> > > On Wed, 5 Dec 2018, Jiri Olsa wrote:
> > >
> > > > On Wed, Dec 05, 2018 at 01:45:38PM +0100, Jiri Olsa wrote:
> > > > > On Tue, Dec 04, 2018 at 10:54:55AM -0500, Vince Weaver wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I was able to trigger another oops with the perf_fuzzer with current git.
> > > > > >
> > > > > > This is 4.20-rc5 after the fix for the very similar oops I previously
> > > > > > reported got committed.
> > > > > >
> > > > > > It seems to be pointing to the same location in the source as
> > > > > > before, I guess maybe triggered a different way?
> > > > >
> > > > > nice.. yep, looks the same
> > > > >
> > > > > >
> > > > > > Unfortunately this crash is not easily reproducible like the last one was.
> > > > >
> > > > > will check
> > > >
> > > > what model are hitting this on?
> > >
> > > Haswell. 6/60/3.
> > >
> > > While I can't deterministically trigger this, the fuzzer usually hits it
> > > within an hour or two. Is there any debug or printk messages I can
> > > add that would help figure out what's going on?
> >
> > I can't see how we could end up with that config other than
> > some corruption.. the only way I see could be that we touch
> > cpu->events array without checking its active_mask bit
> >
> > but that does not explain why the crash happened in the same
> > place as before
>
> Maybe it is a corruption issue. I had applied my own debug patch that
> would dump some info if data->callchain was NULL.
>
> But my debug code didn't trigger this time because it looks like
> data->callchain was "1" rather than "0".
>
> [27764.840179] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
> [27764.840179] PGD 0 P4D 0
> [27764.840180] Oops: 0000 [#1] SMP PTI
> [27764.840180] CPU: 1 PID: 18687 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #125
> [27764.840180] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014

actually, you could try that patch from my previous email?

thanks,
jirka

2018-12-09 02:09:34

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Thu, 6 Dec 2018, Jiri Olsa wrote:

> On Thu, Dec 06, 2018 at 10:35:28AM -0500, Vince Weaver wrote:
> > On Wed, 5 Dec 2018, Jiri Olsa wrote:
> > Maybe it is a corruption issue. I had applied my own debug patch that
> > would dump some info if data->callchain was NULL.
> >
> > But my debug code didn't trigger this time because it looks like
> > data->callchain was "1" rather than "0".
> >
> > [27764.840179] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
> > [27764.840179] PGD 0 P4D 0
> > [27764.840180] Oops: 0000 [#1] SMP PTI
> > [27764.840180] CPU: 1 PID: 18687 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #125
> > [27764.840180] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
>
> actually, you could try that patch from my previous email?
>
still crashes with your patch (see below)

I've also been able to replicate this crash on a skylake machine in
addition to the haswell machine.

Vince

[28269.147232] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[28269.155628] PGD 0 P4D 0
[28269.158360] Oops: 0000 [#1] SMP PTI
[28269.162087] CPU: 0 PID: 1189 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #128
[28269.171011] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[28269.178935] RIP: 0010:perf_prepare_sample+0x82/0x4a0
[28269.184239] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
[28269.204249] RSP: 0000:ffffc9000aca7a40 EFLAGS: 00010082
[28269.209832] RAX: 0000000000000000 RBX: ffffc9000aca7a98 RCX: ffffc9000aca7ad8
[28269.217484] RDX: 0000000000000000 RSI: ffffc9000aca7b80 RDI: ffffc9000aca7a9e
[28269.225129] RBP: 80000000000bb068 R08: 0000000000000002 R09: 00000000000215c0
[28269.232760] R10: ffff8880ce552000 R11: 0000000000000000 R12: ffffc9000aca7b80
[28269.240380] R13: ffff88803696c800 R14: ffffc9000aca7ad8 R15: ffffe8ffffc06300
[28269.248014] FS: 00007f5927fe7500(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000
[28269.256606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28269.262739] CR2: 0000000000000000 CR3: 0000000116d98001 CR4: 00000000001607f0
[28269.270349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[28269.277968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[28269.285639] Call Trace:
[28269.288266] intel_pmu_drain_bts_buffer+0x151/0x220
[28269.293476] ? radix_tree_delete_item+0x69/0xc0
[28269.298378] x86_pmu_stop+0x3b/0x90
[28269.302113] x86_pmu_del+0x57/0x160
[28269.305840] event_sched_out.isra.106+0x81/0x170
[28269.310780] group_sched_out.part.108+0x51/0xc0
[28269.315634] ctx_sched_out+0xf8/0x220
[28269.319551] __perf_event_task_sched_out+0x18d/0x3f0
[28269.324866] ? pick_next_task_fair+0x60a/0x660
[28269.329639] __schedule+0x4b9/0x820
[28269.333367] ? kill_pid_info+0x34/0x50
[28269.337360] schedule+0x28/0x80
[28269.340725] exit_to_usermode_loop+0x4e/0xc0
[28269.345272] prepare_exit_to_usermode+0x53/0x80
[28269.350109] retint_user+0x8/0x8
[28269.353541] RIP: 0033:0x56154980b6c3
[28269.357346] Code: 01 d0 48 c1 e0 06 48 89 c2 48 8d 05 cf 93 23 00 48 8b 04 02 48 85 c0 74 11 8b 45 f8 3b 45 f4 75 05 8b 45 fc eb 16 83 45 f8 01 <83> 45 fc 01 81 7d fc 9f 86 01 00 7e 96 b8 ff ff ff ff c9 c3 55 48
[28269.377462] RSP: 002b:00007ffc6a1540a0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[28269.385562] RAX: 0000000000000000 RBX: 000000000000000c RCX: 000000000000003c
[28269.393182] RDX: 0000000000b895c0 RSI: 00007ffc6a154074 RDI: 00007f5927fe0740
[28269.400835] RBP: 00007ffc6a1540b0 R08: 00007f5927fe01f0 R09: 00007f5927fe0240
[28269.408452] R10: 0000000000000000 R11: 0000000000000246 R12: 000056154980b4c0
[28269.416080] R13: 00007ffc6a156510 R14: 0000000000000000 R15: 0000000000000000
[28269.423723] Modules linked in: snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul iosf_mbi ghash_clmulni_intel drm_kms_helper aesni_intel snd_hda_codec_realtek aes_x86_64 crypto_simd drm cryptd snd_hda_codec_generic i2c_algo_bit snd_hda_intel evdev glue_helper snd_hda_codec snd_hda_core iTCO_wdt mei_me mei wmi_bmof tpm_tis snd_hwdep tpm_tis_core pcc_cpufreq pcspkr iTCO_vendor_support snd_pcm tpm sg rng_core button snd_timer video snd soundcore wmi binfmt_misc ip_tables x_tables autofs4 sr_mod sd_mod cdrom ahci xhci_pci ehci_pci libahci xhci_hcd ehci_hcd libata usbcore lpc_ich mfd_core e1000e scsi_mod i2c_i801 crc32c_intel usb_common fan thermal
[28269.492702] CR2: 0000000000000000
[28269.496246] ---[ end trace 6775846bfda0f18b ]---
[28269.501186] RIP: 0010:perf_prepare_sample+0x82/0x4a0
[28269.506482] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
[28269.526587] RSP: 0000:ffffc9000aca7a40 EFLAGS: 00010082
[28269.532176] RAX: 0000000000000000 RBX: ffffc9000aca7a98 RCX: ffffc9000aca7ad8
[28269.539805] RDX: 0000000000000000 RSI: ffffc9000aca7b80 RDI: ffffc9000aca7a9e
[28269.547450] RBP: 80000000000bb068 R08: 0000000000000002 R09: 00000000000215c0
[28269.555075] R10: ffff8880ce552000 R11: 0000000000000000 R12: ffffc9000aca7b80
[28269.562694] R13: ffff88803696c800 R14: ffffc9000aca7ad8 R15: ffffe8ffffc06300
[28269.570329] FS: 00007f5927fe7500(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000
[28269.578960] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28269.585123] CR2: 0000000000000000 CR3: 0000000116d98001 CR4: 00000000001607f0
[28269.592740] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[28269.600358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600

2018-12-09 11:56:17

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf: perf_fuzzer triggers GPF in perf_prepare_sample

On Sat, Dec 08, 2018 at 09:08:28PM -0500, Vince Weaver wrote:
> On Thu, 6 Dec 2018, Jiri Olsa wrote:
>
> > On Thu, Dec 06, 2018 at 10:35:28AM -0500, Vince Weaver wrote:
> > > On Wed, 5 Dec 2018, Jiri Olsa wrote:
> > > Maybe it is a corruption issue. I had applied my own debug patch that
> > > would dump some info if data->callchain was NULL.
> > >
> > > But my debug code didn't trigger this time because it looks like
> > > data->callchain was "1" rather than "0".
> > >
> > > [27764.840179] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
> > > [27764.840179] PGD 0 P4D 0
> > > [27764.840180] Oops: 0000 [#1] SMP PTI
> > > [27764.840180] CPU: 1 PID: 18687 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #125
> > > [27764.840180] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> >
> > actually, you could try that patch from my previous email?
> >
> still crashes with your patch (see below)
>
> I've also been able to replicate this crash on a skylake machine in
> addition to the haswell machine.
>
> Vince
>
> [28269.147232] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [28269.155628] PGD 0 P4D 0
> [28269.158360] Oops: 0000 [#1] SMP PTI
> [28269.162087] CPU: 0 PID: 1189 Comm: perf_fuzzer Tainted: G W 4.20.0-rc5+ #128
> [28269.171011] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> [28269.178935] RIP: 0010:perf_prepare_sample+0x82/0x4a0
> [28269.184239] Code: 06 4c 89 ea 4c 89 e6 e8 3c 54 ff ff 40 f6 c5 01 0f 85 28 01 00 00 40 f6 c5 20 74 1c 48 85 ed 0f 89 04 01 00 00 49 8b 44 24 70 <48> 8b 00 8d 04 c5 08 00 00 00 66 01 43 06 f7 c5 00 04 00 00 74 41
> [28269.204249] RSP: 0000:ffffc9000aca7a40 EFLAGS: 00010082
> [28269.209832] RAX: 0000000000000000 RBX: ffffc9000aca7a98 RCX: ffffc9000aca7ad8
> [28269.217484] RDX: 0000000000000000 RSI: ffffc9000aca7b80 RDI: ffffc9000aca7a9e
> [28269.225129] RBP: 80000000000bb068 R08: 0000000000000002 R09: 00000000000215c0
> [28269.232760] R10: ffff8880ce552000 R11: 0000000000000000 R12: ffffc9000aca7b80
> [28269.240380] R13: ffff88803696c800 R14: ffffc9000aca7ad8 R15: ffffe8ffffc06300
> [28269.248014] FS: 00007f5927fe7500(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000
> [28269.256606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [28269.262739] CR2: 0000000000000000 CR3: 0000000116d98001 CR4: 00000000001607f0
> [28269.270349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [28269.277968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> [28269.285639] Call Trace:
> [28269.288266] intel_pmu_drain_bts_buffer+0x151/0x220
> [28269.293476] ? radix_tree_delete_item+0x69/0xc0
> [28269.298378] x86_pmu_stop+0x3b/0x90
> [28269.302113] x86_pmu_del+0x57/0x160

nice, at least it's in different callstack context, that might help

thanks,
jirka