LinuxLists.cc - Linux 5.8-rc1 BUG unable to handle page fault (snd

2020-06-15 18:50:51

Subject: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

I am seeing the following problem on my system. I haven't started debug
yet. Is this a known issue?

[ 9.791309] BUG: unable to handle page fault for address:
ffffb1e78165d000
[ 9.791328] #PF: supervisor write access in kernel mode
[ 9.791330] #PF: error_code(0x000b) - reserved bit violation
[ 9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067
PTE 80001a3681509163
[ 9.791337] Oops: 000b [#1] SMP NOPTI
[ 9.791340] CPU: 7 PID: 866 Comm: pulseaudio Not tainted 5.8.0-rc1 #1
[ 9.791341] Hardware name: LENOVO 10VGCTO1WW/3130, BIOS M1XKT45A
08/21/2019
[ 9.791346] RIP: 0010:__memset+0x24/0x30
[ 9.791348] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1
83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af
c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 9.791350] RSP: 0018:ffffb1e7817a7dd0 EFLAGS: 00010216
[ 9.791352] RAX: 0000000000000000 RBX: ffff97b32dfd7400 RCX:
00000000000008a0
[ 9.791354] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffb1e78165d000
[ 9.791356] RBP: ffffb1e7817a7e00 R08: ffffb1e780000000 R09:
ffffb1e78165d000
[ 9.791358] R10: ffffffffffffffff R11: ffffb1e78165d000 R12:
0000000000000000
[ 9.791359] R13: ffff97b32dfd3000 R14: ffffffffc0b48880 R15:
ffff97b33aa42600
[ 9.791361] FS: 00007fa11cb34ec0(0000) GS:ffff97b33edc0000(0000)
knlGS:0000000000000000
[ 9.791363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.791365] CR2: ffffb1e78165d000 CR3: 0000000210db6000 CR4:
00000000003406e0
[ 9.791367] Call Trace:
[ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
[ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
[ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 [snd]
[ 9.791394] snd_pcm_ioctl+0x27/0x40 [snd_pcm]
[ 9.791398] ksys_ioctl+0x9d/0xd0
[ 9.791400] __x64_sys_ioctl+0x1a/0x20
[ 9.791404] do_syscall_64+0x49/0xc0
[ 9.791406] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9.791408] RIP: 0033:0x7fa11d4c137b
[ 9.791410] Code: Bad RIP value.
[ 9.791412] RSP: 002b:00007ffe2fb4e308 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 9.791414] RAX: ffffffffffffffda RBX: 00007ffe2fb4e510 RCX:
00007fa11d4c137b
[ 9.791415] RDX: 00007ffe2fb4e510 RSI: 00000000c2604111 RDI:
0000000000000012
[ 9.791417] RBP: 000055e99f65a890 R08: 0000000000000000 R09:
0000000000000000
[ 9.791418] R10: 0000000000000004 R11: 0000000000000246 R12:
000055e99f65a810
[ 9.791420] R13: 00007ffe2fb4e344 R14: 0000000000000000 R15:
00007ffe2fb4e510
[ 9.791422] Modules linked in: cmac algif_hash algif_skcipher af_alg
bnep binfmt_misc nls_iso8859_1 snd_hda_codec_realtek
snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel
snd_usb_audio snd_intel_dspcfg snd_usbmidi_lib snd_hda_codec amdgpu mc
snd_hda_core snd_hwdep snd_pcm edac_mce_amd iommu_v2 ath10k_pci
snd_seq_midi gpu_sched snd_seq_midi_event kvm_amd kvm ttm snd_rawmidi
ath10k_core irqbypass drm_kms_helper snd_seq cec i2c_algo_bit
fb_sys_fops ath snd_seq_device syscopyarea snd_timer mac80211
crct10dif_pclmul ghash_clmulni_intel btusb aesni_intel btrtl btbcm
crypto_simd cryptd btintel serio_raw input_leds sysfillrect glue_helper
bluetooth efi_pstore k10temp snd pl2303 wmi_bmof ecdh_generic ecc
snd_pci_acp3x sysimgblt cfg80211 soundcore ccp libarc4 ipmi_devintf
ipmi_msghandler mac_hid sch_fq_codel parport_pc ppdev lp parport drm
ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul nvme ahci
psmouse i2c_piix4 libahci nvme_core r8169 realtek wmi video
[ 9.791463] CR2: ffffb1e78165d000
[ 9.791465] ---[ end trace 7b22a028ccaf2e75 ]---

thanks,
-- Shuah

2020-06-15 19:50:22

by David Rientjes

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On Mon, 15 Jun 2020, Shuah Khan wrote:

> I am seeing the following problem on my system. I haven't started debug
> yet. Is this a known issue?
>
> [ 9.791309] BUG: unable to handle page fault for address: ffffb1e78165d000
> [ 9.791328] #PF: supervisor write access in kernel mode
> [ 9.791330] #PF: error_code(0x000b) - reserved bit violation
> [ 9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067 PTE
> 80001a3681509163
> [ 9.791337] Oops: 000b [#1] SMP NOPTI
> [ 9.791340] CPU: 7 PID: 866 Comm: pulseaudio Not tainted 5.8.0-rc1 #1
> [ 9.791341] Hardware name: LENOVO 10VGCTO1WW/3130, BIOS M1XKT45A 08/21/2019
> [ 9.791346] RIP: 0010:__memset+0x24/0x30
> [ 9.791348] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2
> 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48
> ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
> [ 9.791350] RSP: 0018:ffffb1e7817a7dd0 EFLAGS: 00010216
> [ 9.791352] RAX: 0000000000000000 RBX: ffff97b32dfd7400 RCX:
> 00000000000008a0
> [ 9.791354] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffb1e78165d000
> [ 9.791356] RBP: ffffb1e7817a7e00 R08: ffffb1e780000000 R09:
> ffffb1e78165d000
> [ 9.791358] R10: ffffffffffffffff R11: ffffb1e78165d000 R12:
> 0000000000000000
> [ 9.791359] R13: ffff97b32dfd3000 R14: ffffffffc0b48880 R15:
> ffff97b33aa42600
> [ 9.791361] FS: 00007fa11cb34ec0(0000) GS:ffff97b33edc0000(0000)
> knlGS:0000000000000000
> [ 9.791363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9.791365] CR2: ffffb1e78165d000 CR3: 0000000210db6000 CR4:
> 00000000003406e0
> [ 9.791367] Call Trace:
> [ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
> [ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
> [ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 [snd]
> [ 9.791394] snd_pcm_ioctl+0x27/0x40 [snd_pcm]
> [ 9.791398] ksys_ioctl+0x9d/0xd0
> [ 9.791400] __x64_sys_ioctl+0x1a/0x20
> [ 9.791404] do_syscall_64+0x49/0xc0
> [ 9.791406] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9.791408] RIP: 0033:0x7fa11d4c137b
> [ 9.791410] Code: Bad RIP value.

Hi Shuah, do you have CONFIG_AMD_MEM_ENCRYPT enabled?

If so, could you try
http://git.infradead.org/users/hch/dma-mapping.git/commitdiff/dbed452a078d56bc7f1abecc3edd6a75e8e4484e

2020-06-15 19:51:44

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On Mon, Jun 15, 2020 at 11:48 AM Shuah Khan <[email protected]> wrote:
>
> I am seeing the following problem on my system. I haven't started debug
> yet. Is this a known issue?
>
> [ 9.791309] BUG: unable to handle page fault for address:
> ffffb1e78165d000
> [ 9.791328] #PF: supervisor write access in kernel mode
> [ 9.791330] #PF: error_code(0x000b) - reserved bit violation

Hmm. "reserved bit violation" sounds like the page tables themselves
are corrupt.

> [ 9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067
> PTE 80001a3681509163

PTE low 12 bits 163 is "global", "dirty+accessed" + "kernel
read-write", so that part looks fine. The top bit is NX. I'm not
seeing any reserved bits set.

The page directory bits look sane too (067 is just the normal state
for page tables).

The PTE does have bit 44 set. I think that's what triggers the
problem. This is presumably on a machine with 44 physical address
bits?

The faulting code is all in memset, and it's just doing "rep stosq" to
fill memory with zeroes, and we have

RAX: 0000000000000000 (the zero pattern)
RCX: 00000000000008a0 (repeat count)
RDI: ffffb1e78165d000 (the target address)

and that target address looks odd. If I read it right, it's at the
41TB mark in the direct-mapped area.

But I am probably mis-reading this.

Better bring in a few more x86 people. We did have some page table
work this time around, with both the entry code changes but also the
vmalloc faulting removal.

It doesn't _look_ like it's in the vmalloc range, though. But with
that RCX value, it's certainly doing more than a single page.

> [ 9.791367] Call Trace:
> [ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
> [ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
> [ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 [snd]
> [ 9.791394] snd_pcm_ioctl+0x27/0x40 [snd_pcm]
> [ 9.791398] ksys_ioctl+0x9d/0xd0
> [ 9.791400] __x64_sys_ioctl+0x1a/0x20
> [ 9.791404] do_syscall_64+0x49/0xc0
> [ 9.791406] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Can you re-create it with CONFIG_DEBUG_INFO enabled, and run it
through scripts/decode_stacktrace.sh to give more details on where it
happens.

Linus

2020-06-15 20:00:40

by Takashi Iwai

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On Mon, 15 Jun 2020 20:48:11 +0200,
Shuah Khan wrote:
>
> I am seeing the following problem on my system. I haven't started debug
> yet. Is this a known issue?

Yes, the recent fix by David should paper over it:
http://lore.kernel.org/r/alpine.DEB.2.22.394.2006110025250.13899@chino.kir.corp.google.com
IIRC, Christoph already merged it in his tree.

I also made another fix series in the sound driver side (found in
topic/dma-fix2 branch in my sound.git tree).
But I guess I'll queue it for 5.9 after more testing.

thanks,

Takashi

> [ 9.791309] BUG: unable to handle page fault for address:
> ffffb1e78165d000
> [ 9.791328] #PF: supervisor write access in kernel mode
> [ 9.791330] #PF: error_code(0x000b) - reserved bit violation
> [ 9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067
> PTE 80001a3681509163
> [ 9.791337] Oops: 000b [#1] SMP NOPTI
> [ 9.791340] CPU: 7 PID: 866 Comm: pulseaudio Not tainted 5.8.0-rc1 #1
> [ 9.791341] Hardware name: LENOVO 10VGCTO1WW/3130, BIOS M1XKT45A
> 08/21/2019
> [ 9.791346] RIP: 0010:__memset+0x24/0x30
> [ 9.791348] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89
> d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48
> 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89
> d1 f3
> [ 9.791350] RSP: 0018:ffffb1e7817a7dd0 EFLAGS: 00010216
> [ 9.791352] RAX: 0000000000000000 RBX: ffff97b32dfd7400 RCX:
> 00000000000008a0
> [ 9.791354] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffb1e78165d000
> [ 9.791356] RBP: ffffb1e7817a7e00 R08: ffffb1e780000000 R09:
> ffffb1e78165d000
> [ 9.791358] R10: ffffffffffffffff R11: ffffb1e78165d000 R12:
> 0000000000000000
> [ 9.791359] R13: ffff97b32dfd3000 R14: ffffffffc0b48880 R15:
> ffff97b33aa42600
> [ 9.791361] FS: 00007fa11cb34ec0(0000) GS:ffff97b33edc0000(0000)
> knlGS:0000000000000000
> [ 9.791363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9.791365] CR2: ffffb1e78165d000 CR3: 0000000210db6000 CR4:
> 00000000003406e0
> [ 9.791367] Call Trace:
> [ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
> [ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
> [ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 [snd]
> [ 9.791394] snd_pcm_ioctl+0x27/0x40 [snd_pcm]
> [ 9.791398] ksys_ioctl+0x9d/0xd0
> [ 9.791400] __x64_sys_ioctl+0x1a/0x20
> [ 9.791404] do_syscall_64+0x49/0xc0
> [ 9.791406] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9.791408] RIP: 0033:0x7fa11d4c137b
> [ 9.791410] Code: Bad RIP value.
> [ 9.791412] RSP: 002b:00007ffe2fb4e308 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 9.791414] RAX: ffffffffffffffda RBX: 00007ffe2fb4e510 RCX:
> 00007fa11d4c137b
> [ 9.791415] RDX: 00007ffe2fb4e510 RSI: 00000000c2604111 RDI:
> 0000000000000012
> [ 9.791417] RBP: 000055e99f65a890 R08: 0000000000000000 R09:
> 0000000000000000
> [ 9.791418] R10: 0000000000000004 R11: 0000000000000246 R12:
> 000055e99f65a810
> [ 9.791420] R13: 00007ffe2fb4e344 R14: 0000000000000000 R15:
> 00007ffe2fb4e510
> [ 9.791422] Modules linked in: cmac algif_hash algif_skcipher
> af_alg bnep binfmt_misc nls_iso8859_1 snd_hda_codec_realtek
> snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel
> snd_usb_audio snd_intel_dspcfg snd_usbmidi_lib snd_hda_codec amdgpu mc
> snd_hda_core snd_hwdep snd_pcm edac_mce_amd iommu_v2 ath10k_pci
> snd_seq_midi gpu_sched snd_seq_midi_event kvm_amd kvm ttm snd_rawmidi
> ath10k_core irqbypass drm_kms_helper snd_seq cec i2c_algo_bit
> fb_sys_fops ath snd_seq_device syscopyarea snd_timer mac80211
> crct10dif_pclmul ghash_clmulni_intel btusb aesni_intel btrtl btbcm
> crypto_simd cryptd btintel serio_raw input_leds sysfillrect
> glue_helper bluetooth efi_pstore k10temp snd pl2303 wmi_bmof
> ecdh_generic ecc snd_pci_acp3x sysimgblt cfg80211 soundcore ccp
> libarc4 ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel parport_pc
> ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid
> crc32_pclmul nvme ahci psmouse i2c_piix4 libahci nvme_core r8169
> realtek wmi video
> [ 9.791463] CR2: ffffb1e78165d000
> [ 9.791465] ---[ end trace 7b22a028ccaf2e75 ]---
>
> thanks,
> -- Shuah
>

2020-06-15 20:44:37

by Shuah Khan

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On 6/15/20 1:48 PM, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 11:48 AM Shuah Khan <[email protected]> wrote:
>>
>> I am seeing the following problem on my system. I haven't started debug
>> yet. Is this a known issue?
>>
>> [ 9.791309] BUG: unable to handle page fault for address:
>> ffffb1e78165d000
>> [ 9.791328] #PF: supervisor write access in kernel mode
>> [ 9.791330] #PF: error_code(0x000b) - reserved bit violation
>
> Hmm. "reserved bit violation" sounds like the page tables themselves
> are corrupt.
>
>> [ 9.791332] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 22ba8e067
>> PTE 80001a3681509163
>
> PTE low 12 bits 163 is "global", "dirty+accessed" + "kernel
> read-write", so that part looks fine. The top bit is NX. I'm not
> seeing any reserved bits set.
>
> The page directory bits look sane too (067 is just the normal state
> for page tables).
>
> The PTE does have bit 44 set. I think that's what triggers the
> problem. This is presumably on a machine with 44 physical address
> bits?
>
> The faulting code is all in memset, and it's just doing "rep stosq" to
> fill memory with zeroes, and we have
>
> RAX: 0000000000000000 (the zero pattern)
> RCX: 00000000000008a0 (repeat count)
> RDI: ffffb1e78165d000 (the target address)
>
> and that target address looks odd. If I read it right, it's at the
> 41TB mark in the direct-mapped area.
>
> But I am probably mis-reading this.
>
> Better bring in a few more x86 people. We did have some page table
> work this time around, with both the entry code changes but also the
> vmalloc faulting removal.
>
> It doesn't _look_ like it's in the vmalloc range, though. But with
> that RCX value, it's certainly doing more than a single page.
>
>> [ 9.791367] Call Trace:
>> [ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 [snd_pcm]
>> [ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 [snd_pcm]
>> [ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 [snd]
>> [ 9.791394] snd_pcm_ioctl+0x27/0x40 [snd_pcm]
>> [ 9.791398] ksys_ioctl+0x9d/0xd0
>> [ 9.791400] __x64_sys_ioctl+0x1a/0x20
>> [ 9.791404] do_syscall_64+0x49/0xc0
>> [ 9.791406] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Can you re-create it with CONFIG_DEBUG_INFO enabled, and run it
> through scripts/decode_stacktrace.sh to give more details on where it
> happens.
>

I have CONFIG_DEBUG_INFO enabled. Ran the stack trace through
scripts/decode_stacktrace.sh

Log below.

-- Shuah

------------------------------------------------------------------------

[ 15.341211] BUG: unable to handle page fault for address:
ffffb1e782ba5000
[ 15.341217] #PF: supervisor write access in kernel mode
[ 15.341218] #PF: error_code(0x000b) - reserved bit violation
[ 15.341220] PGD 23dd5c067 P4D 23dd5c067 PUD 23dd5d067 PMD 1fc3aa067
PTE 80001a36827a9163
[ 15.341225] Oops: 000b [#5] SMP NOPTI
[ 15.341229] CPU: 5 PID: 1213 Comm: pulseaudio Tainted: G D
5.8.0-rc1 #1
[ 15.341231] Hardware name: LENOVO 10VGCTO1WW/3130, BIOS M1XKT45A
08/21/2019
[ 15.341237] RIP: 0010:__memset (arch/x86/lib/memset_64.S:41)
[ 15.341239] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83
e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6
<f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
All code
========
0: cc int3
1: cc int3
2: cc int3
3: cc int3
4: cc int3
5: cc int3
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: 49 89 f9 mov %rdi,%r9
e: 48 89 d1 mov %rdx,%rcx
11: 83 e2 07 and $0x7,%edx
14: 48 c1 e9 03 shr $0x3,%rcx
18: 40 0f b6 f6 movzbl %sil,%esi
1c: 48 b8 01 01 01 01 01 movabs $0x101010101010101,%rax
23: 01 01 01
26: 48 0f af c6 imul %rsi,%rax
2a:* f3 48 ab rep stos %rax,%es:(%rdi) <-- trapping
instruction
2d: 89 d1 mov %edx,%ecx
2f: f3 aa rep stos %al,%es:(%rdi)
31: 4c 89 c8 mov %r9,%rax
34: c3 retq
35: 90 nop
36: 49 89 f9 mov %rdi,%r9
39: 40 88 f0 mov %sil,%al
3c: 48 89 d1 mov %rdx,%rcx
3f: f3 repz

Code starting with the faulting instruction
===========================================
0: f3 48 ab rep stos %rax,%es:(%rdi)
3: 89 d1 mov %edx,%ecx
5: f3 aa rep stos %al,%es:(%rdi)
7: 4c 89 c8 mov %r9,%rax
a: c3 retq
b: 90 nop
c: 49 89 f9 mov %rdi,%r9
f: 40 88 f0 mov %sil,%al
12: 48 89 d1 mov %rdx,%rcx
15: f3 repz
[ 15.341242] RSP: 0018:ffffb1e7827dbdd0 EFLAGS: 00010216
[ 15.341244] RAX: 0000000000000000 RBX: ffff97b30e341800 RCX:
00000000000008a0
[ 15.341246] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffb1e782ba5000
[ 15.341247] RBP: ffffb1e7827dbe00 R08: ffffb1e780000000 R09:
ffffb1e782ba5000
[ 15.341249] R10: ffffffffffffffff R11: ffffb1e782ba5000 R12:
0000000000000000
[ 15.341251] R13: ffff97b30e347000 R14: ffffffffc0b48880 R15:
ffff97b33aa42600
[ 15.341253] FS: 00007fcbaebd4ec0(0000) GS:ffff97b33ed40000(0000)
knlGS:0000000000000000
[ 15.341256] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.341257] CR2: ffffb1e782ba5000 CR3: 0000000229eb4000 CR4:
00000000003406e0
[ 15.341259] Call Trace:
[ 15.341267] ? snd_pcm_hw_params+0x3ca/0x440 snd_pcm
[ 15.341272] snd_pcm_common_ioctl+0x173/0xf20 snd_pcm
[ 15.341277] ? snd_ctl_ioctl+0x1c5/0x710 snd
[ 15.341282] snd_pcm_ioctl+0x27/0x40 snd_pcm
[ 15.341285] ksys_ioctl (fs/ioctl.c:49
/home/shuah/lkml/linux_5.8/fs/ioctl.c:753)
[ 15.341288] __x64_sys_ioctl (fs/ioctl.c:760)
[ 15.341291] do_syscall_64 (arch/x86/entry/common.c:359)
[ 15.341294] entry_SYSCALL_64_after_hwframe
(arch/x86/entry/entry_64.S:124)
[ 15.341296] RIP: 0033:0x7fcbaf56137b
[ 15.341297] Code: Bad RIP value.
objdump: '/tmp/tmp.NDQZh43uz9.o': No such file

Code starting with the faulting instruction
===========================================
[ 15.341298] RSP: 002b:00007ffde0397558 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 15.341300] RAX: ffffffffffffffda RBX: 00007ffde0397760 RCX:
00007fcbaf56137b
[ 15.341301] RDX: 00007ffde0397760 RSI: 00000000c2604111 RDI:
0000000000000013
[ 15.341302] RBP: 0000555aeb4bcc10 R08: 0000000000000000 R09:
0000000000000000
[ 15.341303] R10: 0000000000000004 R11: 0000000000000246 R12:
0000555aeb4bcb90
[ 15.341304] R13: 00007ffde0397594 R14: 0000000000000000 R15:
00007ffde0397760
[ 15.341307] Modules linked in: ccm cmac algif_hash algif_skcipher
af_alg bnep binfmt_misc nls_iso8859_1 snd_hda_codec_realtek
snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel
snd_usb_audio snd_intel_dspcfg snd_usbmidi_lib snd_hda_codec amdgpu mc
snd_hda_core snd_hwdep snd_pcm edac_mce_amd iommu_v2 ath10k_pci
snd_seq_midi gpu_sched snd_seq_midi_event kvm_amd kvm ttm snd_rawmidi
ath10k_core irqbypass drm_kms_helper snd_seq cec i2c_algo_bit
fb_sys_fops ath snd_seq_device syscopyarea snd_timer mac80211
crct10dif_pclmul ghash_clmulni_intel btusb aesni_intel btrtl btbcm
crypto_simd cryptd btintel serio_raw input_leds sysfillrect glue_helper
bluetooth efi_pstore k10temp snd pl2303 wmi_bmof ecdh_generic ecc
snd_pci_acp3x sysimgblt cfg80211 soundcore ccp libarc4 ipmi_devintf
ipmi_msghandler mac_hid sch_fq_codel parport_pc ppdev lp parport drm
ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul nvme ahci
psmouse i2c_piix4 libahci nvme_core r8169 realtek wmi video
[ 15.341342] CR2: ffffb1e782ba5000
[ 15.341344] ---[ end trace 7b22a028ccaf2e79 ]---
------------------------------------------------------------------------

2020-06-15 20:59:36

by Shuah Khan

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On 6/15/20 1:57 PM, Takashi Iwai wrote:
> On Mon, 15 Jun 2020 20:48:11 +0200,
> Shuah Khan wrote:
>>
>> I am seeing the following problem on my system. I haven't started debug
>> yet. Is this a known issue?
>
> Yes, the recent fix by David should paper over it:
> http://lore.kernel.org/r/alpine.DEB.2.22.394.2006110025250.13899@chino.kir.corp.google.com
> IIRC, Christoph already merged it in his tree.
>
> I also made another fix series in the sound driver side (found in
> topic/dma-fix2 branch in my sound.git tree).
> But I guess I'll queue it for 5.9 after more testing.
>
>

David and Takashi,

I applied the patch David pointed me to.

http://git.infradead.org/users/hch/dma-mapping.git/commitdiff/dbed452a078d56bc7f1abecc3edd6a75e8e4484e

I have CONFIG_AMD_MEM_ENCRYPT enabled. Building now. Will keep
you updated.

thanks,
-- Shuah

2020-06-15 21:00:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On Mon, Jun 15, 2020 at 1:41 PM Shuah Khan <[email protected]> wrote:
>
> I have CONFIG_DEBUG_INFO enabled. Ran the stack trace through
> scripts/decode_stacktrace.sh

Thanks. It looks like it isn't needed and people already know what the cause is.

Also, sadly the stack trace decoding didn't end up as useful as it
could have been because it looks like it doesn't know how to do the
nice address lookups for modules.

So this:

> [ 15.341237] RIP: 0010:__memset (arch/x86/lib/memset_64.S:41)

gets nicely pinpointed to the source, but the most critical part of
the call trace is in modules, and there we end up having just

> [ 15.341259] Call Trace:
> [ 15.341267] ? snd_pcm_hw_params+0x3ca/0x440 snd_pcm
> [ 15.341272] snd_pcm_common_ioctl+0x173/0xf20 snd_pcm
> [ 15.341277] ? snd_ctl_ioctl+0x1c5/0x710 snd
> [ 15.341282] snd_pcm_ioctl+0x27/0x40 snd_pcm

without then looking at the debug info in the snd_pcm module to figure that out.

Then when the call trace gets back to non-module code, it looks good again:

> [ 15.341285] ksys_ioctl (fs/ioctl.c:49 /home/shuah/lkml/linux_5.8/fs/ioctl.c:753)
> [ 15.341288] __x64_sys_ioctl (fs/ioctl.c:760)
> [ 15.341291] do_syscall_64 (arch/x86/entry/common.c:359)
> [ 15.341294] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)

with pinpointing to exactly where the calls are.

I note that Konstantin Khlebnikov did add support to do the module
parts too back in 2016, but it requires people to know to give the
module path too.

Adding him and Sasha to the participants in case there are ideas on
how to improve on this (and party just because I want to once again
give scripts/decode_stacktrace.sh soem more mention, because a lot of
people seem to be unaware of how useful it can be to make oopses and
traces more readable..

Maybe even just a warning about lacking a module path when there are
module symbols present?

Linus

2020-06-15 21:06:19

by Shuah Khan

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On 6/15/20 2:53 PM, Shuah Khan wrote:
> On 6/15/20 1:57 PM, Takashi Iwai wrote:
>> On Mon, 15 Jun 2020 20:48:11 +0200,
>> Shuah Khan wrote:
>>>
>>> I am seeing the following problem on my system. I haven't started debug
>>> yet. Is this a known issue?
>>
>> Yes, the recent fix by David should paper over it:
>>
>> http://lore.kernel.org/r/alpine.DEB.2.22.394.2006110025250.13899@chino.kir.corp.google.com
>>
>> IIRC, Christoph already merged it in his tree.
>>
>> I also made another fix series in the sound driver side (found in
>> topic/dma-fix2 branch in my sound.git tree).
>> But I guess I'll queue it for 5.9 after more testing.
>>
>>
>
> David and Takashi,
>
> I applied the patch David pointed me to.
>
> http://git.infradead.org/users/hch/dma-mapping.git/commitdiff/dbed452a078d56bc7f1abecc3edd6a75e8e4484e
>
>
> I have CONFIG_AMD_MEM_ENCRYPT enabled. Building now. Will keep
> you updated.
>

This patch took care of the problem.

thanks,
-- Shuah

2020-06-15 21:21:51

by Shuah Khan

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On 6/15/20 2:55 PM, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 1:41 PM Shuah Khan <[email protected]> wrote:
>>
>> I have CONFIG_DEBUG_INFO enabled. Ran the stack trace through
>> scripts/decode_stacktrace.sh
>
> Thanks. It looks like it isn't needed and people already know what the cause is.
>
> Also, sadly the stack trace decoding didn't end up as useful as it
> could have been because it looks like it doesn't know how to do the
> nice address lookups for modules.
>
> So this:
>
>> [ 15.341237] RIP: 0010:__memset (arch/x86/lib/memset_64.S:41)
>
> gets nicely pinpointed to the source, but the most critical part of
> the call trace is in modules, and there we end up having just
>
>> [ 15.341259] Call Trace:
>> [ 15.341267] ? snd_pcm_hw_params+0x3ca/0x440 snd_pcm
>> [ 15.341272] snd_pcm_common_ioctl+0x173/0xf20 snd_pcm
>> [ 15.341277] ? snd_ctl_ioctl+0x1c5/0x710 snd
>> [ 15.341282] snd_pcm_ioctl+0x27/0x40 snd_pcm
>
> without then looking at the debug info in the snd_pcm module to figure that out.
>
> Then when the call trace gets back to non-module code, it looks good again:
>
>> [ 15.341285] ksys_ioctl (fs/ioctl.c:49 /home/shuah/lkml/linux_5.8/fs/ioctl.c:753)
>> [ 15.341288] __x64_sys_ioctl (fs/ioctl.c:760)
>> [ 15.341291] do_syscall_64 (arch/x86/entry/common.c:359)
>> [ 15.341294] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)
>
> with pinpointing to exactly where the calls are.
>
> I note that Konstantin Khlebnikov did add support to do the module
> parts too back in 2016, but it requires people to know to give the
> module path too.
>
> Adding him and Sasha to the participants in case there are ideas on
> how to improve on this (and party just because I want to once again
> give scripts/decode_stacktrace.sh soem more mention, because a lot of
> people seem to be unaware of how useful it can be to make oopses and
> traces more readable..
>

Thanks. I usually decode all of this by hand. This script saves a lot
of time. Very cool.

Yeah. I should have thought about adding module path. With module path
added, I get better results:

[ 15.341267] ? snd_pcm_hw_params (./include/linux/string.h:391
/home/shuah/lkml/linux_5.8/sound/core/pcm_native.c:759) snd_pcm
[ 15.341272] snd_pcm_common_ioctl (sound/core/pcm_native.c:792
/home/shuah/lkml/linux_5.8/sound/core/pcm_native.c:3210) snd_pcm
[ 15.341277] ? snd_ctl_ioctl+0x1c5/0x710 snd
[ 15.341282] snd_pcm_ioctl (sound/core/pcm_native.c:3297) snd_pcm
[ 15.341285] ksys_ioctl (fs/ioctl.c:49
/home/shuah/lkml/linux_5.8/fs/ioctl.c:753)
[ 15.341288] __x64_sys_ioctl (fs/ioctl.c:760)
[ 15.341291] do_syscall_64 (arch/x86/entry/common.c:359)
[ 15.341294] entry_SYSCALL_64_after_hwframe
(arch/x86/entry/entry_64.S:124)
[ 15.341296] RIP: 0033:0x7fcbaf56137b
[ 15.341297] Code: Bad RIP value.

> Maybe even just a warning about lacking a module path when there are
> module symbols present?
>

It does tell you the usage.

Usage:
scripts/decode_stacktrace.sh [vmlinux] [base path] [modulespath]

I would be useful to add a warning.

thanks,
-- Shuah

2020-06-15 22:27:27

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 5.8-rc1 BUG unable to handle page fault (snd_pcm)

On Mon, Jun 15, 2020 at 2:18 PM Shuah Khan <[email protected]> wrote:
>
> Yeah. I should have thought about adding module path. With module path
> added, I get better results:
>
> [ 15.341267] ? snd_pcm_hw_params (./include/linux/string.h:391
> /home/shuah/lkml/linux_5.8/sound/core/pcm_native.c:759) snd_pcm
> [ 15.341272] snd_pcm_common_ioctl (sound/core/pcm_native.c:792
> /home/shuah/lkml/linux_5.8/sound/core/pcm_native.c:3210) snd_pcm

Yeah, now it gives the complete path and you see exactly which
memset() it ends up being, ie it's that

/* clear the buffer for avoiding possible kernel info leaks */
if (runtime->dma_area && !substream->ops->copy_user)
memset(runtime->dma_area, 0, runtime->dma_bytes);

Quite often with all the inlining the compiler does it can be really
hard to figure out where things come from when you just see the symbol
and offset.

Ok, in this case there aren't that many calls to memset() in that
file, and it might have been obvious which one it was in this case.
But sometimes they just come from various inline helper functions too,
and when automation can give us the answer easily, it's the thing to
do.

> > Maybe even just a warning about lacking a module path when there are
> > module symbols present?
> >
>
> It does tell you the usage.

Yes, it's more that it's very easy to overlook it, and then get a
partial decode.

Once you know how to use that script, it's very convenient, but the
problem tends to be that too few people are aware of it in the first
place.

Linus

2020-06-15 22:27:31

by Sasha Levin

[permalink] [raw]

Subject: [PATCH] scripts/decode_stacktrace: warn when modpath is needed but is unset

When a user tries to parse a symbol located inside a module he must have
modpath set. Otherwise, decode_stacktrace won't be able to parse the
symbol correctly.

Right now the failure is silent and easily missed by the user. What's
worse is that by the time the user realizes what happened (or someone on
LKML asks him to add the modpath and re-run), he might have already got
rid of the vmlinux/modules.

Signed-off-by: Sasha Levin <[email protected]>
---
scripts/decode_stacktrace.sh | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/scripts/decode_stacktrace.sh b/scripts/decode_stacktrace.sh
index 13e5fbafdf2f..2c9ee4beb545 100755
--- a/scripts/decode_stacktrace.sh
+++ b/scripts/decode_stacktrace.sh
@@ -27,7 +27,10 @@ parse_symbol() {
elif [[ "${modcache[$module]+isset}" == "isset" ]]; then
local objfile=${modcache[$module]}
else
- [[ $modpath == "" ]] && return
+ if [[ $modpath == "" ]]; then
+ echo "WARNING! Modules path isn't set, but is needed to parse this symbol" >&2
+ return
+ fi
local objfile=$(find "$modpath" -name "${module//_/[-_]}.ko*" -print -quit)
[[ $objfile == "" ]] && return
modcache[$module]=$objfile
--
2.25.1

2020-06-15 22:40:22

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] scripts/decode_stacktrace: warn when modpath is needed but is unset

On Mon, Jun 15, 2020 at 3:24 PM Sasha Levin <[email protected]> wrote:
>
> When a user tries to parse a symbol located inside a module he must have
> modpath set. Otherwise, decode_stacktrace won't be able to parse the
> symbol correctly.
>
> Right now the failure is silent and easily missed by the user. What's
> worse is that by the time the user realizes what happened (or someone on
> LKML asks him to add the modpath and re-run), he might have already got
> rid of the vmlinux/modules.

Well, that looks straightforward.

Applied,

Linus

2020-06-15 22:46:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] scripts/decode_stacktrace: warn when modpath is needed but is unset

On Mon, Jun 15, 2020 at 3:37 PM Linus Torvalds
<[email protected]> wrote:
>
> Well, that looks straightforward.

Hmm. Decided to test it. It warns for every case: a bit excessive,
perhaps, but I guess it won't hurt.

So Shuah's thing results in

[ 9.791367] Call Trace:
WARNING! Modules path isn't set, but is needed to parse this symbol
[ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 snd_pcm
WARNING! Modules path isn't set, but is needed to parse this symbol
[ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 snd_pcm
WARNING! Modules path isn't set, but is needed to parse this symbol
[ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 snd
WARNING! Modules path isn't set, but is needed to parse this symbol
[ 9.791394] snd_pcm_ioctl+0x27/0x40 snd_pcm

which looks a bit redundant, but maybe that just means people _really_ notice.

So the patch stays.

Linus

2020-06-15 23:28:02

by Sasha Levin

[permalink] [raw]

Subject: Re: [PATCH] scripts/decode_stacktrace: warn when modpath is needed but is unset

On Mon, Jun 15, 2020 at 03:43:31PM -0700, Linus Torvalds wrote:
>On Mon, Jun 15, 2020 at 3:37 PM Linus Torvalds
><[email protected]> wrote:
>>
>> Well, that looks straightforward.
>
>Hmm. Decided to test it. It warns for every case: a bit excessive,
>perhaps, but I guess it won't hurt.
>
>So Shuah's thing results in
>
>[ 9.791367] Call Trace:
>WARNING! Modules path isn't set, but is needed to parse this symbol
>[ 9.791377] ? snd_pcm_hw_params+0x3ca/0x440 snd_pcm
>WARNING! Modules path isn't set, but is needed to parse this symbol
>[ 9.791383] snd_pcm_common_ioctl+0x173/0xf20 snd_pcm
>WARNING! Modules path isn't set, but is needed to parse this symbol
>[ 9.791389] ? snd_ctl_ioctl+0x1c5/0x710 snd
>WARNING! Modules path isn't set, but is needed to parse this symbol
>[ 9.791394] snd_pcm_ioctl+0x27/0x40 snd_pcm
>
>which looks a bit redundant, but maybe that just means people _really_ notice.

I figured it's a good balance between warning only once (which can get
lost in a longer trace) vs just exiting on error (as it prevents the
user from ignoring this issue if he doesn't care or just doesn't have
the modules).

--
Thanks,
Sasha