2020-06-19 06:54:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes

That is indeed really strange, as that commit is just a rename.
Well, Linus also added swapping of the argument order, but again it
shouldn't change much. Do you see any compiler warnings or something
odd in the kernel log before the actual crash?

On Thu, Jun 18, 2020 at 06:21:37PM -0700, Kenneth R. Crudup wrote:
>
> My system oopses in all manner of places with it in when/if my Thunderbolt
> adapter is connected (and clears up if I revert just this).
>
> Here's a couple examples:
>
> ----
> <6>[ 43.136942][ T37] r8152 6-2.1.2:1.0 eth0: carrier on
> <4>[ 45.400675][ T235] general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI
> <4>[ 45.400680][ T235] CPU: 6 PID: 235 Comm: kworker/u16:4 Tainted: G U 5.8.0-rc1-XPS-Kenny+ #29
> <4>[ 45.400681][ T235] Hardware name: Dell Inc. XPS 13 7390 2-in-1/06CDVY, BIOS 1.3.1 03/02/2020
> <4>[ 45.400719][ T235] Workqueue: i915-dp i915_digport_work_func [i915]
> <4>[ 45.400723][ T235] RIP: 0010:drm_dp_mst_hpd_irq+0x7a8/0x1040
> <4>[ 45.400725][ T235] Code: 7d c0 4c 89 55 b8 e8 a7 c3 ff ff 4c 8b 55 b8 4c 89 d7 e8 7b 89 57 00 49 8b 17 49 8b 47 08 41 c7 47 14 03 00 00 00 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 07 48 83 c0 22 4c 8b
> Oops#1 Part1
> <4>[ 45.400727][ T235] RSP: 0018:ffff978d808c7d70 EFLAGS: 00010246
> <4>[ 45.400728][ T235] RAX: dead000000000122 RBX: ffff8b49230cc7e8 RCX: 0000000000000002
> <4>[ 45.400729][ T235] RDX: ffff8b48ba9df140 RSI: ffffffff9ad85a32 RDI: ffff8b49230ccb68
> <4>[ 45.400730][ T235] RBP: ffff978d808c7dc0 R08: 0000000000000006 R09: ffffffff9ad85e70
> <4>[ 45.400731][ T235] R10: ffff8b49230ccb68 R11: 0000000000000014 R12: ffff978d808c7df2
> <4>[ 45.400732][ T235] R13: ffff978d808c7df1 R14: 0000000000000140 R15: ffff8b48ba9dc140
> <4>[ 45.400733][ T235] FS: 0000000000000000(0000) GS:ffff8b492f780000(0000) knlGS:0000000000000000
> <4>[ 45.400734][ T235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 45.400735][ T235] CR2: 00007f3d3f1ebd90 CR3: 000000082460c006 CR4: 0000000000760ee0
> <4>[ 45.400736][ T235] PKRU: 55555554
> <4>[ 45.400737][ T235] Call Trace:
> <4>[ 45.400762][ T235] ? intel_dp_hpd_pulse+0x2b8/0x440 [i915]
> <4>[ 45.400785][ T235] intel_dp_hpd_pulse+0x2b8/0x440 [i915]
> <4>[ 45.400809][ T235] i915_digport_work_func+0xcb/0x160 [i915]
> <4>[ 45.400814][ T235] process_one_work+0x1ac/0x300
> <4>[ 45.400815][ T235] worker_thread+0x49/0x3c0
> <4>[ 45.400817][ T235] ? process_one_work+0x300/0x300
> <4>[ 45.400819][ T235] kthread+0x10e/0x150
> <4>[ 45.400821][ T235] ? kthread_park+0x90/0x90
> <4>[ 45.400823][ T235] ret_from_fork+0x1f/0x30
> <4>[ 45.400825][ T235] Modules linked in: r8152 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btintel vmw_vmci iwlmvm mac80211 snd_hda_intel iwlwifi snd_intel_dspcfg snd_hda_codec snd_hda_core cfg80211 mei_hdcp i915 intel_gtt
> <4>[ 45.400834][ T235] ---[ end trace e3e9561107d7861e ]---
> ----
>
> ----
> <3>[ 93.961454][ T32] i915 0000:00:02.0: [drm] *ERROR* failed to enable link training
> <4>[ 104.048809][ T47] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PRE
> EMPT SMP NOPTI
> <4>[ 104.051127][ T47] CPU: 7 PID: 47 Comm: kworker/7:0 Tainted: G U 5.8.0-rc1-XPS-Kenny+ #29
> <4>[ 104.052493][ T47] Hardware name: Dell Inc. XPS 13 7390 2-in-1/06CDVY, BIOS 1.3.1 03/02/2020
> <4>[ 104.054387][ T47] Workqueue: events drm_dp_delayed_destroy_work
> Oops#1 Part1
> <4>[ 104.055639][ T47] RIP: 0010:drm_dp_delayed_destroy_work+0x16b/0x2d0
> <4>[ 104.056924][ T47] Code: 00 00 48 89 f1 48 8b 34 24 48 3b b2 38 01 00 00 75 32 48 8b ba 40 01 00 00 48 8b b2 48 0
> 1 00 00 c7 82 54 01 00 00 04 00 00 00 <48> 89 77 08 48 89 3e 4c 89 aa 40 01 00 00 4c 89 a2 48 01 00 00 bb
> <4>[ 104.058343][ T47] RSP: 0018:ffffb68100257e18 EFLAGS: 00010246
> <4>[ 104.059857][ T47] RAX: ffff916c625f2b88 RBX: 0000000000000001 RCX: deacffffffffffc0
> <4>[ 104.061593][ T47] RDX: ffff916c47ae1800 RSI: dead000000000122 RDI: dead000000000100
> <4>[ 104.064004][ T47] RBP: ffff916c625f2c50 R08: 0000000000000001 R09: ffff916c6c006a00
> <4>[ 104.066821][ T47] R10: ffff916c6a4d4000 R11: ffff916c6bc00e48 R12: dead000000000122
> <4>[ 104.069857][ T47] R13: dead000000000100 R14: ffff916c625f2af8 R15: ffff916c52259a28
> <4>[ 104.072820][ T47] FS: 0000000000000000(0000) GS:ffff916c6f7c0000(0000) knlGS:0000000000000000
> <4>[ 104.075671][ T47] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 104.078493][ T47] CR2: 000055f3668bf058 CR3: 00000008407de003 CR4: 0000000000760ee0
> <4>[ 104.081331][ T47] PKRU: 55555554
> <4>[ 104.084125][ T47] Call Trace:
> <4>[ 104.086726][ T47] process_one_work+0x1ac/0x300
> <4>[ 104.089355][ T47] worker_thread+0x49/0x3c0
> <4>[ 104.091899][ T47] ? process_one_work+0x300/0x300
> <4>[ 104.094377][ T47] kthread+0x10e/0x150
> <4>[ 104.096662][ T47] ? kthread_park+0x90/0x90
> <4>[ 104.098952][ T47] ret_from_fork+0x1f/0x30
> <4>[ 104.101248][ T47] Modules linked in: r8152 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btintel vmw_vmci iwlmvm snd_hda_intel mac80211 snd_intel_dspcfg mei_hdcp snd_hda_codec iwlwifi snd_hda_core i915 cfg80211 intel_gtt
> <4>[ 104.106196][ T47] ---[ end trace 9d53a6fd1074c5d2 ]---
> ----
>
> ----
> <6>[ 39.395350][ T935] usb 5-2.1.4: new high-speed USB device number 10 using xhci_hcd
> <0>[ 39.447609][ T17] BUG: stack guard page was hit at 000000001d3357a4 (stack is 00000000d8c3d9e8..000000009e4ee10e)
> <4>[ 39.447619][ T17] kernel stack overflow (page fault): 0000 [#1] PREEMPT SMP NOPTI
> Oops#1 Part1
> <4>[ 39.447625][ T17] CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G U 5.8.0-rc1-XPS-Kenny+ #29
> <4>[ 39.447628][ T17] Hardware name: Dell Inc. XPS 13 7390 2-in-1/06CDVY, BIOS 1.3.1 03/02/2020
> <4>[ 39.447637][ T17] Workqueue: events_long drm_dp_tx_work
> <4>[ 39.447644][ T17] RIP: 0010:process_single_tx_qlock+0x3bb/0x5f0
> <4>[ 39.447649][ T17] Code: 48 83 e0 e0 48 89 c2 31 c0 4d 8b 0c 03 4d 8b 44 03 08 49 8b 7c 03 10 49 8b 74 03 18 4c 89 0c 01 4c 89 44 01 08 48 89 7c 01 10 <48> 89 74 01 18 48 83 c0 20 48 39 d0 72 d1 e9 b5 fe ff ff 48 8b 4f
> <4>[ 39.447652][ T17] RSP: 0018:ffffbb0a0014fd98 EFLAGS: 00010203
> <4>[ 39.447656][ T17] RAX: 0000000000000200 RBX: 00000000fffffffc RCX: ffffbb0a0014fde8
> <4>[ 39.447660][ T17] RDX: ffffffffffffffe0 RSI: 0000000000000000 RDI: 0000000000000000
> <4>[ 39.447662][ T17] RBP: ffff9c87712ea800 R08: 0000000000000000 R09: 0000000000000000
> <4>[ 39.447665][ T17] R10: 0000000000000000 R11: ffff9c87712ea809 R12: ffff9c87a3b417e8
> <4>[ 39.447668][ T17] R13: 0000000000000000 R14: 0000000000000003 R15: ffffbb0a0014fde3
> <4>[ 39.447672][ T17] FS: 0000000000000000(0000) GS:ffff9c87af640000(0000) knlGS:0000000000000000
> <4>[ 39.447675][ T17] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 39.447679][ T17] CR2: ffffbb0a00150000 CR3: 000000025760a001 CR4: 0000000000760ee0
> <4>[ 39.447682][ T17] PKRU: 55555554
> <4>[ 39.447684][ T17] Call Trace:
> <4>[ 39.447693][ T17] Modules linked in: r8152 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btintel vmw_vmci iwlmvm mei_hdcp snd_hda_intel mac80211 snd_intel_dspcfg snd_hda_codec snd_hda_core i915 iwlwifi cfg80211 intel_gtt
> <4>[ 39.447712][ T17] ---[ end trace 60af41073edc305f ]---
> ----
>
> ----
> <0>[ 38.555935][ T76] BUG: stack guard page was hit at 00000000244b28bd (stack is 000000000d692c63..00000000613959ea)
> <4>[ 38.555949][ T76] kernel stack overflow (page fault): 0000 [#1] PREEMPT SMP NOPTI
> Panic#2 Part1
> <4>[ 38.556022][ T76] PKRU: 55555554
> <4>[ 38.556025][ T76] Call Trace:
> <4>[ 38.556036][ T76] Modules linked in: r8152 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btintel vmw_vmci iwlmvm snd_hda_intel snd_intel_dspcfg mac80211 snd_hda_codec snd_hda_core iwlwifi i915 mei_hdcp cfg80211 intel_gtt
> <4>[ 38.556058][ T76] ---[ end trace d76b60500c18efde ]---
> <4>[ 38.920175][ T76] RIP: 0010:process_single_tx_qlock+0x329/0x480
> <4>[ 38.920181][ T76] Code: c0 01 89 44 24 04 eb 8e 4a 8b 54 30 f8 49 89 54 07 f8 48 83 e8 01 48 83 f8 08 0f 82 38 ff ff ff 48 83 e0 f8 31 d2 49 8b 0c 16 <49> 89 0c 17 48 83 c2 08 48 39 c2 72 ef e9 1c ff ff ff c6 44 24 3e
> <4>[ 38.920184][ T76] RSP: 0018:ffffc170c0347da0 EFLAGS: 00010203
> <4>[ 38.920188][ T76] RAX: fffffffffffffff8 RBX: 00000000fffffffc RCX: 0000000000000000
> <4>[ 38.920191][ T76] RDX: 0000000000000218 RSI: 0000000000000000 RDI: ffffc170c0347de0
> <4>[ 38.920193][ T76] RBP: ffff9fa7c34b4000 R08: 0000000000000002 R09: 000000000000000a
> <4>[ 38.920195][ T76] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9fa80a3007e8
> <4>[ 38.920197][ T76] R13: 0000000000000003 R14: ffff9fa7c34b4004 R15: ffffc170c0347de3
> <4>[ 38.920200][ T76] FS: 0000000000000000(0000) GS:ffff9fa80f440000(0000) knlGS:0000000000000000
> <4>[ 38.920202][ T76] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 38.920203][ T76] CR2: ffffc170c0348000 CR3: 000000084872e004 CR4: 0000000000760ee0
> <4>[ 38.920205][ T76] PKRU: 55555554
> <0>[ 38.920207][ T76] Kernel panic - not syncing: Fatal exception
> <0>[ 38.920218][ T76] Kernel Offset: 0x9200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Oops#1 Part1
> <4>[ 38.555956][ T76] CPU: 1 PID: 76 Comm: kworker/1:1 Tainted: G U 5.8.0-rc1-XPS-Kenny+ #32
> <4>[ 38.555960][ T76] Hardware name: Dell Inc. XPS 13 7390 2-in-1/06CDVY, BIOS 1.3.1 03/02/2020
> <4>[ 38.555971][ T76] Workqueue: events_long drm_dp_tx_work
> <4>[ 38.555980][ T76] RIP: 0010:process_single_tx_qlock+0x329/0x480
> <4>[ 38.555985][ T76] Code: c0 01 89 44 24 04 eb 8e 4a 8b 54 30 f8 49 89 54 07 f8 48 83 e8 01 48 83 f8 08 0f 82 38 ff ff ff 48 83 e0 f8 31 d2 49 8b 0c 16 <49> 89 0c 17 48 83 c2 08 48 39 c2 72 ef e9 1c ff ff ff c6 44 24 3e
> <4>[ 38.555990][ T76] RSP: 0018:ffffc170c0347da0 EFLAGS: 00010203
> <4>[ 38.555995][ T76] RAX: fffffffffffffff8 RBX: 00000000fffffffc RCX: 0000000000000000
> <4>[ 38.555998][ T76] RDX: 0000000000000218 RSI: 0000000000000000 RDI: ffffc170c0347de0
> <4>[ 38.556001][ T76] RBP: ffff9fa7c34b4000 R08: 0000000000000002 R09: 000000000000000a
> <4>[ 38.556004][ T76] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9fa80a3007e8
> <4>[ 38.556007][ T76] R13: 0000000000000003 R14: ffff9fa7c34b4004 R15: ffffc170c0347de3
> <4>[ 38.556012][ T76] FS: 0000000000000000(0000) GS:ffff9fa80f440000(0000) knlGS:0000000000000000
> <4>[ 38.556015][ T76] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 38.556019][ T76] CR2: ffffc170c0348000 CR3: 000000084872e004 CR4: 0000000000760ee0
> <4>[ 38.556022][ T76] PKRU: 55555554
> <4>[ 38.556025][ T76] Call Trace:
> <4>[ 38.556036][ T76] Modules linked in: r8152 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btintel vmw_vmci iwlmvm snd_hda_intel snd_intel_dspcfg mac80211 snd_hda_codec snd_hda_core iwlwifi i915 mei_hdcp cfg80211 intel_gtt
> <4>[ 38.556058][ T76] ---[ end trace d76b60500c18efde ]---
> ---
>
> -Kenny
>
> --
> Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley
---end quoted text---


2020-06-19 07:17:43

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes


On Fri, 19 Jun 2020, Christoph Hellwig wrote:

> That is indeed really strange, as that commit is just a rename.
> Well, Linus also added swapping of the argument order, but again it
> shouldn't change much.

Thing is, there's other examples of the previous version in the kernel tree- any
chance there's a usage conflict (Thunderbolt has a ROM in it, maybe something in
"probe_roms.c"? (Just guessing, no idea):

----
afind probe_kernel_address
./lib/test_lockup.c: probe_kernel_address(ptr, buf) ||
./lib/test_lockup.c: probe_kernel_address(ptr + size - 1, buf)) {
./lib/test_lockup.c: if (probe_kernel_address(ptr, magic) || magic != expected) {
./arch/arm64/kernel/traps.c: if (probe_kernel_address((__force __le32 *)pc, instr_le))
./arch/sh/kernel/traps.c: if (probe_kernel_address((insn_size_t *)addr, opcode))
./arch/x86/kernel/traps.c: if (probe_kernel_address((unsigned short *)addr, ud))
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom_list, device) != 0)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 0x18, offset) != 0)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x4, vendor) != 0)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x6, device) != 0)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x8, list) == 0 &&
./arch/x86/kernel/probe_roms.c: probe_kernel_address(rom + offset + 0xc, rev) == 0 &&
./arch/x86/kernel/probe_roms.c: return probe_kernel_address(ptr, sig) == 0 && sig == ROMSIGNATURE;
./arch/x86/kernel/probe_roms.c: for (sum = 0; length && probe_kernel_address(rom++, c) == 0; length--)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 2, c) != 0)
./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 2, c) != 0)
./arch/x86/mm/fault.c: if (probe_kernel_address(instr, opcode))
./arch/x86/mm/fault.c: if (probe_kernel_address(instr, opcode))
./arch/x86/mm/fault.c: return probe_kernel_address((unsigned long *)p, dummy);
./arch/x86/pci/pcbios.c: if (probe_kernel_address(&check->fields.signature, sig))
./arch/arm/mm/alignment.c: fault = probe_kernel_address(ip, instr);
./arch/arm/mm/alignment.c: fault = probe_kernel_address(ip, instr);
./arch/s390/mm/fault.c: return probe_kernel_address((unsigned long *)p, dummy);
./arch/powerpc/kernel/process.c: probe_kernel_address((const void *)pc, instr)) {
./arch/powerpc/kernel/kprobes.c: if (probe_kernel_address(addr, instr))
./arch/powerpc/sysdev/fsl_pci.c: ret = probe_kernel_address((void *)regs->nip, inst);
./arch/riscv/kernel/kgdb.c: if (probe_kernel_address((void *)pc, op_code))
./arch/riscv/kernel/kgdb.c: error = probe_kernel_address((void *)addr, stepped_opcode);
./arch/riscv/kernel/traps.c: if (probe_kernel_address((bug_insn_t *)pc, insn))
./arch/riscv/kernel/traps.c: if (probe_kernel_address((bug_insn_t *)pc, insn))
----

> Do you see any compiler warnings or something
> odd in the kernel log before the actual crash?

Not that I could see, but I'll try building again later on.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley

2020-06-19 07:45:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes

On Thu, Jun 18, 2020 at 11:57:27PM -0700, Kenneth R. Crudup wrote:
>
> On Fri, 19 Jun 2020, Christoph Hellwig wrote:
>
> > That is indeed really strange, as that commit is just a rename.
> > Well, Linus also added swapping of the argument order, but again it
> > shouldn't change much.
>
> Thing is, there's other examples of the previous version in the kernel tree- any
> chance there's a usage conflict (Thunderbolt has a ROM in it, maybe something in
> "probe_roms.c"? (Just guessing, no idea):

Maybe. But nothing looks strange there. Just to re-reconfirm, you had to
revert "maccess: rename probe_kernel_address to get_kernel_nofault",
"maccess: make get_kernel_nofault() check for minimal type compatibility"
wasn't enough?

Below is a patch to do a "partial revert" for probe_roms.c. I'd be
totally surprised if it changes anything from staring at it for while,
but anyway..

diff --git a/arch/x86/kernel/probe_roms.c b/arch/x86/kernel/probe_roms.c
index 9e1def3744f220..70ff3267b4f373 100644
--- a/arch/x86/kernel/probe_roms.c
+++ b/arch/x86/kernel/probe_roms.c
@@ -22,6 +22,9 @@
#include <asm/io.h>
#include <asm/setup_arch.h>

+#define probe_kernel_address(addr, retval) \
+ copy_from_kernel_nofault(&retval, addr, sizeof(retval))
+
static struct resource system_rom_resource = {
.name = "System ROM",
.start = 0xf0000,
@@ -99,7 +102,7 @@ static bool probe_list(struct pci_dev *pdev, unsigned short vendor,
unsigned short device;

do {
- if (get_kernel_nofault(device, rom_list) != 0)
+ if (probe_kernel_address(rom_list, device) != 0)
device = 0;

if (device && match_id(pdev, vendor, device))
@@ -125,13 +128,13 @@ static struct resource *find_oprom(struct pci_dev *pdev)
break;

rom = isa_bus_to_virt(res->start);
- if (get_kernel_nofault(offset, rom + 0x18) != 0)
+ if (probe_kernel_address(rom + 0x18, offset) != 0)
continue;

- if (get_kernel_nofault(vendor, rom + offset + 0x4) != 0)
+ if (probe_kernel_address(rom + offset + 0x4, vendor) != 0)
continue;

- if (get_kernel_nofault(device, rom + offset + 0x6) != 0)
+ if (probe_kernel_address(rom + offset + 0x6, device) != 0)
continue;

if (match_id(pdev, vendor, device)) {
@@ -139,8 +142,8 @@ static struct resource *find_oprom(struct pci_dev *pdev)
break;
}

- if (get_kernel_nofault(list, rom + offset + 0x8) == 0 &&
- get_kernel_nofault(rev, rom + offset + 0xc) == 0 &&
+ if (probe_kernel_address(rom + offset + 0x8, list) == 0 &&
+ probe_kernel_address(rom + offset + 0xc, rev) == 0 &&
rev >= 3 && list &&
probe_list(pdev, vendor, rom + offset + list)) {
oprom = res;
@@ -183,14 +186,14 @@ static int __init romsignature(const unsigned char *rom)
const unsigned short * const ptr = (const unsigned short *)rom;
unsigned short sig;

- return get_kernel_nofault(sig, ptr) == 0 && sig == ROMSIGNATURE;
+ return probe_kernel_address(ptr, sig) == 0 && sig == ROMSIGNATURE;
}

static int __init romchecksum(const unsigned char *rom, unsigned long length)
{
unsigned char sum, c;

- for (sum = 0; length && get_kernel_nofault(c, rom++) == 0; length--)
+ for (sum = 0; length && probe_kernel_address(rom++, c) == 0; length--)
sum += c;
return !length && !sum;
}
@@ -211,7 +214,7 @@ void __init probe_roms(void)

video_rom_resource.start = start;

- if (get_kernel_nofault(c, rom + 2) != 0)
+ if (probe_kernel_address(rom + 2, c) != 0)
continue;

/* 0 < length <= 0x7f * 512, historically */
@@ -249,7 +252,7 @@ void __init probe_roms(void)
if (!romsignature(rom))
continue;

- if (get_kernel_nofault(c, rom + 2) != 0)
+ if (probe_kernel_address(rom + 2, c) != 0)
continue;

/* 0 < length <= 0x7f * 512, historically */

2020-06-19 07:50:31

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes


On Fri, 19 Jun 2020, Christoph Hellwig wrote:

> Just to re-reconfirm, you had to
> revert "maccess: rename probe_kernel_address to get_kernel_nofault",
> "maccess: make get_kernel_nofault() check for minimal type compatibility"
> wasn't enough?

Yeah. the bisect turned up the commit in the subject: and reverting it
(only) stopped the crashes. This was this afternoon's (US PST) Linus' master.

> Below is a patch to do a "partial revert" for probe_roms.c.

I'll give it a shot later, thanks.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2020-06-20 13:49:36

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes


> > Thing is, there's other examples of the previous version in the kernel tree- any
> > chance there's a usage conflict (Thunderbolt has a ROM in it, maybe something in
> > "probe_roms.c"? (Just guessing, no idea):

On Fri, 19 Jun 2020, Christoph Hellwig wrote:

> Maybe. But nothing looks strange there. Just to re-reconfirm, you had to
> revert "maccess: rename probe_kernel_address to get_kernel_nofault",
> "maccess: make get_kernel_nofault() check for minimal type compatibility"
> wasn't enough?

Yeah, the only commit I had to revert was this one. BUT:

> Below is a patch to do a "partial revert" for probe_roms.c. I'd be
> totally surprised if it changes anything from staring at it for while,
> but anyway..

So, be totally surprised :) I've just booted with "maccess: rename
probe_kernel_address to get_kernel_nofault" intact and your probe_roms.c
patch with no issues.

(Perhaps there's some sort of compiler optimization going on?)

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2020-06-20 16:53:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes

On Sat, Jun 20, 2020 at 6:46 AM Kenneth R. Crudup <[email protected]> wrote:
>
> So, be totally surprised :) I've just booted with "maccess: rename
> probe_kernel_address to get_kernel_nofault" intact and your probe_roms.c
> patch with no issues.
>
> (Perhaps there's some sort of compiler optimization going on?)

Hmm.

Very strange. I was a tiny bit worried about that part of the patch,
because I also changed the types (from "unsigned char *" to "void *"),
but pointer arithmetic in "unsigned char *" and "void *" is the same,
and Christoph's partial revert patch doesn't even revert that part.

But I really don't see what Christoph's revert would really even
change It switches the order of the arguments back..

It does re-introduce a bug in that macro that I fixed. This macro is
buggy garbage:

+#define probe_kernel_address(addr, retval) \
+ copy_from_kernel_nofault(&retval, addr, sizeof(retval))

in case 'retval' is a complex expression, becasue of possibly changing
the C order of operations. So it needs to be "&(retval)" in the macro
body.

But that is never the case for 'retval'. For 'addr', yes, but 'addr'
is only used simply (and copy_from_kernel_nofault() isn't a macro).

I'm staring at that opatch and not seeing how it could _possibly_ make
any difference in code generation.

Which is the obvious next step: would you mind compiling that file
with and without the patch and sending me the two object files?

Linus

2020-06-20 17:55:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes

On Sat, Jun 20, 2020 at 9:49 AM Linus Torvalds
<[email protected]> wrote:
>
> I'm staring at that opatch and not seeing how it could _possibly_ make
> any difference in code generation.
>
> Which is the obvious next step: would you mind compiling that file
> with and without the patch and sending me the two object files?

Hmm. I did that here (well, I did the probe_roms.s files) and with gcc
I didn't see any difference in the generated code what-so-ever. The
assembler comments changed (because the line numbers changed), but the
code didn't.

With clang, the only difference is a slight code generation change in
the end condition of the top for-loop in 'probe_roms()'.

But that clang difference literally seems to be about instruction
scheduling, not about any semantic change, although I find the code
hard to read because clang has rotated the loop so it looks entirely
different.

Anyway, even after going through a compiler, the difference that
Christoph's patch could possibly make is not obvious.

So yeah, I'd like to see what the code generation difference is for
you, since it seems to matter to your install for some odd reason.

Linus

2020-06-21 19:32:14

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes


On Sat, Jun 20, 2020 at 6:46 AM Kenneth R. Crudup <[email protected]> wrote:

> > So, be totally surprised :) I've just booted with "maccess: rename
> > probe_kernel_address to get_kernel_nofault" intact and your probe_roms.c
> > patch with no issues.
> > (Perhaps there's some sort of compiler optimization going on?)

On Sat, 20 Jun 2020, Linus Torvalds wrote:


> I'm staring at that opatch and not seeing how it could _possibly_ make
> any difference in code generation.

> Which is the obvious next step: would you mind compiling that file
> with and without the patch and sending me the two object files?

It looks like you had already, do you still need me to do this?

FWIW, here's my gcc info:

$ gcc --version
gcc (Ubuntu 9.3.0-13ubuntu1) 9.3.0

OH- I did change arch/x86/Makefile in my own builds- maybe this could matter?
Doubtful, but I could test later tonight or tomorrow (gotta do some ;l work in the
meantime).

----
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 00e378de8bc0..37aff76f3067 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -123,7 +123,8 @@ else
cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)

cflags-$(CONFIG_MCORE2) += \
- $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
+ $(call cc-option,-march=icelake-client,$(call cc-option,-mtune=native)) \
+ $(call cc-option,-mtune=icelake-client,$(call cc-option,-mtune=native))
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)
----

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2020-06-21 19:47:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes

On Sun, Jun 21, 2020 at 12:30 PM Kenneth R. Crudup <[email protected]> wrote:
>
> > Which is the obvious next step: would you mind compiling that file
> > with and without the patch and sending me the two object files?
>
> It looks like you had already, do you still need me to do this?

Yes please. For me that patch makes no difference. With gcc, it
generates the exact same code.

> FWIW, here's my gcc info:
>
> $ gcc --version
> gcc (Ubuntu 9.3.0-13ubuntu1) 9.3.0

Yeah, well, I've got

gcc version 10.1.1 20200507 (Red Hat 10.1.1-1) (GCC)

but it's not like yours is any known problematic one.

But please compile that probe_roms file with Chrisptoph's patch that
makes things work for you, and without it - but otherwise identically
and with the same config options that work with that patch applied.

> OH- I did change arch/x86/Makefile in my own builds- maybe this could matter?

Doubtful, but who knows. The fact that so far yours seems to be the
only report of this, and it looks inexplicable, maybe it has that
tuning difference that triggers it.

But don't change it now. If it is what triggers the issue, it's still
an interesting datapoint.

You might test (separately) if removing that change from your build
makes a difference, but I'd like to see what the code generation
difference is with the patch and without, since there _shouldn't_ be
any.

Linus

2020-06-26 08:38:26

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes


On Sun, 21 Jun 2020, Linus Torvalds wrote:

> For me that patch makes no difference.

So ... it looks like this original issue (random crashes if this commit was
made and the ROMs thing isn't patched) is a red herring.

For some time now, my Thunderbolt adapter (Lenovo 2nd-Gen TB dock) sometimes
... "acts up" on reboots/powerons with my laptop (Dell XPS 13 2-in-1) if I
boot with it connected- I'll get random disconnects and/or lack of external
video and/or its internal devices will drop offline. I suspect these crashes
were related to the PCIe scribbling(?) somewhere(?) (even though I have IOMMU
turned on). I can't track it down, and I think the issue may lie somewhere
either with the Thunderbolt subsystem (my TB domain UUID gets regenerated on
every boot and is always changing) or a bug in this Dell's BIOS (probably
the likely culprit).

... in any case, I booted a fresh branch that does contain the commit in the
Subject: line, and w/o Christoph's ROMs patch and as long as I don't boot
with the TB adapter attached (or from a cold boot- SOMEtimes) it seems to
work. So, my apologies for wasting everyone's time.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA