LinuxLists.cc - Machine check recovery broken in v6.9-rc1

2024-04-04 22:09:26

Subject: Machine check recovery broken in v6.9-rc1

v6.8 works. v6.9-rc1 fails with:

kernel BUG at include/linux/swapops.h:119

117 static inline unsigned long swp_offset_pfn(swp_entry_t entry)
118 {
119 VM_BUG_ON(!is_pfn_swap_entry(entry));
120 return swp_offset(entry) & SWP_PFN_MASK;
121 }

I've messed up the bisection three times now and ended up
pointing at some innocent looking commit. So I wonder if
anyone else has noticed and tried debugging.

-Tony

2024-04-04 23:40:05

by Tony Luck

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Thu, Apr 04, 2024 at 03:05:33PM -0700, Tony Luck wrote:
> v6.8 works. v6.9-rc1 fails with:
>
> kernel BUG at include/linux/swapops.h:119
>
> 117 static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> 118 {
> 119 VM_BUG_ON(!is_pfn_swap_entry(entry));
> 120 return swp_offset(entry) & SWP_PFN_MASK;
> 121 }
>
>
> I've messed up the bisection three times now and ended up
> pointing at some innocent looking commit. So I wonder if
> anyone else has noticed and tried debugging.

I think I know why I messed up the bisection. I thought
the failure happens every time. But it seems that it only
happens most of the time. I must have accidentally marked
a build as "good" because I got "lucky" and my test passed.

Back to the bisect grindstone :-(

-Tony

2024-04-05 07:19:21

by David Hildenbrand

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On 05.04.24 00:05, Tony Luck wrote:
> v6.8 works. v6.9-rc1 fails with:
>
> kernel BUG at include/linux/swapops.h:119
>
> 117 static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> 118 {
> 119 VM_BUG_ON(!is_pfn_swap_entry(entry));
> 120 return swp_offset(entry) & SWP_PFN_MASK;
> 121 }
>

Do you have the full backtrace?

--
Cheers,

David / dhildenb

2024-04-05 15:06:24

by Tony Luck

[permalink] [raw]

Subject: RE: Machine check recovery broken in v6.9-rc1

>> kernel BUG at include/linux/swapops.h:119
>>
>> 117 static inline unsigned long swp_offset_pfn(swp_entry_t entry)
>> 118 {
>> 119 VM_BUG_ON(!is_pfn_swap_entry(entry));
>> 120 return swp_offset(entry) & SWP_PFN_MASK;
>> 121 }
>>
>
> Do you have the full backtrace?

It's very garbled because multiple CPUs seem to be logging to the console together.

I've also got some more details.

1) Problems with recovery go back further than v6.8. Still looking to pin down where it started.
v6.0 seems good (can run hundreds of injections + consumptions).
2) It's connected to a BIOS setting that enables APEI logging. If I disable that, current
kernels don't have problems. v6.0 runs OK with this setting on or off.

I'm bisecting to find where the APEI issue begins.

-Tony

2024-04-05 23:59:30

by Tony Luck

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Fri, Apr 05, 2024 at 03:05:17PM +0000, Luck, Tony wrote:
> >> kernel BUG at include/linux/swapops.h:119
> >>
> >> 117 static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> >> 118 {
> >> 119 VM_BUG_ON(!is_pfn_swap_entry(entry));
> >> 120 return swp_offset(entry) & SWP_PFN_MASK;
> >> 121 }
> >>
> >
> > Do you have the full backtrace?
>
> It's very garbled because multiple CPUs seem to be logging to the console together.
>
> I've also got some more details.
>
> 1) Problems with recovery go back further than v6.8. Still looking to pin down where it started.
> v6.0 seems good (can run hundreds of injections + consumptions).
> 2) It's connected to a BIOS setting that enables APEI logging. If I disable that, current
> kernels don't have problems. v6.0 runs OK with this setting on or off.
>
> I'm bisecting to find where the APEI issue begins.

So a problem (not necessarily the current upstream problem given the
amount of additional changes that happened) began way back in this
commit from v6.1:

0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")

This commit gives the "kernel BUG at include/linux/swapops.h:88!" while
its predecessor runs without errors).

The test case simply injects an uncorrectable memory error to DRAM, then
reads that location to consume the poison and generate a machine check.

memory_failure() must have been called multiple times (see the "already hardware
poisoned" message). Twice is pretty common. Intel eystems can report an
error twice. Once by the memory controller when it sees the ECC bits
don't add up. Second time by the CPU core when trying to consume the
poisoned data. But I can't explain four calls.

I unloaded the acpi_extlog and i10nm_edac modules before running the
test to get a less jumbled console log.

[ 159.345790] EINJ: Error INJection is initialized.
[ 183.645231] Disabling lock debugging due to kernel taint
[ 183.645243] mce: [Hardware Error]: Machine check events logged
[ 183.645940] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 183.645945] {1}[Hardware Error]: event severity: recoverable
[ 183.645947] {1}[Hardware Error]: Error 0, type: recoverable
[ 183.645949] {1}[Hardware Error]: fru_text: Card02, ChnA, DIMM0
[ 183.645951] {1}[Hardware Error]: section_type: memory error
[ 183.645953] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
[ 183.645955] {1}[Hardware Error]: physical_address: 0x00000008a398a400
[ 183.645958] {1}[Hardware Error]: node:1 card:0 module:0 rank:0 bank:2 device:0 row:1136 column:584
[ 183.645959] {1}[Hardware Error]: error_type: 14, scrub uncorrected error
[ 183.645961] {1}[Hardware Error]: DIMM location: NODE 4 CPU1_DIMM_A1
[ 183.645962] {1}[Hardware Error]: Error 1, type: recoverable
[ 183.645963] {1}[Hardware Error]: section_type: memory error
[ 183.645964] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
[ 183.645965] {1}[Hardware Error]: physical_address: 0x00000008a398a400
[ 183.645965] {1}[Hardware Error]: node:1
[ 183.645966] {1}[Hardware Error]: error_type: 3, multi-bit ECC
[ 183.650625] mce: Uncorrected hardware memory error in user-access at 8a398a400
[ 183.763844] Memory failure: 0x8a398a: recovery action for dirty LRU page: Recovered
[ 183.778756] Memory failure: 0x8a398a: already hardware poisoned
[ 183.784695] Memory failure: 0x8a398a: already hardware poisoned
[ 183.790657] mce: [Hardware Error]: Machine check events logged
[ 183.790746] Memory failure: 0x8a398a: already hardware poisoned
[ 183.802491] ------------[ cut here ]------------
[ 183.807126] kernel BUG at include/linux/swapops.h:88!
[ 183.812199] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 183.817436] CPU: 139 PID: 4501 Comm: einj_mem_uc Tainted: G M 6.0.0-rc3+ #443
[ 183.825961] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0021.P06.2104260458 04/26/2021
[ 183.836476] RIP: 0010:check_hwpoisoned_entry+0xba/0xc0
[ 183.841632] Code: 31 c0 c3 cc cc cc cc 48 83 e7 fd 41 89 c3 41 21 fb eb ce 48 83 e7 f7 41 89 c3 41 21 fb 41 f6 c3 02 74 be eb e2 0f 1f 44 00 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 4c 89 c0 48 89 d6 4d 8b 40 28 48
[ 183.860382] RSP: 0000:ff500c69ca963c90 EFLAGS: 00010246
[ 183.865608] RAX: 0000000000000016 RBX: 00007f8fb2a69000 RCX: 00000000008a398a
[ 183.872741] RDX: 0000000000000000 RSI: 00007f8fb2a69000 RDI: b7fffffeeb8cea00
[ 183.879874] RBP: ff486f47df6f9ca8 R08: ff500c69ca963e10 R09: 0000000000000000
[ 183.887004] R10: 000000000000000c R11: 0000000000000000 R12: ff486f47d17ca348
[ 183.894139] R13: 00007f8fb2a6a000 R14: ff486f48010c8b00 R15: ff500c69ca963e10
[ 183.901278] FS: 00007f8fb2874740(0000) GS:ff486f4f9fac0000(0000) knlGS:0000000000000000
[ 183.909362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 183.915110] CR2: 00000000016b02f8 CR3: 00000008b81ce003 CR4: 0000000000771ee0
[ 183.922240] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 183.929373] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 183.936505] PKRU: 55555554
[ 183.939218] Call Trace:
[ 183.941670] <TASK>
[ 183.943776] hwpoison_pte_range+0x2ad/0x300
[ 183.947963] __walk_page_range+0x5f7/0xc80
[ 183.952072] ? tick_nohz_tick_stopped+0x12/0x30
[ 183.956609] ? rcu_read_lock_sched_held+0x10/0x80
[ 183.961327] walk_page_range+0xc6/0x180
[ 183.965173] kill_accessing_process+0xca/0x13d
[ 183.969628] memory_failure.cold+0x9a/0xbc
[ 183.973737] kill_me_maybe+0x3b/0x9a
[ 183.977324] task_work_run+0x59/0x90
[ 183.980910] exit_to_user_mode_prepare+0x22c/0x250
[ 183.985710] irqentry_exit_to_user_mode+0x5/0x30
[ 183.990340] noist_exc_machine_check+0x55/0xa0
[ 183.994792] asm_exc_machine_check+0x2f/0x40
[ 183.999073] RIP: 0033:0x405bff
[ 184.002140] Code: 00 e8 a5 c5 ff ff 55 53 48 83 ec 08 48 63 05 40 5d 00 00 48 8d 2c 07 bf 10 80 40 00 e8 ba c4 ff ff b8 00 00 00 00 89 c1 0f a2 <0f> be 45 00 48 83 c4 08 5b 5d c3 53 48 83 ec 20 48 c7 44 24 18 00
[ 184.020886] RSP: 002b:00007ffcfa807e80 EFLAGS: 00010206
[ 184.026111] RAX: 000000000000001b RBX: 00000000756e6547 RCX: 000000006c65746e
[ 184.033241] RDX: 0000000049656e69 RSI: 00000000016ae4a0 RDI: 00007f8fb2a4ea30
[ 184.040376] RBP: 00007f8fb2a69400 R08: 00000000016ae2a0 R09: 0000000000000001
[ 184.047508] R10: 00007ffcfa807e60 R11: 0000000000000202 R12: 0000000000000003
[ 184.054639] R13: 0000000000000000 R14: 00007f8fb2a9d000 R15: 000000000040ae00
[ 184.061774] </TASK>
[ 184.063973] Modules linked in: einj snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif sunrpc kvm_intel binfmt_misc kvm irqbypass vfat rapl fat intel_cstate intel_uncore pcspkr isst_if_mbox_pci isst_if_mmio isst_if_common i2c_i801 i2c_smbus mei_me acpi_ipmi mei intel_pch_thermal intel_vsec ioatdma ipmi_si ipmi_devintf joydev ipmi_msghandler loop zram crct10dif_pclmul ast crc32_pclmul crc32c_intel polyval_clmulni drm_vram_helper polyval_generic ixgbe drm_ttm_helper igb mdio ttm ghash_clmulni_intel dca wmi ip6_tables ip_tables fuse
[ 184.064048] Unloaded tainted modules: i10nm_edac():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1
[ 184.149160] acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
[ 184.236447] acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
[ 184.323640] acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
[ 184.410309] acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1
[ 184.497416] acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
[ 184.584254] pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1
[ 184.671099] pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1 [last unloaded: i10nm_edac]
[ 184.796172] ---[ end trace 0000000000000000 ]---
[ 184.806944] RIP: 0010:check_hwpoisoned_entry+0xba/0xc0
[ 184.812107] Code: 31 c0 c3 cc cc cc cc 48 83 e7 fd 41 89 c3 41 21 fb eb ce 48 83 e7 f7 41 89 c3 41 21 fb 41 f6 c3 02 74 be eb e2 0f 1f 44 00 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 4c 89 c0 48 89 d6 4d 8b 40 28 48
[ 184.830868] RSP: 0000:ff500c69ca963c90 EFLAGS: 00010246
[ 184.836111] RAX: 0000000000000016 RBX: 00007f8fb2a69000 RCX: 00000000008a398a
[ 184.843261] RDX: 0000000000000000 RSI: 00007f8fb2a69000 RDI: b7fffffeeb8cea00
[ 184.850410] RBP: ff486f47df6f9ca8 R08: ff500c69ca963e10 R09: 0000000000000000
[ 184.857560] R10: 000000000000000c R11: 0000000000000000 R12: ff486f47d17ca348
[ 184.864709] R13: 00007f8fb2a6a000 R14: ff486f48010c8b00 R15: ff500c69ca963e10
[ 184.871855] FS: 00007f8fb2874740(0000) GS:ff486f4f9fac0000(0000) knlGS:0000000000000000
[ 184.879950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 184.885704] CR2: 00000000016b02f8 CR3: 00000008b81ce003 CR4: 0000000000771ee0
[ 184.892847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 184.899995] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 184.907135] PKRU: 55555554
[ 184.909857] note: einj_mem_uc[4501] exited with preempt_count 1

2024-04-06 03:17:32

by Oscar Salvador

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Fri, Apr 05, 2024 at 04:58:55PM -0700, Tony Luck wrote:
> So a problem (not necessarily the current upstream problem given the
> amount of additional changes that happened) began way back in this
> commit from v6.1:
>
> 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
>
> This commit gives the "kernel BUG at include/linux/swapops.h:88!" while
> its predecessor runs without errors).

Tony, could you try the following patch, to see if that goes away?
It is against 6.1, but current kernel has the same problem AFAICS.

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 578212fbf2be..575d428393c0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -416,7 +416,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);

return is_migration_entry(entry) || is_device_private_entry(entry) ||
- is_device_exclusive_entry(entry);
+ is_device_exclusive_entry(entry) || is_hwpsoion_entry(entry);
}

struct page_vma_mapped_walk;

Thanks

--
Oscar Salvador
SUSE Labs

2024-04-06 03:55:04

by Oscar Salvador

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Sat, Apr 06, 2024 at 04:18:41AM +0200, Oscar Salvador wrote:
> Tony, could you try the following patch, to see if that goes away?
> It is against 6.1, but current kernel has the same problem AFAICS.
>
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 578212fbf2be..575d428393c0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -416,7 +416,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
> BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
>
> return is_migration_entry(entry) || is_device_private_entry(entry) ||
> - is_device_exclusive_entry(entry);
> + is_device_exclusive_entry(entry) || is_hwpsoion_entry(entry);
> }
>
> struct page_vma_mapped_walk;

Sorry, I was still sleepy and I made a typo.
Plus I realized we need to move some definitions to make them available.

This should have been (compile tested only):

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 48b700ba1d18..201efbf441d6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -390,6 +390,35 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
}
#endif /* CONFIG_MIGRATION */

+#ifdef CONFIG_MEMORY_FAILURE
+
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
typedef unsigned long pte_marker;

#define PTE_MARKER_UFFD_WP BIT(0)
@@ -492,7 +521,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);

return is_migration_entry(entry) || is_device_private_entry(entry) ||
- is_device_exclusive_entry(entry);
+ is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
}

struct page_vma_mapped_walk;
@@ -561,35 +590,6 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
}
#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */

-#ifdef CONFIG_MEMORY_FAILURE
-
-/*
- * Support for hardware poisoned pages
- */
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- BUG_ON(!PageLocked(page));
- return swp_entry(SWP_HWPOISON, page_to_pfn(page));
-}
-
-static inline int is_hwpoison_entry(swp_entry_t entry)
-{
- return swp_type(entry) == SWP_HWPOISON;
-}
-
-#else
-
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- return swp_entry(0, 0);
-}
-
-static inline int is_hwpoison_entry(swp_entry_t swp)
-{
- return 0;
-}
-#endif
-
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;

--
Oscar Salvador
SUSE Labs

2024-04-06 04:14:18

by Oscar Salvador

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Sat, Apr 06, 2024 at 05:54:47AM +0200, Oscar Salvador wrote:
> Sorry, I was still sleepy and I made a typo.
> Plus I realized we need to move some definitions to make them available.
>
> This should have been (compile tested only):

This one is against 6.1 (previous one was against v6.9-rc2):
Again, compile tested only

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 578212fbf2be..5d720f15e811 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -313,6 +313,55 @@ static inline int is_readable_migration_entry(swp_entry_t entry)

#endif /* CONFIG_MIGRATION */

+#ifdef CONFIG_MEMORY_FAILURE
+
+extern atomic_long_t num_poisoned_pages __read_mostly;
+
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+
+static inline void num_poisoned_pages_inc(void)
+{
+ atomic_long_inc(&num_poisoned_pages);
+}
+
+static inline void num_poisoned_pages_sub(long i)
+{
+ atomic_long_sub(i, &num_poisoned_pages);
+}
+
+#else /* CONFIG_MEMORY_FAILURE */
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+
+static inline void num_poisoned_pages_inc(void)
+{
+}
+
+static inline void num_poisoned_pages_sub(long i)
+{
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
typedef unsigned long pte_marker;

#define PTE_MARKER_UFFD_WP BIT(0)
@@ -416,7 +465,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);

return is_migration_entry(entry) || is_device_private_entry(entry) ||
- is_device_exclusive_entry(entry);
+ is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
}

struct page_vma_mapped_walk;
@@ -485,55 +534,6 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
}
#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */

-#ifdef CONFIG_MEMORY_FAILURE
-
-extern atomic_long_t num_poisoned_pages __read_mostly;
-
-/*
- * Support for hardware poisoned pages
- */
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- BUG_ON(!PageLocked(page));
- return swp_entry(SWP_HWPOISON, page_to_pfn(page));
-}
-
-static inline int is_hwpoison_entry(swp_entry_t entry)
-{
- return swp_type(entry) == SWP_HWPOISON;
-}
-
-static inline void num_poisoned_pages_inc(void)
-{
- atomic_long_inc(&num_poisoned_pages);
-}
-
-static inline void num_poisoned_pages_sub(long i)
-{
- atomic_long_sub(i, &num_poisoned_pages);
-}
-
-#else /* CONFIG_MEMORY_FAILURE */
-
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- return swp_entry(0, 0);
-}
-
-static inline int is_hwpoison_entry(swp_entry_t swp)
-{
- return 0;
-}
-
-static inline void num_poisoned_pages_inc(void)
-{
-}
-
-static inline void num_poisoned_pages_sub(long i)
-{
-}
-#endif /* CONFIG_MEMORY_FAILURE */
-
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;

--
Oscar Salvador
SUSE Labs

2024-04-07 00:08:46

by Tony Luck

[permalink] [raw]

Subject: RE: Machine check recovery broken in v6.9-rc1

> This one is against 6.1 (previous one was against v6.9-rc2):
> Again, compile tested only

Oscar.

Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.

Here's what's happening.

When the machine check occurs there's a scramble from various subsystems
to report the memory error.

ghes_do_memory_failure() calls memory_failure_queue() which later
calls memory_failure() from a kernel thread. Side note: this happens TWICE
for each error. Not sure yet if this is a BIOS issue logging more than once.
or some Linux issues in acpi/apei/ghes.c code.

uc_decode_notifier() [called from a different kernel thread] also calls
do_memory_failure()

Finally kill_me_maybe() [called from task_work on return to the application
when returning from the machine check handler] also calls memory_failure()

do_memory_failure() is somewhat prepared for multiple reports of the same
error. It uses an atomic test and set operation to mark the page as poisoned.

First called to report the error does all the real work. Late arrivals take a
shorter path, but may still take some action(s) depending on the "flags"
passed in:

if (TestSetPageHWPoison(p)) {
pr_err("%#lx: already hardware poisoned\n", pfn);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
if (flags & MF_COUNT_INCREASED)
put_page(p);
goto unlock_mutex;
}

In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
kill_accessing_process() ... which is in the stack trace that led to the:

kernel BUG at include/linux/swapops.h:88!

I'm not sure that I fully understand your patch. I guess that it is making sure to
handle the case that the page has already been marked as poisoned?

Anyway ... thanks for the quick fix. I hope the above helps write a good
commit message to get this applied and backported to stable.

Tested-by: Tony Luck <[email protected]>

-Tony

2024-04-07 03:59:45

by Miaohe Lin

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On 2024/4/7 8:08, Luck, Tony wrote:
>> This one is against 6.1 (previous one was against v6.9-rc2):
>> Again, compile tested only
>
> Oscar.
>
> Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.
>
> Here's what's happening.
>
> When the machine check occurs there's a scramble from various subsystems
> to report the memory error.
>
> ghes_do_memory_failure() calls memory_failure_queue() which later
> calls memory_failure() from a kernel thread. Side note: this happens TWICE
> for each error. Not sure yet if this is a BIOS issue logging more than once.
> or some Linux issues in acpi/apei/ghes.c code.
>
> uc_decode_notifier() [called from a different kernel thread] also calls
> do_memory_failure()
>
> Finally kill_me_maybe() [called from task_work on return to the application
> when returning from the machine check handler] also calls memory_failure()
>
> do_memory_failure() is somewhat prepared for multiple reports of the same
> error. It uses an atomic test and set operation to mark the page as poisoned.
>
> First called to report the error does all the real work. Late arrivals take a
> shorter path, but may still take some action(s) depending on the "flags"
> passed in:
>
> if (TestSetPageHWPoison(p)) {
> pr_err("%#lx: already hardware poisoned\n", pfn);
> res = -EHWPOISON;
> if (flags & MF_ACTION_REQUIRED)
> res = kill_accessing_process(current, pfn, flags);
> if (flags & MF_COUNT_INCREASED)
> put_page(p);
> goto unlock_mutex;
> }
>
> In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
> kill_accessing_process() ... which is in the stack trace that led to the:
>
> kernel BUG at include/linux/swapops.h:88!
>
> I'm not sure that I fully understand your patch. I guess that it is making sure to
> handle the case that the page has already been marked as poisoned?
>
>
> Anyway ... thanks for the quick fix. I hope the above helps write a good
> commit message to get this applied and backported to stable.

Sorry for late. I was just back from my vacation.

>
> Tested-by: Tony Luck <[email protected]>

Thanks for both. This should be a issue introduced from commit:

0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")

hwpoison_entry_to_pfn() is replaced with swp_offset_pfn() which might not be intended to be used
with hwpoison entry:

/*
* A pfn swap entry is a special type of swap entry that always has a pfn stored
* in the swap offset. *They are used to represent unaddressable device memory*
* *and to restrict access to a page undergoing migration*
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
/* Make sure the swp offset can always store the needed fields */
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);

return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry);
}

I think Oscar's patch is the right fix and it will be better to amend the corresponding comment too.

Thanks.

>
> -Tony
>
>
>
>
>
> .
>

2024-04-07 04:52:10

by Oscar Salvador

[permalink] [raw]

Subject: Re: Machine check recovery broken in v6.9-rc1

On Sun, Apr 07, 2024 at 12:08:30AM +0000, Luck, Tony wrote:
> Oscar.
>
> Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.

Thanks for the switf test Tony!

> Here's what's happening.
>
> When the machine check occurs there's a scramble from various subsystems
> to report the memory error.
>
> ghes_do_memory_failure() calls memory_failure_queue() which later
> calls memory_failure() from a kernel thread. Side note: this happens TWICE
> for each error. Not sure yet if this is a BIOS issue logging more than once.
> or some Linux issues in acpi/apei/ghes.c code.
>
> uc_decode_notifier() [called from a different kernel thread] also calls
> do_memory_failure()
>
> Finally kill_me_maybe() [called from task_work on return to the application
> when returning from the machine check handler] also calls memory_failure()
>
> do_memory_failure() is somewhat prepared for multiple reports of the same
> error. It uses an atomic test and set operation to mark the page as poisoned.
>
> First called to report the error does all the real work. Late arrivals take a
> shorter path, but may still take some action(s) depending on the "flags"
> passed in:
>
> if (TestSetPageHWPoison(p)) {
> pr_err("%#lx: already hardware poisoned\n", pfn);
> res = -EHWPOISON;
> if (flags & MF_ACTION_REQUIRED)
> res = kill_accessing_process(current, pfn, flags);
> if (flags & MF_COUNT_INCREASED)
> put_page(p);
> goto unlock_mutex;
> }

Thanks for the detailed explanation.

> In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
> kill_accessing_process() ... which is in the stack trace that led to the:
>
> kernel BUG at include/linux/swapops.h:88!
>
> I'm not sure that I fully understand your patch. I guess that it is making sure to
> handle the case that the page has already been marked as poisoned?

Basically what is happening is:

1) We mark the page as HWPoison
2) We see that the page is mapped by someone
3) We try to unmap it, and in the process we create a hwpoison swap entry.
See the following chunk from try_to_unmap_one():

"
if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
hugetlb_count_sub(folio_nr_pages(folio), mm);
set_huge_pte_at(mm, address, pvmw.pte, pteval,
hsz);
} else {
dec_mm_counter(mm, mm_counter(folio));
set_pte_at(mm, address, pvmw.pte, pteval);
}
...
}
"
4) Now there is a second memory event (maybe the previous one has
already finished, I do not think it matters for the sake of this
problem)
5) The second event sees that the page has already been marked as
HWPoison but since it has MF_ACTION_REQUIRED specified, it
goes to kill_accessing_process() to do what its name says.
6) We walk the page tables of the accessing process to see if it has
the poisoned pfn.
7) check_hwpoisoned_entry()
(which is called from
walk_page_range()->walk_{pgd,p4d,pud,pmd}_range()->ops->pmd_entry())
checks whether any of the ptes is poisoned.
8) Since the previous MCE event unmapped the page, pte_present() == 0,
so we want to get the swap entry, and this is where it falls off the
cliff.
See check_hwpoisoned_entry()

static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
unsigned long poisoned_pfn, struct to_kill *tk)
{
unsigned long pfn = 0;

if (pte_present(pte)) {
pfn = pte_pfn(pte);
} else {
swp_entry_t swp = pte_to_swp_entry(pte);

if (is_hwpoison_entry(swp))
pfn = swp_offset_pfn(swp);
}
...
}

is_hwpoison_entry() returns true (remember the make_hwpoison_entry()
call we did?)
But when we try to get the pfn from the swap entry, we stumble upon the
VM_BUG_ON(), because is_pfn_swap_entry() only checks for:
is_migration_entry()
is_device_private_entry()
is_device_exclusive_entry()

but it should also check for is_hwpoison_entry().
Since it does not, is_pfn_swap_entry() returns false in our case,
leading to the VM_BUG_ON.

Note that this should only matter in environments where CONFIG_DEBUG_VM
is set.

I hope I shed some light in here.

> Anyway ... thanks for the quick fix. I hope the above helps write a good
> commit message to get this applied and backported to stable.
>
> Tested-by: Tony Luck <[email protected]>

Thanks again Tony, much appreciated.

I will write the patch and most likely send it out either today in the
afternoon or tomorrow early in the
morning.

--
Oscar Salvador
SUSE Labs