2020-05-21 20:37:05

by Thomas Gleixner

[permalink] [raw]
Subject: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

From: Thomas Gleixner <[email protected]>

Device interrupt handlers and system vector handlers are executed on the
interrupt stack. The stack switch happens in the low level assembly entry
code. This conflicts with the efforts to consolidate the exit code in C to
ensure correctness vs. RCU and tracing.

As there is no way to move #DB away from IST due to the MOV SS issue, the
requirements vs. #DB and NMI for switching to the interrupt stack do not
exist anymore. The only requirement is that interrupts are disabled.

That allows to move the stack switching to C code which simplifies the
entry/exit handling further because it allows to switch stacks after
handling the entry and on exit before handling RCU, return to usermode and
kernel preemption in the same way as for regular exceptions.

The initial attempt of having the stack switching in inline ASM caused too
much headache vs. objtool and the unwinder. After analysing the use cases
it was agreed on that having the stack switch in ASM for the price of an
indirect call is acceptable as the main users are indirect call heavy
anyway and the few system vectors which are empty shells (scheduler IPI and
KVM posted interrupt vectors) can run from the regular stack.

Provide helper functions to check whether the interrupt stack is already
active and whether stack switching is required.

64 bit only for now. 32 bit has a variant of that already. Once this is
cleaned up the two implementations might be consolidated as a cleanup on
top.

Signed-off-by: Thomas Gleixner <[email protected]>
---
V9: Moved the conditions into an inline to avoid code duplication
---
arch/x86/entry/entry_64.S | 39 ++++++++++++++++++++++++++++
arch/x86/include/asm/irq_stack.h | 53 +++++++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1106,6 +1106,45 @@ SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
SYM_CODE_END(.Lbad_gs)
.previous

+/*
+ * rdi: New stack pointer points to the top word of the stack
+ * rsi: Function pointer
+ * rdx: Function argument (can be NULL if none)
+ */
+SYM_FUNC_START(asm_call_on_stack)
+ /*
+ * Save the frame pointer unconditionally. This allows the ORC
+ * unwinder to handle the stack switch.
+ */
+ pushq %rbp
+ mov %rsp, %rbp
+
+ /*
+ * The unwinder relies on the word at the top of the new stack
+ * page linking back to the previous RSP.
+ */
+ mov %rsp, (%rdi)
+ mov %rdi, %rsp
+ /* Move the argument to the right place */
+ mov %rdx, %rdi
+
+1:
+ .pushsection .discard.instr_begin
+ .long 1b - .
+ .popsection
+
+ CALL_NOSPEC rsi
+
+2:
+ .pushsection .discard.instr_end
+ .long 2b - .
+ .popsection
+
+ /* Restore the previous stack pointer from RBP. */
+ leaveq
+ ret
+SYM_FUNC_END(asm_call_on_stack)
+
/* Call softirq on interrupt stack. Interrupts are off. */
.pushsection .text, "ax"
SYM_FUNC_START(do_softirq_own_stack)
--- /dev/null
+++ b/arch/x86/include/asm/irq_stack.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IRQ_STACK_H
+#define _ASM_X86_IRQ_STACK_H
+
+#include <linux/ptrace.h>
+
+#include <asm/processor.h>
+
+#ifdef CONFIG_X86_64
+static __always_inline bool irqstack_active(void)
+{
+ return __this_cpu_read(irq_count) != -1;
+}
+
+void asm_call_on_stack(void *sp, void *func, void *arg);
+
+static __always_inline void __run_on_irqstack(void *func, void *arg)
+{
+ void *tos = __this_cpu_read(hardirq_stack_ptr);
+
+ __this_cpu_add(irq_count, 1);
+ asm_call_on_stack(tos - 8, func, arg);
+ __this_cpu_sub(irq_count, 1);
+}
+
+#else /* CONFIG_X86_64 */
+static inline bool irqstack_active(void) { return false; }
+static inline void __run_on_irqstack(void *func, void *arg) { }
+#endif /* !CONFIG_X86_64 */
+
+static __always_inline bool irq_needs_irq_stack(struct pt_regs *regs)
+{
+ if (IS_ENABLED(CONFIG_X86_32))
+ return false;
+ if (!regs)
+ return !irqstack_active();
+ return !user_mode(regs) && !irqstack_active();
+}
+
+static __always_inline void run_on_irqstack_cond(void *func, void *arg,
+ struct pt_regs *regs)
+{
+ void (*__func)(void *arg) = func;
+
+ lockdep_assert_irqs_disabled();
+
+ if (irq_needs_irq_stack(regs))
+ __run_on_irqstack(__func, arg);
+ else
+ __func(arg);
+}
+
+#endif


Subject: [tip: x86/entry] x86/entry: Provide helpers for executing on the irqstack

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: 0aa4dbb2808991f53396df8d2deb390d4f880abb
Gitweb: https://git.kernel.org/tip/0aa4dbb2808991f53396df8d2deb390d4f880abb
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 21 May 2020 22:05:23 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 26 May 2020 19:06:27 +02:00

x86/entry: Provide helpers for executing on the irqstack

Device interrupt handlers and system vector handlers are executed on the
interrupt stack. The stack switch happens in the low level assembly entry
code. This conflicts with the efforts to consolidate the exit code in C to
ensure correctness vs. RCU and tracing.

As there is no way to move #DB away from IST due to the MOV SS issue, the
requirements vs. #DB and NMI for switching to the interrupt stack do not
exist anymore. The only requirement is that interrupts are disabled.

That allows the moving of the stack switching to C code, which simplifies the
entry/exit handling further, because it allows the switching of stacks after
handling the entry and on exit before handling RCU, returning to usermode and
kernel preemption in the same way as for regular exceptions.

The initial attempt of having the stack switching in inline ASM caused too
much headache vs. objtool and the unwinder. After analysing the use cases
it was agreed on that having the stack switch in ASM for the price of an
indirect call is acceptable, as the main users are indirect call heavy
anyway and the few system vectors which are empty shells (scheduler IPI and
KVM posted interrupt vectors) can run from the regular stack.

Provide helper functions to check whether the interrupt stack is already
active and whether stack switching is required.

64-bit only for now, as 32-bit has a variant of that already. Once this is
cleaned up, the two implementations might be consolidated as an additional
cleanup on top.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/entry/entry_64.S | 39 +++++++++++++++++++++++-
arch/x86/include/asm/irq_stack.h | 53 +++++++++++++++++++++++++++++++-
2 files changed, 92 insertions(+)
create mode 100644 arch/x86/include/asm/irq_stack.h

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d983a0d..1597370 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1106,6 +1106,45 @@ SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
SYM_CODE_END(.Lbad_gs)
.previous

+/*
+ * rdi: New stack pointer points to the top word of the stack
+ * rsi: Function pointer
+ * rdx: Function argument (can be NULL if none)
+ */
+SYM_FUNC_START(asm_call_on_stack)
+ /*
+ * Save the frame pointer unconditionally. This allows the ORC
+ * unwinder to handle the stack switch.
+ */
+ pushq %rbp
+ mov %rsp, %rbp
+
+ /*
+ * The unwinder relies on the word at the top of the new stack
+ * page linking back to the previous RSP.
+ */
+ mov %rsp, (%rdi)
+ mov %rdi, %rsp
+ /* Move the argument to the right place */
+ mov %rdx, %rdi
+
+1:
+ .pushsection .discard.instr_begin
+ .long 1b - .
+ .popsection
+
+ CALL_NOSPEC rsi
+
+2:
+ .pushsection .discard.instr_end
+ .long 2b - .
+ .popsection
+
+ /* Restore the previous stack pointer from RBP. */
+ leaveq
+ ret
+SYM_FUNC_END(asm_call_on_stack)
+
/* Call softirq on interrupt stack. Interrupts are off. */
.pushsection .text, "ax"
SYM_FUNC_START(do_softirq_own_stack)
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
new file mode 100644
index 0000000..4ae66f0
--- /dev/null
+++ b/arch/x86/include/asm/irq_stack.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IRQ_STACK_H
+#define _ASM_X86_IRQ_STACK_H
+
+#include <linux/ptrace.h>
+
+#include <asm/processor.h>
+
+#ifdef CONFIG_X86_64
+static __always_inline bool irqstack_active(void)
+{
+ return __this_cpu_read(irq_count) != -1;
+}
+
+void asm_call_on_stack(void *sp, void *func, void *arg);
+
+static __always_inline void __run_on_irqstack(void *func, void *arg)
+{
+ void *tos = __this_cpu_read(hardirq_stack_ptr);
+
+ __this_cpu_add(irq_count, 1);
+ asm_call_on_stack(tos - 8, func, arg);
+ __this_cpu_sub(irq_count, 1);
+}
+
+#else /* CONFIG_X86_64 */
+static inline bool irqstack_active(void) { return false; }
+static inline void __run_on_irqstack(void *func, void *arg) { }
+#endif /* !CONFIG_X86_64 */
+
+static __always_inline bool irq_needs_irq_stack(struct pt_regs *regs)
+{
+ if (IS_ENABLED(CONFIG_X86_32))
+ return false;
+ if (!regs)
+ return !irqstack_active();
+ return !user_mode(regs) && !irqstack_active();
+}
+
+static __always_inline void run_on_irqstack_cond(void *func, void *arg,
+ struct pt_regs *regs)
+{
+ void (*__func)(void *arg) = func;
+
+ lockdep_assert_irqs_disabled();
+
+ if (irq_needs_irq_stack(regs))
+ __run_on_irqstack(__func, arg);
+ else
+ __func(arg);
+}
+
+#endif

2020-06-05 17:20:52

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Thu, May 21, 2020 at 10:05:23PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <[email protected]>
>
> Device interrupt handlers and system vector handlers are executed on the
> interrupt stack. The stack switch happens in the low level assembly entry
> code. This conflicts with the efforts to consolidate the exit code in C to
> ensure correctness vs. RCU and tracing.
>
> As there is no way to move #DB away from IST due to the MOV SS issue, the
> requirements vs. #DB and NMI for switching to the interrupt stack do not
> exist anymore. The only requirement is that interrupts are disabled.
>
> That allows to move the stack switching to C code which simplifies the
> entry/exit handling further because it allows to switch stacks after
> handling the entry and on exit before handling RCU, return to usermode and
> kernel preemption in the same way as for regular exceptions.
>
> The initial attempt of having the stack switching in inline ASM caused too
> much headache vs. objtool and the unwinder. After analysing the use cases
> it was agreed on that having the stack switch in ASM for the price of an
> indirect call is acceptable as the main users are indirect call heavy
> anyway and the few system vectors which are empty shells (scheduler IPI and
> KVM posted interrupt vectors) can run from the regular stack.
>
> Provide helper functions to check whether the interrupt stack is already
> active and whether stack switching is required.
>
> 64 bit only for now. 32 bit has a variant of that already. Once this is
> cleaned up the two implementations might be consolidated as a cleanup on
> top.
>
> Signed-off-by: Thomas Gleixner <[email protected]>

Reverted this commit and the rest of series (with trivial fixup) as well
as the two dependencies [1],

8449e768dcb8 ("x86/entry: Remove debug IDT frobbing")
029149180d1d ("x86/entry: Rename trace_hardirqs_off_prepare()")

fixed the warning under some memory pressure on AMD NUMA servers.

[ 9371.959858] asm_call_on_stack+0x12/0x20
asm_call_on_stack at arch/x86/entry/entry_64.S:710

The .config (if ever matters),
https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[ 9371.260161] ------------[ cut here ]------------
[ 9371.267143] Stack depot reached limit capacity
[ 9371.267193] WARNING: CPU: 19 PID: 1181 at lib/stackdepot.c:115 stack_depot_save+0x3d9/0x57d
[ 9371.281470] Modules linked in: brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_amd kvm ses enclosure dax_pmem irqbypass dax_pmem_core acpi_cpufreq ip_tables x_table
s xfs sd_mod bnxt_en smartpqi scsi_transport_sas tg3 i40e libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
[ 9371.310176] CPU: 19 PID: 1181 Comm: systemd-journal Tainted: G O 5.7.0-next-20200604+ #1
[ 9371.320700] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
[ 9371.329987] RIP: 0010:stack_depot_save+0x3d9/0x57d
[ 9371.335513] Code: 1d 9b bc 68 01 80 fb 01 0f 87 c0 01 00 00 80 e3 01 75 1f 4c 89 45 c0 c6 05 82 bc 68 01 01 48 c7 c7 e0 85 63 9f e8 9d 74 9d ff <0f> 0b 90 90 4c 8
b 45 c0 48 c7 c7 80 1a d0 9f 4c 89 c6 e8 b0 2b 46
[ 9371.355426] RSP: 0018:ffffc90007260490 EFLAGS: 00010082
[ 9371.361387] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff9ed3207f
[ 9371.369544] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: 0000000000000000
[ 9371.377428] RBP: ffffc900072604f8 R08: fffffbfff3f37539 R09: fffffbfff3f37539
[ 9371.385310] R10: ffffffff9f9ba9c3 R11: fffffbfff3f37538 R12: ffffc90007260508
[ 9371.393521] R13: 0000000000000036 R14: 0000000000000000 R15: 000000000009fb52
[ 9371.401403] FS: 00007fc9849f2980(0000) GS:ffff88942fb80000(0000) knlGS:0000000000000000
[ 9371.410244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9371.417007] CR2: 00007f20d02c3000 CR3: 0000000440b9c000 CR4: 00000000003406e0
[ 9371.424889] Call Trace:
[ 9371.428054] <IRQ>
[ 9371.436315] save_stack+0x3f/0x50
[ 9371.734034] kasan_slab_free+0xe/0x10
[ 9371.738798] slab_free_freelist_hook+0x5d/0x1c0
[ 9371.748886] kmem_cache_free+0x10c/0x390
[ 9371.758450] mempool_free_slab+0x17/0x20
[ 9371.763522] mempool_free+0x65/0x170
[ 9371.767825] bio_free+0x14c/0x210
[ 9371.771864] bio_put+0x59/0x70
[ 9371.775644] end_swap_bio_write+0x199/0x250
[ 9371.780556] bio_endio+0x22c/0x4e0
[ 9371.784709] dec_pending+0x1bf/0x3e0 [dm_mod]
[ 9371.790207] clone_endio+0x129/0x3d0 [dm_mod]
[ 9371.800746] bio_endio+0x22c/0x4e0
[ 9371.809262] blk_update_request+0x3bb/0x980
[ 9371.814605] scsi_end_request+0x53/0x420
[ 9371.824002] scsi_io_completion+0x10a/0x830
[ 9371.844445] scsi_finish_command+0x1b9/0x250
[ 9371.849445] scsi_softirq_done+0x1ab/0x1f0
[ 9371.854272] blk_mq_force_complete_rq+0x217/0x250
[ 9371.859708] blk_mq_complete_request+0xe/0x20
[ 9371.865171] scsi_mq_done+0xc1/0x220
[ 9371.869479] pqi_aio_io_complete+0x83/0x2c0 [smartpqi]
[ 9371.881764] pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
[ 9371.914115] __handle_irq_event_percpu+0x81/0x550
[ 9371.924289] handle_irq_event_percpu+0x70/0x100
[ 9371.945559] handle_irq_event+0x5a/0x8b
[ 9371.950121] handle_edge_irq+0x10c/0x370
[ 9371.950121] handle_edge_irq+0x10c/0x370
[ 9371.959858] asm_call_on_stack+0x12/0x20
asm_call_on_stack at arch/x86/entry/entry_64.S:710
[ 9371.964899] </IRQ>
[ 9371.967716] common_interrupt+0x185/0x2a0
[ 9371.972455] asm_common_interrupt+0x1e/0x40
[ 9371.977368] RIP: 0010:__asan_load4+0x8/0xa0
[ 9371.982281] Code: 00 e8 5c f4 ff ff 5d c3 40 38 f0 0f 9e c0 84 c0 75 e5 5d c3 48 c1 e8 03 80 3c 10 00 75 ed 5d c3 66 90 55 48 89 e5 48 8b 4d 08 <48> 83 ff fb 77 6c eb 3a 0f 1f 00 48 b8 00 00 00 00 00 00 00 ff 48
[ 9372.002246] RSP: 0018:ffffc9000b12f0e8 EFLAGS: 00000202
[ 9372.008207] RAX: 0000000000000000 RBX: ffffc9000b12f150 RCX: ffffffff9ed32487
[ 9372.016489] RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffffffff9ff92a94
[ 9372.024370] RBP: ffffc9000b12f0e8 R08: fffffbfff3ff1d4d R09: fffffbfff3ff1d4d
[ 9372.032252] R10: ffffffff9ff8ea67 R11: fffffbfff3ff1d4c R12: 0000000000000000
[ 9372.040490] R13: ffffc9000b12f358 R14: 000000000000000c R15: 0000000000000005
[ 9372.053899] debug_lockdep_rcu_enabled+0x27/0x60
[ 9372.059248] rcu_read_lock_held_common+0x12/0x60
[ 9372.064989] rcu_read_lock_sched_held+0x60/0xe0
[ 9372.080338] shrink_active_list+0xbfd/0xc30
[ 9372.120481] shrink_lruvec+0xbf1/0x11b0
[ 9372.150410] shrink_node+0x344/0xd10
[ 9372.154719] do_try_to_free_pages+0x263/0xa00
[ 9372.169860] try_to_free_pages+0x239/0x570
[ 9372.179356] __alloc_pages_slowpath.constprop.59+0x5dd/0x1880
[ 9372.215762] __alloc_pages_nodemask+0x562/0x670
[ 9372.232686] alloc_pages_current+0x9c/0x110
[ 9372.237599] alloc_slab_page+0x355/0x530
[ 9372.242755] allocate_slab+0x485/0x5a0
[ 9372.247232] new_slab+0x46/0x70
[ 9372.251095] ___slab_alloc+0x35f/0x810
[ 9372.274485] __slab_alloc+0x43/0x70
[ 9372.287650] kmem_cache_alloc+0x257/0x3d0
[ 9372.298182] prepare_creds+0x26/0x130
[ 9372.302571] do_faccessat+0x255/0x3e0
[ 9372.321591] __x64_sys_access+0x38/0x40
[ 9372.326154] do_syscall_64+0x64/0x340
[ 9372.330542] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9372.336329] RIP: 0033:0x7fc983a21bfb
[ 9372.341050] Code: Bad RIP value.
[ 9372.345000] RSP: 002b:00007fffd543cde8 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
[ 9372.353319] RAX: ffffffffffffffda RBX: 00007fffd543fa40 RCX: 00007fc983a21bfb
[ 9372.361199] RDX: 00007fc983cf1c00 RSI: 0000000000000000 RDI: 00005573ed458090
[ 9372.369512] RBP: 00007fffd543cf30 R08: 00005573ed44ab99 R09: 0000000000000007
[ 9372.377393] R10: 0000000000000041 R11: 0000000000000246 R12: 0000000000000000
[ 9372.385275] R13: 0000000000000000 R14: 00007fffd543cea0 R15: 00005573eecd2990
[ 9372.393608] irq event stamp: 27274496
[ 9372.397999] hardirqs last enabled at (27274495): [<ffffffff9e635b8e>] free_unref_page_list+0x2ee/0x400
[ 9372.408152] hardirqs last disabled at (27274496): [<ffffffff9ed2c22b>] idtentry_enter_cond_rcu+0x1b/0x50
[ 9372.418695] softirqs last enabled at (27272816): [<ffffffff9f000478>] __do_softirq+0x478/0x784
[ 9372.428154] softirqs last disabled at (27272807): [<ffffffff9e2d0b41>] irq_exit_rcu+0xd1/0xe0
[ 9372.437435] ---[ end trace d2ebac1fad6e452e ]---

[1]
git revert --no-edit 355e1262d603..5a7462b1f9c1

> ---
> V9: Moved the conditions into an inline to avoid code duplication
> ---
> arch/x86/entry/entry_64.S | 39 ++++++++++++++++++++++++++++
> arch/x86/include/asm/irq_stack.h | 53 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 92 insertions(+)
>
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1106,6 +1106,45 @@ SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
> SYM_CODE_END(.Lbad_gs)
> .previous
>
> +/*
> + * rdi: New stack pointer points to the top word of the stack
> + * rsi: Function pointer
> + * rdx: Function argument (can be NULL if none)
> + */
> +SYM_FUNC_START(asm_call_on_stack)
> + /*
> + * Save the frame pointer unconditionally. This allows the ORC
> + * unwinder to handle the stack switch.
> + */
> + pushq %rbp
> + mov %rsp, %rbp
> +
> + /*
> + * The unwinder relies on the word at the top of the new stack
> + * page linking back to the previous RSP.
> + */
> + mov %rsp, (%rdi)
> + mov %rdi, %rsp
> + /* Move the argument to the right place */
> + mov %rdx, %rdi
> +
> +1:
> + .pushsection .discard.instr_begin
> + .long 1b - .
> + .popsection
> +
> + CALL_NOSPEC rsi
> +
> +2:
> + .pushsection .discard.instr_end
> + .long 2b - .
> + .popsection
> +
> + /* Restore the previous stack pointer from RBP. */
> + leaveq
> + ret
> +SYM_FUNC_END(asm_call_on_stack)
> +
> /* Call softirq on interrupt stack. Interrupts are off. */
> .pushsection .text, "ax"
> SYM_FUNC_START(do_softirq_own_stack)
> --- /dev/null
> +++ b/arch/x86/include/asm/irq_stack.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_IRQ_STACK_H
> +#define _ASM_X86_IRQ_STACK_H
> +
> +#include <linux/ptrace.h>
> +
> +#include <asm/processor.h>
> +
> +#ifdef CONFIG_X86_64
> +static __always_inline bool irqstack_active(void)
> +{
> + return __this_cpu_read(irq_count) != -1;
> +}
> +
> +void asm_call_on_stack(void *sp, void *func, void *arg);
> +
> +static __always_inline void __run_on_irqstack(void *func, void *arg)
> +{
> + void *tos = __this_cpu_read(hardirq_stack_ptr);
> +
> + __this_cpu_add(irq_count, 1);
> + asm_call_on_stack(tos - 8, func, arg);
> + __this_cpu_sub(irq_count, 1);
> +}
> +
> +#else /* CONFIG_X86_64 */
> +static inline bool irqstack_active(void) { return false; }
> +static inline void __run_on_irqstack(void *func, void *arg) { }
> +#endif /* !CONFIG_X86_64 */
> +
> +static __always_inline bool irq_needs_irq_stack(struct pt_regs *regs)
> +{
> + if (IS_ENABLED(CONFIG_X86_32))
> + return false;
> + if (!regs)
> + return !irqstack_active();
> + return !user_mode(regs) && !irqstack_active();
> +}
> +
> +static __always_inline void run_on_irqstack_cond(void *func, void *arg,
> + struct pt_regs *regs)
> +{
> + void (*__func)(void *arg) = func;
> +
> + lockdep_assert_irqs_disabled();
> +
> + if (irq_needs_irq_stack(regs))
> + __run_on_irqstack(__func, arg);
> + else
> + __func(arg);
> +}
> +
> +#endif
>

2020-06-05 17:39:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Fri, Jun 05, 2020 at 01:18:16PM -0400, Qian Cai wrote:
> On Thu, May 21, 2020 at 10:05:23PM +0200, Thomas Gleixner wrote:
> > From: Thomas Gleixner <[email protected]>
> >
> > Device interrupt handlers and system vector handlers are executed on the
> > interrupt stack. The stack switch happens in the low level assembly entry
> > code. This conflicts with the efforts to consolidate the exit code in C to
> > ensure correctness vs. RCU and tracing.
> >
> > As there is no way to move #DB away from IST due to the MOV SS issue, the
> > requirements vs. #DB and NMI for switching to the interrupt stack do not
> > exist anymore. The only requirement is that interrupts are disabled.
> >
> > That allows to move the stack switching to C code which simplifies the
> > entry/exit handling further because it allows to switch stacks after
> > handling the entry and on exit before handling RCU, return to usermode and
> > kernel preemption in the same way as for regular exceptions.
> >
> > The initial attempt of having the stack switching in inline ASM caused too
> > much headache vs. objtool and the unwinder. After analysing the use cases
> > it was agreed on that having the stack switch in ASM for the price of an
> > indirect call is acceptable as the main users are indirect call heavy
> > anyway and the few system vectors which are empty shells (scheduler IPI and
> > KVM posted interrupt vectors) can run from the regular stack.
> >
> > Provide helper functions to check whether the interrupt stack is already
> > active and whether stack switching is required.
> >
> > 64 bit only for now. 32 bit has a variant of that already. Once this is
> > cleaned up the two implementations might be consolidated as a cleanup on
> > top.
> >
> > Signed-off-by: Thomas Gleixner <[email protected]>
>
> Reverted this commit and the rest of series (with trivial fixup) as well
> as the two dependencies [1],
>
> 8449e768dcb8 ("x86/entry: Remove debug IDT frobbing")
> 029149180d1d ("x86/entry: Rename trace_hardirqs_off_prepare()")
>
> fixed the warning under some memory pressure on AMD NUMA servers.

The warning ?

> [ 9371.959858] asm_call_on_stack+0x12/0x20
> asm_call_on_stack at arch/x86/entry/entry_64.S:710

^^ what's that?

> The .config (if ever matters),
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>
> [ 9371.260161] ------------[ cut here ]------------
> [ 9371.267143] Stack depot reached limit capacity
> [ 9371.267193] WARNING: CPU: 19 PID: 1181 at lib/stackdepot.c:115 stack_depot_save+0x3d9/0x57d

Is this _the_ warning?

Maybe it really is running out of storage, I don't think this patch is
to blame. The unwind here looks good, so why would the stack-depot
unwind be any different?

> [ 9371.281470] Modules linked in: brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_amd kvm ses enclosure dax_pmem irqbypass dax_pmem_core acpi_cpufreq ip_tables x_table
> s xfs sd_mod bnxt_en smartpqi scsi_transport_sas tg3 i40e libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> [ 9371.310176] CPU: 19 PID: 1181 Comm: systemd-journal Tainted: G O 5.7.0-next-20200604+ #1
> [ 9371.320700] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
> [ 9371.329987] RIP: 0010:stack_depot_save+0x3d9/0x57d
> [ 9371.335513] Code: 1d 9b bc 68 01 80 fb 01 0f 87 c0 01 00 00 80 e3 01 75 1f 4c 89 45 c0 c6 05 82 bc 68 01 01 48 c7 c7 e0 85 63 9f e8 9d 74 9d ff <0f> 0b 90 90 4c 8
> b 45 c0 48 c7 c7 80 1a d0 9f 4c 89 c6 e8 b0 2b 46
> [ 9371.355426] RSP: 0018:ffffc90007260490 EFLAGS: 00010082
> [ 9371.361387] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff9ed3207f
> [ 9371.369544] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: 0000000000000000
> [ 9371.377428] RBP: ffffc900072604f8 R08: fffffbfff3f37539 R09: fffffbfff3f37539
> [ 9371.385310] R10: ffffffff9f9ba9c3 R11: fffffbfff3f37538 R12: ffffc90007260508
> [ 9371.393521] R13: 0000000000000036 R14: 0000000000000000 R15: 000000000009fb52
> [ 9371.401403] FS: 00007fc9849f2980(0000) GS:ffff88942fb80000(0000) knlGS:0000000000000000
> [ 9371.410244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9371.417007] CR2: 00007f20d02c3000 CR3: 0000000440b9c000 CR4: 00000000003406e0
> [ 9371.424889] Call Trace:
> [ 9371.428054] <IRQ>
> [ 9371.436315] save_stack+0x3f/0x50
> [ 9371.734034] kasan_slab_free+0xe/0x10
> [ 9371.738798] slab_free_freelist_hook+0x5d/0x1c0
> [ 9371.748886] kmem_cache_free+0x10c/0x390
> [ 9371.758450] mempool_free_slab+0x17/0x20
> [ 9371.763522] mempool_free+0x65/0x170
> [ 9371.767825] bio_free+0x14c/0x210
> [ 9371.771864] bio_put+0x59/0x70
> [ 9371.775644] end_swap_bio_write+0x199/0x250
> [ 9371.780556] bio_endio+0x22c/0x4e0
> [ 9371.784709] dec_pending+0x1bf/0x3e0 [dm_mod]
> [ 9371.790207] clone_endio+0x129/0x3d0 [dm_mod]
> [ 9371.800746] bio_endio+0x22c/0x4e0
> [ 9371.809262] blk_update_request+0x3bb/0x980
> [ 9371.814605] scsi_end_request+0x53/0x420
> [ 9371.824002] scsi_io_completion+0x10a/0x830
> [ 9371.844445] scsi_finish_command+0x1b9/0x250
> [ 9371.849445] scsi_softirq_done+0x1ab/0x1f0
> [ 9371.854272] blk_mq_force_complete_rq+0x217/0x250
> [ 9371.859708] blk_mq_complete_request+0xe/0x20
> [ 9371.865171] scsi_mq_done+0xc1/0x220
> [ 9371.869479] pqi_aio_io_complete+0x83/0x2c0 [smartpqi]
> [ 9371.881764] pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
> [ 9371.914115] __handle_irq_event_percpu+0x81/0x550
> [ 9371.924289] handle_irq_event_percpu+0x70/0x100
> [ 9371.945559] handle_irq_event+0x5a/0x8b
> [ 9371.950121] handle_edge_irq+0x10c/0x370
> [ 9371.950121] handle_edge_irq+0x10c/0x370
> [ 9371.959858] asm_call_on_stack+0x12/0x20
> asm_call_on_stack at arch/x86/entry/entry_64.S:710
> [ 9371.964899] </IRQ>
> [ 9371.967716] common_interrupt+0x185/0x2a0
> [ 9371.972455] asm_common_interrupt+0x1e/0x40
> [ 9371.977368] RIP: 0010:__asan_load4+0x8/0xa0
> [ 9371.982281] Code: 00 e8 5c f4 ff ff 5d c3 40 38 f0 0f 9e c0 84 c0 75 e5 5d c3 48 c1 e8 03 80 3c 10 00 75 ed 5d c3 66 90 55 48 89 e5 48 8b 4d 08 <48> 83 ff fb 77 6c eb 3a 0f 1f 00 48 b8 00 00 00 00 00 00 00 ff 48
> [ 9372.002246] RSP: 0018:ffffc9000b12f0e8 EFLAGS: 00000202
> [ 9372.008207] RAX: 0000000000000000 RBX: ffffc9000b12f150 RCX: ffffffff9ed32487
> [ 9372.016489] RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffffffff9ff92a94
> [ 9372.024370] RBP: ffffc9000b12f0e8 R08: fffffbfff3ff1d4d R09: fffffbfff3ff1d4d
> [ 9372.032252] R10: ffffffff9ff8ea67 R11: fffffbfff3ff1d4c R12: 0000000000000000
> [ 9372.040490] R13: ffffc9000b12f358 R14: 000000000000000c R15: 0000000000000005
> [ 9372.053899] debug_lockdep_rcu_enabled+0x27/0x60
> [ 9372.059248] rcu_read_lock_held_common+0x12/0x60
> [ 9372.064989] rcu_read_lock_sched_held+0x60/0xe0
> [ 9372.080338] shrink_active_list+0xbfd/0xc30
> [ 9372.120481] shrink_lruvec+0xbf1/0x11b0
> [ 9372.150410] shrink_node+0x344/0xd10
> [ 9372.154719] do_try_to_free_pages+0x263/0xa00
> [ 9372.169860] try_to_free_pages+0x239/0x570
> [ 9372.179356] __alloc_pages_slowpath.constprop.59+0x5dd/0x1880
> [ 9372.215762] __alloc_pages_nodemask+0x562/0x670
> [ 9372.232686] alloc_pages_current+0x9c/0x110
> [ 9372.237599] alloc_slab_page+0x355/0x530
> [ 9372.242755] allocate_slab+0x485/0x5a0
> [ 9372.247232] new_slab+0x46/0x70
> [ 9372.251095] ___slab_alloc+0x35f/0x810
> [ 9372.274485] __slab_alloc+0x43/0x70
> [ 9372.287650] kmem_cache_alloc+0x257/0x3d0
> [ 9372.298182] prepare_creds+0x26/0x130
> [ 9372.302571] do_faccessat+0x255/0x3e0
> [ 9372.321591] __x64_sys_access+0x38/0x40
> [ 9372.326154] do_syscall_64+0x64/0x340
> [ 9372.330542] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9372.336329] RIP: 0033:0x7fc983a21bfb
> [ 9372.341050] Code: Bad RIP value.
> [ 9372.345000] RSP: 002b:00007fffd543cde8 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
> [ 9372.353319] RAX: ffffffffffffffda RBX: 00007fffd543fa40 RCX: 00007fc983a21bfb
> [ 9372.361199] RDX: 00007fc983cf1c00 RSI: 0000000000000000 RDI: 00005573ed458090
> [ 9372.369512] RBP: 00007fffd543cf30 R08: 00005573ed44ab99 R09: 0000000000000007
> [ 9372.377393] R10: 0000000000000041 R11: 0000000000000246 R12: 0000000000000000
> [ 9372.385275] R13: 0000000000000000 R14: 00007fffd543cea0 R15: 00005573eecd2990
> [ 9372.393608] irq event stamp: 27274496
> [ 9372.397999] hardirqs last enabled at (27274495): [<ffffffff9e635b8e>] free_unref_page_list+0x2ee/0x400
> [ 9372.408152] hardirqs last disabled at (27274496): [<ffffffff9ed2c22b>] idtentry_enter_cond_rcu+0x1b/0x50
> [ 9372.418695] softirqs last enabled at (27272816): [<ffffffff9f000478>] __do_softirq+0x478/0x784
> [ 9372.428154] softirqs last disabled at (27272807): [<ffffffff9e2d0b41>] irq_exit_rcu+0xd1/0xe0
> [ 9372.437435] ---[ end trace d2ebac1fad6e452e ]---

2020-06-05 17:54:55

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Fri, Jun 05, 2020 at 07:36:22PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2020 at 01:18:16PM -0400, Qian Cai wrote:
> > On Thu, May 21, 2020 at 10:05:23PM +0200, Thomas Gleixner wrote:
> > > From: Thomas Gleixner <[email protected]>
> > >
> > > Device interrupt handlers and system vector handlers are executed on the
> > > interrupt stack. The stack switch happens in the low level assembly entry
> > > code. This conflicts with the efforts to consolidate the exit code in C to
> > > ensure correctness vs. RCU and tracing.
> > >
> > > As there is no way to move #DB away from IST due to the MOV SS issue, the
> > > requirements vs. #DB and NMI for switching to the interrupt stack do not
> > > exist anymore. The only requirement is that interrupts are disabled.
> > >
> > > That allows to move the stack switching to C code which simplifies the
> > > entry/exit handling further because it allows to switch stacks after
> > > handling the entry and on exit before handling RCU, return to usermode and
> > > kernel preemption in the same way as for regular exceptions.
> > >
> > > The initial attempt of having the stack switching in inline ASM caused too
> > > much headache vs. objtool and the unwinder. After analysing the use cases
> > > it was agreed on that having the stack switch in ASM for the price of an
> > > indirect call is acceptable as the main users are indirect call heavy
> > > anyway and the few system vectors which are empty shells (scheduler IPI and
> > > KVM posted interrupt vectors) can run from the regular stack.
> > >
> > > Provide helper functions to check whether the interrupt stack is already
> > > active and whether stack switching is required.
> > >
> > > 64 bit only for now. 32 bit has a variant of that already. Once this is
> > > cleaned up the two implementations might be consolidated as a cleanup on
> > > top.
> > >
> > > Signed-off-by: Thomas Gleixner <[email protected]>
> >
> > Reverted this commit and the rest of series (with trivial fixup) as well
> > as the two dependencies [1],
> >
> > 8449e768dcb8 ("x86/entry: Remove debug IDT frobbing")
> > 029149180d1d ("x86/entry: Rename trace_hardirqs_off_prepare()")
> >
> > fixed the warning under some memory pressure on AMD NUMA servers.
>
> The warning ?

Obviously, the warning below.

>
> > [ 9371.959858] asm_call_on_stack+0x12/0x20
> > asm_call_on_stack at arch/x86/entry/entry_64.S:710

This is one piece of call from the warning call traces that introduced
by the patch which leads me to revert the commit in the first place. It
may or may not be the real culprit, but just wanted to highlight it in
case.

>
> ^^ what's that?
>
> > The .config (if ever matters),
> > https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> >
> > [ 9371.260161] ------------[ cut here ]------------
> > [ 9371.267143] Stack depot reached limit capacity
> > [ 9371.267193] WARNING: CPU: 19 PID: 1181 at lib/stackdepot.c:115 stack_depot_save+0x3d9/0x57d
>
> Is this _the_ warning?
>
> Maybe it really is running out of storage, I don't think this patch is
> to blame. The unwind here looks good, so why would the stack-depot
> unwind be any different?

Don't know. It is always reproducible with the same warning and call
traces on multiple similar AMD servers. Can this patch (or series)
increase the storage so much or some kind of recursion? Otherwise, never
saw this problem before the series was introduced. It is always running
almost the same tests for years on those machines.

>
> > [ 9371.281470] Modules linked in: brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_amd kvm ses enclosure dax_pmem irqbypass dax_pmem_core acpi_cpufreq ip_tables x_table
> > s xfs sd_mod bnxt_en smartpqi scsi_transport_sas tg3 i40e libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > [ 9371.310176] CPU: 19 PID: 1181 Comm: systemd-journal Tainted: G O 5.7.0-next-20200604+ #1
> > [ 9371.320700] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
> > [ 9371.329987] RIP: 0010:stack_depot_save+0x3d9/0x57d
> > [ 9371.335513] Code: 1d 9b bc 68 01 80 fb 01 0f 87 c0 01 00 00 80 e3 01 75 1f 4c 89 45 c0 c6 05 82 bc 68 01 01 48 c7 c7 e0 85 63 9f e8 9d 74 9d ff <0f> 0b 90 90 4c 8
> > b 45 c0 48 c7 c7 80 1a d0 9f 4c 89 c6 e8 b0 2b 46
> > [ 9371.355426] RSP: 0018:ffffc90007260490 EFLAGS: 00010082
> > [ 9371.361387] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff9ed3207f
> > [ 9371.369544] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: 0000000000000000
> > [ 9371.377428] RBP: ffffc900072604f8 R08: fffffbfff3f37539 R09: fffffbfff3f37539
> > [ 9371.385310] R10: ffffffff9f9ba9c3 R11: fffffbfff3f37538 R12: ffffc90007260508
> > [ 9371.393521] R13: 0000000000000036 R14: 0000000000000000 R15: 000000000009fb52
> > [ 9371.401403] FS: 00007fc9849f2980(0000) GS:ffff88942fb80000(0000) knlGS:0000000000000000
> > [ 9371.410244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 9371.417007] CR2: 00007f20d02c3000 CR3: 0000000440b9c000 CR4: 00000000003406e0
> > [ 9371.424889] Call Trace:
> > [ 9371.428054] <IRQ>
> > [ 9371.436315] save_stack+0x3f/0x50
> > [ 9371.734034] kasan_slab_free+0xe/0x10
> > [ 9371.738798] slab_free_freelist_hook+0x5d/0x1c0
> > [ 9371.748886] kmem_cache_free+0x10c/0x390
> > [ 9371.758450] mempool_free_slab+0x17/0x20
> > [ 9371.763522] mempool_free+0x65/0x170
> > [ 9371.767825] bio_free+0x14c/0x210
> > [ 9371.771864] bio_put+0x59/0x70
> > [ 9371.775644] end_swap_bio_write+0x199/0x250
> > [ 9371.780556] bio_endio+0x22c/0x4e0
> > [ 9371.784709] dec_pending+0x1bf/0x3e0 [dm_mod]
> > [ 9371.790207] clone_endio+0x129/0x3d0 [dm_mod]
> > [ 9371.800746] bio_endio+0x22c/0x4e0
> > [ 9371.809262] blk_update_request+0x3bb/0x980
> > [ 9371.814605] scsi_end_request+0x53/0x420
> > [ 9371.824002] scsi_io_completion+0x10a/0x830
> > [ 9371.844445] scsi_finish_command+0x1b9/0x250
> > [ 9371.849445] scsi_softirq_done+0x1ab/0x1f0
> > [ 9371.854272] blk_mq_force_complete_rq+0x217/0x250
> > [ 9371.859708] blk_mq_complete_request+0xe/0x20
> > [ 9371.865171] scsi_mq_done+0xc1/0x220
> > [ 9371.869479] pqi_aio_io_complete+0x83/0x2c0 [smartpqi]
> > [ 9371.881764] pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
> > [ 9371.914115] __handle_irq_event_percpu+0x81/0x550
> > [ 9371.924289] handle_irq_event_percpu+0x70/0x100
> > [ 9371.945559] handle_irq_event+0x5a/0x8b
> > [ 9371.950121] handle_edge_irq+0x10c/0x370
> > [ 9371.950121] handle_edge_irq+0x10c/0x370
> > [ 9371.959858] asm_call_on_stack+0x12/0x20
> > asm_call_on_stack at arch/x86/entry/entry_64.S:710
> > [ 9371.964899] </IRQ>
> > [ 9371.967716] common_interrupt+0x185/0x2a0
> > [ 9371.972455] asm_common_interrupt+0x1e/0x40
> > [ 9371.977368] RIP: 0010:__asan_load4+0x8/0xa0
> > [ 9371.982281] Code: 00 e8 5c f4 ff ff 5d c3 40 38 f0 0f 9e c0 84 c0 75 e5 5d c3 48 c1 e8 03 80 3c 10 00 75 ed 5d c3 66 90 55 48 89 e5 48 8b 4d 08 <48> 83 ff fb 77 6c eb 3a 0f 1f 00 48 b8 00 00 00 00 00 00 00 ff 48
> > [ 9372.002246] RSP: 0018:ffffc9000b12f0e8 EFLAGS: 00000202
> > [ 9372.008207] RAX: 0000000000000000 RBX: ffffc9000b12f150 RCX: ffffffff9ed32487
> > [ 9372.016489] RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffffffff9ff92a94
> > [ 9372.024370] RBP: ffffc9000b12f0e8 R08: fffffbfff3ff1d4d R09: fffffbfff3ff1d4d
> > [ 9372.032252] R10: ffffffff9ff8ea67 R11: fffffbfff3ff1d4c R12: 0000000000000000
> > [ 9372.040490] R13: ffffc9000b12f358 R14: 000000000000000c R15: 0000000000000005
> > [ 9372.053899] debug_lockdep_rcu_enabled+0x27/0x60
> > [ 9372.059248] rcu_read_lock_held_common+0x12/0x60
> > [ 9372.064989] rcu_read_lock_sched_held+0x60/0xe0
> > [ 9372.080338] shrink_active_list+0xbfd/0xc30
> > [ 9372.120481] shrink_lruvec+0xbf1/0x11b0
> > [ 9372.150410] shrink_node+0x344/0xd10
> > [ 9372.154719] do_try_to_free_pages+0x263/0xa00
> > [ 9372.169860] try_to_free_pages+0x239/0x570
> > [ 9372.179356] __alloc_pages_slowpath.constprop.59+0x5dd/0x1880
> > [ 9372.215762] __alloc_pages_nodemask+0x562/0x670
> > [ 9372.232686] alloc_pages_current+0x9c/0x110
> > [ 9372.237599] alloc_slab_page+0x355/0x530
> > [ 9372.242755] allocate_slab+0x485/0x5a0
> > [ 9372.247232] new_slab+0x46/0x70
> > [ 9372.251095] ___slab_alloc+0x35f/0x810
> > [ 9372.274485] __slab_alloc+0x43/0x70
> > [ 9372.287650] kmem_cache_alloc+0x257/0x3d0
> > [ 9372.298182] prepare_creds+0x26/0x130
> > [ 9372.302571] do_faccessat+0x255/0x3e0
> > [ 9372.321591] __x64_sys_access+0x38/0x40
> > [ 9372.326154] do_syscall_64+0x64/0x340
> > [ 9372.330542] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 9372.336329] RIP: 0033:0x7fc983a21bfb
> > [ 9372.341050] Code: Bad RIP value.
> > [ 9372.345000] RSP: 002b:00007fffd543cde8 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
> > [ 9372.353319] RAX: ffffffffffffffda RBX: 00007fffd543fa40 RCX: 00007fc983a21bfb
> > [ 9372.361199] RDX: 00007fc983cf1c00 RSI: 0000000000000000 RDI: 00005573ed458090
> > [ 9372.369512] RBP: 00007fffd543cf30 R08: 00005573ed44ab99 R09: 0000000000000007
> > [ 9372.377393] R10: 0000000000000041 R11: 0000000000000246 R12: 0000000000000000
> > [ 9372.385275] R13: 0000000000000000 R14: 00007fffd543cea0 R15: 00005573eecd2990
> > [ 9372.393608] irq event stamp: 27274496
> > [ 9372.397999] hardirqs last enabled at (27274495): [<ffffffff9e635b8e>] free_unref_page_list+0x2ee/0x400
> > [ 9372.408152] hardirqs last disabled at (27274496): [<ffffffff9ed2c22b>] idtentry_enter_cond_rcu+0x1b/0x50
> > [ 9372.418695] softirqs last enabled at (27272816): [<ffffffff9f000478>] __do_softirq+0x478/0x784
> > [ 9372.428154] softirqs last disabled at (27272807): [<ffffffff9e2d0b41>] irq_exit_rcu+0xd1/0xe0
> > [ 9372.437435] ---[ end trace d2ebac1fad6e452e ]---

2020-06-07 12:05:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack


CC:+ Alexander

Qian Cai <[email protected]> writes:
> On Fri, Jun 05, 2020 at 07:36:22PM +0200, Peter Zijlstra wrote:
>> > [ 9371.959858] asm_call_on_stack+0x12/0x20
>> > asm_call_on_stack at arch/x86/entry/entry_64.S:710
>
> This is one piece of call from the warning call traces that introduced
> by the patch which leads me to revert the commit in the first place. It
> may or may not be the real culprit, but just wanted to highlight it in
> case.

Oh well. The warning is a storage check in the stack depot code,
i.e. stack depot ran out of storage space.

Even if that commit causes stack traces to be larger that revert does
not make any sense at all and handwaving about recursions does not help
either. If that commit introduced a recursion then that would have worse
effects than triggering this warning.

The difference between the interrupt stack switching introduced by this
commit is that it generates another entry in the stack trace compared to
the state before it, which obviously has an effect on the storage
requirements in the stack depot.

Thanks,

tglx

2020-06-07 18:32:58

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Sun, Jun 07, 2020 at 01:59:53PM +0200, Thomas Gleixner wrote:
>
> CC:+ Alexander
>
> Qian Cai <[email protected]> writes:
> > On Fri, Jun 05, 2020 at 07:36:22PM +0200, Peter Zijlstra wrote:
> >> > [ 9371.959858] asm_call_on_stack+0x12/0x20
> >> > asm_call_on_stack at arch/x86/entry/entry_64.S:710
> >
> > This is one piece of call from the warning call traces that introduced
> > by the patch which leads me to revert the commit in the first place. It
> > may or may not be the real culprit, but just wanted to highlight it in
> > case.
>
> Oh well. The warning is a storage check in the stack depot code,
> i.e. stack depot ran out of storage space.
>
> Even if that commit causes stack traces to be larger that revert does
> not make any sense at all and handwaving about recursions does not help
> either. If that commit introduced a recursion then that would have worse
> effects than triggering this warning.
>
> The difference between the interrupt stack switching introduced by this
> commit is that it generates another entry in the stack trace compared to
> the state before it, which obviously has an effect on the storage
> requirements in the stack depot.

Sorry, not meant to propose the reverting as a final solution to the
problem, but rather a data point. My problem is that I knew so little
about this kernel area, so I was to throw out something where I saw
something similiar while working in other areas over the years. Of
course, my "random" tips could be wrong.

The effect is quite bad for our CI because it will disable lockdep in
the middle of the tests which could miss regression bugs lockdep might
find later. Anyway, I am not motivate enough to measure how much "an
effect" is yet due to this patchset. I'll trim our .config and tests to
hopefully not running out of the storage this early.

2020-06-08 16:06:21

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Fri, Jun 05, 2020 at 07:36:22PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2020 at 01:18:16PM -0400, Qian Cai wrote:
> > On Thu, May 21, 2020 at 10:05:23PM +0200, Thomas Gleixner wrote:
> > > From: Thomas Gleixner <[email protected]>
> > >
> > > Device interrupt handlers and system vector handlers are executed on the
> > > interrupt stack. The stack switch happens in the low level assembly entry
> > > code. This conflicts with the efforts to consolidate the exit code in C to
> > > ensure correctness vs. RCU and tracing.
> > >
> > > As there is no way to move #DB away from IST due to the MOV SS issue, the
> > > requirements vs. #DB and NMI for switching to the interrupt stack do not
> > > exist anymore. The only requirement is that interrupts are disabled.
> > >
> > > That allows to move the stack switching to C code which simplifies the
> > > entry/exit handling further because it allows to switch stacks after
> > > handling the entry and on exit before handling RCU, return to usermode and
> > > kernel preemption in the same way as for regular exceptions.
> > >
> > > The initial attempt of having the stack switching in inline ASM caused too
> > > much headache vs. objtool and the unwinder. After analysing the use cases
> > > it was agreed on that having the stack switch in ASM for the price of an
> > > indirect call is acceptable as the main users are indirect call heavy
> > > anyway and the few system vectors which are empty shells (scheduler IPI and
> > > KVM posted interrupt vectors) can run from the regular stack.
> > >
> > > Provide helper functions to check whether the interrupt stack is already
> > > active and whether stack switching is required.
> > >
> > > 64 bit only for now. 32 bit has a variant of that already. Once this is
> > > cleaned up the two implementations might be consolidated as a cleanup on
> > > top.
> > >
> > > Signed-off-by: Thomas Gleixner <[email protected]>
> >
> > Reverted this commit and the rest of series (with trivial fixup) as well
> > as the two dependencies [1],
> >
> > 8449e768dcb8 ("x86/entry: Remove debug IDT frobbing")
> > 029149180d1d ("x86/entry: Rename trace_hardirqs_off_prepare()")
> >
> > fixed the warning under some memory pressure on AMD NUMA servers.
>
> The warning ?
>
> > [ 9371.959858] asm_call_on_stack+0x12/0x20
> > asm_call_on_stack at arch/x86/entry/entry_64.S:710

Even after I trimmed the .config [1] to the barely minimal, a subset of LTP
test still unable to finish on those AMD servers with page_owner=on.

[1]:
https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

It looks like because this new IRQ entry introduced by this patch,

</IRQ>
asm_call_on_stack at arch/x86/entry/entry_64.S:710
handle_edge_irq at kernel/irq/chip.c:832

which will running out of the stack depot limit due to nested loops
below.

which has this loop,

do {
...
handle_irq_event(desc);
...
} while ((desc->istate & IRQS_PENDING) &&
!irqd_irq_disabled(&desc->irq_data));

handle_irq_event at kernel/irq/handle.c:215
__handle_irq_event_percpu at kernel/irq/handle.c:156

Here has a nested loop,

for_each_action_of_desc(desc, action) {
...
res = action->handler(irq, action->dev_id);
...
}

pqi_irq_handler at drivers/scsi/smartpqi/smartpqi_init.c:3462
pqi_process_io_intr at drivers/scsi/smartpqi/smartpqi_init.c:2992

Here has another nested loop,

while (1) {
...
io_request->io_complete_callback(io_request,
io_request->context);
...
}

scsi_mq_done at drivers/scsi/scsi_lib.c:1602
blk_mq_complete_request at block/blk-mq.c:677
blk_mq_force_complete_rq at block/blk-mq.c:637
scsi_io_completion at drivers/scsi/scsi_lib.c:934
scsi_end_request at drivers/scsi/scsi_lib.c:584
scsi_mq_uninit_cmd at drivers/scsi/scsi_lib.c:547
scsi_free_sgtables at drivers/scsi/scsi_lib.c:537
__sg_free_table at lib/scatterlist.c:225

Here has one more nested loop,

while (table->orig_nents) {
...
free_fn(sgl, alloc_size);
...
}

free_fn() will call kmem_cache_free(). Since we have page_owner=on, it
will call save_stack() to save the each free stack trace.

> >
> > [ 9371.260161] ------------[ cut here ]------------
> > [ 9371.267143] Stack depot reached limit capacity
> > [ 9371.267193] WARNING: CPU: 19 PID: 1181 at lib/stackdepot.c:115 stack_depot_save+0x3d9/0x57d
>
> Is this _the_ warning?
>
> Maybe it really is running out of storage, I don't think this patch is
> to blame. The unwind here looks good, so why would the stack-depot
> unwind be any different?
>
> > [ 9371.281470] Modules linked in: brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_amd kvm ses enclosure dax_pmem irqbypass dax_pmem_core acpi_cpufreq ip_tables x_table
> > s xfs sd_mod bnxt_en smartpqi scsi_transport_sas tg3 i40e libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > [ 9371.310176] CPU: 19 PID: 1181 Comm: systemd-journal Tainted: G O 5.7.0-next-20200604+ #1
> > [ 9371.320700] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
> > [ 9371.329987] RIP: 0010:stack_depot_save+0x3d9/0x57d
> > [ 9371.335513] Code: 1d 9b bc 68 01 80 fb 01 0f 87 c0 01 00 00 80 e3 01 75 1f 4c 89 45 c0 c6 05 82 bc 68 01 01 48 c7 c7 e0 85 63 9f e8 9d 74 9d ff <0f> 0b 90 90 4c 8
> > b 45 c0 48 c7 c7 80 1a d0 9f 4c 89 c6 e8 b0 2b 46
> > [ 9371.355426] RSP: 0018:ffffc90007260490 EFLAGS: 00010082
> > [ 9371.361387] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff9ed3207f
> > [ 9371.369544] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: 0000000000000000
> > [ 9371.377428] RBP: ffffc900072604f8 R08: fffffbfff3f37539 R09: fffffbfff3f37539
> > [ 9371.385310] R10: ffffffff9f9ba9c3 R11: fffffbfff3f37538 R12: ffffc90007260508
> > [ 9371.393521] R13: 0000000000000036 R14: 0000000000000000 R15: 000000000009fb52
> > [ 9371.401403] FS: 00007fc9849f2980(0000) GS:ffff88942fb80000(0000) knlGS:0000000000000000
> > [ 9371.410244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 9371.417007] CR2: 00007f20d02c3000 CR3: 0000000440b9c000 CR4: 00000000003406e0
> > [ 9371.424889] Call Trace:
> > [ 9371.428054] <IRQ>
> > [ 9371.436315] save_stack+0x3f/0x50
> > [ 9371.734034] kasan_slab_free+0xe/0x10
> > [ 9371.738798] slab_free_freelist_hook+0x5d/0x1c0
> > [ 9371.748886] kmem_cache_free+0x10c/0x390
> > [ 9371.758450] mempool_free_slab+0x17/0x20
> > [ 9371.763522] mempool_free+0x65/0x170
> > [ 9371.767825] bio_free+0x14c/0x210
> > [ 9371.771864] bio_put+0x59/0x70
> > [ 9371.775644] end_swap_bio_write+0x199/0x250
> > [ 9371.780556] bio_endio+0x22c/0x4e0
> > [ 9371.784709] dec_pending+0x1bf/0x3e0 [dm_mod]
> > [ 9371.790207] clone_endio+0x129/0x3d0 [dm_mod]
> > [ 9371.800746] bio_endio+0x22c/0x4e0
> > [ 9371.809262] blk_update_request+0x3bb/0x980
> > [ 9371.814605] scsi_end_request+0x53/0x420
> > [ 9371.824002] scsi_io_completion+0x10a/0x830
> > [ 9371.844445] scsi_finish_command+0x1b9/0x250
> > [ 9371.849445] scsi_softirq_done+0x1ab/0x1f0
> > [ 9371.854272] blk_mq_force_complete_rq+0x217/0x250
> > [ 9371.859708] blk_mq_complete_request+0xe/0x20
> > [ 9371.865171] scsi_mq_done+0xc1/0x220
> > [ 9371.869479] pqi_aio_io_complete+0x83/0x2c0 [smartpqi]
> > [ 9371.881764] pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
> > [ 9371.914115] __handle_irq_event_percpu+0x81/0x550
> > [ 9371.924289] handle_irq_event_percpu+0x70/0x100
> > [ 9371.945559] handle_irq_event+0x5a/0x8b
> > [ 9371.950121] handle_edge_irq+0x10c/0x370
> > [ 9371.950121] handle_edge_irq+0x10c/0x370
> > [ 9371.959858] asm_call_on_stack+0x12/0x20
> > asm_call_on_stack at arch/x86/entry/entry_64.S:710
> > [ 9371.964899] </IRQ>
> > [ 9371.967716] common_interrupt+0x185/0x2a0
> > [ 9371.972455] asm_common_interrupt+0x1e/0x40
> > [ 9371.977368] RIP: 0010:__asan_load4+0x8/0xa0
> > [ 9371.982281] Code: 00 e8 5c f4 ff ff 5d c3 40 38 f0 0f 9e c0 84 c0 75 e5 5d c3 48 c1 e8 03 80 3c 10 00 75 ed 5d c3 66 90 55 48 89 e5 48 8b 4d 08 <48> 83 ff fb 77 6c eb 3a 0f 1f 00 48 b8 00 00 00 00 00 00 00 ff 48
> > [ 9372.002246] RSP: 0018:ffffc9000b12f0e8 EFLAGS: 00000202
> > [ 9372.008207] RAX: 0000000000000000 RBX: ffffc9000b12f150 RCX: ffffffff9ed32487
> > [ 9372.016489] RDX: 0000000000000007 RSI: 0000000000000002 RDI: ffffffff9ff92a94
> > [ 9372.024370] RBP: ffffc9000b12f0e8 R08: fffffbfff3ff1d4d R09: fffffbfff3ff1d4d
> > [ 9372.032252] R10: ffffffff9ff8ea67 R11: fffffbfff3ff1d4c R12: 0000000000000000
> > [ 9372.040490] R13: ffffc9000b12f358 R14: 000000000000000c R15: 0000000000000005
> > [ 9372.053899] debug_lockdep_rcu_enabled+0x27/0x60
> > [ 9372.059248] rcu_read_lock_held_common+0x12/0x60
> > [ 9372.064989] rcu_read_lock_sched_held+0x60/0xe0
> > [ 9372.080338] shrink_active_list+0xbfd/0xc30
> > [ 9372.120481] shrink_lruvec+0xbf1/0x11b0
> > [ 9372.150410] shrink_node+0x344/0xd10
> > [ 9372.154719] do_try_to_free_pages+0x263/0xa00
> > [ 9372.169860] try_to_free_pages+0x239/0x570
> > [ 9372.179356] __alloc_pages_slowpath.constprop.59+0x5dd/0x1880
> > [ 9372.215762] __alloc_pages_nodemask+0x562/0x670
> > [ 9372.232686] alloc_pages_current+0x9c/0x110
> > [ 9372.237599] alloc_slab_page+0x355/0x530
> > [ 9372.242755] allocate_slab+0x485/0x5a0
> > [ 9372.247232] new_slab+0x46/0x70
> > [ 9372.251095] ___slab_alloc+0x35f/0x810
> > [ 9372.274485] __slab_alloc+0x43/0x70
> > [ 9372.287650] kmem_cache_alloc+0x257/0x3d0
> > [ 9372.298182] prepare_creds+0x26/0x130
> > [ 9372.302571] do_faccessat+0x255/0x3e0
> > [ 9372.321591] __x64_sys_access+0x38/0x40
> > [ 9372.326154] do_syscall_64+0x64/0x340
> > [ 9372.330542] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 9372.336329] RIP: 0033:0x7fc983a21bfb
> > [ 9372.341050] Code: Bad RIP value.
> > [ 9372.345000] RSP: 002b:00007fffd543cde8 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
> > [ 9372.353319] RAX: ffffffffffffffda RBX: 00007fffd543fa40 RCX: 00007fc983a21bfb
> > [ 9372.361199] RDX: 00007fc983cf1c00 RSI: 0000000000000000 RDI: 00005573ed458090
> > [ 9372.369512] RBP: 00007fffd543cf30 R08: 00005573ed44ab99 R09: 0000000000000007
> > [ 9372.377393] R10: 0000000000000041 R11: 0000000000000246 R12: 0000000000000000
> > [ 9372.385275] R13: 0000000000000000 R14: 00007fffd543cea0 R15: 00005573eecd2990
> > [ 9372.393608] irq event stamp: 27274496
> > [ 9372.397999] hardirqs last enabled at (27274495): [<ffffffff9e635b8e>] free_unref_page_list+0x2ee/0x400
> > [ 9372.408152] hardirqs last disabled at (27274496): [<ffffffff9ed2c22b>] idtentry_enter_cond_rcu+0x1b/0x50
> > [ 9372.418695] softirqs last enabled at (27272816): [<ffffffff9f000478>] __do_softirq+0x478/0x784
> > [ 9372.428154] softirqs last disabled at (27272807): [<ffffffff9e2d0b41>] irq_exit_rcu+0xd1/0xe0
> > [ 9372.437435] ---[ end trace d2ebac1fad6e452e ]---

2020-06-08 22:23:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian,

can you please ensure that people who got cc'ed because the problem
affects their subsystem are included on your replies even if you are
replying to a different subthread?

I explicitely did:

Cc:+ Alexander

at the very beginning of my reply:

https://lore.kernel.org/r/[email protected]

to make you aware of that.

Yes, email sucks, but it sucks even more when people are careless.

Qian Cai <[email protected]> writes:
> On Fri, Jun 05, 2020 at 07:36:22PM +0200, Peter Zijlstra wrote:
>
> Even after I trimmed the .config [1] to the barely minimal, a subset of LTP
> test still unable to finish on those AMD servers with page_owner=on.

What a surprise...

> [1]:
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>
> It looks like because this new IRQ entry introduced by this patch,
>
> </IRQ>
> asm_call_on_stack at arch/x86/entry/entry_64.S:710
> handle_edge_irq at kernel/irq/chip.c:832
>
> which will running out of the stack depot limit due to nested loops
> below.
>
> which has this loop,
>
> do {
> ...
> handle_irq_event(desc);
> ...
> } while ((desc->istate & IRQS_PENDING) &&
> !irqd_irq_disabled(&desc->irq_data));

This loop has absolutely nothing to do with stack entry usage.

foo()
do {
bar();
} while (condition);
}

If you take a stack trace inside bar() it will be the same stack trace
for every single loop iteration. And that stack trace will not be any
different from:

foo()
{
bar():
}

assumed that the call chain leading to foo() is the same in both cases.

And you can add even more loops in subsequent call chains within
bar(). They do not matter at all.

> Here has a nested loop,
>
> for_each_action_of_desc(desc, action) {
> ...
> res = action->handler(irq, action->dev_id);
> ...
> }

And this one is completely irrelevant because the interrupt which we are
looking at is a PCI interrupt which CANNOT be shared. IOW, the number of
loop iterations and the number of handlers invoked is exactly ONE.

I seriously have no idea what you are trying to demonstrate by finding
loops in a SINGLE callchain.

> Here has one more nested loop,
>
> while (table->orig_nents) {
> ...
> free_fn(sgl, alloc_size);
> ...
> }
>
> free_fn() will call kmem_cache_free(). Since we have page_owner=on, it
> will call save_stack() to save the each free stack trace.

That stack trace for each invocation of free_fn() in this loop is
exactly the same stack trace. The same stack trace is not eating up any
memory because the hash matches, i.e. the stack trace in the depot is
already known.

Here is the simplified difference between the old code and the new code:

Old New

handle_edge_irq handle_edge_irq
do_IRQ asm_call_on_stack
common_interrupt common_interrupt
asm_common_interrupt

IOW, for every _UNIQUE_ interrupt related call chain, there is exactly
ONE stack entry more than before.

For a loop which generates the exact same stack trace for every
iteration this extra entry is not a problem.

But what matters is that interrupts can hit any random code path. So the
amount of possible non-unique call chains is pretty much unlimited. And
with a high number of non-unique call chains the extra entry starts to
matter.

It's trival math, isn't it?

TS = Total Size of depot
as = average size of all stored unique stack traces
ms = maximum number of unqiue stack traces which fit in TS

ms_old = TS / as

Lets further assume that the vast majority of stack traces are taken
from interrupt context. That means with the new code this results in:

ms_new = TS / (as + 1)

==> ms_new = ms_old * as / (as + 1)

Depending on the value of 'as' the +1 can shave off a significant
percentage of capacity. IOW, the capacity is simply too small now for
the test scenario you are running. Truly a surprising outcome, right?

To get facts instead of useless loop theories, can you please apply the
patch below, enable DEBUGFS and provide the output of

/sys/kernel/debug/stackdepot/info

for a kernel before that change and after? Please read out that file at
periodically roughly the same amounts of time after starting your test
scenario.

Note, that I doubled the size of the stack depot so that we get real
numbers and not the cutoff by the size limit. IOW, the warning should
not trigger anymore. If it triggers nevertheless then the numbers will
still tell us an interesting story.

Thanks,

tglx
---
lib/stackdepot.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)

--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -31,6 +31,7 @@
#include <linux/stackdepot.h>
#include <linux/string.h>
#include <linux/types.h>
+#include <linux/debugfs.h>

#define DEPOT_STACK_BITS (sizeof(depot_stack_handle_t) * 8)

@@ -42,7 +43,7 @@
STACK_ALLOC_ALIGN)
#define STACK_ALLOC_INDEX_BITS (DEPOT_STACK_BITS - \
STACK_ALLOC_NULL_PROTECTION_BITS - STACK_ALLOC_OFFSET_BITS)
-#define STACK_ALLOC_SLABS_CAP 8192
+#define STACK_ALLOC_SLABS_CAP 16384
#define STACK_ALLOC_MAX_SLABS \
(((1LL << (STACK_ALLOC_INDEX_BITS)) < STACK_ALLOC_SLABS_CAP) ? \
(1LL << (STACK_ALLOC_INDEX_BITS)) : STACK_ALLOC_SLABS_CAP)
@@ -70,6 +71,7 @@ static void *stack_slabs[STACK_ALLOC_MAX
static int depot_index;
static int next_slab_inited;
static size_t depot_offset;
+static unsigned long unique_stacks;
static DEFINE_SPINLOCK(depot_lock);

static bool init_stack_slab(void **prealloc)
@@ -138,6 +140,7 @@ static struct stack_record *depot_alloc_
stack->handle.valid = 1;
memcpy(stack->entries, entries, size * sizeof(unsigned long));
depot_offset += required_size;
+ unique_stacks++;

return stack;
}
@@ -340,3 +343,41 @@ unsigned int filter_irq_stacks(unsigned
return nr_entries;
}
EXPORT_SYMBOL_GPL(filter_irq_stacks);
+
+static int debug_show(struct seq_file *m, void *p)
+{
+ unsigned long unst;
+ int didx, doff;
+
+ spin_lock_irq(&depot_lock);
+ unst = unique_stacks;
+ didx = depot_index;
+ doff = depot_offset;
+ spin_unlock_irq(&depot_lock);
+
+ seq_printf(m, "Unique stacks: %lu\n", unst);
+ seq_printf(m, "Depot index: %d\n", didx);
+ seq_printf(m, "Depot offset: %d\n", doff);
+ return 0;
+}
+
+static int debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+ .open = debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init debugfs_init(void)
+{
+ struct dentry *root_dir = debugfs_create_dir("stackdepot", NULL);
+
+ debugfs_create_file("info", 0444, root_dir, NULL, &dfs_ops);
+ return 0;
+}
+__initcall(debugfs_init);


2020-06-09 02:38:12

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Tue, Jun 09, 2020 at 12:20:06AM +0200, Thomas Gleixner wrote:
> Qian,
>
> can you please ensure that people who got cc'ed because the problem
> affects their subsystem are included on your replies even if you are
> replying to a different subthread?
>
> I explicitely did:
>
> Cc:+ Alexander
>
> at the very beginning of my reply:
>
> https://lore.kernel.org/r/[email protected]
>
> to make you aware of that.
>
> Yes, email sucks, but it sucks even more when people are careless.

Sorry, I will remeber that next time.

[]
> To get facts instead of useless loop theories, can you please apply the
> patch below, enable DEBUGFS and provide the output of
>
> /sys/kernel/debug/stackdepot/info
>
> for a kernel before that change and after? Please read out that file at
> periodically roughly the same amounts of time after starting your test
> scenario.
>
> Note, that I doubled the size of the stack depot so that we get real
> numbers and not the cutoff by the size limit. IOW, the warning should
> not trigger anymore. If it triggers nevertheless then the numbers will
> still tell us an interesting story.

Instead of running the whole testsuite, I just picked this single LTP
oom02 test which seems usually trigger it within the testsuite. Let me
know if this is insufficient (which indeed tell the big difference in
"Unique stacks"), and I am happy to run the whole things.

BAD: next-20200608
GOOD: next-20200528 (which does not include this series)

BAD (after boot)
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 33547
Depot index: 359
Depot offset: 6752

BAD (after oom02)
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 140476
Depot index: 2555
Depot offset: 9168

GOOD (after boot)
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 31112
Depot index: 317
Depot offset: 14384

GOOD (after oom02)
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 34176
Depot index: 354
Depot offset: 4032

BTW, I am happy to run another one using next-20200608 with just this
series reverted if you suspect there is something else going on between
those two trees.

2020-06-09 20:42:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian,

Qian Cai <[email protected]> writes:
> On Tue, Jun 09, 2020 at 12:20:06AM +0200, Thomas Gleixner wrote:
>> Note, that I doubled the size of the stack depot so that we get real
>> numbers and not the cutoff by the size limit. IOW, the warning should
>> not trigger anymore. If it triggers nevertheless then the numbers will
>> still tell us an interesting story.
>
> Instead of running the whole testsuite, I just picked this single LTP
> oom02 test which seems usually trigger it within the testsuite. Let me
> know if this is insufficient (which indeed tell the big difference in
> "Unique stacks"), and I am happy to run the whole things.

thanks for providing the data.

> BAD: next-20200608
> GOOD: next-20200528 (which does not include this series)
>
> BAD (after boot)
> # cat /sys/kernel/debug/stackdepot/info
> Unique stacks: 33547
> Depot index: 359
> Depot offset: 6752
>
> BAD (after oom02)
> # cat /sys/kernel/debug/stackdepot/info
> Unique stacks: 140476

That's indeed odd. I try to reproduce and figure out what really breaks
here.

Thanks,

tglx

2020-06-09 21:01:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian,

Thomas Gleixner <[email protected]> writes:
>> BAD (after oom02)
>> # cat /sys/kernel/debug/stackdepot/info
>> Unique stacks: 140476
>
> That's indeed odd. I try to reproduce and figure out what really breaks
> here.

I checked your config file and I think I know where this comes from. Can
you plase disable KASAN just for testing purposes and compare before
after again?

Thanks,

tglx

2020-06-10 12:44:57

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Tue, Jun 09, 2020 at 10:50:50PM +0200, Thomas Gleixner wrote:
> Qian,
>
> Thomas Gleixner <[email protected]> writes:
> >> BAD (after oom02)
> >> # cat /sys/kernel/debug/stackdepot/info
> >> Unique stacks: 140476
> >
> > That's indeed odd. I try to reproduce and figure out what really breaks
> > here.
>
> I checked your config file and I think I know where this comes from. Can
> you plase disable KASAN just for testing purposes and compare before
> after again?

It turns out I'll need a few days to be able to get ahold of those
affected systems again. I'll be reporting back as soon as possible.

2020-06-10 20:20:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian Cai <[email protected]> writes:
> On Tue, Jun 09, 2020 at 10:50:50PM +0200, Thomas Gleixner wrote:
>> Thomas Gleixner <[email protected]> writes:
>> >> BAD (after oom02)
>> >> # cat /sys/kernel/debug/stackdepot/info
>> >> Unique stacks: 140476
>> >
>> > That's indeed odd. I try to reproduce and figure out what really breaks
>> > here.
>>
>> I checked your config file and I think I know where this comes from. Can
>> you plase disable KASAN just for testing purposes and compare before
>> after again?
>
> It turns out I'll need a few days to be able to get ahold of those
> affected systems again. I'll be reporting back as soon as possible.

I figured it out. It has nothing to do with asm_call_on...(). It's also
unrelated to page_owner. It's purely a KASAN issue.

KASAN trims stack traces with the help of filter_irq_stacks() which
relies on __irqentry_text_start/end

The entry rework moved the interrupt entry points out of the irqentry
section, which breaks that filter function.

I made it work again. Here are the numbers which make that obvious:

Mainline:
Unique stacks: 23403
Depot index: 244
Depot offset: 4208

x86/entry:
Unique stacks: 38835
Depot index: 464
Depot offset: 7616

x86/entry + fix:
Unique stacks: 23607
Depot index: 247
Depot offset: 14224

So with the non-working trimming this generates more unique stacks and
because they are not trimmed they become larger and eat more storage
space. The resulting average per stack shows that:

Mainline: 171 bytes per stack
x86/entry: 195 bytes per stack
x86/entry + fix: 172 bytes per stack

I'll point you to a test branch shortly.

Thanks,

tglx

2020-06-13 14:00:04

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Wed, Jun 10, 2020 at 09:38:56PM +0200, Thomas Gleixner wrote:
> Qian Cai <[email protected]> writes:
> > On Tue, Jun 09, 2020 at 10:50:50PM +0200, Thomas Gleixner wrote:
> >> Thomas Gleixner <[email protected]> writes:
> >> >> BAD (after oom02)
> >> >> # cat /sys/kernel/debug/stackdepot/info
> >> >> Unique stacks: 140476
> >> >
> >> > That's indeed odd. I try to reproduce and figure out what really breaks
> >> > here.
> >>
> >> I checked your config file and I think I know where this comes from. Can
> >> you plase disable KASAN just for testing purposes and compare before
> >> after again?
> >
> > It turns out I'll need a few days to be able to get ahold of those
> > affected systems again. I'll be reporting back as soon as possible.
>
> I figured it out. It has nothing to do with asm_call_on...(). It's also
> unrelated to page_owner. It's purely a KASAN issue.
>
> KASAN trims stack traces with the help of filter_irq_stacks() which
> relies on __irqentry_text_start/end
>
> The entry rework moved the interrupt entry points out of the irqentry
> section, which breaks that filter function.
>
> I made it work again. Here are the numbers which make that obvious:
>
> Mainline:
> Unique stacks: 23403
> Depot index: 244
> Depot offset: 4208
>
> x86/entry:
> Unique stacks: 38835
> Depot index: 464
> Depot offset: 7616
>
> x86/entry + fix:
> Unique stacks: 23607
> Depot index: 247
> Depot offset: 14224
>
> So with the non-working trimming this generates more unique stacks and
> because they are not trimmed they become larger and eat more storage
> space. The resulting average per stack shows that:
>
> Mainline: 171 bytes per stack
> x86/entry: 195 bytes per stack
> x86/entry + fix: 172 bytes per stack
>
> I'll point you to a test branch shortly.

Thomas, I get ahold of one of the affected systems now if you want some
testing there.

2020-06-13 14:06:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian,

Qian Cai <[email protected]> writes:
> On Wed, Jun 10, 2020 at 09:38:56PM +0200, Thomas Gleixner wrote:
>
> Thomas, I get ahold of one of the affected systems now if you want some
> testing there.

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry

Thanks,

tglx

2020-06-13 21:43:38

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

On Sat, Jun 13, 2020 at 04:03:02PM +0200, Thomas Gleixner wrote:
> Qian,
>
> Qian Cai <[email protected]> writes:
> > On Wed, Jun 10, 2020 at 09:38:56PM +0200, Thomas Gleixner wrote:
> >
> > Thomas, I get ahold of one of the affected systems now if you want some
> > testing there.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry

Looks good.

<start>
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 27481
Depot index: 285
Depot offset: 3024

<after oom02>
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 29242
Depot index: 308
Depot offset: 6608

<after the full testsuite>
# cat /sys/kernel/debug/stackdepot/info
Unique stacks: 49221
Depot index: 530
Depot offset: 5888

2020-06-14 09:04:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V9 10/39] x86/entry: Provide helpers for execute on irqstack

Qian Cai <[email protected]> writes:
> On Sat, Jun 13, 2020 at 04:03:02PM +0200, Thomas Gleixner wrote:
>> Qian,
>>
>> Qian Cai <[email protected]> writes:
>> > On Wed, Jun 10, 2020 at 09:38:56PM +0200, Thomas Gleixner wrote:
>> >
>> > Thomas, I get ahold of one of the affected systems now if you want some
>> > testing there.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/entry
>
> Looks good.
>
> <start>
> # cat /sys/kernel/debug/stackdepot/info
> Unique stacks: 27481
> Depot index: 285
> Depot offset: 3024
>
> <after oom02>
> # cat /sys/kernel/debug/stackdepot/info
> Unique stacks: 29242
> Depot index: 308
> Depot offset: 6608
>
> <after the full testsuite>
> # cat /sys/kernel/debug/stackdepot/info
> Unique stacks: 49221
> Depot index: 530
> Depot offset: 5888

Thanks for confirming!