Hi,
We are debugging some WARNING with LBR, precise_ip=2 and bpf_get_stackid(), like:
[36000.334284] WARNING: stack recursion on stack type 1
[36000.334288] WARNING: can't access registers at syscall_return_via_sysret+0x12/0x7f
This happens when we attach BPF program to perf_event with:
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.precise_ip = 2,
.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_BRANCH_STACK,
.branch_sample_type = PERF_SAMPLE_BRANCH_USER |
PERF_SAMPLE_BRANCH_NO_FLAGS |
PERF_SAMPLE_BRANCH_NO_CYCLES |
PERF_SAMPLE_BRANCH_CALL_STACK,
.sample_period = 500000,
.size = sizeof(struct perf_event_attr),
};
and calls bpf_get_stackid() from the BPF program. I pushed a reproducer to
https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=lbr_issue
under tools/bpf/ministrobe/. It requires latest CLANG to build. I also included
the binary that should just run on CentOS 7.
This warning is usually harmless, but it may also cause double fault #DF. Here are
some analysis from Tejun:
"""
The exact pathway is still under investigation but it triggered a sanity warning in
the kernel backtrace acquisition code looking like the following:
[ 2051.040745] WARNING: stack recursion on stack type 1
This is from `arch/x86/kernel/dumpstack_64.c::get_stack_info()` and is warning that
while walking the stack to obtain backtrace it encountered the same type of stack
more than once. The warning in itself isn't critical. It's a sanity check and when
the check fails, it just stops walking the stack and it is likely that the warning
is being triggered spuriously given that more machines which trigger these warnings
continue to run fine than not, which is unlikely with actual stack corruptions. A
possibility is that stack acquisition is happening from a context that the code
can't handle.
However, on some machines, this caused #DF (double fault) and thus immediate crash.
Backtracing a DF'd kernel is a bit cumbersome, so the following contains spurious
entries, but it does show what happened:
PID: 80430 TASK: ffff888d92c62a80 CPU: 24 COMMAND: "25_scheduler"
#0 [fffffe00004cfd88] machine_kexec at ffffffff8104a646
#1 [fffffe00004cfdd8] __crash_kexec at ffffffff8114a82f
#2 [fffffe00004cfea0] panic at ffffffff810ba99c
#3 [fffffe00004cff20] df_debug at ffffffff8104e21d
#4 [fffffe00004cff30] do_double_fault at ffffffff8101c9b4
#5 [fffffe00004cff50] double_fault at ffffffff81c00b3e
[exception RIP: vsnprintf+14]
RIP: ffffffff81a3ee6e RSP: fffffe00004d0ff8 RFLAGS: 00010082
RAX: fffffe00004d1070 RBX: fffffe00004d1101 RCX: fffffe00004d1050
RDX: ffffffff822c42f5 RSI: 000000007fffffff RDI: fffffe00004d111a
RBP: fffffe00004d10a0 R8: 0000000000000067 R9: ffffffff82209d05
R10: ffffffff822a5fd0 R11: ffffffff822a6358 R12: 0000000000000019
R13: 0000000000000001 R14: 0000000000000019 R15: ffffffff822b5fd6
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <DOUBLEFAULT exception stack> ---
#6 [fffffe00004d0ff8] vsnprintf at ffffffff81a3ee6e
bt: cannot transition from exception stack to current process stack:
exception stack pointer: fffffe00004cfd88
process stack pointer: fffffe00004d1048
current stack base: ffffc900241a0000
0xfffffe00004d1040 kallsyms_token_index
0xfffffe00004d1048 sprintf
0xfffffe00004d1078 kallsyms_lookup
0xfffffe00004d1098 kallsyms_names
0xfffffe00004d10a8 __sprint_symbol
0xfffffe00004d10d8 textbuf.47672
0xfffffe00004d10e0 always_kmsg_dump
0xfffffe00004d10f8 symbol_string
0xfffffe00004d11c0 textbuf.47672
0xfffffe00004d11e8 textbuf.47672
0xfffffe00004d11f8 textbuf.47672
0xfffffe00004d1200 always_kmsg_dump
0xfffffe00004d1210 kallsyms_token_index
0xfffffe00004d1218 vsnprintf
0xfffffe00004d1220 textbuf.47672
0xfffffe00004d1258 kallsyms_token_index
0xfffffe00004d1270 vscnprintf
0xfffffe00004d1280 vprintk_store
0xfffffe00004d1290 wake_up_klogd
0xfffffe00004d12b0 kallsyms_token_index
0xfffffe00004d12c8 vprintk_emit
0xfffffe00004d1300 entry_SYSCALL_64
0xfffffe00004d1318 vprintk_deferred
0xfffffe00004d1328 printk_deferred
0xfffffe00004d1360 entry_SYSCALL_64
0xfffffe00004d1380 __start_orc_unwind
0xfffffe00004d1388 unwind_next_frame.cold.7
0xfffffe00004d13c8 perf_callchain_kernel
0xfffffe00004d1418 entry_SYSCALL_64
0xfffffe00004d1450 get_perf_callchain
0xfffffe00004d14b0 bpf_get_stack
0xfffffe00004d1730 bpf_overflow_handler
0xfffffe00004d1788 __perf_event_overflow
0xfffffe00004d17a0 x86_pmu
0xfffffe00004d17b8 __intel_pmu_pebs_event
0xfffffe00004d17e0 setup_pebs_fixed_sample_data
0xfffffe00004d1890 entry_SYSCALL_64
0xfffffe00004d1ab8 intel_pmu_drain_pebs_nhm
0xfffffe00004d1ac0 setup_pebs_fixed_sample_data
0xfffffe00004d1bb8 handle_pmi_common
0xfffffe00004d1d00 insn_get_sib.part.5
0xfffffe00004d1d10 insn_get_displacement.part.6
0xfffffe00004d1d20 insn_get_immediate.part.7
0xfffffe00004d1d38 __set_pte_vaddr
0xfffffe00004d1d78 __native_set_fixmap
0xfffffe00004d1d80 ghes_copy_tofrom_phys
0xfffffe00004d1e08 intel_pmu_handle_irq
0xfffffe00004d1e48 perf_event_nmi_handler
0xfffffe00004d1e50 perf_event_nmi_handler_na.47078
0xfffffe00004d1e58 nmi_reason_lock
0xfffffe00004d1e60 nmi_handle
0xfffffe00004d1e70 nmi_desc
0xfffffe00004d1e90 nmi_reason_lock
0xfffffe00004d1eb8 default_do_nmi
0xfffffe00004d1ed8 do_nmi
0xfffffe00004d1ef0 nmi
0xfffffe00004d1f78 entry_SYSCALL_64
0xfffffe00004d1fa0 entry_SYSCALL_64
0xfffffe00004d1fd8 entry_SYSCALL_64
The kernel is running the normal NMI path and triggers the above mentioned warnings,
which calls into string formatting path for the warning messages. However, NMIs run
off of its dedicated exception stack which is 4k in size, which is too small for the
full string formatting path to run, so it ends up overflowing the stack while running
NMI and the machine enters an unrecoverable state.
"""
We found the issue in multiple versions of kernel, from 4.16 to 5.8-rc1. The exact
WARNING message varies among different versions. Latest kernel put these warnings
under WARN_ONCE(), so they are not flushing dmesg.
We noticed that this only happens with precise_ip >= 2. This is caused by
setup_pebs_fixed_sample_data() using pens->real_ip:
set_linear_ip(regs, pebs->real_ip);
For our use case, we do need precise_ip=2. So we would like suggestions to fix the
warning and/or to avoid double fault.
Please share your thoughts on this.
Thanks in advance!
Song
On Thu, Jun 18, 2020 at 05:25:24PM +0000, Song Liu wrote:
> PID: 80430 TASK: ffff888d92c62a80 CPU: 24 COMMAND: "25_scheduler"
> #0 [fffffe00004cfd88] machine_kexec at ffffffff8104a646
> #1 [fffffe00004cfdd8] __crash_kexec at ffffffff8114a82f
> #2 [fffffe00004cfea0] panic at ffffffff810ba99c
> #3 [fffffe00004cff20] df_debug at ffffffff8104e21d
> #4 [fffffe00004cff30] do_double_fault at ffffffff8101c9b4
> #5 [fffffe00004cff50] double_fault at ffffffff81c00b3e
> [exception RIP: vsnprintf+14]
> RIP: ffffffff81a3ee6e RSP: fffffe00004d0ff8 RFLAGS: 00010082
> RAX: fffffe00004d1070 RBX: fffffe00004d1101 RCX: fffffe00004d1050
> RDX: ffffffff822c42f5 RSI: 000000007fffffff RDI: fffffe00004d111a
> RBP: fffffe00004d10a0 R8: 0000000000000067 R9: ffffffff82209d05
> R10: ffffffff822a5fd0 R11: ffffffff822a6358 R12: 0000000000000019
> R13: 0000000000000001 R14: 0000000000000019 R15: ffffffff822b5fd6
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <DOUBLEFAULT exception stack> ---
> #6 [fffffe00004d0ff8] vsnprintf at ffffffff81a3ee6e
> bt: cannot transition from exception stack to current process stack:
> exception stack pointer: fffffe00004cfd88
> process stack pointer: fffffe00004d1048
> current stack base: ffffc900241a0000
>
> 0xfffffe00004d1040 kallsyms_token_index
> 0xfffffe00004d1048 sprintf
> 0xfffffe00004d1078 kallsyms_lookup
> 0xfffffe00004d1098 kallsyms_names
> 0xfffffe00004d10a8 __sprint_symbol
> 0xfffffe00004d10d8 textbuf.47672
> 0xfffffe00004d10e0 always_kmsg_dump
> 0xfffffe00004d10f8 symbol_string
> 0xfffffe00004d11c0 textbuf.47672
> 0xfffffe00004d11e8 textbuf.47672
> 0xfffffe00004d11f8 textbuf.47672
> 0xfffffe00004d1200 always_kmsg_dump
> 0xfffffe00004d1210 kallsyms_token_index
> 0xfffffe00004d1218 vsnprintf
> 0xfffffe00004d1220 textbuf.47672
> 0xfffffe00004d1258 kallsyms_token_index
> 0xfffffe00004d1270 vscnprintf
> 0xfffffe00004d1280 vprintk_store
> 0xfffffe00004d1290 wake_up_klogd
> 0xfffffe00004d12b0 kallsyms_token_index
> 0xfffffe00004d12c8 vprintk_emit
> 0xfffffe00004d1300 entry_SYSCALL_64
> 0xfffffe00004d1318 vprintk_deferred
> 0xfffffe00004d1328 printk_deferred
> 0xfffffe00004d1360 entry_SYSCALL_64
> 0xfffffe00004d1380 __start_orc_unwind
> 0xfffffe00004d1388 unwind_next_frame.cold.7
> 0xfffffe00004d13c8 perf_callchain_kernel
> 0xfffffe00004d1418 entry_SYSCALL_64
> 0xfffffe00004d1450 get_perf_callchain
> 0xfffffe00004d14b0 bpf_get_stack
This, this is where it gets buggerd. This is a BPF problem, not a perf
problem.
You'll note that if you ask perf for a PERF_SAMPLE_CALLCHAIN we'll do
that early in setup_pebs_fixed_sample_data().
> 0xfffffe00004d1730 bpf_overflow_handler
> 0xfffffe00004d1788 __perf_event_overflow
> 0xfffffe00004d17a0 x86_pmu
> 0xfffffe00004d17b8 __intel_pmu_pebs_event
> 0xfffffe00004d17e0 setup_pebs_fixed_sample_data
> 0xfffffe00004d1890 entry_SYSCALL_64
> 0xfffffe00004d1ab8 intel_pmu_drain_pebs_nhm
> 0xfffffe00004d1ac0 setup_pebs_fixed_sample_data
> 0xfffffe00004d1bb8 handle_pmi_common
> 0xfffffe00004d1d00 insn_get_sib.part.5
> 0xfffffe00004d1d10 insn_get_displacement.part.6
> 0xfffffe00004d1d20 insn_get_immediate.part.7
> 0xfffffe00004d1d38 __set_pte_vaddr
> 0xfffffe00004d1d78 __native_set_fixmap
> 0xfffffe00004d1d80 ghes_copy_tofrom_phys
> 0xfffffe00004d1e08 intel_pmu_handle_irq
> 0xfffffe00004d1e48 perf_event_nmi_handler
> 0xfffffe00004d1e50 perf_event_nmi_handler_na.47078
> 0xfffffe00004d1e58 nmi_reason_lock
> 0xfffffe00004d1e60 nmi_handle
> 0xfffffe00004d1e70 nmi_desc
> 0xfffffe00004d1e90 nmi_reason_lock
> 0xfffffe00004d1eb8 default_do_nmi
> 0xfffffe00004d1ed8 do_nmi
> 0xfffffe00004d1ef0 nmi
> 0xfffffe00004d1f78 entry_SYSCALL_64
> 0xfffffe00004d1fa0 entry_SYSCALL_64
> 0xfffffe00004d1fd8 entry_SYSCALL_64
>
> The kernel is running the normal NMI path and triggers the above mentioned warnings,
> which calls into string formatting path for the warning messages.
> However, NMIs run
> off of its dedicated exception stack which is 4k in size, which is too small for the
> full string formatting path to run,
Say what ?! Has printk() grown into such a freakish monster that it
cannot run on 4k stacks anymore?
I suppose that KSYM_SYMBOL_LEN on-stack buffer in symbol_string()
doesn't help.
> We noticed that this only happens with precise_ip >= 2. This is caused by
> setup_pebs_fixed_sample_data() using pens->real_ip:
>
> set_linear_ip(regs, pebs->real_ip);
>
> For our use case, we do need precise_ip=2. So we would like suggestions to fix the
> warning and/or to avoid double fault.
It all works as advertised. The BPF program is broken for calling
bpf_get_stack() on a PEBS event.
> On Jun 18, 2020, at 11:33 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jun 18, 2020 at 05:25:24PM +0000, Song Liu wrote:
>
[...]
>> 0xfffffe00004d1380 __start_orc_unwind
>> 0xfffffe00004d1388 unwind_next_frame.cold.7
>> 0xfffffe00004d13c8 perf_callchain_kernel
>> 0xfffffe00004d1418 entry_SYSCALL_64
>> 0xfffffe00004d1450 get_perf_callchain
>> 0xfffffe00004d14b0 bpf_get_stack
>
> This, this is where it gets buggerd. This is a BPF problem, not a perf
> problem.
>
> You'll note that if you ask perf for a PERF_SAMPLE_CALLCHAIN we'll do
> that early in setup_pebs_fixed_sample_data().
We did notice perf works fine in similar scenario. I guess need similar
logic in BPF.
Thanks,
Song
> We noticed that this only happens with precise_ip >= 2. This is caused by
> setup_pebs_fixed_sample_data() using pens->real_ip:
>
> set_linear_ip(regs, pebs->real_ip);
With precise_ip == 1 the IP will be one instruction behind.
That's the only difference to 2 or 3.
So something about the actual instruction confuses the unwinder.
I would check if there is anything special about these IPs:
> 0xfffffe00004d1f78 entry_SYSCALL_64
> 0xfffffe00004d1fa0 entry_SYSCALL_64
> 0xfffffe00004d1fd8 entry_SYSCALL_64
>
> For our use case, we do need precise_ip=2. So we would like suggestions to fix the
> warning and/or to avoid double fault.
As a minimum, I would just add an && !in_nmi() to the warning. Clearly
it doesn't make sense to print a warning that overflows the stack.
-Andi
> On Jun 18, 2020, at 3:35 PM, Andi Kleen <[email protected]> wrote:
>
>> We noticed that this only happens with precise_ip >= 2. This is caused by
>> setup_pebs_fixed_sample_data() using pens->real_ip:
>>
>> set_linear_ip(regs, pebs->real_ip);
>
> With precise_ip == 1 the IP will be one instruction behind.
> That's the only difference to 2 or 3.
>
> So something about the actual instruction confuses the unwinder.
>
> I would check if there is anything special about these IPs:
>
>> 0xfffffe00004d1f78 entry_SYSCALL_64
>> 0xfffffe00004d1fa0 entry_SYSCALL_64
>> 0xfffffe00004d1fd8 entry_SYSCALL_64
I don't have this system at the moment. For the system I have, I got:
[36000.163945] WARNING: can't access registers at error_entry+0x74/0xe0
[36000.243879] WARNING: can't access registers at error_entry+0x65/0xe0
[36000.334288] WARNING: can't access registers at syscall_return_via_sysret+0x12/0x7f
[36000.540940] WARNING: can't access registers at syscall_return_via_sysret+0x6/0x7f
[58419.640387] WARNING: can't access registers at syscall_return_via_sysret+0xb/0x7f
[58423.082418] WARNING: can't access registers at syscall_return_via_sysret+0xf/0x7f
[58423.155705] WARNING: can't access registers at syscall_return_via_sysret+0x6/0x7f
And
(gdb) x/5i error_entry+0x65
0xffffffff81c01075 <error_entry+101>: pop %r12
0xffffffff81c01077 <error_entry+103>: mov %rsp,%rdi
0xffffffff81c0107a <error_entry+106>: callq 0xffffffff81aa7a30 <sync_regs>
0xffffffff81c0107f <error_entry+111>: mov %rax,%rsp
0xffffffff81c01082 <error_entry+114>: push %r12
(gdb) x/5i error_entry+0x74
0xffffffff81c01084 <error_entry+116>: retq
0xffffffff81c01085 <error_entry+117>: nop
0xffffffff81c01086 <error_entry+118>: nop
0xffffffff81c01087 <error_entry+119>: nop
0xffffffff81c01088 <error_entry+120>: retq
(gdb) x/5i syscall_return_via_sysret+0xb
0xffffffff81c000ec <entry_SYSCALL_64+236>: pop %r10
0xffffffff81c000ee <entry_SYSCALL_64+238>: pop %r9
0xffffffff81c000f0 <entry_SYSCALL_64+240>: pop %r8
0xffffffff81c000f2 <entry_SYSCALL_64+242>: pop %rax
0xffffffff81c000f3 <entry_SYSCALL_64+243>: pop %rsi
(gdb) x/5i syscall_return_via_sysret+0xa
0xffffffff81c000eb <entry_SYSCALL_64+235>: pop %rsi
0xffffffff81c000ec <entry_SYSCALL_64+236>: pop %r10
0xffffffff81c000ee <entry_SYSCALL_64+238>: pop %r9
0xffffffff81c000f0 <entry_SYSCALL_64+240>: pop %r8
0xffffffff81c000f2 <entry_SYSCALL_64+242>: pop %rax
(gdb) x/5i syscall_return_via_sysret+0x12
0xffffffff81c000f3 <entry_SYSCALL_64+243>: pop %rsi
0xffffffff81c000f4 <entry_SYSCALL_64+244>: pop %rdx
0xffffffff81c000f5 <entry_SYSCALL_64+245>: pop %rsi
0xffffffff81c000f6 <entry_SYSCALL_64+246>: mov %rsp,%rdi
0xffffffff81c000f9 <entry_SYSCALL_64+249>: mov %gs:0x6004,%rsp
(gdb) x/5i syscall_return_via_sysret+0x6
0xffffffff81c000e7 <entry_SYSCALL_64+231>: pop %r12
0xffffffff81c000e9 <entry_SYSCALL_64+233>: pop %rbp
0xffffffff81c000ea <entry_SYSCALL_64+234>: pop %rbx
0xffffffff81c000eb <entry_SYSCALL_64+235>: pop %rsi
0xffffffff81c000ec <entry_SYSCALL_64+236>: pop %r10
I am not sure whether there is anything special. Maybe "pop"?
Thanks,
Song
On Thu, Jun 18, 2020 at 03:35:55PM -0700, Andi Kleen wrote:
> > We noticed that this only happens with precise_ip >= 2. This is caused by
> > setup_pebs_fixed_sample_data() using pens->real_ip:
> >
> > set_linear_ip(regs, pebs->real_ip);
>
> With precise_ip == 1 the IP will be one instruction behind.
> That's the only difference to 2 or 3.
>
> So something about the actual instruction confuses the unwinder.
It's trying to do an unwind with the PEBS IP (or worse the whole PEBS
regset), which doesn't match the interrupt IP or the stack layout at
that time. Of course it doesn't work.