Hi Alexei Starovoitov and other,
I triggered a kernel panic when developing my 'perf bpf' facility. The
call stack is listed at the bottom of
this mail.
I attached two bpf programs on 'kmem_cache_free%return' and
'__alloc_pages_nodemask'. The programs is very simple.
The panic is raised after closing the bpf program and the perf event
file. Looks like the panic is caused
by racing between closing perf event fd and bpf program fd. I'm unable
to reproduce this problem with similar
operations.
Following is the exact instruction cause the panic.
ffffffff8111cf70 <bpf_prog_put>:
void bpf_prog_put(struct bpf_prog *prog)
{
ffffffff8111cf70: e8 fb a1 49 00 callq ffffffff815b7170
<__fentry__>
ffffffff8111cf75: 55 push %rbp
ffffffff8111cf76: 48 89 e5 mov %rsp,%rbp
ffffffff8111cf79: 53 push %rbx
ffffffff8111cf7a: 48 89 fb mov %rdi,%rbx
ffffffff8111cf7d: 48 83 ec 08 sub $0x8,%rsp
ffffffff8111cf81: 48 8b 47 10 mov 0x10(%rdi),%rax
<-- *panic at this instruction*
ffffffff8111cf85: f0 ff 08 lock decl (%rax)
ffffffff8111cf88: 74 0e je ffffffff8111cf98
<bpf_prog_put+0x28>
if (atomic_dec_and_test(&prog->aux->refcnt)) {
free_used_maps(prog->aux);
bpf_prog_free(prog);
}
}
ffffffff8111cf8a: 48 83 c4 08 add $0x8,%rsp
ffffffff8111cf8e: 5b pop %rbx
ffffffff8111cf8f: 5d pop %rbp
ffffffff8111cf90: c3 retq
Thank you.
----------------------- KERNEL PANIC -------------------
[ 261.839750] BUG: unable to handle kernel NULL pointer dereference at
00000000000006d0
[ 261.839750] IP: [<ffffffff8111cf81>] bpf_prog_put+0x11/0x50
[ 261.839750] PGD 7f7d0067 PUD 7f74d067 PMD 0
[ 261.839750] Oops: 0000 [#1] SMP
[ 261.839750] Modules linked in:
[ 261.839750] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0+ #11
[ 261.839750] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 261.839750] task: ffffffff81a114a0 ti: ffffffff81a00000 task.ti:
ffffffff81a00000
[ 261.839750] RIP: 0010:[<ffffffff8111cf81>] [<ffffffff8111cf81>]
bpf_prog_put+0x11/0x50
[ 261.839750] RSP: 0018:ffff88007ea03e68 EFLAGS: 00000292
[ 261.839750] RAX: ffff880076e35d20 RBX: 00000000000006c0 RCX:
ffffffff81123d60
[ 261.839750] RDX: 00000001000d000b RSI: 0000000000000000 RDI:
00000000000006c0
[ 261.839750] RBP: ffff88007ea03e78 R08: ffff88007f10c3c0 R09:
ffff88007ea189c0
[ 261.839750] R10: ffff88007aa68290 R11: ffff88007ea0800d R12:
ffff88007643a000
[ 261.839750] R13: 000000000000000a R14: 0000000000000125 R15:
ffff88007ea16540
[ 261.839750] FS: 0000000000000000(0000) GS:ffff88007ea00000(0000)
knlGS:0000000000000000
[ 261.839750] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 261.839750] CR2: 00000000000006d0 CR3: 0000000078aa5000 CR4:
00000000000006f0
[ 261.839750] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 261.839750] DR3: 0000000000000000 DR6: 0000000000000000 DR7:
0000000000000000
[ 261.839750] Stack:
[ 261.839750] ffff88007ea03e78 ffff88007643a320 ffff88007ea03e98
ffffffff81123dac
[ 261.839750] ffffffff81a38380 ffff88007f7de000 ffff88007ea03f08
ffffffff810a2d0b
[ 261.839750] ffffffff81ced238 ffff88007b911508 ffff88007ea16570
ffffffff81a114a0
[ 261.839750] Call Trace:
[ 261.839750] <IRQ>
[ 261.839750] [<ffffffff81123dac>] free_event_rcu+0x4c/0x60
[ 261.839750] [<ffffffff810a2d0b>] rcu_process_callbacks+0x25b/0x5a0
[ 261.839750] [<ffffffff8105749d>] __do_softirq+0xed/0x280
[ 261.839750] [<ffffffff810577fd>] irq_exit+0x4d/0x60
[ 261.839750] [<ffffffff815b7b3a>] smp_apic_timer_interrupt+0x4a/0x60
[ 261.839750] [<ffffffff815b617b>] apic_timer_interrupt+0x6b/0x70
[ 261.839750] <EOI>
[ 261.839750] [<ffffffff8100cd10>] ? default_idle+0x20/0xb0
[ 261.839750] [<ffffffff8100d69f>] arch_cpu_idle+0xf/0x20
[ 261.839750] [<ffffffff8108c117>] cpu_startup_entry+0x2f7/0x400
[ 261.839750] [<ffffffff815a4e87>] rest_init+0x77/0x80
[ 261.839750] [<ffffffff81ae7fcd>] start_kernel+0x423/0x430
[ 261.839750] [<ffffffff81ae799a>] ? set_init_arg+0x56/0x56
[ 261.839750] [<ffffffff81ae75ad>] x86_64_start_reservations+0x2a/0x2c
[ 261.839750] [<ffffffff81ae769b>] x86_64_start_kernel+0xec/0xf0
[ 261.839750] Code: 24 72 e7 49 8b 7d 00 e8 8e ce 05 00 48 83 c4 08 5b
41 5c 41 5d 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb 48 83
ec 08 <48> 8b 47 10 3e ff 08 74 0e 48 83 c4 08 5b 5d c3 0f 1f 80 00 00
[ 261.839750] RIP [<ffffffff8111cf81>] bpf_prog_put+0x11/0x50
[ 261.839750] RSP <ffff88007ea03e68>
[ 261.839750] CR2: 00000000000006d0
[ 261.839750] ---[ end trace dddf4ec721745b49 ]---
[ 261.839750] Kernel panic - not syncing: Fatal exception in interrupt
[ 261.839750] Kernel Offset: disabled
[ 261.839750] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt
On 5/14/15 8:54 PM, Wangnan (F) wrote:
> Hi Alexei Starovoitov and other,
>
> I triggered a kernel panic when developing my 'perf bpf' facility. The
> call stack is listed at the bottom of
> this mail.
>
> I attached two bpf programs on 'kmem_cache_free%return' and
> '__alloc_pages_nodemask'. The programs is very simple.
> The panic is raised after closing the bpf program and the perf event
> file. Looks like the panic is caused
> by racing between closing perf event fd and bpf program fd. I'm unable
> to reproduce this problem with similar
> operations.
>
> Following is the exact instruction cause the panic.
thanks for the report.
Looks like pointer 'prog == 0x6c0' is passed into bpf_prog_put,
which means that event->tp_event was freed and memory reused before
free_event_rcu() was called.
I think it's not perf_event_fd racing with prog_fd, but rather
with kprobe freeing:
__free_event()
event->destroy(event)
perf_trace_destroy
perf_trace_event_unreg
which is dropping event->tp_event->perf_refcount
that allows kprobe freeing to proceed in:
unregister_kprobe_event
trace_remove_event_call
probe_remove_event_call
and eventually tp_event to get freed.
I think calling perf_event_free_bpf_prog()
from __free_event() instead of free_event_rcu() will fix the race,
but please double check my analysis.
Also please send me a reproducer script. I'd like to see it crashing
first before the fix and not crashing afterwards.
在 2015/5/15 13:37, Alexei Starovoitov 写道:
> On 5/14/15 8:54 PM, Wangnan (F) wrote:
>> Hi Alexei Starovoitov and other,
>>
>> I triggered a kernel panic when developing my 'perf bpf' facility. The
>> call stack is listed at the bottom of
>> this mail.
>>
>> I attached two bpf programs on 'kmem_cache_free%return' and
>> '__alloc_pages_nodemask'. The programs is very simple.
>> The panic is raised after closing the bpf program and the perf event
>> file. Looks like the panic is caused
>> by racing between closing perf event fd and bpf program fd. I'm unable
>> to reproduce this problem with similar
>> operations.
>>
>> Following is the exact instruction cause the panic.
>
> thanks for the report.
> Looks like pointer 'prog == 0x6c0' is passed into bpf_prog_put,
> which means that event->tp_event was freed and memory reused before
> free_event_rcu() was called.
>
> I think it's not perf_event_fd racing with prog_fd, but rather
> with kprobe freeing:
> __free_event()
> event->destroy(event)
> perf_trace_destroy
> perf_trace_event_unreg
> which is dropping event->tp_event->perf_refcount
> that allows kprobe freeing to proceed in:
> unregister_kprobe_event
> trace_remove_event_call
> probe_remove_event_call
> and eventually tp_event to get freed.
>
> I think calling perf_event_free_bpf_prog()
> from __free_event() instead of free_event_rcu() will fix the race,
> but please double check my analysis.
> Also please send me a reproducer script. I'd like to see it crashing
> first before the fix and not crashing afterwards.
>
I triggered the problem with my 'perf bpf' patch series, and reproduced
once.
The bpf program is attached.
What I do is to use
# perf bpf record --object /root/sample_bpf_program.o -- sleep 4
to start recording, then press C-c before sleep finish after about 3
seconds.
The second call trace is identical to the previous one.
My environment is qemu with v4.1-rc3 kernel.
Thank you.
-------------------------------------------------
#include <uapi/linux/bpf.h>
#include <linux/version.h>
#include <uapi/linux/ptrace.h>
#define SEC(NAME) __attribute__((section(NAME), used))
static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
(void *) BPF_FUNC_trace_printk;
struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};
struct pair {
u64 val;
u64 ip;
};
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(long),
.value_size = sizeof(struct pair),
.max_entries = 1000000,
};
struct bpf_map_def SEC("maps") my_map2 = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(long),
.value_size = sizeof(struct pair),
.max_entries = 1000000,
};
SEC("cache_free=kmem_cache_free%return")
int bpf_prog1(struct pt_regs *ctx)
{
long ptr = ctx->r14;
bpf_map_delete_elem(&my_map2, &ptr);
return 0;
}
SEC("mybpfprog=__alloc_pages_nodemask")
int bpf_prog_my(struct pt_regs *ctx)
{
char fmt[] = "Haha\n";
long ptr = ctx->r14;
bpf_trace_printk(fmt, sizeof(fmt));
bpf_map_delete_elem(&my_map, &ptr);
return 0;
}
char _license[] SEC("license") = "GPL";
u32 _version SEC("version") = LINUX_VERSION_CODE;