bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
try_to_wake_up+0x1ad/0x590
wake_up_q+0x54/0x80
rwsem_wake+0x8a/0xb0
bpf_get_stack+0x13c/0x150
bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
bpf_overflow_handler+0x60/0x100
__perf_event_overflow+0x4f/0xf0
perf_swevent_overflow+0x99/0xc0
___perf_sw_event+0xe7/0x120
__schedule+0x47d/0x620
schedule+0x29/0x90
futex_wait_queue_me+0xb9/0x110
futex_wait+0x139/0x230
do_futex+0x2ac/0xa50
__x64_sys_futex+0x13c/0x180
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
build-id, e.g. with trace.py from bcc tools:
trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch
A sample reproducer is attached at the end.
Fix this by checking whether rq_lock is already locked. If so, postpone
the up_read() to irq_work, just like the in_nmi() case.
Fixes: commit 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Cc: [email protected] # v4.17+
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Reproducer:
============================ 8< ============================
char *filename;
void *worker(void *p)
{
void *ptr;
int fd;
char *pptr;
fd = open(filename, O_RDONLY);
if (fd < 0)
return NULL;
while (1) {
struct timespec ts = {0, 1000 + rand() % 2000};
ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
usleep(1);
if (ptr == MAP_FAILED) {
printf("failed to mmap\n");
break;
}
munmap(ptr, 4096 * 64);
usleep(1);
pptr = malloc(1);
usleep(1);
pptr[0] = 1;
usleep(1);
free(pptr);
usleep(1);
nanosleep(&ts, NULL);
}
close(fd);
return NULL;
}
int main(int argc, char *argv[])
{
void *ptr;
int i;
pthread_t threads[THREAD_COUNT];
if (argc < 2)
return 0;
filename = argv[1];
for (i = 0; i < THREAD_COUNT; i++) {
if (pthread_create(threads + i, NULL, worker, NULL)) {
fprintf(stderr, "Error creating thread\n");
return 0;
}
}
for (i = 0; i < THREAD_COUNT; i++)
pthread_join(threads[i], NULL);
return 0;
}
============================ 8< ============================
---
kernel/bpf/stackmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 052580c33d26..3b278f6b0c3e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
bool irq_work_busy = false;
struct stack_map_irq_work *work = NULL;
- if (in_nmi()) {
+ if (in_nmi() || this_rq_is_locked()) {
work = this_cpu_ptr(&up_read_work);
if (work->irq_work.flags & IRQ_WORK_BUSY)
/* cannot queue more up_read, fallback */
--
2.17.1
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> deadlock on rq_lock():
>
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [...]
> Call Trace:
> try_to_wake_up+0x1ad/0x590
> wake_up_q+0x54/0x80
> rwsem_wake+0x8a/0xb0
> bpf_get_stack+0x13c/0x150
> bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
> bpf_overflow_handler+0x60/0x100
> __perf_event_overflow+0x4f/0xf0
> perf_swevent_overflow+0x99/0xc0
> ___perf_sw_event+0xe7/0x120
> __schedule+0x47d/0x620
> schedule+0x29/0x90
> futex_wait_queue_me+0xb9/0x110
> futex_wait+0x139/0x230
> do_futex+0x2ac/0xa50
> __x64_sys_futex+0x13c/0x180
> do_syscall_64+0x42/0x100
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 052580c33d26..3b278f6b0c3e 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> bool irq_work_busy = false;
> struct stack_map_irq_work *work = NULL;
>
> - if (in_nmi()) {
> + if (in_nmi() || this_rq_is_locked()) {
> work = this_cpu_ptr(&up_read_work);
> if (work->irq_work.flags & IRQ_WORK_BUSY)
> /* cannot queue more up_read, fallback */
This is horrific crap. Just say no to that get_build_id_offset()
trainwreck.
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>> deadlock on rq_lock():
>>
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [...]
>> Call Trace:
>> try_to_wake_up+0x1ad/0x590
>> wake_up_q+0x54/0x80
>> rwsem_wake+0x8a/0xb0
>> bpf_get_stack+0x13c/0x150
>> bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>> bpf_overflow_handler+0x60/0x100
>> __perf_event_overflow+0x4f/0xf0
>> perf_swevent_overflow+0x99/0xc0
>> ___perf_sw_event+0xe7/0x120
>> __schedule+0x47d/0x620
>> schedule+0x29/0x90
>> futex_wait_queue_me+0xb9/0x110
>> futex_wait+0x139/0x230
>> do_futex+0x2ac/0xa50
>> __x64_sys_futex+0x13c/0x180
>> do_syscall_64+0x42/0x100
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 052580c33d26..3b278f6b0c3e 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>> bool irq_work_busy = false;
>> struct stack_map_irq_work *work = NULL;
>>
>> - if (in_nmi()) {
>> + if (in_nmi() || this_rq_is_locked()) {
>> work = this_cpu_ptr(&up_read_work);
>> if (work->irq_work.flags & IRQ_WORK_BUSY)
>> /* cannot queue more up_read, fallback */
>
> This is horrific crap. Just say no to that get_build_id_offset()
> trainwreck.
this is not a helpful comment.
What issues do you see with this approach?
On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> > On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> >> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> >> deadlock on rq_lock():
> >>
> >> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [...]
> >> Call Trace:
> >> try_to_wake_up+0x1ad/0x590
> >> wake_up_q+0x54/0x80
> >> rwsem_wake+0x8a/0xb0
> >> bpf_get_stack+0x13c/0x150
> >> bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
> >> bpf_overflow_handler+0x60/0x100
> >> __perf_event_overflow+0x4f/0xf0
> >> perf_swevent_overflow+0x99/0xc0
> >> ___perf_sw_event+0xe7/0x120
> >> __schedule+0x47d/0x620
> >> schedule+0x29/0x90
> >> futex_wait_queue_me+0xb9/0x110
> >> futex_wait+0x139/0x230
> >> do_futex+0x2ac/0xa50
> >> __x64_sys_futex+0x13c/0x180
> >> do_syscall_64+0x42/0x100
> >> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >
> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >> index 052580c33d26..3b278f6b0c3e 100644
> >> --- a/kernel/bpf/stackmap.c
> >> +++ b/kernel/bpf/stackmap.c
> >> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> >> bool irq_work_busy = false;
> >> struct stack_map_irq_work *work = NULL;
> >>
> >> - if (in_nmi()) {
> >> + if (in_nmi() || this_rq_is_locked()) {
> >> work = this_cpu_ptr(&up_read_work);
> >> if (work->irq_work.flags & IRQ_WORK_BUSY)
> >> /* cannot queue more up_read, fallback */
> >
> > This is horrific crap. Just say no to that get_build_id_offset()
> > trainwreck.
>
> this is not a helpful comment.
> What issues do you see with this approach?
It will still generate deadlocks if I place a tracepoint inside a lock
that nests inside rq->lock, and it won't ever be able to detect that.
Say do the very same thing on trace_hrtimer_start(), which is under
cpu_base->lock, which nests inside rq->lock. That should give you an
AB-BA.
tracepoints / perf-overflow should _never_ take locks.
All of stack_map_get_build_id_offset() is just disguisting games; I did
tell you guys how to do lockless vma lookups a few years ago -- and yes,
that is invasive core mm surgery. But this is just disguisting hacks for
not wanting to do it right.
Basically the only semi-sane thing to do with that trainwreck is
s/in_nmi()/true/ and pray.
On top of that I just hate buildids in general.
On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
>>> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>>>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>>>> deadlock on rq_lock():
>>>>
>>>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [...]
>>>> Call Trace:
>>>> try_to_wake_up+0x1ad/0x590
>>>> wake_up_q+0x54/0x80
>>>> rwsem_wake+0x8a/0xb0
>>>> bpf_get_stack+0x13c/0x150
>>>> bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>>> bpf_overflow_handler+0x60/0x100
>>>> __perf_event_overflow+0x4f/0xf0
>>>> perf_swevent_overflow+0x99/0xc0
>>>> ___perf_sw_event+0xe7/0x120
>>>> __schedule+0x47d/0x620
>>>> schedule+0x29/0x90
>>>> futex_wait_queue_me+0xb9/0x110
>>>> futex_wait+0x139/0x230
>>>> do_futex+0x2ac/0xa50
>>>> __x64_sys_futex+0x13c/0x180
>>>> do_syscall_64+0x42/0x100
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index 052580c33d26..3b278f6b0c3e 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>> bool irq_work_busy = false;
>>>> struct stack_map_irq_work *work = NULL;
>>>>
>>>> - if (in_nmi()) {
>>>> + if (in_nmi() || this_rq_is_locked()) {
>>>> work = this_cpu_ptr(&up_read_work);
>>>> if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>> /* cannot queue more up_read, fallback */
>>>
>>> This is horrific crap. Just say no to that get_build_id_offset()
>>> trainwreck.
>>
>> this is not a helpful comment.
>> What issues do you see with this approach?
>
> It will still generate deadlocks if I place a tracepoint inside a lock
> that nests inside rq->lock, and it won't ever be able to detect that.
> Say do the very same thing on trace_hrtimer_start(), which is under
> cpu_base->lock, which nests inside rq->lock. That should give you an
> AB-BA.
>
> tracepoints / perf-overflow should _never_ take locks.
>
> All of stack_map_get_build_id_offset() is just disguisting games; I did
> tell you guys how to do lockless vma lookups a few years ago -- and yes,
> that is invasive core mm surgery. But this is just disguisting hacks for
> not wanting to do it right.
you mean speculative page fault stuff?
That was my hope as well and I offered Laurent all the help to land it.
Yet after a year since we've talked the patches are not any closer
to landing.
Any other 'invasive mm surgery' you have in mind?
> Basically the only semi-sane thing to do with that trainwreck is
> s/in_nmi()/true/ and pray.
>
> On top of that I just hate buildids in general.
Emotions aside... build_id is useful and used in production.
It's used widely because it solves real problems.
This dead lock is from real servers and not from some sanitizer wannabe.
Hence we need to fix it as cleanly as possible and quickly.
s/in_nmi/true/ is certainly an option.
I'm worried about overhead of doing irq_work_queue() all the time.
But I'm not familiar with mechanism enough to justify the concerns.
Would it make sense to do s/in_nmi/irgs_disabled/ instead?
On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> > All of stack_map_get_build_id_offset() is just disguisting games; I did
> > tell you guys how to do lockless vma lookups a few years ago -- and yes,
> > that is invasive core mm surgery. But this is just disguisting hacks for
> > not wanting to do it right.
>
> you mean speculative page fault stuff?
> That was my hope as well and I offered Laurent all the help to land it.
> Yet after a year since we've talked the patches are not any closer
> to landing.
> Any other 'invasive mm surgery' you have in mind?
Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
which is exactly what you need here.
> > Basically the only semi-sane thing to do with that trainwreck is
> > s/in_nmi()/true/ and pray.
> >
> > On top of that I just hate buildids in general.
>
> Emotions aside... build_id is useful and used in production.
> It's used widely because it solves real problems.
AFAIU it solves the problem of you not knowing what version of the
binary runs where; which I was hoping your cloud infrastructure thing
would actually know already.
Anyway, I know what it does, I just don't nessecarily agree it is the
right way around that particular problem (also, the way I'm personally
affected is that perf-record is dead slow by default due to built-id
post processing).
And it obviously leads to horrible hacks like the code currently under
discussion :/
> This dead lock is from real servers and not from some sanitizer wannabe.
If you enable CFS bandwidth control and run this function on the
trace_hrtimer_start() tracepoint, you should be able to trigger a real
AB-BA lockup.
> Hence we need to fix it as cleanly as possible and quickly.
> s/in_nmi/true/ is certainly an option.
That is the best option; because tracepoints / perf-overflow handlers
really should not be taking any locks.
> I'm worried about overhead of doing irq_work_queue() all the time.
> But I'm not familiar with mechanism enough to justify the concerns.
> Would it make sense to do s/in_nmi/irgs_disabled/ instead?
irqs_disabled() should work in this particular case because rq->lock
(and therefore all it's nested locks) are IRQ-safe.
Thanks Peter!
> On Oct 14, 2019, at 2:09 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 10:46 AM, Peter Zijlstra wrote:
>
>>> All of stack_map_get_build_id_offset() is just disguisting games; I did
>>> tell you guys how to do lockless vma lookups a few years ago -- and yes,
>>> that is invasive core mm surgery. But this is just disguisting hacks for
>>> not wanting to do it right.
>>
>> you mean speculative page fault stuff?
>> That was my hope as well and I offered Laurent all the help to land it.
>> Yet after a year since we've talked the patches are not any closer
>> to landing.
>> Any other 'invasive mm surgery' you have in mind?
>
> Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
> which is exactly what you need here.
Lockless VMA lookups will be really useful. It would resolve all the
pains we are having here.
I remember Google folks also mentioned in LPC that they would like
better mechanism to confirm build-id in perf.
>
>>> Basically the only semi-sane thing to do with that trainwreck is
>>> s/in_nmi()/true/ and pray.
>>>
>>> On top of that I just hate buildids in general.
>>
>> Emotions aside... build_id is useful and used in production.
>> It's used widely because it solves real problems.
>
> AFAIU it solves the problem of you not knowing what version of the
> binary runs where; which I was hoping your cloud infrastructure thing
> would actually know already.
>
> Anyway, I know what it does, I just don't nessecarily agree it is the
> right way around that particular problem (also, the way I'm personally
> affected is that perf-record is dead slow by default due to built-id
> post processing).
>
> And it obviously leads to horrible hacks like the code currently under
> discussion :/
>
>> This dead lock is from real servers and not from some sanitizer wannabe.
>
> If you enable CFS bandwidth control and run this function on the
> trace_hrtimer_start() tracepoint, you should be able to trigger a real
> AB-BA lockup.
>
>> Hence we need to fix it as cleanly as possible and quickly.
>> s/in_nmi/true/ is certainly an option.
>
> That is the best option; because tracepoints / perf-overflow handlers
> really should not be taking any locks.
>
>> I'm worried about overhead of doing irq_work_queue() all the time.
>> But I'm not familiar with mechanism enough to justify the concerns.
>> Would it make sense to do s/in_nmi/irgs_disabled/ instead?
>
> irqs_disabled() should work in this particular case because rq->lock
> (and therefore all it's nested locks) are IRQ-safe.
We worry about the overhead of irq_work for every single stackmap
lookup. So we would like to go with the irqs_disabled() check. I just
sent v2 of the patch.
Thanks again,
Song