Hi Paul,
I've been debugging the issue reported by Daniel:
http://thread.gmane.org/gmane.linux.kernel/1974304/focus=1974304
and it seems I narrowed it down to recursive call_rcu.
From trace_preempt_on() I'm doing:
e = kmalloc(sizeof(struct elem), GFP_ATOMIC)
kfree_rcu(e, rcu)
which causing all sorts of corruptions like:
[ 2.074175] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263
debug_print_object+0x8c/0xb0()
[ 2.075567] ODEBUG: active_state not available (active state 0)
object type: rcu_head hint: (null)
[ 2.102141] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263
debug_print_object+0x8c/0xb0()
[ 2.103547] ODEBUG: deactivate not available (active state 0) object
type: rcu_head hint: (null)
[ 2.253995] WARNING: CPU: 0 PID: 7 at ../kernel/rcu/tree.c:2976
__call_rcu.constprop.67+0x1e5/0x350()
[ 2.255510] __call_rcu(): Leaked duplicate callback
Sometimes stack looks like:
[ 2.145163] WARNING: CPU: 0 PID: 102 at ../lib/debugobjects.c:263
debug_print_object+0x8c/0xb0()
[ 2.147465] ODEBUG: active_state not available (active state 0)
object type: rcu_head hint: (null)
[ 2.148022] Modules linked in:
[ 2.148022] CPU: 0 PID: 102 Comm: systemd-udevd Not tainted
4.1.0-rc7+ #653
[ 2.148022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-rc1-0-gb1d4dc9-20140515_140003-nilsson.home.kraxel.org
04/01/2014
[ 2.148022] ffffffff81a34f77 ffff88000fc03d18 ffffffff81781ed4
0000000000000105
[ 2.148022] ffff88000fc03d68 ffff88000fc03d58 ffffffff81064e57
0000000000000000
[ 2.148022] ffff88000fc03e20 ffffffff81c50f00 ffffffff81a34fdf
0000000000000286
[ 2.148022] Call Trace:
[ 2.148022] <IRQ> [<ffffffff81781ed4>] dump_stack+0x4f/0x7b
[ 2.148022] [<ffffffff81064e57>] warn_slowpath_common+0x97/0xe0
[ 2.148022] [<ffffffff81064f56>] warn_slowpath_fmt+0x46/0x50
[ 2.148022] [<ffffffff813d305c>] debug_print_object+0x8c/0xb0
[ 2.148022] [<ffffffff813d3cf6>] ? debug_object_active_state+0x66/0x160
[ 2.148022] [<ffffffff813d3d81>] debug_object_active_state+0xf1/0x160
[ 2.148022] [<ffffffff810e33b1>] rcu_process_callbacks+0x301/0xae0
[ 2.148022] [<ffffffff810e3397>] ? rcu_process_callbacks+0x2e7/0xae0
[ 2.148022] [<ffffffff810e9f28>] ? run_timer_softirq+0x218/0x4c0
[ 2.148022] [<ffffffff81069dff>] __do_softirq+0x14f/0x670
[ 2.148022] [<ffffffff8106a5f5>] irq_exit+0xa5/0xb0
[ 2.148022] [<ffffffff8178ea6a>] smp_apic_timer_interrupt+0x4a/0x60
[ 2.148022] [<ffffffff8178cea0>] apic_timer_interrupt+0x70/0x80
[ 2.148022] <EOI> [<ffffffff813d363c>] ?
debug_object_activate+0x9c/0x1e0
[ 2.148022] [<ffffffff8178bd47>] ? _raw_spin_unlock_irqrestore+0x67/0x80
[ 2.148022] [<ffffffff813d36f6>] debug_object_activate+0x156/0x1e0
[ 2.148022] [<ffffffff810de037>] rcuhead_fixup_activate+0x37/0x40
[ 2.148022] [<ffffffff813d36a1>] debug_object_activate+0x101/0x1e0
[ 2.148022] [<ffffffff8178bd2b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
[ 2.148022] [<ffffffff810e2d26>] __call_rcu.constprop.67+0x46/0x350
[ 2.148022] [<ffffffff813d3524>] ? __debug_object_init+0x3f4/0x430
[ 2.148022] [<ffffffff8178bd2b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
[ 2.148022] [<ffffffff810e308a>] kfree_call_rcu+0x1a/0x20
[ 2.148022] [<ffffffff8115e230>] trace_preempt_on+0x180/0x290
[ 2.148022] [<ffffffff8115e17e>] ? trace_preempt_on+0xce/0x290
[ 2.148022] [<ffffffff810991c3>] preempt_count_sub+0x73/0xf0
[ 2.148022] [<ffffffff8178bd2b>] _raw_spin_unlock_irqrestore+0x4b/0x80
[ 2.148022] [<ffffffff813d3524>] __debug_object_init+0x3f4/0x430
[ 2.148022] [<ffffffff8115e23c>] ? trace_preempt_on+0x18c/0x290
[ 2.148022] [<ffffffff813d357b>] debug_object_init+0x1b/0x20
[ 2.148022] [<ffffffff810de028>] rcuhead_fixup_activate+0x28/0x40
[ 2.148022] [<ffffffff813d36a1>] debug_object_activate+0x101/0x1e0
[ 2.148022] [<ffffffff812036b0>] ? get_max_files+0x20/0x20
[ 2.148022] [<ffffffff810e2d26>] __call_rcu.constprop.67+0x46/0x350
[ 2.148022] [<ffffffff810e30a7>] call_rcu+0x17/0x20
[ 2.148022] [<ffffffff81203903>] __fput+0x183/0x200
[ 2.148022] [<ffffffff812039ce>] ____fput+0xe/0x10
[ 2.148022] [<ffffffff81088dd5>] task_work_run+0xb5/0xe0
[ 2.148022] [<ffffffff81002c74>] do_notify_resume+0x64/0x80
[ 2.148022] [<ffffffff8178c26c>] int_signal+0x12/0x17
My reading of the code is debug_object_*() bits are reporting real
problem. In the above trace the call
debug_rcu_head_unqueue(list);
from rcu_do_batch() is not finding 'list' in tracked objects.
I know that doing call_rcu() from trace_preempt is ill advised,
but I still want to understand why call_rcu corrupts the memory.
Attaching a patch that I'm using for debugging.
It's doing recursion preemption check, so number of nested call_rcu
is no more than 2.
Also if I replace kfree_rcu is this patch with a regular kfree,
all works fine.
I'm seeing this crashes in VM with _single_ cpu.
Kernel is built with CONFIG_PREEMPT, CONFIG_PREEMPT_TRACER and
CONFIG_DEBUG_OBJECTS_RCU_HEAD.
Also interesting that size of
struct elem {
u64 pad[32];
struct rcu_head rcu;
};
that I'm using in kmalloc/kfree_rcu changes the crash.
If padding is zero, kernel just locksup, if pad[1] I see
one type of odebug warnings, if pad[32] - another.
Any advise on where to look is greatly appreciated.
Thanks!
On Mon, Jun 15, 2015 at 03:24:29PM -0700, Alexei Starovoitov wrote:
> Hi Paul,
>
> I've been debugging the issue reported by Daniel:
> http://thread.gmane.org/gmane.linux.kernel/1974304/focus=1974304
> and it seems I narrowed it down to recursive call_rcu.
By "recursive call_rcu()", you mean invoking call_rcu() twice in a row
on the same memory, like this?
call_rcu(&p->rh, some_callback_function);
do_something_quick();
call_rcu(&p->rh, another_callback_function);
Because this is perfectly legal:
void recirculating_callback_function(struct rcu_head *p)
{
struct foo *fp = container_of(p, struct foo, rh);
kfree(fp);
}
void recirculating_callback_function(struct rcu_head *p)
{
call_rcu(p, endpoint_callback_function);
}
...
call_rcu(&fp->rh, recirculating_callback_function);
This sort of thing is actually used in some situations involving
RCU and reference counters.
> From trace_preempt_on() I'm doing:
> e = kmalloc(sizeof(struct elem), GFP_ATOMIC)
> kfree_rcu(e, rcu)
As written, this should be OK, assuming that "rcu" is a field of type
"struct rcu_head" (not a pointer!) within "struct elem".
> which causing all sorts of corruptions like:
> [ 2.074175] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263
> debug_print_object+0x8c/0xb0()
> [ 2.075567] ODEBUG: active_state not available (active state 0)
> object type: rcu_head hint: (null)
>
> [ 2.102141] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263
> debug_print_object+0x8c/0xb0()
> [ 2.103547] ODEBUG: deactivate not available (active state 0)
> object type: rcu_head hint: (null)
>
> [ 2.253995] WARNING: CPU: 0 PID: 7 at ../kernel/rcu/tree.c:2976
> __call_rcu.constprop.67+0x1e5/0x350()
> [ 2.255510] __call_rcu(): Leaked duplicate callback
>
> Sometimes stack looks like:
> [ 2.145163] WARNING: CPU: 0 PID: 102 at ../lib/debugobjects.c:263
> debug_print_object+0x8c/0xb0()
> [ 2.147465] ODEBUG: active_state not available (active state 0)
> object type: rcu_head hint: (null)
> [ 2.148022] Modules linked in:
> [ 2.148022] CPU: 0 PID: 102 Comm: systemd-udevd Not tainted
> 4.1.0-rc7+ #653
> [ 2.148022] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS
> rel-1.7.5-rc1-0-gb1d4dc9-20140515_140003-nilsson.home.kraxel.org
> 04/01/2014
> [ 2.148022] ffffffff81a34f77 ffff88000fc03d18 ffffffff81781ed4
> 0000000000000105
> [ 2.148022] ffff88000fc03d68 ffff88000fc03d58 ffffffff81064e57
> 0000000000000000
> [ 2.148022] ffff88000fc03e20 ffffffff81c50f00 ffffffff81a34fdf
> 0000000000000286
> [ 2.148022] Call Trace:
> [ 2.148022] <IRQ> [<ffffffff81781ed4>] dump_stack+0x4f/0x7b
> [ 2.148022] [<ffffffff81064e57>] warn_slowpath_common+0x97/0xe0
> [ 2.148022] [<ffffffff81064f56>] warn_slowpath_fmt+0x46/0x50
> [ 2.148022] [<ffffffff813d305c>] debug_print_object+0x8c/0xb0
> [ 2.148022] [<ffffffff813d3cf6>] ? debug_object_active_state+0x66/0x160
> [ 2.148022] [<ffffffff813d3d81>] debug_object_active_state+0xf1/0x160
> [ 2.148022] [<ffffffff810e33b1>] rcu_process_callbacks+0x301/0xae0
> [ 2.148022] [<ffffffff810e3397>] ? rcu_process_callbacks+0x2e7/0xae0
> [ 2.148022] [<ffffffff810e9f28>] ? run_timer_softirq+0x218/0x4c0
> [ 2.148022] [<ffffffff81069dff>] __do_softirq+0x14f/0x670
> [ 2.148022] [<ffffffff8106a5f5>] irq_exit+0xa5/0xb0
> [ 2.148022] [<ffffffff8178ea6a>] smp_apic_timer_interrupt+0x4a/0x60
> [ 2.148022] [<ffffffff8178cea0>] apic_timer_interrupt+0x70/0x80
> [ 2.148022] <EOI> [<ffffffff813d363c>] ?
> debug_object_activate+0x9c/0x1e0
> [ 2.148022] [<ffffffff8178bd47>] ? _raw_spin_unlock_irqrestore+0x67/0x80
> [ 2.148022] [<ffffffff813d36f6>] debug_object_activate+0x156/0x1e0
> [ 2.148022] [<ffffffff810de037>] rcuhead_fixup_activate+0x37/0x40
> [ 2.148022] [<ffffffff813d36a1>] debug_object_activate+0x101/0x1e0
> [ 2.148022] [<ffffffff8178bd2b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 2.148022] [<ffffffff810e2d26>] __call_rcu.constprop.67+0x46/0x350
> [ 2.148022] [<ffffffff813d3524>] ? __debug_object_init+0x3f4/0x430
> [ 2.148022] [<ffffffff8178bd2b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 2.148022] [<ffffffff810e308a>] kfree_call_rcu+0x1a/0x20
> [ 2.148022] [<ffffffff8115e230>] trace_preempt_on+0x180/0x290
> [ 2.148022] [<ffffffff8115e17e>] ? trace_preempt_on+0xce/0x290
> [ 2.148022] [<ffffffff810991c3>] preempt_count_sub+0x73/0xf0
> [ 2.148022] [<ffffffff8178bd2b>] _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 2.148022] [<ffffffff813d3524>] __debug_object_init+0x3f4/0x430
> [ 2.148022] [<ffffffff8115e23c>] ? trace_preempt_on+0x18c/0x290
> [ 2.148022] [<ffffffff813d357b>] debug_object_init+0x1b/0x20
> [ 2.148022] [<ffffffff810de028>] rcuhead_fixup_activate+0x28/0x40
> [ 2.148022] [<ffffffff813d36a1>] debug_object_activate+0x101/0x1e0
> [ 2.148022] [<ffffffff812036b0>] ? get_max_files+0x20/0x20
> [ 2.148022] [<ffffffff810e2d26>] __call_rcu.constprop.67+0x46/0x350
> [ 2.148022] [<ffffffff810e30a7>] call_rcu+0x17/0x20
> [ 2.148022] [<ffffffff81203903>] __fput+0x183/0x200
> [ 2.148022] [<ffffffff812039ce>] ____fput+0xe/0x10
> [ 2.148022] [<ffffffff81088dd5>] task_work_run+0xb5/0xe0
> [ 2.148022] [<ffffffff81002c74>] do_notify_resume+0x64/0x80
> [ 2.148022] [<ffffffff8178c26c>] int_signal+0x12/0x17
>
> My reading of the code is debug_object_*() bits are reporting real
> problem. In the above trace the call
> debug_rcu_head_unqueue(list);
> from rcu_do_batch() is not finding 'list' in tracked objects.
>
> I know that doing call_rcu() from trace_preempt is ill advised,
> but I still want to understand why call_rcu corrupts the memory.
Hmmm... This is what I would expect if you invoked call_rcu()
(or kfree_rcu(), for that matter) on some memory, then freed it
before the grace period ended. This would cause the same problems
as any other use-after-free error. Might this be happening?
> Attaching a patch that I'm using for debugging.
> It's doing recursion preemption check, so number of nested call_rcu
> is no more than 2.
Oh... One important thing is that both call_rcu() and kfree_rcu()
use per-CPU variables, managing a per-CPU linked list. This is why
they disable interrupts. If you do another call_rcu() in the middle
of the first one in just the wrong place, you will have two entities
concurrently manipulating the same linked list, which will not go well.
Maybe mark call_rcu() and the things it calls as notrace? Or you
could maintain a separate per-CPU linked list that gathered up the
stuff to be kfree()ed after a grace period, and some time later
feed them to kfree_rcu()?
> Also if I replace kfree_rcu is this patch with a regular kfree,
> all works fine.
>
> I'm seeing this crashes in VM with _single_ cpu.
> Kernel is built with CONFIG_PREEMPT, CONFIG_PREEMPT_TRACER and
> CONFIG_DEBUG_OBJECTS_RCU_HEAD.
No surprise -- when you have lists hanging off of per-CPU variables,
it only takes one CPU to tangle the lists.
> Also interesting that size of
> struct elem {
> u64 pad[32];
> struct rcu_head rcu;
> };
> that I'm using in kmalloc/kfree_rcu changes the crash.
> If padding is zero, kernel just locksup, if pad[1] I see
> one type of odebug warnings, if pad[32] - another.
The usual consequence of racing a pair of callback insertions on the
same CPU would be that one of them gets leaked, and possible all
subsequent callbacks. So the lockup is no surprise. And there are a
lot of other assumptions in nearby code paths about only one execution
at a time from a given CPU.
> Any advise on where to look is greatly appreciated.
What I don't understand is exactly what you are trying to do. Have more
complex tracers that dynamically allocate memory? If so, having a per-CPU
list that stages memory to be freed so that it can be passed to call_rcu()
in a safe environment might make sense. Of course, that list would need
to be managed carefully!
Or am I missing the point of the code below?
Thanx, Paul
> Thanks!
>
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index 8523ea345f2b..89433a83dd2d 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -13,6 +13,7 @@
> #include <linux/uaccess.h>
> #include <linux/module.h>
> #include <linux/ftrace.h>
> +#include <linux/slab.h>
>
> #include "trace.h"
>
> @@ -510,8 +511,42 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
> #endif /* CONFIG_IRQSOFF_TRACER */
>
> #ifdef CONFIG_PREEMPT_TRACER
> +struct elem {
> + u64 pad[32];
> + struct rcu_head rcu;
> +};
> +
> +static DEFINE_PER_CPU(int, prog_active);
> +static void * test_alloc(void)
> +{
> + struct elem *e = NULL;
> +
> + if (in_nmi())
> + return e;
> +
> + preempt_disable_notrace();
> + if (unlikely(__this_cpu_inc_return(prog_active) != 1))
> + goto out;
> +
> + rcu_read_lock();
> + e = kmalloc(sizeof(struct elem), GFP_ATOMIC);
> + rcu_read_unlock();
> + if (!e)
> + goto out;
> +
> + kfree_rcu(e, rcu);
> +out:
> + __this_cpu_dec(prog_active);
> + preempt_enable_no_resched_notrace();
> + return e;
> +}
> +
> void trace_preempt_on(unsigned long a0, unsigned long a1)
> {
> + void * buf = 0;
> + static int cnt = 0;
> + if (cnt++ > 3000000)
> + buf = test_alloc();
> if (preempt_trace() && !irq_trace())
> stop_critical_timing(a0, a1);
> }
On 6/15/15 4:07 PM, Paul E. McKenney wrote:
>
> Oh... One important thing is that both call_rcu() and kfree_rcu()
> use per-CPU variables, managing a per-CPU linked list. This is why
> they disable interrupts. If you do another call_rcu() in the middle
> of the first one in just the wrong place, you will have two entities
> concurrently manipulating the same linked list, which will not go well.
yes. I'm trying to find that 'wrong place'.
The trace.patch is doing kmalloc/kfree_rcu for every preempt_enable.
So any spin_unlock called by first call_rcu will be triggering
2nd recursive to call_rcu.
But as far as I could understand rcu code that looks ok everywhere.
call_rcu
debug_rcu_head_[un]queue
debug_object_activate
spin_unlock
and debug_rcu_head* seems to be called from safe places
where local_irq is enabled.
> Maybe mark call_rcu() and the things it calls as notrace? Or you
> could maintain a separate per-CPU linked list that gathered up the
> stuff to be kfree()ed after a grace period, and some time later
> feed them to kfree_rcu()?
yeah, I can think of this or 10 other ways to fix it within
kprobe+bpf area, but I think something like call_rcu_notrace()
may be a better solution.
Or may be single generic 'fix' for call_rcu will be enough if
it doesn't affect all other users.
> The usual consequence of racing a pair of callback insertions on the
> same CPU would be that one of them gets leaked, and possible all
> subsequent callbacks. So the lockup is no surprise. And there are a
> lot of other assumptions in nearby code paths about only one execution
> at a time from a given CPU.
yes, I don't think calling 2nd call_rcu from preempt_enable violates
this assumptions. local_irq does it job. No extra stuff is called when
interrupts are disabled.
>> Any advise on where to look is greatly appreciated.
>
> What I don't understand is exactly what you are trying to do. Have more
> complex tracers that dynamically allocate memory? If so, having a per-CPU
> list that stages memory to be freed so that it can be passed to call_rcu()
> in a safe environment might make sense. Of course, that list would need
> to be managed carefully!
yes. We tried to compute the time the kernel spends between
preempt_disable->preempt_enable and plot a histogram of latencies.
> Or am I missing the point of the code below?
this trace.patch is reproducer of call_rcu crashes that doing:
preempt_enable
trace_preempt_on
kfree_call_rcu
The real call stack is:
preempt_enable
trace_preempt_on
kprobe_int3_handler
trace_call_bpf
bpf_map_update_elem
htab_map_update_elem
kree_call_rcu
On Mon, Jun 15, 2015 at 06:09:56PM -0700, Alexei Starovoitov wrote:
> On 6/15/15 4:07 PM, Paul E. McKenney wrote:
> >
> >Oh... One important thing is that both call_rcu() and kfree_rcu()
> >use per-CPU variables, managing a per-CPU linked list. This is why
> >they disable interrupts. If you do another call_rcu() in the middle
> >of the first one in just the wrong place, you will have two entities
> >concurrently manipulating the same linked list, which will not go well.
>
> yes. I'm trying to find that 'wrong place'.
> The trace.patch is doing kmalloc/kfree_rcu for every preempt_enable.
> So any spin_unlock called by first call_rcu will be triggering
> 2nd recursive to call_rcu.
> But as far as I could understand rcu code that looks ok everywhere.
> call_rcu
> debug_rcu_head_[un]queue
> debug_object_activate
> spin_unlock
>
> and debug_rcu_head* seems to be called from safe places
> where local_irq is enabled.
I do sympathize, but your own testing does demonstrate that it is
very much not OK. ;-)
> >Maybe mark call_rcu() and the things it calls as notrace? Or you
> >could maintain a separate per-CPU linked list that gathered up the
> >stuff to be kfree()ed after a grace period, and some time later
> >feed them to kfree_rcu()?
>
> yeah, I can think of this or 10 other ways to fix it within
> kprobe+bpf area, but I think something like call_rcu_notrace()
> may be a better solution.
> Or may be single generic 'fix' for call_rcu will be enough if
> it doesn't affect all other users.
Why do you believe that it is better to fix it within call_rcu()?
> >The usual consequence of racing a pair of callback insertions on the
> >same CPU would be that one of them gets leaked, and possible all
> >subsequent callbacks. So the lockup is no surprise. And there are a
> >lot of other assumptions in nearby code paths about only one execution
> >at a time from a given CPU.
>
> yes, I don't think calling 2nd call_rcu from preempt_enable violates
> this assumptions. local_irq does it job. No extra stuff is called when
> interrupts are disabled.
Perhaps you are self-deadlocking within __call_rcu_core(). If you have
not already done so, please try running with CONFIG_PROVE_LOCKING=y.
(Yes, I see your point about not calling extra stuff when interrupts
are disabled, and I remember that __call_rcu() avoids that path when
interrupts are disabled, but -something- is clearly going wrong!
Or maybe something momentarily enables interrupts somewhere, and RCU
has just been getting lucky. Or...)
> >>Any advise on where to look is greatly appreciated.
> >
> >What I don't understand is exactly what you are trying to do. Have more
> >complex tracers that dynamically allocate memory? If so, having a per-CPU
> >list that stages memory to be freed so that it can be passed to call_rcu()
> >in a safe environment might make sense. Of course, that list would need
> >to be managed carefully!
>
> yes. We tried to compute the time the kernel spends between
> preempt_disable->preempt_enable and plot a histogram of latencies.
>
> >Or am I missing the point of the code below?
>
> this trace.patch is reproducer of call_rcu crashes that doing:
> preempt_enable
> trace_preempt_on
> kfree_call_rcu
>
> The real call stack is:
> preempt_enable
> trace_preempt_on
> kprobe_int3_handler
> trace_call_bpf
> bpf_map_update_elem
> htab_map_update_elem
> kree_call_rcu
I suspect that your problem may range quite a bit further than just
call_rcu(). For example, in your stack trace, you have a recursive
call to debug_object_activate(), which might not be such good thing.
Thanx, Paul
On 6/15/15 7:14 PM, Paul E. McKenney wrote:
>
> Why do you believe that it is better to fix it within call_rcu()?
found it:
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8cf7304b2867..a3be09d482ae 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
{
bool ret;
- preempt_disable();
+ preempt_disable_notrace();
ret = __rcu_is_watching();
- preempt_enable();
+ preempt_enable_notrace();
return ret;
}
the rcu_is_watching() and __rcu_is_watching() are already marked
notrace, so imo it's a good 'fix'.
What was happening is that the above preempt_enable was triggering
recursive call_rcu that was indeed messing 'rdp' that was
prepared by __call_rcu and before __call_rcu_core could use that.
btw, also noticed that local_irq_save done by note_gp_changes
is partially redundant. In __call_rcu_core path the irqs are
already disabled.
> Perhaps you are self-deadlocking within __call_rcu_core(). If you have
> not already done so, please try running with CONFIG_PROVE_LOCKING=y.
yes, I had CONFIG_PROVE_LOCKING on.
> I suspect that your problem may range quite a bit further than just
> call_rcu(). For example, in your stack trace, you have a recursive
> call to debug_object_activate(), which might not be such good thing.
nope :) recursive debug_object_activate() is fine.
with the above 'fix' the trace.patch is now passing.
Why I'm digging into all of these? Well, to find out when
it's safe to finally do call_rcu. If I will use deferred kfree
approach in bpf maps, I need to know when it's safe to finally
call_rcu and it's not an easy answer.
kprobes potentially can be placed in any part of call_rcu stack,
so things can go messy quickly. So it helps to understand the call_rcu
logic well enough to come up with good solution.
On 06/16/2015 07:45 AM, Alexei Starovoitov wrote:
> On 6/15/15 7:14 PM, Paul E. McKenney wrote:
>>
>> Why do you believe that it is better to fix it within call_rcu()?
>
> found it:
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8cf7304b2867..a3be09d482ae 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> {
> bool ret;
>
> - preempt_disable();
> + preempt_disable_notrace();
> ret = __rcu_is_watching();
> - preempt_enable();
> + preempt_enable_notrace();
> return ret;
> }
>
> the rcu_is_watching() and __rcu_is_watching() are already marked
> notrace, so imo it's a good 'fix'.
> What was happening is that the above preempt_enable was triggering
> recursive call_rcu that was indeed messing 'rdp' that was
> prepared by __call_rcu and before __call_rcu_core could use that.
>
> btw, also noticed that local_irq_save done by note_gp_changes
> is partially redundant. In __call_rcu_core path the irqs are
> already disabled.
>
>> Perhaps you are self-deadlocking within __call_rcu_core(). If you have
>> not already done so, please try running with CONFIG_PROVE_LOCKING=y.
>
> yes, I had CONFIG_PROVE_LOCKING on.
>
>> I suspect that your problem may range quite a bit further than just
>> call_rcu(). For example, in your stack trace, you have a recursive
>> call to debug_object_activate(), which might not be such good thing.
>
> nope :) recursive debug_object_activate() is fine.
> with the above 'fix' the trace.patch is now passing.
It still crashes for me with the original test program
[ 145.908013] [<ffffffff810d1da1>] ? __rcu_reclaim+0x101/0x3d0
[ 145.908013] [<ffffffff810d1ca0>] ? rcu_barrier_func+0x250/0x250
[ 145.908013] [<ffffffff810abc03>] ? trace_hardirqs_on_caller+0xf3/0x240
[ 145.908013] [<ffffffff810d9afa>] rcu_do_batch+0x2ea/0x6b0
[ 145.908013] [<ffffffff8151a803>] ? __this_cpu_preempt_check+0x13/0x20
[ 145.908013] [<ffffffff810abc03>] ? trace_hardirqs_on_caller+0xf3/0x240
[ 145.921092] [<ffffffff81b6f072>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[ 145.921092] [<ffffffff810d2794>] ? rcu_report_qs_rnp+0x1b4/0x3f0
[ 145.921092] [<ffffffff8151a803>] ? __this_cpu_preempt_check+0x13/0x20
[ 145.921092] [<ffffffff810d9f96>] rcu_process_callbacks+0xd6/0x6a0
[ 145.921092] [<ffffffff81060042>] __do_softirq+0xe2/0x670
[ 145.921092] [<ffffffff810605ef>] run_ksoftirqd+0x1f/0x60
[ 145.921092] [<ffffffff81081843>] smpboot_thread_fn+0x193/0x2a0
[ 145.921092] [<ffffffff810816b0>] ? sort_range+0x30/0x30
[ 145.921092] [<ffffffff8107da12>] kthread+0xf2/0x110
[ 145.921092] [<ffffffff81b6a523>] ? wait_for_completion+0xc3/0x120
[ 145.921092] [<ffffffff8108a77b>] ? preempt_count_sub+0xab/0xf0
[ 145.921092] [<ffffffff8107d920>] ? kthread_create_on_node+0x240/0x240
[ 145.921092] [<ffffffff81b6ff02>] ret_from_fork+0x42/0x70
[ 145.921092] [<ffffffff8107d920>] ? kthread_create_on_node+0x240/0x240
> Why I'm digging into all of these? Well, to find out when
> it's safe to finally do call_rcu. If I will use deferred kfree
> approach in bpf maps, I need to know when it's safe to finally
> call_rcu and it's not an easy answer.
> kprobes potentially can be placed in any part of call_rcu stack,
> so things can go messy quickly. So it helps to understand the call_rcu
> logic well enough to come up with good solution.
On 6/15/15 11:06 PM, Daniel Wagner wrote:
>> with the above 'fix' the trace.patch is now passing.
> It still crashes for me with the original test program
>
> [ 145.908013] [<ffffffff810d1da1>] ? __rcu_reclaim+0x101/0x3d0
> [ 145.908013] [<ffffffff810d1ca0>] ? rcu_barrier_func+0x250/0x250
> [ 145.908013] [<ffffffff810abc03>] ? trace_hardirqs_on_caller+0xf3/0x240
> [ 145.908013] [<ffffffff810d9afa>] rcu_do_batch+0x2ea/0x6b0
yes. full bpf test still crashes.
That's why I said trace.patch is passing ;)
There is something else in there. One 'fix' at a time.
On 06/16/2015 08:25 AM, Alexei Starovoitov wrote:
> On 6/15/15 11:06 PM, Daniel Wagner wrote:
>>> with the above 'fix' the trace.patch is now passing.
>> It still crashes for me with the original test program
>>
>> [ 145.908013] [<ffffffff810d1da1>] ? __rcu_reclaim+0x101/0x3d0
>> [ 145.908013] [<ffffffff810d1ca0>] ? rcu_barrier_func+0x250/0x250
>> [ 145.908013] [<ffffffff810abc03>] ?
>> trace_hardirqs_on_caller+0xf3/0x240
>> [ 145.908013] [<ffffffff810d9afa>] rcu_do_batch+0x2ea/0x6b0
>
> yes. full bpf test still crashes.
> That's why I said trace.patch is passing ;)
> There is something else in there. One 'fix' at a time.
Ah, sorry, I read it is working now :) Anyway, I'll keep looking
as well.
Yesterday I wrote a small torture program for the map
implementation. Just to rule out memory corruption there.
Are you interested in it? If yes I could clean it a bit.
It is based on Paul's rcu torturing (Finally I can write
this word without a typo :))
Here a rough/dirty version:
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/err.h>
#include <linux/module.h>
#include <linux/random.h>
#include <linux/bpf.h>
#include <linux/torture.h>
#include <linux/slab.h>
struct bpf_test {
struct bpf_map *map;
const struct bpf_map_ops *ops;
};
static struct bpf_test test;
static struct task_struct **torture_reads;
static struct task_struct **torture_writes;
torture_param(int, nreads, -1, "Number of BPF torture threads");
torture_param(int, nwrites, -1, "Number of BPF torture threads");
torture_param(int, onoff_holdoff, 0, "Time after boot before CPU hotplugs (s)");
torture_param(int, onoff_interval, 0, "Time between CPU hotplugs (s), 0=disable");
torture_param(int, shutdown_secs, 0, "Shutdown time (s), <= zero to disable.");
torture_param(int, stutter, 5, "Number of seconds to run/halt test");
torture_param(bool, verbose, true,
"Enable verbose debugging printk()s");
static int nrealreads;
static int nrealwrites;
static int torture_runnable = 1;
module_param(torture_runnable, int, 0444);
MODULE_PARM_DESC(torture_runnable, "Start rcutorture at boot");
static char *torture_type = "bpf";
module_param(torture_type, charp, 0444);
MODULE_PARM_DESC(torture_type, "Type of BPF to torture ( ...)");
static int bpf_torture_writes(void *arg)
{
int i, v;
do {
v = 1234;
rcu_read_lock();
for (i = 0; i < 1000; i++) {
test.ops->map_update_elem(test.map, &v, &v, BPF_ANY);
v = next_pseudo_random32(v);
}
rcu_read_unlock();
stutter_wait("bpf_torture_write");
} while (!torture_must_stop());
torture_kthread_stopping("bpf_torture_write");
return 0;
}
static int bpf_torture_reads(void *arg)
{
int *r;
int key, next_key;
do {
key = -1;
rcu_read_lock();
while (test.ops->map_get_next_key(test.map, &key, &next_key) == 0) {
r = test.ops->map_lookup_elem(test.map, &next_key);
test.ops->map_delete_elem(test.map, &next_key);
key = next_key;
}
rcu_read_unlock();
stutter_wait("bpf_torture_read");
} while (!torture_must_stop());
torture_kthread_stopping("bpf_torture_read");
return 0;
}
static void bpf_torture_cleanup(void)
{
int i;
torture_cleanup_begin();
if (torture_reads) {
for (i = 0; i < nrealreads; i++)
torture_stop_kthread(rcu_torture_reads,
torture_reads[i]);
kfree(torture_reads);
}
if (torture_writes) {
for (i = 0; i < nrealwrites; i++)
torture_stop_kthread(rcu_torture_writes,
torture_writes[i]);
kfree(torture_writes);
}
if (!test.map) {
test.ops->map_free(test.map);
test.map = NULL;
}
torture_cleanup_end();
}
static int __init bpf_torture_init(void)
{
union bpf_attr attr;
int i, err;
if (nreads >= 0) {
nrealreads = nreads;
} else {
nrealreads = num_online_cpus() - 1;
if (nrealreads <= 0)
nrealreads = 1;
}
if (nwrites >= 0) {
nrealwrites = nwrites;
} else {
nrealwrites = num_online_cpus() - 1;
if (nrealwrites <= 0)
nrealwrites = 1;
}
if (!torture_init_begin(torture_type, verbose, &torture_runnable))
return -EBUSY;
torture_reads = kzalloc(nrealreads * sizeof(torture_reads[0]), GFP_KERNEL);
if (!torture_reads)
return -ENOMEM;
torture_writes = kzalloc(nrealwrites * sizeof(torture_writes[0]), GFP_KERNEL);
if (!torture_writes)
return -ENOMEM;
attr.map_type = BPF_MAP_TYPE_HASH;
attr.key_size = sizeof(u32);
attr.value_size = sizeof(u32);
attr.max_entries = 1000;
test.ops = bpf_get_map_ops(attr.map_type);
if (!test.ops)
return -ENOENT;
test.map = test.ops->map_alloc(&attr);
if (IS_ERR(test.map))
return PTR_ERR(test.map);
err = torture_stutter_init(stutter * HZ);
if (err)
return err;
for (i = 0; i < nrealreads; i++) {
err = torture_create_kthread(bpf_torture_reads,
NULL, torture_reads[i]);
if (err)
return err;
}
for (i = 0; i < nrealwrites; i++) {
err = torture_create_kthread(bpf_torture_writes,
NULL, torture_writes[i]);
if (err)
return err;
}
err = torture_shutdown_init(shutdown_secs, bpf_torture_cleanup);
if (err)
return err;
err = torture_onoff_init(onoff_holdoff * HZ, onoff_interval * HZ);
if (err)
return err;
torture_init_end();
return 0;
}
module_init(bpf_torture_init);
module_exit(bpf_torture_cleanup);
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Daniel Wagner <[email protected]>");
MODULE_DESCRIPTION("BPF torture tests");
On 6/15/15 11:34 PM, Daniel Wagner wrote:
> On 06/16/2015 08:25 AM, Alexei Starovoitov wrote:
>> On 6/15/15 11:06 PM, Daniel Wagner wrote:
>>>> with the above 'fix' the trace.patch is now passing.
>>> It still crashes for me with the original test program
>>>
>>> [ 145.908013] [<ffffffff810d1da1>] ? __rcu_reclaim+0x101/0x3d0
>>> [ 145.908013] [<ffffffff810d1ca0>] ? rcu_barrier_func+0x250/0x250
>>> [ 145.908013] [<ffffffff810abc03>] ?
>>> trace_hardirqs_on_caller+0xf3/0x240
>>> [ 145.908013] [<ffffffff810d9afa>] rcu_do_batch+0x2ea/0x6b0
>>
>> yes. full bpf test still crashes.
>> That's why I said trace.patch is passing ;)
>> There is something else in there. One 'fix' at a time.
>
> Ah, sorry, I read it is working now :) Anyway, I'll keep looking
> as well.
>
> Yesterday I wrote a small torture program for the map
> implementation. Just to rule out memory corruption there.
> Are you interested in it? If yes I could clean it a bit.
of course!
We already have samples/bpf/test_maps.c that stresses map
access from user space and lib/test_bpf.c that stress JIT
and interpreter from the kernel.
Looking at your test, I think it doesn't buy as much doing it
from the kernel?
If so, I think would be great to add it to test_maps.c
Will read it more carefully tomorrow.
On 06/16/2015 08:46 AM, Alexei Starovoitov wrote:
> On 6/15/15 11:34 PM, Daniel Wagner wrote:
>> On 06/16/2015 08:25 AM, Alexei Starovoitov wrote:
>>> On 6/15/15 11:06 PM, Daniel Wagner wrote:
>>>>> with the above 'fix' the trace.patch is now passing.
>>>> It still crashes for me with the original test program
>>>>
>>>> [ 145.908013] [<ffffffff810d1da1>] ? __rcu_reclaim+0x101/0x3d0
>>>> [ 145.908013] [<ffffffff810d1ca0>] ? rcu_barrier_func+0x250/0x250
>>>> [ 145.908013] [<ffffffff810abc03>] ?
>>>> trace_hardirqs_on_caller+0xf3/0x240
>>>> [ 145.908013] [<ffffffff810d9afa>] rcu_do_batch+0x2ea/0x6b0
>>>
>>> yes. full bpf test still crashes.
>>> That's why I said trace.patch is passing ;)
>>> There is something else in there. One 'fix' at a time.
>>
>> Ah, sorry, I read it is working now :) Anyway, I'll keep looking
>> as well.
>>
>> Yesterday I wrote a small torture program for the map
>> implementation. Just to rule out memory corruption there.
>> Are you interested in it? If yes I could clean it a bit.
>
> of course!
> We already have samples/bpf/test_maps.c that stresses map
> access from user space and lib/test_bpf.c that stress JIT
> and interpreter from the kernel.
> Looking at your test, I think it doesn't buy as much doing it
> from the kernel?
No, it doesn't really buy you anything compared to test_maps.c. I was
just ignored to the fact that test_maps.c exists. My goal was to reduce
the problem to a smallest unit and therefore tried to do it inside the
kernel without any additional layers.
> If so, I think would be great to add it to test_maps.c
>
> Will read it more carefully tomorrow.
No need, test_maps.c is far better.
On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
> On 6/15/15 7:14 PM, Paul E. McKenney wrote:
> >
> >Why do you believe that it is better to fix it within call_rcu()?
>
> found it:
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8cf7304b2867..a3be09d482ae 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> {
> bool ret;
>
> - preempt_disable();
> + preempt_disable_notrace();
> ret = __rcu_is_watching();
> - preempt_enable();
> + preempt_enable_notrace();
> return ret;
> }
>
> the rcu_is_watching() and __rcu_is_watching() are already marked
> notrace, so imo it's a good 'fix'.
> What was happening is that the above preempt_enable was triggering
> recursive call_rcu that was indeed messing 'rdp' that was
> prepared by __call_rcu and before __call_rcu_core could use that.
> btw, also noticed that local_irq_save done by note_gp_changes
> is partially redundant. In __call_rcu_core path the irqs are
> already disabled.
But you said earlier that nothing happened when interrupts were
disabled. And interrupts are disabled across the call to
rcu_is_watching() in __call_rcu_core(). So why did those calls
to preempt_disable() and preempt_enable() cause trouble?
That said, the patch looks inoffensive to me, adding Steven for his
trace expertise.
Still, I do need to understand what was really happening. Did interrupts
get enabled somehow? Or is your code that ignores calls when interrupts
are disabled incomplete in some way? Something else?
> >Perhaps you are self-deadlocking within __call_rcu_core(). If you have
> >not already done so, please try running with CONFIG_PROVE_LOCKING=y.
>
> yes, I had CONFIG_PROVE_LOCKING on.
Good! ;-)
> >I suspect that your problem may range quite a bit further than just
> >call_rcu(). For example, in your stack trace, you have a recursive
> >call to debug_object_activate(), which might not be such good thing.
>
> nope :) recursive debug_object_activate() is fine.
> with the above 'fix' the trace.patch is now passing.
>
> Why I'm digging into all of these? Well, to find out when
> it's safe to finally do call_rcu. If I will use deferred kfree
> approach in bpf maps, I need to know when it's safe to finally
> call_rcu and it's not an easy answer.
Given that reentrant calls to call_rcu() and/or kfree_rcu() were not
in any way considered during design and implementation, it is not a
surprise that the answer is not easy. The reason I need to understand
what your code does in interrupt-disabled situations is to work out
whether or not it makes sense to agree to support reentrancy in call_rcu()
and kfree_rcu().
> kprobes potentially can be placed in any part of call_rcu stack,
> so things can go messy quickly. So it helps to understand the call_rcu
> logic well enough to come up with good solution.
Indeed, I do have some concerns about that sort of thing, as it is not
at all clear that designing call_rcu() and kfree_rcu() for unrestricted
reentrancy is a win.
Thanx, Paul
On 06/16/2015 02:27 PM, Paul E. McKenney wrote:
> On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
>> On 6/15/15 7:14 PM, Paul E. McKenney wrote:
>>>
>>> Why do you believe that it is better to fix it within call_rcu()?
>>
>> found it:
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 8cf7304b2867..a3be09d482ae 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
>> {
>> bool ret;
>>
>> - preempt_disable();
>> + preempt_disable_notrace();
>> ret = __rcu_is_watching();
>> - preempt_enable();
>> + preempt_enable_notrace();
>> return ret;
>> }
>>
>> the rcu_is_watching() and __rcu_is_watching() are already marked
>> notrace, so imo it's a good 'fix'.
>> What was happening is that the above preempt_enable was triggering
>> recursive call_rcu that was indeed messing 'rdp' that was
>> prepared by __call_rcu and before __call_rcu_core could use that.
>
>> btw, also noticed that local_irq_save done by note_gp_changes
>> is partially redundant. In __call_rcu_core path the irqs are
>> already disabled.
>
> But you said earlier that nothing happened when interrupts were
> disabled. And interrupts are disabled across the call to
> rcu_is_watching() in __call_rcu_core(). So why did those calls
> to preempt_disable() and preempt_enable() cause trouble?
Just for the record: Using a thread for freeing the memory is curing the
problem without the need to modify rcu_is_watching.
Here is the hack:
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 83c209d..8d73be3 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -13,6 +13,8 @@
#include <linux/jhash.h>
#include <linux/filter.h>
#include <linux/vmalloc.h>
+#include <linux/kthread.h>
+#include <linux/spinlock.h>
struct bpf_htab {
struct bpf_map map;
@@ -27,10 +29,14 @@ struct bpf_htab {
struct htab_elem {
struct hlist_node hash_node;
struct rcu_head rcu;
+ struct list_head list;
u32 hash;
char key[0] __aligned(8);
};
+static LIST_HEAD(elem_freelist);
+static DEFINE_SPINLOCK(elem_freelist_lock);
+
/* Called from syscall */
static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
{
@@ -228,6 +234,7 @@ static int htab_map_update_elem(struct bpf_map *map,
void *key, void *value,
memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
l_new->hash = htab_map_hash(l_new->key, key_size);
+ INIT_LIST_HEAD(&l_new->list);
/* bpf_map_update_elem() can be called in_irq() */
spin_lock_irqsave(&htab->lock, flags);
@@ -300,11 +307,17 @@ static int htab_map_delete_elem(struct bpf_map
*map, void *key)
if (l) {
hlist_del_rcu(&l->hash_node);
htab->count--;
- kfree_rcu(l, rcu);
+ /* kfree_rcu(l, rcu); */
ret = 0;
}
spin_unlock_irqrestore(&htab->lock, flags);
+
+ if (l) {
+ spin_lock_irqsave(&elem_freelist_lock, flags);
+ list_add(&l->list, &elem_freelist);
+ spin_unlock_irqrestore(&elem_freelist_lock, flags);
+ }
return ret;
}
@@ -359,9 +372,31 @@ static struct bpf_map_type_list htab_type
__read_mostly = {
.type = BPF_MAP_TYPE_HASH,
};
+static int free_thread(void *arg)
+{
+ unsigned long flags;
+ struct htab_elem *l;
+
+ while (!kthread_should_stop()) {
+ spin_lock_irqsave(&elem_freelist_lock, flags);
+ while (!list_empty(&elem_freelist)) {
+ l = list_entry(elem_freelist.next,
+ struct htab_elem, list);
+ list_del(&l->list);
+ kfree(l);
+ }
+ spin_unlock_irqrestore(&elem_freelist_lock, flags);
+ }
+
+ return 0;
+}
+
static int __init register_htab_map(void)
{
bpf_register_map_type(&htab_type);
+
+ kthread_run(free_thread, NULL, "free_thread");
+
return 0;
}
late_initcall(register_htab_map);
On Tue, Jun 16, 2015 at 02:38:53PM +0200, Daniel Wagner wrote:
> On 06/16/2015 02:27 PM, Paul E. McKenney wrote:
> > On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
> >> On 6/15/15 7:14 PM, Paul E. McKenney wrote:
> >>>
> >>> Why do you believe that it is better to fix it within call_rcu()?
> >>
> >> found it:
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 8cf7304b2867..a3be09d482ae 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> >> {
> >> bool ret;
> >>
> >> - preempt_disable();
> >> + preempt_disable_notrace();
> >> ret = __rcu_is_watching();
> >> - preempt_enable();
> >> + preempt_enable_notrace();
> >> return ret;
> >> }
> >>
> >> the rcu_is_watching() and __rcu_is_watching() are already marked
> >> notrace, so imo it's a good 'fix'.
> >> What was happening is that the above preempt_enable was triggering
> >> recursive call_rcu that was indeed messing 'rdp' that was
> >> prepared by __call_rcu and before __call_rcu_core could use that.
> >
> >> btw, also noticed that local_irq_save done by note_gp_changes
> >> is partially redundant. In __call_rcu_core path the irqs are
> >> already disabled.
> >
> > But you said earlier that nothing happened when interrupts were
> > disabled. And interrupts are disabled across the call to
> > rcu_is_watching() in __call_rcu_core(). So why did those calls
> > to preempt_disable() and preempt_enable() cause trouble?
>
> Just for the record: Using a thread for freeing the memory is curing the
> problem without the need to modify rcu_is_watching.
I must confess to liking this approach better than guaranteeing full-up
reentrancy in call_rcu() and kfree_rcu(). ;-)
Thanx, Paul
> Here is the hack:
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 83c209d..8d73be3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -13,6 +13,8 @@
> #include <linux/jhash.h>
> #include <linux/filter.h>
> #include <linux/vmalloc.h>
> +#include <linux/kthread.h>
> +#include <linux/spinlock.h>
>
> struct bpf_htab {
> struct bpf_map map;
> @@ -27,10 +29,14 @@ struct bpf_htab {
> struct htab_elem {
> struct hlist_node hash_node;
> struct rcu_head rcu;
> + struct list_head list;
> u32 hash;
> char key[0] __aligned(8);
> };
>
> +static LIST_HEAD(elem_freelist);
> +static DEFINE_SPINLOCK(elem_freelist_lock);
> +
> /* Called from syscall */
> static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
> {
> @@ -228,6 +234,7 @@ static int htab_map_update_elem(struct bpf_map *map,
> void *key, void *value,
> memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
>
> l_new->hash = htab_map_hash(l_new->key, key_size);
> + INIT_LIST_HEAD(&l_new->list);
>
> /* bpf_map_update_elem() can be called in_irq() */
> spin_lock_irqsave(&htab->lock, flags);
> @@ -300,11 +307,17 @@ static int htab_map_delete_elem(struct bpf_map
> *map, void *key)
> if (l) {
> hlist_del_rcu(&l->hash_node);
> htab->count--;
> - kfree_rcu(l, rcu);
> + /* kfree_rcu(l, rcu); */
> ret = 0;
> }
>
> spin_unlock_irqrestore(&htab->lock, flags);
> +
> + if (l) {
> + spin_lock_irqsave(&elem_freelist_lock, flags);
> + list_add(&l->list, &elem_freelist);
> + spin_unlock_irqrestore(&elem_freelist_lock, flags);
> + }
> return ret;
> }
>
> @@ -359,9 +372,31 @@ static struct bpf_map_type_list htab_type
> __read_mostly = {
> .type = BPF_MAP_TYPE_HASH,
> };
>
> +static int free_thread(void *arg)
> +{
> + unsigned long flags;
> + struct htab_elem *l;
> +
> + while (!kthread_should_stop()) {
> + spin_lock_irqsave(&elem_freelist_lock, flags);
> + while (!list_empty(&elem_freelist)) {
> + l = list_entry(elem_freelist.next,
> + struct htab_elem, list);
> + list_del(&l->list);
> + kfree(l);
> + }
> + spin_unlock_irqrestore(&elem_freelist_lock, flags);
> + }
> +
> + return 0;
> +}
> +
> static int __init register_htab_map(void)
> {
> bpf_register_map_type(&htab_type);
> +
> + kthread_run(free_thread, NULL, "free_thread");
> +
> return 0;
> }
> late_initcall(register_htab_map);
>
On Tue, 16 Jun 2015 05:27:33 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
> > On 6/15/15 7:14 PM, Paul E. McKenney wrote:
> > >
> > >Why do you believe that it is better to fix it within call_rcu()?
> >
> > found it:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8cf7304b2867..a3be09d482ae 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> > {
> > bool ret;
> >
> > - preempt_disable();
> > + preempt_disable_notrace();
> > ret = __rcu_is_watching();
> > - preempt_enable();
> > + preempt_enable_notrace();
> > return ret;
> > }
> >
> > the rcu_is_watching() and __rcu_is_watching() are already marked
> > notrace, so imo it's a good 'fix'.
> > What was happening is that the above preempt_enable was triggering
> > recursive call_rcu that was indeed messing 'rdp' that was
> > prepared by __call_rcu and before __call_rcu_core could use that.
>
> > btw, also noticed that local_irq_save done by note_gp_changes
> > is partially redundant. In __call_rcu_core path the irqs are
> > already disabled.
>
If rcu_is_watching() and __rcu_is_watching() are both marked as
notrace, it makes sense to use preempt_disable/enable_notrace() as it
otherwise defeats the purpose of the notrace markers on rcu_is_watching.
That is regardless of what the rest of this thread is about.
-- Steve
On Tue, 16 Jun 2015 14:38:53 +0200
Daniel Wagner <[email protected]> wrote:
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 83c209d..8d73be3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -13,6 +13,8 @@
> #include <linux/jhash.h>
> #include <linux/filter.h>
> #include <linux/vmalloc.h>
> +#include <linux/kthread.h>
> +#include <linux/spinlock.h>
>
> struct bpf_htab {
> struct bpf_map map;
> @@ -27,10 +29,14 @@ struct bpf_htab {
> struct htab_elem {
> struct hlist_node hash_node;
> struct rcu_head rcu;
> + struct list_head list;
> u32 hash;
> char key[0] __aligned(8);
> };
>
> +static LIST_HEAD(elem_freelist);
> +static DEFINE_SPINLOCK(elem_freelist_lock);
> +
> /* Called from syscall */
> static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
> {
> @@ -228,6 +234,7 @@ static int htab_map_update_elem(struct bpf_map *map,
> void *key, void *value,
> memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
>
> l_new->hash = htab_map_hash(l_new->key, key_size);
> + INIT_LIST_HEAD(&l_new->list);
>
> /* bpf_map_update_elem() can be called in_irq() */
> spin_lock_irqsave(&htab->lock, flags);
> @@ -300,11 +307,17 @@ static int htab_map_delete_elem(struct bpf_map
> *map, void *key)
> if (l) {
> hlist_del_rcu(&l->hash_node);
> htab->count--;
> - kfree_rcu(l, rcu);
> + /* kfree_rcu(l, rcu); */
So this kfree_rcu() is only being used to defer a free, and has nothing
to do with having to free 'l' from rcu?
> ret = 0;
> }
>
> spin_unlock_irqrestore(&htab->lock, flags);
> +
> + if (l) {
> + spin_lock_irqsave(&elem_freelist_lock, flags);
> + list_add(&l->list, &elem_freelist);
> + spin_unlock_irqrestore(&elem_freelist_lock, flags);
> + }
> return ret;
> }
>
> @@ -359,9 +372,31 @@ static struct bpf_map_type_list htab_type
> __read_mostly = {
> .type = BPF_MAP_TYPE_HASH,
> };
>
> +static int free_thread(void *arg)
> +{
> + unsigned long flags;
> + struct htab_elem *l;
> +
> + while (!kthread_should_stop()) {
> + spin_lock_irqsave(&elem_freelist_lock, flags);
> + while (!list_empty(&elem_freelist)) {
> + l = list_entry(elem_freelist.next,
> + struct htab_elem, list);
> + list_del(&l->list);
> + kfree(l);
> + }
> + spin_unlock_irqrestore(&elem_freelist_lock, flags);
Wow! This is burning up CPU isn't it?
If you just need to delay the kfree, why not use irq_work for that job?
-- Steve
> + }
> +
> + return 0;
> +}
> +
> static int __init register_htab_map(void)
> {
> bpf_register_map_type(&htab_type);
> +
> + kthread_run(free_thread, NULL, "free_thread");
> +
> return 0;
> }
> late_initcall(register_htab_map);
On Tue, 16 Jun 2015 07:16:26 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > Just for the record: Using a thread for freeing the memory is curing the
> > problem without the need to modify rcu_is_watching.
>
> I must confess to liking this approach better than guaranteeing full-up
> reentrancy in call_rcu() and kfree_rcu(). ;-)
>
Then reentrancy must be really bad if you prefer a spinning thread that
polls constantly just to free an item ;-)
-- Steve
On Tue, 16 Jun 2015 11:41:51 -0400
Steven Rostedt <[email protected]> wrote:
> > if (l) {
> > hlist_del_rcu(&l->hash_node);
> > htab->count--;
> > - kfree_rcu(l, rcu);
> > + /* kfree_rcu(l, rcu); */
>
> So this kfree_rcu() is only being used to defer a free, and has nothing
> to do with having to free 'l' from rcu?
>
If you still require kfree_rcu() but can not call it here, still use an
irq_work, and have the irq_work handler call kfree_rcu().
-- Steve
On Tue, Jun 16, 2015 at 11:37:38AM -0400, Steven Rostedt wrote:
> On Tue, 16 Jun 2015 05:27:33 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
> > > On 6/15/15 7:14 PM, Paul E. McKenney wrote:
> > > >
> > > >Why do you believe that it is better to fix it within call_rcu()?
> > >
> > > found it:
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 8cf7304b2867..a3be09d482ae 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> > > {
> > > bool ret;
> > >
> > > - preempt_disable();
> > > + preempt_disable_notrace();
> > > ret = __rcu_is_watching();
> > > - preempt_enable();
> > > + preempt_enable_notrace();
> > > return ret;
> > > }
> > >
> > > the rcu_is_watching() and __rcu_is_watching() are already marked
> > > notrace, so imo it's a good 'fix'.
> > > What was happening is that the above preempt_enable was triggering
> > > recursive call_rcu that was indeed messing 'rdp' that was
> > > prepared by __call_rcu and before __call_rcu_core could use that.
> >
> > > btw, also noticed that local_irq_save done by note_gp_changes
> > > is partially redundant. In __call_rcu_core path the irqs are
> > > already disabled.
> >
>
> If rcu_is_watching() and __rcu_is_watching() are both marked as
> notrace, it makes sense to use preempt_disable/enable_notrace() as it
> otherwise defeats the purpose of the notrace markers on rcu_is_watching.
>
> That is regardless of what the rest of this thread is about.
Good enough! Alexei, are you OK with my adding your Signed-off-by
to the above patch? (Still not sold on reentrant call_rcu() and
kfree_rcu(), but getting notrace set up correctly is worthwhile.)
Thanx, Paul
On Tue, Jun 16, 2015 at 11:43:42AM -0400, Steven Rostedt wrote:
> On Tue, 16 Jun 2015 07:16:26 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > Just for the record: Using a thread for freeing the memory is curing the
> > > problem without the need to modify rcu_is_watching.
> >
> > I must confess to liking this approach better than guaranteeing full-up
> > reentrancy in call_rcu() and kfree_rcu(). ;-)
>
> Then reentrancy must be really bad if you prefer a spinning thread that
> polls constantly just to free an item ;-)
I was (perhaps naively) assuming that they would use a less aggressive
approach at some point. ;-)
Thanx, Paul
On 06/16/2015 05:41 PM, Steven Rostedt wrote:
> On Tue, 16 Jun 2015 14:38:53 +0200
> Daniel Wagner <[email protected]> wrote:
>> *map, void *key)
>> if (l) {
>> hlist_del_rcu(&l->hash_node);
>> htab->count--;
>> - kfree_rcu(l, rcu);
>> + /* kfree_rcu(l, rcu); */
>
> So this kfree_rcu() is only being used to defer a free, and has nothing
> to do with having to free 'l' from rcu?
Not 100% sure but I got the impression kfree_rcu only defers the free.
>> +static int free_thread(void *arg)
>> +{
>> + unsigned long flags;
>> + struct htab_elem *l;
>> +
>> + while (!kthread_should_stop()) {
>> + spin_lock_irqsave(&elem_freelist_lock, flags);
>> + while (!list_empty(&elem_freelist)) {
>> + l = list_entry(elem_freelist.next,
>> + struct htab_elem, list);
>> + list_del(&l->list);
>> + kfree(l);
>> + }
>> + spin_unlock_irqrestore(&elem_freelist_lock, flags);
>
> Wow! This is burning up CPU isn't it?
Sure, this is a very busy thread :) I was just experimenting if defering
it to a thread would paper of the problem.
> If you just need to delay the kfree, why not use irq_work for that job?
Good point. I tried that tomorrow.
cheers,
daniel
On 06/16/2015 06:07 PM, Paul E. McKenney wrote:
> On Tue, Jun 16, 2015 at 11:43:42AM -0400, Steven Rostedt wrote:
>> On Tue, 16 Jun 2015 07:16:26 -0700
>> "Paul E. McKenney" <[email protected]> wrote:
>>
>>>> Just for the record: Using a thread for freeing the memory is curing the
>>>> problem without the need to modify rcu_is_watching.
>>>
>>> I must confess to liking this approach better than guaranteeing full-up
>>> reentrancy in call_rcu() and kfree_rcu(). ;-)
>>
>> Then reentrancy must be really bad if you prefer a spinning thread that
>> polls constantly just to free an item ;-)
>
> I was (perhaps naively) assuming that they would use a less aggressive
> approach at some point. ;-)
Yes, this was just playing around :)
On 6/16/15 9:05 AM, Paul E. McKenney wrote:
> On Tue, Jun 16, 2015 at 11:37:38AM -0400, Steven Rostedt wrote:
>> On Tue, 16 Jun 2015 05:27:33 -0700
>> "Paul E. McKenney" <[email protected]> wrote:
>>
>>> On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
>>>> On 6/15/15 7:14 PM, Paul E. McKenney wrote:
>>>>>
>>>>> Why do you believe that it is better to fix it within call_rcu()?
>>>>
>>>> found it:
>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>> index 8cf7304b2867..a3be09d482ae 100644
>>>> --- a/kernel/rcu/tree.c
>>>> +++ b/kernel/rcu/tree.c
>>>> @@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
>>>> {
>>>> bool ret;
>>>>
>>>> - preempt_disable();
>>>> + preempt_disable_notrace();
>>>> ret = __rcu_is_watching();
>>>> - preempt_enable();
>>>> + preempt_enable_notrace();
>>>> return ret;
>>>> }
>>>>
>>>> the rcu_is_watching() and __rcu_is_watching() are already marked
>>>> notrace, so imo it's a good 'fix'.
>>>> What was happening is that the above preempt_enable was triggering
>>>> recursive call_rcu that was indeed messing 'rdp' that was
>>>> prepared by __call_rcu and before __call_rcu_core could use that.
>>>
>>>> btw, also noticed that local_irq_save done by note_gp_changes
>>>> is partially redundant. In __call_rcu_core path the irqs are
>>>> already disabled.
>>>
>>
>> If rcu_is_watching() and __rcu_is_watching() are both marked as
>> notrace, it makes sense to use preempt_disable/enable_notrace() as it
>> otherwise defeats the purpose of the notrace markers on rcu_is_watching.
>>
>> That is regardless of what the rest of this thread is about.
>
> Good enough! Alexei, are you OK with my adding your Signed-off-by
> to the above patch?
sure.
Signed-off-by: Alexei Starovoitov <[email protected]>
> (Still not sold on reentrant call_rcu() and
> kfree_rcu(), but getting notrace set up correctly is worthwhile.)
I'm not sold on it either. So far trying to understand
all consequences.
On 6/16/15 5:38 AM, Daniel Wagner wrote:
> static int free_thread(void *arg)
> +{
> + unsigned long flags;
> + struct htab_elem *l;
> +
> + while (!kthread_should_stop()) {
> + spin_lock_irqsave(&elem_freelist_lock, flags);
> + while (!list_empty(&elem_freelist)) {
> + l = list_entry(elem_freelist.next,
> + struct htab_elem, list);
> + list_del(&l->list);
> + kfree(l);
that's not right, since such thread defeats rcu protection of lookup.
We need either kfree_rcu/call_rcu or synchronize_rcu.
Obviously the former is preferred that's why I'm still digging into it.
Probably a thread that does kfree_rcu would be ok, but we shouldn't
be doing it unconditionally. For all networking programs and 99%
of tracing programs the existing code is fine and I don't want to
slow it down to tackle the corner case.
Extra spin_lock just to add it to the list is also quite costly.
On Tue, 16 Jun 2015 10:20:05 -0700
Alexei Starovoitov <[email protected]> wrote:
> On 6/16/15 5:38 AM, Daniel Wagner wrote:
> > static int free_thread(void *arg)
> > +{
> > + unsigned long flags;
> > + struct htab_elem *l;
> > +
> > + while (!kthread_should_stop()) {
> > + spin_lock_irqsave(&elem_freelist_lock, flags);
> > + while (!list_empty(&elem_freelist)) {
> > + l = list_entry(elem_freelist.next,
> > + struct htab_elem, list);
> > + list_del(&l->list);
> > + kfree(l);
>
> that's not right, since such thread defeats rcu protection of lookup.
> We need either kfree_rcu/call_rcu or synchronize_rcu.
> Obviously the former is preferred that's why I'm still digging into it.
> Probably a thread that does kfree_rcu would be ok, but we shouldn't
> be doing it unconditionally. For all networking programs and 99%
> of tracing programs the existing code is fine and I don't want to
> slow it down to tackle the corner case.
> Extra spin_lock just to add it to the list is also quite costly.
Use a irq_work() handler to do the kfree_rcu(), and use llist (lockless
list) to add items to the list.
-- Steve
On Tue, Jun 16, 2015 at 10:14:08AM -0700, Alexei Starovoitov wrote:
> On 6/16/15 9:05 AM, Paul E. McKenney wrote:
> >On Tue, Jun 16, 2015 at 11:37:38AM -0400, Steven Rostedt wrote:
> >>On Tue, 16 Jun 2015 05:27:33 -0700
> >>"Paul E. McKenney" <[email protected]> wrote:
> >>
> >>>On Mon, Jun 15, 2015 at 10:45:05PM -0700, Alexei Starovoitov wrote:
> >>>>On 6/15/15 7:14 PM, Paul E. McKenney wrote:
> >>>>>
> >>>>>Why do you believe that it is better to fix it within call_rcu()?
> >>>>
> >>>>found it:
> >>>>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>>>index 8cf7304b2867..a3be09d482ae 100644
> >>>>--- a/kernel/rcu/tree.c
> >>>>+++ b/kernel/rcu/tree.c
> >>>>@@ -935,9 +935,9 @@ bool notrace rcu_is_watching(void)
> >>>> {
> >>>> bool ret;
> >>>>
> >>>>- preempt_disable();
> >>>>+ preempt_disable_notrace();
> >>>> ret = __rcu_is_watching();
> >>>>- preempt_enable();
> >>>>+ preempt_enable_notrace();
> >>>> return ret;
> >>>> }
> >>>>
> >>>>the rcu_is_watching() and __rcu_is_watching() are already marked
> >>>>notrace, so imo it's a good 'fix'.
> >>>>What was happening is that the above preempt_enable was triggering
> >>>>recursive call_rcu that was indeed messing 'rdp' that was
> >>>>prepared by __call_rcu and before __call_rcu_core could use that.
> >>>
> >>>>btw, also noticed that local_irq_save done by note_gp_changes
> >>>>is partially redundant. In __call_rcu_core path the irqs are
> >>>>already disabled.
> >>>
> >>
> >>If rcu_is_watching() and __rcu_is_watching() are both marked as
> >>notrace, it makes sense to use preempt_disable/enable_notrace() as it
> >>otherwise defeats the purpose of the notrace markers on rcu_is_watching.
And __rcu_is_watching() is marked notrace as well.
> >>That is regardless of what the rest of this thread is about.
> >
> >Good enough! Alexei, are you OK with my adding your Signed-off-by
> >to the above patch?
>
> sure.
> Signed-off-by: Alexei Starovoitov <[email protected]>
>
> >(Still not sold on reentrant call_rcu() and
> >kfree_rcu(), but getting notrace set up correctly is worthwhile.)
>
> I'm not sold on it either. So far trying to understand
> all consequences.
Here is the updated patch. Steven, I added your "Acked-by" based
on your positive comments above, please let me know if you would
like me to remove it.
Thanx, Paul
------------------------------------------------------------------------
commit 9611f225d383a2edbdf74ca7f00c8d0b1e56dc45
Author: Alexei Starovoitov <[email protected]>
Date: Tue Jun 16 10:35:18 2015 -0700
rcu: Make rcu_is_watching() really notrace
Although rcu_is_watching() is marked notrace, it invokes preempt_disable()
and preempt_enable(), both of which can be traced. This defeats the
purpose of the notrace on rcu_is_watching(), so this commit substitutes
preempt_disable_notrace() and preempt_enable_notrace().
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index fc0385380e97..c844ef3c2fae 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -973,9 +973,9 @@ bool notrace rcu_is_watching(void)
{
bool ret;
- preempt_disable();
+ preempt_disable_notrace();
ret = __rcu_is_watching();
- preempt_enable();
+ preempt_enable_notrace();
return ret;
}
EXPORT_SYMBOL_GPL(rcu_is_watching);
On Tue, 16 Jun 2015 10:39:42 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > >>If rcu_is_watching() and __rcu_is_watching() are both marked as
> > >>notrace, it makes sense to use preempt_disable/enable_notrace() as it
> > >>otherwise defeats the purpose of the notrace markers on rcu_is_watching.
>
> And __rcu_is_watching() is marked notrace as well.
Isn't that what I said?
>
> > >>That is regardless of what the rest of this thread is about.
> > >
> > >Good enough! Alexei, are you OK with my adding your Signed-off-by
> > >to the above patch?
> >
> > sure.
> > Signed-off-by: Alexei Starovoitov <[email protected]>
> >
> > >(Still not sold on reentrant call_rcu() and
> > >kfree_rcu(), but getting notrace set up correctly is worthwhile.)
> >
> > I'm not sold on it either. So far trying to understand
> > all consequences.
>
> Here is the updated patch. Steven, I added your "Acked-by" based
> on your positive comments above, please let me know if you would
> like me to remove it.
I'm fine with it. But doesn't Acked-by go above Signed-off-by?
-- Steve
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 9611f225d383a2edbdf74ca7f00c8d0b1e56dc45
> Author: Alexei Starovoitov <[email protected]>
> Date: Tue Jun 16 10:35:18 2015 -0700
>
> rcu: Make rcu_is_watching() really notrace
>
> Although rcu_is_watching() is marked notrace, it invokes preempt_disable()
> and preempt_enable(), both of which can be traced. This defeats the
> purpose of the notrace on rcu_is_watching(), so this commit substitutes
> preempt_disable_notrace() and preempt_enable_notrace().
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Acked-by: Steven Rostedt <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index fc0385380e97..c844ef3c2fae 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -973,9 +973,9 @@ bool notrace rcu_is_watching(void)
> {
> bool ret;
>
> - preempt_disable();
> + preempt_disable_notrace();
> ret = __rcu_is_watching();
> - preempt_enable();
> + preempt_enable_notrace();
> return ret;
> }
> EXPORT_SYMBOL_GPL(rcu_is_watching);
On Tue, Jun 16, 2015 at 02:57:44PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jun 2015 10:39:42 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > >>If rcu_is_watching() and __rcu_is_watching() are both marked as
> > > >>notrace, it makes sense to use preempt_disable/enable_notrace() as it
> > > >>otherwise defeats the purpose of the notrace markers on rcu_is_watching.
> >
> > And __rcu_is_watching() is marked notrace as well.
>
> Isn't that what I said?
You did say "if", so I checked. ;-)
> > > >>That is regardless of what the rest of this thread is about.
> > > >
> > > >Good enough! Alexei, are you OK with my adding your Signed-off-by
> > > >to the above patch?
> > >
> > > sure.
> > > Signed-off-by: Alexei Starovoitov <[email protected]>
> > >
> > > >(Still not sold on reentrant call_rcu() and
> > > >kfree_rcu(), but getting notrace set up correctly is worthwhile.)
> > >
> > > I'm not sold on it either. So far trying to understand
> > > all consequences.
> >
> > Here is the updated patch. Steven, I added your "Acked-by" based
> > on your positive comments above, please let me know if you would
> > like me to remove it.
>
> I'm fine with it. But doesn't Acked-by go above Signed-off-by?
I have done it both ways, usually in time order.
Thanx, Paul
> -- Steve
>
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 9611f225d383a2edbdf74ca7f00c8d0b1e56dc45
> > Author: Alexei Starovoitov <[email protected]>
> > Date: Tue Jun 16 10:35:18 2015 -0700
> >
> > rcu: Make rcu_is_watching() really notrace
> >
> > Although rcu_is_watching() is marked notrace, it invokes preempt_disable()
> > and preempt_enable(), both of which can be traced. This defeats the
> > purpose of the notrace on rcu_is_watching(), so this commit substitutes
> > preempt_disable_notrace() and preempt_enable_notrace().
> >
> > Signed-off-by: Alexei Starovoitov <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Acked-by: Steven Rostedt <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index fc0385380e97..c844ef3c2fae 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -973,9 +973,9 @@ bool notrace rcu_is_watching(void)
> > {
> > bool ret;
> >
> > - preempt_disable();
> > + preempt_disable_notrace();
> > ret = __rcu_is_watching();
> > - preempt_enable();
> > + preempt_enable_notrace();
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(rcu_is_watching);
>
On Tue, 16 Jun 2015 12:20:55 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > I'm fine with it. But doesn't Acked-by go above Signed-off-by?
>
> I have done it both ways, usually in time order.
You're either a Big-endian, or a Little-endian? You can't be both!
-- Steve
On Tue, Jun 16, 2015 at 03:29:25PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jun 2015 12:20:55 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > I'm fine with it. But doesn't Acked-by go above Signed-off-by?
> >
> > I have done it both ways, usually in time order.
>
> You're either a Big-endian, or a Little-endian? You can't be both!
We specify at boot time, on each boot. Can be different for different
guest OSes. ;-)
Thanx, Paul
On 6/16/15 10:37 AM, Steven Rostedt wrote:
>>> + kfree(l);
>> >
>> >that's not right, since such thread defeats rcu protection of lookup.
>> >We need either kfree_rcu/call_rcu or synchronize_rcu.
>> >Obviously the former is preferred that's why I'm still digging into it.
>> >Probably a thread that does kfree_rcu would be ok, but we shouldn't
>> >be doing it unconditionally. For all networking programs and 99%
>> >of tracing programs the existing code is fine and I don't want to
>> >slow it down to tackle the corner case.
>> >Extra spin_lock just to add it to the list is also quite costly.
> Use a irq_work() handler to do the kfree_rcu(), and use llist (lockless
> list) to add items to the list.
have been studying irq_work and llist... it will work, but it's quite
costly too. Every kfree_rcu will be replaced with irq_work_queue(),
which is irq_work_claim() with one lock_cmpxchg plus another
lock_cmpxchg in llist_add, plus another lock_cmpxchg for our own llist
of 'to be kfree_rcu-ed htab elements'. That's a lot.
The must be better solution. Need to explore more.
On Tue, 16 Jun 2015 17:33:24 -0700
Alexei Starovoitov <[email protected]> wrote:
> On 6/16/15 10:37 AM, Steven Rostedt wrote:
> >>> + kfree(l);
> >> >
> >> >that's not right, since such thread defeats rcu protection of lookup.
> >> >We need either kfree_rcu/call_rcu or synchronize_rcu.
> >> >Obviously the former is preferred that's why I'm still digging into it.
> >> >Probably a thread that does kfree_rcu would be ok, but we shouldn't
> >> >be doing it unconditionally. For all networking programs and 99%
> >> >of tracing programs the existing code is fine and I don't want to
> >> >slow it down to tackle the corner case.
> >> >Extra spin_lock just to add it to the list is also quite costly.
> > Use a irq_work() handler to do the kfree_rcu(), and use llist (lockless
> > list) to add items to the list.
>
> have been studying irq_work and llist... it will work, but it's quite
> costly too. Every kfree_rcu will be replaced with irq_work_queue(),
> which is irq_work_claim() with one lock_cmpxchg plus another
> lock_cmpxchg in llist_add, plus another lock_cmpxchg for our own llist
> of 'to be kfree_rcu-ed htab elements'. That's a lot.
> The must be better solution. Need to explore more.
Do what I do in tracing. Use a bit (per cpu?) test.
Add the element to the list (that will be a cmpxchg, but I'm not sure
you can avoid it), then check the bit to see if the irq work is already
been activated. If not, then activate the irq work and set the bit.
Then you will not have any more cmpxchg in the fast path.
In your irq work handler, you clear the bit, process all the entries
until they are empty, check if the bit is set again, and repeat.
I haven't looked at the thread before I was added to the Cc, so I'm
answering this out of context.
-- Steve
On 6/16/15 5:47 PM, Steven Rostedt wrote:
> Do what I do in tracing. Use a bit (per cpu?) test.
>
> Add the element to the list (that will be a cmpxchg, but I'm not sure
> you can avoid it), then check the bit to see if the irq work is already
> been activated. If not, then activate the irq work and set the bit.
> Then you will not have any more cmpxchg in the fast path.
you mean similar to what rb_wakeups() and friends are doing?
makes sense. starting to study it...
On Tue, 16 Jun 2015 18:04:39 -0700
Alexei Starovoitov <[email protected]> wrote:
>
> you mean similar to what rb_wakeups() and friends are doing?
> makes sense. starting to study it...
Yes, I meant those functions. But yours would be slightly different. As
it would be the one calling the irq work that would be setting the
flag, not a separate entity (like the reader of the ring buffer).
I haven't fully thought it through, but I imagine it should still work.
-- Steve
On 06/16/2015 07:20 PM, Alexei Starovoitov wrote:
> On 6/16/15 5:38 AM, Daniel Wagner wrote:
>> static int free_thread(void *arg)
>> +{
>> + unsigned long flags;
>> + struct htab_elem *l;
>> +
>> + while (!kthread_should_stop()) {
>> + spin_lock_irqsave(&elem_freelist_lock, flags);
>> + while (!list_empty(&elem_freelist)) {
>> + l = list_entry(elem_freelist.next,
>> + struct htab_elem, list);
>> + list_del(&l->list);
>> + kfree(l);
>
> that's not right, since such thread defeats rcu protection of lookup.
> We need either kfree_rcu/call_rcu or synchronize_rcu.
D'oh, I have not seen the elephant in the room.
/me grabs a brown bag.
> Obviously the former is preferred that's why I'm still digging into it.
> Probably a thread that does kfree_rcu would be ok, but we shouldn't
> be doing it unconditionally. For all networking programs and 99%
> of tracing programs the existing code is fine and I don't want to
> slow it down to tackle the corner case.
> Extra spin_lock just to add it to the list is also quite costly.
Anyway, I changed to above kfree() to a kfree_rcu() and it explodes
again. With the same stack trace we seen.
Steven's suggestion deferring the work via irq_work results in the same
stack trace. (Now I get cold feets, without the nice heat from the CPU
busy looping...)
It looks like there is something else somewhere hidden.
cheers,
daniel
>From 3215ba9e4f9ecd0c7183f2400d9d579dcd5f4bc3 Mon Sep 17 00:00:00 2001
From: Daniel Wagner <[email protected]>
Date: Wed, 17 Jun 2015 09:52:23 +0200
Subject: [PATCH] bpf: Defer kfree_rcu via irq_work
---
kernel/bpf/hashtab.c | 35 +++++++++++++++++++++++++++++++++--
1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 83c209d..f6f1702 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -13,6 +13,7 @@
#include <linux/jhash.h>
#include <linux/filter.h>
#include <linux/vmalloc.h>
+#include <linux/irq_work.h>
struct bpf_htab {
struct bpf_map map;
@@ -27,10 +28,39 @@ struct bpf_htab {
struct htab_elem {
struct hlist_node hash_node;
struct rcu_head rcu;
+ struct llist_node llist;
u32 hash;
char key[0] __aligned(8);
};
+static struct irq_work free_work;
+static LLIST_HEAD(free_list);
+static bool free_pending;
+
+static void free_work_cb(struct irq_work *work)
+{
+ struct llist_node *n;
+ struct htab_elem *e;
+
+ free_pending = false;
+
+ n = llist_del_all(&free_list);
+ if (!n)
+ return;
+
+ llist_for_each_entry(e, n, llist)
+ kfree_rcu(e, rcu);
+}
+
+static void free_elem(struct htab_elem *e)
+{
+ llist_add(&e->llist, &free_list);
+ if (!free_pending) {
+ free_pending = true;
+ irq_work_queue(&free_work);
+ }
+}
+
/* Called from syscall */
static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
{
@@ -262,7 +292,7 @@ static int htab_map_update_elem(struct bpf_map *map,
void *key, void *value,
hlist_add_head_rcu(&l_new->hash_node, head);
if (l_old) {
hlist_del_rcu(&l_old->hash_node);
- kfree_rcu(l_old, rcu);
+ free_elem(l_old);
} else {
htab->count++;
}
@@ -300,7 +330,7 @@ static int htab_map_delete_elem(struct bpf_map *map,
void *key)
if (l) {
hlist_del_rcu(&l->hash_node);
htab->count--;
- kfree_rcu(l, rcu);
+ free_elem(l);
ret = 0;
}
@@ -361,6 +391,7 @@ static struct bpf_map_type_list htab_type
__read_mostly = {
static int __init register_htab_map(void)
{
+ init_irq_work(&free_work, free_work_cb);
bpf_register_map_type(&htab_type);
return 0;
}
--
2.1.0
On 06/17/2015 10:11 AM, Daniel Wagner wrote:
> On 06/16/2015 07:20 PM, Alexei Starovoitov wrote:
>> On 6/16/15 5:38 AM, Daniel Wagner wrote:
>>> static int free_thread(void *arg)
>>> +{
>>> + unsigned long flags;
>>> + struct htab_elem *l;
>>> +
>>> + while (!kthread_should_stop()) {
>>> + spin_lock_irqsave(&elem_freelist_lock, flags);
>>> + while (!list_empty(&elem_freelist)) {
>>> + l = list_entry(elem_freelist.next,
>>> + struct htab_elem, list);
>>> + list_del(&l->list);
>>> + kfree(l);
>>
> Anyway, I changed to above kfree() to a kfree_rcu() and it explodes
> again. With the same stack trace we seen.
Correction. I did this without the is_rcu_watching() change. With that
patch applied it works fine again.
> Steven's suggestion deferring the work via irq_work results in the same
> stack trace. (Now I get cold feets, without the nice heat from the CPU
> busy looping...)
That one still not working. It also makes the system really really slow.
I guess I still do something completely wrong.
On 6/17/15 2:05 AM, Daniel Wagner wrote:
>> >Steven's suggestion deferring the work via irq_work results in the same
>> >stack trace. (Now I get cold feets, without the nice heat from the CPU
>> >busy looping...)
> That one still not working. It also makes the system really really slow.
> I guess I still do something completely wrong.
tried your irq_work patch. It indeed makes the whole system
unresponsive. Ctrl-C of hwlathist no longer works and
it runs out of memory in 20 sec or so of running hwlathist
on idle system (without parallel hackbench).
It looks that free_pending flag is racy, so I removed it,
but it didn't help.
Also I've tried all sort of other things in rcu including
add rcu_bpf similar to rcu_sched to make sure that recursive
call into call_rcu will not be messing rcu_preempt or rcu_sched
states and instead will be operating on rcu_bpf per-cpu states.
In theory that should have worked flawlessly and it sort-of did.
But multiple hackbench runs still managed to crash it.
So far I think the temp workaround is to stick with array maps
for probing such low level things like trace_preempt.
Note that pre-allocation of all elements in hash map also won't
help, since the problem here is some collision of call_rcu and
rcu_process_callbacks. I'm pretty sure that kfree_rcu with
rcu_is_watching patch is ready for this type of abuse.
The rcu_process_callbacks() path - no yet. I'm still analyzing it.
On Wed, Jun 17, 2015 at 11:39:29AM -0700, Alexei Starovoitov wrote:
> On 6/17/15 2:05 AM, Daniel Wagner wrote:
> >>>Steven's suggestion deferring the work via irq_work results in the same
> >>>stack trace. (Now I get cold feets, without the nice heat from the CPU
> >>>busy looping...)
> >That one still not working. It also makes the system really really slow.
> >I guess I still do something completely wrong.
>
> tried your irq_work patch. It indeed makes the whole system
> unresponsive. Ctrl-C of hwlathist no longer works and
> it runs out of memory in 20 sec or so of running hwlathist
> on idle system (without parallel hackbench).
> It looks that free_pending flag is racy, so I removed it,
> but it didn't help.
>
> Also I've tried all sort of other things in rcu including
> add rcu_bpf similar to rcu_sched to make sure that recursive
> call into call_rcu will not be messing rcu_preempt or rcu_sched
> states and instead will be operating on rcu_bpf per-cpu states.
> In theory that should have worked flawlessly and it sort-of did.
> But multiple hackbench runs still managed to crash it.
> So far I think the temp workaround is to stick with array maps
> for probing such low level things like trace_preempt.
> Note that pre-allocation of all elements in hash map also won't
> help, since the problem here is some collision of call_rcu and
> rcu_process_callbacks. I'm pretty sure that kfree_rcu with
> rcu_is_watching patch is ready for this type of abuse.
> The rcu_process_callbacks() path - no yet. I'm still analyzing it.
How about if I just gave you a hook in __call_rcu() itself, just before
it returns, just after the local_irq_restore()? You could maintain
recursion flags and test the environment, at some point handling any
memory that needed freeing.
The idea would be to use an atomic queue to accumulate the to-be-freed
data, then kfree_rcu() it in the hook if it was safe to do so.
I really don't trust the re-entrancy, especially not in the long term.
Thanx, Paul
On 6/17/15 1:37 PM, Paul E. McKenney wrote:
> On Wed, Jun 17, 2015 at 11:39:29AM -0700, Alexei Starovoitov wrote:
>> On 6/17/15 2:05 AM, Daniel Wagner wrote:
>>>>> Steven's suggestion deferring the work via irq_work results in the same
>>>>> stack trace. (Now I get cold feets, without the nice heat from the CPU
>>>>> busy looping...)
>>> That one still not working. It also makes the system really really slow.
>>> I guess I still do something completely wrong.
>>
>> tried your irq_work patch. It indeed makes the whole system
>> unresponsive. Ctrl-C of hwlathist no longer works and
>> it runs out of memory in 20 sec or so of running hwlathist
>> on idle system (without parallel hackbench).
>> It looks that free_pending flag is racy, so I removed it,
>> but it didn't help.
>>
>> Also I've tried all sort of other things in rcu including
>> add rcu_bpf similar to rcu_sched to make sure that recursive
>> call into call_rcu will not be messing rcu_preempt or rcu_sched
>> states and instead will be operating on rcu_bpf per-cpu states.
>> In theory that should have worked flawlessly and it sort-of did.
>> But multiple hackbench runs still managed to crash it.
>> So far I think the temp workaround is to stick with array maps
>> for probing such low level things like trace_preempt.
>> Note that pre-allocation of all elements in hash map also won't
>> help, since the problem here is some collision of call_rcu and
>> rcu_process_callbacks. I'm pretty sure that kfree_rcu with
>> rcu_is_watching patch is ready for this type of abuse.
>> The rcu_process_callbacks() path - no yet. I'm still analyzing it.
>
> How about if I just gave you a hook in __call_rcu() itself, just before
> it returns, just after the local_irq_restore()? You could maintain
> recursion flags and test the environment, at some point handling any
> memory that needed freeing.
>
> The idea would be to use an atomic queue to accumulate the to-be-freed
> data, then kfree_rcu() it in the hook if it was safe to do so.
I'm not yet seeing how it will look. You mean I'll just locklessly
enqueue into some global llist and it will get kfree-d from
rcu_process_callbacks() ?
Something like kfree_rcu_lockless(ptr_to_be_freed) that
llist_adds and let rcu core know that something has to be freed?
I think such feature would be very useful in general.
Or may be kfree_rcu_this_cpu(ptr_to_be_freed) that uses
per-cpu llist of 'to-be-kfreed' objects?
Performance will be great and not need to embed rcu_head in
every datastructure.
btw, irq_work suffers the same re-entrancy problem:
[ 19.914910] [<ffffffff8117c63c>] free_work_cb+0x2c/0x50
[ 19.914910] [<ffffffff81176624>] irq_work_run_list+0x44/0x70
[ 19.914910] [<ffffffff8117667e>] irq_work_run+0x2e/0x50
[ 19.914910] [<ffffffff81008d0e>] smp_irq_work_interrupt+0x2e/0x40
[ 19.914910] [<ffffffff8178dd20>] irq_work_interrupt+0x70/0x80
[ 19.914910] <EOI> [<ffffffff813d3cec>] ?
debug_object_active_state+0xfc/0x150
[ 19.914910] [<ffffffff813d3cec>] ? debug_object_active_state+0xfc/0x150
[ 19.914910] [<ffffffff8178c21b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
[ 19.914910] [<ffffffff8115e0a7>] ? trace_preempt_on+0x7/0x100
[ 19.914910] [<ffffffff810991c3>] ? preempt_count_sub+0x73/0xf0
[ 19.914910] [<ffffffff8178c21b>] _raw_spin_unlock_irqrestore+0x4b/0x80
[ 19.914910] [<ffffffff813d3cec>] debug_object_active_state+0xfc/0x150
[ 19.914910] [<ffffffff812035f0>] ? get_max_files+0x20/0x20
[ 19.914910] [<ffffffff810e2d2f>] __call_rcu.constprop.67+0x5f/0x350
[ 19.914910] [<ffffffff810e3097>] call_rcu+0x17/0x20
[ 19.914910] [<ffffffff81203843>] __fput+0x183/0x200
so if I do call_rcu from free_work_cb, it's equally bad
as calling call_rcu from trace_preempt_on
On Wed, Jun 17, 2015 at 01:53:17PM -0700, Alexei Starovoitov wrote:
> On 6/17/15 1:37 PM, Paul E. McKenney wrote:
> >On Wed, Jun 17, 2015 at 11:39:29AM -0700, Alexei Starovoitov wrote:
> >>On 6/17/15 2:05 AM, Daniel Wagner wrote:
> >>>>>Steven's suggestion deferring the work via irq_work results in the same
> >>>>>stack trace. (Now I get cold feets, without the nice heat from the CPU
> >>>>>busy looping...)
> >>>That one still not working. It also makes the system really really slow.
> >>>I guess I still do something completely wrong.
> >>
> >>tried your irq_work patch. It indeed makes the whole system
> >>unresponsive. Ctrl-C of hwlathist no longer works and
> >>it runs out of memory in 20 sec or so of running hwlathist
> >>on idle system (without parallel hackbench).
> >>It looks that free_pending flag is racy, so I removed it,
> >>but it didn't help.
> >>
> >>Also I've tried all sort of other things in rcu including
> >>add rcu_bpf similar to rcu_sched to make sure that recursive
> >>call into call_rcu will not be messing rcu_preempt or rcu_sched
> >>states and instead will be operating on rcu_bpf per-cpu states.
> >>In theory that should have worked flawlessly and it sort-of did.
> >>But multiple hackbench runs still managed to crash it.
> >>So far I think the temp workaround is to stick with array maps
> >>for probing such low level things like trace_preempt.
> >>Note that pre-allocation of all elements in hash map also won't
> >>help, since the problem here is some collision of call_rcu and
> >>rcu_process_callbacks. I'm pretty sure that kfree_rcu with
> >>rcu_is_watching patch is ready for this type of abuse.
> >>The rcu_process_callbacks() path - no yet. I'm still analyzing it.
> >
> >How about if I just gave you a hook in __call_rcu() itself, just before
> >it returns, just after the local_irq_restore()? You could maintain
> >recursion flags and test the environment, at some point handling any
> >memory that needed freeing.
> >
> >The idea would be to use an atomic queue to accumulate the to-be-freed
> >data, then kfree_rcu() it in the hook if it was safe to do so.
>
> I'm not yet seeing how it will look. You mean I'll just locklessly
> enqueue into some global llist and it will get kfree-d from
> rcu_process_callbacks() ?
Locklessly enqueue onto a per-CPU list, but yes. The freeing is up to
you -- you get called just before exit from __call_rcu(), and get to
figure out what to do.
My guess would be if not in interrupt and not recursively invoked,
atomically remove all the elements from the list, then pass each to
kfree_rcu(), and finally let things take their course from there.
The llist APIs look like they would work.
> Something like kfree_rcu_lockless(ptr_to_be_freed) that
> llist_adds and let rcu core know that something has to be freed?
> I think such feature would be very useful in general.
> Or may be kfree_rcu_this_cpu(ptr_to_be_freed) that uses
> per-cpu llist of 'to-be-kfreed' objects?
> Performance will be great and not need to embed rcu_head in
> every datastructure.
Well, you do need to have something in each element to allow them to be
tracked. You could indeed use llist_add() to maintain the per-CPU list,
and then use llist_del_all() bulk-remove all the elements from the per-CPU
list. You can then pass each element in turn to kfree_rcu(). And yes,
I am suggesting that you open-code this, as it is going to be easier to
handle your special case then to provide a fully general solution. For
one thing, the general solution would require a full rcu_head to track
offset and next. In contrast, you can special-case the offset. And
ignore the overload special cases.
Thanx, Paul
> btw, irq_work suffers the same re-entrancy problem:
> [ 19.914910] [<ffffffff8117c63c>] free_work_cb+0x2c/0x50
> [ 19.914910] [<ffffffff81176624>] irq_work_run_list+0x44/0x70
> [ 19.914910] [<ffffffff8117667e>] irq_work_run+0x2e/0x50
> [ 19.914910] [<ffffffff81008d0e>] smp_irq_work_interrupt+0x2e/0x40
> [ 19.914910] [<ffffffff8178dd20>] irq_work_interrupt+0x70/0x80
> [ 19.914910] <EOI> [<ffffffff813d3cec>] ?
> debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff813d3cec>] ? debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff8178c21b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 19.914910] [<ffffffff8115e0a7>] ? trace_preempt_on+0x7/0x100
> [ 19.914910] [<ffffffff810991c3>] ? preempt_count_sub+0x73/0xf0
> [ 19.914910] [<ffffffff8178c21b>] _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 19.914910] [<ffffffff813d3cec>] debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff812035f0>] ? get_max_files+0x20/0x20
> [ 19.914910] [<ffffffff810e2d2f>] __call_rcu.constprop.67+0x5f/0x350
> [ 19.914910] [<ffffffff810e3097>] call_rcu+0x17/0x20
> [ 19.914910] [<ffffffff81203843>] __fput+0x183/0x200
>
> so if I do call_rcu from free_work_cb, it's equally bad
> as calling call_rcu from trace_preempt_on
>
>
On 6/17/15 2:36 PM, Paul E. McKenney wrote:
> Well, you do need to have something in each element to allow them to be
> tracked. You could indeed use llist_add() to maintain the per-CPU list,
> and then use llist_del_all() bulk-remove all the elements from the per-CPU
> list. You can then pass each element in turn to kfree_rcu(). And yes,
> I am suggesting that you open-code this, as it is going to be easier to
> handle your special case then to provide a fully general solution. For
> one thing, the general solution would require a full rcu_head to track
> offset and next. In contrast, you can special-case the offset. And
> ignore the overload special cases.
yes. all makes sense.
> Locklessly enqueue onto a per-CPU list, but yes. The freeing is up to
yes. per-cpu llist indeed.
> you -- you get called just before exit from __call_rcu(), and get to
> figure out what to do.
>
> My guess would be if not in interrupt and not recursively invoked,
> atomically remove all the elements from the list, then pass each to
> kfree_rcu(), and finally let things take their course from there.
> The llist APIs look like they would work.
Above and 'just before the exit from __call_rcu()' part of suggestion
I still don't understand.
To avoid reentry into call_rcu I can either create 1 or N new kthreads
or work_queue and do manual wakeups, but that's very specialized and I
don't want to permanently waste them, so I'm thinking to llist_add into
per-cpu llists and do llist_del_all in rcu_process_callbacks() to take
them from these llists and call kfree_rcu on them.
The llist_add part will also do:
if (!rcu_is_watching()) invoke_rcu_core();
to raise softirq when necessary.
So at the end it will look like two phase kfree_rcu.
I'll try to code it up and see it explodes :)
On Wed, Jun 17, 2015 at 04:58:48PM -0700, Alexei Starovoitov wrote:
> On 6/17/15 2:36 PM, Paul E. McKenney wrote:
> >Well, you do need to have something in each element to allow them to be
> >tracked. You could indeed use llist_add() to maintain the per-CPU list,
> >and then use llist_del_all() bulk-remove all the elements from the per-CPU
> >list. You can then pass each element in turn to kfree_rcu(). And yes,
> >I am suggesting that you open-code this, as it is going to be easier to
> >handle your special case then to provide a fully general solution. For
> >one thing, the general solution would require a full rcu_head to track
> >offset and next. In contrast, you can special-case the offset. And
> >ignore the overload special cases.
>
> yes. all makes sense.
>
> > Locklessly enqueue onto a per-CPU list, but yes. The freeing is up to
>
> yes. per-cpu llist indeed.
>
> > you -- you get called just before exit from __call_rcu(), and get to
> > figure out what to do.
> >
> > My guess would be if not in interrupt and not recursively invoked,
> > atomically remove all the elements from the list, then pass each to
> > kfree_rcu(), and finally let things take their course from there.
> > The llist APIs look like they would work.
>
> Above and 'just before the exit from __call_rcu()' part of suggestion
> I still don't understand.
> To avoid reentry into call_rcu I can either create 1 or N new kthreads
> or work_queue and do manual wakeups, but that's very specialized and I
> don't want to permanently waste them, so I'm thinking to llist_add into
> per-cpu llists and do llist_del_all in rcu_process_callbacks() to take
> them from these llists and call kfree_rcu on them.
Another option is to drain the lists the next time you do an allocation.
That would avoid hooking both __call_rcu() and rcu_process_callbacks().
Thanx, Paul
> The llist_add part will also do:
> if (!rcu_is_watching()) invoke_rcu_core();
> to raise softirq when necessary.
> So at the end it will look like two phase kfree_rcu.
> I'll try to code it up and see it explodes :)