Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755304AbbFOXHP (ORCPT ); Mon, 15 Jun 2015 19:07:15 -0400 Received: from e32.co.us.ibm.com ([32.97.110.150]:35603 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751353AbbFOXHJ (ORCPT ); Mon, 15 Jun 2015 19:07:09 -0400 X-Helo: d03dlp03.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Mon, 15 Jun 2015 16:07:02 -0700 From: "Paul E. McKenney" To: Alexei Starovoitov Cc: Daniel Wagner , LKML Subject: Re: call_rcu from trace_preempt Message-ID: <20150615230702.GB3913@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <557F509D.2000509@plumgrid.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <557F509D.2000509@plumgrid.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15061523-0005-0000-0000-00000FF23169 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9670 Lines: 247 On Mon, Jun 15, 2015 at 03:24:29PM -0700, Alexei Starovoitov wrote: > Hi Paul, > > I've been debugging the issue reported by Daniel: > http://thread.gmane.org/gmane.linux.kernel/1974304/focus=1974304 > and it seems I narrowed it down to recursive call_rcu. By "recursive call_rcu()", you mean invoking call_rcu() twice in a row on the same memory, like this? call_rcu(&p->rh, some_callback_function); do_something_quick(); call_rcu(&p->rh, another_callback_function); Because this is perfectly legal: void recirculating_callback_function(struct rcu_head *p) { struct foo *fp = container_of(p, struct foo, rh); kfree(fp); } void recirculating_callback_function(struct rcu_head *p) { call_rcu(p, endpoint_callback_function); } ... call_rcu(&fp->rh, recirculating_callback_function); This sort of thing is actually used in some situations involving RCU and reference counters. > From trace_preempt_on() I'm doing: > e = kmalloc(sizeof(struct elem), GFP_ATOMIC) > kfree_rcu(e, rcu) As written, this should be OK, assuming that "rcu" is a field of type "struct rcu_head" (not a pointer!) within "struct elem". > which causing all sorts of corruptions like: > [ 2.074175] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263 > debug_print_object+0x8c/0xb0() > [ 2.075567] ODEBUG: active_state not available (active state 0) > object type: rcu_head hint: (null) > > [ 2.102141] WARNING: CPU: 0 PID: 3 at ../lib/debugobjects.c:263 > debug_print_object+0x8c/0xb0() > [ 2.103547] ODEBUG: deactivate not available (active state 0) > object type: rcu_head hint: (null) > > [ 2.253995] WARNING: CPU: 0 PID: 7 at ../kernel/rcu/tree.c:2976 > __call_rcu.constprop.67+0x1e5/0x350() > [ 2.255510] __call_rcu(): Leaked duplicate callback > > Sometimes stack looks like: > [ 2.145163] WARNING: CPU: 0 PID: 102 at ../lib/debugobjects.c:263 > debug_print_object+0x8c/0xb0() > [ 2.147465] ODEBUG: active_state not available (active state 0) > object type: rcu_head hint: (null) > [ 2.148022] Modules linked in: > [ 2.148022] CPU: 0 PID: 102 Comm: systemd-udevd Not tainted > 4.1.0-rc7+ #653 > [ 2.148022] Hardware name: QEMU Standard PC (i440FX + PIIX, > 1996), BIOS > rel-1.7.5-rc1-0-gb1d4dc9-20140515_140003-nilsson.home.kraxel.org > 04/01/2014 > [ 2.148022] ffffffff81a34f77 ffff88000fc03d18 ffffffff81781ed4 > 0000000000000105 > [ 2.148022] ffff88000fc03d68 ffff88000fc03d58 ffffffff81064e57 > 0000000000000000 > [ 2.148022] ffff88000fc03e20 ffffffff81c50f00 ffffffff81a34fdf > 0000000000000286 > [ 2.148022] Call Trace: > [ 2.148022] [] dump_stack+0x4f/0x7b > [ 2.148022] [] warn_slowpath_common+0x97/0xe0 > [ 2.148022] [] warn_slowpath_fmt+0x46/0x50 > [ 2.148022] [] debug_print_object+0x8c/0xb0 > [ 2.148022] [] ? debug_object_active_state+0x66/0x160 > [ 2.148022] [] debug_object_active_state+0xf1/0x160 > [ 2.148022] [] rcu_process_callbacks+0x301/0xae0 > [ 2.148022] [] ? rcu_process_callbacks+0x2e7/0xae0 > [ 2.148022] [] ? run_timer_softirq+0x218/0x4c0 > [ 2.148022] [] __do_softirq+0x14f/0x670 > [ 2.148022] [] irq_exit+0xa5/0xb0 > [ 2.148022] [] smp_apic_timer_interrupt+0x4a/0x60 > [ 2.148022] [] apic_timer_interrupt+0x70/0x80 > [ 2.148022] [] ? > debug_object_activate+0x9c/0x1e0 > [ 2.148022] [] ? _raw_spin_unlock_irqrestore+0x67/0x80 > [ 2.148022] [] debug_object_activate+0x156/0x1e0 > [ 2.148022] [] rcuhead_fixup_activate+0x37/0x40 > [ 2.148022] [] debug_object_activate+0x101/0x1e0 > [ 2.148022] [] ? _raw_spin_unlock_irqrestore+0x4b/0x80 > [ 2.148022] [] __call_rcu.constprop.67+0x46/0x350 > [ 2.148022] [] ? __debug_object_init+0x3f4/0x430 > [ 2.148022] [] ? _raw_spin_unlock_irqrestore+0x4b/0x80 > [ 2.148022] [] kfree_call_rcu+0x1a/0x20 > [ 2.148022] [] trace_preempt_on+0x180/0x290 > [ 2.148022] [] ? trace_preempt_on+0xce/0x290 > [ 2.148022] [] preempt_count_sub+0x73/0xf0 > [ 2.148022] [] _raw_spin_unlock_irqrestore+0x4b/0x80 > [ 2.148022] [] __debug_object_init+0x3f4/0x430 > [ 2.148022] [] ? trace_preempt_on+0x18c/0x290 > [ 2.148022] [] debug_object_init+0x1b/0x20 > [ 2.148022] [] rcuhead_fixup_activate+0x28/0x40 > [ 2.148022] [] debug_object_activate+0x101/0x1e0 > [ 2.148022] [] ? get_max_files+0x20/0x20 > [ 2.148022] [] __call_rcu.constprop.67+0x46/0x350 > [ 2.148022] [] call_rcu+0x17/0x20 > [ 2.148022] [] __fput+0x183/0x200 > [ 2.148022] [] ____fput+0xe/0x10 > [ 2.148022] [] task_work_run+0xb5/0xe0 > [ 2.148022] [] do_notify_resume+0x64/0x80 > [ 2.148022] [] int_signal+0x12/0x17 > > My reading of the code is debug_object_*() bits are reporting real > problem. In the above trace the call > debug_rcu_head_unqueue(list); > from rcu_do_batch() is not finding 'list' in tracked objects. > > I know that doing call_rcu() from trace_preempt is ill advised, > but I still want to understand why call_rcu corrupts the memory. Hmmm... This is what I would expect if you invoked call_rcu() (or kfree_rcu(), for that matter) on some memory, then freed it before the grace period ended. This would cause the same problems as any other use-after-free error. Might this be happening? > Attaching a patch that I'm using for debugging. > It's doing recursion preemption check, so number of nested call_rcu > is no more than 2. Oh... One important thing is that both call_rcu() and kfree_rcu() use per-CPU variables, managing a per-CPU linked list. This is why they disable interrupts. If you do another call_rcu() in the middle of the first one in just the wrong place, you will have two entities concurrently manipulating the same linked list, which will not go well. Maybe mark call_rcu() and the things it calls as notrace? Or you could maintain a separate per-CPU linked list that gathered up the stuff to be kfree()ed after a grace period, and some time later feed them to kfree_rcu()? > Also if I replace kfree_rcu is this patch with a regular kfree, > all works fine. > > I'm seeing this crashes in VM with _single_ cpu. > Kernel is built with CONFIG_PREEMPT, CONFIG_PREEMPT_TRACER and > CONFIG_DEBUG_OBJECTS_RCU_HEAD. No surprise -- when you have lists hanging off of per-CPU variables, it only takes one CPU to tangle the lists. > Also interesting that size of > struct elem { > u64 pad[32]; > struct rcu_head rcu; > }; > that I'm using in kmalloc/kfree_rcu changes the crash. > If padding is zero, kernel just locksup, if pad[1] I see > one type of odebug warnings, if pad[32] - another. The usual consequence of racing a pair of callback insertions on the same CPU would be that one of them gets leaked, and possible all subsequent callbacks. So the lockup is no surprise. And there are a lot of other assumptions in nearby code paths about only one execution at a time from a given CPU. > Any advise on where to look is greatly appreciated. What I don't understand is exactly what you are trying to do. Have more complex tracers that dynamically allocate memory? If so, having a per-CPU list that stages memory to be freed so that it can be passed to call_rcu() in a safe environment might make sense. Of course, that list would need to be managed carefully! Or am I missing the point of the code below? Thanx, Paul > Thanks! > > diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c > index 8523ea345f2b..89433a83dd2d 100644 > --- a/kernel/trace/trace_irqsoff.c > +++ b/kernel/trace/trace_irqsoff.c > @@ -13,6 +13,7 @@ > #include > #include > #include > +#include > > #include "trace.h" > > @@ -510,8 +511,42 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller); > #endif /* CONFIG_IRQSOFF_TRACER */ > > #ifdef CONFIG_PREEMPT_TRACER > +struct elem { > + u64 pad[32]; > + struct rcu_head rcu; > +}; > + > +static DEFINE_PER_CPU(int, prog_active); > +static void * test_alloc(void) > +{ > + struct elem *e = NULL; > + > + if (in_nmi()) > + return e; > + > + preempt_disable_notrace(); > + if (unlikely(__this_cpu_inc_return(prog_active) != 1)) > + goto out; > + > + rcu_read_lock(); > + e = kmalloc(sizeof(struct elem), GFP_ATOMIC); > + rcu_read_unlock(); > + if (!e) > + goto out; > + > + kfree_rcu(e, rcu); > +out: > + __this_cpu_dec(prog_active); > + preempt_enable_no_resched_notrace(); > + return e; > +} > + > void trace_preempt_on(unsigned long a0, unsigned long a1) > { > + void * buf = 0; > + static int cnt = 0; > + if (cnt++ > 3000000) > + buf = test_alloc(); > if (preempt_trace() && !irq_trace()) > stop_critical_timing(a0, a1); > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/