Hi,
On tip:master, while turning on lock events with perf through
perf lock record
I get random kernel crashes, sometimes about weird unaligned
accesses, sometimes about scheduler that complains.
I hope someone has an idea about this.
In three different attempts (got to force reboot each time), I got:
First (task_tgid_nr_ns() is called from perf_event_pid()):
[ 565.464201] Kernel unaligned access at TPC[486b74] task_tgid_nr_ns+0x8/0x54
[ 565.475801] sun4v_data_access_exception: ADDR[000060f8b13a2004] CTX[0000] TYPE[0009], going.
[ 565.488610] \|/ ____ \|/
1>[ 565.492705] Unable to handle kernel NULL pointer dereference
[ 565.492719] Unable to handle kernel NULL pointer dereference
[ 565.492733] Unable to handle kernel NULL pointer dereference
[ 565.492747] Unable to handle kernel NULL pointer dereference
[ 565.492761] Unable to handle kernel NULL pointer dereference
[ 565.492776] Unable to handle kernel NULL pointer dereference
[ 565.492790] Unable to handle kernel NULL pointer dereference
[ 565.492804] Unable to handle kernel NULL pointer dereference
[ 565.492818] Unable to handle kernel NULL pointer dereference
[ 565.492832] Unable to handle kernel NULL pointer dereference
[ 565.492847] Unable to handle kernel NULL pointer dereference
Second:
[ 250.508047] Kernel unaligned access at TPC[4d1a0c] perf_swevent_ctx_event+0x16c/0x1b0
(this one happened in asm/processor_64.h: prefetch(), probably while
walking to the context's event list)
Third:
[ 60.147895] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 130.637924] sun4v_data_access_exception: ADDR[ffffa010933a6060] CTX[0000] TYPE[0009], going.
[ 130.639329] kernel BUG at kernel/sched.c:1166!
[ 130.639364] \|/ ____ \|/
[ 130.639370] "@'/ .. \`@"
[ 130.639377] /_| \__/ |_\
[ 130.639382] \__U_/
[ 130.639394] swapper(0): Kernel bad sw trap 5 [#1]
[ 130.639430] TSTATE: 0000000080e01605 TPC: 000000000045a02c TNPC: 000000000045a030 Y: 00000000 Tainted: G
W
[ 130.639462] TPC: <resched_task+0x44/0xa8>
[ 130.639475] g0: fffff803f685d980 g1: 0000000000000000 g2: 0000000000000027 g3: 0000000000007d0d
[ 130.639493] g4: fffff803f685d980 g5: fffff800160a0000 g6: fffff803f6864000 g7: 0000000000b54c00
[ 130.639511] o0: 0000000000828580 o1: 000000000000048e o2: 0000000000000000 o3: 0000000000000000
[ 130.639528] o4: 0000000000000002 o5: 0000000000000001 sp: fffff803ff932ff1 ret_pc: 000000000045a024
[ 130.639548] RPC: <resched_task+0x3c/0xa8>
[ 130.639561] l0: 0000000000000000 l1: 000000000000000e l2: ffffffffffffffff l3: 0000000000000104
[ 130.639580] l4: fffff803f685d980 l5: 0006000000000000 l6: 000000000000000e l7: 00000000008a4180
[ 130.639597] i0: fffff803f62c6960 i1: fffff803f7fa18e0 i2: 0000000000000001 i3: 0000000000000000
[ 130.639616] i4: 0000000000b28340 i5: 0000000000b25cc0 i6: fffff803ff9330b1 i7: 0000000000461bc8
[ 130.639644] I7: <check_preempt_wakeup+0x148/0x1f8>
[ 130.639655] Instruction DUMP: 9210248e 7fff3f4d 90122180 <91d02005> c25a6008 82086008 0ac84014 92026008
4005c092
[ 130.639703] Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 130.639715] Call Trace:
[ 130.639742] [000000000073795c] panic+0x58/0x124
[ 130.639763] [000000000046fb80] do_exit+0x64/0x770
[ 130.639782] [0000000000427d4c] die_if_kernel+0x264/0x290
[ 130.639801] [000000000042a138] bad_trap+0x78/0xe8
[ 130.639824] [00000000004220b0] tl0_resv104+0x30/0xa0
[ 130.639841] [000000000045a02c] resched_task+0x44/0xa8
[ 130.639861] [0000000000461bc8] check_preempt_wakeup+0x148/0x1f8
[ 130.639883] [0000000000466554] try_to_wake_up+0x484/0x570
[ 130.639902] [000000000046668c] wake_up_process+0xc/0x20
[ 130.639920] [00000000004678b4] load_balance+0xfb4/0x10f0
[ 130.639940] [0000000000467b60] rebalance_domains+0x170/0x204
[ 130.639960] [0000000000467c30] run_rebalance_domains+0x3c/0x100
[ 130.639985] [00000000004734a4] __do_softirq+0x1b8/0x378
[ 130.640004] [000000000042a354] do_softirq+0x8c/0xcc
[ 130.640021] [0000000000472ec0] irq_exit+0x68/0xd0
[ 130.640044] [000000000042f0f8] timer_interrupt+0xb8/0xec
On Sun, Apr 04, 2010 at 02:18:54PM +0200, Frederic Weisbecker wrote:
> Hi,
>
> On tip:master, while turning on lock events with perf through
>
> perf lock record
>
> I get random kernel crashes, sometimes about weird unaligned
> accesses, sometimes about scheduler that complains.
>
> I hope someone has an idea about this.
>
> In three different attempts (got to force reboot each time), I got:
>
> First (task_tgid_nr_ns() is called from perf_event_pid()):
>
> [ 565.464201] Kernel unaligned access at TPC[486b74] task_tgid_nr_ns+0x8/0x54
> [ 565.475801] sun4v_data_access_exception: ADDR[000060f8b13a2004] CTX[0000] TYPE[0009], going.
> [ 565.488610] \|/ ____ \|/
> 1>[ 565.492705] Unable to handle kernel NULL pointer dereference
> [ 565.492719] Unable to handle kernel NULL pointer dereference
> [ 565.492733] Unable to handle kernel NULL pointer dereference
> [ 565.492747] Unable to handle kernel NULL pointer dereference
> [ 565.492761] Unable to handle kernel NULL pointer dereference
> [ 565.492776] Unable to handle kernel NULL pointer dereference
> [ 565.492790] Unable to handle kernel NULL pointer dereference
> [ 565.492804] Unable to handle kernel NULL pointer dereference
> [ 565.492818] Unable to handle kernel NULL pointer dereference
> [ 565.492832] Unable to handle kernel NULL pointer dereference
> [ 565.492847] Unable to handle kernel NULL pointer dereference
>
> Second:
>
> [ 250.508047] Kernel unaligned access at TPC[4d1a0c] perf_swevent_ctx_event+0x16c/0x1b0
>
> (this one happened in asm/processor_64.h: prefetch(), probably while
> walking to the context's event list)
>
> Third:
>
> [ 60.147895] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [ 130.637924] sun4v_data_access_exception: ADDR[ffffa010933a6060] CTX[0000] TYPE[0009], going.
> [ 130.639329] kernel BUG at kernel/sched.c:1166!
> [ 130.639364] \|/ ____ \|/
> [ 130.639370] "@'/ .. \`@"
> [ 130.639377] /_| \__/ |_\
> [ 130.639382] \__U_/
> [ 130.639394] swapper(0): Kernel bad sw trap 5 [#1]
> [ 130.639430] TSTATE: 0000000080e01605 TPC: 000000000045a02c TNPC: 000000000045a030 Y: 00000000 Tainted: G
> W
> [ 130.639462] TPC: <resched_task+0x44/0xa8>
> [ 130.639475] g0: fffff803f685d980 g1: 0000000000000000 g2: 0000000000000027 g3: 0000000000007d0d
> [ 130.639493] g4: fffff803f685d980 g5: fffff800160a0000 g6: fffff803f6864000 g7: 0000000000b54c00
> [ 130.639511] o0: 0000000000828580 o1: 000000000000048e o2: 0000000000000000 o3: 0000000000000000
> [ 130.639528] o4: 0000000000000002 o5: 0000000000000001 sp: fffff803ff932ff1 ret_pc: 000000000045a024
> [ 130.639548] RPC: <resched_task+0x3c/0xa8>
> [ 130.639561] l0: 0000000000000000 l1: 000000000000000e l2: ffffffffffffffff l3: 0000000000000104
> [ 130.639580] l4: fffff803f685d980 l5: 0006000000000000 l6: 000000000000000e l7: 00000000008a4180
> [ 130.639597] i0: fffff803f62c6960 i1: fffff803f7fa18e0 i2: 0000000000000001 i3: 0000000000000000
> [ 130.639616] i4: 0000000000b28340 i5: 0000000000b25cc0 i6: fffff803ff9330b1 i7: 0000000000461bc8
> [ 130.639644] I7: <check_preempt_wakeup+0x148/0x1f8>
> [ 130.639655] Instruction DUMP: 9210248e 7fff3f4d 90122180 <91d02005> c25a6008 82086008 0ac84014 92026008
> 4005c092
> [ 130.639703] Kernel panic - not syncing: Aiee, killing interrupt handler!
> [ 130.639715] Call Trace:
> [ 130.639742] [000000000073795c] panic+0x58/0x124
> [ 130.639763] [000000000046fb80] do_exit+0x64/0x770
> [ 130.639782] [0000000000427d4c] die_if_kernel+0x264/0x290
> [ 130.639801] [000000000042a138] bad_trap+0x78/0xe8
> [ 130.639824] [00000000004220b0] tl0_resv104+0x30/0xa0
> [ 130.639841] [000000000045a02c] resched_task+0x44/0xa8
> [ 130.639861] [0000000000461bc8] check_preempt_wakeup+0x148/0x1f8
> [ 130.639883] [0000000000466554] try_to_wake_up+0x484/0x570
> [ 130.639902] [000000000046668c] wake_up_process+0xc/0x20
> [ 130.639920] [00000000004678b4] load_balance+0xfb4/0x10f0
> [ 130.639940] [0000000000467b60] rebalance_domains+0x170/0x204
> [ 130.639960] [0000000000467c30] run_rebalance_domains+0x3c/0x100
> [ 130.639985] [00000000004734a4] __do_softirq+0x1b8/0x378
> [ 130.640004] [000000000042a354] do_softirq+0x8c/0xcc
> [ 130.640021] [0000000000472ec0] irq_exit+0x68/0xd0
> [ 130.640044] [000000000042f0f8] timer_interrupt+0xb8/0xec
>
Forgot the config, it's in attachment.
Thanks.
From: Frederic Weisbecker <[email protected]>
Date: Sun, 4 Apr 2010 14:21:19 +0200
> Forgot the config, it's in attachment.
I get them too when even trying to use the function tracer in some
simple way. They occur in random places and eventually the entire
machine wedges and is unusuable.
I suspect there is some bug that ends up corrupting memory, so I
started trying to debug this last night when I first saw it.
I'll let you know if I make any progress.
On Sun, Apr 04, 2010 at 06:00:57PM -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Sun, 4 Apr 2010 14:21:19 +0200
>
> > Forgot the config, it's in attachment.
>
> I get them too when even trying to use the function tracer in some
> simple way. They occur in random places and eventually the entire
> machine wedges and is unusuable.
Yeah. But it's not totally random. I often see the same scheduler
crash.
It seems to happen after an unaligned access fixup.
>
> I suspect there is some bug that ends up corrupting memory, so I
> started trying to debug this last night when I first saw it.
Ok. I indeed have the same problem with the function tracer when
it is running.
> I'll let you know if I make any progress.
Thanks!
From: Frederic Weisbecker <[email protected]>
Date: Mon, 5 Apr 2010 08:57:04 +0200
> It seems to happen after an unaligned access fixup.
You shouldn't be getting the unaligned fixup in the first place,
especially in the locations where you see them. I suspect that
once you see first fixup, all of the registers in the cpu have
been corrupted in one way or another.
I suspect something fundamental gets corrupted, for example the
current register window (%cwp) is corrupted and that screws up all of
the registers so every single function starts accessing garbage.
My suspicions lie in three places, the ftrace mcount()
stubs, stack_trace_flush(), or the new perf_arch_save_caller_regs()
since those are the three places offhand that could make
us potentially make us return to function in the wrong
register window.
On Mon, Apr 05, 2010 at 12:22:33PM -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 08:57:04 +0200
>
> > It seems to happen after an unaligned access fixup.
>
> You shouldn't be getting the unaligned fixup in the first place,
> especially in the locations where you see them. I suspect that
> once you see first fixup, all of the registers in the cpu have
> been corrupted in one way or another.
>
> I suspect something fundamental gets corrupted, for example the
> current register window (%cwp) is corrupted and that screws up all of
> the registers so every single function starts accessing garbage.
>
> My suspicions lie in three places, the ftrace mcount()
> stubs, stack_trace_flush(), or the new perf_arch_save_caller_regs()
> since those are the three places offhand that could make
> us potentially make us return to function in the wrong
> register window.
It happens without CONFIG_FUNCTION_TRACER as well (but it happens
when the function tracer runs). And I hadn't your perf_arch_save_caller_regs()
when I triggered this.
From: Frederic Weisbecker <[email protected]>
Date: Mon, 5 Apr 2010 21:40:58 +0200
> It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> when the function tracer runs). And I hadn't your perf_arch_save_caller_regs()
> when I triggered this.
Good clues, thanks.
From: Frederic Weisbecker <[email protected]>
Date: Mon, 5 Apr 2010 21:40:58 +0200
> It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> when the function tracer runs). And I hadn't your
> perf_arch_save_caller_regs() when I triggered this.
I think there's still something wrong with the ring buffer stuff on
architectures like sparc64.
Stephen, I'm looking at the 8-byte alignment fix that was made a few
weeks ago, commit:
commit 2271048d1b3b0aabf83d25b29c20646dcabedc05
Author: Steven Rostedt <[email protected]>
Date: Thu Mar 18 17:54:19 2010 -0400
ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
and I'm not so sure it's completely correct.
Originally, the ring buffer entries determine where the entry data
resides (either &event->array[0] or &event->array[1]) based upon the
length.
Beforehand, in all cases:
1) If length could be encoded into event->type_len (ie. <=
RB_MAX_SMALL_DATA) then event->type_len holds the length
and the event data starts at &event->array[0]
2) Otherwise (length > RB_MAX_SMALL_DATA) the length is
encoded into event->array[0] and the event data starts at
&event->array[1]
But now, there is a new semantic when CONFIG_64BIT is true and
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is false (which isn't the right
test btw, f.e. sparc 32-bit needs this handling just like sparc 64-bit
does since it uses full 64-bit loads and stores to access u64 objects
and thus will crash without proper alignment, the correct test should
be just CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS being false).
This new semantic is:
1) Entries always encode the length in ->array[0] and ->type_len
is set to zero.
And then there are special cases like events of type
RINGBUF_TYPE_PADDING which, even though ->type_len is non-zero, encode
a length field in ->array[0] which is used by the ring buffer
iterators such as rb_event_length(), but this only applies only if
event->time_delta is non-zero. (Phew!)
The commit adjusts the code in rb_calculate_event_length() to force 8
byte chunks when RB_FORCE_8BYTE_ALIGNMENT is set. It also adjusted
the rb_update_event() logic so that it unconditionally uses
event->array[0] for the length on such platforms.
However I don't see any logic added to ring_buffer_event_length()
to handle this forcing. That alone can't explain the crashes
Frederic and I are seeing, since only oprofile seems to use that
helper function, but I can just imagine there might be other
subtle bugs linering after the above commit.
Anyways, that's just the inital potential problem I've discovered.
I'll start auditing the rest of this code.
I wonder if there's a simpler way to implement this alignment fix such
that we don't have to constantly make sure scores of locations in
ring_buffer.c get this magic exception case correct.
We should probably also BUILD_BUG_ON() if BUF_PAGE_HDR_SIZE is not
a multiple of the necessary alignment, since the ring buffer
entries start at the end of that.
Also I noticed (painfully :-) that 2.6.33 needs a backport of this
alignment fix too, so we should submit it to -stable (once we sift
out all the bugs of course).
From: Frederic Weisbecker <[email protected]>
Date: Mon, 5 Apr 2010 21:40:58 +0200
> It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> when the function tracer runs). And I hadn't your
> perf_arch_save_caller_regs() when I triggered this.
I figured out the problem, it's NMIs. As soon as I disable all of the
NMI watchdog code, the problem goes away.
This is because some parts of the NMI interrupt handling path are not
marked with "notrace" and the various tracer code paths use
local_irq_disable() (either directly or indirectly) which doesn't work
with sparc64's NMI scheme. These essentially turn NMIs back on in the
NMI handler before the NMI condition has been cleared, and thus we can
re-enter with another NMI interrupt.
We went through this for perf events, and we just made sure that
local_irq_{enable,disable}() never occurs in any of the code paths in
perf events that can be reached via the NMI interrupt handler. (the
only one we had was sched_clock() and that was easily fixed)
So, the first mcount hit we get is for rcu_nmi_enter() via
nmi_enter().
I can see two ways to handle this:
1) Pepper 'notrace' markers onto rcu_nmi_enter(), rcu_nmi_exit()
and whatever else I can see getting hit in the NMI interrupt
handler code paths.
2) Add a hack to __raw_local_irq_save() that keeps it from writing
anything to the interrupt level register if we have NMI's disabled.
(this puts the cost on the entire kernel instead of just the NMI
paths).
#1 seems to be the intent on other platforms, the majority of the NMI
code paths are protected with 'notrace' on x86, I bet nobody noticed
that nmi_enter() when CONFIG_NO_HZ && !CONFIG_TINY_RCU ends up calling
a function that does tracing.
The next one we'll hit is atomic_notifier_call_chain() (amusingly
notify_die() is marked 'notrace' but the one thing it calls isn't)
For example, the following are the generic notrace annotations I
would need to get sparc64 ftrace functioning again. (Frederic I will
send you the full patch with the sparc specific bits under seperate
cover in so that you can test things...)
--------------------
kernel: Add notrace annotations to common routines invoked via NMI.
This includes the atomic notifier call chain as well as the RCU
specific NMI enter/exit handlers.
Signed-off-by: David S. Miller <[email protected]>
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 2488ba7..ceae89a 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -71,9 +71,9 @@ static int notifier_chain_unregister(struct notifier_block **nl,
* @returns: notifier_call_chain returns the value returned by the
* last notifier function called.
*/
-static int __kprobes notifier_call_chain(struct notifier_block **nl,
- unsigned long val, void *v,
- int nr_to_call, int *nr_calls)
+static int notrace __kprobes notifier_call_chain(struct notifier_block **nl,
+ unsigned long val, void *v,
+ int nr_to_call, int *nr_calls)
{
int ret = NOTIFY_DONE;
struct notifier_block *nb, *next_nb;
@@ -172,9 +172,9 @@ EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
* Otherwise the return value is the return value
* of the last notifier function called.
*/
-int __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
- unsigned long val, void *v,
- int nr_to_call, int *nr_calls)
+int notrace __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
+ unsigned long val, void *v,
+ int nr_to_call, int *nr_calls)
{
int ret;
@@ -185,8 +185,8 @@ int __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
}
EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
-int __kprobes atomic_notifier_call_chain(struct atomic_notifier_head *nh,
- unsigned long val, void *v)
+int notrace __kprobes atomic_notifier_call_chain(struct atomic_notifier_head *nh,
+ unsigned long val, void *v)
{
return __atomic_notifier_call_chain(nh, val, v, -1, NULL);
}
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3ec8160..d1a44ab 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -286,7 +286,7 @@ void rcu_exit_nohz(void)
* irq handler running, this updates rdtp->dynticks_nmi to let the
* RCU grace-period handling know that the CPU is active.
*/
-void rcu_nmi_enter(void)
+void notrace rcu_nmi_enter(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
@@ -304,7 +304,7 @@ void rcu_nmi_enter(void)
* irq handler running, this updates rdtp->dynticks_nmi to let the
* RCU grace-period handling know that the CPU is no longer active.
*/
-void rcu_nmi_exit(void)
+void notrace rcu_nmi_exit(void)
{
struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
On Tue, Apr 06, 2010 at 02:50:49AM -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 21:40:58 +0200
>
> > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > when the function tracer runs). And I hadn't your
> > perf_arch_save_caller_regs() when I triggered this.
>
> I figured out the problem, it's NMIs. As soon as I disable all of the
> NMI watchdog code, the problem goes away.
>
> This is because some parts of the NMI interrupt handling path are not
> marked with "notrace" and the various tracer code paths use
> local_irq_disable() (either directly or indirectly) which doesn't work
> with sparc64's NMI scheme. These essentially turn NMIs back on in the
> NMI handler before the NMI condition has been cleared, and thus we can
> re-enter with another NMI interrupt.
>
> We went through this for perf events, and we just made sure that
> local_irq_{enable,disable}() never occurs in any of the code paths in
> perf events that can be reached via the NMI interrupt handler. (the
> only one we had was sched_clock() and that was easily fixed)
>
> So, the first mcount hit we get is for rcu_nmi_enter() via
> nmi_enter().
>
> I can see two ways to handle this:
>
> 1) Pepper 'notrace' markers onto rcu_nmi_enter(), rcu_nmi_exit()
> and whatever else I can see getting hit in the NMI interrupt
> handler code paths.
>
> 2) Add a hack to __raw_local_irq_save() that keeps it from writing
> anything to the interrupt level register if we have NMI's disabled.
> (this puts the cost on the entire kernel instead of just the NMI
> paths).
>
> #1 seems to be the intent on other platforms, the majority of the NMI
> code paths are protected with 'notrace' on x86, I bet nobody noticed
> that nmi_enter() when CONFIG_NO_HZ && !CONFIG_TINY_RCU ends up calling
> a function that does tracing.
>
> The next one we'll hit is atomic_notifier_call_chain() (amusingly
> notify_die() is marked 'notrace' but the one thing it calls isn't)
>
> For example, the following are the generic notrace annotations I
> would need to get sparc64 ftrace functioning again. (Frederic I will
> send you the full patch with the sparc specific bits under seperate
> cover in so that you can test things...)
>
> --------------------
> kernel: Add notrace annotations to common routines invoked via NMI.
>
> This includes the atomic notifier call chain as well as the RCU
> specific NMI enter/exit handlers.
Ok, but this as a cause looks weird.
The function tracer handler disables interrupts. I don't remember exactly
why but we also have a no-preempt mode that only disables preemption instead:
(function_trace_call_preempt_only())
It means having such interrupt reentrancy is not a problem. In fact, the
function tracer is not reentrant:
data = tr->data[cpu];
disabled = atomic_inc_return(&data->disabled);
if (likely(disabled == 1))
trace_function(tr, ip, parent_ip, flags, pc);
atomic_dec(&data->disabled);
we do this just to prevent from tracing recursion (in case we have
a traceable function in the inner function tracing path).
Nmis are just supposed to be fine with the function tracer.
From: Frederic Weisbecker <[email protected]>
Date: Tue, 6 Apr 2010 12:19:28 +0200
> It means having such interrupt reentrancy is not a problem.
It's not reentrancy.
It's the fact that local_irq_disable() (read it again, it's the
"disable" that re-renables NMIs on sparc64) turns NMIs back on even in
code where we are still trying to figure out how to service the NMI
still.
It's because we implement NMIs on sparc64 by having the performance
counter interrupt come in on the level 15 interrupt, and we run the
entire kernel at level 14 when IRQs are "disabled".
On Tue, 2010-04-06 at 02:50 -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 21:40:58 +0200
>
> > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > when the function tracer runs). And I hadn't your
> > perf_arch_save_caller_regs() when I triggered this.
>
> I figured out the problem, it's NMIs. As soon as I disable all of the
> NMI watchdog code, the problem goes away.
>
> This is because some parts of the NMI interrupt handling path are not
> marked with "notrace" and the various tracer code paths use
> local_irq_disable() (either directly or indirectly) which doesn't work
> with sparc64's NMI scheme. These essentially turn NMIs back on in the
> NMI handler before the NMI condition has been cleared, and thus we can
> re-enter with another NMI interrupt.
>
> We went through this for perf events, and we just made sure that
> local_irq_{enable,disable}() never occurs in any of the code paths in
> perf events that can be reached via the NMI interrupt handler. (the
> only one we had was sched_clock() and that was easily fixed)
One thing we can do is make the code WARN about this, how about
something like the below
---
Subject: lockdep: WARN about local_irq_{en,dis}able in NMI context
Some architectures implement NMIs by using IRQ priority levels and
mixing local_irq_{en,dis}able() with NMIs will give unexpected results.
Hence disallow this in general and WARN about it.
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/lockdep.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 08e6f76..06ec1c7 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2341,6 +2341,11 @@ EXPORT_SYMBOL(trace_hardirqs_on_caller);
void trace_hardirqs_on(void)
{
+ /*
+ * Some architectures can't deal with local_irq_{enable,disable}
+ * from NMI context (SPARC), enforce this.
+ */
+ WARN_ON_ONCE(in_nmi());
trace_hardirqs_on_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_on);
@@ -2375,6 +2380,11 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
void trace_hardirqs_off(void)
{
+ /*
+ * Some architectures can't deal with local_irq_{enable,disable}
+ * from NMI context (SPARC), enforce this.
+ */
+ WARN_ON_ONCE(in_nmi());
trace_hardirqs_off_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);
From: Peter Zijlstra <[email protected]>
Date: Tue, 06 Apr 2010 13:12:01 +0200
> One thing we can do is make the code WARN about this, how about
> something like the below
It's going to warn every bootup in cpu_clock() on x86.
On Tue, 2010-04-06 at 04:13 -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Tue, 06 Apr 2010 13:12:01 +0200
>
> > One thing we can do is make the code WARN about this, how about
> > something like the below
>
> It's going to warn every bootup in cpu_clock() on x86.
*sigh*, yes, we could hack around that I suppose.. it would be nice to
automate this check though, I bet you don't fancy tracking down more
such splats than you have to.
You could of course insert the debug code into your arch routines but
that would limit the coverage checks to whatever you happen to run.
From: Peter Zijlstra <[email protected]>
Date: Tue, 06 Apr 2010 13:20:19 +0200
> On Tue, 2010-04-06 at 04:13 -0700, David Miller wrote:
>> From: Peter Zijlstra <[email protected]>
>> Date: Tue, 06 Apr 2010 13:12:01 +0200
>>
>> > One thing we can do is make the code WARN about this, how about
>> > something like the below
>>
>> It's going to warn every bootup in cpu_clock() on x86.
>
> *sigh*, yes, we could hack around that I suppose.. it would be nice to
> automate this check though, I bet you don't fancy tracking down more
> such splats than you have to.
>
> You could of course insert the debug code into your arch routines but
> that would limit the coverage checks to whatever you happen to run.
Yes, f.e. you could add local_irq_*_nmi() or similar that don't
complain when called inside of an NMI.
I would certainly welcome this debugging facility, for sure!
It would have saved me two days of work this time, in fact.
On Tue, Apr 06, 2010 at 02:50:49AM -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 21:40:58 +0200
>
> > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > when the function tracer runs). And I hadn't your
> > perf_arch_save_caller_regs() when I triggered this.
>
> I figured out the problem, it's NMIs. As soon as I disable all of the
> NMI watchdog code, the problem goes away.
>
> This is because some parts of the NMI interrupt handling path are not
> marked with "notrace" and the various tracer code paths use
> local_irq_disable() (either directly or indirectly) which doesn't work
> with sparc64's NMI scheme. These essentially turn NMIs back on in the
> NMI handler before the NMI condition has been cleared, and thus we can
> re-enter with another NMI interrupt.
>
> We went through this for perf events, and we just made sure that
> local_irq_{enable,disable}() never occurs in any of the code paths in
> perf events that can be reached via the NMI interrupt handler. (the
> only one we had was sched_clock() and that was easily fixed)
That reminds me we have a new pair of local_irq_disable/enable
in perf_event_task_output(), which path can be taken by hardware
pmu events.
See this patch:
8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
perf: Fix 'perf sched record' deadlock
On Tue, 2010-04-06 at 13:38 +0200, Frederic Weisbecker wrote:
> On Tue, Apr 06, 2010 at 02:50:49AM -0700, David Miller wrote:
> > From: Frederic Weisbecker <[email protected]>
> > Date: Mon, 5 Apr 2010 21:40:58 +0200
> >
> > > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > > when the function tracer runs). And I hadn't your
> > > perf_arch_save_caller_regs() when I triggered this.
> >
> > I figured out the problem, it's NMIs. As soon as I disable all of the
> > NMI watchdog code, the problem goes away.
> >
> > This is because some parts of the NMI interrupt handling path are not
> > marked with "notrace" and the various tracer code paths use
> > local_irq_disable() (either directly or indirectly) which doesn't work
> > with sparc64's NMI scheme. These essentially turn NMIs back on in the
> > NMI handler before the NMI condition has been cleared, and thus we can
> > re-enter with another NMI interrupt.
> >
> > We went through this for perf events, and we just made sure that
> > local_irq_{enable,disable}() never occurs in any of the code paths in
> > perf events that can be reached via the NMI interrupt handler. (the
> > only one we had was sched_clock() and that was easily fixed)
>
>
>
> That reminds me we have a new pair of local_irq_disable/enable
> in perf_event_task_output(), which path can be taken by hardware
> pmu events.
>
> See this patch:
>
> 8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
> perf: Fix 'perf sched record' deadlock
ARGH.. yes
Also, I guess that should live in perf_output_lock/unlock() not in
perf_event_task_output().
Egads, how to fix that
On Tue, 2010-04-06 at 13:51 +0200, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 13:38 +0200, Frederic Weisbecker wrote:
> > On Tue, Apr 06, 2010 at 02:50:49AM -0700, David Miller wrote:
> > > From: Frederic Weisbecker <[email protected]>
> > > Date: Mon, 5 Apr 2010 21:40:58 +0200
> > >
> > > > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > > > when the function tracer runs). And I hadn't your
> > > > perf_arch_save_caller_regs() when I triggered this.
> > >
> > > I figured out the problem, it's NMIs. As soon as I disable all of the
> > > NMI watchdog code, the problem goes away.
> > >
> > > This is because some parts of the NMI interrupt handling path are not
> > > marked with "notrace" and the various tracer code paths use
> > > local_irq_disable() (either directly or indirectly) which doesn't work
> > > with sparc64's NMI scheme. These essentially turn NMIs back on in the
> > > NMI handler before the NMI condition has been cleared, and thus we can
> > > re-enter with another NMI interrupt.
> > >
> > > We went through this for perf events, and we just made sure that
> > > local_irq_{enable,disable}() never occurs in any of the code paths in
> > > perf events that can be reached via the NMI interrupt handler. (the
> > > only one we had was sched_clock() and that was easily fixed)
> >
> >
> >
> > That reminds me we have a new pair of local_irq_disable/enable
> > in perf_event_task_output(), which path can be taken by hardware
> > pmu events.
> >
> > See this patch:
> >
> > 8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
> > perf: Fix 'perf sched record' deadlock
>
> ARGH.. yes
>
> Also, I guess that should live in perf_output_lock/unlock() not in
> perf_event_task_output().
>
> Egads, how to fix that
Damn, so deadlock fix isn't a fix. No idea.
-Mike
On Tue, 2010-04-06 at 14:54 +0200, Mike Galbraith wrote:
> > ARGH.. yes
> >
> > Also, I guess that should live in perf_output_lock/unlock() not in
> > perf_event_task_output().
> >
> > Egads, how to fix that
>
> Damn, so deadlock fix isn't a fix. No idea.
well it is,. but it breaks sparc..
I'm currently compile testing a bunch to fix all that up by doing what
davem suggested: local_irq_save_nmi()/local_irq_restore_nmi().
David,
It's best to send to my [email protected] account, just like it is
best to send to your [email protected] ;-)
On Mon, 2010-04-05 at 19:15 -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 21:40:58 +0200
>
> > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > when the function tracer runs). And I hadn't your
> > perf_arch_save_caller_regs() when I triggered this.
>
> I think there's still something wrong with the ring buffer stuff on
> architectures like sparc64.
>
> Stephen, I'm looking at the 8-byte alignment fix that was made a few
> weeks ago, commit:
>
> commit 2271048d1b3b0aabf83d25b29c20646dcabedc05
> Author: Steven Rostedt <[email protected]>
> Date: Thu Mar 18 17:54:19 2010 -0400
>
> ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
>
> and I'm not so sure it's completely correct.
>
> Originally, the ring buffer entries determine where the entry data
> resides (either &event->array[0] or &event->array[1]) based upon the
> length.
>
> Beforehand, in all cases:
>
> 1) If length could be encoded into event->type_len (ie. <=
> RB_MAX_SMALL_DATA) then event->type_len holds the length
> and the event data starts at &event->array[0]
>
> 2) Otherwise (length > RB_MAX_SMALL_DATA) the length is
> encoded into event->array[0] and the event data starts at
> &event->array[1]
>
> But now, there is a new semantic when CONFIG_64BIT is true and
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is false (which isn't the right
> test btw, f.e. sparc 32-bit needs this handling just like sparc 64-bit
> does since it uses full 64-bit loads and stores to access u64 objects
> and thus will crash without proper alignment, the correct test should
> be just CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS being false).
OK, so the a 64 bit word still needs 64 bit alignment when storing to a
data pointer.
I wonder if we should just have a special copy in this case for the
events and remove this patch in the ring buffer. That is:
__assign_word(__entry->word, value);
And have in !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS be:
#define __assgin_word(dest, src) \
memcpy(&(dest), &(src), sizeof(src));
This would fix it for all.
>
> This new semantic is:
>
> 1) Entries always encode the length in ->array[0] and ->type_len
> is set to zero.
>
> And then there are special cases like events of type
> RINGBUF_TYPE_PADDING which, even though ->type_len is non-zero, encode
> a length field in ->array[0] which is used by the ring buffer
> iterators such as rb_event_length(), but this only applies only if
> event->time_delta is non-zero. (Phew!)
The RINGBUF_TYPE_PADDING is used to either fill the rest of the sub
buffer, where alignment does not matter, or to replace a deleted event
that had another event after, it, which should already be aligned.
>
> The commit adjusts the code in rb_calculate_event_length() to force 8
> byte chunks when RB_FORCE_8BYTE_ALIGNMENT is set. It also adjusted
> the rb_update_event() logic so that it unconditionally uses
> event->array[0] for the length on such platforms.
>
> However I don't see any logic added to ring_buffer_event_length()
> to handle this forcing. That alone can't explain the crashes
> Frederic and I are seeing, since only oprofile seems to use that
> helper function, but I can just imagine there might be other
> subtle bugs linering after the above commit.
>
> Anyways, that's just the inital potential problem I've discovered.
> I'll start auditing the rest of this code.
>
> I wonder if there's a simpler way to implement this alignment fix such
> that we don't have to constantly make sure scores of locations in
> ring_buffer.c get this magic exception case correct.
>
> We should probably also BUILD_BUG_ON() if BUF_PAGE_HDR_SIZE is not
> a multiple of the necessary alignment, since the ring buffer
> entries start at the end of that.
>
> Also I noticed (painfully :-) that 2.6.33 needs a backport of this
> alignment fix too, so we should submit it to -stable (once we sift
> out all the bugs of course).
What about removing the logic from the ring buffer and moving it to the
TRACE_EVENT() macros as I suggested above?
We would probably need a way to read the buffers too.
I also know that Mathieu has some strange tricks to force alignment but
I'm still not convinced it would make things any less fragile than what
is already there.
-- Steve
From: Steven Rostedt <[email protected]>
Date: Tue, 06 Apr 2010 09:41:04 -0400
> It's best to send to my [email protected] account, just like it is
> best to send to your [email protected] ;-)
Ok then you should, like me, use it in your commits :-)
> OK, so the a 64 bit word still needs 64 bit alignment when storing to a
> data pointer.
>
> I wonder if we should just have a special copy in this case for the
> events and remove this patch in the ring buffer. That is:
>
> __assign_word(__entry->word, value);
>
> And have in !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS be:
>
> #define __assgin_word(dest, src) \
> memcpy(&(dest), &(src), sizeof(src));
>
> This would fix it for all.
Doesn't work, first of all, because if gcc can see the types
it will inline emit aligned loads and stores.
Secondly, the accessors who read in these entries need them to be
aligned too.
> What about removing the logic from the ring buffer and moving it to the
> TRACE_EVENT() macros as I suggested above?
No, that makes no sense, just align the data types properly
instead of making every single access site pay the price.
Just fix the CPP test to handle 32-bit platforms with slow
mis-alignment handling, everything else seems to be working properly
after I fixed the NMI issues on sparc64.
On Tue, Apr 06, 2010 at 02:50:49AM -0700, David Miller wrote:
> From: Frederic Weisbecker <[email protected]>
> Date: Mon, 5 Apr 2010 21:40:58 +0200
>
> > It happens without CONFIG_FUNCTION_TRACER as well (but it happens
> > when the function tracer runs). And I hadn't your
> > perf_arch_save_caller_regs() when I triggered this.
>
> I figured out the problem, it's NMIs. As soon as I disable all of the
> NMI watchdog code, the problem goes away.
>
> This is because some parts of the NMI interrupt handling path are not
> marked with "notrace" and the various tracer code paths use
> local_irq_disable() (either directly or indirectly) which doesn't work
> with sparc64's NMI scheme. These essentially turn NMIs back on in the
> NMI handler before the NMI condition has been cleared, and thus we can
> re-enter with another NMI interrupt.
>
> We went through this for perf events, and we just made sure that
> local_irq_{enable,disable}() never occurs in any of the code paths in
> perf events that can be reached via the NMI interrupt handler. (the
> only one we had was sched_clock() and that was easily fixed)
>
> So, the first mcount hit we get is for rcu_nmi_enter() via
> nmi_enter().
>
> I can see two ways to handle this:
>
> 1) Pepper 'notrace' markers onto rcu_nmi_enter(), rcu_nmi_exit()
> and whatever else I can see getting hit in the NMI interrupt
> handler code paths.
>
> 2) Add a hack to __raw_local_irq_save() that keeps it from writing
> anything to the interrupt level register if we have NMI's disabled.
> (this puts the cost on the entire kernel instead of just the NMI
> paths).
>
> #1 seems to be the intent on other platforms, the majority of the NMI
> code paths are protected with 'notrace' on x86, I bet nobody noticed
> that nmi_enter() when CONFIG_NO_HZ && !CONFIG_TINY_RCU ends up calling
> a function that does tracing.
>
> The next one we'll hit is atomic_notifier_call_chain() (amusingly
> notify_die() is marked 'notrace' but the one thing it calls isn't)
>
> For example, the following are the generic notrace annotations I
> would need to get sparc64 ftrace functioning again. (Frederic I will
> send you the full patch with the sparc specific bits under seperate
> cover in so that you can test things...)
>
> --------------------
> kernel: Add notrace annotations to common routines invoked via NMI.
>
> This includes the atomic notifier call chain as well as the RCU
> specific NMI enter/exit handlers.
Assuming that static inline functions don't need the notrace flag:
Reviewed-by: Paul E. McKenney <[email protected]>
(If static inline functions -do- need notrace, then the definitions
in include/linux/hardirq.h need to be tagged as well.)
> Signed-off-by: David S. Miller <[email protected]>
>
> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index 2488ba7..ceae89a 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -71,9 +71,9 @@ static int notifier_chain_unregister(struct notifier_block **nl,
> * @returns: notifier_call_chain returns the value returned by the
> * last notifier function called.
> */
> -static int __kprobes notifier_call_chain(struct notifier_block **nl,
> - unsigned long val, void *v,
> - int nr_to_call, int *nr_calls)
> +static int notrace __kprobes notifier_call_chain(struct notifier_block **nl,
> + unsigned long val, void *v,
> + int nr_to_call, int *nr_calls)
> {
> int ret = NOTIFY_DONE;
> struct notifier_block *nb, *next_nb;
> @@ -172,9 +172,9 @@ EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
> * Otherwise the return value is the return value
> * of the last notifier function called.
> */
> -int __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
> - unsigned long val, void *v,
> - int nr_to_call, int *nr_calls)
> +int notrace __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
> + unsigned long val, void *v,
> + int nr_to_call, int *nr_calls)
> {
> int ret;
>
> @@ -185,8 +185,8 @@ int __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
> }
> EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
>
> -int __kprobes atomic_notifier_call_chain(struct atomic_notifier_head *nh,
> - unsigned long val, void *v)
> +int notrace __kprobes atomic_notifier_call_chain(struct atomic_notifier_head *nh,
> + unsigned long val, void *v)
> {
> return __atomic_notifier_call_chain(nh, val, v, -1, NULL);
> }
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 3ec8160..d1a44ab 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -286,7 +286,7 @@ void rcu_exit_nohz(void)
> * irq handler running, this updates rdtp->dynticks_nmi to let the
> * RCU grace-period handling know that the CPU is active.
> */
> -void rcu_nmi_enter(void)
> +void notrace rcu_nmi_enter(void)
> {
> struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
>
> @@ -304,7 +304,7 @@ void rcu_nmi_enter(void)
> * irq handler running, this updates rdtp->dynticks_nmi to let the
> * RCU grace-period handling know that the CPU is no longer active.
> */
> -void rcu_nmi_exit(void)
> +void notrace rcu_nmi_exit(void)
> {
> struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2010-04-06 at 10:46 -0700, David Miller wrote:
> From: Steven Rostedt <[email protected]>
> Date: Tue, 06 Apr 2010 09:41:04 -0400
>
> > It's best to send to my [email protected] account, just like it is
> > best to send to your [email protected] ;-)
>
> Ok then you should, like me, use it in your commits :-)
>
My SoB is [email protected], my commits are [email protected] just
because I want to give credit to the one that pays me for making those
commits ;-)
> > What about removing the logic from the ring buffer and moving it to the
> > TRACE_EVENT() macros as I suggested above?
>
> No, that makes no sense, just align the data types properly
> instead of making every single access site pay the price.
OK
>
> Just fix the CPP test to handle 32-bit platforms with slow
> mis-alignment handling, everything else seems to be working properly
> after I fixed the NMI issues on sparc64.
A little more work will need to be done since on 32bit, the page headers
are 12 bytes, not 16, so we start off without being aligned 8bytes. But
that too is fixable.
-- Steve
From: Steven Rostedt <[email protected]>
Date: Tue, 06 Apr 2010 14:15:58 -0400
> On Tue, 2010-04-06 at 10:46 -0700, David Miller wrote:
>> Just fix the CPP test to handle 32-bit platforms with slow
>> mis-alignment handling, everything else seems to be working properly
>> after I fixed the NMI issues on sparc64.
>
> A little more work will need to be done since on 32bit, the page headers
> are 12 bytes, not 16, so we start off without being aligned 8bytes. But
> that too is fixable.
Oh yeah the buffer_data_page, good point.