Hit this while trying to configure the irq-tracer.
I ran cat trace before doing a "echo 0 > tracing_on" by mistake.
Shot myself in the foot, but it still shouldn't oops.
Dave
BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
IP: [<ffffffff8110b72d>] ring_buffer_resize+0xad/0x430
PGD 141eaf067 PUD 13f6d3067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 6
Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel usb_debug ghash_clmulni_intel microcode serio_raw pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
Pid: 731, comm: bash Not tainted 3.4.0+ #29
RIP: 0010:[<ffffffff8110b72d>] [<ffffffff8110b72d>] ring_buffer_resize+0xad/0x430
RSP: 0018:ffff88013f83bdb8 EFLAGS: 00010217
RAX: ffff880145d60390 RBX: 00000000001609e0 RCX: 0000000000000006
RDX: 0000000000000008 RSI: ffffffff8110b714 RDI: ffffffff8165790a
RBP: ffff88013f83bdf8 R08: 2222222222222222 R09: 2222222222222222
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801462f0920
R13: 0000000000000162 R14: 0000000000000008 R15: 0000000000000000
FS: 00007fd389ddb740(0000) GS:ffff880148800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000064 CR3: 000000013e66d000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 731, threadinfo ffff88013f83a000, task ffff880140220000)
Stack:
ffff8801462f0940 ffff8801417c4310 ffff8801417c4310 0000000000000008
000000000000000c 00000000001609e0 ffff8801417c4310 ffffffff81c7b2f0
ffff88013f83be48 ffffffff81114451 ffffffff81d34880 ffffffff81c7b260
Call Trace:
[<ffffffff81114451>] tracing_set_tracer+0x221/0x320
[<ffffffff81114615>] tracing_set_trace_write+0xc5/0x100
[<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[<ffffffff812ad55c>] ? security_file_permission+0x2c/0xb0
[<ffffffff811a725f>] vfs_write+0xaf/0x190
[<ffffffff811a759d>] sys_write+0x4d/0x90
[<ffffffff81662912>] system_call_fastpath+0x16/0x1b
Code: 8d 44 24 20 31 f6 48 89 c7 48 89 45 c0 e8 3c bf 54 00 41 83 fe ff 0f 84 05 01 00 00 49 8b 84 24 c8 00 00 00 49 63 d6 4c 8b 3c d0 <41> 8b 47 64 44 39 e8 74 7a 44 89 ef 49 8d b7 f8 00 00 00 29 c7
On Thu, 2012-05-24 at 12:01 -0400, Dave Jones wrote:
> Hit this while trying to configure the irq-tracer.
> I ran cat trace before doing a "echo 0 > tracing_on" by mistake.
> Shot myself in the foot, but it still shouldn't oops.
>
Hi Dave,
Thanks for reporting. Could you try this patch to see if it fixes it for
you.
http://groups.google.com/group/linux.kernel/msg/4294e8d564d66997?dmode=source
Below is the patch again, as the one in the link has GPG tags that might
screw with patch.
-- Steve
commit 6a31e1f135d1abfb5137697f889c8cd5d72eb522
Author: Steven Rostedt <[email protected]>
Date: Wed May 23 15:35:17 2012 -0400
ring-buffer: Check for valid buffer before changing size
On some machines the number of possible CPUS is not the same as the
number of CPUs that is on the machine. Ftrace uses possible_cpus to
update the tracing structures but the ring buffer only allocates
per cpu buffers for online CPUs when they come up.
When the wakeup tracer was enabled in such a case, the ftrace code
enabled all possible cpu buffers, but the code in ring_buffer_resize()
did not check to see if the buffer in question was allocated. Since
boot up CPUs did not match possible CPUs it caused the following
crash:
BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<c1097851>] ring_buffer_resize+0x16a/0x28d
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 1387, comm: bash Not tainted 3.4.0-test+ #13 /DG965MQ
EIP: 0060:[<c1097851>] EFLAGS: 00010217 CPU: 0
EIP is at ring_buffer_resize+0x16a/0x28d
EAX: f5a14340 EBX: f6026b80 ECX: 00000ff4 EDX: 00000ff3
ESI: 00000000 EDI: 00000002 EBP: f4275ecc ESP: f4275eb0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 00000020 CR3: 34396000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process bash (pid: 1387, ti=f4274000 task=f4380cb0 task.ti=f4274000)
Stack:
c109cf9a f6026b98 00000162 00160f68 00000006 00160f68 00000002 f4275ef0
c109d013 f4275ee8 c123b72a c1c0bf00 c1cc81dc 00000005 f4275f98 00000007
f4275f70 c109d0c7 7700000e 75656b61 00000070 f5e90900 f5c4e198 00000301
Call Trace:
[<c109cf9a>] ? tracing_set_tracer+0x115/0x1e9
[<c109d013>] tracing_set_tracer+0x18e/0x1e9
[<c123b72a>] ? _copy_from_user+0x30/0x46
[<c109d0c7>] tracing_set_trace_write+0x59/0x7f
[<c10ec01e>] ? fput+0x18/0x1c6
[<c11f8732>] ? security_file_permission+0x27/0x2b
[<c10eaacd>] ? rw_verify_area+0xcf/0xf2
[<c10ec01e>] ? fput+0x18/0x1c6
[<c109d06e>] ? tracing_set_tracer+0x1e9/0x1e9
[<c10ead77>] vfs_write+0x8b/0xe3
[<c10ebead>] ? fget_light+0x30/0x81
[<c10eaf54>] sys_write+0x42/0x63
[<c1834fbf>] sysenter_do_call+0x12/0x28
This happens with the latency tracer as the ftrace code updates the
saved max buffer via its cpumask and not with a global setting.
Adding a check in ring_buffer_resize() to make sure the buffer being resized
exists, fixes the problem.
Cc: Vaibhav Nagarnaik <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6420cda..1d0f6a8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1486,6 +1486,11 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
if (!buffer)
return size;
+ /* Make sure the requested buffer exists */
+ if (cpu_id != RING_BUFFER_ALL_CPUS &&
+ !cpumask_test_cpu(cpu_id, buffer->cpumask))
+ return size;
+
size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
size *= BUF_PAGE_SIZE;
On Thu, May 24, 2012 at 12:19:58PM -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 12:01 -0400, Dave Jones wrote:
> > Hit this while trying to configure the irq-tracer.
> > I ran cat trace before doing a "echo 0 > tracing_on" by mistake.
> > Shot myself in the foot, but it still shouldn't oops.
> >
>
> Hi Dave,
>
> Thanks for reporting. Could you try this patch to see if it fixes it for
> you.
>
> http://groups.google.com/group/linux.kernel/msg/4294e8d564d66997?dmode=source
>
> Below is the patch again, as the one in the link has GPG tags that might
> screw with patch.
hmm, this time I didn't even get past 'echo irqsoff > current_tracer'
Dave
[ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
[ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
[ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
[ 1013.310600] CPU 2
[ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1013.401848]
[ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
[ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
[ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
[ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
[ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
[ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
[ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
[ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
[ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1013.717028] Stack:
[ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
[ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
[ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
[ 1013.790021] Call Trace:
[ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
[ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff
[ 1013.884466] RSP <ffff88014780f408>
[ 1013.901507] CR2: 0000000000000002
[ 1013.918628] ---[ end trace d0e6e5741aaf56ac ]---
[ 1013.940297] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
[ 1013.972340] in_atomic(): 0, irqs_disabled(): 1, pid: 112, name: kworker/2:1
[ 1014.001470] INFO: lockdep is turned off.
[ 1014.021285] irq event stamp: 146784
[ 1014.039375] hardirqs last enabled at (146783): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
[ 1014.074744] hardirqs last disabled at (146784): [<ffffffff8165b380>] int3+0x10/0x40
[ 1014.106164] softirqs last enabled at (146418): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
[ 1014.141604] softirqs last disabled at (146407): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
[ 1014.174829] Pid: 112, comm: kworker/2:1 Tainted: G D 3.4.0+ #30
[ 1014.204753] Call Trace:
[ 1014.220719] <#DB> [<ffffffff810b3eb0>] ? print_irqtrace_events+0xd0/0xe0
[ 1014.251033] [<ffffffff8107f5c0>] __might_sleep+0x140/0x1f0
[ 1014.276482] [<ffffffff81657fc6>] down_read+0x26/0x93
[ 1014.301499] [<ffffffff810624e4>] exit_signals+0x24/0x130
[ 1014.326264] [<ffffffff8104e95c>] do_exit+0xbc/0xb70
[ 1014.348709] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
[ 1014.373726] [<ffffffff8164d5e3>] ? printk+0x61/0x63
[ 1014.396878] [<ffffffff8165c007>] oops_end+0x97/0xe0
[ 1014.420306] [<ffffffff8164cece>] no_context+0x258/0x283
[ 1014.446640] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
[ 1014.474615] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
[ 1014.505249] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
[ 1014.531368] [<ffffffff8106f708>] ? __kernel_text_address+0x58/0x80
[ 1014.560247] [<ffffffff810057cf>] ? print_context_stack+0x8f/0xf0
[ 1014.589070] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1014.614948] [<ffffffff8165b613>] ? error_sti+0x5/0x6
[ 1014.641105] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1014.671882] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1014.700148] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1014.730928] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1014.759028] [<ffffffff8165b42f>] page_fault+0x1f/0x30
[ 1014.784924] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1014.812910] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1014.840108] [<ffffffff8131ef40>] ? bsearch+0x60/0x90
[ 1014.864867] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1014.891647] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
[ 1014.918212] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1014.942506] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1014.971053] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1014.999788] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1015.026354] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
[ 1015.052818] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1015.077099] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1015.103546] [<ffffffff8165e91e>] ? sub_preempt_count+0xe/0xd0
[ 1015.131390] <<EOE>>
[ 1015.145014] BUG: unable to handle kernel paging request at fffffffffffffff8
[ 1015.169954] IP: [<ffffffff81072ee1>] kthread_data+0x11/0x20
[ 1015.190713] PGD 1c0d067 PUD 1c0e067 PMD 0
[ 1015.207586] Oops: 0000 [#2] PREEMPT SMP
[ 1015.224452] CPU 2
[ 1015.236313] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1015.339235]
[ 1015.349349] Pid: 112, comm: kworker/2:1 Tainted: G D 3.4.0+ #30
[ 1015.389489] RIP: 0010:[<ffffffff81072ee1>] [<ffffffff81072ee1>] kthread_data+0x11/0x20
[ 1015.419063] RSP: 0018:ffff880148008aa0 EFLAGS: 00010092
[ 1015.440762] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000002
[ 1015.468267] RDX: ffffffff820a0940 RSI: 0000000000000002 RDI: ffff88014020a6a0
[ 1015.495782] RBP: ffff880148008ab8 R08: ffff88014020a710 R09: 0000000000000400
[ 1015.523231] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801481d3080
[ 1015.550722] R13: 0000000000000002 R14: ffff8801480086c0 R15: ffff88014020a690
[ 1015.578205] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1015.607725] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1015.631370] CR2: fffffffffffffff8 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1015.658912] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1015.686428] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1015.713990] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1015.744998] Stack:
[ 1015.756808] ffffffff8106d615 ffff880148008ab8 ffff88014020ac60 ffff880148008b38
[ 1015.784512] ffffffff81658ca8 ffff880148008ae8 ffffffff8165add2 ffff88014020a6a0
[ 1015.812232] ffff88013fe0ffd8 ffff88013fe0ffd8 ffff88013fe0ffd8 ffff880148008b18
[ 1015.839896] Call Trace:
[ 1015.853916] <#DB>
[ 1015.866497] [<ffffffff8106d615>] ? wq_worker_sleeping+0x15/0xa0
[ 1015.891848] [<ffffffff81658ca8>] __schedule+0x758/0x9d0
[ 1015.915078] [<ffffffff8165add2>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[ 1015.942459] [<ffffffff81658fa9>] schedule+0x29/0x70
[ 1015.963674] [<ffffffff8104ef00>] do_exit+0x660/0xb70
[ 1015.987893] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
[ 1016.011435] [<ffffffff8164d5e3>] ? printk+0x61/0x63
[ 1016.032614] [<ffffffff8165c007>] oops_end+0x97/0xe0
[ 1016.053714] [<ffffffff8164cece>] no_context+0x258/0x283
[ 1016.076944] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
[ 1016.102353] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
[ 1016.127714] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
[ 1016.150977] [<ffffffff8106f708>] ? __kernel_text_address+0x58/0x80
[ 1016.176389] [<ffffffff810057cf>] ? print_context_stack+0x8f/0xf0
[ 1016.201792] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1016.225106] [<ffffffff8165b613>] ? error_sti+0x5/0x6
[ 1016.248342] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1016.275871] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1016.301182] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1016.328674] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1016.354020] [<ffffffff8165b42f>] page_fault+0x1f/0x30
[ 1016.377291] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1016.402573] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1016.427845] [<ffffffff8131ef40>] ? bsearch+0x60/0x90
[ 1016.451031] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1016.476310] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
[ 1016.501673] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1016.524849] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1016.552487] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1016.579892] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1016.605229] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
[ 1016.630546] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1016.653694] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1016.679048] [<ffffffff8165e91e>] ? sub_preempt_count+0xe/0xd0
[ 1016.704311] <<EOE>>
[ 1016.718280] Code: 81 e8 34 e9 03 00 85 c0 0f 85 6c fe ff ff eb a1 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 cc f7 f7 5e 00 48 8b 87 68 05 00 00 5d <48> 8b 40 f8 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 cc d7
[ 1016.780403] RIP [<ffffffff81072ee1>] kthread_data+0x11/0x20
[ 1016.801375] RSP <ffff880148008aa0>
[ 1016.816881] CR2: fffffffffffffff8
[ 1016.832146] ---[ end trace d0e6e5741aaf56ad ]---
[ 1016.851464] Fixing recursive fault but reboot is needed!
[ 1034.086221] ------------[ cut here ]------------
[ 1034.095568] WARNING: at kernel/watchdog.c:241 watchdog_overflow_callback+0x9a/0xc0()
[ 1034.105026] Hardware name: 2012 Client Platform
[ 1034.114418] Watchdog detected hard LOCKUP on cpu 2
[ 1034.118880] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1034.132195] Pid: 112, comm: kworker/2:1 Tainted: G D 3.4.0+ #30
[ 1034.136525] Call Trace:
[ 1034.140761] <NMI> [<ffffffff8104891f>] warn_slowpath_common+0x7f/0xc0
[ 1034.150641] [<ffffffff81048a16>] warn_slowpath_fmt+0x46/0x50
[ 1034.160462] [<ffffffff810f2689>] ? watchdog_overflow_callback+0x9/0xc0
[ 1034.170299] [<ffffffff810f2680>] ? touch_nmi_watchdog+0x90/0x90
[ 1034.180139] [<ffffffff810f271a>] watchdog_overflow_callback+0x9a/0xc0
[ 1034.189984] [<ffffffff81133d0d>] __perf_event_overflow+0x9d/0x310
[ 1034.199783] [<ffffffff81014385>] ? x86_perf_event_set_period+0xd5/0x150
[ 1034.209617] [<ffffffff81134f84>] perf_event_overflow+0x14/0x20
[ 1034.219422] [<ffffffff8101960f>] intel_pmu_handle_irq+0x18f/0x310
[ 1034.229257] [<ffffffff8165cc3d>] perf_event_nmi_handler+0x1d/0x20
[ 1034.239042] [<ffffffff8165c1fe>] nmi_handle.isra.2+0xae/0x1f0
[ 1034.248822] [<ffffffff8165c150>] ? __die+0x100/0x100
[ 1034.258550] [<ffffffff810fe40e>] ? rcu_nmi_enter+0xe/0x80
[ 1034.268345] [<ffffffff8165c458>] do_nmi+0x118/0x380
[ 1034.278104] [<ffffffff8165b7b2>] end_repeat_nmi+0x1a/0x1e
[ 1034.287916] [<ffffffff81319834>] ? delay_tsc+0x74/0xe0
[ 1034.297662] [<ffffffff81319834>] ? delay_tsc+0x74/0xe0
[ 1034.307432] [<ffffffff81319834>] ? delay_tsc+0x74/0xe0
[ 1034.317208] <<EOE>> <#DB> [<ffffffff8131971f>] __delay+0xf/0x20
[ 1034.326979] [<ffffffff81320d2e>] do_raw_spin_lock+0x7e/0x140
[ 1034.336743] [<ffffffff8165a236>] _raw_spin_lock_irq+0x66/0x80
[ 1034.340966] [<ffffffff8165863f>] ? __schedule+0xef/0x9d0
[ 1034.345181] [<ffffffff8165863f>] __schedule+0xef/0x9d0
[ 1034.349398] [<ffffffff81658fa9>] schedule+0x29/0x70
[ 1034.353579] [<ffffffff8104f301>] do_exit+0xa61/0xb70
[ 1034.357772] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
[ 1034.361953] [<ffffffff8164d5e3>] ? printk+0x61/0x63
[ 1034.366148] [<ffffffff8165c007>] oops_end+0x97/0xe0
[ 1034.370331] [<ffffffff8164cece>] no_context+0x258/0x283
[ 1034.374540] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
[ 1034.384325] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
[ 1034.394059] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
[ 1034.403809] [<ffffffff8118ea9f>] ? cmpxchg_double_slab.isra.40+0x13f/0x220
[ 1034.413576] [<ffffffff8165b613>] ? error_sti+0x5/0x6
[ 1034.423289] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1034.433070] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1034.442850] [<ffffffff8165b42f>] page_fault+0x1f/0x30
[ 1034.452588] [<ffffffff81072ee1>] ? kthread_data+0x11/0x20
[ 1034.462311] [<ffffffff8106d615>] ? wq_worker_sleeping+0x15/0xa0
[ 1034.472070] [<ffffffff81658ca8>] __schedule+0x758/0x9d0
[ 1034.481826] [<ffffffff8165add2>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[ 1034.491606] [<ffffffff81658fa9>] schedule+0x29/0x70
[ 1034.495794] [<ffffffff8104ef00>] do_exit+0x660/0xb70
[ 1034.499982] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
[ 1034.504171] [<ffffffff8164d5e3>] ? printk+0x61/0x63
[ 1034.508381] [<ffffffff8165c007>] oops_end+0x97/0xe0
[ 1034.518107] [<ffffffff8164cece>] no_context+0x258/0x283
[ 1034.527863] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
[ 1034.537539] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
[ 1034.541512] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
[ 1034.545475] [<ffffffff8106f708>] ? __kernel_text_address+0x58/0x80
[ 1034.549465] [<ffffffff810057cf>] ? print_context_stack+0x8f/0xf0
[ 1034.553452] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1034.562960] [<ffffffff8165b613>] ? error_sti+0x5/0x6
[ 1034.572446] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1034.581996] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1034.591538] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1034.595562] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1034.599553] [<ffffffff8165b42f>] page_fault+0x1f/0x30
[ 1034.603473] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
[ 1034.612922] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1034.616813] [<ffffffff8131ef40>] ? bsearch+0x60/0x90
[ 1034.620667] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1034.624529] [<ffffffff8165e9e9>] ? add_preempt_count+0x9/0xe0
[ 1034.628378] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1034.637760] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1034.647158] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1034.651037] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
[ 1034.654882] [<ffffffff8165e9e9>] ? add_preempt_count+0x9/0xe0
[ 1034.658726] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
[ 1034.662576] [<ffffffff8165e9ee>] ? add_preempt_count+0xe/0xe0
[ 1034.671903] <<EOE>>
[ 1034.671903] ---[ end trace d0e6e5741aaf56ae ]---
[ 1042.542768] BUG: soft lockup - CPU#4 stuck for 22s! [bash:736]
[ 1042.550033] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1042.562286] irq event stamp: 109562
[ 1042.569389] hardirqs last enabled at (109561): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
[ 1042.576591] hardirqs last disabled at (109562): [<ffffffff8165a1e9>] _raw_spin_lock_irq+0x19/0x80
[ 1042.583959] softirqs last enabled at (109554): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
[ 1042.591204] softirqs last disabled at (109533): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
[ 1042.598508] CPU 4
[ 1042.606084] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1042.622155]
[ 1042.629673] Pid: 736, comm: bash Tainted: G D W 3.4.0+ #30
[ 1042.637406] RIP: 0010:[<ffffffff810be800>] [<ffffffff810be800>] smp_call_function_many+0x1f0/0x260
[ 1042.645077] RSP: 0018:ffff88014251bbf8 EFLAGS: 00000202
[ 1042.652634] RAX: 0000000000000001 RBX: ffffffff8165b21d RCX: 0000000000000000
[ 1042.660277] RDX: 0000000000000004 RSI: ffffffff8118eab5 RDI: ffffffff810b6edd
[ 1042.667798] RBP: ffff88014251bc38 R08: ffff880143a1f2d8 R09: 0000000000000000
[ 1042.675538] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88014251bb68
[ 1042.683235] R13: ffff8801485d3080 R14: ffff88014251a000 R15: ffff88014251bfd8
[ 1042.690960] FS: 00007f206db43740(0000) GS:ffff880148400000(0000) knlGS:0000000000000000
[ 1042.698725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1042.706516] CR2: 00000037682bad10 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1042.714256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1042.722048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1042.729836] Process bash (pid: 736, threadinfo ffff88014251a000, task ffff880145834d40)
[ 1042.734323] Stack:
[ 1042.741936] 0000000000000000 ffffffff81029970 0000000000000296 ffffffff81029970
[ 1042.746452] 0000000000000000 0000000000000001 00000000000062f9 ffffffff816626d0
[ 1042.750989] ffff88014251bc68 ffffffff810be976 ffff88014251bc68 0000000000000001
[ 1042.755485] Call Trace:
[ 1042.759885] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1042.764367] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1042.768858] [<ffffffff816626d0>] ? mcount+0x10/0x10
[ 1042.773295] [<ffffffff810be976>] smp_call_function+0x46/0x80
[ 1042.777761] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1042.785902] [<ffffffff810bea1c>] on_each_cpu+0x3c/0xb0
[ 1042.794051] [<ffffffff81029bde>] run_sync+0x1e/0x50
[ 1042.802138] [<ffffffff8102a021>] ftrace_replace_code+0x291/0x380
[ 1042.806621] [<ffffffff811079ca>] ftrace_modify_all_code+0x6a/0x80
[ 1042.811074] [<ffffffff8102a120>] arch_ftrace_update_code+0x10/0x20
[ 1042.815520] [<ffffffff81107a42>] ftrace_run_update_code+0x22/0xa0
[ 1042.819982] [<ffffffff81107af9>] ftrace_startup_enable+0x39/0x50
[ 1042.824413] [<ffffffff811081b6>] ftrace_startup+0x46/0x80
[ 1042.828836] [<ffffffff81108233>] register_ftrace_function+0x43/0x70
[ 1042.833253] [<ffffffff81118299>] start_irqsoff_tracer.isra.1+0x19/0x70
[ 1042.837695] [<ffffffff8111833c>] __irqsoff_tracer_init+0x4c/0x70
[ 1042.842104] [<ffffffff811183b3>] irqsoff_tracer_init+0x13/0x20
[ 1042.846508] [<ffffffff81113f46>] tracer_init+0x26/0x30
[ 1042.850913] [<ffffffff811144dd>] tracing_set_tracer+0x27d/0x320
[ 1042.855332] [<ffffffff81114645>] tracing_set_trace_write+0xc5/0x100
[ 1042.859768] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1042.864194] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1042.868600] [<ffffffff812ad58c>] ? security_file_permission+0x2c/0xb0
[ 1042.873036] [<ffffffff811a728f>] vfs_write+0xaf/0x190
[ 1042.877429] [<ffffffff811a75cd>] sys_write+0x4d/0x90
[ 1042.881836] [<ffffffff81662952>] system_call_fastpath+0x16/0x1b
[ 1042.886746] Code: c0 81 e8 b4 c5 59 00 0f ae f0 49 8b 7e 30 ff 15 7f df b5 00 45 84 ff 0f 84 91 fe ff ff 41 f6 46 20 01 0f 84 86 fe ff ff 0f 1f 00 <f3> 90 41 f6 46 20 01 75 f7 e9 75 fe ff ff 66 90 48 8b 55 c0 48
[ 1070.519234] BUG: soft lockup - CPU#4 stuck for 22s! [bash:736]
[ 1070.527311] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1070.541727] irq event stamp: 109562
[ 1070.546312] hardirqs last enabled at (109561): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
[ 1070.551010] hardirqs last disabled at (109562): [<ffffffff8165a1e9>] _raw_spin_lock_irq+0x19/0x80
[ 1070.558796] softirqs last enabled at (109554): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
[ 1070.566674] softirqs last disabled at (109533): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
[ 1070.574547] CPU 4
[ 1070.582620] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1070.600333]
[ 1070.608261] Pid: 736, comm: bash Tainted: G D W 3.4.0+ #30
[ 1070.616428] RIP: 0010:[<ffffffff810be807>] [<ffffffff810be807>] smp_call_function_many+0x1f7/0x260
[ 1070.624512] RSP: 0018:ffff88014251bbf8 EFLAGS: 00000202
[ 1070.632485] RAX: 0000000000000001 RBX: ffffffff8165b21d RCX: 0000000000000000
[ 1070.640481] RDX: 0000000000000004 RSI: ffffffff8118eab5 RDI: ffffffff810b6edd
[ 1070.648436] RBP: ffff88014251bc38 R08: ffff880143a1f2d8 R09: 0000000000000000
[ 1070.656434] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88014251bb68
[ 1070.661106] R13: ffff8801485d3080 R14: ffff88014251a000 R15: ffff88014251bfd8
[ 1070.665787] FS: 00007f206db43740(0000) GS:ffff880148400000(0000) knlGS:0000000000000000
[ 1070.673817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1070.681771] CR2: 00000037682bad10 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1070.686462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1070.691116] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1070.695776] Process bash (pid: 736, threadinfo ffff88014251a000, task ffff880145834d40)
[ 1070.703590] Stack:
[ 1070.711420] 0000000000000000 ffffffff81029970 0000000000000296 ffffffff81029970
[ 1070.719481] 0000000000000000 0000000000000001 00000000000062f9 ffffffff816626d0
[ 1070.727507] ffff88014251bc68 ffffffff810be976 ffff88014251bc68 0000000000000001
[ 1070.732175] Call Trace:
[ 1070.736713] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1070.741355] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1070.745956] [<ffffffff816626d0>] ? mcount+0x10/0x10
[ 1070.750539] [<ffffffff810be976>] smp_call_function+0x46/0x80
[ 1070.755146] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1070.759755] [<ffffffff810bea1c>] on_each_cpu+0x3c/0xb0
[ 1070.768201] [<ffffffff81029bde>] run_sync+0x1e/0x50
[ 1070.776475] [<ffffffff8102a021>] ftrace_replace_code+0x291/0x380
[ 1070.781125] [<ffffffff811079ca>] ftrace_modify_all_code+0x6a/0x80
[ 1070.785743] [<ffffffff8102a120>] arch_ftrace_update_code+0x10/0x20
[ 1070.790372] [<ffffffff81107a42>] ftrace_run_update_code+0x22/0xa0
[ 1070.794988] [<ffffffff81107af9>] ftrace_startup_enable+0x39/0x50
[ 1070.799614] [<ffffffff811081b6>] ftrace_startup+0x46/0x80
[ 1070.804221] [<ffffffff81108233>] register_ftrace_function+0x43/0x70
[ 1070.808842] [<ffffffff81118299>] start_irqsoff_tracer.isra.1+0x19/0x70
[ 1070.813494] [<ffffffff8111833c>] __irqsoff_tracer_init+0x4c/0x70
[ 1070.821897] [<ffffffff811183b3>] irqsoff_tracer_init+0x13/0x20
[ 1070.830209] [<ffffffff81113f46>] tracer_init+0x26/0x30
[ 1070.838475] [<ffffffff811144dd>] tracing_set_tracer+0x27d/0x320
[ 1070.846763] [<ffffffff81114645>] tracing_set_trace_write+0xc5/0x100
[ 1070.851413] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1070.856060] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1070.860663] [<ffffffff812ad58c>] ? security_file_permission+0x2c/0xb0
[ 1070.865317] [<ffffffff811a728f>] vfs_write+0xaf/0x190
[ 1070.869913] [<ffffffff811a75cd>] sys_write+0x4d/0x90
[ 1070.874531] [<ffffffff81662952>] system_call_fastpath+0x16/0x1b
[ 1070.883313] Code: 0f ae f0 49 8b 7e 30 ff 15 7f df b5 00 45 84 ff 0f 84 91 fe ff ff 41 f6 46 20 01 0f 84 86 fe ff ff 0f 1f 00 f3 90 41 f6 46 20 01 <75> f7 e9 75 fe ff ff 66 90 48 8b 55 c0 48 8b 75 c8 41 0f b6 cf
[ 1098.495701] BUG: soft lockup - CPU#4 stuck for 23s! [bash:736]
[ 1098.503796] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1098.521248] irq event stamp: 109562
[ 1098.529054] hardirqs last enabled at (109561): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
[ 1098.533770] hardirqs last disabled at (109562): [<ffffffff8165a1e9>] _raw_spin_lock_irq+0x19/0x80
[ 1098.538489] softirqs last enabled at (109554): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
[ 1098.543228] softirqs last disabled at (109533): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
[ 1098.551115] CPU 4
[ 1098.559187] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1098.576879]
[ 1098.584800] Pid: 736, comm: bash Tainted: G D W 3.4.0+ #30
[ 1098.589602] RIP: 0010:[<ffffffff810be807>] [<ffffffff810be807>] smp_call_function_many+0x1f7/0x260
[ 1098.594390] RSP: 0018:ffff88014251bbf8 EFLAGS: 00000202
[ 1098.602434] RAX: 0000000000000001 RBX: ffffffff8165b21d RCX: 0000000000000000
[ 1098.610433] RDX: 0000000000000004 RSI: ffffffff8118eab5 RDI: ffffffff810b6edd
[ 1098.618362] RBP: ffff88014251bc38 R08: ffff880143a1f2d8 R09: 0000000000000000
[ 1098.626485] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88014251bb68
[ 1098.631200] R13: ffff8801485d3080 R14: ffff88014251a000 R15: ffff88014251bfd8
[ 1098.635898] FS: 00007f206db43740(0000) GS:ffff880148400000(0000) knlGS:0000000000000000
[ 1098.640604] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1098.645262] CR2: 00000037682bad10 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1098.649957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1098.657983] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1098.662678] Process bash (pid: 736, threadinfo ffff88014251a000, task ffff880145834d40)
[ 1098.667361] Stack:
[ 1098.671941] 0000000000000000 ffffffff81029970 0000000000000296 ffffffff81029970
[ 1098.680070] 0000000000000000 0000000000000001 00000000000062f9 ffffffff816626d0
[ 1098.688135] ffff88014251bc68 ffffffff810be976 ffff88014251bc68 0000000000000001
[ 1098.696133] Call Trace:
[ 1098.704135] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1098.708787] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1098.713402] [<ffffffff816626d0>] ? mcount+0x10/0x10
[ 1098.717993] [<ffffffff810be976>] smp_call_function+0x46/0x80
[ 1098.726269] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1098.730928] [<ffffffff810bea1c>] on_each_cpu+0x3c/0xb0
[ 1098.735545] [<ffffffff81029bde>] run_sync+0x1e/0x50
[ 1098.743887] [<ffffffff8102a021>] ftrace_replace_code+0x291/0x380
[ 1098.752191] [<ffffffff811079ca>] ftrace_modify_all_code+0x6a/0x80
[ 1098.756840] [<ffffffff8102a120>] arch_ftrace_update_code+0x10/0x20
[ 1098.761471] [<ffffffff81107a42>] ftrace_run_update_code+0x22/0xa0
[ 1098.769736] [<ffffffff81107af9>] ftrace_startup_enable+0x39/0x50
[ 1098.778035] [<ffffffff811081b6>] ftrace_startup+0x46/0x80
[ 1098.786299] [<ffffffff81108233>] register_ftrace_function+0x43/0x70
[ 1098.794641] [<ffffffff81118299>] start_irqsoff_tracer.isra.1+0x19/0x70
[ 1098.799297] [<ffffffff8111833c>] __irqsoff_tracer_init+0x4c/0x70
[ 1098.803952] [<ffffffff811183b3>] irqsoff_tracer_init+0x13/0x20
[ 1098.808601] [<ffffffff81113f46>] tracer_init+0x26/0x30
[ 1098.816936] [<ffffffff811144dd>] tracing_set_tracer+0x27d/0x320
[ 1098.825254] [<ffffffff81114645>] tracing_set_trace_write+0xc5/0x100
[ 1098.833579] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1098.841906] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1098.846546] [<ffffffff812ad58c>] ? security_file_permission+0x2c/0xb0
[ 1098.851204] [<ffffffff811a728f>] vfs_write+0xaf/0x190
[ 1098.855838] [<ffffffff811a75cd>] sys_write+0x4d/0x90
[ 1098.864166] [<ffffffff81662952>] system_call_fastpath+0x16/0x1b
[ 1098.872969] Code: 0f ae f0 49 8b 7e 30 ff 15 7f df b5 00 45 84 ff 0f 84 91 fe ff ff 41 f6 46 20 01 0f 84 86 fe ff ff 0f 1f 00 f3 90 41 f6 46 20 01 <75> f7 e9 75 fe ff ff 66 90 48 8b 55 c0 48 8b 75 c8 41 0f b6 cf
[ 1126.472169] BUG: soft lockup - CPU#4 stuck for 23s! [bash:736]
[ 1126.480251] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1126.494669] irq event stamp: 109562
[ 1126.502447] hardirqs last enabled at (109561): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
[ 1126.507139] hardirqs last disabled at (109562): [<ffffffff8165a1e9>] _raw_spin_lock_irq+0x19/0x80
[ 1126.511860] softirqs last enabled at (109554): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
[ 1126.519665] softirqs last disabled at (109533): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
[ 1126.527532] CPU 4
[ 1126.535574] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1126.553312]
[ 1126.561265] Pid: 736, comm: bash Tainted: G D W 3.4.0+ #30
[ 1126.566064] RIP: 0010:[<ffffffff810be800>] [<ffffffff810be800>] smp_call_function_many+0x1f0/0x260
[ 1126.570845] RSP: 0018:ffff88014251bbf8 EFLAGS: 00000202
[ 1126.575491] RAX: 0000000000000001 RBX: ffffffff8165b21d RCX: 0000000000000000
[ 1126.580161] RDX: 0000000000000004 RSI: ffffffff8118eab5 RDI: ffffffff810b6edd
[ 1126.584818] RBP: ffff88014251bc38 R08: ffff880143a1f2d8 R09: 0000000000000000
[ 1126.592642] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88014251bb68
[ 1126.600621] R13: ffff8801485d3080 R14: ffff88014251a000 R15: ffff88014251bfd8
[ 1126.608614] FS: 00007f206db43740(0000) GS:ffff880148400000(0000) knlGS:0000000000000000
[ 1126.616664] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1126.621333] CR2: 00000037682bad10 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1126.626014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1126.630706] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1126.635392] Process bash (pid: 736, threadinfo ffff88014251a000, task ffff880145834d40)
[ 1126.640072] Stack:
[ 1126.644634] 0000000000000000 ffffffff81029970 0000000000000296 ffffffff81029970
[ 1126.652702] 0000000000000000 0000000000000001 00000000000062f9 ffffffff816626d0
[ 1126.657407] ffff88014251bc68 ffffffff810be976 ffff88014251bc68 0000000000000001
[ 1126.662081] Call Trace:
[ 1126.666629] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1126.674930] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1126.679558] [<ffffffff816626d0>] ? mcount+0x10/0x10
[ 1126.684141] [<ffffffff810be976>] smp_call_function+0x46/0x80
[ 1126.688747] [<ffffffff81029970>] ? apic_send_IPI_self+0x50/0x50
[ 1126.693382] [<ffffffff810bea1c>] on_each_cpu+0x3c/0xb0
[ 1126.697991] [<ffffffff81029bde>] run_sync+0x1e/0x50
[ 1126.702594] [<ffffffff8102a021>] ftrace_replace_code+0x291/0x380
[ 1126.707242] [<ffffffff811079ca>] ftrace_modify_all_code+0x6a/0x80
[ 1126.711867] [<ffffffff8102a120>] arch_ftrace_update_code+0x10/0x20
[ 1126.720179] [<ffffffff81107a42>] ftrace_run_update_code+0x22/0xa0
[ 1126.724823] [<ffffffff81107af9>] ftrace_startup_enable+0x39/0x50
[ 1126.729459] [<ffffffff811081b6>] ftrace_startup+0x46/0x80
[ 1126.734078] [<ffffffff81108233>] register_ftrace_function+0x43/0x70
[ 1126.742457] [<ffffffff81118299>] start_irqsoff_tracer.isra.1+0x19/0x70
[ 1126.747139] [<ffffffff8111833c>] __irqsoff_tracer_init+0x4c/0x70
[ 1126.751772] [<ffffffff811183b3>] irqsoff_tracer_init+0x13/0x20
[ 1126.756408] [<ffffffff81113f46>] tracer_init+0x26/0x30
[ 1126.764618] [<ffffffff811144dd>] tracing_set_tracer+0x27d/0x320
[ 1126.772879] [<ffffffff81114645>] tracing_set_trace_write+0xc5/0x100
[ 1126.781233] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 1126.789573] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
[ 1126.797825] [<ffffffff812ad58c>] ? security_file_permission+0x2c/0xb0
[ 1126.806145] [<ffffffff811a728f>] vfs_write+0xaf/0x190
[ 1126.810757] [<ffffffff811a75cd>] sys_write+0x4d/0x90
[ 1126.815386] [<ffffffff81662952>] system_call_fastpath+0x16/0x1b
[ 1126.820508] Code: c0 81 e8 b4 c5 59 00 0f ae f0 49 8b 7e 30 ff 15 7f df b5 00 45 84 ff 0f 84 91 fe ff ff 41 f6 46 20 01 0f 84 86 fe ff ff 0f 1f 00 <f3> 90 41 f6 46 20 01 75 f7 e9 75 fe ff ff 66 90 48 8b 55 c0 48
and then a bunch more CPU stuck for the other cpus..
On Thu, 2012-05-24 at 13:22 -0400, Dave Jones wrote:
> On Thu, May 24, 2012 at 12:19:58PM -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-24 at 12:01 -0400, Dave Jones wrote:
> > > Hit this while trying to configure the irq-tracer.
> > > I ran cat trace before doing a "echo 0 > tracing_on" by mistake.
> > > Shot myself in the foot, but it still shouldn't oops.
> > >
> >
> > Hi Dave,
> >
> > Thanks for reporting. Could you try this patch to see if it fixes it for
> > you.
> >
> > http://groups.google.com/group/linux.kernel/msg/4294e8d564d66997?dmode=source
> >
> > Below is the patch again, as the one in the link has GPG tags that might
> > screw with patch.
>
> hmm, this time I didn't even get past 'echo irqsoff > current_tracer'
This looks different. In fact, I had the similar bug but! (see below)
>
> Dave
>
> [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
> [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
> [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
> [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
> [ 1013.310600] CPU 2
> [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1013.401848]
> [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
> [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
The RIP is meaningless.
> [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
> [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
> [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
> [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
> [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
> [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
> [ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
> [ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
> [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
> [ 1013.717028] Stack:
> [ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
> [ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
> [ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
> [ 1013.790021] Call Trace:
> [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
> [ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff
> [ 1013.884466] RSP <ffff88014780f408>
> [ 1013.901507] CR2: 0000000000000002
> [ 1013.918628] ---[ end trace d0e6e5741aaf56ac ]---
> [ 1013.940297] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
> [ 1013.972340] in_atomic(): 0, irqs_disabled(): 1, pid: 112, name: kworker/2:1
> [ 1014.001470] INFO: lockdep is turned off.
> [ 1014.021285] irq event stamp: 146784
> [ 1014.039375] hardirqs last enabled at (146783): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
> [ 1014.074744] hardirqs last disabled at (146784): [<ffffffff8165b380>] int3+0x10/0x40
> [ 1014.106164] softirqs last enabled at (146418): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
> [ 1014.141604] softirqs last disabled at (146407): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
> [ 1014.174829] Pid: 112, comm: kworker/2:1 Tainted: G D 3.4.0+ #30
> [ 1014.204753] Call Trace:
> [ 1014.220719] <#DB> [<ffffffff810b3eb0>] ? print_irqtrace_events+0xd0/0xe0
> [ 1014.251033] [<ffffffff8107f5c0>] __might_sleep+0x140/0x1f0
> [ 1014.276482] [<ffffffff81657fc6>] down_read+0x26/0x93
> [ 1014.301499] [<ffffffff810624e4>] exit_signals+0x24/0x130
> [ 1014.326264] [<ffffffff8104e95c>] do_exit+0xbc/0xb70
> [ 1014.348709] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
> [ 1014.373726] [<ffffffff8164d5e3>] ? printk+0x61/0x63
> [ 1014.396878] [<ffffffff8165c007>] oops_end+0x97/0xe0
> [ 1014.420306] [<ffffffff8164cece>] no_context+0x258/0x283
> [ 1014.446640] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
> [ 1014.474615] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
> [ 1014.505249] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
> [ 1014.531368] [<ffffffff8106f708>] ? __kernel_text_address+0x58/0x80
> [ 1014.560247] [<ffffffff810057cf>] ? print_context_stack+0x8f/0xf0
> [ 1014.589070] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
> [ 1014.614948] [<ffffffff8165b613>] ? error_sti+0x5/0x6
> [ 1014.641105] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
> [ 1014.671882] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.700148] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 1014.730928] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.759028] [<ffffffff8165b42f>] page_fault+0x1f/0x30
> [ 1014.784924] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.812910] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1014.840108] [<ffffffff8131ef40>] ? bsearch+0x60/0x90
> [ 1014.864867] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1014.891647] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
> [ 1014.918212] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
> [ 1014.942506] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
> [ 1014.971053] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 1014.999788] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1015.026354] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
> [ 1015.052818] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
Hmm, I wonder if this is the break point change :-/
Do you have NMIs enabled? I didn't have 'paranoid_exit' in my backtrace
when I saw the bug.
I saw this but after doing a cp .config /tmp; make mrproper;
cp /tmp/.config .; make
The problem went away.
I'll start running a bunch of high stress tests using NMIs again (on
this box) and see if I can force this to happen again :-/
-- Steve
> [ 1015.077099] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1015.103546] [<ffffffff8165e91e>] ? sub_preempt_count+0xe/0xd0
> [ 1015.131390] <<EOE>>
> [ 1015.145014] BUG: unable to handle kernel paging request at fffffffffffffff8
> [ 1015.169954] IP: [<ffffffff81072ee1>] kthread_data+0x11/0x20
> [ 1015.190713] PGD 1c0d067 PUD 1c0e067 PMD 0
> [ 1015.207586] Oops: 0000 [#2] PREEMPT SMP
> [ 1015.224452] CPU 2
> [ 1015.236313] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1015.339235]
On Thu, May 24, 2012 at 01:35:14PM -0400, Steven Rostedt wrote:
> Hmm, I wonder if this is the break point change :-/
>
> Do you have NMIs enabled? I didn't have 'paranoid_exit' in my backtrace
> when I saw the bug.
$ grep NMI .config
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HPWDT_NMI_DECODING=y
# CONFIG_DEBUG_NMI_SELFTEST is not set
Dave
On Thu, 2012-05-24 at 14:47 -0400, Dave Jones wrote:
> On Thu, May 24, 2012 at 01:35:14PM -0400, Steven Rostedt wrote:
>
> > Hmm, I wonder if this is the break point change :-/
> >
> > Do you have NMIs enabled? I didn't have 'paranoid_exit' in my backtrace
> > when I saw the bug.
>
> $ grep NMI .config
> CONFIG_OPROFILE_NMI_TIMER=y
> CONFIG_HAVE_PERF_EVENTS_NMI=y
> CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
> CONFIG_HPWDT_NMI_DECODING=y
> # CONFIG_DEBUG_NMI_SELFTEST is not set
Could you email your entire config. I haven't been able to trigger the
bug again. Maybe your config might do it :-/
Also, how reproducible is it? When it triggered for me, it was constant.
Every boot and test failed. But after I did the make mrproper, I have
not been able to trigger it again. This is why I put it down as a bad
build.
-- Steve
On Thu, May 24, 2012 at 02:54:32PM -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 14:47 -0400, Dave Jones wrote:
> > On Thu, May 24, 2012 at 01:35:14PM -0400, Steven Rostedt wrote:
> >
> > > Hmm, I wonder if this is the break point change :-/
> > >
> > > Do you have NMIs enabled? I didn't have 'paranoid_exit' in my backtrace
> > > when I saw the bug.
> >
> > $ grep NMI .config
> > CONFIG_OPROFILE_NMI_TIMER=y
> > CONFIG_HAVE_PERF_EVENTS_NMI=y
> > CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
> > CONFIG_HPWDT_NMI_DECODING=y
> > # CONFIG_DEBUG_NMI_SELFTEST is not set
>
> Could you email your entire config. I haven't been able to trigger the
> bug again. Maybe your config might do it :-/
http://fpaste.org/BTN7/raw/
> Also, how reproducible is it? When it triggered for me, it was constant.
> Every boot and test failed. But after I did the make mrproper, I have
> not been able to trigger it again. This is why I put it down as a bad
> build.
happens every time for me so far.
Dave
On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
> > Also, how reproducible is it? When it triggered for me, it was constant.
> > Every boot and test failed. But after I did the make mrproper, I have
> > not been able to trigger it again. This is why I put it down as a bad
> > build.
>
> happens every time for me so far.
Did you also do a make mrproper and try again?
-- Steve
On Thu, May 24, 2012 at 03:24:55PM -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
>
> > > Also, how reproducible is it? When it triggered for me, it was constant.
> > > Every boot and test failed. But after I did the make mrproper, I have
> > > not been able to trigger it again. This is why I put it down as a bad
> > > build.
> >
> > happens every time for me so far.
>
> Did you also do a make mrproper and try again?
ok, that's nasty. Works fine after a make clean.
Dave
On Thu, 2012-05-24 at 16:05 -0400, Dave Jones wrote:
> On Thu, May 24, 2012 at 03:24:55PM -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
> >
> > > > Also, how reproducible is it? When it triggered for me, it was constant.
> > > > Every boot and test failed. But after I did the make mrproper, I have
> > > > not been able to trigger it again. This is why I put it down as a bad
> > > > build.
> > >
> > > happens every time for me so far.
> >
> > Did you also do a make mrproper and try again?
>
> ok, that's nasty. Works fine after a make clean.
>
Thanks for verifying. I'm thinking this is a build bug. Something's not
cleaning up properly. It may be with the recordmcount code. Perhaps
objects needed to change where the mcount calls are and did not?
I'll try other things to see if I can trigger this again. Then I'll save
off the build tree, do a make clean, rebuild, and compare what's
different.
-- Steve
On Thu, 2012-05-24 at 16:18 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 16:05 -0400, Dave Jones wrote:
> > On Thu, May 24, 2012 at 03:24:55PM -0400, Steven Rostedt wrote:
> > > On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
> > >
> > > > > Also, how reproducible is it? When it triggered for me, it was constant.
> > > > > Every boot and test failed. But after I did the make mrproper, I have
> > > > > not been able to trigger it again. This is why I put it down as a bad
> > > > > build.
> > > >
> > > > happens every time for me so far.
> > >
> > > Did you also do a make mrproper and try again?
> >
> > ok, that's nasty. Works fine after a make clean.
> >
>
> Thanks for verifying. I'm thinking this is a build bug. Something's not
> cleaning up properly. It may be with the recordmcount code. Perhaps
> objects needed to change where the mcount calls are and did not?
>
> I'll try other things to see if I can trigger this again. Then I'll save
> off the build tree, do a make clean, rebuild, and compare what's
> different.
>
I'm kicking off the ktest that initially caused this issue hoping that
it reproduces the bad tree again.
BTW, did you happen to upgrade gcc or anything? The ktest I run does
test against different gcc's and I don't think I had it do a make clean
between the switch. That may be the cause of this problem too :-?
-- Steve
On Thu, May 24, 2012 at 04:33:06PM -0400, Steven Rostedt wrote:
> I'm kicking off the ktest that initially caused this issue hoping that
> it reproduces the bad tree again.
>
> BTW, did you happen to upgrade gcc or anything? The ktest I run does
> test against different gcc's and I don't think I had it do a make clean
> between the switch. That may be the cause of this problem too :-?
No gcc updates. It's a Fedora 17 box, so there's been hardly any updates this
last week or so as things have been frozen leading into the release.
Dave
On Thu, 2012-05-24 at 17:15 -0400, Dave Jones wrote:
> On Thu, May 24, 2012 at 04:33:06PM -0400, Steven Rostedt wrote:
>
> > I'm kicking off the ktest that initially caused this issue hoping that
> > it reproduces the bad tree again.
> >
> > BTW, did you happen to upgrade gcc or anything? The ktest I run does
> > test against different gcc's and I don't think I had it do a make clean
> > between the switch. That may be the cause of this problem too :-?
>
> No gcc updates. It's a Fedora 17 box, so there's been hardly any updates this
> last week or so as things have been frozen leading into the release.
>
I'm currently running a ktest that is similar to what I think I ran when
the bug hit (I didn't save off the config file :-( )
But it looks like it did do a mrproper between switching gcc's so I no
longer think that was the issue.
Thanks,
-- Steve
On Thu, 2012-05-24 at 16:05 -0400, Dave Jones wrote:
> On Thu, May 24, 2012 at 03:24:55PM -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
> >
> > > > Also, how reproducible is it? When it triggered for me, it was constant.
> > > > Every boot and test failed. But after I did the make mrproper, I have
> > > > not been able to trigger it again. This is why I put it down as a bad
> > > > build.
> > >
> > > happens every time for me so far.
> >
> > Did you also do a make mrproper and try again?
>
> ok, that's nasty. Works fine after a make clean.
>
Hmm, I did two make mrproper builds and they both failed. So this isn't
a build issue. Must just be some nasty race. I'm still leaning that this
has to do with the breakpoint function tracing updates.
Looking deeper into it.
-- Steve
On Thu, May 24, 2012 at 06:49:41PM -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 16:05 -0400, Dave Jones wrote:
> > On Thu, May 24, 2012 at 03:24:55PM -0400, Steven Rostedt wrote:
> > > On Thu, 2012-05-24 at 15:11 -0400, Dave Jones wrote:
> > >
> > > > > Also, how reproducible is it? When it triggered for me, it was constant.
> > > > > Every boot and test failed. But after I did the make mrproper, I have
> > > > > not been able to trigger it again. This is why I put it down as a bad
> > > > > build.
> > > >
> > > > happens every time for me so far.
> > >
> > > Did you also do a make mrproper and try again?
> >
> > ok, that's nasty. Works fine after a make clean.
> >
>
> Hmm, I did two make mrproper builds and they both failed. So this isn't
> a build issue. Must just be some nasty race. I'm still leaning that this
> has to do with the breakpoint function tracing updates.
>
> Looking deeper into it.
>
> -- Steve
Possibly unrelated, but I found that it only worked once per boot. Is that expected ?
echo irqsoff > current_tracer
echo latency-format > trace_options
echo 0 > tracing_max_latency
echo 1 > tracing_on
blah
echo 0 > tracing_on
cat trace
showed me a trace as I expected
repeating the same commands then got me an empty trace.
what gives ?
Dave
On Thu, 2012-05-24 at 13:22 -0400, Dave Jones wrote:
I found a clue!
> [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
> [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
> [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
> [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
> [ 1013.310600] CPU 2
> [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1013.401848]
> [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
> [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
RIP is always near the GS segment. As GS points to the per_cpu area, we
may somehow be getting our GS screwed up. I'm not sure why that would
affect the RIP. Maybe stacks are not being processed properly somewhere?
It's strange because I can either trigger it on the first try, or it
never triggers at all??
> [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
> [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
> [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
> [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
> [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
> [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
> [ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
Your GS is ffff880148000000 and the RIP was: <ffff880145cc0000>
Here's some of the crashes I've had:
RIP: ffff88007dc042a0 GS:ffff88007e200000
RIP: ffff88007de08ea8 GS:ffff88007de00000
Interesting, the above is one of the few that the RIP is greater than GS.
RIP: ffff88007dc042a0 GS:ffff88007e280000
RIP: ffff88007dc042a0 GS:ffff88007e280000
The above are the same but two different runs, actually, a third run had
the same as well.
Dave, I'm looking into your other issue (irqsoff), that looks totally
unrelated.
Andi and Peter, to bring you up to speed, we are getting this very
strange crash when enabling function tracing. At first it looked like a
kernel compiler issue as after doing a make clean the problem went away.
But I've now done two make cleans and have had the build still produce
the error. But it either errors on the first instantiation of the
function tracer or it never errors on it. Well, that's not totally true.
I do have the ftrace start up tests enabled so the function tracer is
enabled on boot up for testing. But that never seems to error. Only when
the system is live does it produce this error.
Now this code now has the new "breakpoint" logic. That is, instead of
running stop machine, it adds a break point to the first byte of the
nop, does a IPI to all CPUS (to sync), modify the rest of the nop, send
IPI, and then remove the break point.
But on some boxes, this crashes. But the symptom is always the same. The
RIP is that of something similar to the GS. Perhaps we have a stack
mismatch? But this still doesn't make sense, as the GS is never saved on
the stack.
Any ideas?
-- Steve
> [ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
> [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
> [ 1013.717028] Stack:
> [ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
> [ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
> [ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
> [ 1013.790021] Call Trace:
> [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
> [ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff
> [ 1013.884466] RSP <ffff88014780f408>
> [ 1013.901507] CR2: 0000000000000002
> [ 1013.918628] ---[ end trace d0e6e5741aaf56ac ]---
> [ 1013.940297] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
> [ 1013.972340] in_atomic(): 0, irqs_disabled(): 1, pid: 112, name: kworker/2:1
> [ 1014.001470] INFO: lockdep is turned off.
> [ 1014.021285] irq event stamp: 146784
> [ 1014.039375] hardirqs last enabled at (146783): [<ffffffff8165b4f4>] paranoid_restore+0x10/0x66
> [ 1014.074744] hardirqs last disabled at (146784): [<ffffffff8165b380>] int3+0x10/0x40
> [ 1014.106164] softirqs last enabled at (146418): [<ffffffff81052054>] __do_softirq+0x144/0x3d0
> [ 1014.141604] softirqs last disabled at (146407): [<ffffffff81663e4c>] call_softirq+0x1c/0x30
> [ 1014.174829] Pid: 112, comm: kworker/2:1 Tainted: G D 3.4.0+ #30
> [ 1014.204753] Call Trace:
> [ 1014.220719] <#DB> [<ffffffff810b3eb0>] ? print_irqtrace_events+0xd0/0xe0
> [ 1014.251033] [<ffffffff8107f5c0>] __might_sleep+0x140/0x1f0
> [ 1014.276482] [<ffffffff81657fc6>] down_read+0x26/0x93
> [ 1014.301499] [<ffffffff810624e4>] exit_signals+0x24/0x130
> [ 1014.326264] [<ffffffff8104e95c>] do_exit+0xbc/0xb70
> [ 1014.348709] [<ffffffff8104b8b3>] ? kmsg_dump+0x83/0x2b0
> [ 1014.373726] [<ffffffff8164d5e3>] ? printk+0x61/0x63
> [ 1014.396878] [<ffffffff8165c007>] oops_end+0x97/0xe0
> [ 1014.420306] [<ffffffff8164cece>] no_context+0x258/0x283
> [ 1014.446640] [<ffffffff8164d0bd>] __bad_area_nosemaphore+0x1c4/0x1e3
> [ 1014.474615] [<ffffffff8164d0ef>] bad_area_nosemaphore+0x13/0x15
> [ 1014.505249] [<ffffffff8165e540>] do_page_fault+0x3c0/0x4b0
> [ 1014.531368] [<ffffffff8106f708>] ? __kernel_text_address+0x58/0x80
> [ 1014.560247] [<ffffffff810057cf>] ? print_context_stack+0x8f/0xf0
> [ 1014.589070] [<ffffffff810860f1>] ? get_parent_ip+0x11/0x50
> [ 1014.614948] [<ffffffff8165b613>] ? error_sti+0x5/0x6
> [ 1014.641105] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
> [ 1014.671882] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.700148] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 1014.730928] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.759028] [<ffffffff8165b42f>] page_fault+0x1f/0x30
> [ 1014.784924] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
> [ 1014.812910] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1014.840108] [<ffffffff8131ef40>] ? bsearch+0x60/0x90
> [ 1014.864867] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1014.891647] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
> [ 1014.918212] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
> [ 1014.942506] [<ffffffff810b0b38>] ? trace_hardirqs_off_caller+0x28/0xc0
> [ 1014.971053] [<ffffffff8131ac5d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 1014.999788] [<ffffffff81104020>] ? __add_hash_entry+0x90/0x90
> [ 1015.026354] [<ffffffff8165e919>] ? sub_preempt_count+0x9/0xd0
> [ 1015.052818] [<ffffffff8165b466>] ? paranoid_exit+0x6/0x1b
> [ 1015.077099] [<ffffffff81005751>] ? print_context_stack+0x11/0xf0
On 05/24/2012 04:40 PM, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 13:22 -0400, Dave Jones wrote:
>
> I found a clue!
>
>
>> [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
>> [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
>> [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
>> [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
>> [ 1013.310600] CPU 2
>> [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
>> [ 1013.401848]
>> [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
>> [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
>
> RIP is always near the GS segment. As GS points to the per_cpu area, we
> may somehow be getting our GS screwed up. I'm not sure why that would
> affect the RIP. Maybe stacks are not being processed properly somewhere?
>
> It's strange because I can either trigger it on the first try, or it
> never triggers at all??
>
Much more fundamentally, RIP should never leave the range [-2G, 0).
What is happening here is almost certainly that we jump through
something which isn't a function pointer.
The other thing worth noting is that the code segment is not the
standard Linux code segment, not even close; it *also* doesn't look like
the typical Xen code segment. This makes be believe that we did an IRET
with the stack pointer set to something other than a valid interrupt
stack frame. Specifically, note that the value of R12 is the same
value; R12 is a preserved register and may have been pushed onto the
stack by something that wants to save it.
Waitaminute... this isn't related to your using R12 to save a pointer to
something, is it?
> [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
-hpa
On Thu, May 24, 2012 at 07:40:16PM -0400, Steven Rostedt wrote:
> On Thu, 2012-05-24 at 13:22 -0400, Dave Jones wrote:
>
> I found a clue!
>
>
> > [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
> > [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
> > [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
> > [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
> > [ 1013.310600] CPU 2
> > [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
> > [ 1013.401848]
> > [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
> > [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
>
> RIP is always near the GS segment. As GS points to the per_cpu area, we
> may somehow be getting our GS screwed up. I'm not sure why that would
> affect the RIP. Maybe stacks are not being processed properly somewhere?
>
> It's strange because I can either trigger it on the first try, or it
> never triggers at all??
I think this could happen if you get your SWAPGS state screwed up
(so you do a mismatched swapgs) In the early days of the port I fought a
lot with this.
One easy way to debug it is to read the GS msr early and double
check it's as expected.
-Andi
On Thu, 2012-05-24 at 17:14 -0700, Andi Kleen wrote:
> On Thu, May 24, 2012 at 07:40:16PM -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-24 at 13:22 -0400, Dave Jones wrote:
> >
> > I found a clue!
> >
> >
> > > [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
> > > [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
> > > [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
> > > [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
> > > [ 1013.310600] CPU 2
> > > [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
> > > [ 1013.401848]
> > > [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
> > > [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
> >
> > RIP is always near the GS segment. As GS points to the per_cpu area, we
> > may somehow be getting our GS screwed up. I'm not sure why that would
> > affect the RIP. Maybe stacks are not being processed properly somewhere?
> >
> > It's strange because I can either trigger it on the first try, or it
> > never triggers at all??
>
> I think this could happen if you get your SWAPGS state screwed up
> (so you do a mismatched swapgs) In the early days of the port I fought a
> lot with this.
>
> One easy way to debug it is to read the GS msr early and double
> check it's as expected.
It's the RIP that's screwed up, not the GS. Thus, I don't think SWAPGS
is the issue. Seems to be something else that is screwing up RIP.
-- Steve
On Thu, 2012-05-24 at 16:53 -0700, H. Peter Anvin wrote:
> Waitaminute... this isn't related to your using R12 to save a pointer to
> something, is it?
We talked about using r12, but that patch hasn't made it out into the
wild yet. I haven't even pushed it into git yet. It still lives on a
quilt queue. Unless someone else abused r12, it wasn't from what I was
going to use it for.
-- Steve
>
> > [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
>
> -hpa
On Thu, 2012-05-24 at 16:53 -0700, H. Peter Anvin wrote:
> Much more fundamentally, RIP should never leave the range [-2G, 0).
> What is happening here is almost certainly that we jump through
> something which isn't a function pointer.
>
> The other thing worth noting is that the code segment is not the
> standard Linux code segment, not even close; it *also* doesn't look like
> the typical Xen code segment. This makes be believe that we did an IRET
> with the stack pointer set to something other than a valid interrupt
> stack frame.
I was thinking the same. But not from an NMI. Seems it may be a
breakpoint IRET that is the issue. Could also be a nesting issue with
the stack, as breakpoints have a single stack as well.
I may modify the code a little to see if I can trigger it on a normal
config. Right now, I can only trigger it with an allmodconfig. That may
also be the issue. It may have added some debugging that causes
something to be traced (and breakpoint added) that isn't normally
traced.
> Specifically, note that the value of R12 is the same
> value; R12 is a preserved register and may have been pushed onto the
> stack by something that wants to save it.
I don't see R12 as the same value:
RIP: <ffff88014630a000>
GS: ffff880148000000
R12: ffff880145cc0000
Close but not quite. Even R13 and R15 are close:
R13: ffff880148008eb8
R15: ffff88014780cb40
But this probably does show that the stack is screwed up and did a bad
iret.
-- Steve
On Thu, 2012-05-24 at 21:39 -0400, Steven Rostedt wrote:
> > Specifically, note that the value of R12 is the same
> > value; R12 is a preserved register and may have been pushed onto the
> > stack by something that wants to save it.
>
> I don't see R12 as the same value:
>
> RIP: <ffff88014630a000>
> GS: ffff880148000000
> R12: ffff880145cc0000
>
> Close but not quite. Even R13 and R15 are close:
>
> R13: ffff880148008eb8
> R15: ffff88014780cb40
>
> But this probably does show that the stack is screwed up and did a bad
> iret.
>
Here's the stack of the first time I saw this crash:
RIP: 92dc:[<ffff88007dc042a0>] [<ffff88007dc042a0>] 0xffff88007dc0429f
RSP: 0018:0000000000000576 EFLAGS: 00000575
RAX: 0000000000000001 RBX: ffffffff812b4e6e RCX: ffff88007e209e00
RDX: 0000000000005495 RSI: ffffffff826165b0 RDI: ffff88007e209eb8
RBP: ffff88007e209ed8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880078d6fa00 R11: ffff880078d6fa00 R12: ffff880078c3dfa0
R13: ffffffff810b92dc R14: 00000000000000e6 R15: 00000000000000e8
FS: 0000000000000000(0000) GS:ffff88007e200000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88007dc042a0 CR3: 0000000069ec4000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
RCX, RDI, RBP, R10, R11, and R12 are all like GS and RIP.
-- Steve
Updated status (been up to 2am debugging this :-( )
I did find a missing smp_mb() that could definitely cause issues. But
that alone is not fixing the problem. But it did change it slightly.
Here's a new dump:
[ 45.605085] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[ 45.605682] IP: [<ffffffff8220b0c0>] lock_classes+0x3d8c0/0x3dfe20
Not associated with GS anymore. But that is still a data segment.
[ 45.605682] PGD 609a3067 PUD 60a7c067 PMD 0
[ 45.605682] Oops: 0002 [#1] PREEMPT SMP
[ 45.605682] Dumping ftrace buffer:
[ 45.605682] (ftrace buffer empty)
[ 45.605682] CPU 1
[ 45.605682] Modules linked in: sunrpc nf_conntrack_ipv4
nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state
nf_conntrack ip6table_filter ip6_tables binfmt_misc uinput
snd_hda_codec_idt i915 drm_kms_helper snd_hda_intel snd_hda_codec
snd_hwdep snd_seq snd_seq_device snd_pcm drm snd_timer ata_generic snd
firewire_ohci ppdev soundcore i2c_algo_bit firewire_core pata_marvell
i2c_i801 i2c_core snd_page_alloc pata_acpi iTCO_wdt iTCO_vendor_support
microcode crc_itu_t video pcspkr e1000e parport_pc parport [last
unloaded: scsi_wait_scan]
[ 45.605682]
[ 45.605682] Pid: 30, comm: kworker/1:1 Not tainted 3.4.0-rc5-test+ #6 /DG965MQ
[ 45.605682] RIP: ae10:[<ffffffff8220b0c0>] [<ffffffff8220b0c0>] lock_classes+0x3d8c0/0x3dfe20
[ 45.605682] RSP: fffffb52:ffffffff816161a9 EFLAGS: ffffffff816161a9
[ 45.605682] RAX: ffff88007de08e98 RBX: 00000000000000b4 RCX: ffffffff812f2028
[ 45.605682] RDX: 0000000000002e90 RSI: ffffffff810f3270 RDI: ffff88007abd9b20
[ 45.605682] RBP: ffff88007de08e88 R08: 0000000000000000 R09: 0000000000000000
[ 45.605682] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880077ab0988
[ 45.605682] R13: ffff88007d80ae10 R14: ffffffff816161a9 R15: ffffffff816161a9
[ 45.605682] FS: 0000000000000000(0000) GS:ffff88007de00000(0000) knlGS:0000000000000000
[ 45.605682] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 45.605682] CR2: 0000000000000001 CR3: 00000000762bd000 CR4: 00000000000007e0
[ 45.605682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 45.605682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 45.605682] Process kworker/1:1 (pid: 30, threadinfo ffff880078a22000, task ffff880077ab0000)
[ 45.605682] Stack:
[ 45.605682] 8b4c6590666666cc 89480000ba80252c 6348002efbfee8fb ba4fc7c748c689d0
Look at the stack! 8b4c6590666666cc That has: cc66666690, which is a
ftrace nop with a 'cc' breakpoint on it!
[ 45.605682] fc00d5148b48819b 98a38b4cc0318206 3ee8d18948000000 000b5fe4e800600b
[ 45.605682] df894800000001be 4d8b49ffffc877e8 000004a4958b4108 3100000658b58d49
[ 45.605682] Call Trace:
[ 45.605682] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 <00> 00 00 00 00 00 00 00 d0 51 7c 82 ff ff ff ff 00 00 00 00 00
[ 45.605682] RIP [<ffffffff8220b0c0>] lock_classes+0x3d8c0/0x3dfe20
[ 45.605682] RSP <ffffffff816161a9>
[ 45.605682] CR2: 0000000000000001
RSP <ffffffff816161a9> which leads to:
(gdb) li *0xffffffff816161a9
0xffffffff816161a9 is in sub_preempt_count (/work/autotest/nobackup/linux-test.git/kernel/sched/core.c:3070).
3065 trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
3066 }
3067 EXPORT_SYMBOL(add_preempt_count);
3068
3069 void __kprobes sub_preempt_count(int val)
3070 {
3071 #ifdef CONFIG_DEBUG_PREEMPT
3072 /*
3073 * Underflow?
3074 */
Looks like we set RSP to code. Again pointing to a corrupted iretq.
Maybe we are having nested debug stack usage, where we are hitting a
breakpoint before setting the idt to not change the stack?
-- Steve
On Fri, 2012-05-25 at 10:31 -0400, Steven Rostedt wrote:
> Looks like we set RSP to code. Again pointing to a corrupted iretq.
> Maybe we are having nested debug stack usage, where we are hitting a
> breakpoint before setting the idt to not change the stack?
Another clue. If I do not trace the following functions:
func_ptr_is_kernel_text
kprobe_exceptions_notify
hw_breakpoint_exceptions_notify
notifier_call_chain*
it works fine.
# echo func_ptr_is_kernel_text kprobe_exceptions_notify \
hw_breakpoint_exceptions_notify notifier_call_chain* > set_ftrace_notrace
# echo function > current_tracer
works!
These notifiers are being called by the breakpoint. So perhaps the
breakpoint is still being called by int3 when it shouldn't be. It
shouldn't because we have:
dotraplinkage void __kprobes notrace do_int3(struct pt_regs *regs, long error_code)
{
#ifdef CONFIG_DYNAMIC_FTRACE
/* ftrace must be first, everything else may cause a recursive crash */
if (unlikely(modifying_ftrace_code) && ftrace_int3_handler(regs))
return;
#endif
The fix I added (but hasn't fixed it completely) was:
void arch_ftrace_update_code(int command)
{
modifying_ftrace_code++;
+ /*
+ * Make sure that all CPUs see this before we start
+ * adding breakpoints.
+ */
+ smp_mb();
ftrace_modify_all_code(command);
+ /* Finish all breakpoints before clearing */
+ smp_mb();
+
modifying_ftrace_code--;
}
This would make sense for this bug, as if modifying_ftrace_code was not
seen by other CPUs, it wouldn't go into the ftrace_int3_handler() path.
That would cause this issue. But the bug remains after the smp_mb()'s
were put in place. Although it behaves a little differently not. Maybe
there's something else I missed?
-- Steve
On 05/25/2012 08:29 AM, Steven Rostedt wrote:
>
> This would make sense for this bug, as if modifying_ftrace_code was not
> seen by other CPUs, it wouldn't go into the ftrace_int3_handler() path.
> That would cause this issue. But the bug remains after the smp_mb()'s
> were put in place. Although it behaves a little differently not. Maybe
> there's something else I missed?
>
Perhaps you should make the modifying_ftrace_code modification atomic...
it seems odd to have it not be atomic when it is clearly accessed across
CPUs that way.
-hpa
On Fri, 2012-05-25 at 10:40 -0700, H. Peter Anvin wrote:
> On 05/25/2012 08:29 AM, Steven Rostedt wrote:
> >
> > This would make sense for this bug, as if modifying_ftrace_code was not
> > seen by other CPUs, it wouldn't go into the ftrace_int3_handler() path.
> > That would cause this issue. But the bug remains after the smp_mb()'s
> > were put in place. Although it behaves a little differently not. Maybe
> > there's something else I missed?
> >
>
> Perhaps you should make the modifying_ftrace_code modification atomic...
> it seems odd to have it not be atomic when it is clearly accessed across
> CPUs that way.
I guess I can make it atomic. Not really a big deal as this (and soon
one other place) is the only place that changes its value.
I've found another place that may be causing harm, and I'm currently
working on fixing it. Hopefully after that's done, this problem will go
away.
Thanks,
-- Steve
On Fri, 2012-05-25 at 14:46 -0400, Steven Rostedt wrote:
> On Fri, 2012-05-25 at 10:40 -0700, H. Peter Anvin wrote:
> > On 05/25/2012 08:29 AM, Steven Rostedt wrote:
> > >
> > > This would make sense for this bug, as if modifying_ftrace_code was not
> > > seen by other CPUs, it wouldn't go into the ftrace_int3_handler() path.
> > > That would cause this issue. But the bug remains after the smp_mb()'s
> > > were put in place. Although it behaves a little differently not. Maybe
> > > there's something else I missed?
> > >
> >
> > Perhaps you should make the modifying_ftrace_code modification atomic...
> > it seems odd to have it not be atomic when it is clearly accessed across
> > CPUs that way.
>
> I guess I can make it atomic. Not really a big deal as this (and soon
> one other place) is the only place that changes its value.
>
> I've found another place that may be causing harm, and I'm currently
> working on fixing it. Hopefully after that's done, this problem will go
> away.
Grumble, still faults, and the notifier that I was tracking down
happened to be for DIE_OOPS :-p
/me continues his search.
-- Steve
On Fri, 2012-05-25 at 16:51 -0400, Steven Rostedt wrote:
> On Fri, 2012-05-25 at 14:46 -0400, Steven Rostedt wrote:
> > On Fri, 2012-05-25 at 10:40 -0700, H. Peter Anvin wrote:
> > > On 05/25/2012 08:29 AM, Steven Rostedt wrote:
> > > >
> > > > This would make sense for this bug, as if modifying_ftrace_code was not
> > > > seen by other CPUs, it wouldn't go into the ftrace_int3_handler() path.
> > > > That would cause this issue. But the bug remains after the smp_mb()'s
> > > > were put in place. Although it behaves a little differently not. Maybe
> > > > there's something else I missed?
> > > >
> > >
> > > Perhaps you should make the modifying_ftrace_code modification atomic...
> > > it seems odd to have it not be atomic when it is clearly accessed across
> > > CPUs that way.
> >
> > I guess I can make it atomic. Not really a big deal as this (and soon
> > one other place) is the only place that changes its value.
> >
> > I've found another place that may be causing harm, and I'm currently
> > working on fixing it. Hopefully after that's done, this problem will go
> > away.
>
OK, here's another clue.
I've added tons of debug, and filtering out functions to trace. I even
added a 64 entry buffer that records what spots are being hit by the
breakpoint (it always dies with just the breakpoints, it never makes it
to the update instruction part).
Basically it always crashes with this path:
[ 45.050159] [ffffffff8109c1a9 ktime_get+0x19/0xe0]
[ 45.050159] [ffffffff810a3921 update_ts_time_stats+0x11/0xa0]
[ 45.050159] [ffffffff8104ef29 ns_to_timeval+0x9/0x40]
[ 45.050159] [ffffffff8104eea9 ns_to_timespec+0x9/0x80]
[ 45.050159] [ffffffff8104ef29 ns_to_timeval+0x9/0x40]
[ 45.050159] [ffffffff8104eea9 ns_to_timespec+0x9/0x80]
[ 45.050159] [ffffffff814bda45 __cpufreq_driver_getavg+0x15/0x80]
[ 45.050159] [ffffffff814bd735 cpufreq_cpu_get+0x15/0xd0]
[ 45.050159] [ffffffff810b6aad try_module_get+0x1d/0x140]
[ 45.050159] [ffffffff814c387c cpufreq_get_measured_perf+0xc/0xa0]
[ 45.050159] [ffffffff810b2dbd smp_call_function_single+0x1d/0x1c0]
[ 45.050159] [ffffffff814c391a read_measured_perf_ctrs+0xa/0x70]
[ 45.050159] [ffffffff814bcfe5 cpufreq_cpu_put+0x5/0x30]
[ 45.050159] [ffffffff810b67ad module_put+0x1d/0x130]
[ 45.050159] [ffffffff814bcf55 __cpufreq_driver_target+0x15/0xa0]
[ 45.050159] [ffffffff814c3ef2 acpi_cpufreq_target+0x12/0x380]
[ 45.050159] [ffffffff814c21b2 cpufreq_frequency_table_target+0x12/0x1a0]
[ 45.050159] [ffffffff814beced cpufreq_notify_transition+0x1d/0x2c0]
[ 45.050159] [ffffffff81074365 srcu_notifier_call_chain+0x5/0x20]
[ 45.050159] [ffffffff810742bd __srcu_notifier_call_chain+0x1d/0xc0]
[ 45.050159] [ffffffff810736d8 __srcu_read_lock+0x8/0x70]
[ 45.050159] [ffffffff81616072 notifier_call_chain.isra.1+0x12/0xb0]
[ 45.050159] [ffffffff81010efa save_stack_trace+0xa/0x50]
I see it always hitting the cpufreq code just before the crash.
Actually, it never finishes the cpufreq code. Something about this code
causes issues with breakpoints.
/me continues the hunt.
-- Steve
On Fri, 2012-05-25 at 16:51 -0400, Steven Rostedt wrote:
>
> /me continues his search.
Found it!
The int3 handler uses paranoidzeroentry_ist:
.macro paranoidzeroentry_ist sym do_sym ist
ENTRY(\sym)
INTR_FRAME
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call save_paranoid
TRACE_IRQS_OFF
movq %rsp,%rdi /* pt_regs pointer */
xorl %esi,%esi /* no error code */
subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
call \do_sym
addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
jmp paranoid_exit /* %ebx: no swapgs flag */
CFI_ENDPROC
END(\sym)
Which calls paranoid_exit, which does:
paranoid_restore:
TRACE_IRQS_IRETQ 0
RESTORE_ALL 8
jmp irq_return
The problem is with TRACE_IRQS_IRETQ which happens to call into lockdep.
Now we are still using the debug stack here outside that little
subtraction trick of the INIT_TSS_IST.
.macro TRACE_IRQS_IRETQ offset=ARGOFFSET
#ifdef CONFIG_TRACE_IRQFLAGS
bt $9,EFLAGS-\offset(%rsp) /* interrupts off? */
jnc 1f
TRACE_IRQS_ON
1:
#endif
.endm
# define TRACE_IRQS_ON call trace_hardirqs_on_thunk;
THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
which eventually leads to:
trace_hardirqs_on_caller {
__trace_hardirqs_on_caller(ip) {
mark_locks_held() {
mark_lock() {
save_trace() {
save_stack_trace()...
Unfortunately, the save_stack_trace() is traced by the function tracer.
Which means that it will hit a breakpoint and jump into the breakpoint
code. But here it will reset the stack and corrupt the current stack,
causing strange hard-to-debug bugs.
There's no reason to function trace stack dumps, and this stops the bug
from triggering when I apply it.
Dave, can you give this a try too?
Thanks!
-- Steve
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 532d2e0..0026999 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -14,6 +14,10 @@ CFLAGS_REMOVE_paravirt-spinlocks.o = -pg
CFLAGS_REMOVE_pvclock.o = -pg
CFLAGS_REMOVE_kvmclock.o = -pg
CFLAGS_REMOVE_ftrace.o = -pg
+CFLAGS_REMOVE_dumpstack.o = -pg
+CFLAGS_REMOVE_dumpstack_32.o = -pg
+CFLAGS_REMOVE_dumpstack_64.o = -pg
+CFLAGS_REMOVE_stacktrace.o = -pg
CFLAGS_REMOVE_early_printk.o = -pg
endif
On Tue, 2012-05-29 at 07:37 -0400, Steven Rostedt wrote:
> trace_hardirqs_on_caller {
> __trace_hardirqs_on_caller(ip) {
> mark_locks_held() {
> mark_lock() {
> save_trace() {
> save_stack_trace()...
>
>
> Unfortunately, the save_stack_trace() is traced by the function tracer.
> Which means that it will hit a breakpoint and jump into the breakpoint
> code. But here it will reset the stack and corrupt the current stack,
> causing strange hard-to-debug bugs.
>
Bah, unfortunately this still isn't enough, as save_stack_trace() calls
a lot more that happens to be traced. I'm looking into seeing if I can
change lockdep not to blindly do records when enabling interrupts.
-- Steve
> There's no reason to function trace stack dumps, and this stops the bug
> from triggering when I apply it.
>
> Dave, can you give this a try too?
>
> Thanks!
>
> -- Steve
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 532d2e0..0026999 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -14,6 +14,10 @@ CFLAGS_REMOVE_paravirt-spinlocks.o = -pg
> CFLAGS_REMOVE_pvclock.o = -pg
> CFLAGS_REMOVE_kvmclock.o = -pg
> CFLAGS_REMOVE_ftrace.o = -pg
> +CFLAGS_REMOVE_dumpstack.o = -pg
> +CFLAGS_REMOVE_dumpstack_32.o = -pg
> +CFLAGS_REMOVE_dumpstack_64.o = -pg
> +CFLAGS_REMOVE_stacktrace.o = -pg
> CFLAGS_REMOVE_early_printk.o = -pg
> endif
>
>