2011-05-30 16:10:22

by Witold Baryluk

[permalink] [raw]
Subject: Kernel OOPS in function_graph_tracer due to the 44259b1abfaa8bb819d25d41d71e8e33e25dd36a commit (x86-64: Move vread_tsc into a new file with sensible options)

Hi,

I found yesterday a problem when booting system on Pentium-M, 32-bit.

I got approximetly this

[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] Detected 3199.363 MHz processor.
[ 0.020197] Calibrating delay loop (skipped), value calculated using timer frequency.. 6398.72 BogoMIPS (lpj=31993630)
[ 0.020763] pid_max: default: 32768 minimum: 301
[ 0.022972] Mount-cache hash table entries: 512
[ 0.040440] CPU: Intel Pentium III (Katmai) stepping 03
[ 0.042789] ftrace: allocating 6263 entries in 13 pages
[ 0.062128] Performance Events:
[ 0.062505] no APIC, boot with the "lapic" boot parameter to force-enable it.
[ 0.062818] no hardware sampling interrupt available.
[ 0.063038] Broken PMU hardware detected, using software events only.
[ 0.072671] Testing tracer nop: PASSED
[ 0.073477] NMI watchdog disabled (cpu0): hardware events not enabled
[ 0.077682] devtmpfs: initialized
[ 0.090118] kworker/u:0 used greatest stack depth: 6840 bytes left
[ 0.105247] atomic64 test passed for i586+ platform with CX8 and with SSE
[ 0.112515] kworker/u:0 used greatest stack depth: 6700 bytes left
[ 0.198527] bio: create slab <bio-0> at 0
[ 0.212671] Switching to clocksource pit
[ 0.882202] kworker/u:0 used greatest stack depth: 6636 bytes left
[ 1.070749] platform rtc_cmos: registered platform RTC device (no PNP device found)
[ 1.114000] Initializing RT-Tester: OK
[ 1.119256] Testing tracer function: PASSED
[ 1.268289] Testing dynamic ftrace: PASSED
[ 1.517992] Testing dynamic ftrace ops #1: (1 0 1 1 0) (1 1 2 1 0) (2 1 3 1 35342) (2 2 4 1 35634) PASSED
[ 1.591847] Testing dynamic ftrace ops #2: (1 0 1 35341 0) (1 1 2 35612 0) (2 1 3 1 8) (2 2 4 249 256) PASSED
[ 1.707799] Testing tracer irqsoff: PASSED
[ 1.733435] Testing tracer preemptoff: PASSED
[ 1.752663] Testing tracer preemptirqsoff: PASSED
[ 1.769764] Testing tracer wakeup:
[ 2.068581] Switching to clocksource tsc
[ 2.108718] PASSED
[ 2.108928] Testing tracer wakeup_rt: PASSED
[ 2.459170] Testing tracer function_graph:
[ 2.466979] BUG: unable to handle kernel paging request at e421cc10
[ 2.466979] IP: [<c10850bc>] ftrace_push_return_trace+0x6c/0xd0
[ 2.466979] *pdpt = 00000000013b4001 *pde = 0000000000000000
[ 2.466979] Thread overran stack, or stack corrupted
[ 2.466979] Oops: 0002 [#1] PREEMPT
[ 2.466979]
[ 2.466979] Pid: 1, comm: swapper Not tainted 2.6.39-t43+ #40 Bochs Bochs
[ 2.466979] EIP: 0060:[<c10850bc>] EFLAGS: 00000006 CPU: 0
[ 2.466979] EIP is at ftrace_push_return_trace+0x6c/0xd0
[ 2.466979] EAX: c7833480 EBX: c7832020 ECX: c1070aac EDX: 1c9e9790
[ 2.466979] ESI: 986b5466 EDI: 00000000 EBP: c78335ac ESP: c7833590
[ 2.466979] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 2.466979] Process swapper (pid: 1, ti=c7832000 task=c7832020 task.ti=c7834000)
[ 2.466979] Stack:
[ 2.466979] c105a9fc c78335c0 c1007d26 c1070aac c78335f4 c1007d26 c1070aac c78335d0
[ 2.466979] c10150a2 c7833600 c7833600 c1007d26 00000000 c7832020 c1007d26 c7910800
[ 2.466979] c78335f0 c117e212 c1007d26 c7833638 c7832000 c1007d2b 00000046 c13551c0
[ 2.466979] Call Trace:
[ 2.466979] [<c105a9fc>] ? trace_hardirqs_off+0xc/0x10
[ 2.466979] [<c1007d26>] ? native_sched_clock+0x6/0x70
[ 2.466979] [<c1070aac>] ? trace_clock_local+0x1c/0x50
[ 2.466979] [<c1007d26>] ? native_sched_clock+0x6/0x70
[ 2.466979] [<c1070aac>] ? trace_clock_local+0x1c/0x50
[ 2.466979] [<c10150a2>] prepare_ftrace_return+0x72/0xa0
[ 2.466979] [<c1007d26>] ? native_sched_clock+0x6/0x70
[ 2.466979] [<c1007d26>] ? native_sched_clock+0x6/0x70
[ 2.466979] [<c117e212>] ftrace_graph_caller+0x22/0x26
[ 2.466979] [<c1007d26>] ? native_sched_clock+0x6/0x70
[ 2.466979] [<c1007d2b>] ? native_sched_clock+0xb/0x70
[ 2.466979] [<c117e216>] ftrace_graph_caller+0x26/0x26
[ 2.466979] BUG: unable to handle kernel NULL pointer dereference at 0000032d
[ 2.466979] IP: [<c1005529>] print_context_stack+0xd9/0x110
[ 2.466979] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 2.466979] Thread overran stack, or stack corrupted
[ 2.466979] Oops: 0000 [#2] PREEMPT
[ 2.466979]
[ 2.466979] Pid: 1, comm: swapper Not tainted 2.6.39-t43+ #40 Bochs Bochs
[ 2.466979] EIP: 0060:[<c1005529>] EFLAGS: 00000002 CPU: 0
[ 2.466979] EIP is at print_context_stack+0xd9/0x110
[ 2.466979] EAX: 0000001d EBX: c78335f4 ECX: 00000001 EDX: 00000310
[ 2.466979] ESI: 00000000 EDI: c11806d4 EBP: c7833380 ESP: c783335c
[ 2.466979] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 2.466979] Process swapper (pid: 1, ti=c7832000 task=c7832020 task.ti=c7834000)
[ 2.466979] Stack:
[ 2.466979] c117e216 c7833ffc c7833600 ffffe000 c7832000 c7833600 c7833590 c11806d4
[ 2.466979] c7832000 c78333ac c10040dd c11806d4 c11d6c66 00000000 c783339c 00000018
[ 2.466979] 00000000 00000000 c11806d4 00000018 c78333d0 c10052ec c78335ac c11806d4
[ 2.466979] Call Trace:
[ 2.466979] [<c117e216>] ? ftrace_graph_caller+0x26/0x26
[ 2.466979] BUG: unable to handle kernel NULL pointer dereference at 0000032d
[ 2.466979] IP: [<c1005529>] print_context_stack+0xd9/0x110
[ 2.466979] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 2.466979] Thread overran stack, or stack corrupted
[ 2.466979] Oops: 0000 [#3] PREEMPT
[ 2.466979]
[ 2.466979] Pid: 1, comm: swapper Not tainted 2.6.39-t43+ #40 Bochs Bochs
[ 2.466979] EIP: 0060:[<c1005529>] EFLAGS: 00000002 CPU: 0
[ 2.466979] EIP is at print_context_stack+0xd9/0x110
[ 2.466979] EAX: 0000001d EBX: c783335c ECX: 00000001 EDX: 00000310
[ 2.466979] ESI: 00000000 EDI: c11806d4 EBP: c783314c ESP: c7833128
[ 2.466979] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 2.466979] Process swapper (pid: 1, ti=c7832000 task=c7832020 task.ti=c7834000)
[ 2.466979] Stack:
[ 2.466979] c117e216 c7833ffc c7833380 ffffe000 c7832000 c7833380 c783335c c11806d4
[ 2.466979] c7832000 c7833178 c10040dd c11806d4 c11d6c66 00000000 c7833168 00000018
[ 2.466979] 00000000 00000000 c11806d4 00000018 c783319c c10052ec c7833380 c11806d4
[ 2.466979] Call Trace:
[ 2.466979] [<c117e216>] ? ftrace_graph_caller+0x26/0x26
[ 2.466979] BUG: unable to handle kernel NULL pointer dereference at 0000032d
[ 2.466979] IP: [<c1005529>] print_context_stack+0xd9/0x110
[ 2.466979] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 2.466979] Thread overran stack, or stack corrupted
[ 2.466979] Oops: 0000 [#4] PREEMPT
[ 2.466979]
[ 2.466979] Pid: 1, comm: swapper Not tainted 2.6.39-t43+ #40 Bochs Bochs
[ 2.466979] EIP: 0060:[<c1005529>] EFLAGS: 00000002 CPU: 0
[ 2.466979] EIP is at print_context_stack+0xd9/0x110
[ 2.466979] EAX: 0000001d EBX: c7833128 ECX: 00000001 EDX: 00000310
[ 2.466979] ESI: 00000000 EDI: c11806d4 EBP: c7832f18 ESP: c7832ef4
[ 2.466979] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 2.466979] Process swapper (pid: 1, ti=c7832000 task=c7832020 task.ti=c7834000)
[ 2.466979] Stack:
[ 2.466979] c117e216 c7833ffc c783314c ffffe000 c7832000 c783314c c7833128 c11806d4
[ 2.466979] c7832000 c7832f44 c10040dd c11806d4 c11d6c66 00000000 c7832f34 00000018


full kernel log in attachment.

By bisection I got

commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a
Author: Andy Lutomirski <[email protected]>
Date: Mon May 23 09:31:28 2011 -0400

x86-64: Move vread_tsc into a new file with sensible options


I tested on i386, 32-bit architecture. Both gcc 4.4 and 4.6 result
in same problem. Also -O2 vs -O2 doesn't change anything.

Sometimes bug doesn't appear, or appear differently, for example

[ 1.114660] Initializing RT-Tester: OK
[ 1.118077] Testing tracer function: PASSED
[ 1.272665] Testing dynamic ftrace: PASSED
[ 1.524057] Testing dynamic ftrace ops #1: (1 0 1 1 0) (1 1 2 1 0) (2 1 3 1 44876) (2 2 4 1 45184) PASSED
[ 1.600066] Testing dynamic ftrace ops #2: (1 0 1 44944 0) (1 1 2 45228 0) (2 1 3 1 8) (2 2 4 265 272) PASSED
[ 1.717279] Testing tracer irqsoff: PASSED
[ 1.737368] Testing tracer preemptoff: PASSED
[ 1.759463] Testing tracer preemptirqsoff:
[ 1.773645] ------------[ cut here ]------------
[ 1.774220] WARNING: at kernel/lockdep.c:3351 check_flags+0x126/0x157()
[ 1.774723] Hardware name: Bochs
[ 1.775020] Pid: 1, comm: swapper Not tainted 2.6.39-t43+ #42
[ 1.775332] Call Trace:
[ 1.775560] [<c102213e>] warn_slowpath_common+0x77/0x8c
[ 1.775862] [<c104728d>] ? check_flags+0x126/0x157
[ 1.776138] [<c104728d>] ? check_flags+0x126/0x157
[ 1.776411] [<c1064f5a>] ? T.482+0x114/0x16e
[ 1.776664] [<c1022175>] warn_slowpath_null+0x22/0x24
[ 1.776950] [<c104728d>] check_flags+0x126/0x157
[ 1.777236] [<c104b0a6>] lock_acquire+0x34/0xe7
[ 1.777541] [<c1126672>] _raw_spin_lock_irqsave+0x4e/0x7e
[ 1.777839] [<c1064f5a>] ? T.482+0x114/0x16e
[ 1.778092] [<c1064f5a>] T.482+0x114/0x16e
[ 1.778341] [<c105d62c>] ? ring_buffer_unlock_commit+0x21/0x2e
[ 1.778654] [<c1026ff6>] ? __do_softirq+0x17e/0x191
[ 1.778931] [<c1026ff6>] ? __do_softirq+0x17e/0x191
[ 1.779208] [<c106502a>] stop_critical_timing+0x76/0x87
[ 1.779497] [<c1026d47>] ? __local_bh_enable+0x82/0x85
[ 1.779784] [<c106505a>] trace_preempt_on+0x1f/0x21
[ 1.780061] [<c101c17d>] sub_preempt_count+0x8b/0x98
[ 1.780341] [<c1026d47>] __local_bh_enable+0x82/0x85
[ 1.780621] [<c1026ff6>] __do_softirq+0x17e/0x191
[ 1.780892] [<c1026e78>] ? local_bh_enable+0x12/0x12
[ 1.781160] <IRQ> [<c1026c06>] ? irq_exit+0x44/0x66
[ 1.781536] [<c1003589>] ? do_IRQ+0x75/0x88
[ 1.781789] [<c1127b6e>] ? common_interrupt+0x2e/0x34
[ 1.782076] [<c106151c>] ? trace_selftest_startup_preemptirqsoff+0xae/0x1a9
[ 1.782435] [<c1061914>] ? register_tracer+0x12d/0x1fe
[ 1.782724] [<c1310344>] ? setup_command_line+0x73/0x73
[ 1.783066] [<c131c1b4>] ? init_irqsoff_tracer+0x21/0x25
[ 1.783390] [<c13101fa>] ? do_one_initcall+0x86/0x138
[ 1.783390] [<c131c193>] ? init_function_trace+0x35/0x35
[ 1.783390] [<c1310344>] ? setup_command_line+0x73/0x73
[ 1.783390] [<c13103cb>] ? kernel_init+0x87/0x11b
[ 1.783390] [<c1127b7a>] ? kernel_thread_helper+0x6/0x10
[ 1.783390] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 1.783390] irq event stamp: 133534
[ 1.783390] hardirqs last enabled at (133532): [<c1126cbd>] _raw_spin_unlock_irq+0x27/0x4a
[ 1.783390] hardirqs last disabled at (133533): [<c1026fbf>] __do_softirq+0x147/0x191
[ 1.783390] softirqs last enabled at (133534): [<c1026ff6>] __do_softirq+0x17e/0x191
[ 1.783390] softirqs last disabled at (133529): [<c10037b7>] do_softirq+0x55/0xa6
[ 1.798767] PASSED
[ 1.798767] Testing tracer wakeup:
[ 2.063646] Switching to clocksource tsc
[ 2.133963] PASSED


On recent kernels it looks like

[ 1.159875] Switching to clocksource tsc
[ 1.439760] PASSED
[ 1.440212] Testing tracer function_graph:
[ 1.449049] BUG: unable to handle kernel paging request at e42d01bc
[ 1.449049] IP: [<c107586c>] ftrace_push_return_trace+0x6c/0xd0
[ 1.449049] *pdpt = 0000000001221001 *pde = 0000000000000000
[ 1.449049] Thread overran stack, or stack corrupted
[ 1.449049] Oops: 0002 [#1] PREEMPT
[ 1.449049]
[ 1.449049] Pid: -947754028, comm:
[ 1.449049] ������k������ Not tainted 3.0.0-rc1-t43 #45 Bochs Bochs
[ 1.449049] EIP: 0060:[<c107586c>] EFLAGS: 00000016 CPU: 0
[ 1.449049] EIP is at ftrace_push_return_trace+0x6c/0xd0
[ 1.449049] EAX: c7826e50 EBX: c7826ae0 ECX: c106178c EDX: 1caa936c
[ 1.449049] ESI: 5656002d EDI: 00000000 EBP: c7826f84 ESP: c7826f68
[ 1.449049] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 1.449049] Process � (pid: -947753988, ti=c7826000 task=c7826ae0 task.ti=c7828000)
[ 1.449049] Stack:
[ 1.449049] c106178d c7826f98 c1007a26 c106178c c7826fcc c1007a26 c106178c c7826fa8
[ 1.449049] c1014982 c7826fd8 c7826fd8 c1007a26 00000000 c7826ae0 c1007a26 c7941800
[ 1.449049] c7826fc8 c1162192 c1007a26 c7827010 c7826000 c1007a2b c783e95c 00000021
[ 1.449049] Call Trace:
[ 1.449049] [<c106178d>] ? trace_clock_local+0x1d/0x50
[ 1.449049] [<c1007a26>] ? native_sched_clock+0x6/0x70
[ 1.449049] [<c106178c>] ? trace_clock_local+0x1c/0x50
[ 1.449049] [<c1007a26>] ? native_sched_clock+0x6/0x70
[ 1.449049] [<c106178c>] ? trace_clock_local+0x1c/0x50
[ 1.449049] [<c1014982>] prepare_ftrace_return+0x72/0xa0
[ 1.449049] [<c1007a26>] ? native_sched_clock+0x6/0x70
[ 1.449049] [<c1007a26>] ? native_sched_clock+0x6/0x70
...

or

[ 0.126297] Testing tracer function: PASSED
[ 0.260366] Testing dynamic ftrace: PASSED
[ 0.498381] Testing dynamic ftrace ops #1: (1 0 1 1 0) (1 1 2 1 0) (2 1 3 1 6015) (2 2 4 1 6190) PASSED
[ 0.557775] Testing dynamic ftrace ops #2: (1 0 1 6069 0) (1 1 2 6234 0) (2 1 3 1 2) (2 2 4 152 153) PASSED
[ 0.637461] Testing tracer irqsoff: PASSED
[ 0.650073] Testing tracer wakeup: PASSED
[ 0.979100] Testing tracer wakeup_rt:
[ 1.118393] Switching to clocksource tsc
[ 1.308726] PASSED
[ 1.309162] Testing tracer function_graph:
[ 1.317873] Tracing recursion: depth[576]:HC[1023]:SC[255]:NMI[67108864]
[ 1.317873] ------------[ cut here ]------------
[ 1.317873] WARNING: at kernel/trace/ring_buffer.c:2224 trace_recursive_fail+0x30/0x90()
[ 1.317873] Hardware name: Bochs
[ 1.317873] Pid: 1, comm: Not tainted 3.0.0-rc1-t43 #46
[ 1.317873] Call Trace:
[ 1.317873] [<c1022fbd>] warn_slowpath_common+0x6d/0xa0
[ 1.317873] [<c105b3c0>] ? trace_recursive_fail+0x30/0x90
[ 1.317873] [<c105b3c0>] ? trace_recursive_fail+0x30/0x90
[ 1.317873] [<c1023010>] warn_slowpath_null+0x20/0x30
[ 1.317873] [<c105b3c0>] trace_recursive_fail+0x30/0x90
[ 1.317873] [<c105ce8c>] ring_buffer_lock_reserve+0x8c/0xc0
[ 1.317873] [<c106177a>] trace_buffer_lock_reserve+0x1a/0x50
[ 1.317873] [<c1069357>] __trace_graph_return+0x37/0x90
[ 1.317873] <IRQ>
[ 1.317873] BUG: unable to handle kernel NULL pointer dereference at 00000040
[ 1.317873] IP: [<c1004f95>] print_context_stack+0x55/0x110
[ 1.317873] *pde = 00000000
[ 1.317873] BUG: unable to handle kernel NULL pointer dereference at 0000004a
[ 1.317873] IP: [<c1015e0e>] no_context+0x6e/0x160
[ 1.317873] *pde = 00000000
[ 1.317873] Thread overran stack, or stack corrupted
[ 1.317873] Oops: 0000 [#1]
[ 1.317873]
[ 1.317873] Pid: -947741524, comm: Not tainted 3.0.0-rc1-t43 #46 Bochs Bochs
[ 1.317873] EIP: 0060:[<c1015e0e>] EFLAGS: 00000006 CPU: 0
[ 1.317873] EIP is at no_context+0x6e/0x160
[ 1.317873] EAX: 00000002 EBX: c7829e78 ECX: c1023e56 EDX: c7829bc0
[ 1.317873] ESI: 00000040 EDI: 00000000 EBP: c7829da8 ESP: c7829d88
[ 1.317873] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 1.317873] Process (pid: -947741524, ti=c7828000 task=c7829bc0 task.ti=c10019a0)
[ 1.317873] Stack:
[ 1.317873] c117be6d 00000040 00000046 c7829bc0 c7829de8 00000000 00000040 c7829bc0
[ 1.317873] c7829de8 c1015f98 c1143805 c7829dd8 c1013c3d 00000000 0000004a 0000000e
[ 1.317873] 00000000 c1408242 00000040 c140820e c7829e78 c1408242 00000040 c140820e
[ 1.317873] Call Trace:
[ 1.317873] [<c1015f98>] __bad_area_nosemaphore+0x98/0x140
[ 1.317873] [<c1143805>] ? error_code+0x5d/0x64
[ 1.317873] [<c1013c3d>] ? prepare_ftrace_return+0x4d/0xa0
[ 1.317873] [<c1016057>] bad_area_nosemaphore+0x17/0x20
[ 1.317873] [<c10164ae>] do_page_fault+0x27e/0x410
[ 1.317873] [<c1023928>] ? console_unlock+0x1c8/0x1f0
[ 1.317873] [<c1016230>] ? mm_fault_error+0x130/0x130
[ 1.317873] [<c1016230>] ? mm_fault_error+0x130/0x130
...

Depending on the luck and boot time. All above was registered using qemu.
But I have this (or slightly different - depdnding on boot,
or kernel configuration - messages, just after trace function_graph testing
started) alsoon my laptop.

Reverting commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a on top of current
kernel make bug disapear.

Disabling CONFIG_FUNCTION_GRAPH_TRACER also make bug dispear.

Bug is preserved at multiple kernel configs.
I send one which is pretty minimized.


Regards,
Witek

--
Witold Baryluk


Attachments:
(No filename) (0.00 B)
signature.asc (198.00 B)
Digital signature
Download all attachments

2011-05-30 20:14:47

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel OOPS in function_graph_tracer due to the 44259b1abfaa8bb819d25d41d71e8e33e25dd36a commit (x86-64: Move vread_tsc into a new file with sensible options)

On Mon, May 30, 2011 at 12:10 PM, Witold Baryluk
<[email protected]> wrote:
> Hi,
>
> I found yesterday a problem when booting system on Pentium-M, 32-bit.
>
> I got approximetly this
>
> [ ? ?2.459170] Testing tracer function_graph:
> [ ? ?2.466979] BUG: unable to handle kernel paging request at e421cc10

>
> Reverting commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a on top of current
> kernel make bug disapear.
>
> Disabling CONFIG_FUNCTION_GRAPH_TRACER also make bug dispear.
>

Of course, the most trivial of my patches was the one with the most
significant bug. Can you try this fix:

http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=89e1be50c68eb5e58b873dce87bbac627ee18d1f

--Andy

2011-05-30 22:19:12

by Witold Baryluk

[permalink] [raw]
Subject: Re: Kernel OOPS in function_graph_tracer due to the 44259b1abfaa8bb819d25d41d71e8e33e25dd36a commit (x86-64: Move vread_tsc into a new file with sensible options)

On 05-30 16:14, Andrew Lutomirski wrote:
> On Mon, May 30, 2011 at 12:10 PM, Witold Baryluk
> <[email protected]> wrote:
> > Hi,
> >
> > I found yesterday a problem when booting system on Pentium-M, 32-bit.
> >
> > I got approximetly this
> >
> > [ ? ?2.459170] Testing tracer function_graph:
> > [ ? ?2.466979] BUG: unable to handle kernel paging request at e421cc10
>
> >
> > Reverting commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a on top of current
> > kernel make bug disapear.
> >
> > Disabling CONFIG_FUNCTION_GRAPH_TRACER also make bug dispear.
> >
>
> Of course, the most trivial of my patches was the one with the most
> significant bug. Can you try this fix:
>
> http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=89e1be50c68eb5e58b873dce87bbac627ee18d1f
>
> --Andy

Fix works. :)

Compiled with patch and CONFIG_FUNCTION_GRAPH_TRACER enabled, and
booted 4 times without problem

[ 6.611505] Testing dynamic ftrace ops #1:
[ 6.771532] Refined TSC clocksource calibration: 3199.859 MHz.
[ 6.772337] Switching to clocksource tsc
[ 6.919805] (1 0 1 1 0) (1 1 2 1 0) (2 1 3 1 215146) (2 2 4 1 215428) PASSED
[ 7.058209] Testing dynamic ftrace ops #2: (1 0 1 213422 0) (1 1 2 213704 0) (2 1 3 1 184) (2 2 4 270 453) PASSED
[ 7.646890] Testing tracer irqsoff: PASSED
[ 7.754403] Testing tracer preemptoff: PASSED
[ 7.860605] Testing tracer preemptirqsoff: PASSED
[ 7.970927] Testing tracer wakeup: PASSED
[ 8.363620] Testing tracer wakeup_rt: PASSED
[ 8.763273] Testing tracer function_graph: PASSED
[ 9.052030] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[ 9.061322] VFS: Disk quotas dquot_6.5.2
[ 9.065226] Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
[ 9.095324] Registering the id_resolver key type
[ 9.097664] FS-Cache: Netfs 'nfs' registered for caching
[ 9.103503] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
[ 9.111907] fuse init (API version 7.16)
[ 9.116706] msgmni has been set to 207
[ 9.143692] cryptomgr_test used greatest stack depth: 6452 bytes left
[ 9.163685] alg: No test for fcrypt (fcrypt-generic)
[ 9.205037] alg: No test for stdrng (krng)
[ 9.452480] NET: Registered protocol family 38
[ 9.456719] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[ 9.458912] io scheduler noop registered
[ 9.459136] io scheduler deadline registered (default)
[ 9.459822] io scheduler cfq registered
[ 9.460551] start plist test
[ 9.465413] end plist test



Thanks you.

I forgot to check linux-tip/next.

Good work, hope it will land quickly in Linus' tree.


--
Witold Baryluk


Attachments:
(No filename) (2.62 kB)
signature.asc (198.00 B)
Digital signature
Download all attachments

2011-05-30 23:46:59

by Witold Baryluk

[permalink] [raw]
Subject: Re: Kernel OOPS in function_graph_tracer due to the 44259b1. More oopses in tracing...

On 05-30 16:14, Andrew Lutomirski wrote:
> On Mon, May 30, 2011 at 12:10 PM, Witold Baryluk
> <[email protected]> wrote:
> > Hi,
> >
> > I found yesterday a problem when booting system on Pentium-M, 32-bit.
> >
> > I got approximetly this
> >
> > [ ? ?2.459170] Testing tracer function_graph:
> > [ ? ?2.466979] BUG: unable to handle kernel paging request at e421cc10
>
> >
> > Reverting commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a on top of current
> > kernel make bug disapear.
> >
> > Disabling CONFIG_FUNCTION_GRAPH_TRACER also make bug dispear.
> >
>
> Of course, the most trivial of my patches was the one with the most
> significant bug. Can you try this fix:
>
> http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=89e1be50c68eb5e58b873dce87bbac627ee18d1f
>
> --Andy

Well, to add more.

It fixed most of crashes and definietl one with function graph tracer.

However in 1/10 of boots I still got some kind of crash, oops or panic.

I repeated 50 times a boot cycle on qemu, and got oopses in attachments.

...
[ 0.036201] CPU: Intel Pentium III (Katmai) stepping 03
[ 0.038551] ftrace: allocating 6263 entries in 13 pages
[ 0.050411] BUG: unable to handle kernel NULL pointer dereference at 00000012
[ 0.051079] IP: [<c106a07f>] note_interrupt+0xf/0x1c0
[ 0.051743] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
...
CRASH


....
[ 0.035682] CPU: Intel Pentium III (Katmai) stepping 03
[ 0.038048] ftrace: allocating 6263 entries in 13 pages
[ 0.050386] BUG: unable to handle kernel paging request at 8a51553a
[ 0.051031] IP: [<c10587cd>] tick_handle_periodic+0x1d/0x90
[ 0.051705] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
...
CRASH

...
[ 1.758224] Testing tracer preemptirqsoff:
[ 1.772287] ------------[ cut here ]------------
[ 1.773193] WARNING: at kernel/lockdep.c:3351 check_flags+0x186/0x190()
...
BOOTING (up to mounting root - not tested more)

...
[ 0.045767] CPU: Intel Pentium III (Katmai) stepping 03
[ 0.048323] ftrace: allocating 6263 entries in 13 pages
CRASH


Details in attachments.


Regards,
Witek

--
Witold Baryluk


Attachments:
(No filename) (0.00 B)
signature.asc (198.00 B)
Digital signature
Download all attachments

2011-05-31 01:24:09

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel OOPS in function_graph_tracer due to the 44259b1. More oopses in tracing...

On Mon, May 30, 2011 at 7:46 PM, Witold Baryluk
<[email protected]> wrote:
> On 05-30 16:14, Andrew Lutomirski wrote:
>> On Mon, May 30, 2011 at 12:10 PM, Witold Baryluk
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > I found yesterday a problem when booting system on Pentium-M, 32-bit.
>> >
>> > I got approximetly this
>> >
>> > [ ? ?2.459170] Testing tracer function_graph:
>> > [ ? ?2.466979] BUG: unable to handle kernel paging request at e421cc10
>>
>> >
>> > Reverting commit 44259b1abfaa8bb819d25d41d71e8e33e25dd36a on top of current
>> > kernel make bug disapear.
>> >
>> > Disabling CONFIG_FUNCTION_GRAPH_TRACER also make bug dispear.
>> >
>>
>> Of course, the most trivial of my patches was the one with the most
>> significant bug. ?Can you try this fix:
>>
>> http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=89e1be50c68eb5e58b873dce87bbac627ee18d1f
>>
>> --Andy
>
> Well, to add more.
>
> It fixed most of crashes and definietl one with function graph tracer.
>
> However in 1/10 of boots I still got some kind of crash, oops or panic.
>


>
> ....
> [ ? ?0.035682] CPU: Intel Pentium III (Katmai) stepping 03
> [ ? ?0.038048] ftrace: allocating 6263 entries in 13 pages
> [ ? ?0.050386] BUG: unable to handle kernel paging request at 8a51553a
> [ ? ?0.051031] IP: [<c10587cd>] tick_handle_periodic+0x1d/0x90
> [ ? ?0.051705] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
> ...
> CRASH

This is oops1.txt. The faulting code is:

000003a0 <tick_handle_periodic>:
3a0: 55 push %ebp
3a1: 89 e5 mov %esp,%ebp
3a3: 57 push %edi
3a4: 56 push %esi
3a5: 53 push %ebx
3a6: 83 ec 0c sub $0xc,%esp
3a9: e8 fc ff ff ff call 3aa <tick_handle_periodic+0xa>
3ae: 89 c7 mov %eax,%edi
3b0: e8 fc ff ff ff call 3b1 <tick_handle_periodic+0x11>
3b5: 89 45 f0 mov %eax,-0x10(%ebp)
3b8: e8 63 ff ff ff call 320 <tick_periodic>
3bd: 83 7f 28 03 cmpl $0x3,0x28(%edi)

^^^ fault was in the dereference of edi + 0x28.

3c1: 74 0d je 3d0 <tick_handle_periodic+0x30>
3c3: 83 c4 0c add $0xc,%esp
3c6: 5b pop %ebx
3c7: 5e pop %esi
3c8: 5f pop %edi
3c9: 5d pop %ebp
3ca: c3 ret

The stack trace is garbage, though.

The offending C code is probably this:

if (dev->mode != CLOCK_EVT_MODE_ONESHOT)
return;

I would guess that this isn't related to the vdso changes, and I'm
mostly out of ideas.

--Andy