2015-02-16 07:20:52

by Fengguang Wu

[permalink] [raw]
Subject: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

Hi Peter,

We got the below dmesg and the first bad commit is

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git revert-d347efeb16d3d5150cb7f8d50b05f388b572840e-1019a359d3dc4b64d0e1e5a5efcb725d5e83994d

commit 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed Nov 26 08:44:03 2014 +0800
Commit: Ingo Molnar <[email protected]>
CommitDate: Wed Feb 4 07:52:26 2015 +0100

sched/deadline: Fix stale yield state

When we fail to start the deadline timer in update_curr_dl(), we
forget to clear ->dl_yielded, resulting in wrecked time keeping.

Since the natural place to clear both ->dl_yielded and ->dl_throttled
is in replenish_dl_entity(); both are after all waiting for that event;
make it so.

Luckily since 67dfa1b756f2 ("sched/deadline: Implement
cancel_dl_timer() to use in switched_from_dl()") the
task_on_rq_queued() condition in dl_task_timer() must be true, and can
therefore call enqueue_task_dl() unconditionally.

Reported-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

===================================================
PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
===================================================

+------------------------------------------------+------------+------------+------------+
| | a7bebf4887 | 1019a359d3 | 58e0e18cb8 |
+------------------------------------------------+------------+------------+------------+
| boot_successes | 886 | 54 | 10 |
| boot_failures | 314 | 246 | 4 |
| BUG:kernel_boot_crashed | 48 | 20 | 1 |
| BUG:kernel_test_crashed | 16 | 2 | |
| BUG:kernel_boot_hang | 213 | 43 | |
| Kernel_panic-not_syncing:softlockup:hung_tasks | 37 | 3 | |
| backtrace:ring_buffer_consumer_thread | 17 | 3 | |
| EIP_is_at_get_node_by_name | 4 | | |
| backtrace:gcov_enable_events | 17 | | |
| backtrace:gcov_fs_init | 17 | | |
| backtrace:kernel_init_freeable | 20 | 178 | 3 |
| EIP_is_at_strcmp | 7 | | |
| EIP_is_at__slab_alloc | 1 | | |
| backtrace:__ring_buffer_alloc | 1 | | |
| backtrace:ring_buffer_benchmark_init | 1 | | |
| EIP_is_at_lock_release | 1 | | |
| backtrace:debug_mutex_init | 1 | | |
| EIP_is_at_strncmp | 1 | | |
| EIP_is_at_do_raw_spin_unlock | 1 | | |
| EIP_is_at__do_softirq | 6 | 2 | |
| EIP_is_at_gcov_info_filename | 1 | | |
| EIP_is_at_ring_buffer_consume | 8 | 1 | |
| EIP_is_at_raw_spin_unlock_irqrestore | 1 | | |
| EIP_is_at_ftrace_ops_list_func | 1 | | |
| backtrace:event_trace_self_tests | 2 | | |
| backtrace:event_trace_self_tests_init | 2 | | |
| EIP_is_at_ftrace_likely_update | 1 | | |
| EIP_is_at_rb_reserve_next_event | 1 | | |
| EIP_is_at_raw_spin_unlock_irq | 1 | | |
| kernel_BUG_at_kernel/sched/deadline.c | 0 | 178 | 3 |
| invalid_opcode | 0 | 178 | 3 |
| EIP_is_at_enqueue_task_dl | 0 | 178 | 3 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 178 | 3 |
| backtrace:register_tracer | 0 | 178 | 3 |
| backtrace:init_wakeup_tracer | 0 | 178 | 3 |
+------------------------------------------------+------------+------------+------------+

[ 6.432010] Testing ftrace regs: PASSED
[ 6.530393] Testing tracer wakeup:
[ 7.648615] ------------[ cut here ]------------
[ 7.649000] kernel BUG at kernel/sched/deadline.c:805!
[ 7.649000] invalid opcode: 0000 [#1] SMP
[ 7.649000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc6-00041-g1019a35 #17
[ 7.649000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 7.649000] task: d3348000 ti: d3344000 task.ti: d3344000
[ 7.649000] EIP: 0060:[<c11969a1>] EFLAGS: 00010046 CPU: 0
[ 7.649000] EIP is at enqueue_task_dl+0x341/0x1860
[ 7.649000] EAX: d2940101 EBX: 00000001 ECX: 00000000 EDX: 0085a491
[ 7.649000] ESI: d2940000 EDI: d2940101 EBP: d3345e08 ESP: d3345dc0
[ 7.649000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7.649000] CR0: 8005003b CR2: ffffffff CR3: 04e8b000 CR4: 000006d0
[ 7.649000] Stack:
[ 7.649000] c1170353 00000002 d3345de0 c11ab53c d391ec10 00000046 d33485c0 d3348000
[ 7.649000] 00000002 d29401b8 d391ec00 d29401b8 00000046 00000005 00000000 d2940000
[ 7.649000] d391ec00 c36efc80 d3345e20 c115991a 00000005 00000000 d391ec00 00000000
[ 7.649000] Call Trace:
[ 7.649000] [<c1170353>] ? sched_clock_cpu+0x293/0x2e0
[ 7.649000] [<c11ab53c>] ? __lock_is_held+0x6c/0xa0
[ 7.649000] [<c115991a>] enqueue_task+0xba/0x220
[ 7.649000] [<c115aa4c>] activate_task+0x4c/0x80
[ 7.649000] [<c115aaa1>] ttwu_activate+0x21/0x90
[ 7.649000] [<c115bd84>] ttwu_do_activate+0x64/0x90
[ 7.649000] [<c11689b6>] try_to_wake_up+0x366/0x8b0
[ 7.649000] [<c1169000>] wake_up_process+0xa0/0xe0
[ 7.649000] [<c114095b>] kthread_stop+0x21b/0xd90
[ 7.649000] [<c12c0b27>] trace_selftest_startup_wakeup+0x3b7/0x4b0
[ 7.649000] [<c36b1690>] ? wait_for_completion+0x1f0/0x290
[ 7.649000] [<c12c1d04>] register_tracer+0x2d4/0x700
[ 7.649000] [<c4bd11a4>] init_wakeup_tracer+0x1d/0xea
[ 7.649000] [<c4b8f12e>] do_one_initcall+0x278/0x44c
[ 7.649000] [<c4bd1187>] ? init_function_trace+0x1f8/0x1f8
[ 7.649000] [<c4b8f565>] kernel_init_freeable+0x263/0x424
[ 7.649000] [<c3663b4c>] kernel_init+0x1c/0x2a0
[ 7.649000] [<c36c2181>] ret_from_kernel_thread+0x21/0x30
[ 7.649000] [<c3663b30>] ? rest_init+0x330/0x330
[ 7.649000] Code: da e8 94 96 13 00 8b 04 9d 84 f6 77 c4 83 c0 01 89 04 9d 84 f6 77 c4 89 f8 84 c0 74 4d 83 05 c8 14 f2 c4 01 83 15 cc 14 f2 c4 00 <0f> 0b 83 05 d0 14 f2 c4 01 83 15 d4 14 f2 c4 00 8d b4 26 00 00
[ 7.649000] EIP: [<c11969a1>] enqueue_task_dl+0x341/0x1860 SS:ESP 0068:d3345dc0
[ 7.649000] ---[ end trace cb4bdebe33135894 ]---
[ 7.649000] Kernel panic - not syncing: Fatal exception

git bisect start 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe bfa76d49576599a4b9f9b7a71f23d73d6dcff735 --
git bisect bad 30496a920e97b97e16d078c520eef323344ef997 # 23:32 0- 67 Merge 'amirv/for-upstream' into devel-xian-smoke-201502152116
git bisect good d170f4905ee6f93eb12f87d018dd4b9acea18f2c # 00:27 300+ 120 0day base guard for 'devel-xian-smoke-201502152116'
git bisect bad c5ce28df0e7c01a1de23c36ebdefcd803f2b6cbb # 01:11 0- 72 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect good 57ee062e4baa9532e5a3835a2d185590a3fc8bd3 # 01:51 300+ 94 Merge branch 'r8152'
git bisect bad c08f8467939e7d2eebcba7cf2330242c4f53f2f7 # 02:10 0- 103 Merge tag 'pci-v3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad 5b9b28a63f2e47dac5ff3a2503bfe3ade8796aa0 # 02:24 0- 5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 30d46827c2744f56bb31460007f2d16455f10720 # 02:41 300+ 107 Merge tag 'regulator-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect good 8308756f45a12e2ff4f7749c2694fc83cdef0be9 # 03:11 300+ 101 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 4ac30cf74b308fb01338e660d3471cd490a7958a # 03:36 300+ 25 perf tools: Do not use __perf_session__process_events() directly
git bisect bad 9659e1eeee28f7025b6545934d644d19e9c6e603 # 03:48 0- 2 sched/deadline: Remove cpu_active_mask from cpudl_find()
git bisect good 3847b272248a3a4ed70d20392cc0454917f7713b # 04:23 300+ 89 Merge branch 'sched/urgent' into sched/core
git bisect good 4c195c8a1967ff8bee13a811518a99db04618ab7 # 04:59 300+ 118 Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new patches
git bisect bad 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d # 05:10 39- 193 sched/deadline: Fix stale yield state
git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb # 05:34 300+ 103 sched/deadline: Fix hrtick for a non-leftmost task
# first bad commit: [1019a359d3dc4b64d0e1e5a5efcb725d5e83994d] sched/deadline: Fix stale yield state
git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb # 05:55 900+ 314 sched/deadline: Fix hrtick for a non-leftmost task
# extra tests with DEBUG_INFO
git bisect good 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d # 06:11 900+ 0 sched/deadline: Fix stale yield state
# extra tests on HEAD of linux-devel/devel-xian-smoke-201502152116
git bisect bad 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe # 06:11 0- 4 0day head guard for 'devel-xian-smoke-201502152116'
# extra tests on tree/branch linus/master
git bisect bad a9724125ad014decf008d782e60447c811391326 # 06:11 0- 1 Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
# extra tests with first bad commit reverted
# extra tests on tree/branch linus/master
git bisect bad a9724125ad014decf008d782e60447c811391326 # 06:16 0- 1 Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
# extra tests on tree/branch next/master
git bisect bad b8acf73194186a5cba86812eb4ba17b897f0e13e # 06:26 0- 35 Add linux-next specific files for 20150213


This script may reproduce the error.

----------------------------------------------------------------------------
#!/bin/bash

kernel=$1

kvm=(
qemu-system-x86_64
-cpu kvm64
-enable-kvm
-kernel $kernel
-m 320
-smp 1
-net nic,vlan=1,model=e1000
-net user,vlan=1
-boot order=nc
-no-reboot
-watchdog i6300esb
-rtc base=localtime
-serial stdio
-display none
-monitor null
)

append=(
hung_task_panic=1
earlyprintk=ttyS0,115200
rd.udev.log-priority=err
systemd.log_target=journal
systemd.log_level=warning
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
console=ttyS0,115200
console=tty0
vga=normal
root=/dev/ram0
rw
drbd.minor_count=8
)

"${kvm[@]}" --append "${append[*]}"
----------------------------------------------------------------------------

Thanks,
Fengguang


Attachments:
(No filename) (11.86 kB)
dmesg-yocto-client9-7:20150216050530:i386-randconfig-x1-02152111:3.19.0-rc6-00041-g1019a35:17 (31.71 kB)
Download all attachments

2015-02-16 11:52:42

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

16.02.2015, 10:21, "Fengguang Wu" <[email protected]>:
> Hi Peter,
>
> We got the below dmesg and the first bad commit is
>
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git revert-d347efeb16d3d5150cb7f8d50b05f388b572840e-1019a359d3dc4b64d0e1e5a5efcb725d5e83994d
>
> commit 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d
> Author: ????Peter Zijlstra <[email protected]>
> AuthorDate: Wed Nov 26 08:44:03 2014 +0800
> Commit: ????Ingo Molnar <[email protected]>
> CommitDate: Wed Feb 4 07:52:26 2015 +0100
>
> ????sched/deadline: Fix stale yield state
>
> ????When we fail to start the deadline timer in update_curr_dl(), we
> ????forget to clear ->dl_yielded, resulting in wrecked time keeping.
>
> ????Since the natural place to clear both ->dl_yielded and ->dl_throttled
> ????is in replenish_dl_entity(); both are after all waiting for that event;
> ????make it so.
>
> ????Luckily since 67dfa1b756f2 ("sched/deadline: Implement
> ????cancel_dl_timer() to use in switched_from_dl()") the
> ????task_on_rq_queued() condition in dl_task_timer() must be true, and can
> ????therefore call enqueue_task_dl() unconditionally.
>
> ????Reported-by: Wanpeng Li <[email protected]>
> ????Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ????Cc: Kirill Tkhai <[email protected]>
> ????Cc: Juri Lelli <[email protected]>
> ????Cc: Linus Torvalds <[email protected]>
> ????Link: http://lkml.kernel.org/r/[email protected]
> ????Signed-off-by: Ingo Molnar <[email protected]>
>
> ===================================================
> PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
> ===================================================
>
> +------------------------------------------------+------------+------------+------------+
> | ???????????????????????????????????????????????| a7bebf4887 | 1019a359d3 | 58e0e18cb8 |
> +------------------------------------------------+------------+------------+------------+
> | boot_successes ????????????????????????????????| 886 ???????| 54 ????????| 10 ????????|
> | boot_failures ?????????????????????????????????| 314 ???????| 246 ???????| 4 ?????????|
> | BUG:kernel_boot_crashed ???????????????????????| 48 ????????| 20 ????????| 1 ?????????|
> | BUG:kernel_test_crashed ???????????????????????| 16 ????????| 2 ?????????| ???????????|
> | BUG:kernel_boot_hang ??????????????????????????| 213 ???????| 43 ????????| ???????????|
> | Kernel_panic-not_syncing:softlockup:hung_tasks | 37 ????????| 3 ?????????| ???????????|
> | backtrace:ring_buffer_consumer_thread ?????????| 17 ????????| 3 ?????????| ???????????|
> | EIP_is_at_get_node_by_name ????????????????????| 4 ?????????| ???????????| ???????????|
> | backtrace:gcov_enable_events ??????????????????| 17 ????????| ???????????| ???????????|
> | backtrace:gcov_fs_init ????????????????????????| 17 ????????| ???????????| ???????????|
> | backtrace:kernel_init_freeable ????????????????| 20 ????????| 178 ???????| 3 ?????????|
> | EIP_is_at_strcmp ??????????????????????????????| 7 ?????????| ???????????| ???????????|
> | EIP_is_at__slab_alloc ?????????????????????????| 1 ?????????| ???????????| ???????????|
> | backtrace:__ring_buffer_alloc ?????????????????| 1 ?????????| ???????????| ???????????|
> | backtrace:ring_buffer_benchmark_init ??????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_lock_release ????????????????????????| 1 ?????????| ???????????| ???????????|
> | backtrace:debug_mutex_init ????????????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_strncmp ?????????????????????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_do_raw_spin_unlock ??????????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at__do_softirq ?????????????????????????| 6 ?????????| 2 ?????????| ???????????|
> | EIP_is_at_gcov_info_filename ??????????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_ring_buffer_consume ?????????????????| 8 ?????????| 1 ?????????| ???????????|
> | EIP_is_at_raw_spin_unlock_irqrestore ??????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_ftrace_ops_list_func ????????????????| 1 ?????????| ???????????| ???????????|
> | backtrace:event_trace_self_tests ??????????????| 2 ?????????| ???????????| ???????????|
> | backtrace:event_trace_self_tests_init ?????????| 2 ?????????| ???????????| ???????????|
> | EIP_is_at_ftrace_likely_update ????????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_rb_reserve_next_event ???????????????| 1 ?????????| ???????????| ???????????|
> | EIP_is_at_raw_spin_unlock_irq ?????????????????| 1 ?????????| ???????????| ???????????|
> | kernel_BUG_at_kernel/sched/deadline.c ?????????| 0 ?????????| 178 ???????| 3 ?????????|
> | invalid_opcode ????????????????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
> | EIP_is_at_enqueue_task_dl ?????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
> | Kernel_panic-not_syncing:Fatal_exception ??????| 0 ?????????| 178 ???????| 3 ?????????|
> | backtrace:register_tracer ?????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
> | backtrace:init_wakeup_tracer ??????????????????| 0 ?????????| 178 ???????| 3 ?????????|
> +------------------------------------------------+------------+------------+------------+
>
> [ ???6.432010] Testing ftrace regs: PASSED
> [ ???6.530393] Testing tracer wakeup:
> [ ???7.648615] ------------[ cut here ]------------
> [ ???7.649000] kernel BUG at kernel/sched/deadline.c:805!
> [ ???7.649000] invalid opcode: 0000 [#1] SMP
> [ ???7.649000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc6-00041-g1019a35 #17
> [ ???7.649000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [ ???7.649000] task: d3348000 ti: d3344000 task.ti: d3344000
> [ ???7.649000] EIP: 0060:[<c11969a1>] EFLAGS: 00010046 CPU: 0
> [ ???7.649000] EIP is at enqueue_task_dl+0x341/0x1860
> [ ???7.649000] EAX: d2940101 EBX: 00000001 ECX: 00000000 EDX: 0085a491
> [ ???7.649000] ESI: d2940000 EDI: d2940101 EBP: d3345e08 ESP: d3345dc0
> [ ???7.649000] ?DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ ???7.649000] CR0: 8005003b CR2: ffffffff CR3: 04e8b000 CR4: 000006d0
> [ ???7.649000] Stack:
> [ ???7.649000] ?c1170353 00000002 d3345de0 c11ab53c d391ec10 00000046 d33485c0 d3348000
> [ ???7.649000] ?00000002 d29401b8 d391ec00 d29401b8 00000046 00000005 00000000 d2940000
> [ ???7.649000] ?d391ec00 c36efc80 d3345e20 c115991a 00000005 00000000 d391ec00 00000000
> [ ???7.649000] Call Trace:
> [ ???7.649000] ?[<c1170353>] ? sched_clock_cpu+0x293/0x2e0
> [ ???7.649000] ?[<c11ab53c>] ? __lock_is_held+0x6c/0xa0
> [ ???7.649000] ?[<c115991a>] enqueue_task+0xba/0x220
> [ ???7.649000] ?[<c115aa4c>] activate_task+0x4c/0x80
> [ ???7.649000] ?[<c115aaa1>] ttwu_activate+0x21/0x90
> [ ???7.649000] ?[<c115bd84>] ttwu_do_activate+0x64/0x90
> [ ???7.649000] ?[<c11689b6>] try_to_wake_up+0x366/0x8b0
> [ ???7.649000] ?[<c1169000>] wake_up_process+0xa0/0xe0
> [ ???7.649000] ?[<c114095b>] kthread_stop+0x21b/0xd90
> [ ???7.649000] ?[<c12c0b27>] trace_selftest_startup_wakeup+0x3b7/0x4b0
> [ ???7.649000] ?[<c36b1690>] ? wait_for_completion+0x1f0/0x290
> [ ???7.649000] ?[<c12c1d04>] register_tracer+0x2d4/0x700
> [ ???7.649000] ?[<c4bd11a4>] init_wakeup_tracer+0x1d/0xea
> [ ???7.649000] ?[<c4b8f12e>] do_one_initcall+0x278/0x44c
> [ ???7.649000] ?[<c4bd1187>] ? init_function_trace+0x1f8/0x1f8
> [ ???7.649000] ?[<c4b8f565>] kernel_init_freeable+0x263/0x424
> [ ???7.649000] ?[<c3663b4c>] kernel_init+0x1c/0x2a0
> [ ???7.649000] ?[<c36c2181>] ret_from_kernel_thread+0x21/0x30
> [ ???7.649000] ?[<c3663b30>] ? rest_init+0x330/0x330
> [ ???7.649000] Code: da e8 94 96 13 00 8b 04 9d 84 f6 77 c4 83 c0 01 89 04 9d 84 f6 77 c4 89 f8 84 c0 74 4d 83 05 c8 14 f2 c4 01 83 15 cc 14 f2 c4 00 <0f> 0b 83 05 d0 14 f2 c4 01 83 15 d4 14 f2 c4 00 8d b4 26 00 00
> [ ???7.649000] EIP: [<c11969a1>] enqueue_task_dl+0x341/0x1860 SS:ESP 0068:d3345dc0
> [ ???7.649000] ---[ end trace cb4bdebe33135894 ]---
> [ ???7.649000] Kernel panic - not syncing: Fatal exception
>
> git bisect start 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe bfa76d49576599a4b9f9b7a71f23d73d6dcff735 --
> git bisect ?bad 30496a920e97b97e16d078c520eef323344ef997 ?# 23:32 ?????0- ????67 ?Merge 'amirv/for-upstream' into devel-xian-smoke-201502152116
> git bisect good d170f4905ee6f93eb12f87d018dd4b9acea18f2c ?# 00:27 ???300+ ???120 ?0day base guard for 'devel-xian-smoke-201502152116'
> git bisect ?bad c5ce28df0e7c01a1de23c36ebdefcd803f2b6cbb ?# 01:11 ?????0- ????72 ?Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect good 57ee062e4baa9532e5a3835a2d185590a3fc8bd3 ?# 01:51 ???300+ ????94 ?Merge branch 'r8152'
> git bisect ?bad c08f8467939e7d2eebcba7cf2330242c4f53f2f7 ?# 02:10 ?????0- ???103 ?Merge tag 'pci-v3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
> git bisect ?bad 5b9b28a63f2e47dac5ff3a2503bfe3ade8796aa0 ?# 02:24 ?????0- ?????5 ?Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 30d46827c2744f56bb31460007f2d16455f10720 ?# 02:41 ???300+ ???107 ?Merge tag 'regulator-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
> git bisect good 8308756f45a12e2ff4f7749c2694fc83cdef0be9 ?# 03:11 ???300+ ???101 ?Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 4ac30cf74b308fb01338e660d3471cd490a7958a ?# 03:36 ???300+ ????25 ?perf tools: Do not use __perf_session__process_events() directly
> git bisect ?bad 9659e1eeee28f7025b6545934d644d19e9c6e603 ?# 03:48 ?????0- ?????2 ?sched/deadline: Remove cpu_active_mask from cpudl_find()
> git bisect good 3847b272248a3a4ed70d20392cc0454917f7713b ?# 04:23 ???300+ ????89 ?Merge branch 'sched/urgent' into sched/core
> git bisect good 4c195c8a1967ff8bee13a811518a99db04618ab7 ?# 04:59 ???300+ ???118 ?Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new patches
> git bisect ?bad 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d ?# 05:10 ????39- ???193 ?sched/deadline: Fix stale yield state
> git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb ?# 05:34 ???300+ ???103 ?sched/deadline: Fix hrtick for a non-leftmost task
> # first bad commit: [1019a359d3dc4b64d0e1e5a5efcb725d5e83994d] sched/deadline: Fix stale yield state
> git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb ?# 05:55 ???900+ ???314 ?sched/deadline: Fix hrtick for a non-leftmost task
> # extra tests with DEBUG_INFO
> git bisect good 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d ?# 06:11 ???900+ ?????0 ?sched/deadline: Fix stale yield state
> # extra tests on HEAD of linux-devel/devel-xian-smoke-201502152116
> git bisect ?bad 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe ?# 06:11 ?????0- ?????4 ?0day head guard for 'devel-xian-smoke-201502152116'
> # extra tests on tree/branch linus/master
> git bisect ?bad a9724125ad014decf008d782e60447c811391326 ?# 06:11 ?????0- ?????1 ?Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> # extra tests with first bad commit reverted
> # extra tests on tree/branch linus/master
> git bisect ?bad a9724125ad014decf008d782e60447c811391326 ?# 06:16 ?????0- ?????1 ?Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> # extra tests on tree/branch next/master
> git bisect ?bad b8acf73194186a5cba86812eb4ba17b897f0e13e ?# 06:26 ?????0- ????35 ?Add linux-next specific files for 20150213
>
> This script may reproduce the error.
>
> ----------------------------------------------------------------------------
> #!/bin/bash
>
> kernel=$1
>
> kvm=(
> ????????qemu-system-x86_64
> ????????-cpu kvm64
> ????????-enable-kvm
> ????????-kernel $kernel
> ????????-m 320
> ????????-smp 1
> ????????-net nic,vlan=1,model=e1000
> ????????-net user,vlan=1
> ????????-boot order=nc
> ????????-no-reboot
> ????????-watchdog i6300esb
> ????????-rtc base=localtime
> ????????-serial stdio
> ????????-display none
> ????????-monitor null
> )
>
> append=(
> ????????hung_task_panic=1
> ????????earlyprintk=ttyS0,115200
> ????????rd.udev.log-priority=err
> ????????systemd.log_target=journal
> ????????systemd.log_level=warning
> ????????debug
> ????????apic=debug
> ????????sysrq_always_enabled
> ????????rcupdate.rcu_cpu_stall_timeout=100
> ????????panic=-1
> ????????softlockup_panic=1
> ????????nmi_watchdog=panic
> ????????oops=panic
> ????????load_ramdisk=2
> ????????prompt_ramdisk=0
> ????????console=ttyS0,115200
> ????????console=tty0
> ????????vga=normal
> ????????root=/dev/ram0
> ????????rw
> ????????drbd.minor_count=8
> )
>
> "${kvm[@]}" --append "${append[*]}"
> ----------------------------------------------------------------------------

Don't we have the following problem?

current->state = TASK_INTERRUPTIBLE;
schedule()
deactivate_task()
dequeue_task_dl()
update_curr_dl()
start_dl_timer()
__dequeue_task_dl()

...timer fires...

dl_task_timer()
enqueue_task_dl() /* queues on dl_rq */

...try_to_wake_up()

enqueue_dl_entity()
BUG_ON(on_dl_rq())

Compile-tested fix:

kernel/sched/deadline.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a027799..1b78598 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -541,6 +541,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

sched_clock_tick();
update_rq_clock(rq);
+ if (!p->on_rq) {
+ replenish_dl_entity(dl_se, dl_se);
+ goto unlock;
+ }
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr))
check_preempt_curr_dl(rq, p, 0);

2015-02-16 12:46:44

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

16.02.2015, 14:52, "Kirill Tkhai" <[email protected]>:
> 16.02.2015, 10:21, "Fengguang Wu" <[email protected]>:
>> ?Hi Peter,
>>
>> ?We got the below dmesg and the first bad commit is
>>
>> ?git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git revert-d347efeb16d3d5150cb7f8d50b05f388b572840e-1019a359d3dc4b64d0e1e5a5efcb725d5e83994d
>>
>> ?commit 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d
>> ?Author: ????Peter Zijlstra <[email protected]>
>> ?AuthorDate: Wed Nov 26 08:44:03 2014 +0800
>> ?Commit: ????Ingo Molnar <[email protected]>
>> ?CommitDate: Wed Feb 4 07:52:26 2015 +0100
>>
>> ?????sched/deadline: Fix stale yield state
>>
>> ?????When we fail to start the deadline timer in update_curr_dl(), we
>> ?????forget to clear ->dl_yielded, resulting in wrecked time keeping.
>>
>> ?????Since the natural place to clear both ->dl_yielded and ->dl_throttled
>> ?????is in replenish_dl_entity(); both are after all waiting for that event;
>> ?????make it so.
>>
>> ?????Luckily since 67dfa1b756f2 ("sched/deadline: Implement
>> ?????cancel_dl_timer() to use in switched_from_dl()") the
>> ?????task_on_rq_queued() condition in dl_task_timer() must be true, and can
>> ?????therefore call enqueue_task_dl() unconditionally.
>>
>> ?????Reported-by: Wanpeng Li <[email protected]>
>> ?????Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> ?????Cc: Kirill Tkhai <[email protected]>
>> ?????Cc: Juri Lelli <[email protected]>
>> ?????Cc: Linus Torvalds <[email protected]>
>> ?????Link: http://lkml.kernel.org/r/[email protected]
>> ?????Signed-off-by: Ingo Molnar <[email protected]>
>>
>> ?===================================================
>> ?PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
>> ?===================================================
>>
>> ?+------------------------------------------------+------------+------------+------------+
>> ?| ???????????????????????????????????????????????| a7bebf4887 | 1019a359d3 | 58e0e18cb8 |
>> ?+------------------------------------------------+------------+------------+------------+
>> ?| boot_successes ????????????????????????????????| 886 ???????| 54 ????????| 10 ????????|
>> ?| boot_failures ?????????????????????????????????| 314 ???????| 246 ???????| 4 ?????????|
>> ?| BUG:kernel_boot_crashed ???????????????????????| 48 ????????| 20 ????????| 1 ?????????|
>> ?| BUG:kernel_test_crashed ???????????????????????| 16 ????????| 2 ?????????| ???????????|
>> ?| BUG:kernel_boot_hang ??????????????????????????| 213 ???????| 43 ????????| ???????????|
>> ?| Kernel_panic-not_syncing:softlockup:hung_tasks | 37 ????????| 3 ?????????| ???????????|
>> ?| backtrace:ring_buffer_consumer_thread ?????????| 17 ????????| 3 ?????????| ???????????|
>> ?| EIP_is_at_get_node_by_name ????????????????????| 4 ?????????| ???????????| ???????????|
>> ?| backtrace:gcov_enable_events ??????????????????| 17 ????????| ???????????| ???????????|
>> ?| backtrace:gcov_fs_init ????????????????????????| 17 ????????| ???????????| ???????????|
>> ?| backtrace:kernel_init_freeable ????????????????| 20 ????????| 178 ???????| 3 ?????????|
>> ?| EIP_is_at_strcmp ??????????????????????????????| 7 ?????????| ???????????| ???????????|
>> ?| EIP_is_at__slab_alloc ?????????????????????????| 1 ?????????| ???????????| ???????????|
>> ?| backtrace:__ring_buffer_alloc ?????????????????| 1 ?????????| ???????????| ???????????|
>> ?| backtrace:ring_buffer_benchmark_init ??????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_lock_release ????????????????????????| 1 ?????????| ???????????| ???????????|
>> ?| backtrace:debug_mutex_init ????????????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_strncmp ?????????????????????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_do_raw_spin_unlock ??????????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at__do_softirq ?????????????????????????| 6 ?????????| 2 ?????????| ???????????|
>> ?| EIP_is_at_gcov_info_filename ??????????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_ring_buffer_consume ?????????????????| 8 ?????????| 1 ?????????| ???????????|
>> ?| EIP_is_at_raw_spin_unlock_irqrestore ??????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_ftrace_ops_list_func ????????????????| 1 ?????????| ???????????| ???????????|
>> ?| backtrace:event_trace_self_tests ??????????????| 2 ?????????| ???????????| ???????????|
>> ?| backtrace:event_trace_self_tests_init ?????????| 2 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_ftrace_likely_update ????????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_rb_reserve_next_event ???????????????| 1 ?????????| ???????????| ???????????|
>> ?| EIP_is_at_raw_spin_unlock_irq ?????????????????| 1 ?????????| ???????????| ???????????|
>> ?| kernel_BUG_at_kernel/sched/deadline.c ?????????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?| invalid_opcode ????????????????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?| EIP_is_at_enqueue_task_dl ?????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?| Kernel_panic-not_syncing:Fatal_exception ??????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?| backtrace:register_tracer ?????????????????????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?| backtrace:init_wakeup_tracer ??????????????????| 0 ?????????| 178 ???????| 3 ?????????|
>> ?+------------------------------------------------+------------+------------+------------+
>>
>> ?[ ???6.432010] Testing ftrace regs: PASSED
>> ?[ ???6.530393] Testing tracer wakeup:
>> ?[ ???7.648615] ------------[ cut here ]------------
>> ?[ ???7.649000] kernel BUG at kernel/sched/deadline.c:805!
>> ?[ ???7.649000] invalid opcode: 0000 [#1] SMP
>> ?[ ???7.649000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc6-00041-g1019a35 #17
>> ?[ ???7.649000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>> ?[ ???7.649000] task: d3348000 ti: d3344000 task.ti: d3344000
>> ?[ ???7.649000] EIP: 0060:[<c11969a1>] EFLAGS: 00010046 CPU: 0
>> ?[ ???7.649000] EIP is at enqueue_task_dl+0x341/0x1860
>> ?[ ???7.649000] EAX: d2940101 EBX: 00000001 ECX: 00000000 EDX: 0085a491
>> ?[ ???7.649000] ESI: d2940000 EDI: d2940101 EBP: d3345e08 ESP: d3345dc0
>> ?[ ???7.649000] ?DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>> ?[ ???7.649000] CR0: 8005003b CR2: ffffffff CR3: 04e8b000 CR4: 000006d0
>> ?[ ???7.649000] Stack:
>> ?[ ???7.649000] ?c1170353 00000002 d3345de0 c11ab53c d391ec10 00000046 d33485c0 d3348000
>> ?[ ???7.649000] ?00000002 d29401b8 d391ec00 d29401b8 00000046 00000005 00000000 d2940000
>> ?[ ???7.649000] ?d391ec00 c36efc80 d3345e20 c115991a 00000005 00000000 d391ec00 00000000
>> ?[ ???7.649000] Call Trace:
>> ?[ ???7.649000] ?[<c1170353>] ? sched_clock_cpu+0x293/0x2e0
>> ?[ ???7.649000] ?[<c11ab53c>] ? __lock_is_held+0x6c/0xa0
>> ?[ ???7.649000] ?[<c115991a>] enqueue_task+0xba/0x220
>> ?[ ???7.649000] ?[<c115aa4c>] activate_task+0x4c/0x80
>> ?[ ???7.649000] ?[<c115aaa1>] ttwu_activate+0x21/0x90
>> ?[ ???7.649000] ?[<c115bd84>] ttwu_do_activate+0x64/0x90
>> ?[ ???7.649000] ?[<c11689b6>] try_to_wake_up+0x366/0x8b0
>> ?[ ???7.649000] ?[<c1169000>] wake_up_process+0xa0/0xe0
>> ?[ ???7.649000] ?[<c114095b>] kthread_stop+0x21b/0xd90
>> ?[ ???7.649000] ?[<c12c0b27>] trace_selftest_startup_wakeup+0x3b7/0x4b0
>> ?[ ???7.649000] ?[<c36b1690>] ? wait_for_completion+0x1f0/0x290
>> ?[ ???7.649000] ?[<c12c1d04>] register_tracer+0x2d4/0x700
>> ?[ ???7.649000] ?[<c4bd11a4>] init_wakeup_tracer+0x1d/0xea
>> ?[ ???7.649000] ?[<c4b8f12e>] do_one_initcall+0x278/0x44c
>> ?[ ???7.649000] ?[<c4bd1187>] ? init_function_trace+0x1f8/0x1f8
>> ?[ ???7.649000] ?[<c4b8f565>] kernel_init_freeable+0x263/0x424
>> ?[ ???7.649000] ?[<c3663b4c>] kernel_init+0x1c/0x2a0
>> ?[ ???7.649000] ?[<c36c2181>] ret_from_kernel_thread+0x21/0x30
>> ?[ ???7.649000] ?[<c3663b30>] ? rest_init+0x330/0x330
>> ?[ ???7.649000] Code: da e8 94 96 13 00 8b 04 9d 84 f6 77 c4 83 c0 01 89 04 9d 84 f6 77 c4 89 f8 84 c0 74 4d 83 05 c8 14 f2 c4 01 83 15 cc 14 f2 c4 00 <0f> 0b 83 05 d0 14 f2 c4 01 83 15 d4 14 f2 c4 00 8d b4 26 00 00
>> ?[ ???7.649000] EIP: [<c11969a1>] enqueue_task_dl+0x341/0x1860 SS:ESP 0068:d3345dc0
>> ?[ ???7.649000] ---[ end trace cb4bdebe33135894 ]---
>> ?[ ???7.649000] Kernel panic - not syncing: Fatal exception
>>
>> ?git bisect start 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe bfa76d49576599a4b9f9b7a71f23d73d6dcff735 --
>> ?git bisect ?bad 30496a920e97b97e16d078c520eef323344ef997 ?# 23:32 ?????0- ????67 ?Merge 'amirv/for-upstream' into devel-xian-smoke-201502152116
>> ?git bisect good d170f4905ee6f93eb12f87d018dd4b9acea18f2c ?# 00:27 ???300+ ???120 ?0day base guard for 'devel-xian-smoke-201502152116'
>> ?git bisect ?bad c5ce28df0e7c01a1de23c36ebdefcd803f2b6cbb ?# 01:11 ?????0- ????72 ?Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
>> ?git bisect good 57ee062e4baa9532e5a3835a2d185590a3fc8bd3 ?# 01:51 ???300+ ????94 ?Merge branch 'r8152'
>> ?git bisect ?bad c08f8467939e7d2eebcba7cf2330242c4f53f2f7 ?# 02:10 ?????0- ???103 ?Merge tag 'pci-v3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
>> ?git bisect ?bad 5b9b28a63f2e47dac5ff3a2503bfe3ade8796aa0 ?# 02:24 ?????0- ?????5 ?Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> ?git bisect good 30d46827c2744f56bb31460007f2d16455f10720 ?# 02:41 ???300+ ???107 ?Merge tag 'regulator-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
>> ?git bisect good 8308756f45a12e2ff4f7749c2694fc83cdef0be9 ?# 03:11 ???300+ ???101 ?Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> ?git bisect good 4ac30cf74b308fb01338e660d3471cd490a7958a ?# 03:36 ???300+ ????25 ?perf tools: Do not use __perf_session__process_events() directly
>> ?git bisect ?bad 9659e1eeee28f7025b6545934d644d19e9c6e603 ?# 03:48 ?????0- ?????2 ?sched/deadline: Remove cpu_active_mask from cpudl_find()
>> ?git bisect good 3847b272248a3a4ed70d20392cc0454917f7713b ?# 04:23 ???300+ ????89 ?Merge branch 'sched/urgent' into sched/core
>> ?git bisect good 4c195c8a1967ff8bee13a811518a99db04618ab7 ?# 04:59 ???300+ ???118 ?Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new patches
>> ?git bisect ?bad 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d ?# 05:10 ????39- ???193 ?sched/deadline: Fix stale yield state
>> ?git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb ?# 05:34 ???300+ ???103 ?sched/deadline: Fix hrtick for a non-leftmost task
>> ?# first bad commit: [1019a359d3dc4b64d0e1e5a5efcb725d5e83994d] sched/deadline: Fix stale yield state
>> ?git bisect good a7bebf488791aa1036f3e6629daf01d01f705dcb ?# 05:55 ???900+ ???314 ?sched/deadline: Fix hrtick for a non-leftmost task
>> ?# extra tests with DEBUG_INFO
>> ?git bisect good 1019a359d3dc4b64d0e1e5a5efcb725d5e83994d ?# 06:11 ???900+ ?????0 ?sched/deadline: Fix stale yield state
>> ?# extra tests on HEAD of linux-devel/devel-xian-smoke-201502152116
>> ?git bisect ?bad 58e0e18cb8b1c3351cb70b3cf78d4e3785a181fe ?# 06:11 ?????0- ?????4 ?0day head guard for 'devel-xian-smoke-201502152116'
>> ?# extra tests on tree/branch linus/master
>> ?git bisect ?bad a9724125ad014decf008d782e60447c811391326 ?# 06:11 ?????0- ?????1 ?Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
>> ?# extra tests with first bad commit reverted
>> ?# extra tests on tree/branch linus/master
>> ?git bisect ?bad a9724125ad014decf008d782e60447c811391326 ?# 06:16 ?????0- ?????1 ?Merge tag 'tty-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
>> ?# extra tests on tree/branch next/master
>> ?git bisect ?bad b8acf73194186a5cba86812eb4ba17b897f0e13e ?# 06:26 ?????0- ????35 ?Add linux-next specific files for 20150213
>>
>> ?This script may reproduce the error.
>>
>> ?----------------------------------------------------------------------------
>> ?#!/bin/bash
>>
>> ?kernel=$1
>>
>> ?kvm=(
>> ?????????qemu-system-x86_64
>> ?????????-cpu kvm64
>> ?????????-enable-kvm
>> ?????????-kernel $kernel
>> ?????????-m 320
>> ?????????-smp 1
>> ?????????-net nic,vlan=1,model=e1000
>> ?????????-net user,vlan=1
>> ?????????-boot order=nc
>> ?????????-no-reboot
>> ?????????-watchdog i6300esb
>> ?????????-rtc base=localtime
>> ?????????-serial stdio
>> ?????????-display none
>> ?????????-monitor null
>> ?)
>>
>> ?append=(
>> ?????????hung_task_panic=1
>> ?????????earlyprintk=ttyS0,115200
>> ?????????rd.udev.log-priority=err
>> ?????????systemd.log_target=journal
>> ?????????systemd.log_level=warning
>> ?????????debug
>> ?????????apic=debug
>> ?????????sysrq_always_enabled
>> ?????????rcupdate.rcu_cpu_stall_timeout=100
>> ?????????panic=-1
>> ?????????softlockup_panic=1
>> ?????????nmi_watchdog=panic
>> ?????????oops=panic
>> ?????????load_ramdisk=2
>> ?????????prompt_ramdisk=0
>> ?????????console=ttyS0,115200
>> ?????????console=tty0
>> ?????????vga=normal
>> ?????????root=/dev/ram0
>> ?????????rw
>> ?????????drbd.minor_count=8
>> ?)
>>
>> ?"${kvm[@]}" --append "${append[*]}"
>> ?----------------------------------------------------------------------------
>
> Don't we have the following problem?
>
> current->state = TASK_INTERRUPTIBLE;
> schedule()
> ????deactivate_task()
> ????????dequeue_task_dl()
> ????????????update_curr_dl()
> ????????????????start_dl_timer()
> ????????????__dequeue_task_dl()
>
> ...timer fires...
>
> dl_task_timer()
> ????enqueue_task_dl() /* queues on dl_rq */
>
> ...try_to_wake_up()
>
> ???enqueue_dl_entity()
> ???????BUG_ON(on_dl_rq())
>
> Compile-tested fix:
>
> ?kernel/sched/deadline.c | 4 ++++
> ?1 file changed, 4 insertions(+)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a027799..1b78598 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -541,6 +541,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>
> ?????????sched_clock_tick();
> ?????????update_rq_clock(rq);
> + if (!p->on_rq) {
> + replenish_dl_entity(dl_se, dl_se);
> + goto unlock;
> + }
> ?????????enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> ?????????if (dl_task(rq->curr))
> ?????????????????check_preempt_curr_dl(rq, p, 0);

We shouldn't enqueue migrating tasks. Please, try this one instead ;)

[PATCH] sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()

A deadline task may be throttled and dequeued at the same time.
This happens, when it becomes throttled in schedule(), which
is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
deactivate_task()
dequeue_task_dl()
update_curr_dl()
start_dl_timer()
__dequeue_task_dl()
prev->on_rq = 0;

Later the timer fires, but the task is still dequeued:

dl_task_timer()
enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

Someone wakes it up:

try_to_wake_up()

enqueue_dl_entity()
BUG_ON(on_dl_rq())

Patch fixes this problem, it prevents queueing !on_rq tasks
on dl_rq.

Signed-off-by: Kirill Tkhai <[email protected]>

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a027799..784ba68 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -541,6 +541,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

sched_clock_tick();
update_rq_clock(rq);
+ if (!task_on_rq_queued(p)) {
+ replenish_dl_entity(dl_se, dl_se);
+ goto unlock;
+ }
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr))
check_preempt_curr_dl(rq, p, 0);

2015-02-16 12:49:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 03:20:38PM +0800, Fengguang Wu wrote:
> Elapsed time: 25
> qemu-system-x86_64 -cpu kvm64 -enable-kvm -kernel /kernel/i386-randconfig-x1-02152111/1019a359d3dc4b64d0e1e5a5efcb725d5e83994d/vmlinuz-3.19.0-rc6-00041-g1019a35

But you 'forgot' to include the .config for this particular randconfig ^

Now I could probably figure it out from the reported crash and dmesg,
but please, just always include the used .config.

2015-02-16 13:08:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 03:38:34PM +0300, Kirill Tkhai wrote:
> We shouldn't enqueue migrating tasks. Please, try this one instead ;)

Ha, we should amend that task-rq-lock loop for that. See below.

I've not yet tested; going to try and reconstruct a .config that
triggers the oops.

---
Subject: sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()
From: Kirill Tkhai <[email protected]>
Date: Mon, 16 Feb 2015 15:38:34 +0300

A deadline task may be throttled and dequeued at the same time.
This happens, when it becomes throttled in schedule(), which
is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
deactivate_task()
dequeue_task_dl()
update_curr_dl()
start_dl_timer()
__dequeue_task_dl()
prev->on_rq = 0;

Later the timer fires, but the task is still dequeued:

dl_task_timer()
enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

Someone wakes it up:

try_to_wake_up()

enqueue_dl_entity()
BUG_ON(on_dl_rq())

Patch fixes this problem, it prevents queueing !on_rq tasks
on dl_rq.

Also teach the rq-lock loop about TASK_ON_RQ_MIGRATING as per
cca26e8009d1 ("sched: Teach scheduler to understand
TASK_ON_RQ_MIGRATING state").

Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state")
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Kirill Tkhai <[email protected]>
[peterz: Wrote comment; fixed task-rq-lock loop]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/sched/deadline.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -515,9 +515,8 @@ static enum hrtimer_restart dl_task_time
again:
rq = task_rq(p);
raw_spin_lock(&rq->lock);
-
- if (rq != task_rq(p)) {
- /* Task was moved, retrying. */
+ if (rq != task_rq(p) || task_on_rq_migrating(p)) {
+ /* Task was move{d,ing}, retry */
raw_spin_unlock(&rq->lock);
goto again;
}
@@ -541,6 +540,26 @@ static enum hrtimer_restart dl_task_time

sched_clock_tick();
update_rq_clock(rq);
+
+ /*
+ * If the throttle happened during sched-out; like:
+ *
+ * schedule()
+ * deactivate_task()
+ * dequeue_task_dl()
+ * update_curr_dl()
+ * start_dl_timer()
+ * __dequeue_task_dl()
+ * prev->on_rq = 0;
+ *
+ * We can be both throttled and !queued. Replenish the counter
+ * but do not enqueue -- wait for our wakeup to do that.
+ */
+ if (!task_on_rq_queued(p)) {
+ replenish_dl_entity(dl_se, dl_se);
+ goto unlock;
+ }
+
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr))
check_preempt_curr_dl(rq, p, 0);

2015-02-16 13:26:12

by Fengguang Wu

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 01:49:50PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 03:20:38PM +0800, Fengguang Wu wrote:
> > Elapsed time: 25
> > qemu-system-x86_64 -cpu kvm64 -enable-kvm -kernel /kernel/i386-randconfig-x1-02152111/1019a359d3dc4b64d0e1e5a5efcb725d5e83994d/vmlinuz-3.19.0-rc6-00041-g1019a35
>
> But you 'forgot' to include the .config for this particular randconfig ^
>
> Now I could probably figure it out from the reported crash and dmesg,
> but please, just always include the used .config.

Sorry it looks the script does not reliably include the config.
Here it is.

Thanks,
Fengguang


Attachments:
(No filename) (616.00 B)
.config (91.97 kB)
Download all attachments

2015-02-16 13:29:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 09:25:57PM +0800, Fengguang Wu wrote:
> On Mon, Feb 16, 2015 at 01:49:50PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 16, 2015 at 03:20:38PM +0800, Fengguang Wu wrote:
> > > Elapsed time: 25
> > > qemu-system-x86_64 -cpu kvm64 -enable-kvm -kernel /kernel/i386-randconfig-x1-02152111/1019a359d3dc4b64d0e1e5a5efcb725d5e83994d/vmlinuz-3.19.0-rc6-00041-g1019a35
> >
> > But you 'forgot' to include the .config for this particular randconfig ^
> >
> > Now I could probably figure it out from the reported crash and dmesg,
> > but please, just always include the used .config.
>
> Sorry it looks the script does not reliably include the config.
> Here it is.

Awesome, thanks!

2015-02-16 13:37:38

by Fengguang Wu

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 09:25:57PM +0800, Fengguang Wu wrote:
> On Mon, Feb 16, 2015 at 01:49:50PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 16, 2015 at 03:20:38PM +0800, Fengguang Wu wrote:
> > > Elapsed time: 25
> > > qemu-system-x86_64 -cpu kvm64 -enable-kvm -kernel /kernel/i386-randconfig-x1-02152111/1019a359d3dc4b64d0e1e5a5efcb725d5e83994d/vmlinuz-3.19.0-rc6-00041-g1019a35
> >
> > But you 'forgot' to include the .config for this particular randconfig ^
> >
> > Now I could probably figure it out from the reported crash and dmesg,
> > but please, just always include the used .config.
>
> Sorry it looks the script does not reliably include the config.

It's the mailing system going wrong somewhere between the robot and me.
The original email copy was good, however attachments get truncated/lost
when reach me.

Thanks,
Fengguang

2015-02-16 14:44:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

On Mon, Feb 16, 2015 at 02:08:21PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 03:38:34PM +0300, Kirill Tkhai wrote:
> > We shouldn't enqueue migrating tasks. Please, try this one instead ;)
>
> Ha, we should amend that task-rq-lock loop for that. See below.
>
> I've not yet tested; going to try and reconstruct a .config that
> triggers the oops.

OK, I had to remove -enable-kvm from Wu's script, otherwise the test
would not hit the CBS limit, and then I had to remove the panic on
softlockup because well, qemu is so slow you're going to hit that :-)

But the good news is that I could reproduce and this patch does indeed
seem to solve it.

2015-02-16 15:12:33

by Juri Lelli

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!

Hi,

On 16/02/15 14:44, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 02:08:21PM +0100, Peter Zijlstra wrote:
>> On Mon, Feb 16, 2015 at 03:38:34PM +0300, Kirill Tkhai wrote:
>>> We shouldn't enqueue migrating tasks. Please, try this one instead ;)
>>
>> Ha, we should amend that task-rq-lock loop for that. See below.
>>
>> I've not yet tested; going to try and reconstruct a .config that
>> triggers the oops.
>
> OK, I had to remove -enable-kvm from Wu's script, otherwise the test
> would not hit the CBS limit, and then I had to remove the panic on
> softlockup because well, qemu is so slow you're going to hit that :-)
>
> But the good news is that I could reproduce and this patch does indeed
> seem to solve it.
>

I also tested this on top of tip, with kvm actually. So, 1019a359d3dc
seemed to reintroduce what aee38ea95419 "sched/deadline: Fix races
between rt_mutex_setprio() and dl_task_timer()" fixed. But, this patch
seems to fix it again :).

Thanks,

- Juri

Subject: [tip:sched/core] sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()

Commit-ID: a79ec89fd8459f0de850898f432a2a57d60e64de
Gitweb: http://git.kernel.org/tip/a79ec89fd8459f0de850898f432a2a57d60e64de
Author: Kirill Tkhai <[email protected]>
AuthorDate: Mon, 16 Feb 2015 15:38:34 +0300
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 14:27:31 +0100

sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()

A deadline task may be throttled and dequeued at the same time.
This happens, when it becomes throttled in schedule(), which
is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
deactivate_task()
dequeue_task_dl()
update_curr_dl()
start_dl_timer()
__dequeue_task_dl()
prev->on_rq = 0;

Later the timer fires, but the task is still dequeued:

dl_task_timer()
enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

Someone wakes it up:

try_to_wake_up()

enqueue_dl_entity()
BUG_ON(on_dl_rq())

Patch fixes this problem, it prevents queueing !on_rq tasks
on dl_rq.

Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[ Wrote comment. ]
Cc: Juri Lelli <[email protected]>
Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e88847d..9908c95 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -535,6 +535,26 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

sched_clock_tick();
update_rq_clock(rq);
+
+ /*
+ * If the throttle happened during sched-out; like:
+ *
+ * schedule()
+ * deactivate_task()
+ * dequeue_task_dl()
+ * update_curr_dl()
+ * start_dl_timer()
+ * __dequeue_task_dl()
+ * prev->on_rq = 0;
+ *
+ * We can be both throttled and !queued. Replenish the counter
+ * but do not enqueue -- wait for our wakeup to do that.
+ */
+ if (!task_on_rq_queued(p)) {
+ replenish_dl_entity(dl_se, dl_se);
+ goto unlock;
+ }
+
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr))
check_preempt_curr_dl(rq, p, 0);

2015-02-18 18:00:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [sched/deadline] kernel BUG at kernel/sched/deadline.c:805!


* Fengguang Wu <[email protected]> wrote:

> On Mon, Feb 16, 2015 at 09:25:57PM +0800, Fengguang Wu wrote:
> > On Mon, Feb 16, 2015 at 01:49:50PM +0100, Peter Zijlstra wrote:
> > > On Mon, Feb 16, 2015 at 03:20:38PM +0800, Fengguang Wu wrote:
> > > > Elapsed time: 25
> > > > qemu-system-x86_64 -cpu kvm64 -enable-kvm -kernel /kernel/i386-randconfig-x1-02152111/1019a359d3dc4b64d0e1e5a5efcb725d5e83994d/vmlinuz-3.19.0-rc6-00041-g1019a35
> > >
> > > But you 'forgot' to include the .config for this particular randconfig ^
> > >
> > > Now I could probably figure it out from the reported crash and dmesg,
> > > but please, just always include the used .config.
> >
> > Sorry it looks the script does not reliably include the config.
>
> It's the mailing system going wrong somewhere between the
> robot and me. The original email copy was good, however
> attachments get truncated/lost when reach me.

it might make sense to just append them in the body,
instead of an attachment?

Thanks,

Ingo