Greetings,
I got the below dmesg and the first bad commit is
git://git.linaro.org/people/vireshk/linux timer-cleanup-for-tglx
commit 6378cb51af5f4743db0dcb3cbcf862eac5908754
Author: Viresh Kumar <[email protected]>
AuthorDate: Thu Mar 20 14:29:02 2014 +0530
Commit: Viresh Kumar <[email protected]>
CommitDate: Wed Apr 2 14:54:57 2014 +0530
timer: don't migrate pinned timers
migrate_timer() is called when a CPU goes down and its timers are required to be
migrated to some other CPU. Its the responsibility of the users of the timer to
remove it before control reaches to migrate_timers().
As these were the pinned timers, the best we can do is: don't migrate these and
report to the user as well.
That's all this patch does.
Signed-off-by: Viresh Kumar <[email protected]>
===================================================
PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
===================================================
Attached one more dmesg for the NULL pointer bug in parent commit.
+--------------------------------------------------------+------------+------------+------------+
| | 5a8530b7c3 | 6378cb51af | 7caf71f403 |
+--------------------------------------------------------+------------+------------+------------+
| boot_successes | 103 | 14 | 10 |
| boot_failures | 17 | 18 | 13 |
| BUG:unable_to_handle_kernel_NULL_pointer_dereference | 16 | | |
| Oops:SMP | 16 | | |
| Kernel_panic-not_syncing:Fatal_exception | 16 | | |
| backtrace:vfs_read | 16 | | |
| backtrace:SyS_read | 16 | | |
| BUG:kernel_test_crashed | 1 | | |
| WARNING:CPU:PID:at_kernel/timer.c:migrate_timer_list() | 0 | 17 | 12 |
| backtrace:vfs_write | 0 | 17 | 12 |
| backtrace:SyS_write | 0 | 17 | 12 |
| BUG:kernel_early_hang_without_any_printk_output | 0 | 1 | 1 |
+--------------------------------------------------------+------------+------------+------------+
[ 74.242293] Unregister pv shared memory for cpu 1
[ 74.273280] smpboot: CPU 1 is now offline
[ 74.274685] ------------[ cut here ]------------
[ 74.275524] WARNING: CPU: 0 PID: 1935 at kernel/timer.c:1621 migrate_timer_list+0xd6/0xf0()
[ 74.275524] migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, deactivating it
[ 74.275524] Modules linked in:
[ 74.275524] CPU: 0 PID: 1935 Comm: 01-cpu-hotplug Not tainted 3.14.0-rc1-00087-g6378cb5 #1
[ 74.275524] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 74.275524] 0000000000000009 ffff88001bcabc38 ffffffff817237bd ffff88001bcabc80
[ 74.275524] ffff88001bcabc70 ffffffff8106a1dd 00000000000000f0 ffffffff81f06a60
[ 74.275524] ffff88001e04d120 ffffffff81e3d4c0 ffff88001e04d030 ffff88001bcabcd0
[ 74.275524] Call Trace:
[ 74.275524] [<ffffffff817237bd>] dump_stack+0x4d/0x66
[ 74.275524] [<ffffffff8106a1dd>] warn_slowpath_common+0x7d/0xa0
[ 74.275524] [<ffffffff8106a24c>] warn_slowpath_fmt+0x4c/0x50
[ 74.275524] [<ffffffff810761c3>] ? __internal_add_timer+0x113/0x130
[ 74.275524] [<ffffffff81077536>] migrate_timer_list+0xd6/0xf0
[ 74.275524] [<ffffffff810782dc>] timer_cpu_notify+0xfc/0x1f0
[ 74.275524] [<ffffffff8173046c>] notifier_call_chain+0x4c/0x70
[ 74.275524] [<ffffffff8109340e>] __raw_notifier_call_chain+0xe/0x10
[ 74.275524] [<ffffffff8106a3f3>] cpu_notify+0x23/0x50
[ 74.275524] [<ffffffff8106a44e>] cpu_notify_nofail+0xe/0x20
[ 74.275524] [<ffffffff81712a5d>] _cpu_down+0x1ad/0x2e0
[ 74.275524] [<ffffffff81712bc4>] cpu_down+0x34/0x50
[ 74.275524] [<ffffffff813fec54>] cpu_subsys_offline+0x14/0x20
[ 74.275524] [<ffffffff813f9f65>] device_offline+0x95/0xc0
[ 74.275524] [<ffffffff813fa060>] online_store+0x40/0x90
[ 74.275524] [<ffffffff813f75d8>] dev_attr_store+0x18/0x30
[ 74.275524] [<ffffffff8123309d>] sysfs_kf_write+0x3d/0x50
[ 74.275524] [<ffffffff81236f12>] kernfs_fop_write+0xd2/0x140
[ 74.275524] [<ffffffff811be22a>] vfs_write+0xba/0x1e0
[ 74.275524] [<ffffffff811bec09>] SyS_write+0x49/0xa0
[ 74.275524] [<ffffffff81735129>] system_call_fastpath+0x16/0x1b
[ 74.275524] ---[ end trace 9cf1ed50b307bb1a ]---
[ 74.630194] blk-mq: CPU -> queue map
git bisect start 7caf71f403b4758e8e2b2fef1d2e2d7f7a0c5db4 b97f0291a2504291aef850077f98cab68a5a2f33 --
git bisect good fe04ef37d949ea92dbc37164eff116fe3c4a0930 # 01:45 32+ 42 hrtimer: make enqueue_hrtimer() return void
git bisect good de1f862abff16ab0a1058181bb541a55b691e64d # 03:35 32+ 3 hrtimer: Use for_each_active_base() to iterate over active clock bases
git bisect bad 6378cb51af5f4743db0dcb3cbcf862eac5908754 # 04:02 2- 2 timer: don't migrate pinned timers
git bisect good 6b4e6938e2f62e8b7f02453cabff5640f6b4bbdf # 04:10 40+ 2 hrtimer: call switch_hrtimer_base() after setting new expiry time
git bisect good e5535a330ccddbe95760a52bba837703ee16926f # 04:36 40+ 1 hrtimer: fix routine names in comments
git bisect good 5a8530b7c3b1889861949d4e52b2da82d0aff242 # 05:03 40+ 0 timer: track pinned timers with TIMER_PINNED flag
# first bad commit: [6378cb51af5f4743db0dcb3cbcf862eac5908754] timer: don't migrate pinned timers
git bisect good 5a8530b7c3b1889861949d4e52b2da82d0aff242 # 05:13 120+ 17 timer: track pinned timers with TIMER_PINNED flag
git bisect bad 7caf71f403b4758e8e2b2fef1d2e2d7f7a0c5db4 # 05:13 0- 13 sched: don't queue timers on quiesced CPUs
git bisect good 4b22efdd5595f0acb48f02bf664a451ee98f9a2e # 05:29 120+ 1 Add linux-next specific files for 20140403
Thanks,
Fengguang
Thanks Fengguang,
On 4 April 2014 08:49, Fengguang Wu <[email protected]> wrote:
> Greetings,
>
> I got the below dmesg and the first bad commit is
>
> git://git.linaro.org/people/vireshk/linux timer-cleanup-for-tglx
>
> commit 6378cb51af5f4743db0dcb3cbcf862eac5908754
> Author: Viresh Kumar <[email protected]>
> AuthorDate: Thu Mar 20 14:29:02 2014 +0530
> Commit: Viresh Kumar <[email protected]>
> CommitDate: Wed Apr 2 14:54:57 2014 +0530
>
> timer: don't migrate pinned timers
>
> migrate_timer() is called when a CPU goes down and its timers are required to be
> migrated to some other CPU. Its the responsibility of the users of the timer to
> remove it before control reaches to migrate_timers().
>
> As these were the pinned timers, the best we can do is: don't migrate these and
> report to the user as well.
>
> That's all this patch does.
>
> Signed-off-by: Viresh Kumar <[email protected]>
>
> ===================================================
> PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
> ===================================================
> Attached one more dmesg for the NULL pointer bug in parent commit.
>
> +--------------------------------------------------------+------------+------------+------------+
> | | 5a8530b7c3 | 6378cb51af | 7caf71f403 |
> +--------------------------------------------------------+------------+------------+------------+
> | boot_successes | 103 | 14 | 10 |
> | boot_failures | 17 | 18 | 13 |
> | BUG:unable_to_handle_kernel_NULL_pointer_dereference | 16 | | |
> | Oops:SMP | 16 | | |
> | Kernel_panic-not_syncing:Fatal_exception | 16 | | |
> | backtrace:vfs_read | 16 | | |
> | backtrace:SyS_read | 16 | | |
> | BUG:kernel_test_crashed | 1 | | |
> | WARNING:CPU:PID:at_kernel/timer.c:migrate_timer_list() | 0 | 17 | 12 |
> | backtrace:vfs_write | 0 | 17 | 12 |
> | backtrace:SyS_write | 0 | 17 | 12 |
> | BUG:kernel_early_hang_without_any_printk_output | 0 | 1 | 1 |
> +--------------------------------------------------------+------------+------------+------------+
>
> [ 74.242293] Unregister pv shared memory for cpu 1
> [ 74.273280] smpboot: CPU 1 is now offline
> [ 74.274685] ------------[ cut here ]------------
> [ 74.275524] WARNING: CPU: 0 PID: 1935 at kernel/timer.c:1621 migrate_timer_list+0xd6/0xf0()
> [ 74.275524] migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, deactivating it
Hmm, nice. So, my patch hasn't created a bug, but just highlighted it.
I have added this piece of code while migrating timers away:
if (unlikely(WARN(is_pinned,
"%s: can't migrate pinned timer: %p, deactivating it\n",
__func__, timer)))
Which means, migrate all timers to other CPUs when a CPU is going down.
But obviously we can't migrate the pinned timers. And it looks like we
actually were doing that before this commit and things went unnoticed.
But just due to this print, we are highlighting an existing issue here.
@Thomas: So, in a sense my patch is doing some good work now :)
Now, we need to fix the code which queued this pinned timer.
@Fengguang: As I don't have the facilities to reproduce this, can you
help me debugging this? Probably just change this print message to
print the address of timer->function as well and then we can find
the name of routine with help of objdump..
What we need to do finally is to remove this timer before CPU is
going down. I can fix the driver in question once we know which
driver it is.
--
viresh
Hi Viresh,
I changed your print message as you suggested.
diff --git a/kernel/timer.c b/kernel/timer.c
index 6c3a371..193101d 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1617,8 +1617,8 @@ static void migrate_timer_list(struct tvec_base *new_base, struct list_head *hea
/* Check if CPU still has pinned timers */
if (unlikely(WARN(is_pinned,
- "%s: can't migrate pinned timer: %p, deactivating it\n",
- __func__, timer)))
+ "%s: can't migrate pinned timer: %p, timer->function: %p,deactivating it\n",
+ __func__, timer, timer->function)))
continue;
Then I reproduced the issue, and got the dmesg output,
[ 37.918406] migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, timer->function: ffffffff810d7010,deactivating it
We reproduced this issue for several times in our LKP system. The address of timer ffffffff81f06a60 is very constant. So is timer->function, I believe.
Hope this information will help you. Please feel free to tell me what else I can do to help you.
Thanks,
Jet
On 04/04/2014 12:56 PM, Viresh Kumar wrote:
> Thanks Fengguang,
>
> On 4 April 2014 08:49, Fengguang Wu <[email protected]> wrote:
>> Greetings,
>>
>> I got the below dmesg and the first bad commit is
>>
>> git://git.linaro.org/people/vireshk/linux timer-cleanup-for-tglx
>>
>> commit 6378cb51af5f4743db0dcb3cbcf862eac5908754
>> Author: Viresh Kumar <[email protected]>
>> AuthorDate: Thu Mar 20 14:29:02 2014 +0530
>> Commit: Viresh Kumar <[email protected]>
>> CommitDate: Wed Apr 2 14:54:57 2014 +0530
>>
>> timer: don't migrate pinned timers
>>
>> migrate_timer() is called when a CPU goes down and its timers are required to be
>> migrated to some other CPU. Its the responsibility of the users of the timer to
>> remove it before control reaches to migrate_timers().
>>
>> As these were the pinned timers, the best we can do is: don't migrate these and
>> report to the user as well.
>>
>> That's all this patch does.
>>
>> Signed-off-by: Viresh Kumar <[email protected]>
>>
>> ===================================================
>> PARENT COMMIT NOT CLEAN. LOOK OUT FOR WRONG BISECT!
>> ===================================================
>> Attached one more dmesg for the NULL pointer bug in parent commit.
>>
>> +--------------------------------------------------------+------------+------------+------------+
>> | | 5a8530b7c3 | 6378cb51af | 7caf71f403 |
>> +--------------------------------------------------------+------------+------------+------------+
>> | boot_successes | 103 | 14 | 10 |
>> | boot_failures | 17 | 18 | 13 |
>> | BUG:unable_to_handle_kernel_NULL_pointer_dereference | 16 | | |
>> | Oops:SMP | 16 | | |
>> | Kernel_panic-not_syncing:Fatal_exception | 16 | | |
>> | backtrace:vfs_read | 16 | | |
>> | backtrace:SyS_read | 16 | | |
>> | BUG:kernel_test_crashed | 1 | | |
>> | WARNING:CPU:PID:at_kernel/timer.c:migrate_timer_list() | 0 | 17 | 12 |
>> | backtrace:vfs_write | 0 | 17 | 12 |
>> | backtrace:SyS_write | 0 | 17 | 12 |
>> | BUG:kernel_early_hang_without_any_printk_output | 0 | 1 | 1 |
>> +--------------------------------------------------------+------------+------------+------------+
>>
>> [ 74.242293] Unregister pv shared memory for cpu 1
>> [ 74.273280] smpboot: CPU 1 is now offline
>> [ 74.274685] ------------[ cut here ]------------
>> [ 74.275524] WARNING: CPU: 0 PID: 1935 at kernel/timer.c:1621 migrate_timer_list+0xd6/0xf0()
>> [ 74.275524] migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, deactivating it
>
> Hmm, nice. So, my patch hasn't created a bug, but just highlighted it.
> I have added this piece of code while migrating timers away:
>
> if (unlikely(WARN(is_pinned,
> "%s: can't migrate pinned timer: %p, deactivating it\n",
> __func__, timer)))
>
> Which means, migrate all timers to other CPUs when a CPU is going down.
> But obviously we can't migrate the pinned timers. And it looks like we
> actually were doing that before this commit and things went unnoticed.
>
> But just due to this print, we are highlighting an existing issue here.
> @Thomas: So, in a sense my patch is doing some good work now :)
>
> Now, we need to fix the code which queued this pinned timer.
> @Fengguang: As I don't have the facilities to reproduce this, can you
> help me debugging this? Probably just change this print message to
> print the address of timer->function as well and then we can find
> the name of routine with help of objdump..
>
> What we need to do finally is to remove this timer before CPU is
> going down. I can fix the driver in question once we know which
> driver it is.
>
> --
> viresh
>
On 4 April 2014 13:16, Jet Chen <[email protected]> wrote:
> Hi Viresh,
>
> I changed your print message as you suggested.
>
> diff --git a/kernel/timer.c b/kernel/timer.c
> index 6c3a371..193101d 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1617,8 +1617,8 @@ static void migrate_timer_list(struct tvec_base
> *new_base, struct list_head *hea
>
> /* Check if CPU still has pinned timers */
> if (unlikely(WARN(is_pinned,
> - "%s: can't migrate pinned timer: %p,
> deactivating it\n",
> - __func__, timer)))
> + "%s: can't migrate pinned timer: %p,
> timer->function: %p,deactivating it\n",
> + __func__, timer, timer->function)))
> continue;
>
> Then I reproduced the issue, and got the dmesg output,
>
> [ 37.918406] migrate_timer_list: can't migrate pinned timer:
> ffffffff81f06a60, timer->function: ffffffff810d7010,deactivating it
>
> We reproduced this issue for several times in our LKP system. The address of
> timer ffffffff81f06a60 is very constant. So is timer->function, I believe.
>
> Hope this information will help you. Please feel free to tell me what else I
> can do to help you.
Hi Jet,
Thanks a lot. Yes that's pretty helpful.. But I need some more help from you..
I don't have any idea which function has this address in your kernel:
ffffffff810d7010 :)
Can you please debug that a bit more? You need to find which function
this address belongs to. You can try that using objdump on your vmlinux.
Some help can be found here: Documentation/BUG-HUNTING
Thanks in Advance.
On 04/04/2014 03:52 PM, Viresh Kumar wrote:
> On 4 April 2014 13:16, Jet Chen <[email protected]> wrote:
>> Hi Viresh,
>>
>> I changed your print message as you suggested.
>>
>> diff --git a/kernel/timer.c b/kernel/timer.c
>> index 6c3a371..193101d 100644
>> --- a/kernel/timer.c
>> +++ b/kernel/timer.c
>> @@ -1617,8 +1617,8 @@ static void migrate_timer_list(struct tvec_base
>> *new_base, struct list_head *hea
>>
>> /* Check if CPU still has pinned timers */
>> if (unlikely(WARN(is_pinned,
>> - "%s: can't migrate pinned timer: %p,
>> deactivating it\n",
>> - __func__, timer)))
>> + "%s: can't migrate pinned timer: %p,
>> timer->function: %p,deactivating it\n",
>> + __func__, timer, timer->function)))
>> continue;
>>
>> Then I reproduced the issue, and got the dmesg output,
>>
>> [ 37.918406] migrate_timer_list: can't migrate pinned timer:
>> ffffffff81f06a60, timer->function: ffffffff810d7010,deactivating it
>>
>> We reproduced this issue for several times in our LKP system. The address of
>> timer ffffffff81f06a60 is very constant. So is timer->function, I believe.
>>
>> Hope this information will help you. Please feel free to tell me what else I
>> can do to help you.
>
> Hi Jet,
>
> Thanks a lot. Yes that's pretty helpful.. But I need some more help from you..
> I don't have any idea which function has this address in your kernel:
> ffffffff810d7010 :)
>
> Can you please debug that a bit more? You need to find which function
> this address belongs to. You can try that using objdump on your vmlinux.
>
> Some help can be found here: Documentation/BUG-HUNTING
>
> Thanks in Advance.
>
vmlinuz from our build system doesn't have debug information. It is hard
to use objdump to identify which routine is timer->function.
But after several times trials, I get below dmesg messages.
It is clear to see address of "timer->function" is 0xffffffff810d7010.
In calling stack, " [<ffffffff810d7010>] ?
clocksource_watchdog_kthread+0x40/0x40 ". So I guess timer->function is
clocksource_watchdog_kthread.
I manually disable CONFIG_CLOCKSOURCE_WATCHDOG, then I never see this
oops again (But see other oops for other reason :( )
[ 37.918345] WARNING: CPU: 0 PID: 1932 at kernel/timer.c:1621
migrate_timer_list+0xdb/0xf0()
[ 37.918406] migrate_timer_list: can't migrate pinned timer:
ffffffff81f06a60, timer->function: ffffffff810d7010,deactivating it
[ 37.918406] Modules linked in:
[ 37.918406] CPU: 0 PID: 1932 Comm: 01-cpu-hotplug Not tainted
3.14.0-rc1-00088-gab3c4fd #4
[ 37.918406] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 37.918406] 0000000000000009 ffff88001d407c38 ffffffff817237bd
ffff88001d407c80
[ 37.918406] ffff88001d407c70 ffffffff8106a1dd 0000000000000010
ffffffff81f06a60
[ 37.918406] ffff88001e04d040 ffffffff81e3d4c0 ffff88001e04d030
ffff88001d407cd0
[ 37.918406] Call Trace:
[ 37.918406] [<ffffffff817237bd>] dump_stack+0x4d/0x66
[ 37.918406] [<ffffffff8106a1dd>] warn_slowpath_common+0x7d/0xa0
[ 37.918406] [<ffffffff8106a24c>] warn_slowpath_fmt+0x4c/0x50
[ 37.918406] [<ffffffff810761c3>] ? __internal_add_timer+0x113/0x130
[ 37.918406] [<ffffffff810d7010>] ?
clocksource_watchdog_kthread+0x40/0x40
[ 37.918406] [<ffffffff8107753b>] migrate_timer_list+0xdb/0xf0
[ 37.918406] [<ffffffff810782dc>] timer_cpu_notify+0xfc/0x1f0
[ 37.918406] [<ffffffff8173046c>] notifier_call_chain+0x4c/0x70
[ 37.918406] [<ffffffff8109340e>] __raw_notifier_call_chain+0xe/0x10
[ 37.918406] [<ffffffff8106a3f3>] cpu_notify+0x23/0x50
[ 37.918406] [<ffffffff8106a44e>] cpu_notify_nofail+0xe/0x20
[ 37.918406] [<ffffffff81712a5d>] _cpu_down+0x1ad/0x2e0
[ 37.918406] [<ffffffff81712bc4>] cpu_down+0x34/0x50
[ 37.918406] [<ffffffff813fec54>] cpu_subsys_offline+0x14/0x20
[ 37.918406] [<ffffffff813f9f65>] device_offline+0x95/0xc0
[ 37.918406] [<ffffffff813fa060>] online_store+0x40/0x90
[ 37.918406] [<ffffffff813f75d8>] dev_attr_store+0x18/0x30
[ 37.918406] [<ffffffff8123309d>] sysfs_kf_write+0x3d/0x50
On 5 April 2014 10:00, Jet Chen <[email protected]> wrote:
> vmlinuz from our build system doesn't have debug information. It is hard to
> use objdump to identify which routine is timer->function.
I see...
> But after several times trials, I get below dmesg messages.
> It is clear to see address of "timer->function" is 0xffffffff810d7010.
> In calling stack, " [<ffffffff810d7010>] ?
> clocksource_watchdog_kthread+0x40/0x40 ". So I guess timer->function is
> clocksource_watchdog_kthread.
Hmm.. not exactly this function as this isn't timer->function for any timer.
But I think I have found the right function with this hint:
clocksource_watchdog()
Can you please try to test the attached patch, which must fix it.
Untested. I will
then post it with your Tested-by :)
--
viresh
On 04/05/2014 02:26 PM, Viresh Kumar wrote:
> On 5 April 2014 10:00, Jet Chen <[email protected]> wrote:
>> vmlinuz from our build system doesn't have debug information. It is hard to
>> use objdump to identify which routine is timer->function.
>
> I see...
>
>> But after several times trials, I get below dmesg messages.
>> It is clear to see address of "timer->function" is 0xffffffff810d7010.
>> In calling stack, " [<ffffffff810d7010>] ?
>> clocksource_watchdog_kthread+0x40/0x40 ". So I guess timer->function is
>> clocksource_watchdog_kthread.
>
> Hmm.. not exactly this function as this isn't timer->function for any timer.
> But I think I have found the right function with this hint:
> clocksource_watchdog()
>
> Can you please try to test the attached patch, which must fix it.
Your patch fixes it!
> Untested. I will
> then post it with your Tested-by :)
Thank you
>
> --
> viresh
>
On 5 April 2014 12:48, Jet Chen <[email protected]> wrote:
> Your patch fixes it!
Thanks for all your efforts on reporting/testing this patch.. I have
sent the final patch to list.