I'm getting a BUG with current kernels from
kernel/time/clockevents.c:263 when halting the system - a restart
behaves normally. I don't have a good camera handy at the moment to
capture the call stack on screen, but the call sequence is:
clockevents_notify
hrtimer_cpu_notify
notifier_call_chain
raw_notifier_call_chain
_cpu_down
disable_nonboot_cpus
kernel_power_off
sys_reboot
I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
never called on blocked tasks. There were a few commits tested along
the way where I got a freeze (with the power still on) instead of a
BUG. Reverting that commit from the current kernel doesn't look
trivial, but the commit immediately preceding this one does halt fine.
Marc
On Fri, 2010-01-01 at 19:27 -0500, Marc Dionne wrote:
> I'm getting a BUG with current kernels from
> kernel/time/clockevents.c:263 when halting the system - a restart
> behaves normally. I don't have a good camera handy at the moment to
> capture the call stack on screen, but the call sequence is:
>
> clockevents_notify
> hrtimer_cpu_notify
> notifier_call_chain
> raw_notifier_call_chain
> _cpu_down
> disable_nonboot_cpus
> kernel_power_off
> sys_reboot
>
> I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
> never called on blocked tasks. There were a few commits tested along
> the way where I got a freeze (with the power still on) instead of a
> BUG. Reverting that commit from the current kernel doesn't look
> trivial, but the commit immediately preceding this one does halt fine.
We somehow seem to trip up the below patch, which doesn't really make
sense, as I can't find how task placement would affect the below error.
It seems to purely test against the hot-unplugged cpu, not a cpu the
task is running on.
---
commit bb6eddf7676e1c1f3e637aa93c5224488d99036f
Author: Thomas Gleixner <[email protected]>
Date: Thu Dec 10 15:35:10 2009 +0100
clockevents: Prevent clockevent_devices list corruption on cpu hotplug
Xiaotian Feng triggered a list corruption in the clock events list on
CPU hotplug and debugged the root cause.
If a CPU registers more than one per cpu clock event device, then only
the active clock event device is removed on CPU_DEAD. The unused
devices are kept in the clock events device list.
On CPU up the clock event devices are registered again, which means
that we list_add an already enqueued list_head. That results in list
corruption.
Resolve this by removing all devices which are associated to the dead
CPU on CPU_DEAD.
Reported-by: Xiaotian Feng <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Xiaotian Feng <[email protected]>
Cc: [email protected]
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..91db2e3 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
*/
void clockevents_notify(unsigned long reason, void *arg)
{
- struct list_head *node, *tmp;
+ struct clock_event_device *dev, *tmp;
unsigned long flags;
+ int cpu;
spin_lock_irqsave(&clockevents_lock, flags);
clockevents_do_notify(reason, arg);
@@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
* Unregister the clock event devices which were
* released from the users in the notify chain.
*/
- list_for_each_safe(node, tmp, &clockevents_released)
- list_del(node);
+ list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+ list_del(&dev->list);
+ /*
+ * Now check whether the CPU has left unused per cpu devices
+ */
+ cpu = *((int *)arg);
+ list_for_each_entry_safe(dev, tmp, &clockevent_devices, list) {
+ if (cpumask_test_cpu(cpu, dev->cpumask) &&
+ cpumask_weight(dev->cpumask) == 1) {
+ BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
+ list_del(&dev->list);
+ }
+ }
break;
default:
break;
On Fri, Jan 1, 2010 at 7:42 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-01-01 at 19:27 -0500, Marc Dionne wrote:
>> I'm getting a BUG with current kernels from
>> kernel/time/clockevents.c:263 when halting the system - a restart
>> behaves normally. ?I don't have a good camera handy at the moment to
>> capture the call stack on screen, but the call sequence is:
>>
>> clockevents_notify
>> hrtimer_cpu_notify
>> notifier_call_chain
>> raw_notifier_call_chain
>> _cpu_down
>> disable_nonboot_cpus
>> kernel_power_off
>> sys_reboot
>>
>> I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
>> never called on blocked tasks. ?There were a few commits tested along
>> the way where I got a freeze (with the power still on) instead of a
>> BUG. Reverting that commit from the current kernel doesn't look
>> trivial, but the commit immediately preceding this one does halt fine.
>
> We somehow seem to trip up the below patch, which doesn't really make
> sense, as I can't find how task placement would affect the below error.
>
> It seems to purely test against the hot-unplugged cpu, not a cpu the
> task is running on.
>
> ---
> commit bb6eddf7676e1c1f3e637aa93c5224488d99036f
> Author: Thomas Gleixner <[email protected]>
> Date: ? Thu Dec 10 15:35:10 2009 +0100
Probably predictable but worth testing, reverting that patch does
allow my system to shutdown cleanly.
Marc
On 01/05/2010 02:43 AM, Marc Dionne wrote:
> On Fri, Jan 1, 2010 at 7:42 PM, Peter Zijlstra<[email protected]> wrote:
>> On Fri, 2010-01-01 at 19:27 -0500, Marc Dionne wrote:
>>> I'm getting a BUG with current kernels from
>>> kernel/time/clockevents.c:263 when halting the system - a restart
>>> behaves normally. I don't have a good camera handy at the moment to
>>> capture the call stack on screen, but the call sequence is:
>>>
>>> clockevents_notify
>>> hrtimer_cpu_notify
>>> notifier_call_chain
>>> raw_notifier_call_chain
>>> _cpu_down
>>> disable_nonboot_cpus
>>> kernel_power_off
>>> sys_reboot
>>>
>>> I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
>>> never called on blocked tasks. There were a few commits tested along
>>> the way where I got a freeze (with the power still on) instead of a
>>> BUG. Reverting that commit from the current kernel doesn't look
>>> trivial, but the commit immediately preceding this one does halt fine.
>>
>> We somehow seem to trip up the below patch, which doesn't really make
>> sense, as I can't find how task placement would affect the below error.
>>
>> It seems to purely test against the hot-unplugged cpu, not a cpu the
>> task is running on.
>>
>> ---
>> commit bb6eddf7676e1c1f3e637aa93c5224488d99036f
>> Author: Thomas Gleixner<[email protected]>
>> Date: Thu Dec 10 15:35:10 2009 +0100
>
> Probably predictable but worth testing, reverting that patch does
> allow my system to shutdown cleanly.
That BUG_ON was removed by reverting that patch, so you can shutdown
cleanly.
Could you please attach you kernel config file? I'm a little confused
about how do you revert e2912009, manually? I can't see any connections
between e2912009 and bb6eddf7, could you please show me your timer list
(cat /proc/timer_list)
>
> Marc
>
On Mon, Jan 4, 2010 at 9:56 PM, Xiaotian Feng <[email protected]> wrote:
> On 01/05/2010 02:43 AM, Marc Dionne wrote:
>>
>> On Fri, Jan 1, 2010 at 7:42 PM, Peter Zijlstra<[email protected]>
>> ?wrote:
>>>
>>> On Fri, 2010-01-01 at 19:27 -0500, Marc Dionne wrote:
>>>>
>>>> I'm getting a BUG with current kernels from
>>>> kernel/time/clockevents.c:263 when halting the system - a restart
>>>> behaves normally. ?I don't have a good camera handy at the moment to
>>>> capture the call stack on screen, but the call sequence is:
>>>>
>>>> clockevents_notify
>>>> hrtimer_cpu_notify
>>>> notifier_call_chain
>>>> raw_notifier_call_chain
>>>> _cpu_down
>>>> disable_nonboot_cpus
>>>> kernel_power_off
>>>> sys_reboot
>>>>
>>>> I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
>>>> never called on blocked tasks. ?There were a few commits tested along
>>>> the way where I got a freeze (with the power still on) instead of a
>>>> BUG. Reverting that commit from the current kernel doesn't look
>>>> trivial, but the commit immediately preceding this one does halt fine.
>>>
>>> We somehow seem to trip up the below patch, which doesn't really make
>>> sense, as I can't find how task placement would affect the below error.
>>>
>>> It seems to purely test against the hot-unplugged cpu, not a cpu the
>>> task is running on.
>>>
>>> ---
>>> commit bb6eddf7676e1c1f3e637aa93c5224488d99036f
>>> Author: Thomas Gleixner<[email protected]>
>>> Date: ? Thu Dec 10 15:35:10 2009 +0100
>>
>> Probably predictable but worth testing, reverting that patch does
>> allow my system to shutdown cleanly.
>
> That BUG_ON was removed by reverting that patch, so you can shutdown
> cleanly.
>
> Could you please attach you kernel config file? I'm a little confused about
> how do you revert e2912009, manually? I can't see any connections between
> e2912009 and bb6eddf7, could you please show me your timer list (cat
> /proc/timer_list)
config is attached, and the output of cat /proc/timers is also
attached (it's rather large).
To recap:
- Reverting bb6eddf7 gives me a clean shutdown - predictable of course
since it removes the BUG_ON
- I wasn't able to trivially revert e2912009 from a current kernel.
But it fails to shutdown while the preceding commit is OK.
So it would seem that e2912009 is triggering something that the check
in bb6eddf7 is catching.
With more recent kernels (but not the ones around e2912009), I do get
these timer-related warnings in dmesg (and briefly on screen) :
PCSP: Timer resolution is not sufficient (999848nS)
PCSP: Make sure you have HPET and ACPI enabled.
PCSP: Turned into nopcm mode.
Marc
On 01/05/2010 11:23 AM, Marc Dionne wrote:
> On Mon, Jan 4, 2010 at 9:56 PM, Xiaotian Feng<[email protected]> wrote:
>> On 01/05/2010 02:43 AM, Marc Dionne wrote:
>>>
>>> On Fri, Jan 1, 2010 at 7:42 PM, Peter Zijlstra<[email protected]>
>>> wrote:
>>>>
>>>> On Fri, 2010-01-01 at 19:27 -0500, Marc Dionne wrote:
>>>>>
>>>>> I'm getting a BUG with current kernels from
>>>>> kernel/time/clockevents.c:263 when halting the system - a restart
>>>>> behaves normally. I don't have a good camera handy at the moment to
>>>>> capture the call stack on screen, but the call sequence is:
>>>>>
>>>>> clockevents_notify
>>>>> hrtimer_cpu_notify
>>>>> notifier_call_chain
>>>>> raw_notifier_call_chain
>>>>> _cpu_down
>>>>> disable_nonboot_cpus
>>>>> kernel_power_off
>>>>> sys_reboot
>>>>>
>>>>> I bisected it down to commit e2912009: sched: Ensure set_task_cpu() is
>>>>> never called on blocked tasks. There were a few commits tested along
>>>>> the way where I got a freeze (with the power still on) instead of a
>>>>> BUG. Reverting that commit from the current kernel doesn't look
>>>>> trivial, but the commit immediately preceding this one does halt fine.
>>>>
>>>> We somehow seem to trip up the below patch, which doesn't really make
>>>> sense, as I can't find how task placement would affect the below error.
>>>>
>>>> It seems to purely test against the hot-unplugged cpu, not a cpu the
>>>> task is running on.
>>>>
>>>> ---
>>>> commit bb6eddf7676e1c1f3e637aa93c5224488d99036f
>>>> Author: Thomas Gleixner<[email protected]>
>>>> Date: Thu Dec 10 15:35:10 2009 +0100
>>>
>>> Probably predictable but worth testing, reverting that patch does
>>> allow my system to shutdown cleanly.
>>
>> That BUG_ON was removed by reverting that patch, so you can shutdown
>> cleanly.
>>
>> Could you please attach you kernel config file? I'm a little confused about
>> how do you revert e2912009, manually? I can't see any connections between
>> e2912009 and bb6eddf7, could you please show me your timer list (cat
>> /proc/timer_list)
>
> config is attached, and the output of cat /proc/timers is also
> attached (it's rather large).
>
> To recap:
> - Reverting bb6eddf7 gives me a clean shutdown - predictable of course
> since it removes the BUG_ON
> - I wasn't able to trivially revert e2912009 from a current kernel.
> But it fails to shutdown while the preceding commit is OK.
>
> So it would seem that e2912009 is triggering something that the check
> in bb6eddf7 is catching.
>
> With more recent kernels (but not the ones around e2912009), I do get
> these timer-related warnings in dmesg (and briefly on screen) :
>
> PCSP: Timer resolution is not sufficient (999848nS)
> PCSP: Make sure you have HPET and ACPI enabled.
> PCSP: Turned into nopcm mode.
>
This is outputed by sound module, but it will not affect clockevents,
could you please try following patch and let me know the output before
BUG_ON happens? We can gather more information on the BUG_ON. Thank you.
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 6f740d9..7c945e8 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void *arg)
list_for_each_entry_safe(dev, tmp, &clockevent_devices,
list) {
if (cpumask_test_cpu(cpu, dev->cpumask) &&
cpumask_weight(dev->cpumask) == 1) {
+ if (dev->mode != CLOCK_EVT_MODE_UNUSED)
+ printk("invalid dev %s mode %d
on cpu %d\n", dev->name,
+ dev->mode, cpu);
BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
list_del(&dev->list);
> Marc
On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng <[email protected]> wrote:
> This is outputed by sound module, but it will not affect clockevents, could
> you please try following patch and let me know the output before BUG_ON
> happens? We can gather more information on the BUG_ON. Thank you.
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 6f740d9..7c945e8 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void *arg)
> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp, &clockevent_devices, list)
> {
> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask) &&
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?cpumask_weight(dev->cpumask) == 1) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (dev->mode != CLOCK_EVT_MODE_UNUSED)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? printk("invalid dev %s mode %d on
> cpu %d\n", dev->name,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dev->mode, cpu);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
I don't get anything on screen from the printk - is there a trick
needed to getting printk output at that stage of shutting down? I
tried inserting an mdelay() before the BUG, which delayed the bug
output but still didn't print the invalid dev message.
Thanks,
Marc
On 01/06/2010 06:58 AM, Marc Dionne wrote:
> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]> wrote:
>> This is outputed by sound module, but it will not affect clockevents, could
>> you please try following patch and let me know the output before BUG_ON
>> happens? We can gather more information on the BUG_ON. Thank you.
>>
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 6f740d9..7c945e8 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void *arg)
>> list_for_each_entry_safe(dev, tmp,&clockevent_devices, list)
>> {
>> if (cpumask_test_cpu(cpu, dev->cpumask)&&
>> cpumask_weight(dev->cpumask) == 1) {
>> + if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>> + printk("invalid dev %s mode %d on
>> cpu %d\n", dev->name,
>> + dev->mode, cpu);
>> BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
>> list_del(&dev->list);
>
> I don't get anything on screen from the printk - is there a trick
> needed to getting printk output at that stage of shutting down? I
> tried inserting an mdelay() before the BUG, which delayed the bug
> output but still didn't print the invalid dev message.
Did you notice this BUG when you're doing suspend/resume?
Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode
!= CLOCK_EVT_MODE_UNUSED && dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
>
> Thanks,
> Marc
>
On Wed, Jan 6, 2010 at 4:42 AM, Xiaotian Feng <[email protected]> wrote:
> On 01/06/2010 06:58 AM, Marc Dionne wrote:
>>
>> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]> ?wrote:
>>>
>>> This is outputed by sound module, but it will not affect clockevents,
>>> could
>>> you please try following patch and let me know the output before BUG_ON
>>> happens? We can gather more information on the BUG_ON. Thank you.
>>>
>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>> index 6f740d9..7c945e8 100644
>>> --- a/kernel/time/clockevents.c
>>> +++ b/kernel/time/clockevents.c
>>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void
>>> *arg)
>>> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp,&clockevent_devices,
>>> list)
>>> {
>>> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?cpumask_weight(dev->cpumask) == 1) {
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? printk("invalid dev %s mode %d on
>>> cpu %d\n", dev->name,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dev->mode, cpu);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode !=
>>> CLOCK_EVT_MODE_UNUSED);
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
>>
>> I don't get anything on screen from the printk - is there a trick
>> needed to getting printk output at that stage of shutting down? ?I
>> tried inserting an mdelay() before the BUG, which delayed the bug
>> output but still didn't print the invalid dev message.
>
> Did you notice this BUG when you're doing suspend/resume?
>
> Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode !=
> CLOCK_EVT_MODE_UNUSED && dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
I only see the BUG on halt - reboot works normally and suspend
actually freezes and doesn't suspend, but that's perhaps unrelated.
I managed to get your suggested printk to work by adding KERN_CRIT
(otherwise I got no output), and the offending dev is:
"hpet", mode 3 (CLOCK_EVT_MODE_ONESHOT?), cpu 4.
Marc
On 01/07/2010 08:44 AM, Marc Dionne wrote:
> On Wed, Jan 6, 2010 at 4:42 AM, Xiaotian Feng<[email protected]> wrote:
>> On 01/06/2010 06:58 AM, Marc Dionne wrote:
>>>
>>> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]> wrote:
>>>>
>>>> This is outputed by sound module, but it will not affect clockevents,
>>>> could
>>>> you please try following patch and let me know the output before BUG_ON
>>>> happens? We can gather more information on the BUG_ON. Thank you.
>>>>
>>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>>> index 6f740d9..7c945e8 100644
>>>> --- a/kernel/time/clockevents.c
>>>> +++ b/kernel/time/clockevents.c
>>>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void
>>>> *arg)
>>>> list_for_each_entry_safe(dev, tmp,&clockevent_devices,
>>>> list)
>>>> {
>>>> if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>>> cpumask_weight(dev->cpumask) == 1) {
>>>> + if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>>>> + printk("invalid dev %s mode %d on
>>>> cpu %d\n", dev->name,
>>>> + dev->mode, cpu);
>>>> BUG_ON(dev->mode !=
>>>> CLOCK_EVT_MODE_UNUSED);
>>>> list_del(&dev->list);
>>>
>>> I don't get anything on screen from the printk - is there a trick
>>> needed to getting printk output at that stage of shutting down? I
>>> tried inserting an mdelay() before the BUG, which delayed the bug
>>> output but still didn't print the invalid dev message.
>>
>> Did you notice this BUG when you're doing suspend/resume?
>>
>> Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode !=
>> CLOCK_EVT_MODE_UNUSED&& dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
>
> I only see the BUG on halt - reboot works normally and suspend
> actually freezes and doesn't suspend, but that's perhaps unrelated.
>
> I managed to get your suggested printk to work by adding KERN_CRIT
> (otherwise I got no output), and the offending dev is:
> "hpet", mode 3 (CLOCK_EVT_MODE_ONESHOT?), cpu 4.
It looks like kernel is trying to remove broadcast device, could you
please try following patch?
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 6f740d9..d7395fd 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -259,7 +259,8 @@ void clockevents_notify(unsigned long reason, void *arg)
cpu = *((int *)arg);
list_for_each_entry_safe(dev, tmp, &clockevent_devices,
list) {
if (cpumask_test_cpu(cpu, dev->cpumask) &&
- cpumask_weight(dev->cpumask) == 1) {
+ cpumask_weight(dev->cpumask) == 1 &&
+ !tick_is_broadcast_device(dev)) {
BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
list_del(&dev->list);
}
>
> Marc
>
On Wed, Jan 6, 2010 at 9:51 PM, Xiaotian Feng <[email protected]> wrote:
> On 01/07/2010 08:44 AM, Marc Dionne wrote:
>>
>> On Wed, Jan 6, 2010 at 4:42 AM, Xiaotian Feng<[email protected]> ?wrote:
>>>
>>> On 01/06/2010 06:58 AM, Marc Dionne wrote:
>>>>
>>>> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]>
>>>> ?wrote:
>>>>>
>>>>> This is outputed by sound module, but it will not affect clockevents,
>>>>> could
>>>>> you please try following patch and let me know the output before BUG_ON
>>>>> happens? We can gather more information on the BUG_ON. Thank you.
>>>>>
>>>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>>>> index 6f740d9..7c945e8 100644
>>>>> --- a/kernel/time/clockevents.c
>>>>> +++ b/kernel/time/clockevents.c
>>>>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void
>>>>> *arg)
>>>>> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp,&clockevent_devices,
>>>>> list)
>>>>> {
>>>>> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?cpumask_weight(dev->cpumask) == 1) {
>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? printk("invalid dev %s mode %d
>>>>> on
>>>>> cpu %d\n", dev->name,
>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dev->mode, cpu);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode !=
>>>>> CLOCK_EVT_MODE_UNUSED);
>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
>>>>
>>>> I don't get anything on screen from the printk - is there a trick
>>>> needed to getting printk output at that stage of shutting down? ?I
>>>> tried inserting an mdelay() before the BUG, which delayed the bug
>>>> output but still didn't print the invalid dev message.
>>>
>>> Did you notice this BUG when you're doing suspend/resume?
>>>
>>> Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode
>>> !=
>>> CLOCK_EVT_MODE_UNUSED&& ?dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
>>
>> I only see the BUG on halt - reboot works normally and suspend
>> actually freezes and doesn't suspend, but that's perhaps unrelated.
>>
>> I managed to get your suggested printk to work by adding KERN_CRIT
>> (otherwise I got no output), and the offending dev is:
>> ?"hpet", mode 3 (CLOCK_EVT_MODE_ONESHOT?), cpu 4.
>
> It looks like kernel is trying to remove broadcast device, could you please
> try following patch?
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 6f740d9..d7395fd 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -259,7 +259,8 @@ void clockevents_notify(unsigned long reason, void *arg)
> ? ? ? ? ? ? ? ?cpu = *((int *)arg);
> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp, &clockevent_devices, list)
> {
> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask) &&
> - ? ? ? ? ? ? ? ? ? ? ? ? ? cpumask_weight(dev->cpumask) == 1) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? cpumask_weight(dev->cpumask) == 1 &&
> + ? ? ? ? ? ? ? ? ? ? ? ? ? !tick_is_broadcast_device(dev)) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
> ? ? ? ? ? ? ? ? ? ? ? ?}
That works - no problem shutting down with that patch applied.
Marc
On Wed, Jan 6, 2010 at 10:07 PM, Marc Dionne <[email protected]> wrote:
> On Wed, Jan 6, 2010 at 9:51 PM, Xiaotian Feng <[email protected]> wrote:
>> On 01/07/2010 08:44 AM, Marc Dionne wrote:
>>>
>>> On Wed, Jan 6, 2010 at 4:42 AM, Xiaotian Feng<[email protected]> ?wrote:
>>>>
>>>> On 01/06/2010 06:58 AM, Marc Dionne wrote:
>>>>>
>>>>> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]>
>>>>> ?wrote:
>>>>>>
>>>>>> This is outputed by sound module, but it will not affect clockevents,
>>>>>> could
>>>>>> you please try following patch and let me know the output before BUG_ON
>>>>>> happens? We can gather more information on the BUG_ON. Thank you.
>>>>>>
>>>>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>>>>> index 6f740d9..7c945e8 100644
>>>>>> --- a/kernel/time/clockevents.c
>>>>>> +++ b/kernel/time/clockevents.c
>>>>>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void
>>>>>> *arg)
>>>>>> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp,&clockevent_devices,
>>>>>> list)
>>>>>> {
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?cpumask_weight(dev->cpumask) == 1) {
>>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? printk("invalid dev %s mode %d
>>>>>> on
>>>>>> cpu %d\n", dev->name,
>>>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dev->mode, cpu);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode !=
>>>>>> CLOCK_EVT_MODE_UNUSED);
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
>>>>>
>>>>> I don't get anything on screen from the printk - is there a trick
>>>>> needed to getting printk output at that stage of shutting down? ?I
>>>>> tried inserting an mdelay() before the BUG, which delayed the bug
>>>>> output but still didn't print the invalid dev message.
>>>>
>>>> Did you notice this BUG when you're doing suspend/resume?
>>>>
>>>> Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode
>>>> !=
>>>> CLOCK_EVT_MODE_UNUSED&& ?dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
>>>
>>> I only see the BUG on halt - reboot works normally and suspend
>>> actually freezes and doesn't suspend, but that's perhaps unrelated.
>>>
>>> I managed to get your suggested printk to work by adding KERN_CRIT
>>> (otherwise I got no output), and the offending dev is:
>>> ?"hpet", mode 3 (CLOCK_EVT_MODE_ONESHOT?), cpu 4.
>>
>> It looks like kernel is trying to remove broadcast device, could you please
>> try following patch?
>>
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 6f740d9..d7395fd 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -259,7 +259,8 @@ void clockevents_notify(unsigned long reason, void *arg)
>> ? ? ? ? ? ? ? ?cpu = *((int *)arg);
>> ? ? ? ? ? ? ? ?list_for_each_entry_safe(dev, tmp, &clockevent_devices, list)
>> {
>> ? ? ? ? ? ? ? ? ? ? ? ?if (cpumask_test_cpu(cpu, dev->cpumask) &&
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? cpumask_weight(dev->cpumask) == 1) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? cpumask_weight(dev->cpumask) == 1 &&
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? !tick_is_broadcast_device(dev)) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?list_del(&dev->list);
>> ? ? ? ? ? ? ? ? ? ? ? ?}
>
> That works - no problem shutting down with that patch applied.
And after doing a bit more testing, it turns out that applying this
patch also makes suspend/resume work normally again, so it looks like
the hang on suspend was also related to this.
Thanks,
Marc
On 01/07/2010 11:20 AM, Marc Dionne wrote:
> On Wed, Jan 6, 2010 at 10:07 PM, Marc Dionne<[email protected]> wrote:
>> On Wed, Jan 6, 2010 at 9:51 PM, Xiaotian Feng<[email protected]> wrote:
>>> On 01/07/2010 08:44 AM, Marc Dionne wrote:
>>>>
>>>> On Wed, Jan 6, 2010 at 4:42 AM, Xiaotian Feng<[email protected]> wrote:
>>>>>
>>>>> On 01/06/2010 06:58 AM, Marc Dionne wrote:
>>>>>>
>>>>>> On Tue, Jan 5, 2010 at 5:18 AM, Xiaotian Feng<[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> This is outputed by sound module, but it will not affect clockevents,
>>>>>>> could
>>>>>>> you please try following patch and let me know the output before BUG_ON
>>>>>>> happens? We can gather more information on the BUG_ON. Thank you.
>>>>>>>
>>>>>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>>>>>> index 6f740d9..7c945e8 100644
>>>>>>> --- a/kernel/time/clockevents.c
>>>>>>> +++ b/kernel/time/clockevents.c
>>>>>>> @@ -260,6 +260,9 @@ void clockevents_notify(unsigned long reason, void
>>>>>>> *arg)
>>>>>>> list_for_each_entry_safe(dev, tmp,&clockevent_devices,
>>>>>>> list)
>>>>>>> {
>>>>>>> if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>>>>>> cpumask_weight(dev->cpumask) == 1) {
>>>>>>> + if (dev->mode != CLOCK_EVT_MODE_UNUSED)
>>>>>>> + printk("invalid dev %s mode %d
>>>>>>> on
>>>>>>> cpu %d\n", dev->name,
>>>>>>> + dev->mode, cpu);
>>>>>>> BUG_ON(dev->mode !=
>>>>>>> CLOCK_EVT_MODE_UNUSED);
>>>>>>> list_del(&dev->list);
>>>>>>
>>>>>> I don't get anything on screen from the printk - is there a trick
>>>>>> needed to getting printk output at that stage of shutting down? I
>>>>>> tried inserting an mdelay() before the BUG, which delayed the bug
>>>>>> output but still didn't print the invalid dev message.
>>>>>
>>>>> Did you notice this BUG when you're doing suspend/resume?
>>>>>
>>>>> Does the BUG still appear if we changed BUG_ON line to BUG_ON(dev->mode
>>>>> !=
>>>>> CLOCK_EVT_MODE_UNUSED&& dev->mode != CLOCK_EVT_MODE_SHUTDOWN)?
>>>>
>>>> I only see the BUG on halt - reboot works normally and suspend
>>>> actually freezes and doesn't suspend, but that's perhaps unrelated.
>>>>
>>>> I managed to get your suggested printk to work by adding KERN_CRIT
>>>> (otherwise I got no output), and the offending dev is:
>>>> "hpet", mode 3 (CLOCK_EVT_MODE_ONESHOT?), cpu 4.
>>>
>>> It looks like kernel is trying to remove broadcast device, could you please
>>> try following patch?
>>>
>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>> index 6f740d9..d7395fd 100644
>>> --- a/kernel/time/clockevents.c
>>> +++ b/kernel/time/clockevents.c
>>> @@ -259,7 +259,8 @@ void clockevents_notify(unsigned long reason, void *arg)
>>> cpu = *((int *)arg);
>>> list_for_each_entry_safe(dev, tmp,&clockevent_devices, list)
>>> {
>>> if (cpumask_test_cpu(cpu, dev->cpumask)&&
>>> - cpumask_weight(dev->cpumask) == 1) {
>>> + cpumask_weight(dev->cpumask) == 1&&
>>> + !tick_is_broadcast_device(dev)) {
>>> BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
>>> list_del(&dev->list);
>>> }
>>
>> That works - no problem shutting down with that patch applied.
>
> And after doing a bit more testing, it turns out that applying this
> patch also makes suspend/resume work normally again, so it looks like
> the hang on suspend was also related to this.
Thanks for the testing, patch is sent to upstream :-)
>
> Thanks,
> Marc
>