LinuxLists.cc - [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

2009-12-10 13:08:35

Subject: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

I've met a list_del corruption, which was reported in
http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
by myself.

After I added some printks to show all elements in clockevent_devices, I
found kernel hangs when I tried to resume from s2ram.

In clockevents_register_device, clockevents_do_notify ADD is always followed
by clockevents_notify_released. Although clockevents_do_notify ADD will use
tick_check_new_device to add new devices and replace old devices to the
clockevents_released list, clockevents_notify_released add them back to
clockevent_devices list.

My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
the elements in clockevent_devices list is :
clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
->lapic(0)->hpet2(0)->hpet(0)
* () means cpu id

But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
So after s2ram, elements in clockevent_device list is:
clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)

Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
above list, lapic will move to the clockevent_device->next, but lapic(2)->next
is still point to lapic(1), the list is circular and corrupted then.

This patchset aims to fixes above behaviour by:
- on clockevents_register_device, if notify ADD success, move new devices
to the clockevent_devices list, otherwise move to clockevents_released
list.
- on clockevents_notify_released, same behaviour as above.
- on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
clockevents_released list.

It makes sure that only active devices on each cpu is on clockevent_devices list.
With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
hotplug works fine on my system.

2009-12-10 13:08:17

by Xiaotian Feng

[permalink] [raw]

Subject: [PATCH 1/4] clockevents: use list_for_each_entry_safe

Iterating behaviour is same as before, but later patches will
get benifit from this convert.

Signed-off-by: Xiaotian Feng <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Magnus Damm <[email protected]>
Cc: H Hartley Sweeten <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Venkatesh Pallipadi <[email protected]>
---
kernel/time/clockevents.c | 12 +++++-------
1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..1896d9d 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -163,11 +163,9 @@ static void clockevents_do_notify(unsigned long reason, void *dev)
*/
static void clockevents_notify_released(void)
{
- struct clock_event_device *dev;
+ struct clock_event_device *dev, *tmp;

- while (!list_empty(&clockevents_released)) {
- dev = list_entry(clockevents_released.next,
- struct clock_event_device, list);
+ list_for_each_entry_safe(dev, tmp, &clockevents_released, list) {
list_del(&dev->list);
list_add(&dev->list, &clockevent_devices);
clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
@@ -238,7 +236,7 @@ void clockevents_exchange_device(struct clock_event_device *old,
*/
void clockevents_notify(unsigned long reason, void *arg)
{
- struct list_head *node, *tmp;
+ struct clock_event_device *dev, *tmp;
unsigned long flags;

spin_lock_irqsave(&clockevents_lock, flags);
@@ -250,8 +248,8 @@ void clockevents_notify(unsigned long reason, void *arg)
* Unregister the clock event devices which were
* released from the users in the notify chain.
*/
- list_for_each_safe(node, tmp, &clockevents_released)
- list_del(node);
+ list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+ list_del(&dev->list);
break;
default:
break;
--
1.6.5.2

2009-12-10 13:08:15

by Xiaotian Feng

[permalink] [raw]

Subject: [PATCH 2/4] clockevents: convert clockevents_do_notify to int

We need to get results from notify CLOCK_EVT_NOTIFY_ADD, so convert
clockevents_do_notify to int, used by later patches.

Signed-off-by: Xiaotian Feng <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Magnus Damm <[email protected]>
Cc: H Hartley Sweeten <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Venkatesh Pallipadi <[email protected]>
---
kernel/time/clockevents.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 1896d9d..980c2c0 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -152,9 +152,9 @@ int clockevents_register_notifier(struct notifier_block *nb)
* Notify about a clock event change. Called with clockevents_lock
* held.
*/
-static void clockevents_do_notify(unsigned long reason, void *dev)
+static int clockevents_do_notify(unsigned long reason, void *dev)
{
- raw_notifier_call_chain(&clockevents_chain, reason, dev);
+ return raw_notifier_call_chain(&clockevents_chain, reason, dev);
}

/*
--
1.6.5.2

2009-12-10 13:08:08

by Xiaotian Feng

[permalink] [raw]

Subject: [PATCH 3/4] clockevents: add device to clockevent_devices list if notify ADD success

When we register new clock_event_device, CLOCK_EVT_NOTIFY_ADD was notified,
but tick_check_new_device may fail to use new device, we should only add
device to clockevent_devices list when tick_check_new_device success.

Signed-off-by: Xiaotian Feng <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Magnus Damm <[email protected]>
Cc: H Hartley Sweeten <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Venkatesh Pallipadi <[email protected]>
---
kernel/time/clockevents.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 980c2c0..69d6c58 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -164,11 +164,14 @@ static int clockevents_do_notify(unsigned long reason, void *dev)
static void clockevents_notify_released(void)
{
struct clock_event_device *dev, *tmp;
+ int ret;

list_for_each_entry_safe(dev, tmp, &clockevents_released, list) {
- list_del(&dev->list);
- list_add(&dev->list, &clockevent_devices);
- clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+ ret = clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+ if (ret == NOTIFY_STOP) {
+ list_del(&dev->list);
+ list_add(&dev->list, &clockevent_devices);
+ }
}
}

@@ -179,15 +182,19 @@ static void clockevents_notify_released(void)
void clockevents_register_device(struct clock_event_device *dev)
{
unsigned long flags;
+ int ret;

BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
BUG_ON(!dev->cpumask);

spin_lock_irqsave(&clockevents_lock, flags);

- list_add(&dev->list, &clockevent_devices);
- clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
- clockevents_notify_released();
+ ret = clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev);
+ if (ret == NOTIFY_STOP) {
+ list_add(&dev->list, &clockevent_devices);
+ clockevents_notify_released();
+ } else
+ list_add(&dev->list, &clockevents_released);

spin_unlock_irqrestore(&clockevents_lock, flags);
}

2009-12-10 13:08:28

by Xiaotian Feng

[permalink] [raw]

Subject: [PATCH 4/4] clockevents: remove related device from clockevents_released list when cpu is DEAD

We have converted clockevent_devices to store all active devices, and
clockevents_released to store all fail-to-add/replace-out devices. So
at cpu offline stage, we should clear the clockevents related with the
offline cpu.

Signed-off-by: Xiaotian Feng <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Magnus Damm <[email protected]>
Cc: H Hartley Sweeten <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Venkatesh Pallipadi <[email protected]>
---
kernel/time/clockevents.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index babcb30..d27fcba 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -246,6 +246,7 @@ void clockevents_notify(unsigned long reason, void *arg)
{
struct clock_event_device *dev, *tmp;
unsigned long flags;
+ int *cpup;

spin_lock_irqsave(&clockevents_lock, flags);
clockevents_do_notify(reason, arg);
@@ -256,8 +257,10 @@ void clockevents_notify(unsigned long reason, void *arg)
* Unregister the clock event devices which were
* released from the users in the notify chain.
*/
+ cpup = (int *)arg;
list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
- list_del(&dev->list);
+ if (cpumask_test_cpu(*cpup, dev->cpumask))
+ list_del(&dev->list);
break;
default:
break;
--
1.6.5.2

2009-12-10 14:35:40

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

On Thu, 10 Dec 2009, Xiaotian Feng wrote:
> I've met a list_del corruption, which was reported in
> http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
> by myself.
>
> After I added some printks to show all elements in clockevent_devices, I
> found kernel hangs when I tried to resume from s2ram.
>
> In clockevents_register_device, clockevents_do_notify ADD is always followed
> by clockevents_notify_released. Although clockevents_do_notify ADD will use
> tick_check_new_device to add new devices and replace old devices to the
> clockevents_released list, clockevents_notify_released add them back to
> clockevent_devices list.
>
> My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
> the elements in clockevent_devices list is :
> clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
> ->lapic(0)->hpet2(0)->hpet(0)
> * () means cpu id
>
> But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
> cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
> then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
> So after s2ram, elements in clockevent_device list is:
> clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)
>
> Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
> perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
> above list, lapic will move to the clockevent_device->next, but lapic(2)->next
> is still point to lapic(1), the list is circular and corrupted then.

Great detective work !

> This patchset aims to fixes above behaviour by:
> - on clockevents_register_device, if notify ADD success, move new devices
> to the clockevent_devices list, otherwise move to clockevents_released
> list.
> - on clockevents_notify_released, same behaviour as above.
> - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
> clockevents_released list.
>
> It makes sure that only active devices on each cpu is on clockevent_devices list.
> With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
> hotplug works fine on my system.

I'm not happy about that churn. Why don't we simply scan the
clockevent_devices list for leftovers of the dead CPU ?

Untested patch below solves the same problem.

Thanks,

tglx
----
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..5dd857f 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
*/
void clockevents_notify(unsigned long reason, void *arg)
{
- struct list_head *node, *tmp;
+ struct clock_event_device *dev, *tmp;
unsigned long flags;
+ int cpu;

spin_lock_irqsave(&clockevents_lock, flags);
clockevents_do_notify(reason, arg);
@@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
* Unregister the clock event devices which were
* released from the users in the notify chain.
*/
- list_for_each_safe(node, tmp, &clockevents_released)
- list_del(node);
+ list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+ list_del(&dev->list);
+ /*
+ * Now check whether the CPU has left unused per cpu devices
+ */
+ cpu = *((int *)arg);
+ list_for_each_entry_safe(dev, tmp, &clockevent_devices, list) {
+ if (cpumask_test_cpu(cpu, dev->cpumask) &&
+ cpumask_weight(dev->cpumask) == 1) {
+ BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
+ list_del(&dev->list);
+ }
+ }
break;
default:
break;

2009-12-11 02:29:53

by Xiaotian Feng

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

On 12/10/2009 10:35 PM, Thomas Gleixner wrote:
> On Thu, 10 Dec 2009, Xiaotian Feng wrote:
>> I've met a list_del corruption, which was reported in
>> http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
>> by myself.
>>
>> After I added some printks to show all elements in clockevent_devices, I
>> found kernel hangs when I tried to resume from s2ram.
>>
>> In clockevents_register_device, clockevents_do_notify ADD is always followed
>> by clockevents_notify_released. Although clockevents_do_notify ADD will use
>> tick_check_new_device to add new devices and replace old devices to the
>> clockevents_released list, clockevents_notify_released add them back to
>> clockevent_devices list.
>>
>> My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
>> the elements in clockevent_devices list is :
>> clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
>> ->lapic(0)->hpet2(0)->hpet(0)
>> * () means cpu id
>>
>> But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
>> cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
>> then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
>> So after s2ram, elements in clockevent_device list is:
>> clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)
>>
>> Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
>> perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
>> above list, lapic will move to the clockevent_device->next, but lapic(2)->next
>> is still point to lapic(1), the list is circular and corrupted then.
>
> Great detective work !
>
>> This patchset aims to fixes above behaviour by:
>> - on clockevents_register_device, if notify ADD success, move new devices
>> to the clockevent_devices list, otherwise move to clockevents_released
>> list.
>> - on clockevents_notify_released, same behaviour as above.
>> - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
>> clockevents_released list.
>>
>> It makes sure that only active devices on each cpu is on clockevent_devices list.
>> With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
>> hotplug works fine on my system.
>
> I'm not happy about that churn. Why don't we simply scan the
> clockevent_devices list for leftovers of the dead CPU ?
My only thought is to make clockevent_devices list only store active
devices on each cpu, all other inactive devices stored on
clockevents_released, but this make things complex. Your patch is better.
>
> Untested patch below solves the same problem.
Yes, this also resolves my list_del warning. Thanks
>
> Thanks,
>
> tglx
> ----
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 20a8920..5dd857f 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
> */
> void clockevents_notify(unsigned long reason, void *arg)
> {
> - struct list_head *node, *tmp;
> + struct clock_event_device *dev, *tmp;
> unsigned long flags;
> + int cpu;
>
> spin_lock_irqsave(&clockevents_lock, flags);
> clockevents_do_notify(reason, arg);
> @@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
> * Unregister the clock event devices which were
> * released from the users in the notify chain.
> */
> - list_for_each_safe(node, tmp,&clockevents_released)
> - list_del(node);
> + list_for_each_entry_safe(dev, tmp,&clockevents_released, list)
> + list_del(&dev->list);
> + /*
> + * Now check whether the CPU has left unused per cpu devices
> + */
> + cpu = *((int *)arg);
> + list_for_each_entry_safe(dev, tmp,&clockevent_devices, list) {
> + if (cpumask_test_cpu(cpu, dev->cpumask)&&
> + cpumask_weight(dev->cpumask) == 1) {
> + BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
> + list_del(&dev->list);
> + }
> + }
> break;
> default:
> break;
>
>
>

2010-01-17 09:28:04

by Ozan Çağlayan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

Thomas Gleixner wrote:

>
> I'm not happy about that churn. Why don't we simply scan the
> clockevent_devices list for leftovers of the dead CPU ?
>
> Untested patch below solves the same problem.

Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):

[4406.986777] kernel BUG at kernel/time/clockevents.c:262!
[4406.986777] invalid opcode: 0000 [#1] SMP
[4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate

All 3 systems have Pentium 4 processors with different clock speeds.

Here's a screenshot of the full stacktrace:
http://bugs.pardus.org.tr/attachment.cgi?id=4820

and the relevant bugzilla report:
http://bugzilla.kernel.org/show_bug.cgi?id=15073

Thanks,
Ozan Caglayan
Pardus Linux -- http://www.pardus.org.tr/eng

2010-01-18 02:31:53

by Xiaotian Feng

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> Thomas Gleixner wrote:
>
>>
>> I'm not happy about that churn. Why don't we simply scan the
>> clockevent_devices list for leftovers of the dead CPU ?
>>
>> Untested patch below solves the same problem.
>
>
> Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
> It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):
>
> [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> [4406.986777] invalid opcode: 0000 [#1] SMP
> [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate

I think this one is duplicated of
http://bugzilla.kernel.org/show_bug.cgi?id=15037
Thomas is working on the fix. I've sent a workaround patch on
http://patchwork.kernel.org/patch/71537/

>
>
> All 3 systems have Pentium 4 processors with different clock speeds.
>
> Here's a screenshot of the full stacktrace:
> http://bugs.pardus.org.tr/attachment.cgi?id=4820
>
> and the relevant bugzilla report:
> http://bugzilla.kernel.org/show_bug.cgi?id=15073
>
> Thanks,
> Ozan Caglayan
> Pardus Linux -- http://www.pardus.org.tr/eng
>

2010-01-18 13:52:00

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

On Mon, 18 Jan 2010, Xiaotian Feng wrote:

> On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> > Thomas Gleixner wrote:
> >
> > >
> > > I'm not happy about that churn. Why don't we simply scan the
> > > clockevent_devices list for leftovers of the dead CPU ?
> > >
> > > Untested patch below solves the same problem.
> >
> >
> > Hi, I have 3 bug reports about a clockevent failure while halting the system
> > (2.6.31.11 kernel).
> > It exactly pinpoints to the line 262 which got changed with this patch
> > merged to 2.6.31.10 (and also 2.6.32.3):
> >
> > [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> > [4406.986777] invalid opcode: 0000 [#1] SMP
> > [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate
>
> I think this one is duplicated of
> http://bugzilla.kernel.org/show_bug.cgi?id=15037
> Thomas is working on the fix. I've sent a workaround patch on
> http://patchwork.kernel.org/patch/71537/

I just applied your patch, but kept the cpuweight check. This is the
least intrusive solution for now. The logic needs an overhaul, but
thats neither rc4 nor stable material.

Thanks,

tglx