2005-03-12 13:11:23

by J. Bruce Fields

[permalink] [raw]
Subject: spin_lock error in arch/i386/kernel/time.c on APM resume

On APM resume this morning on my Thinkpad X31, I got a "spin_lock is
already locked" error; see below. This doesn't happen on every resume,
though it's happened before. The kernel is 2.6.11 plus a bunch of
(hopefully unrelated...) NFS patches.

Any ideas?

--Bruce Fields

Mar 12 07:07:29 puzzle kernel: PCI: Setting latency timer of device 0000:00:1d.0 to 64
Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1d.1 to 64
Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1d.2 to 64
Mar 12 07:07:31 puzzle kernel: PCI: cache line size of 32 is not supported by device 0000:00:1d.7
Mar 12 07:07:31 puzzle kernel: ehci_hcd 0000:00:1d.7: USB 2.0 restarted, EHCI 1.00, driver 10 Dec 2004
Mar 12 07:07:31 puzzle kernel: PCI: Found IRQ 11 for device 0000:00:1f.1
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1d.2
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:02:00.2
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:02:02.0
Mar 12 07:07:31 puzzle kernel: PCI: Found IRQ 11 for device 0000:00:1f.5
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1f.3
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1f.6
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:02:00.1
Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1f.5 to 64
Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:179: spin_lock(arch/i386/kernel/time.c:c0603c28) already locked by arch/i386/kernel/time.c/309
Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:316: spin_unlock(arch/i386/kernel/time.c:c0603c28) not locked
Mar 12 07:07:31 puzzle kernel: PCI: Found IRQ 11 for device 0000:01:00.0
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1d.0
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:02:00.0
Mar 12 07:07:31 puzzle kernel: PCI: Found IRQ 11 for device 0000:02:00.2
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1d.2
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:00:1f.1
Mar 12 07:07:31 puzzle kernel: PCI: Sharing IRQ 11 with 0000:02:02.0
Mar 12 07:07:31 puzzle kernel: PCI: Found IRQ 11 for device 0000:02:08.0
Mar 12 07:07:31 puzzle kernel: agpgart: Found an AGP 2.0 compliant device at 0000:00:00.0.
Mar 12 07:07:31 puzzle kernel: agpgart: Putting AGP V2 device at 0000:00:00.0 into 1x mode
Mar 12 07:07:31 puzzle kernel: agpgart: Putting AGP V2 device at 0000:01:00.0 into 1x mode


2005-03-12 15:20:50

by Zwane Mwaikambo

[permalink] [raw]
Subject: [PATCH] APM: fix interrupts enabled in device_power_up

On Sat, 12 Mar 2005, J. Bruce Fields wrote:

> On APM resume this morning on my Thinkpad X31, I got a "spin_lock is
> already locked" error; see below. This doesn't happen on every resume,
> though it's happened before. The kernel is 2.6.11 plus a bunch of
> (hopefully unrelated...) NFS patches.
>
> Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1f.5 to 64
> Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:179: spin_lock(arch/i386/kernel/time.c:c0603c28) already locked by arch/i386/kernel/time.c/309
> Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:316: spin_unlock(arch/i386/kernel/time.c:c0603c28) not locked

APM was calling device_power_down and device_power_up with interrupts
enabled, resulting in a few calls to get_cmos_time being done with
interrupts enabled (rtc_lock needs to be acquired with interrupts
disabled).

Signed-off-by: Zwane Mwaikambo <[email protected]>

===== arch/i386/kernel/apm.c 1.72 vs edited =====
--- 1.72/arch/i386/kernel/apm.c 2005-01-20 22:02:11 -07:00
+++ edited/arch/i386/kernel/apm.c 2005-03-12 08:17:52 -07:00
@@ -1202,10 +1202,11 @@
}

device_suspend(PMSG_SUSPEND);
+ local_irq_disable();
device_power_down(PMSG_SUSPEND);

/* serialize with the timer interrupt */
- write_seqlock_irq(&xtime_lock);
+ write_seqlock(&xtime_lock);

/* protect against access to timer chip registers */
spin_lock(&i8253_lock);
@@ -1216,20 +1217,22 @@
* We'll undo any timer changes due to interrupts below.
*/
spin_unlock(&i8253_lock);
- write_sequnlock_irq(&xtime_lock);
+ write_sequnlock(&xtime_lock);
+ local_irq_enable();

save_processor_state();
err = set_system_power_state(APM_STATE_SUSPEND);
restore_processor_state();

- write_seqlock_irq(&xtime_lock);
+ local_irq_disable();
+ write_seqlock(&xtime_lock);
spin_lock(&i8253_lock);
reinit_timer();
set_time();
ignore_normal_resume = 1;

spin_unlock(&i8253_lock);
- write_sequnlock_irq(&xtime_lock);
+ write_sequnlock(&xtime_lock);

if (err == APM_NO_ERROR)
err = APM_SUCCESS;
@@ -1237,6 +1240,7 @@
apm_error("suspend", err);
err = (err == APM_SUCCESS) ? 0 : -EIO;
device_power_up();
+ local_irq_enable();
device_resume();
pm_send_all(PM_RESUME, (void *)0);
queue_event(APM_NORMAL_RESUME, NULL);
@@ -1255,17 +1259,22 @@
{
int err;

+ local_irq_disable();
device_power_down(PMSG_SUSPEND);
/* serialize with the timer interrupt */
- write_seqlock_irq(&xtime_lock);
+ write_seqlock(&xtime_lock);
/* If needed, notify drivers here */
get_time_diff();
- write_sequnlock_irq(&xtime_lock);
+ write_sequnlock(&xtime_lock);
+ local_irq_enable();

err = set_system_power_state(APM_STATE_STANDBY);
if ((err != APM_SUCCESS) && (err != APM_NO_ERROR))
apm_error("standby", err);
+
+ local_irq_disable();
device_power_up();
+ local_irq_enable();
}

static apm_event_t get_event(void)

2005-03-12 16:04:18

by George Anzinger

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Source: MontaVista Software, Inc.
Type: Defect Fix
Disposition: Pending
Description:

I was not happy with the locking on this. Two changes:
1) Turn off irq while setting the clock.
2) Call the timer code only through the timer interface
(set a short timer to do it from the ntp call).

Signed-off-by: George Anzinger <[email protected]>

time.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6.12-rc/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.12-rc.orig/arch/i386/kernel/time.c
+++ linux-2.6.12-rc/arch/i386/kernel/time.c
@@ -176,12 +176,12 @@ static int set_rtc_mmss(unsigned long no
int retval;

/* gets recalled with irq locally disabled */
- spin_lock(&rtc_lock);
+ spin_lock_irq(&rtc_lock);
if (efi_enabled)
retval = efi_set_rtc_mmss(nowtime);
else
retval = mach_set_rtc_mmss(nowtime);
- spin_unlock(&rtc_lock);
+ spin_unlock_irq(&rtc_lock);

return retval;
}
@@ -282,14 +282,14 @@ unsigned long get_cmos_time(void)
{
unsigned long retval;

- spin_lock(&rtc_lock);
+ spin_lock_irq(&rtc_lock);

if (efi_enabled)
retval = efi_get_time();
else
retval = mach_get_cmos_time();

- spin_unlock(&rtc_lock);
+ spin_unlock_irq(&rtc_lock);

return retval;
}
@@ -338,7 +338,7 @@ static void sync_cmos_clock(unsigned lon
}
void notify_arch_cmos_timer(void)
{
- sync_cmos_clock(0);
+ mod_timer(&sync_cmos_timer, jiffies + 1);
}
static long clock_cmos_diff, sleep_start;


Attachments:
cmos_time_lock.patch (1.48 kB)

2005-03-12 16:24:38

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, 12 Mar 2005, George Anzinger wrote:

> Looks like we need the irq on the read clock also. This is true both before
> and after the prior cmos_time changes.
>
> The attached replaces the patch I sent yesterday.
>
> For those wanting to fix the kernel with out those patches, all that is needed
> its the chunk that applies, i.e. the _irq on the get_cmos_time() spinlocks.
>
> And more... That this occures implies we are attempting to update the cmos
> clock on resume seems wrong. One would presume that the time is wrong at this
> time and we are about to save that wrong time. Possibly the APM code should
> change time_status to STA_UNSYNC on the way into the sleep (or what ever it is
> called). Who should we ping with this?

timer_resume, which appears to be the problem, wants to calculate amount
of time was spent suspended, also your unconditional irq enable in
get_cmos_time breaks the atomicity of device_power_up and would deadlock
in sections of code which call get_time_diff() with xtime_lock held. I
sent a patch subject "APM: fix interrupts enabled in device_power_up"
which should address this.

Thanks,
Zwane

2005-03-12 16:39:22

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, Mar 12, 2005 at 09:25:13AM -0700, Zwane Mwaikambo wrote:
> On Sat, 12 Mar 2005, George Anzinger wrote:
>
> > And more... That this occures implies we are attempting to update the cmos
> > clock on resume seems wrong. One would presume that the time is wrong at this
> > time and we are about to save that wrong time. Possibly the APM code should
> > change time_status to STA_UNSYNC on the way into the sleep (or what ever it is
> > called). Who should we ping with this?
>
> timer_resume, which appears to be the problem, wants to calculate amount
> of time was spent suspended, also your unconditional irq enable in
> get_cmos_time breaks the atomicity of device_power_up and would deadlock
> in sections of code which call get_time_diff() with xtime_lock held. I
> sent a patch subject "APM: fix interrupts enabled in device_power_up"
> which should address this.
>

How about this patch? Also fixes one other use of rtc_lock in acpi/sleep/proc.c

Thanks,
Venki


rtc_lock is held during timer interrupts. So, we should block interrupts
while holding it.

Signed-off-by: Venkatesh Pallipadi <[email protected]>

--- linux-2.6.10/arch/i386/kernel/time.c.org 2005-03-12 10:38:23.000000000 -0800
+++ linux-2.6.10/arch/i386/kernel/time.c 2005-03-12 10:40:26.000000000 -0800
@@ -305,15 +305,16 @@ irqreturn_t timer_interrupt(int irq, voi
unsigned long get_cmos_time(void)
{
unsigned long retval;
+ unsigned long flags;

- spin_lock(&rtc_lock);
+ spin_lock_irqsave(&rtc_lock, flags);

if (efi_enabled)
retval = efi_get_time();
else
retval = mach_get_cmos_time();

- spin_unlock(&rtc_lock);
+ spin_unlock_restore(&rtc_lock, flags);

return retval;
}
--- linux-2.6.10/drivers/acpi/sleep/proc.c.org 2005-03-12 10:50:40.000000000 -0800
+++ linux-2.6.10/drivers/acpi/sleep/proc.c 2005-03-12 10:53:08.000000000 -0800
@@ -84,10 +84,11 @@ static int acpi_system_alarm_seq_show(st
u32 sec, min, hr;
u32 day, mo, yr;
unsigned char rtc_control = 0;
+ unsigned long flags;

ACPI_FUNCTION_TRACE("acpi_system_alarm_seq_show");

- spin_lock(&rtc_lock);
+ spin_lock_irqsave(&rtc_lock, flags);

sec = CMOS_READ(RTC_SECONDS_ALARM);
min = CMOS_READ(RTC_MINUTES_ALARM);
@@ -109,7 +110,7 @@ static int acpi_system_alarm_seq_show(st
else
yr = CMOS_READ(RTC_YEAR);

- spin_unlock(&rtc_lock);
+ spin_unlock_restore(&rtc_lock, flags);

if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
BCD_TO_BIN(sec);

2005-03-12 16:46:31

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, 12 Mar 2005, Venkatesh Pallipadi wrote:

> On Sat, Mar 12, 2005 at 09:25:13AM -0700, Zwane Mwaikambo wrote:
> > On Sat, 12 Mar 2005, George Anzinger wrote:
> >
> > > And more... That this occures implies we are attempting to update the cmos
> > > clock on resume seems wrong. One would presume that the time is wrong at this
> > > time and we are about to save that wrong time. Possibly the APM code should
> > > change time_status to STA_UNSYNC on the way into the sleep (or what ever it is
> > > called). Who should we ping with this?
> >
> > timer_resume, which appears to be the problem, wants to calculate amount
> > of time was spent suspended, also your unconditional irq enable in
> > get_cmos_time breaks the atomicity of device_power_up and would deadlock
> > in sections of code which call get_time_diff() with xtime_lock held. I
> > sent a patch subject "APM: fix interrupts enabled in device_power_up"
> > which should address this.
> >
>
> How about this patch? Also fixes one other use of rtc_lock in acpi/sleep/proc.c
>
> rtc_lock is held during timer interrupts. So, we should block interrupts
> while holding it.

It's certainly a lot safer with saving/restoring eflags and the
drivers/acpi/sleep/proc.c change is a good catch, but i think the
get_cmos_time() callers should take care of the interrupt disabling. btw,
s/spin_unlock_restore/spin_unlock_irqrestore/. Please submit the proc.c
change.

Thanks,
Zwane

> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>
> --- linux-2.6.10/arch/i386/kernel/time.c.org 2005-03-12 10:38:23.000000000 -0800
> +++ linux-2.6.10/arch/i386/kernel/time.c 2005-03-12 10:40:26.000000000 -0800
> @@ -305,15 +305,16 @@ irqreturn_t timer_interrupt(int irq, voi
> unsigned long get_cmos_time(void)
> {
> unsigned long retval;
> + unsigned long flags;
>
> - spin_lock(&rtc_lock);
> + spin_lock_irqsave(&rtc_lock, flags);
>
> if (efi_enabled)
> retval = efi_get_time();
> else
> retval = mach_get_cmos_time();
>
> - spin_unlock(&rtc_lock);
> + spin_unlock_restore(&rtc_lock, flags);
>
> return retval;
> }
> --- linux-2.6.10/drivers/acpi/sleep/proc.c.org 2005-03-12 10:50:40.000000000 -0800
> +++ linux-2.6.10/drivers/acpi/sleep/proc.c 2005-03-12 10:53:08.000000000 -0800
> @@ -84,10 +84,11 @@ static int acpi_system_alarm_seq_show(st
> u32 sec, min, hr;
> u32 day, mo, yr;
> unsigned char rtc_control = 0;
> + unsigned long flags;
>
> ACPI_FUNCTION_TRACE("acpi_system_alarm_seq_show");
>
> - spin_lock(&rtc_lock);
> + spin_lock_irqsave(&rtc_lock, flags);
>
> sec = CMOS_READ(RTC_SECONDS_ALARM);
> min = CMOS_READ(RTC_MINUTES_ALARM);
> @@ -109,7 +110,7 @@ static int acpi_system_alarm_seq_show(st
> else
> yr = CMOS_READ(RTC_YEAR);
>
> - spin_unlock(&rtc_lock);
> + spin_unlock_restore(&rtc_lock, flags);
>
> if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
> BCD_TO_BIN(sec);
>

2005-03-12 17:47:27

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, Mar 12, 2005 at 09:46:54AM -0700, Zwane Mwaikambo wrote:
> On Sat, 12 Mar 2005, Venkatesh Pallipadi wrote:
>
> > On Sat, Mar 12, 2005 at 09:25:13AM -0700, Zwane Mwaikambo wrote:
> >
> > How about this patch? Also fixes one other use of rtc_lock in acpi/sleep/proc.c
> >
> > rtc_lock is held during timer interrupts. So, we should block interrupts
> > while holding it.
>
> It's certainly a lot safer with saving/restoring eflags and the
> drivers/acpi/sleep/proc.c change is a good catch, but i think the
> get_cmos_time() callers should take care of the interrupt disabling. btw,
> s/spin_unlock_restore/spin_unlock_irqrestore/. Please submit the proc.c
> change.
>

So, I will assume get_cmos_time issue will be fixed in some other way.
Sending the proper proc.c change this time.
Andrew: Please apply.

Thanks,
Venki

rtc_lock is held during timer interrupts. So, we should block interrupts
while holding it.

Signed-off-by: Venkatesh Pallipadi <[email protected]>

--- linux-2.6.10/drivers/acpi/sleep/proc.c.org 2005-03-12 10:50:40.000000000 -0800
+++ linux-2.6.10/drivers/acpi/sleep/proc.c 2005-03-12 10:53:08.000000000 -0800
@@ -84,10 +84,11 @@ static int acpi_system_alarm_seq_show(st
u32 sec, min, hr;
u32 day, mo, yr;
unsigned char rtc_control = 0;
+ unsigned long flags;

ACPI_FUNCTION_TRACE("acpi_system_alarm_seq_show");

- spin_lock(&rtc_lock);
+ spin_lock_irqsave(&rtc_lock, flags);

sec = CMOS_READ(RTC_SECONDS_ALARM);
min = CMOS_READ(RTC_MINUTES_ALARM);
@@ -109,7 +110,7 @@ static int acpi_system_alarm_seq_show(st
else
yr = CMOS_READ(RTC_YEAR);

- spin_unlock(&rtc_lock);
+ spin_unlock_irqrestore(&rtc_lock, flags);

if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
BCD_TO_BIN(sec);

2005-03-12 18:04:17

by George Anzinger

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Zwane Mwaikambo wrote:
> On Sat, 12 Mar 2005, George Anzinger wrote:
>
>
>>Looks like we need the irq on the read clock also. This is true both before
>>and after the prior cmos_time changes.
>>
>>The attached replaces the patch I sent yesterday.
>>
>>For those wanting to fix the kernel with out those patches, all that is needed
>>its the chunk that applies, i.e. the _irq on the get_cmos_time() spinlocks.
>>
>>And more... That this occures implies we are attempting to update the cmos
>>clock on resume seems wrong. One would presume that the time is wrong at this
>>time and we are about to save that wrong time. Possibly the APM code should
>>change time_status to STA_UNSYNC on the way into the sleep (or what ever it is
>>called). Who should we ping with this?
>
>
> timer_resume, which appears to be the problem, wants to calculate amount
> of time was spent suspended, also your unconditional irq enable in
> get_cmos_time breaks the atomicity of device_power_up and would deadlock
> in sections of code which call get_time_diff() with xtime_lock held. I
> sent a patch subject "APM: fix interrupts enabled in device_power_up"
> which should address this.

I agree. Still in all that follows, no one has addressed the apparent race
described above. The reason the system reported the errors that started this
thread is that the APM restore code was trying to read the cmos clock (I assume
to set the xtime clock) WHILE the timer interrupt code what trying to set the
cmos clock from xtime. In other words, it is destroying the time it is trying
to read. I repeat "Possibly the APM code should change time_status to
STA_UNSYNC on the way into the sleep." I am not sure how ntp is supposed to
react to the resume but I suspect that the system time is rather out of sync...
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-03-12 19:59:10

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, 12 Mar 2005, George Anzinger wrote:

> I agree. Still in all that follows, no one has addressed the apparent race
> described above. The reason the system reported the errors that started this
> thread is that the APM restore code was trying to read the cmos clock (I
> assume to set the xtime clock) WHILE the timer interrupt code what trying to
> set the cmos clock from xtime.

Doesn't my reply explain the actual problem? The code path being;

arch/i386/kernel/apm.c
suspend()
write_seqlock_irq(xtime_lock)
...
write_sequnlock_irq(xtime_lock)
<interrupts enabled>
device_power_up()
timer_resume()
get_cmos_time();

So this covers the problem that the reporter reported, so yes it's setting
xtime but we shouldn't be taking interrupts in the first place, so i
posted the patch to cover that. APM was clearly violating PM resume
procedures.

Thanks,
Zwane


2005-03-12 20:25:38

by George Anzinger

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Zwane Mwaikambo wrote:
> On Sat, 12 Mar 2005, George Anzinger wrote:
>
>
>>I agree. Still in all that follows, no one has addressed the apparent race
>>described above. The reason the system reported the errors that started this
>>thread is that the APM restore code was trying to read the cmos clock (I
>>assume to set the xtime clock) WHILE the timer interrupt code what trying to
>>set the cmos clock from xtime.
>
>
> Doesn't my reply explain the actual problem? The code path being;

Sorry, I just didn't look at the apm code. My bad.

-g
>
> arch/i386/kernel/apm.c
> suspend()
> write_seqlock_irq(xtime_lock)
> ...
> write_sequnlock_irq(xtime_lock)
> <interrupts enabled>
> device_power_up()
> timer_resume()
> get_cmos_time();

S
> So this covers the problem that the reporter reported, so yes it's setting
> xtime but we shouldn't be taking interrupts in the first place, so i
> posted the patch to cover that. APM was clearly violating PM resume
> procedures.
>
> Thanks,
> Zwane

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-03-12 21:15:12

by Barry K. Nathan

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, Mar 12, 2005 at 07:56:10AM -0800, George Anzinger wrote:
> Yesterday's night mare, todays bug :(

Actually, I think people have been hitting this bug for a few months on
Fedora Core, so it's not really *today's* bug... This might be the first
time it's getting discussed on LKML though. (I haven't been reading LKML
closely enough to say that with 100% certainty, however.)

Summary: kernel-2.6.9-1.724_FC3 breaks APM suspend on Thinkpad
https://bugzilla.redhat.com/beta/show_bug.cgi?id=144415

-Barry K. Nathan <[email protected]>

2005-03-12 21:19:16

by Lee Revell

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

On Sat, 2005-03-12 at 07:56 -0800, George Anzinger wrote:
> J. Bruce Fields wrote:
> > On APM resume this morning on my Thinkpad X31, I got a "spin_lock is
> > already locked" error; see below. This doesn't happen on every resume,
> > though it's happened before. The kernel is 2.6.11 plus a bunch of
> > (hopefully unrelated...) NFS patches.
> >
> > Any ideas?
> >
> Yesterday's night mare, todays bug :(
>

Wait, so this is the same theoretical bug you discussed in the
do_timer_interrupt thread?

Wow, a real schroedinbug...

Lee

2005-03-13 18:36:33

by Pavel Machek

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Hi!

> >>And more... That this occures implies we are attempting to update the cmos
> >>clock on resume seems wrong. One would presume that the time is wrong at
> >>this
> >>time and we are about to save that wrong time. Possibly the APM code
> >>should
> >>change time_status to STA_UNSYNC on the way into the sleep (or what ever
> >>it is
> >>called). Who should we ping with this?
> >
> >
> >timer_resume, which appears to be the problem, wants to calculate amount
> >of time was spent suspended, also your unconditional irq enable in
> >get_cmos_time breaks the atomicity of device_power_up and would deadlock
> >in sections of code which call get_time_diff() with xtime_lock held. I
> >sent a patch subject "APM: fix interrupts enabled in device_power_up"
> >which should address this.
>
> I agree. Still in all that follows, no one has addressed the apparent race
> described above. The reason the system reported the errors that started
> this thread is that the APM restore code was trying to read the cmos clock
> (I assume to set the xtime clock) WHILE the timer interrupt code what
> trying to set the cmos clock from xtime. In other words, it is destroying
> the time it is trying to read. I repeat "Possibly the APM code should
> change time_status to STA_UNSYNC on the way into the sleep." I am not sure
> how ntp is supposed to react to the resume but I suspect that the system
> time is rather out of sync...

It needs to work without NTP, too. You don't get NTP on plane (etc)
where suspend is most usefull.

We have CMOS clock, it should be possible to get time from there
without resorting to NTP..
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-03-14 23:49:38

by George Anzinger

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Pavel Machek wrote:
> Hi!
>
>
>>>>And more... That this occures implies we are attempting to update the cmos
>>>>clock on resume seems wrong. One would presume that the time is wrong at
>>>>this
>>>>time and we are about to save that wrong time. Possibly the APM code
>>>>should
>>>>change time_status to STA_UNSYNC on the way into the sleep (or what ever
>>>>it is
>>>>called). Who should we ping with this?
>>>
>>>
>>>timer_resume, which appears to be the problem, wants to calculate amount
>>>of time was spent suspended, also your unconditional irq enable in
>>>get_cmos_time breaks the atomicity of device_power_up and would deadlock
>>>in sections of code which call get_time_diff() with xtime_lock held. I
>>>sent a patch subject "APM: fix interrupts enabled in device_power_up"
>>>which should address this.
>>
>>I agree. Still in all that follows, no one has addressed the apparent race
>>described above. The reason the system reported the errors that started
>>this thread is that the APM restore code was trying to read the cmos clock
>>(I assume to set the xtime clock) WHILE the timer interrupt code what
>>trying to set the cmos clock from xtime. In other words, it is destroying
>>the time it is trying to read. I repeat "Possibly the APM code should
>>change time_status to STA_UNSYNC on the way into the sleep." I am not sure
>>how ntp is supposed to react to the resume but I suspect that the system
>>time is rather out of sync...
>
>
> It needs to work without NTP, too. You don't get NTP on plane (etc)
> where suspend is most usefull.
>
> We have CMOS clock, it should be possible to get time from there
> without resorting to NTP..

Eh... sure, but... the bug was reported because the system was attempting to
update the cmos clock (which it does every ~11 min.) during APM exit. It does
this IF AND ONLY IF the system is synced to an external source as indicated by
the STA_UNSYNC bit being cleared in the time_state. Now, I don't know what or
how APM and NTP are supposed to play together, but I suspect that on entry to
APM time is no longer synced, thus my comment.

As to your comment, the bug would never have shown its ugly face if the system
wasn't using NTP.

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-03-15 00:13:31

by Pavel Machek

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Hi!

> >>I agree. Still in all that follows, no one has addressed the apparent
> >>race described above. The reason the system reported the errors that
> >>started this thread is that the APM restore code was trying to read the
> >>cmos clock (I assume to set the xtime clock) WHILE the timer interrupt
> >>code what trying to set the cmos clock from xtime. In other words, it is
> >>destroying the time it is trying to read. I repeat "Possibly the APM
> >>code should change time_status to STA_UNSYNC on the way into the sleep."
> >>I am not sure how ntp is supposed to react to the resume but I suspect
> >>that the system time is rather out of sync...
> >
> >
> >It needs to work without NTP, too. You don't get NTP on plane (etc)
> >where suspend is most usefull.
> >
> >We have CMOS clock, it should be possible to get time from there
> >without resorting to NTP..
>
> Eh... sure, but... the bug was reported because the system was attempting
> to update the cmos clock (which it does every ~11 min.) during APM exit.
> It does this IF AND ONLY IF the system is synced to an external source as
> indicated by the STA_UNSYNC bit being cleared in the time_state. Now, I
> don't know what or how APM and NTP are supposed to play together, but I
> suspect that on entry to APM time is no longer synced, thus my comment.
>
> As to your comment, the bug would never have shown its ugly face if the
> system wasn't using NTP.

Uh, ok, you are right. We should set time to STA_UNSYNC so that we do
not write back to CMOS during/shortly after resume. I did not realize
what STA_UNSYNC means. Perhaps you have patch to do that somewhere?
;-))))
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-03-15 09:05:09

by George Anzinger

[permalink] [raw]
Subject: Re: spin_lock error in arch/i386/kernel/time.c on APM resume

Pavel Machek wrote:
> Hi!
>
>
>>>>I agree. Still in all that follows, no one has addressed the apparent
>>>>race described above. The reason the system reported the errors that
>>>>started this thread is that the APM restore code was trying to read the
>>>>cmos clock (I assume to set the xtime clock) WHILE the timer interrupt
>>>>code what trying to set the cmos clock from xtime. In other words, it is
>>>>destroying the time it is trying to read. I repeat "Possibly the APM
>>>>code should change time_status to STA_UNSYNC on the way into the sleep."
>>>>I am not sure how ntp is supposed to react to the resume but I suspect
>>>>that the system time is rather out of sync...
>>>
>>>
>>>It needs to work without NTP, too. You don't get NTP on plane (etc)
>>>where suspend is most usefull.
>>>
>>>We have CMOS clock, it should be possible to get time from there
>>>without resorting to NTP..
>>
>>Eh... sure, but... the bug was reported because the system was attempting
>>to update the cmos clock (which it does every ~11 min.) during APM exit.
>>It does this IF AND ONLY IF the system is synced to an external source as
>>indicated by the STA_UNSYNC bit being cleared in the time_state. Now, I
>>don't know what or how APM and NTP are supposed to play together, but I
>>suspect that on entry to APM time is no longer synced, thus my comment.
>>
>>As to your comment, the bug would never have shown its ugly face if the
>>system wasn't using NTP.
>
>
> Uh, ok, you are right. We should set time to STA_UNSYNC so that we do
> not write back to CMOS during/shortly after resume. I did not realize
> what STA_UNSYNC means. Perhaps you have patch to do that somewhere?
> ;-))))

Zwane has convinced me that the real problem is doing the right thing (tm) in
the APM code, i.e. not allowing the timer interrupt until after reading the cmos
clock, for which he has a patch tendered.

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-03-15 22:36:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] APM: fix interrupts enabled in device_power_up

On Sat, Mar 12, 2005 at 08:21:29AM -0700, Zwane Mwaikambo wrote:
> On Sat, 12 Mar 2005, J. Bruce Fields wrote:
>
> > On APM resume this morning on my Thinkpad X31, I got a "spin_lock is
> > already locked" error; see below. This doesn't happen on every resume,
> > though it's happened before. The kernel is 2.6.11 plus a bunch of
> > (hopefully unrelated...) NFS patches.
> >
> > Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1f.5 to 64
> > Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:179: spin_lock(arch/i386/kernel/time.c:c0603c28) already locked by arch/i386/kernel/time.c/309
> > Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:316: spin_unlock(arch/i386/kernel/time.c:c0603c28) not locked
>
> APM was calling device_power_down and device_power_up with interrupts
> enabled, resulting in a few calls to get_cmos_time being done with
> interrupts enabled (rtc_lock needs to be acquired with interrupts
> disabled).

Thanks, I haven't been following the discussion carefully, but for
what's it worth I did apply that patch and now (a few suspend-resume
cycles later) haven't seen the spin_lock warning or seen any other ill
effects.

Let me know if there's any testing it would be useful for me to do.

--Bruce Fields

> Signed-off-by: Zwane Mwaikambo <[email protected]>
>
> ===== arch/i386/kernel/apm.c 1.72 vs edited =====
> --- 1.72/arch/i386/kernel/apm.c 2005-01-20 22:02:11 -07:00
> +++ edited/arch/i386/kernel/apm.c 2005-03-12 08:17:52 -07:00
> @@ -1202,10 +1202,11 @@
> }
>
> device_suspend(PMSG_SUSPEND);
> + local_irq_disable();
> device_power_down(PMSG_SUSPEND);
>
> /* serialize with the timer interrupt */
> - write_seqlock_irq(&xtime_lock);
> + write_seqlock(&xtime_lock);
>
> /* protect against access to timer chip registers */
> spin_lock(&i8253_lock);
> @@ -1216,20 +1217,22 @@
> * We'll undo any timer changes due to interrupts below.
> */
> spin_unlock(&i8253_lock);
> - write_sequnlock_irq(&xtime_lock);
> + write_sequnlock(&xtime_lock);
> + local_irq_enable();
>
> save_processor_state();
> err = set_system_power_state(APM_STATE_SUSPEND);
> restore_processor_state();
>
> - write_seqlock_irq(&xtime_lock);
> + local_irq_disable();
> + write_seqlock(&xtime_lock);
> spin_lock(&i8253_lock);
> reinit_timer();
> set_time();
> ignore_normal_resume = 1;
>
> spin_unlock(&i8253_lock);
> - write_sequnlock_irq(&xtime_lock);
> + write_sequnlock(&xtime_lock);
>
> if (err == APM_NO_ERROR)
> err = APM_SUCCESS;
> @@ -1237,6 +1240,7 @@
> apm_error("suspend", err);
> err = (err == APM_SUCCESS) ? 0 : -EIO;
> device_power_up();
> + local_irq_enable();
> device_resume();
> pm_send_all(PM_RESUME, (void *)0);
> queue_event(APM_NORMAL_RESUME, NULL);
> @@ -1255,17 +1259,22 @@
> {
> int err;
>
> + local_irq_disable();
> device_power_down(PMSG_SUSPEND);
> /* serialize with the timer interrupt */
> - write_seqlock_irq(&xtime_lock);
> + write_seqlock(&xtime_lock);
> /* If needed, notify drivers here */
> get_time_diff();
> - write_sequnlock_irq(&xtime_lock);
> + write_sequnlock(&xtime_lock);
> + local_irq_enable();
>
> err = set_system_power_state(APM_STATE_STANDBY);
> if ((err != APM_SUCCESS) && (err != APM_NO_ERROR))
> apm_error("standby", err);
> +
> + local_irq_disable();
> device_power_up();
> + local_irq_enable();
> }
>
> static apm_event_t get_event(void)

2005-03-15 22:57:03

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] APM: fix interrupts enabled in device_power_up

On Tue, 15 Mar 2005, J. Bruce Fields wrote:

> On Sat, Mar 12, 2005 at 08:21:29AM -0700, Zwane Mwaikambo wrote:
> > On Sat, 12 Mar 2005, J. Bruce Fields wrote:
> >
> > > On APM resume this morning on my Thinkpad X31, I got a "spin_lock is
> > > already locked" error; see below. This doesn't happen on every resume,
> > > though it's happened before. The kernel is 2.6.11 plus a bunch of
> > > (hopefully unrelated...) NFS patches.
> > >
> > > Mar 12 07:07:31 puzzle kernel: PCI: Setting latency timer of device 0000:00:1f.5 to 64
> > > Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:179: spin_lock(arch/i386/kernel/time.c:c0603c28) already locked by arch/i386/kernel/time.c/309
> > > Mar 12 07:07:31 puzzle kernel: arch/i386/kernel/time.c:316: spin_unlock(arch/i386/kernel/time.c:c0603c28) not locked
> >
> > APM was calling device_power_down and device_power_up with interrupts
> > enabled, resulting in a few calls to get_cmos_time being done with
> > interrupts enabled (rtc_lock needs to be acquired with interrupts
> > disabled).
>
> Thanks, I haven't been following the discussion carefully, but for
> what's it worth I did apply that patch and now (a few suspend-resume
> cycles later) haven't seen the spin_lock warning or seen any other ill
> effects.
>
> Let me know if there's any testing it would be useful for me to do.

Thanks for testing it, suspend/resume cycles are what i was most
interested in.

Zwane