2005-01-01 17:25:29

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

Hi!

> similarly to other people reports of hardware troubles after swsusp, my
> thinkpad r40's e100 nic doesn't fully function after resume.
>
> ifplugd can see the link status change when i plug and unplug the cable,
> but the dhclient it runs just tries and retries to get an ip without
> success.
>
> i've tried reloading e100, mii, and even af_packet, but only a reboot
> fixes it.

e100 seems to have some suspend/resume support [but if even reloading
e100 does not help, fault is not in e100]. Are you running with APIC
enabled? Try noapic. Try acpi=off.

Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!


2005-01-01 22:26:12

by John M Flinchbaugh

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sat, Jan 01, 2005 at 06:23:44PM +0100, Pavel Machek wrote:
> e100 seems to have some suspend/resume support [but if even reloading
> e100 does not help, fault is not in e100]. Are you running with APIC
> enabled? Try noapic. Try acpi=off.

it had been fine in 2.6.9. i think i had switched to using apic back
with 2.6.9 (to facilitate nmi_watchdog, maybe).

i'll try these options. ultimately, though, i'm going to need acpi. :)

thanks.
--
John M Flinchbaugh
[email protected]


Attachments:
(No filename) (482.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-01-01 22:53:02

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

Hi!

> > e100 seems to have some suspend/resume support [but if even reloading
> > e100 does not help, fault is not in e100]. Are you running with APIC
> > enabled? Try noapic. Try acpi=off.
>
> it had been fine in 2.6.9. i think i had switched to using apic back
> with 2.6.9 (to facilitate nmi_watchdog, maybe).
>
> i'll try these options. ultimately, though, i'm going to need acpi. :)

Ok, so if everything else fails, just find which changeset broke it
for you.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-01 23:14:59

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

John M Flinchbaugh <[email protected]> wrote:

> it had been fine in 2.6.9. i think i had switched to using apic back
> with 2.6.9 (to facilitate nmi_watchdog, maybe).
>
> i'll try these options. ultimately, though, i'm going to need acpi. :)

Does pci=routeirq make any difference?

--
Matthew Garrett | [email protected]

2005-01-02 03:46:14

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sat, Jan 01, 2005 at 11:14:57PM +0000, Matthew Garrett wrote:
> Does pci=routeirq make any difference?

I'm not the original poster, and I haven't read this whole thread yet,
but I may have some useful input...

I think I'm seeing this problem (with the same symptoms) with both e100
and 8139too, on two different machines. It started with 2.6.10-rc1-bk24;
bk23 works fine.

I haven't tested my e100 system (laptop) as thoroughly as my 8139too system
(desktop), but this is what I'm seeing with 8139too:

Adding pci=routeirq makes the problem go away. Using acpi=off *instead* of
pci=routeirq also makes the problem go away. If I use "noapic" instead
of acpi=off or pci=routeirq, I get a different variant of the problem:
Almost immediately after resume, there's a kernel log message that the
NIC's interrupt has been disabled (I forget the exact wording). Checking
/proc/interrupts shows that the NIC's interrupt (which is not shared
with any other devices) has shot up to 100000 interrupts. (This
phenomenon does not happen if I do not specify noapic.)

Now I'll go and read the rest of this thread. If there's any more
information I need to provide or anything else I need to try, let me
know.

-Barry K. Nathan <[email protected]>

2005-01-02 05:57:58

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sat, Jan 01, 2005 at 06:23:44PM +0100, Pavel Machek wrote:
> e100 seems to have some suspend/resume support [but if even reloading
> e100 does not help, fault is not in e100]. Are you running with APIC
> enabled? Try noapic. Try acpi=off.

Reloading doesn't help, with either e100 or 8139too. I forgot to mention
that in my other e-mail in this thread. (As I previously mentioned, on
my system with 8139too, noapic makes matters worse, and the problem goes
away if I use *either* pci=routeirq or acpi=off. I haven't tried using
both.)

-Barry K. Nathan <[email protected]>

2005-01-02 18:42:48

by John M Flinchbaugh

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sat, Jan 01, 2005 at 09:57:53PM -0800, Barry K. Nathan wrote:
> > e100 does not help, fault is not in e100]. Are you running with APIC
> > enabled? Try noapic. Try acpi=off.
> Reloading doesn't help, with either e100 or 8139too. I forgot to
> mention
> that in my other e-mail in this thread. (As I previously mentioned,
> on
> my system with 8139too, noapic makes matters worse, and the problem
> goes
> away if I use *either* pci=routeirq or acpi=off. I haven't tried
> using
> both.)

pci=routeirq worked for me to get my e100 working again after resume.

so what's that mean? what's the trade-off for using this option?

thanks for the guidance.
--
John M Flinchbaugh
[email protected]


Attachments:
(No filename) (697.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-01-02 20:09:10

by Håkan Lindqvist

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On s?n, 2005-01-02 at 13:42 -0500, John M Flinchbaugh wrote:
> pci=routeirq worked for me to get my e100 working again after resume.

For the record: It works around my problems with e100 and snd-intel8x0,
too.

>
> so what's that mean? what's the trade-off for using this option?


The Documentation/kernel-parameters.txt says this about pci=routeirq:
"Do IRQ routing for all PCI devices. This is normally done in
pci_enable_device(), so this option is a temporary workaround for broken
drivers that don't call it."

Ie, it doesn't sound too bad to use it until the problem is solved.
And I don't know if this particular issue is a case of broken drivers,
but that was what the parameter was added to work around.


/H?kan

2005-01-03 05:10:53

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sun, Jan 02, 2005 at 09:09:16PM +0100, H?kan Lindqvist wrote:
> On s?n, 2005-01-02 at 13:42 -0500, John M Flinchbaugh wrote:
> > pci=routeirq worked for me to get my e100 working again after resume.
>
> For the record: It works around my problems with e100 and snd-intel8x0,
> too.

I previously mentioned that "pci=routeirq" works to fix my 8139too
problems. However, I just figured out that if I use "acpi=noirq" or
"pci=noacpi" instead of "pci=routeirq", that works too. (This is with
2.6.10-bk4.)

[snip]
> The Documentation/kernel-parameters.txt says this about pci=routeirq:
> "Do IRQ routing for all PCI devices. This is normally done in
> pci_enable_device(), so this option is a temporary workaround for broken
> drivers that don't call it."
>
> Ie, it doesn't sound too bad to use it until the problem is solved.
> And I don't know if this particular issue is a case of broken drivers,
> but that was what the parameter was added to work around.

I don't think this is a case of broken drivers. So far in this thread, it's
been seen with e100, 8139too, snd-intel8x0, and probably one of the USB
drivers too. And the problem happens even if the module is unloaded and
reloaded -- unless I'm seriously missing something, this probably means
pci_enable_device() is unable to do its job properly for some reason --
but only after a swsusp resume.

It would also be informative to examine the kernel command line options
that are making the problem go away:

pci=routeirq
acpi=off
acpi=noirq
pci=noacpi

What do they all have in common? ACPI. (AFAICT from my reading of the
source code, on i386 pci=routeirq only has an effect if ACPI is being
used for IRQ routing.)

So, I think this bug probably lies in ACPI or swsusp. I highly *highly*
doubt it's driver bugs. Hopefully I'll have time later tonight or
tomorrow morning to see if I can figure anything else out...

-Barry K. Nathan <[email protected]>

2005-01-03 08:32:10

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Sun, Jan 02, 2005 at 09:10:18PM -0800, Barry K. Nathan wrote:
> So, I think this bug probably lies in ACPI or swsusp. I highly *highly*
> doubt it's driver bugs. Hopefully I'll have time later tonight or
> tomorrow morning to see if I can figure anything else out...

The following patch is a ridiculously dirty kludge which (very arguably)
improves the situation somewhat:

--- linux-2.6.10-bk4/arch/i386/kernel/mpparse.c 2004-12-14 03:17:21.723010806 -0800
+++ linux-2.6.10-bk4-bkn1/arch/i386/kernel/mpparse.c 2005-01-02 23:43:13.647613575 -0800
@@ -1091,9 +1091,10 @@
return gsi;
}
if ((1<<bit) & mp_ioapic_routing[ioapic].pin_programmed[idx]) {
- Dprintk(KERN_DEBUG "Pin %d-%d already programmed\n",
+ printk(KERN_DEBUG "Pin %d-%d already programmed\n",
mp_ioapic_routing[ioapic].apic_id, ioapic_pin);
- return gsi;
+ /* return gsi; */
+ printk(KERN_DEBUG "However, I will reprogram it anyway.\n");
}

mp_ioapic_routing[ioapic].pin_programmed[idx] |= (1<<bit);


With this patch, unloading and reloading 8139too will make it work again
after a resume -- as long as I boot *without* "noapic". This doesn't fix
the actual problem (it's still broken after resume, and reloading the
module still doesn't work for "noapic") but it might provide clues.

More specifically, this shows that the
mp_ioapic_routing[ioapic].pin_programmed[] array is inconsistent with
the IO-APIC's real configuration after the resume.

I think the reason that "pci=routeirq" works is that, with that option,
the kernel sets up everything on the IO-APIC early in bootup and leaves
nothing to be done later on -- that way, the IO-APIC ends up having the
same setup after the resume that it did at suspend time. At leas, that's
what I suspect; I don't think I've proven it yet. I wouldn't be
surprised if a similar phenomenon is happening with acpi=off.

Anyway, I'm going to keep working on this and see if I can figure it out
some more...

-Barry K. Nathan <[email protected]>

2005-01-03 08:47:31

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

Hi!

> > The Documentation/kernel-parameters.txt says this about pci=routeirq:
> > "Do IRQ routing for all PCI devices. This is normally done in
> > pci_enable_device(), so this option is a temporary workaround for broken
> > drivers that don't call it."
> >
> > Ie, it doesn't sound too bad to use it until the problem is solved.
> > And I don't know if this particular issue is a case of broken drivers,
> > but that was what the parameter was added to work around.
>
> I don't think this is a case of broken drivers. So far in this thread, it's
> been seen with e100, 8139too, snd-intel8x0, and probably one of the USB
> drivers too. And the problem happens even if the module is unloaded and
> reloaded -- unless I'm seriously missing something, this probably means
> pci_enable_device() is unable to do its job properly for some reason --
> but only after a swsusp resume.
...
> So, I think this bug probably lies in ACPI or swsusp. I highly *highly*
> doubt it's driver bugs. Hopefully I'll have time later tonight or
> tomorrow morning to see if I can figure anything else out...

Actually, as you found out in earlier mail, problem is in the driver;
but it is the interrupt controller driver.

Right soution is to save APICs state during sysdev_suspend(), and
resture it during sysdev_resume().
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-03 08:53:32

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

Hi!

> > So, I think this bug probably lies in ACPI or swsusp. I highly *highly*
> > doubt it's driver bugs. Hopefully I'll have time later tonight or
> > tomorrow morning to see if I can figure anything else out...
>
> The following patch is a ridiculously dirty kludge which (very arguably)
> improves the situation somewhat:
>
> --- linux-2.6.10-bk4/arch/i386/kernel/mpparse.c 2004-12-14 03:17:21.723010806 -0800
> +++ linux-2.6.10-bk4-bkn1/arch/i386/kernel/mpparse.c 2005-01-02 23:43:13.647613575 -0800
> @@ -1091,9 +1091,10 @@
> return gsi;
> }
> if ((1<<bit) & mp_ioapic_routing[ioapic].pin_programmed[idx]) {
> - Dprintk(KERN_DEBUG "Pin %d-%d already programmed\n",
> + printk(KERN_DEBUG "Pin %d-%d already programmed\n",
> mp_ioapic_routing[ioapic].apic_id, ioapic_pin);
> - return gsi;
> + /* return gsi; */
> + printk(KERN_DEBUG "However, I will reprogram it anyway.\n");
> }
>
> mp_ioapic_routing[ioapic].pin_programmed[idx] |= (1<<bit);

Less dirty version of this would be adding __nosavedata atribute to
mp_ioapic_routing... like this?

--- clean/arch/i386/kernel/mpparse.c 2004-12-25 13:34:57.000000000 +0100
+++ linux/arch/i386/kernel/mpparse.c 2005-01-03 09:51:07.000000000 +0100
@@ -868,7 +868,9 @@
int gsi_base;
int gsi_end;
u32 pin_programmed[4];
-} mp_ioapic_routing[MAX_IO_APICS];
+};
+
+static struct mp_ioapic_routing __nosavedata mp_ioapic_routing[MAX_IO_APICS];


static int mp_find_ioapic (

> With this patch, unloading and reloading 8139too will make it work again
> after a resume -- as long as I boot *without* "noapic". This doesn't fix
> the actual problem (it's still broken after resume, and reloading the
> module still doesn't work for "noapic") but it might provide clues.
>
> More specifically, this shows that the
> mp_ioapic_routing[ioapic].pin_programmed[] array is inconsistent with
> the IO-APIC's real configuration after the resume.

Agreed. Also it would be nice if drivers did not have to reinitialize
the interrupts... Proper suspend/resume support for APIC would help
there, too.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-03 10:14:32

by Barry K. Nathan

[permalink] [raw]
Subject: Re: 2.6.10: e100 network broken after swsusp/resume

On Mon, Jan 03, 2005 at 09:47:13AM +0100, Pavel Machek wrote:
> Actually, as you found out in earlier mail, problem is in the driver;
> but it is the interrupt controller driver.
>
> Right soution is to save APICs state during sysdev_suspend(), and
> resture it during sysdev_resume().

AFAICT proper support is *already* there in sysdev_suspend() and
sysdev_resume().

However, at least on my system, neither of those functions is getting
called! I put BUG()s at the top of both functions, and neither of those
BUGs is being hit in a suspend/resume cycle.

-Barry K. Nathan <[email protected]>

2005-01-03 15:04:37

by Barry K. Nathan

[permalink] [raw]
Subject: [PATCH] swsusp: properly suspend and resume *all* devices

swsusp does not suspend and resume *all* devices, including system
devices. This has been the case since at least 2.6.9, if not earlier.

One effect of this is that resuming fails to properly reconfigure
interrupt routers. In 2.6.9 this was obscured by other kernel code,
but in 2.6.10 this often causes post-resume APIC errors and near-total
failure of some PCI devices (e.g. network, sound and USB controllers).

On at least one of my systems, without this patch I also have to "ifdown
eth0;ifup eth0" to get networking to function after resuming, even after
working around the interrupt routing problem mentioned above. With this
patch, networking simply works after a resume, and the ifdown/ifup is
no longer needed.

This patch is against 2.6.10-mm1, although it applies with an offset to
2.6.10-bk4 as well. I have tested it against 2.6.10-mm1 and 2.6.10-bk4,
with and without "noapic", with and without "acpi=off". However, I have
not tested it on a highmem system.

I believe this patch fixes a severe problem in swsusp; I would like to
see this patch (or at least *some* kind of fix for this problem) tested
more widely and committed to mainline before the 2.6.11 release.

Signed-off-by: Barry K. Nathan <[email protected]>

--- linux-2.6.10-mm1/kernel/power/swsusp.c 2005-01-03 02:16:15.175265255 -0800
+++ linux-2.6.10-mm1-bkn3/kernel/power/swsusp.c 2005-01-03 06:27:07.753344731 -0800
@@ -843,11 +843,22 @@
if ((error = arch_prepare_suspend()))
return error;
local_irq_disable();
+ /* At this point, device_suspend() has been called, but *not*
+ * device_power_down(). We *must* device_power_down() now.
+ * Otherwise, drivers for some devices (e.g. interrupt controllers)
+ * become desynchronized with the actual state of the hardware
+ * at resume time, and evil weirdness ensues.
+ */
+ if ((error = device_power_down(PM_SUSPEND_DISK))) {
+ local_irq_enable();
+ return error;
+ }
save_processor_state();
error = swsusp_arch_suspend();
/* Restore control flow magically appears here */
restore_processor_state();
restore_highmem();
+ device_power_up();
local_irq_enable();
return error;
}

2005-01-03 17:14:39

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Hi!

> swsusp does not suspend and resume *all* devices, including system
> devices. This has been the case since at least 2.6.9, if not earlier.
>
> One effect of this is that resuming fails to properly reconfigure
> interrupt routers. In 2.6.9 this was obscured by other kernel code,
> but in 2.6.10 this often causes post-resume APIC errors and near-total
> failure of some PCI devices (e.g. network, sound and USB controllers).
>
> On at least one of my systems, without this patch I also have to "ifdown
> eth0;ifup eth0" to get networking to function after resuming, even after
> working around the interrupt routing problem mentioned above. With this
> patch, networking simply works after a resume, and the ifdown/ifup is
> no longer needed.
>
> This patch is against 2.6.10-mm1, although it applies with an offset to
> 2.6.10-bk4 as well. I have tested it against 2.6.10-mm1 and 2.6.10-bk4,
> with and without "noapic", with and without "acpi=off". However, I have
> not tested it on a highmem system.
>
> I believe this patch fixes a severe problem in swsusp; I would like to
> see this patch (or at least *some* kind of fix for this problem) tested
> more widely and committed to mainline before the 2.6.11 release.
>
> Signed-off-by: Barry K. Nathan <[email protected]>

Ack. [I have similar patch in my tree, but yours is better in error
checking area. Please push it to akpm.]
Pavel


> --- linux-2.6.10-mm1/kernel/power/swsusp.c 2005-01-03 02:16:15.175265255 -0800
> +++ linux-2.6.10-mm1-bkn3/kernel/power/swsusp.c 2005-01-03 06:27:07.753344731 -0800
> @@ -843,11 +843,22 @@
> if ((error = arch_prepare_suspend()))
> return error;
> local_irq_disable();
> + /* At this point, device_suspend() has been called, but *not*
> + * device_power_down(). We *must* device_power_down() now.
> + * Otherwise, drivers for some devices (e.g. interrupt controllers)
> + * become desynchronized with the actual state of the hardware
> + * at resume time, and evil weirdness ensues.
> + */
> + if ((error = device_power_down(PM_SUSPEND_DISK))) {
> + local_irq_enable();
> + return error;
> + }
> save_processor_state();
> error = swsusp_arch_suspend();
> /* Restore control flow magically appears here */
> restore_processor_state();
> restore_highmem();
> + device_power_up();
> local_irq_enable();
> return error;
> }

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-03 18:52:04

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

On Po 03-01-05 18:08:07, Pavel Machek wrote:
> Hi!
>
> > swsusp does not suspend and resume *all* devices, including system
> > devices. This has been the case since at least 2.6.9, if not earlier.
> >
> > One effect of this is that resuming fails to properly reconfigure
> > interrupt routers. In 2.6.9 this was obscured by other kernel code,
> > but in 2.6.10 this often causes post-resume APIC errors and near-total
> > failure of some PCI devices (e.g. network, sound and USB controllers).
> >
> > On at least one of my systems, without this patch I also have to "ifdown
> > eth0;ifup eth0" to get networking to function after resuming, even after
> > working around the interrupt routing problem mentioned above. With this
> > patch, networking simply works after a resume, and the ifdown/ifup is
> > no longer needed.
> >
> > This patch is against 2.6.10-mm1, although it applies with an offset to
> > 2.6.10-bk4 as well. I have tested it against 2.6.10-mm1 and 2.6.10-bk4,
> > with and without "noapic", with and without "acpi=off". However, I have
> > not tested it on a highmem system.
> >
> > I believe this patch fixes a severe problem in swsusp; I would like to
> > see this patch (or at least *some* kind of fix for this problem) tested
> > more widely and committed to mainline before the 2.6.11 release.
> >
> > Signed-off-by: Barry K. Nathan <[email protected]>
>
> Ack. [I have similar patch in my tree, but yours is better in error
> checking area. Please push it to akpm.]

Actually you missed second half: same code should be added around
swsusp_arch_resume. It is not too critical there, but its right thing
to do.
Pavel

> > --- linux-2.6.10-mm1/kernel/power/swsusp.c 2005-01-03 02:16:15.175265255 -0800
> > +++ linux-2.6.10-mm1-bkn3/kernel/power/swsusp.c 2005-01-03 06:27:07.753344731 -0800
> > @@ -843,11 +843,22 @@
> > if ((error = arch_prepare_suspend()))
> > return error;
> > local_irq_disable();
> > + /* At this point, device_suspend() has been called, but *not*
> > + * device_power_down(). We *must* device_power_down() now.
> > + * Otherwise, drivers for some devices (e.g. interrupt controllers)
> > + * become desynchronized with the actual state of the hardware
> > + * at resume time, and evil weirdness ensues.
> > + */
> > + if ((error = device_power_down(PM_SUSPEND_DISK))) {
> > + local_irq_enable();
> > + return error;
> > + }
> > save_processor_state();
> > error = swsusp_arch_suspend();
> > /* Restore control flow magically appears here */
> > restore_processor_state();
> > restore_highmem();
> > + device_power_up();
> > local_irq_enable();
> > return error;
> > }
>

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-04 05:17:47

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

On Mon, Jan 03, 2005 at 07:33:18PM +0100, Pavel Machek wrote:
> > Ack. [I have similar patch in my tree, but yours is better in error
> > checking area. Please push it to akpm.]
>
> Actually you missed second half: same code should be added around
> swsusp_arch_resume. It is not too critical there, but its right thing
> to do.

Hmmm... I'm not sure how necessary it is, and I think it slows down
resume a tiny bit. However, the more I think about it the more correct it
seems, so here's the follow-up patch. (Andrew, even if this patch is
rejected, please commit my first one to the next -mm release. That patch
alone is still an improvement over the current code.)

Signed-off-by: Barry K. Nathan <[email protected]>

--- linux-2.6.10-mm1-bkn3/kernel/power/swsusp.c 2005-01-03 06:27:07.753344731 -0800
+++ linux-2.6.10-mm1-bkn4/kernel/power/swsusp.c 2005-01-03 20:19:06.737439106 -0800
@@ -878,6 +878,7 @@
{
int error;
local_irq_disable();
+ device_power_down(PM_SUSPEND_DISK);
/* We'll ignore saved state, but this gets preempt count (etc) right */
save_processor_state();
error = swsusp_arch_resume();
@@ -887,6 +888,7 @@
BUG_ON(!error);
restore_processor_state();
restore_highmem();
+ device_power_up();
local_irq_enable();
return error;
}

2005-01-04 05:19:00

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

On Mon, Jan 03, 2005 at 09:15:30PM -0800, Barry K. Nathan wrote:
> Hmmm... I'm not sure how necessary it is, and I think it slows down
> resume a tiny bit. However, the more I think about it the more correct it
> seems, so here's the follow-up patch. (Andrew, even if this patch is
> rejected, please commit my first one to the next -mm release. That patch
> alone is still an improvement over the current code.)

Ugh. I forgot to mention in my previous mail that the patch is against
2.6.10-mm1 + my previous patch (perhaps that was obvious), and that I've
lightly tested the patch (that probably wasn't obvious, to say the
least).

-Barry K. Nathan <[email protected]>

2005-01-04 08:51:22

by Martin Lucina

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Hi Barry,

Barry K. Nathan <barryn <at> pobox.com> writes:

> swsusp does not suspend and resume *all* devices, including system
> devices. This has been the case since at least 2.6.9, if not earlier.
>
> One effect of this is that resuming fails to properly reconfigure
> interrupt routers. In 2.6.9 this was obscured by other kernel code,
> but in 2.6.10 this often causes post-resume APIC errors and near-total
> failure of some PCI devices (e.g. network, sound and USB controllers).

I'm seeing a variation (?) of this problem with 2.6.10. I have the same symptoms
as you describe above, but on a machine without an APIC, using APM for
suspend/resume. (Toshiba Portege 7220cte, which has an Intel 440BX chipset)

Obviously, I don't get the APIC errors, but everything else is the same, random
devices fail and need to be reloaded (3c59x and uhci-hcd in particular), plus
the system appears to panic somewhere along the way to resume occasionally (as I
assume from the hung machine and blinking CAPS LOCK), which didn't happen
previously (2.6.9, 2.6.8.1, ...). I also see lots of

drivers/usb/input/hid-core.c: input irq status -84 received

until I do a 'rmmod uhci_hcd; modprobe uhci_hcd'. This used to happen with 2.6.9
as well, but the system would recover after about 20 messages or so like this
after a resume.

Any suggestions about where to look to track this down?

-mato

2005-01-05 00:11:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Hi!

> > devices. This has been the case since at least 2.6.9, if not earlier.
> >
> > One effect of this is that resuming fails to properly reconfigure
> > interrupt routers. In 2.6.9 this was obscured by other kernel code,
> > but in 2.6.10 this often causes post-resume APIC errors and near-total
> > failure of some PCI devices (e.g. network, sound and USB controllers).
>
> I'm seeing a variation (?) of this problem with 2.6.10. I have the same symptoms
> as you describe above, but on a machine without an APIC, using APM for
> suspend/resume. (Toshiba Portege 7220cte, which has an Intel 440BX chipset)
>
> Obviously, I don't get the APIC errors, but everything else is the same, random
> devices fail and need to be reloaded (3c59x and uhci-hcd in particular), plus
> the system appears to panic somewhere along the way to resume occasionally (as I
> assume from the hung machine and blinking CAPS LOCK), which didn't happen
> previously (2.6.9, 2.6.8.1, ...). I also see lots of
>
> drivers/usb/input/hid-core.c: input irq status -84 received
>
> until I do a 'rmmod uhci_hcd; modprobe uhci_hcd'. This used to happen with 2.6.9
> as well, but the system would recover after about 20 messages or so like this
> after a resume.
>
> Any suggestions about where to look to track this down?

USB stuff should be discussed on the USB mailing list. Unload uhci_hcd
before suspend and reload it after resume to make sure it does not
interfere.

Check if 3c59x has suspend/resume support. If not, add it.

Panic... we really need to know why it panicked. VESAFB does not
support blanking, just switch to VESAFB and you should be able to see
the messages.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-05 16:04:01

by Lion Vollnhals

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

>>Obviously, I don't get the APIC errors, but everything else is the same, random
>>devices fail and need to be reloaded (3c59x and uhci-hcd in particular), plus
>>the system appears to panic somewhere along the way to resume occasionally (as I
>>assume from the hung machine and blinking CAPS LOCK), which didn't happen
>>previously (2.6.9, 2.6.8.1, ...). I also see lots of
>>
>>drivers/usb/input/hid-core.c: input irq status -84 received
>>
>>until I do a 'rmmod uhci_hcd; modprobe uhci_hcd'. This used to happen with 2.6.9
>>as well, but the system would recover after about 20 messages or so like this
>>after a resume.
>>
>>Any suggestions about where to look to track this down?
>
>
> Check if 3c59x has suspend/resume support. If not, add it.
>
> Panic... we really need to know why it panicked. VESAFB does not
> support blanking, just switch to VESAFB and you should be able to see
> the messages.
> Pavel

I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
have to restart it using the initscripts.

How do I add suspend/resume support (to ISA devices, like my ne2000)?
Can you point me to some information/tutorial?

Lion Vollnhals

2005-01-06 22:30:32

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Hi!

> I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
> 2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
> have to restart it using the initscripts.
>
> How do I add suspend/resume support (to ISA devices, like my ne2000)?
> Can you point me to some information/tutorial?

Look how i8042 suspend/resume support is done and do it in similar
way...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-07 13:47:19

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

At Thu, 6 Jan 2005 23:29:27 +0100,
Pavel Machek wrote:
>
> Hi!
>
> > I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
> > 2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
> > have to restart it using the initscripts.
> >
> > How do I add suspend/resume support (to ISA devices, like my ne2000)?
> > Can you point me to some information/tutorial?
>
> Look how i8042 suspend/resume support is done and do it in similar
> way...

Yep it's fairly easy to implement in that way (I did for ALSA).

But i8042 has also pm_register(), mentioning about APM. Isn't it
redundant?


Takashi

2005-01-07 13:54:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Hi!

> > > I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
> > > 2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
> > > have to restart it using the initscripts.
> > >
> > > How do I add suspend/resume support (to ISA devices, like my ne2000)?
> > > Can you point me to some information/tutorial?
> >
> > Look how i8042 suspend/resume support is done and do it in similar
> > way...
>
> Yep it's fairly easy to implement in that way (I did for ALSA).
>
> But i8042 has also pm_register(), mentioning about APM. Isn't it
> redundant?

Yes, it looks redundant. Vojtech, could you check why this is still
needed? It should not be.
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-07 14:49:03

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

On Fri, 7 Jan 2005 14:54:18 +0100, Pavel Machek <[email protected]> wrote:
> Hi!
>
> > > > I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
> > > > 2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
> > > > have to restart it using the initscripts.
> > > >
> > > > How do I add suspend/resume support (to ISA devices, like my ne2000)?
> > > > Can you point me to some information/tutorial?
> > >
> > > Look how i8042 suspend/resume support is done and do it in similar
> > > way...
> >
> > Yep it's fairly easy to implement in that way (I did for ALSA).
> >
> > But i8042 has also pm_register(), mentioning about APM. Isn't it
> > redundant?
>
> Yes, it looks redundant. Vojtech, could you check why this is still
> needed? It should not be.

It is removed in -bk.

--
Dmitry

2005-01-07 15:34:28

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

On Fri, Jan 07, 2005 at 02:54:18PM +0100, Pavel Machek wrote:
> Hi!
>
> > > > I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
> > > > 2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
> > > > have to restart it using the initscripts.
> > > >
> > > > How do I add suspend/resume support (to ISA devices, like my ne2000)?
> > > > Can you point me to some information/tutorial?
> > >
> > > Look how i8042 suspend/resume support is done and do it in similar
> > > way...
> >
> > Yep it's fairly easy to implement in that way (I did for ALSA).
> >
> > But i8042 has also pm_register(), mentioning about APM. Isn't it
> > redundant?
>
> Yes, it looks redundant. Vojtech, could you check why this is still
> needed? It should not be.

We already have a patch removing that in the queue.

--
Vojtech Pavlik
SuSE Labs, SuSE CR

2005-01-07 15:58:42

by Lion Vollnhals

[permalink] [raw]
Subject: Re: [PATCH] swsusp: properly suspend and resume *all* devices

Pavel Machek wrote:
> Hi!
>
>
>>I have a problem with net-devices, ne2000 in particular, in 2.6.9 and
>>2.6.10, too. After a resume the ne2000-device doesn't work anymore. I
>>have to restart it using the initscripts.
>>
>>How do I add suspend/resume support (to ISA devices, like my ne2000)?
>>Can you point me to some information/tutorial?
>
>
> Look how i8042 suspend/resume support is done and do it in similar
> way...
> Pavel

thx, i will do that.

--
Lion Vollhals