2009-10-29 18:35:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Thursday 29 October 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Wednesday 28 October 2009, Ferenc Wagner wrote:
> >
> >> Something similar to http://bugzilla.kernel.org/show_bug.cgi?id=13894
> >> raised its ugly head again, please see my last comments on that bug.
> >
> > This very well may be a separete bug, so please file a new bugzilla report
> > on this and mark it as a regression.
>
> Done.

Which number is this?

> >> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
> >> off the machine after "S|" or freezes after "Snapshotting system".
> >> The former does not cause much trouble in itself, as the machine can
> >> be switched off and resumed all right, but the latter is nasty.
> >> Suspend to RAM works all the time. The issue is not reproducible,
> >> unfortunately, and the kernel change happened almost together with a
> >> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
> >> still works stably with the new BIOS. I'll report back my findings in
> >> a couple of days.
> >
> > OK, thanks.
> >
> > Still, I'm really afraid we won't be able to debug it any further without a
> > reproducible test case.
>
> I've got another, fully reproducible but nevertheless neglected ACPI
> problem, already mentioned in #13894:
> https://bugs.freedesktop.org/show_bug.cgi?id=22126.

A side note: I'm totally unhappy with _kernel_ bugs being handled at
bugs.freedesktop.org without a notice anywhere else. Even though they are
related to the graphics, the kernel developers in general at least deserve the
information that the bugs have been reported.

In this particulare case, the bug is clearly related to ACPI and linux-acpi
should have received a notification about it.

> Well, it's probably far-fetched, but maybe the two are somehow related...

Very well may be.

> Can't you perhaps suggest a way forward there? Or some tricks to create a
> reproducible test case here?

Well, you can test if the problem is reproducible in the "shutdown" mode of
hibernation.

> Btw. my gut feeling is that hibernation
> is getting slower with each kernel release. I didn't measure it, and
> didn't even care about comparable initial states... But could anything
> explain this, or is it sheer impatience?

Which part of it is getting slower? Saving the image, suspending devices or
the entire hibernation overall?

Rafael


2009-10-29 22:32:19

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Thursday 29 October 2009, Ferenc Wagner wrote:
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
>>>
>>>> Something similar to http://bugzilla.kernel.org/show_bug.cgi?id=13894
>>>> raised its ugly head again, please see my last comments on that bug.
>>>
>>> This very well may be a separete bug, so please file a new bugzilla report
>>> on this and mark it as a regression.
>>
>> Done.
>
> Which number is this?

http://bugzilla.kernel.org/show_bug.cgi?id=14504
Submitted containing the following paragraph only:

>>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
>>>> off the machine after "S|" or freezes after "Snapshotting system".
>>>> The former does not cause much trouble in itself, as the machine can
>>>> be switched off and resumed all right, but the latter is nasty.
>>>> Suspend to RAM works all the time. The issue is not reproducible,
>>>> unfortunately, and the kernel change happened almost together with a
>>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
>>>> still works stably with the new BIOS. I'll report back my findings in
>>>> a couple of days.
>>>
>>> OK, thanks.
>>>
>>> Still, I'm really afraid we won't be able to debug it any further without a
>>> reproducible test case.
>>
>> I've got another, fully reproducible but nevertheless neglected ACPI
>> problem, already mentioned in #13894:
>> https://bugs.freedesktop.org/show_bug.cgi?id=22126.
>
> A side note: I'm totally unhappy with _kernel_ bugs being handled at
> bugs.freedesktop.org without a notice anywhere else. Even though they are
> related to the graphics, the kernel developers in general at least deserve the
> information that the bugs have been reported.
>
> In this particulare case, the bug is clearly related to ACPI and linux-acpi
> should have received a notification about it.

When the ACPI relation became clear to me, I notified linux-acpi, see
http://thread.gmane.org/gmane.linux.acpi.devel/42172/focus=42230

>> Well, it's probably far-fetched, but maybe the two are somehow related...
>
> Very well may be.
>
>> Can't you perhaps suggest a way forward there? Or some tricks to create a
>> reproducible test case here?
>
> Well, you can test if the problem is reproducible in the "shutdown" mode of
> hibernation.

Ok, I'll go back to 2.6.32-rc5 for testing that. Does that make any
difference in the "Snapshotting system" phase? Freezes happen that
time, too, before writing out the image.

>> Btw. my gut feeling is that hibernation is getting slower with each
>> kernel release. I didn't measure it, and didn't even care about
>> comparable initial states... But could anything explain this, or is
>> it sheer impatience?
>
> Which part of it is getting slower? Saving the image, suspending
> devices or the entire hibernation overall?

"Snapshotting system" before saving the image and saving the image as
well. If s2disk didn't report funny huge negative ratios all the
time, I'd probably have tried to correlate this with the number of
saved pages or similar... But anyway, this is a minor nit, it's still
far from being unbearable. If only it worked all the time!
--
Thanks,
Feri.

2009-10-30 18:17:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Thursday 29 October 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Thursday 29 October 2009, Ferenc Wagner wrote:
> >> "Rafael J. Wysocki" <[email protected]> writes:
> >>
> >>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
> >>>
> >>>> Something similar to http://bugzilla.kernel.org/show_bug.cgi?id=13894
> >>>> raised its ugly head again, please see my last comments on that bug.
> >>>
> >>> This very well may be a separete bug, so please file a new bugzilla report
> >>> on this and mark it as a regression.
> >>
> >> Done.
> >
> > Which number is this?
>
> http://bugzilla.kernel.org/show_bug.cgi?id=14504

Thanks.

> Submitted containing the following paragraph only:

That should be sufficient.

> >>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
> >>>> off the machine after "S|" or freezes after "Snapshotting system".
> >>>> The former does not cause much trouble in itself, as the machine can
> >>>> be switched off and resumed all right, but the latter is nasty.
> >>>> Suspend to RAM works all the time. The issue is not reproducible,
> >>>> unfortunately, and the kernel change happened almost together with a
> >>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
> >>>> still works stably with the new BIOS. I'll report back my findings in
> >>>> a couple of days.
> >>>
> >>> OK, thanks.
> >>>
> >>> Still, I'm really afraid we won't be able to debug it any further without a
> >>> reproducible test case.
> >>
> >> I've got another, fully reproducible but nevertheless neglected ACPI
> >> problem, already mentioned in #13894:
> >> https://bugs.freedesktop.org/show_bug.cgi?id=22126.
> >
> > A side note: I'm totally unhappy with _kernel_ bugs being handled at
> > bugs.freedesktop.org without a notice anywhere else. Even though they are
> > related to the graphics, the kernel developers in general at least deserve the
> > information that the bugs have been reported.
> >
> > In this particulare case, the bug is clearly related to ACPI and linux-acpi
> > should have received a notification about it.
>
> When the ACPI relation became clear to me, I notified linux-acpi, see
> http://thread.gmane.org/gmane.linux.acpi.devel/42172/focus=42230

OK, thanks.

> >> Well, it's probably far-fetched, but maybe the two are somehow related...
> >
> > Very well may be.
> >
> >> Can't you perhaps suggest a way forward there? Or some tricks to create a
> >> reproducible test case here?
> >
> > Well, you can test if the problem is reproducible in the "shutdown" mode of
> > hibernation.
>
> Ok, I'll go back to 2.6.32-rc5 for testing that. Does that make any
> difference in the "Snapshotting system" phase?

Yes, it does.

> Freezes happen that time, too, before writing out the image.
>
> >> Btw. my gut feeling is that hibernation is getting slower with each
> >> kernel release. I didn't measure it, and didn't even care about
> >> comparable initial states... But could anything explain this, or is
> >> it sheer impatience?
> >
> > Which part of it is getting slower? Saving the image, suspending
> > devices or the entire hibernation overall?
>
> "Snapshotting system" before saving the image

That may be a result of changing the way in which image memory is reserved.
How much memory is there in your machine?

> and saving the image as well. If s2disk didn't report funny huge negative
> ratios all the time,

Hmm. This looks like a bug in s2disk.

> I'd probably have tried to correlate this with the number of
> saved pages or similar... But anyway, this is a minor nit, it's still
> far from being unbearable. If only it worked all the time!

It should.

Thanks,
Rafael

2009-10-30 19:41:37

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Thursday 29 October 2009, Ferenc Wagner wrote:
>
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>>> Which part of it is getting slower? Saving the image, suspending
>>> devices or the entire hibernation overall?
>>
>> "Snapshotting system" before saving the image
>
> That may be a result of changing the way in which image memory is reserved.
> How much memory is there in your machine?

512 MB.

>> and saving the image as well. If s2disk didn't report funny huge negative
>> ratios all the time,
>
> Hmm. This looks like a bug in s2disk.

Definitely. Do you also experience this? Probably an easy one, but
I've never had the chance to check the CVS version (running 0.7 at the
moment). I can probably give 0.8 a spin if you deem necessary. I
always thought it wasn't more than a cosmetic flaw.
--
Thanks,
Feri.

2009-10-30 20:36:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Friday 30 October 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Thursday 29 October 2009, Ferenc Wagner wrote:
> >
> >> "Rafael J. Wysocki" <[email protected]> writes:
> >>
> >>> Which part of it is getting slower? Saving the image, suspending
> >>> devices or the entire hibernation overall?
> >>
> >> "Snapshotting system" before saving the image
> >
> > That may be a result of changing the way in which image memory is reserved.
> > How much memory is there in your machine?
>
> 512 MB.

So it's likely the slowdown results from the memory management rework.
Hopefully, it'll improve in the future.

> >> and saving the image as well. If s2disk didn't report funny huge negative
> >> ratios all the time,
> >
> > Hmm. This looks like a bug in s2disk.
>
> Definitely. Do you also experience this?

Not really, but I use newer versions.

> Probably an easy one, but I've never had the chance to check the CVS version
> (running 0.7 at the moment). I can probably give 0.8 a spin if you deem
> necessary. I always thought it wasn't more than a cosmetic flaw.

It probably is. You can try the current version from my git tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-utils.git

Thanks,
Rafael

2009-10-31 12:02:03

by Alan Jenkins

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On 10/30/09, Rafael J. Wysocki <[email protected]> wrote:
> On Friday 30 October 2009, Ferenc Wagner wrote:
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>> > On Thursday 29 October 2009, Ferenc Wagner wrote:
>> >
>> >> "Rafael J. Wysocki" <[email protected]> writes:

>> >> and saving the image as well. If s2disk didn't report funny huge
>> >> negative
>> >> ratios all the time,
>> >
>> > Hmm. This looks like a bug in s2disk.
>>
>> Definitely. Do you also experience this?
>
> Not really, but I use newer versions.
>
>> Probably an easy one, but I've never had the chance to check the CVS
>> version
>> (running 0.7 at the moment). I can probably give 0.8 a spin if you deem
>> necessary. I always thought it wasn't more than a cosmetic flaw.
>
> It probably is. You can try the current version from my git tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-utils.git
>
> Thanks,
> Rafael

I seem to recall reporting this and finding that the latest version
fixed the bug simply by removing the code which printed the ratio :-).

Regards
Alan

2009-10-31 14:06:26

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

Alan Jenkins <[email protected]> writes:

> On 10/30/09, Rafael J. Wysocki <[email protected]> wrote:
>> On Friday 30 October 2009, Ferenc Wagner wrote:
>>> "Rafael J. Wysocki" <[email protected]> writes:
>>>
>>>> On Thursday 29 October 2009, Ferenc Wagner wrote:
>>>>
>>>>> and saving the image as well. If s2disk didn't report funny huge
>>>>> negative ratios all the time,
>>>>
>>>> Hmm. This looks like a bug in s2disk.
>>>
>>> Definitely. Do you also experience this?
>>
>> Not really, but I use newer versions.
>>
>>> Probably an easy one, but I've never had the chance to check the CVS
>>> version (running 0.7 at the moment). I can probably give 0.8 a spin
>>> if you deem necessary. I always thought it wasn't more than a
>>> cosmetic flaw.
>>
>> It probably is. You can try the current version from my git tree at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-utils.git
>
> I seem to recall reporting this and finding that the latest version
> fixed the bug simply by removing the code which printed the ratio :-).

Heh, maybe, but the version compiled from Rafael's git tree prints the
ratio all right, and its value is even positive and less than 1. So I
confirm that the ratio issue is fixed. I'll be running with this
version from now on, and if it exhibits the same original issue, I'll
switch to shutdown mode and gather some experience running that.
--
Thanks,
Feri.

2009-10-31 19:09:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Saturday 31 October 2009, Ferenc Wagner wrote:
> Alan Jenkins <[email protected]> writes:
>
> > On 10/30/09, Rafael J. Wysocki <[email protected]> wrote:
> >> On Friday 30 October 2009, Ferenc Wagner wrote:
> >>> "Rafael J. Wysocki" <[email protected]> writes:
> >>>
> >>>> On Thursday 29 October 2009, Ferenc Wagner wrote:
> >>>>
> >>>>> and saving the image as well. If s2disk didn't report funny huge
> >>>>> negative ratios all the time,
> >>>>
> >>>> Hmm. This looks like a bug in s2disk.
> >>>
> >>> Definitely. Do you also experience this?
> >>
> >> Not really, but I use newer versions.
> >>
> >>> Probably an easy one, but I've never had the chance to check the CVS
> >>> version (running 0.7 at the moment). I can probably give 0.8 a spin
> >>> if you deem necessary. I always thought it wasn't more than a
> >>> cosmetic flaw.
> >>
> >> It probably is. You can try the current version from my git tree at:
> >> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-utils.git
> >
> > I seem to recall reporting this and finding that the latest version
> > fixed the bug simply by removing the code which printed the ratio :-).
>
> Heh, maybe, but the version compiled from Rafael's git tree prints the
> ratio all right, and its value is even positive and less than 1. So I
> confirm that the ratio issue is fixed. I'll be running with this
> version from now on, and if it exhibits the same original issue, I'll
> switch to shutdown mode and gather some experience running that.

Well, the problem you reported is a kernel issue and switching to the newer
user space is not likely to help.

Best,
Rafael

2009-11-01 21:53:41

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Saturday 31 October 2009, Ferenc Wagner wrote:
>
>> Heh, maybe, but the version compiled from Rafael's git tree prints the
>> ratio all right, and its value is even positive and less than 1. So I
>> confirm that the ratio issue is fixed. I'll be running with this
>> version from now on, and if it exhibits the same original issue, I'll
>> switch to shutdown mode and gather some experience running that.
>
> Well, the problem you reported is a kernel issue and switching to the newer
> user space is not likely to help.

Well, I haven't had my hopes overly high, but wanted to have a concrete
baseline. So: with the new s2disk I got a freeze again after S|. After
a manual power off and a successful resume, I switched to shutdown mode
and hibernated again, and got the exact same freeze (apart from a
slightly different image size). Power off, resume, switch to reboot
mode, hibernate, and this worked. Switch back to shutdown, now that
worked as well... Eh. In my earlier bug report I think I noted that
after such a hibernation failure a straight shutdown didn't power off
the computer as it otherwise does, which feels consistent with the
above.

With the uswsusp 0.7, a typical freeze looked like this:

s2disk: Snapshotting system
s2disk: System snapshot ready. Preparing to write
s2disk: Image size: 240872 kilobytes
s2disk: Free swap: 1333596 kilobytes
s2disk: Saving 60217 image data pages (press backspace to abort) ... 100% done (60217 pages)
s2disk: Compression ratio -63208.85
S|

With the new version the ratio is 0.42 with similar numbers, which
sounds sane at least. However, 60217 * 4 = 240868 = 240872 - 4, wasn't
the number of saved pages one off? The new version seems to get this
right, though.

Now something else, which may or may not be related. I supervise a
computing farm running a very old OS: Debian Sarge. The kernel was
somewhat newer: 2.6.24 until recently, when new machines arrived to the
lab, which couldn't boot that 2.6.24 kernel. So I upgraded to 2.6.31,
which works quite well, apart from one thing: halt doesn't power off the
machines anymore. All the same under 2.6.32-rc5: they simply freeze
after reaching halt -d -f -i -h -p in the shutdown sequence.

I'm pretty much stumped here, but will try to get some SysRq dumps out
of these machines at least.
--
Thanks,
Feri.

2009-11-03 11:03:18

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

Ferenc Wagner <[email protected]> writes:

A)

> So: with the new s2disk I got a freeze again after S|. After a manual
> power off and a successful resume, I switched to shutdown mode and
> hibernated again, and got the exact same freeze (apart from a slightly
> different image size). Power off, resume, switch to reboot mode,
> hibernate, and this worked. Switch back to shutdown, now that worked
> as well... Eh. In my earlier bug report I think I noted that after
> such a hibernation failure a straight shutdown didn't power off the
> computer as it otherwise does, which feels consistent with the above.

B)

> Now something else, which may or may not be related. I supervise a
> computing farm running a very old OS: Debian Sarge. The kernel was
> somewhat newer: 2.6.24 until recently, when new machines arrived to the
> lab, which couldn't boot that 2.6.24 kernel. So I upgraded to 2.6.31,
> which works quite well, apart from one thing: halt doesn't power off the
> machines anymore. All the same under 2.6.32-rc5: they simply freeze
> after reaching halt -d -f -i -h -p in the shutdown sequence.

Hi,

so now I've got three ACPI and/or PM related problems: A) and B) above
with hibernation and halt/poweroff, and the C) graphics and suspend
related described at http://bugs.freedesktop.org/show_bug.cgi?id=22126
(which turned out to be ACPI related recently).

B) and C) are 100% reproducible, although the machines exhibiting C) are
at a different geographic location, so I have to bug a remote admin for
power switching.

A) and C) are exhibited by my laptop.

I'd be grateful for debugging tips for any of the above. While A) is
not reproducible, it happens often enough with the platform method (I
haven't got enough data with the shutdown method yet). I'm willing to
recompile kernels, read up on documentation and code and have parallel
port LEDs handy (for the laptop). But I've got no experience with ACPI
or PM, sadly. However, 2.6.32 is currently nominated as more than a
couple of distros' choice for long term stable support, so I'm willing
to invest substantial effort into fixing these issues.

If anybody has debugging tips, please share (and note if further
questions aren't welcome).
--
Thanks,
Feri.

2009-11-11 11:30:18

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Thursday 29 October 2009, Ferenc Wagner wrote:
>
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
>>>
>>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
>>>> off the machine after "S|" or freezes after "Snapshotting system".
>>>> The former does not cause much trouble in itself, as the machine can
>>>> be switched off and resumed all right, but the latter is nasty.
>>>> Suspend to RAM works all the time. The issue is not reproducible,
>>>> unfortunately, and the kernel change happened almost together with a
>>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
>>>> still works stably with the new BIOS. I'll report back my findings in
>>>> a couple of days.
>>>
>>> OK, thanks.
>>>
>>> Still, I'm really afraid we won't be able to debug it any further without a
>>> reproducible test case.
>>
>> Can't you perhaps suggest a way forward there? Or some tricks to create a
>> reproducible test case here?
>
> Well, you can test if the problem is reproducible in the "shutdown" mode of
> hibernation.

Well, both failure modes happen with "shutdown" mode as well (the S|
freeze with yesterday's git, too), but still not reproducibly. When
s2disk is stuck in "Snapshotting system", the system is not completely
dead, it echoes line feeds and Ctrl-C at least (as added to #14504).

I wonder what you did if the issue was reproducible... Is that totally
unapplicable if the problem happens with 10% probability only? Slow,
sure, but until I manage to set up an automated testing bench...
--
Thanks,
Feri.

2009-11-11 11:36:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Wednesday 11 November 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Thursday 29 October 2009, Ferenc Wagner wrote:
> >
> >> "Rafael J. Wysocki" <[email protected]> writes:
> >>
> >>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
> >>>
> >>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
> >>>> off the machine after "S|" or freezes after "Snapshotting system".
> >>>> The former does not cause much trouble in itself, as the machine can
> >>>> be switched off and resumed all right, but the latter is nasty.
> >>>> Suspend to RAM works all the time. The issue is not reproducible,
> >>>> unfortunately, and the kernel change happened almost together with a
> >>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
> >>>> still works stably with the new BIOS. I'll report back my findings in
> >>>> a couple of days.
> >>>
> >>> OK, thanks.
> >>>
> >>> Still, I'm really afraid we won't be able to debug it any further without a
> >>> reproducible test case.
> >>
> >> Can't you perhaps suggest a way forward there? Or some tricks to create a
> >> reproducible test case here?
> >
> > Well, you can test if the problem is reproducible in the "shutdown" mode of
> > hibernation.
>
> Well, both failure modes happen with "shutdown" mode as well (the S|
> freeze with yesterday's git, too), but still not reproducibly. When
> s2disk is stuck in "Snapshotting system", the system is not completely
> dead, it echoes line feeds and Ctrl-C at least (as added to #14504).
>
> I wonder what you did if the issue was reproducible... Is that totally
> unapplicable if the problem happens with 10% probability only? Slow,
> sure, but until I manage to set up an automated testing bench...

I would try to identify the commit that made the problem appear using git
bisection. However, this is really difficult with problems that are not
reliably reproducible.

Failing that, I would add some instrumentation to the code to identify the
exact place where it hangs.

BTW, did you carry out the /sys/power/pm_test "core" test on the box?

Rafael

2009-11-11 13:29:45

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Wednesday 11 November 2009, Ferenc Wagner wrote:
>
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>>> On Thursday 29 October 2009, Ferenc Wagner wrote:
>>>
>>>> "Rafael J. Wysocki" <[email protected]> writes:
>>>>
>>>>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
>>>>>
>>>>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
>>>>>> off the machine after "S|" or freezes after "Snapshotting system".
>>>>>> The former does not cause much trouble in itself, as the machine can
>>>>>> be switched off and resumed all right, but the latter is nasty.
>>>>>> Suspend to RAM works all the time. The issue is not reproducible,
>>>>>> unfortunately, and the kernel change happened almost together with a
>>>>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
>>>>>> still works stably with the new BIOS. I'll report back my findings in
>>>>>> a couple of days.
>>>>>
>>>>> OK, thanks.
>>>>>
>>>>> Still, I'm really afraid we won't be able to debug it any further without a
>>>>> reproducible test case.
>>>>
>>>> Can't you perhaps suggest a way forward there? Or some tricks to create a
>>>> reproducible test case here?
>>>
>>> Well, you can test if the problem is reproducible in the "shutdown" mode of
>>> hibernation.
>>
>> Well, both failure modes happen with "shutdown" mode as well (the S|
>> freeze with yesterday's git, too), but still not reproducibly. When
>> s2disk is stuck in "Snapshotting system", the system is not completely
>> dead, it echoes line feeds and Ctrl-C at least (as added to #14504).
>>
>> I wonder what you did if the issue was reproducible... Is that totally
>> unapplicable if the problem happens with 10% probability only? Slow,
>> sure, but until I manage to set up an automated testing bench...
>
> I would try to identify the commit that made the problem appear using git
> bisection. However, this is really difficult with problems that are not
> reliably reproducible.

Indeed. I'm thinking about setting up a script, which does nothing but
hibernates the laptop in a loop, and get my router provide a constant
stream of WOL packets to restart it. If it always freezes in bounded
time that will make bisecting possible, if slow.

> Failing that, I would add some instrumentation to the code to identify the
> exact place where it hangs.

I managed to achieve this with my STR problem, see
http://bugs.freedesktop.org/show_bug.cgi?id=22126#c17, but maybe that
status = acpi_evaluate_object(NULL, METHOD_NAME__PTS, &arg_list, NULL);
wasn't deep enough, as it got no followup. How deep should one go to be
useful?

I can probably do so again, if slower; but this case may also be easier
if I can depend on working console output. Which are the interesting
parts for instrumentation? Can those parts produce console output to
VGA or netconsole? Wouldn't switching on ACPI debugging before invoking
s2disk be useful? Which parts of it (to avoid it spitting out MBs of
useless characters)?

> BTW, did you carry out the /sys/power/pm_test "core" test on the box?

I'm not clear on how to do that with user space suspend. Simply set it
to "cores" before invoking s2disk? I already did the test for STR (see
http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
with the current kernel tonight.
--
Thanks,
Feri.

2009-11-11 14:46:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Wednesday 11 November 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Wednesday 11 November 2009, Ferenc Wagner wrote:
> >
> >> "Rafael J. Wysocki" <[email protected]> writes:
> >>
> >>> On Thursday 29 October 2009, Ferenc Wagner wrote:
> >>>
> >>>> "Rafael J. Wysocki" <[email protected]> writes:
> >>>>
> >>>>> On Wednesday 28 October 2009, Ferenc Wagner wrote:
> >>>>>
> >>>>>> 2.6.32-rc5 feels particularly bad, with frequent failures to switch
> >>>>>> off the machine after "S|" or freezes after "Snapshotting system".
> >>>>>> The former does not cause much trouble in itself, as the machine can
> >>>>>> be switched off and resumed all right, but the latter is nasty.
> >>>>>> Suspend to RAM works all the time. The issue is not reproducible,
> >>>>>> unfortunately, and the kernel change happened almost together with a
> >>>>>> BIOS upgrade. Yesterday I switched back to 2.6.31 to see whether it
> >>>>>> still works stably with the new BIOS. I'll report back my findings in
> >>>>>> a couple of days.
> >>>>>
> >>>>> OK, thanks.
> >>>>>
> >>>>> Still, I'm really afraid we won't be able to debug it any further without a
> >>>>> reproducible test case.
> >>>>
> >>>> Can't you perhaps suggest a way forward there? Or some tricks to create a
> >>>> reproducible test case here?
> >>>
> >>> Well, you can test if the problem is reproducible in the "shutdown" mode of
> >>> hibernation.
> >>
> >> Well, both failure modes happen with "shutdown" mode as well (the S|
> >> freeze with yesterday's git, too), but still not reproducibly. When
> >> s2disk is stuck in "Snapshotting system", the system is not completely
> >> dead, it echoes line feeds and Ctrl-C at least (as added to #14504).
> >>
> >> I wonder what you did if the issue was reproducible... Is that totally
> >> unapplicable if the problem happens with 10% probability only? Slow,
> >> sure, but until I manage to set up an automated testing bench...
> >
> > I would try to identify the commit that made the problem appear using git
> > bisection. However, this is really difficult with problems that are not
> > reliably reproducible.
>
> Indeed. I'm thinking about setting up a script, which does nothing but
> hibernates the laptop in a loop, and get my router provide a constant
> stream of WOL packets to restart it. If it always freezes in bounded
> time that will make bisecting possible, if slow.

Alternatively, you can use the RTC alarm to wake up the machine.

> > Failing that, I would add some instrumentation to the code to identify the
> > exact place where it hangs.
>
> I managed to achieve this with my STR problem, see
> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c17, but maybe that
> status = acpi_evaluate_object(NULL, METHOD_NAME__PTS, &arg_list, NULL);
> wasn't deep enough, as it got no followup. How deep should one go to be
> useful?

No, this is deep enough and indicates a BIOS issue.

> I can probably do so again, if slower; but this case may also be easier
> if I can depend on working console output. Which are the interesting
> parts for instrumentation? Can those parts produce console output to
> VGA or netconsole? Wouldn't switching on ACPI debugging before invoking
> s2disk be useful? Which parts of it (to avoid it spitting out MBs of
> useless characters)?

I usually don't do that and if the issue is reproducible in the "shutdown"
mode, ACPI is most probably not involved.

> > BTW, did you carry out the /sys/power/pm_test "core" test on the box?
>
> I'm not clear on how to do that with user space suspend. Simply set it
> to "cores" before invoking s2disk?

Yes, echo "core" to /sys/power/pm_test before executing s2disk.

> I already did the test for STR (see
> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
> with the current kernel tonight.

OK, thanks.

Best,
Rafael

2009-11-13 16:36:43

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> Yes, echo "core" to /sys/power/pm_test before executing s2disk.

It snapshots the system and returns, producing the same console output
as s2ram (is this the expected behaviour?) I ran this several times in
a loop, and experienced no problems at all. Maybe it depends on the
amount of memory used... I saw a freeze saying "99% done" (ie. not
100%), btw. Are other pm_test values meaningful with s2disk? Is this
handled explicitly in s2disk, or does simply the kernel act as if it was
resumed instead of providing the system image after SNAPSHOT_CREATE_IMAGE?

> On Wednesday 11 November 2009, Ferenc Wagner wrote:
>
>> I already did the test for STR (see
>> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
>> with the current kernel tonight.
>
> OK, thanks.

No change on this front, FWIW. But rc7 is out now, I'll test again.
--
Thanks,
Feri.

2009-11-13 19:58:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Friday 13 November 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > Yes, echo "core" to /sys/power/pm_test before executing s2disk.
>
> It snapshots the system and returns, producing the same console output
> as s2ram (is this the expected behaviour?) I ran this several times in
> a loop, and experienced no problems at all. Maybe it depends on the
> amount of memory used... I saw a freeze saying "99% done" (ie. not
> 100%), btw.

The number is not always accurate because of rounding errors. I think we can
safely assume that it always happens after the entire image has been written.

> Are other pm_test values meaningful with s2disk? Is this
> handled explicitly in s2disk, or does simply the kernel act as if it was
> resumed instead of providing the system image after SNAPSHOT_CREATE_IMAGE?

The latter.

> > On Wednesday 11 November 2009, Ferenc Wagner wrote:
> >
> >> I already did the test for STR (see
> >> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
> >> with the current kernel tonight.
> >
> > OK, thanks.
>
> No change on this front, FWIW. But rc7 is out now, I'll test again.

Not sure if that's going to work, but yes please test it.

Thanks,
Rafael

2009-11-14 01:50:57

by Ferenc Wagner

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

"Rafael J. Wysocki" <[email protected]> writes:

> On Friday 13 November 2009, Ferenc Wagner wrote:
>> "Rafael J. Wysocki" <[email protected]> writes:
>>
>>> Yes, echo "core" to /sys/power/pm_test before executing s2disk.
>>
>> It snapshots the system and returns, producing the same console output
>> as s2ram (is this the expected behaviour?) I ran this several times in
>> a loop, and experienced no problems at all. Maybe it depends on the
>> amount of memory used... I saw a freeze saying "99% done" (ie. not
>> 100%), btw.
>
> The number is not always accurate because of rounding errors. I think we can
> safely assume that it always happens after the entire image has been written.

Probably, "done" isn't output otherwise.

>> Are other pm_test values meaningful with s2disk? Is this
>> handled explicitly in s2disk, or does simply the kernel act as if it was
>> resumed instead of providing the system image after SNAPSHOT_CREATE_IMAGE?
>
> The latter.

Ok, I found the code. Are other pm_test values meaningful, or possibly
harmful? I think I tried freezer, which resulted in a seemingly perfect
suspend, but the machine didn't try to resume afterwards, but booted
normally instead...

>>> On Wednesday 11 November 2009, Ferenc Wagner wrote:
>>>
>>>> I already did the test for STR (see
>>>> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
>>>> with the current kernel tonight.
>>>
>>> OK, thanks.
>>
>> No change on this front, FWIW. But rc7 is out now, I'll test again.
>
> Not sure if that's going to work, but yes please test it.

The KMS related STR freeze (evaluating the _PTS method) is still there.
I'm continuing testing s2disk with the platform method under rc7 (with
some instrumentation added).

Btw, s2ram -f works fine otherwise (no KMS), and my machine is not in
the whitelist. I'm not sure whether the KMS problem disqualifies it
(shall I report it to suspend-devel?), but it can be identified by:
sys_vendor = "IBM"
sys_product = "1834S5G"
sys_version = "ThinkPad R50e"
bios_version = "1WET90WW (2.10 )"
--
Thanks,
Feri.

2009-11-14 18:51:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] intermittent suspend problem again

On Saturday 14 November 2009, Ferenc Wagner wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Friday 13 November 2009, Ferenc Wagner wrote:
> >> "Rafael J. Wysocki" <[email protected]> writes:
> >>
> >>> Yes, echo "core" to /sys/power/pm_test before executing s2disk.
> >>
> >> It snapshots the system and returns, producing the same console output
> >> as s2ram (is this the expected behaviour?) I ran this several times in
> >> a loop, and experienced no problems at all. Maybe it depends on the
> >> amount of memory used... I saw a freeze saying "99% done" (ie. not
> >> 100%), btw.
> >
> > The number is not always accurate because of rounding errors. I think we can
> > safely assume that it always happens after the entire image has been written.
>
> Probably, "done" isn't output otherwise.
>
> >> Are other pm_test values meaningful with s2disk? Is this
> >> handled explicitly in s2disk, or does simply the kernel act as if it was
> >> resumed instead of providing the system image after SNAPSHOT_CREATE_IMAGE?
> >
> > The latter.
>
> Ok, I found the code. Are other pm_test values meaningful, or possibly
> harmful?

They are supposed to work as for suspend.

> I think I tried freezer, which resulted in a seemingly perfect
> suspend, but the machine didn't try to resume afterwards, but booted
> normally instead...

So this sounds like there's a bug (will check).

> >>> On Wednesday 11 November 2009, Ferenc Wagner wrote:
> >>>
> >>>> I already did the test for STR (see
> >>>> http://bugs.freedesktop.org/show_bug.cgi?id=22126#c3), but will redo
> >>>> with the current kernel tonight.
> >>>
> >>> OK, thanks.
> >>
> >> No change on this front, FWIW. But rc7 is out now, I'll test again.
> >
> > Not sure if that's going to work, but yes please test it.
>
> The KMS related STR freeze (evaluating the _PTS method) is still there.
> I'm continuing testing s2disk with the platform method under rc7 (with
> some instrumentation added).
>
> Btw, s2ram -f works fine otherwise (no KMS), and my machine is not in
> the whitelist. I'm not sure whether the KMS problem disqualifies it

No, it doesn't.

> (shall I report it to suspend-devel?),

Yes, please.

> but it can be identified by:
> sys_vendor = "IBM"
> sys_product = "1834S5G"
> sys_version = "ThinkPad R50e"
> bios_version = "1WET90WW (2.10 )"

Thanks,
Rafael