2014-07-16 01:51:43

by Markus Gutschke

[permalink] [raw]
Subject: 3.16-rcX crashes on resume from Suspend-To-RAM

My Dell M4400 has been pretty well-supported by Linux a couple of
years now, but recent 3.16-rcX cause hard crashes when resuming from
Suspend-to-RAM.

This is tricky to debug, as device drivers are not yet restored by the
time that the crash happens. So, I can't use Page-UP to scroll the
screen and see the full crash information. I also cannot use the
netconsole; the ethernet device is still suspended. For similar
reasons, crash kernels don't seem to work either.

After about a day of false starts and a lengthy bi-secting session, I
finally narrowed things down to this change list:

eec15edbb0e14485998635ea7c62e30911b465f0 is the first bad commit
commit eec15edbb0e14485998635ea7c62e30911b465f0
Author: Zhang Rui <[email protected]>
Date: Fri May 30 04:23:01 2014 +0200

ACPI / PNP: use device ID list for PNPACPI device enumeration

ACPI can be used to enumerate PNP devices, but the code does not
handle this in the right way currently. Namely, if an ACPI device
object
1. Has a _CRS method,
2. Has an identification of
"three capital characters followed by four hex digits",
3. Is not in the excluded IDs list,
it will be enumerated to PNP bus (that is, a PNP device object will
be create for it). This means that, actually, the PNP bus type is
used as the default bus type for enumerating _HID devices in ACPI.

However, more and more _HID devices need to be enumerated to the
platform bus instead (that is, platform device objects need to be
created for them). As a result, the device ID list in acpi_platform.c
is used to enforce creating platform device objects rather than PNP
device objects for matching devices. That list has been continuously
growing recently, unfortunately, and it is pretty much guaranteed to
grow even more in the future.

To address that problem it is better to enumerate _HID devices
as platform devices by default. To this end, change the way of
enumerating PNP devices by adding a PNP ACPI scan handler that
will use a device ID list to create PNP devices for the ACPI
device objects whose device IDs are present in that list.

The initial device ID list in the PNP ACPI scan handler contains
all of the pnp_device_id strings from all the existing PNP drivers,
so this change should be transparent to the PNP core and all of the
PNP drivers. Still, in the future it should be possible to reduce
its size by converting PNP drivers that need not be PNP for any
technical reasons into platform drivers.

Signed-off-by: Zhang Rui <[email protected]>
[rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>

:040000 040000 b7c07232aa46ae7b6faf9a907fb7274a02e4680fc2e05b31a61dccd087c554adecc89a43a1ed81f7
M drivers
:040000 040000 4eda970292fffbeebe167f9210502527df4e8ab421e9e6fd84c780a34bf3d48b5e7618b551da3b1a
M include

I took a photo of the crash. It feels silly to do, but I couldn't
think of a better solution. You can find it at
https://drive.google.com/file/d/0B8SxqKDe4hyheTlTLXY2YThkMXM

As I mentioned earlier, a bunch of information has already scrolled
off the screen, but hopefully what is visible is somewhat helpful.

I will have only limited internet access the next couple of weeks. But
I wanted to make sure I at least got the result of the bisection out
to LKML. I will make every best effort to collect additional data, if
asked to do so; but some of it might be delayed for a little bit,
until I can get access to reasonably powerful hardware or reasonably
fast internet.


Markus

P.S.: Please keep me cc'd on all responses, as I am not subscribed to
the firehose that is LKML.


2014-07-17 06:50:54

by Markus Gutschke

[permalink] [raw]
Subject: Re: 3.16-rcX crashes on resume from Suspend-To-RAM

Adding the reviewers of the faulty change list to the cc list for this
e-mail. I hope that is considered proper etiquette for the LKML.

On Tue, Jul 15, 2014 at 6:51 PM, Markus Gutschke <[email protected]> wrote:
> My Dell M4400 has been pretty well-supported by Linux a couple of
> years now, but recent 3.16-rcX cause hard crashes when resuming from
> Suspend-to-RAM.
>
> This is tricky to debug, as device drivers are not yet restored by the
> time that the crash happens. So, I can't use Page-UP to scroll the
> screen and see the full crash information. I also cannot use the
> netconsole; the ethernet device is still suspended. For similar
> reasons, crash kernels don't seem to work either.
>
> After about a day of false starts and a lengthy bi-secting session, I
> finally narrowed things down to this change list:
>
> eec15edbb0e14485998635ea7c62e30911b465f0 is the first bad commit
> commit eec15edbb0e14485998635ea7c62e30911b465f0
> Author: Zhang Rui <[email protected]>
> Date: Fri May 30 04:23:01 2014 +0200
>
> ACPI / PNP: use device ID list for PNPACPI device enumeration
>
> ACPI can be used to enumerate PNP devices, but the code does not
> handle this in the right way currently. Namely, if an ACPI device
> object
> 1. Has a _CRS method,
> 2. Has an identification of
> "three capital characters followed by four hex digits",
> 3. Is not in the excluded IDs list,
> it will be enumerated to PNP bus (that is, a PNP device object will
> be create for it). This means that, actually, the PNP bus type is
> used as the default bus type for enumerating _HID devices in ACPI.
>
> However, more and more _HID devices need to be enumerated to the
> platform bus instead (that is, platform device objects need to be
> created for them). As a result, the device ID list in acpi_platform.c
> is used to enforce creating platform device objects rather than PNP
> device objects for matching devices. That list has been continuously
> growing recently, unfortunately, and it is pretty much guaranteed to
> grow even more in the future.
>
> To address that problem it is better to enumerate _HID devices
> as platform devices by default. To this end, change the way of
> enumerating PNP devices by adding a PNP ACPI scan handler that
> will use a device ID list to create PNP devices for the ACPI
> device objects whose device IDs are present in that list.
>
> The initial device ID list in the PNP ACPI scan handler contains
> all of the pnp_device_id strings from all the existing PNP drivers,
> so this change should be transparent to the PNP core and all of the
> PNP drivers. Still, in the future it should be possible to reduce
> its size by converting PNP drivers that need not be PNP for any
> technical reasons into platform drivers.
>
> Signed-off-by: Zhang Rui <[email protected]>
> [rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> Reviewed-by: Mika Westerberg <[email protected]>
>
> :040000 040000 b7c07232aa46ae7b6faf9a907fb7274a02e4680fc2e05b31a61dccd087c554adecc89a43a1ed81f7
> M drivers
> :040000 040000 4eda970292fffbeebe167f9210502527df4e8ab421e9e6fd84c780a34bf3d48b5e7618b551da3b1a
> M include
>
> I took a photo of the crash. It feels silly to do, but I couldn't
> think of a better solution. You can find it at
> https://drive.google.com/file/d/0B8SxqKDe4hyheTlTLXY2YThkMXM
>
> As I mentioned earlier, a bunch of information has already scrolled
> off the screen, but hopefully what is visible is somewhat helpful.
>
> I will have only limited internet access the next couple of weeks. But
> I wanted to make sure I at least got the result of the bisection out
> to LKML. I will make every best effort to collect additional data, if
> asked to do so; but some of it might be delayed for a little bit,
> until I can get access to reasonably powerful hardware or reasonably
> fast internet.
>
>
> Markus
>
> P.S.: Please keep me cc'd on all responses, as I am not subscribed to
> the firehose that is LKML.

2014-07-17 08:58:25

by Zhang, Rui

[permalink] [raw]
Subject: Re: 3.16-rcX crashes on resume from Suspend-To-RAM

Hi, Markus,

Can you please attach
1. the acpidump output
2. dmesg output after boot in 3.16-rc
3. the output of
a) "grep . /sys/bus/pnp/devices/*/firmware_node/*"
b) "grep . /sys/bus/pnp/devices/*/*"
c) "grep . /sys/bus/platform/devices/*/firmware_node/*"
d) "grep . /sys/bus/platform/devices/*/*"
BOTH before and after this commit?

thanks,
rui

On Wed, 2014-07-16 at 23:50 -0700, Markus Gutschke wrote:
> Adding the reviewers of the faulty change list to the cc list for this
> e-mail. I hope that is considered proper etiquette for the LKML.
>
> On Tue, Jul 15, 2014 at 6:51 PM, Markus Gutschke <[email protected]> wrote:
> > My Dell M4400 has been pretty well-supported by Linux a couple of
> > years now, but recent 3.16-rcX cause hard crashes when resuming from
> > Suspend-to-RAM.
> >
> > This is tricky to debug, as device drivers are not yet restored by the
> > time that the crash happens. So, I can't use Page-UP to scroll the
> > screen and see the full crash information. I also cannot use the
> > netconsole; the ethernet device is still suspended. For similar
> > reasons, crash kernels don't seem to work either.
> >
> > After about a day of false starts and a lengthy bi-secting session, I
> > finally narrowed things down to this change list:
> >
> > eec15edbb0e14485998635ea7c62e30911b465f0 is the first bad commit
> > commit eec15edbb0e14485998635ea7c62e30911b465f0
> > Author: Zhang Rui <[email protected]>
> > Date: Fri May 30 04:23:01 2014 +0200
> >
> > ACPI / PNP: use device ID list for PNPACPI device enumeration
> >
> > ACPI can be used to enumerate PNP devices, but the code does not
> > handle this in the right way currently. Namely, if an ACPI device
> > object
> > 1. Has a _CRS method,
> > 2. Has an identification of
> > "three capital characters followed by four hex digits",
> > 3. Is not in the excluded IDs list,
> > it will be enumerated to PNP bus (that is, a PNP device object will
> > be create for it). This means that, actually, the PNP bus type is
> > used as the default bus type for enumerating _HID devices in ACPI.
> >
> > However, more and more _HID devices need to be enumerated to the
> > platform bus instead (that is, platform device objects need to be
> > created for them). As a result, the device ID list in acpi_platform.c
> > is used to enforce creating platform device objects rather than PNP
> > device objects for matching devices. That list has been continuously
> > growing recently, unfortunately, and it is pretty much guaranteed to
> > grow even more in the future.
> >
> > To address that problem it is better to enumerate _HID devices
> > as platform devices by default. To this end, change the way of
> > enumerating PNP devices by adding a PNP ACPI scan handler that
> > will use a device ID list to create PNP devices for the ACPI
> > device objects whose device IDs are present in that list.
> >
> > The initial device ID list in the PNP ACPI scan handler contains
> > all of the pnp_device_id strings from all the existing PNP drivers,
> > so this change should be transparent to the PNP core and all of the
> > PNP drivers. Still, in the future it should be possible to reduce
> > its size by converting PNP drivers that need not be PNP for any
> > technical reasons into platform drivers.
> >
> > Signed-off-by: Zhang Rui <[email protected]>
> > [rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > Reviewed-by: Mika Westerberg <[email protected]>
> >
> > :040000 040000 b7c07232aa46ae7b6faf9a907fb7274a02e4680fc2e05b31a61dccd087c554adecc89a43a1ed81f7
> > M drivers
> > :040000 040000 4eda970292fffbeebe167f9210502527df4e8ab421e9e6fd84c780a34bf3d48b5e7618b551da3b1a
> > M include
> >
> > I took a photo of the crash. It feels silly to do, but I couldn't
> > think of a better solution. You can find it at
> > https://drive.google.com/file/d/0B8SxqKDe4hyheTlTLXY2YThkMXM
> >
> > As I mentioned earlier, a bunch of information has already scrolled
> > off the screen, but hopefully what is visible is somewhat helpful.
> >
> > I will have only limited internet access the next couple of weeks. But
> > I wanted to make sure I at least got the result of the bisection out
> > to LKML. I will make every best effort to collect additional data, if
> > asked to do so; but some of it might be delayed for a little bit,
> > until I can get access to reasonably powerful hardware or reasonably
> > fast internet.
> >
> >
> > Markus
> >
> > P.S.: Please keep me cc'd on all responses, as I am not subscribed to
> > the firehose that is LKML.

2014-07-22 00:22:40

by Zhang, Rui

[permalink] [raw]
Subject: Re: 3.16-rcX crashes on resume from Suspend-To-RAM

On Thu, 2014-07-17 at 10:27 -0700, Markus Gutschke wrote:
> Please note the crash in "dmesg" right after booting. This looks relevant:
>
I think the crash log also exists in 3.15 kernel, can you please verify
this?

> https://medusa.gutschke.com/markus/acpi/after-dmesg.txt
> https://medusa.gutschke.com/markus/acpi/acpidump.txt
> https://medusa.gutschke.com/markus/acpi/before-platform-devices-firmware-node.txt
> https://medusa.gutschke.com/markus/acpi/before-platform-devices.txt
> https://medusa.gutschke.com/markus/acpi/before-pnp-devices-firmware-node.txt
> https://medusa.gutschke.com/markus/acpi/before-pnp-devices.txt
> https://medusa.gutschke.com/markus/acpi/after-platform-devices-firmware-node.txt
> https://medusa.gutschke.com/markus/acpi/after-platform-devices.txt
> https://medusa.gutschke.com/markus/acpi/after-pnp-devices-firmware-node.txt
> https://medusa.gutschke.com/markus/acpi/after-pnp-devices.txt

are you building the kernel with last commit
eec15edbb0e14485998635ea7c62e30911b465f0?

If yes, please rebuild your 3.16 kernel with last commit
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f, and re-attach the after-*
logs.

And please also apply the debug patch attached on top of commit
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f, and see if the problem still
exists.

thanks,
rui


Attachments:
0001-debug-patch.patch (638.00 B)

2014-07-26 15:52:56

by Markus Gutschke

[permalink] [raw]
Subject: Re: 3.16-rcX crashes on resume from Suspend-To-RAM

Sorry for the delay. Remotely debugging kernels over a shared and
flaky 1MBps terrestrial wireless connection is quite a new experience
to me.

In any case, I was able to collect all the data that you asked for. I
then used "pm-suspend" to put the machine to sleep and asked a helper
to physically press the power button to wake the computer back up. My
helper told me that it crashed just as before, and they had to
power-cycle the machine to bring it back to life.

Please let me know, what other data I can get for you. And thank you
very much for putting up with my slow turn-around. I should have much
better response time again in about two to three weeks when I return
to civilization.

# Startup log file for stock 3.15 kernel
https://medusa.gutschke.com/markus/3.15-dmesg.txt

# Startup log file for b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f
https://medusa.gutschke.com/markus/3.15-rc8-dmesg.txt

# /sys/* files for b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f
https://medusa.gutschke.com/markus/3.15-rc8-platform-devices-firmware-node.txt
https://medusa.gutschke.com/markus/3.15-rc8-platform-devices.txt
https://medusa.gutschke.com/markus/3.15-rc8-pnp-devices-firmware-node.txt
https://medusa.gutschke.com/markus/3.15-rc8-pnp-devices.txt

# /sys/* files for b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f with patch applied
https://medusa.gutschke.com/markus/3.15-rc8-patched-dmesg.txt
https://medusa.gutschke.com/markus/3.15-rc8-patched-platform-devices-firmware-node.txt
https://medusa.gutschke.com/markus/3.15-rc8-patched-platform-devices.txt
https://medusa.gutschke.com/markus/3.15-rc8-patched-pnp-devices-firmware-node.txt
https://medusa.gutschke.com/markus/3.15-rc8-patched-pnp-devices.txt