Hello,
My Lenovo Thinkpad X301 has failed to power off after saving the
hibernation image ever since v3.19-rc1. This is a regression since
v3.18. The regression is still present i v4.0-rc1.
The symptoms are: Hibernation proceeds as usual, writing a complete
image. But instead of powering off the laptop stays on in a "dead"
state, with fans running and the "sleep" LED blinking. The only way out
of this state is by hard poweroff (pressing the power button for 4+
seconds). The system can successfully resume after this, proving that
the hibernation was complete.
I consider the bug somewhat critical as the system will continue to draw
power until it is forcibly powered off, or the battery is completely
dead. This causes unexpected battery usage and unnecessary battery wear
if the power-off failure goes unnoticed.
I finally took the time to bisect the problem, and ended up with this
surprisingly obvious result:
da2bc1b9db3351addd293e5b82757efe1f77ed1d is the first bad commit
commit da2bc1b9db3351addd293e5b82757efe1f77ed1d
Author: Imre Deak <[email protected]>
Date: Thu Oct 23 19:23:26 2014 +0300
drm/i915: add poweroff_late handler
The suspend_late handler saves some registers and powers off the device,
so it doesn't have a big overhead. Calling it at S4 poweroff_late time
makes the power off handling identical to the S3 suspend and S4 freeze
handling, so do this for consistency.
Signed-off-by: Imre Deak <[email protected]>
Reviewed-by: Ville Syrjälä <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
:040000 040000 367eee899c6c2b2a669e2e46f68529dad0e1f7a3 78c7571e2b18dc0fb77161b8a3e32288bd4cbee8 M drivers
The bisect log was:
# bad: [97bf6af1f928216fd6c5a66e8a57bfa95a659672] Linux 3.19-rc1
# good: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18
git bisect start 'v3.19-rc1' 'v3.18'
# good: [70e71ca0af244f48a5dcf56dc435243792e3a495] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect good 70e71ca0af244f48a5dcf56dc435243792e3a495
# bad: [988adfdffdd43cfd841df734664727993076d7cb] Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
git bisect bad 988adfdffdd43cfd841df734664727993076d7cb
# good: [e7cf773d431a63a2417902696fcc9e0ebdc83bbe] Merge tag 'usb-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect good e7cf773d431a63a2417902696fcc9e0ebdc83bbe
# bad: [1a92b7a241dcf06a92d84219b4124dcf420ae315] Merge branch 'linux-3.19' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
git bisect bad 1a92b7a241dcf06a92d84219b4124dcf420ae315
# bad: [fd172d0c47fddff801d998e38c3efdd236ed082f] Merge tag 'drm-intel-next-2014-11-07-fixups' of git://anongit.freedesktop.org/drm-intel into drm-next
git bisect bad fd172d0c47fddff801d998e38c3efdd236ed082f
# bad: [820d2d77482810702758381808bdbb64595298e2] drm/i915/audio: pass intel_encoder on to platform specific ELD functions
git bisect bad 820d2d77482810702758381808bdbb64595298e2
# good: [11c9b6c628c646894e6ef53f92cfd33a814ee553] drm/i915: Tighting frontbuffer tracking around flips
git bisect good 11c9b6c628c646894e6ef53f92cfd33a814ee553
# good: [0b14cbd2f58199a024acbe2994bb27533c97d756] drm/i915: remove dead code from legacy suspend handler
git bisect good 0b14cbd2f58199a024acbe2994bb27533c97d756
# good: [f4a12ead50580c17c3641ac1a453e68b5a5195dd] drm/i915: remove unused restore_gtt_mappings optimization during suspend
git bisect good f4a12ead50580c17c3641ac1a453e68b5a5195dd
# bad: [aff437667b93c3d65576b02628885687c72e1b3b] drm/i915: Move flags describing VMA mappings into the VMA
git bisect bad aff437667b93c3d65576b02628885687c72e1b3b
# bad: [da2bc1b9db3351addd293e5b82757efe1f77ed1d] drm/i915: add poweroff_late handler
git bisect bad da2bc1b9db3351addd293e5b82757efe1f77ed1d
# good: [f2476ae65e6159b41168bc41c630e9fbb1d72dde] drm/i915: disable/re-enable PCI device around S4 freeze/thaw
git bisect good f2476ae65e6159b41168bc41c630e9fbb1d72dde
# good: [5e365c391aeffe8b53d6952c28a68bd5fc856390] drm/i915: sanitize suspend/resume helper function names
git bisect good 5e365c391aeffe8b53d6952c28a68bd5fc856390
My rather old system has a GM45 chip, but I would expect the bug to
show up on most (all?) i915 systems:
nemi:/tmp# lspci -vvvnns 2
00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a42] (rev 07) (prog-if 00 [VGA controller])
Subsystem: Lenovo Device [17aa:20e4]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 28
Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=4M]
Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
Region 4: I/O ports at 1800 [size=8]
Expansion ROM at <unassigned> [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Address: fee0100c Data: 41c2
Capabilities: [d0] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: i915
00:02.1 Display controller [0380]: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller [8086:2a43] (rev 07)
Subsystem: Lenovo Device [17aa:20e4]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Region 0: Memory at f0400000 (64-bit, non-prefetchable) [size=1M]
Capabilities: [d0] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
I'll follow up with a patch reverting the buggy commit. The patch has
been successfully verified to fix the problem on top of v4.0-rc1, but
should also go into the v3.19 stable series.
Bjørn
This fixes a bug where hibernation completes, but the system
fails to power off after the image has been saved.
Bisection lead to commit da2bc1b9db33 ("drm/i915: add poweroff_late
handler") which added a .poweroff_late hook pointing to the same
function as the .freeze_late hook, without any justification or
explanation why this would be correct, except "for consistency".
The assumption that this "makes the power off handling identical to
the S3 suspend and S4 freeze handling" is completely bogus. As clearly
documented in Documentation/power/devices.txt, the poweroff* hooks
are part of the hibernation API along with the freeze* hooks. The
phases involved in hibernation are:
prepare, freeze, freeze_late, freeze_noirq, thaw_noirq, thaw_early,
thaw, complete, prepare, poweroff, poweroff_late, poweroff_noirq
With the above sequence in mind, it should be fairly obvious that you
cannot save registers and disable the device in both the freeze_late
and poweroff_late phases. Or as Documentation/power/devices.txt
explains it:
The poweroff, poweroff_late and poweroff_noirq callbacks should do essentially
the same things as the suspend, suspend_late and suspend_noirq callbacks,
respectively. The only notable difference is that they need not store the
device register values, because the registers should already have been stored
during the freeze, freeze_late or freeze_noirq phases.
The "consistency" between suspend and hibernate was already in
place, with freeze_late pointing to the same function as suspend_late.
There is no need for any poweroff_late hook in this driver.
This reverts commit da2bc1b9db33.
Fixes: da2bc1b9db33 ("drm/i915: add poweroff_late handler")
Cc: <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Ville Syrjälä <[email protected]>
Signed-off-by: Bjørn Mork <[email protected]>
---
drivers/gpu/drm/i915/i915_drv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 8039cec71fc2..9cd695710f93 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1520,7 +1520,6 @@ static const struct dev_pm_ops i915_pm_ops = {
.thaw_early = i915_pm_resume_early,
.thaw = i915_pm_resume,
.poweroff = i915_pm_suspend,
- .poweroff_late = i915_pm_suspend_late,
.restore_early = i915_pm_resume_early,
.restore = i915_pm_resume,
--
1.7.10.4
On ti, 2015-02-24 at 15:58 +0100, Bjørn Mork wrote:
> This fixes a bug where hibernation completes, but the system
> fails to power off after the image has been saved.
>
> Bisection lead to commit da2bc1b9db33 ("drm/i915: add poweroff_late
> handler") which added a .poweroff_late hook pointing to the same
> function as the .freeze_late hook, without any justification or
> explanation why this would be correct, except "for consistency".
>
> The assumption that this "makes the power off handling identical to
> the S3 suspend and S4 freeze handling" is completely bogus. As clearly
> documented in Documentation/power/devices.txt, the poweroff* hooks
> are part of the hibernation API along with the freeze* hooks. The
> phases involved in hibernation are:
>
> prepare, freeze, freeze_late, freeze_noirq, thaw_noirq, thaw_early,
> thaw, complete, prepare, poweroff, poweroff_late, poweroff_noirq
>
> With the above sequence in mind, it should be fairly obvious that you
> cannot save registers and disable the device in both the freeze_late
> and poweroff_late phases. Or as Documentation/power/devices.txt
> explains it:
>
> The poweroff, poweroff_late and poweroff_noirq callbacks should do essentially
> the same things as the suspend, suspend_late and suspend_noirq callbacks,
> respectively. The only notable difference is that they need not store the
> device register values, because the registers should already have been stored
> during the freeze, freeze_late or freeze_noirq phases.
>
> The "consistency" between suspend and hibernate was already in
> place, with freeze_late pointing to the same function as suspend_late.
> There is no need for any poweroff_late hook in this driver.
The poweroff handlers undo the actions of the thaw handlers. As the
original commit stated saving the registers is not needed there, but
it's also not a big overhead and there should be no problem doing it. We
are planning to optimize the hibernation sequence by for example not
shutting down the display between freeze and thaw, and also getting rid
of unnecessary steps at the power off phase. But before that we want to
actually unify things rather than having special cases, as maintaining
the special code paths caused already quite a lot of problems for us so
far.
Reverting the commit may hide some other issue, so before doing that
could you try the following patch:
http://lists.freedesktop.org/archives/intel-gfx/2015-February/060529.html
If with that you still get the hang could you try on top of that the
patch below, first having only pci_set_power_state uncommented, then
both pci_set_power_state and pci_disable_device uncommented?
Thanks,
Imre
diff --git a/drivers/gpu/drm/i915/i915_drv.c
b/drivers/gpu/drm/i915/i915_drv.c
index 4badb23..dc91d4b 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -968,6 +968,23 @@ static int i915_pm_suspend_late(struct device *dev)
return i915_drm_suspend_late(drm_dev);
}
+static int i915_pm_poweroff_late(struct device *dev)
+{
+ struct drm_device *drm_dev = dev_to_i915(dev)->dev;
+ struct drm_i915_private *dev_priv = drm_dev->dev_private;
+ int ret;
+
+ ret = intel_suspend_complete(dev_priv);
+
+ if (ret)
+ return ret;
+
+ pci_disable_device(drm_dev->pdev);
+// pci_set_power_state(drm_dev->pdev, PCI_D3hot);
+
+ return 0;
+}
+
static int i915_pm_resume_early(struct device *dev)
{
struct drm_device *drm_dev = dev_to_i915(dev)->dev;
@@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = {
.thaw_early = i915_pm_resume_early,
.thaw = i915_pm_resume,
.poweroff = i915_pm_suspend,
- .poweroff_late = i915_pm_suspend_late,
+ .poweroff_late = i915_pm_poweroff_late,
.restore_early = i915_pm_resume_early,
.restore = i915_pm_resume,
Imre Deak <[email protected]> writes:
> The poweroff handlers undo the actions of the thaw handlers. As the
> original commit stated saving the registers is not needed there, but
> it's also not a big overhead and there should be no problem doing it. We
> are planning to optimize the hibernation sequence by for example not
> shutting down the display between freeze and thaw, and also getting rid
> of unnecessary steps at the power off phase. But before that we want to
> actually unify things rather than having special cases, as maintaining
> the special code paths caused already quite a lot of problems for us so
> far.
That sounds like a worthy goal. I don't understand what you hope to
achieve by having a poweroff_late hook, since there aren't really
anything useful left to do at the point it is called, but if you want a
dummy callback there for code structure reasons then fine.
But you cannot just run around breaking stuff while slowly moving
towards this goal over multiple releases... v3.19 is currently broken
and that seems totally unnecessary.
In any case: You should have noticed this problem while testing your
patches. The breakage is 100% reproducible. Unfortunately I had to do a
bisect to realize what you had done to the i915 driver, something I
unfortunately didn't find time to do before v3.19 was released. But I
do find it unnecessary to release with such bugs. Any attempt to
exercise the code path you modified would have revealed the bug.
> Reverting the commit may hide some other issue, so before doing that
> could you try the following patch:
> http://lists.freedesktop.org/archives/intel-gfx/2015-February/060529.html
Makes no difference. I assume that patch fixes an unrelated bug? The
age and reported symptoms indicates so. Note that I am reporting a
regression introduced after v3.18, while that seems to fix a bug
introduced in v3.17. Both v3.17 and v3.18 (including v3.18.6), as
well as earlier releases, work fine for me.
> If with that you still get the hang could you try on top of that the
> patch below, first having only pci_set_power_state uncommented, then
> both pci_set_power_state and pci_disable_device uncommented?
That patch fixes the problem, with only pci_set_power_state commented
out. Do you still want me to try with pci_disable_device() commented
out as well?
Bjørn
On ti, 2015-02-24 at 20:00 +0100, Bjørn Mork wrote:
> Imre Deak <[email protected]> writes:
>
> > The poweroff handlers undo the actions of the thaw handlers. As the
> > original commit stated saving the registers is not needed there, but
> > it's also not a big overhead and there should be no problem doing it. We
> > are planning to optimize the hibernation sequence by for example not
> > shutting down the display between freeze and thaw, and also getting rid
> > of unnecessary steps at the power off phase. But before that we want to
> > actually unify things rather than having special cases, as maintaining
> > the special code paths caused already quite a lot of problems for us so
> > far.
>
> That sounds like a worthy goal. I don't understand what you hope to
> achieve by having a poweroff_late hook, since there aren't really
> anything useful left to do at the point it is called, but if you want a
> dummy callback there for code structure reasons then fine.
There are the following reasons for shutting down the device properly
during hibernation:
- ACPI mandates that the OSPM (the kernel in our case) puts all devices
into D3 that are not wake-up sources (i915 is not) (Kudos to Ville for
pointing this out)
- Embedded panels have a well defined shutdown sequence. We don't have
any good reason to not follow this, in fact for some panels the
subsequent reinitialization could be problematic in case of a hard
power-off. (Thanks to Jani for this info)
In fact the only reason why we didn't put the device into PCI D3 before
the patch you bisected, is kind of arbitrary. The PCI core puts every
device into D3 unless its driver saves the device's PCI state on its
own. Since before that patch we did save the state, but haven't put the
device into D3 it stayed in D0. That was definitely not intentional and
as such we depended on the BIOS to correct this for us afterwards.
> But you cannot just run around breaking stuff while slowly moving
> towards this goal over multiple releases... v3.19 is currently broken
> and that seems totally unnecessary.
>
> In any case: You should have noticed this problem while testing your
> patches. The breakage is 100% reproducible. Unfortunately I had to do a
> bisect to realize what you had done to the i915 driver, something I
> unfortunately didn't find time to do before v3.19 was released. But I
> do find it unnecessary to release with such bugs. Any attempt to
> exercise the code path you modified would have revealed the bug.
We tested these patches on numerous platforms and haven't seen the issue
you reported. Based on your feedback the current assumption is that this
is a bug in your BIOS, which tries to access the device despite it being
powered down. We're trying our best to test each change we commit on a
big set of platforms, but - especially on older hardware with random
BIOS versions/configurations - full coverage is not possible. So we
depend on reports like yours a lot to fill in the gaps.
> > Reverting the commit may hide some other issue, so before doing that
> > could you try the following patch:
> > http://lists.freedesktop.org/archives/intel-gfx/2015-February/060529.html
>
> Makes no difference. I assume that patch fixes an unrelated bug? The
> age and reported symptoms indicates so. Note that I am reporting a
> regression introduced after v3.18, while that seems to fix a bug
> introduced in v3.17. Both v3.17 and v3.18 (including v3.18.6), as
> well as earlier releases, work fine for me.
Ok, thanks for trying.
> > If with that you still get the hang could you try on top of that the
> > patch below, first having only pci_set_power_state uncommented, then
> > both pci_set_power_state and pci_disable_device uncommented?
>
> That patch fixes the problem, with only pci_set_power_state commented
> out. Do you still want me to try with pci_disable_device() commented
> out as well?
No, but it would help if you could still try the two attached patch
separately, without any of the previous workarounds. Based on the
result, we'll follow up with a fix that adds for your specific platform
either of the attached workarounds or simply avoids putting the device
into D3 (corresponding to the patch you already tried).
Thanks,
Imre
Imre Deak <[email protected]> writes:
>> That patch fixes the problem, with only pci_set_power_state commented
>> out. Do you still want me to try with pci_disable_device() commented
>> out as well?
>
> No, but it would help if you could still try the two attached patch
> separately, without any of the previous workarounds. Based on the
> result, we'll follow up with a fix that adds for your specific platform
> either of the attached workarounds or simply avoids putting the device
> into D3 (corresponding to the patch you already tried).
None of those patches made any difference. The laptop still hangs at
power-off.
Not really surprising, is it? Previous testing shows that the hang
occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the
poweroff_late hook. It is hard to see how replacing it by an attempt to
set D3cold, or adding any call after this point, could possibly change
anything. The system is stil hanging at the pci_set_power_state() call.
The generic pci-driver code will put the i915 device into PCI_D3hot for
you, won't it? Why do you need to duplicate that in the driver,
*knowing* that doing so breaks (at least some) systems?
I honestly don't think this "let's try some random code" is the proper
way to fix this bug (or any other bug for that matter). You need to
start understanding the code you write, and the first step is by
actually explaining the changes you make.
I also believe that you completely miss the fact that this bug has
survived a full release cycle before you became aware of it, and the
implications this has wrt other affected systems/users. I assume you
understand that my system isn't one-of-a-kind, This means that there are
other affected users with identical/similar systems. Now, if none of
those users reported the bug to you (we all know why: Linux kernel
development is currently limited by the available testing resources, NOT
by the available developer resources), then how do you know that there
aren't a number of other systems affected as well?
Let me answer that for you: You don't.
Which is why you must explain the mechanism triggering the bug, proving
that it is a chipset/system specific issue. Because that's the only way
you will *know* that you have solved the problem not only for me, but for
all affected users.
IMHO, the only safe and sane solution at the moment is the revert patch
I posted. It's a simple fix, reverting back to the *known* working
state before this regression was introduced.
Then you can start over from there, trying to implement this properly.
Thanks,
Bjørn
On to, 2015-02-26 at 10:34 +0100, Bjørn Mork wrote:
> Imre Deak <[email protected]> writes:
>
> >> That patch fixes the problem, with only pci_set_power_state commented
> >> out. Do you still want me to try with pci_disable_device() commented
> >> out as well?
> >
> > No, but it would help if you could still try the two attached patch
> > separately, without any of the previous workarounds. Based on the
> > result, we'll follow up with a fix that adds for your specific platform
> > either of the attached workarounds or simply avoids putting the device
> > into D3 (corresponding to the patch you already tried).
>
> None of those patches made any difference. The laptop still hangs at
> power-off.
>
> Not really surprising, is it? Previous testing shows that the hang
> occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the
> poweroff_late hook. It is hard to see how replacing it by an attempt to
> set D3cold, or adding any call after this point, could possibly change
> anything. The system is stil hanging at the pci_set_power_state() call.
Judging from the blinking LED, I assume that it's not
pci_set_power_state() that hangs the machine, but the hang happens in
BIOS code.
> The generic pci-driver code will put the i915 device into PCI_D3hot for
> you, won't it? Why do you need to duplicate that in the driver,
> *knowing* that doing so breaks (at least some) systems?
Letting the pci core put the device into D3 wouldn't get rid of the problem.
It's putting the device into D3 in the first place what causes it.
> I honestly don't think this "let's try some random code" is the proper
> way to fix this bug (or any other bug for that matter). You need to
> start understanding the code you write, and the first step is by
> actually explaining the changes you make.
We have a good understanding about the issue: the BIOS on your platform
does something unexpected behind the back of the driver/kernel. In that
sense the patches were not random, for instance the first one is based on
an existing workaround for an issue that resembles quite a lot yours, see
pci_pm_poweroff_noirq().
> I also believe that you completely miss the fact that this bug has
> survived a full release cycle before you became aware of it, and the
> implications this has wrt other affected systems/users. I assume you
> understand that my system isn't one-of-a-kind, This means that there are
> other affected users with identical/similar systems. Now, if none of
> those users reported the bug to you (we all know why: Linux kernel
> development is currently limited by the available testing resources, NOT
> by the available developer resources), then how do you know that there
> aren't a number of other systems affected as well?
>
> Let me answer that for you: You don't.
>
> Which is why you must explain the mechanism triggering the bug, proving
> that it is a chipset/system specific issue. Because that's the only way
> you will *know* that you have solved the problem not only for me, but for
> all affected users.
>
> IMHO, the only safe and sane solution at the moment is the revert patch
> I posted. It's a simple fix, reverting back to the *known* working
> state before this regression was introduced.
>
> Then you can start over from there, trying to implement this properly.
The current way is the proper one that we want for the generic case. The issue
on your platform is the exception, so working around that is a sensible choice.
Attached is the proposed fix for this issue.
--Imre
Imre Deak <[email protected]> writes:
> Attached is the proposed fix for this issue.
>
> --Imre
>
> From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001
> From: Imre Deak <[email protected]>
> Date: Thu, 26 Feb 2015 18:38:53 +0200
> Subject: [PATCH] drm/i915: gm45: work around hang during hibernation
I can confirm that this patch solves the problem for me. Thanks.
Bjørn
On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote:
> On to, 2015-02-26 at 10:34 +0100, Bj?rn Mork wrote:
> > Imre Deak <[email protected]> writes:
> >
> > >> That patch fixes the problem, with only pci_set_power_state commented
> > >> out. Do you still want me to try with pci_disable_device() commented
> > >> out as well?
> > >
> > > No, but it would help if you could still try the two attached patch
> > > separately, without any of the previous workarounds. Based on the
> > > result, we'll follow up with a fix that adds for your specific platform
> > > either of the attached workarounds or simply avoids putting the device
> > > into D3 (corresponding to the patch you already tried).
> >
> > None of those patches made any difference. The laptop still hangs at
> > power-off.
> >
> > Not really surprising, is it? Previous testing shows that the hang
> > occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the
> > poweroff_late hook. It is hard to see how replacing it by an attempt to
> > set D3cold, or adding any call after this point, could possibly change
> > anything. The system is stil hanging at the pci_set_power_state() call.
>
> Judging from the blinking LED, I assume that it's not
> pci_set_power_state() that hangs the machine, but the hang happens in
> BIOS code.
>
> > The generic pci-driver code will put the i915 device into PCI_D3hot for
> > you, won't it? Why do you need to duplicate that in the driver,
> > *knowing* that doing so breaks (at least some) systems?
>
> Letting the pci core put the device into D3 wouldn't get rid of the problem.
> It's putting the device into D3 in the first place what causes it.
>
> > I honestly don't think this "let's try some random code" is the proper
> > way to fix this bug (or any other bug for that matter). You need to
> > start understanding the code you write, and the first step is by
> > actually explaining the changes you make.
>
> We have a good understanding about the issue: the BIOS on your platform
> does something unexpected behind the back of the driver/kernel. In that
> sense the patches were not random, for instance the first one is based on
> an existing workaround for an issue that resembles quite a lot yours, see
> pci_pm_poweroff_noirq().
>
> > I also believe that you completely miss the fact that this bug has
> > survived a full release cycle before you became aware of it, and the
> > implications this has wrt other affected systems/users. I assume you
> > understand that my system isn't one-of-a-kind, This means that there are
> > other affected users with identical/similar systems. Now, if none of
> > those users reported the bug to you (we all know why: Linux kernel
> > development is currently limited by the available testing resources, NOT
> > by the available developer resources), then how do you know that there
> > aren't a number of other systems affected as well?
> >
> > Let me answer that for you: You don't.
> >
> > Which is why you must explain the mechanism triggering the bug, proving
> > that it is a chipset/system specific issue. Because that's the only way
> > you will *know* that you have solved the problem not only for me, but for
> > all affected users.
> >
> > IMHO, the only safe and sane solution at the moment is the revert patch
> > I posted. It's a simple fix, reverting back to the *known* working
> > state before this regression was introduced.
> >
> > Then you can start over from there, trying to implement this properly.
>
> The current way is the proper one that we want for the generic case. The issue
> on your platform is the exception, so working around that is a sensible choice.
>
> Attached is the proposed fix for this issue.
>
> --Imre
>
> From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001
> From: Imre Deak <[email protected]>
> Date: Thu, 26 Feb 2015 18:38:53 +0200
> Subject: [PATCH] drm/i915: gm45: work around hang during hibernation
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Bj?rn reported that his machine hang during hibernation and eventually
> bisected the problem to the following commit:
>
> commit da2bc1b9db3351addd293e5b82757efe1f77ed1d
> Author: Imre Deak <[email protected]>
> Date: Thu Oct 23 19:23:26 2014 +0300
>
> drm/i915: add poweroff_late handler
>
> The problem seems to be that after the kernel puts the device into D3
> the BIOS still tries to access it, or otherwise assumes that it's in D0.
> This is clearly bogus, since ACPI mandates that devices are put into D3
> by the OSPM if they are not wake-up sources. In the future we want to
> unify more of the driver's runtime and system suspend paths, for example
> by skipping all the system suspend/hibernation hooks if the device is
> runtime suspended already. Accordingly for all other platforms the goal
> is still to properly power down the device during hibernation.
>
> Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/060633.html
> Reported-and-bisected-by: Bj?rn Mork <[email protected]>
> Signed-off-by: Imre Deak <[email protected]>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 27 ++++++++++++++++++++++-----
> 1 file changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 4badb23..67d212b 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -637,7 +637,7 @@ static int i915_drm_suspend(struct drm_device *dev)
> return 0;
> }
>
> -static int i915_drm_suspend_late(struct drm_device *drm_dev)
> +static int i915_drm_suspend_late(struct drm_device *drm_dev, bool hibernation)
> {
> struct drm_i915_private *dev_priv = drm_dev->dev_private;
> int ret;
> @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev)
> }
>
> pci_disable_device(drm_dev->pdev);
> - pci_set_power_state(drm_dev->pdev, PCI_D3hot);
> + /*
> + * During hibernation on some GM45 platforms the BIOS may try to access
> + * the device even though it's already in D3 and hang the machine. So
> + * leave the device in D0 on those platforms and hope the BIOS will
> + * power down the device properly.
> + */
> + if (!(hibernation && IS_GM45(dev_priv)))
> + pci_set_power_state(drm_dev->pdev, PCI_D3hot);
If we see more of these troublesome machines we might need to apply an
even bigger hammer like gen < 5 or so. But as long as there's just 1
report I think this is the right approach.
Bjorn, as soon as we have your tested-by that this work we can apply this
and shovel it through the backporting machinery.
Thanks, Daniel
>
> return 0;
> }
> @@ -677,7 +684,7 @@ int i915_suspend_legacy(struct drm_device *dev, pm_message_t state)
> if (error)
> return error;
>
> - return i915_drm_suspend_late(dev);
> + return i915_drm_suspend_late(dev, false);
> }
>
> static int i915_drm_resume(struct drm_device *dev)
> @@ -965,7 +972,17 @@ static int i915_pm_suspend_late(struct device *dev)
> if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> return 0;
>
> - return i915_drm_suspend_late(drm_dev);
> + return i915_drm_suspend_late(drm_dev, false);
> +}
> +
> +static int i915_pm_poweroff_late(struct device *dev)
> +{
> + struct drm_device *drm_dev = dev_to_i915(dev)->dev;
> +
> + if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> + return 0;
> +
> + return i915_drm_suspend_late(drm_dev, true);
> }
>
> static int i915_pm_resume_early(struct device *dev)
> @@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = {
> .thaw_early = i915_pm_resume_early,
> .thaw = i915_pm_resume,
> .poweroff = i915_pm_suspend,
> - .poweroff_late = i915_pm_suspend_late,
> + .poweroff_late = i915_pm_poweroff_late,
> .restore_early = i915_pm_resume_early,
> .restore = i915_pm_resume,
>
> --
> 2.1.0
>
> _______________________________________________
> Intel-gfx mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote:
> On to, 2015-02-26 at 10:34 +0100, Bj?rn Mork wrote:
> > Imre Deak <[email protected]> writes:
> >
> > >> That patch fixes the problem, with only pci_set_power_state commented
> > >> out. Do you still want me to try with pci_disable_device() commented
> > >> out as well?
> > >
> > > No, but it would help if you could still try the two attached patch
> > > separately, without any of the previous workarounds. Based on the
> > > result, we'll follow up with a fix that adds for your specific platform
> > > either of the attached workarounds or simply avoids putting the device
> > > into D3 (corresponding to the patch you already tried).
> >
> > None of those patches made any difference. The laptop still hangs at
> > power-off.
> >
> > Not really surprising, is it? Previous testing shows that the hang
> > occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the
> > poweroff_late hook. It is hard to see how replacing it by an attempt to
> > set D3cold, or adding any call after this point, could possibly change
> > anything. The system is stil hanging at the pci_set_power_state() call.
>
> Judging from the blinking LED, I assume that it's not
> pci_set_power_state() that hangs the machine, but the hang happens in
> BIOS code.
>
> > The generic pci-driver code will put the i915 device into PCI_D3hot for
> > you, won't it? Why do you need to duplicate that in the driver,
> > *knowing* that doing so breaks (at least some) systems?
>
> Letting the pci core put the device into D3 wouldn't get rid of the problem.
> It's putting the device into D3 in the first place what causes it.
>
> > I honestly don't think this "let's try some random code" is the proper
> > way to fix this bug (or any other bug for that matter). You need to
> > start understanding the code you write, and the first step is by
> > actually explaining the changes you make.
>
> We have a good understanding about the issue: the BIOS on your platform
> does something unexpected behind the back of the driver/kernel. In that
> sense the patches were not random, for instance the first one is based on
> an existing workaround for an issue that resembles quite a lot yours, see
> pci_pm_poweroff_noirq().
>
> > I also believe that you completely miss the fact that this bug has
> > survived a full release cycle before you became aware of it, and the
> > implications this has wrt other affected systems/users. I assume you
> > understand that my system isn't one-of-a-kind, This means that there are
> > other affected users with identical/similar systems. Now, if none of
> > those users reported the bug to you (we all know why: Linux kernel
> > development is currently limited by the available testing resources, NOT
> > by the available developer resources), then how do you know that there
> > aren't a number of other systems affected as well?
> >
> > Let me answer that for you: You don't.
> >
> > Which is why you must explain the mechanism triggering the bug, proving
> > that it is a chipset/system specific issue. Because that's the only way
> > you will *know* that you have solved the problem not only for me, but for
> > all affected users.
> >
> > IMHO, the only safe and sane solution at the moment is the revert patch
> > I posted. It's a simple fix, reverting back to the *known* working
> > state before this regression was introduced.
> >
> > Then you can start over from there, trying to implement this properly.
>
> The current way is the proper one that we want for the generic case. The issue
> on your platform is the exception, so working around that is a sensible choice.
>
> Attached is the proposed fix for this issue.
>
> --Imre
>
> From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001
> From: Imre Deak <[email protected]>
> Date: Thu, 26 Feb 2015 18:38:53 +0200
> Subject: [PATCH] drm/i915: gm45: work around hang during hibernation
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Bj?rn reported that his machine hang during hibernation and eventually
> bisected the problem to the following commit:
>
> commit da2bc1b9db3351addd293e5b82757efe1f77ed1d
> Author: Imre Deak <[email protected]>
> Date: Thu Oct 23 19:23:26 2014 +0300
>
> drm/i915: add poweroff_late handler
>
> The problem seems to be that after the kernel puts the device into D3
> the BIOS still tries to access it, or otherwise assumes that it's in D0.
> This is clearly bogus, since ACPI mandates that devices are put into D3
> by the OSPM if they are not wake-up sources. In the future we want to
> unify more of the driver's runtime and system suspend paths, for example
> by skipping all the system suspend/hibernation hooks if the device is
> runtime suspended already. Accordingly for all other platforms the goal
> is still to properly power down the device during hibernation.
>
> Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/060633.html
> Reported-and-bisected-by: Bj?rn Mork <[email protected]>
> Signed-off-by: Imre Deak <[email protected]>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 27 ++++++++++++++++++++++-----
> 1 file changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 4badb23..67d212b 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -637,7 +637,7 @@ static int i915_drm_suspend(struct drm_device *dev)
> return 0;
> }
>
> -static int i915_drm_suspend_late(struct drm_device *drm_dev)
> +static int i915_drm_suspend_late(struct drm_device *drm_dev, bool hibernation)
> {
> struct drm_i915_private *dev_priv = drm_dev->dev_private;
> int ret;
> @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev)
> }
>
> pci_disable_device(drm_dev->pdev);
> - pci_set_power_state(drm_dev->pdev, PCI_D3hot);
> + /*
> + * During hibernation on some GM45 platforms the BIOS may try to access
> + * the device even though it's already in D3 and hang the machine. So
> + * leave the device in D0 on those platforms and hope the BIOS will
> + * power down the device properly.
Please include the model of the known bad machine in this comment, to
help future archaeologists.
> + */
> + if (!(hibernation && IS_GM45(dev_priv)))
> + pci_set_power_state(drm_dev->pdev, PCI_D3hot);
>
> return 0;
> }
> @@ -677,7 +684,7 @@ int i915_suspend_legacy(struct drm_device *dev, pm_message_t state)
> if (error)
> return error;
>
> - return i915_drm_suspend_late(dev);
> + return i915_drm_suspend_late(dev, false);
> }
>
> static int i915_drm_resume(struct drm_device *dev)
> @@ -965,7 +972,17 @@ static int i915_pm_suspend_late(struct device *dev)
> if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> return 0;
>
> - return i915_drm_suspend_late(drm_dev);
> + return i915_drm_suspend_late(drm_dev, false);
> +}
> +
> +static int i915_pm_poweroff_late(struct device *dev)
> +{
> + struct drm_device *drm_dev = dev_to_i915(dev)->dev;
> +
> + if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> + return 0;
> +
> + return i915_drm_suspend_late(drm_dev, true);
> }
>
> static int i915_pm_resume_early(struct device *dev)
> @@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = {
> .thaw_early = i915_pm_resume_early,
> .thaw = i915_pm_resume,
> .poweroff = i915_pm_suspend,
> - .poweroff_late = i915_pm_suspend_late,
> + .poweroff_late = i915_pm_poweroff_late,
> .restore_early = i915_pm_resume_early,
> .restore = i915_pm_resume,
>
> --
> 2.1.0
>
--
Ville Syrj?l?
Intel OTC
Ville Syrjälä <[email protected]> writes:
>> @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev)
>> }
>>
>> pci_disable_device(drm_dev->pdev);
>> - pci_set_power_state(drm_dev->pdev, PCI_D3hot);
>> + /*
>> + * During hibernation on some GM45 platforms the BIOS may try to access
>> + * the device even though it's already in D3 and hang the machine. So
>> + * leave the device in D0 on those platforms and hope the BIOS will
>> + * power down the device properly.
>
> Please include the model of the known bad machine in this comment, to
> help future archaeologists.
Here are some details:
bjorn@nemi:~$ grep . /sys/class/dmi/id/{bios,product}* 2>/dev/null
/sys/class/dmi/id/bios_date:12/19/2011
/sys/class/dmi/id/bios_vendor:LENOVO
/sys/class/dmi/id/bios_version:6EET55WW (3.15 )
/sys/class/dmi/id/product_name:2776LEG
/sys/class/dmi/id/product_version:ThinkPad X301
Please let me know if you need some other data.
Bjørn
On Thu, Feb 26, 2015 at 09:01:27PM +0100, Daniel Vetter wrote:
> On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote:
[snip]
> >
> > The problem seems to be that after the kernel puts the device into D3
> > the BIOS still tries to access it, or otherwise assumes that it's in D0.
> > This is clearly bogus, since ACPI mandates that devices are put into D3
> > by the OSPM if they are not wake-up sources. In the future we want to
> > unify more of the driver's runtime and system suspend paths, for example
> > by skipping all the system suspend/hibernation hooks if the device is
> > runtime suspended already. Accordingly for all other platforms the goal
> > is still to properly power down the device during hibernation.
[snip]
>
> If we see more of these troublesome machines we might need to apply an
> even bigger hammer like gen < 5 or so. But as long as there's just 1
> report I think this is the right approach.
>
> Bjorn, as soon as we have your tested-by that this work we can apply this
> and shovel it through the backporting machinery.
Isn't there a risk that this will negatively impact machines using gen4
that *don't* have a buggy BIOS? Wouldn't a quirk tied to the laptop
or BIOS version be better suited -- or possibly a parameter that can be
passed to the driver, which would make it easier to test if others
suffering from similar symptoms on other systems suffer from the same
issue or not?
Just my 2?.
Kind regards, David Weinehall
On Fri, 2015-02-27 at 14:15 +0200, David Weinehall wrote:
> On Thu, Feb 26, 2015 at 09:01:27PM +0100, Daniel Vetter wrote:
> > On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote:
>
> [snip]
> > >
> > > The problem seems to be that after the kernel puts the device into D3
> > > the BIOS still tries to access it, or otherwise assumes that it's in D0.
> > > This is clearly bogus, since ACPI mandates that devices are put into D3
> > > by the OSPM if they are not wake-up sources. In the future we want to
> > > unify more of the driver's runtime and system suspend paths, for example
> > > by skipping all the system suspend/hibernation hooks if the device is
> > > runtime suspended already. Accordingly for all other platforms the goal
> > > is still to properly power down the device during hibernation.
>
> [snip]
> >
> > If we see more of these troublesome machines we might need to apply an
> > even bigger hammer like gen < 5 or so. But as long as there's just 1
> > report I think this is the right approach.
> >
> > Bjorn, as soon as we have your tested-by that this work we can apply this
> > and shovel it through the backporting machinery.
>
> Isn't there a risk that this will negatively impact machines using gen4
> that *don't* have a buggy BIOS? Wouldn't a quirk tied to the laptop
> or BIOS version be better suited -- or possibly a parameter that can be
> passed to the driver, which would make it easier to test if others
> suffering from similar symptoms on other systems suffer from the same
> issue or not?
>
> Just my 2¢.
We've checked today with Ville a few machines we've found from GEN2 to
GEN5+. There was one Thinkpad x61s (GEN4) where I could reproduce the
exact same problem and get rid of it using the same workaround. All the
others were non-Lenovo (including GEN4) mobile and desktop platforms and
none of these had the problem. I also tried it on a Lenovo IVB
ultrabook, which also works fine. So the current theory is that it's
related to GEN4 Thinkpad BIOSes, and so we need to change the condition
to match that at least.
--Imre