2013-05-01 17:11:31

by Jonas Heinrich

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

Hello,
I tried the newest kernel, 3.9 today but the bug is still present.
Applying the attached patch solves the bug for me.

Best regards,
Jonas Heinrich

On 03-20 14:32, Jonas Heinrich wrote:
> Hello Peter,
> sorry for responding that late to your advice ...
>
> On 02-23 13:54, H. Peter Anvin wrote:
> > So to bisect anything between
> > 73201dbec64aebf6b0dca855b523f437972dc7bb and
> > 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e in a meaningful way you
> > will have to apply 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e on top.
>
> Hope I got you right:
>
> git bisect start
> # good: [73201dbec64aebf6b0dca855b523f437972dc7bb] x86, suspend: On
> # wakeup always initialize cr4 and EFER
> git bisect good 73201dbec64aebf6b0dca855b523f437972dc7bb
> # bad: [1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e] x86, suspend: Correct
> # the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [a4a4fd9c1b2fa3687fa80177d5de7c551851906d] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad a4a4fd9c1b2fa3687fa80177d5de7c551851906d
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [be74ee62657cd7a51519946da7c4bedf7695b0da] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad be74ee62657cd7a51519946da7c4bedf7695b0da
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [10df153d76caf2464adb076d475c5c3bfce2c584] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 10df153d76caf2464adb076d475c5c3bfce2c584
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [5ccf64462e2ea53f6b446aea61c308f57224ad6c] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 5ccf64462e2ea53f6b446aea61c308f57224ad6c
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [6337a0046893906ef8dba2db30e36d7360101871] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 6337a0046893906ef8dba2db30e36d7360101871
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [49eeaeaf09fde353766ae0cc548c4cef752d90a9] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 49eeaeaf09fde353766ae0cc548c4cef752d90a9
> # cherry-picking
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
> # bad: [43d880b7fbaa831d5bab9bd3215d3053f7c69e97] x86, suspend: Correct the restore of CR4, EFER; skip computing EFLAGS.ID
> git bisect bad 43d880b7fbaa831d5bab9bd3215d3053f7c69e97
>
> Well, don't know how to proceed here. As you could see, none of these kernels worked for me.
>
> > If 73201dbec64aebf6b0dca855b523f437972dc7b with
> > 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e on top fails whereas the
> > previous one does, then that is very interesting and something we
> > can work with.
>
> git bisect reset
> git pull
> git checkout 73201dbec64aebf6b0dca855b523f437972dc7b
> git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
>
> This kernel also does not work :(
>
> Best regards,
> Jonas Heinrich



Attachments:
(No filename) (0.00 B)
(No filename) (490.00 B)
Download all attachments

2013-05-01 17:33:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 05/01/2013 10:01 AM, Jonas Heinrich wrote:
> Hello, I tried the newest kernel, 3.9 today but the bug is still
> present. Applying the attached patch solves the bug for me.
>
> Best regards, Jonas Heinrich

Okay... WTF is going on here? Does pmode_behavior just not get set up
correctly? Since it seems you can get it to wake up with your patch,
perhaps we can get read out the value of pmode_behavior and print it...

-hpa

2013-05-01 18:52:53

by Jonas Heinrich

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

Well, you could give me instructions on how to debug this (I'll do
everything ;)) or I could ship you the Thinkpad T43. I guess this would
worth the effort since this bug is somehow critical.

Best regards,
Jonas

On 05-01 10:33, H. Peter Anvin wrote:
> On 05/01/2013 10:01 AM, Jonas Heinrich wrote:
> > Hello, I tried the newest kernel, 3.9 today but the bug is still
> > present. Applying the attached patch solves the bug for me.
> >
> > Best regards, Jonas Heinrich
>
> Okay... WTF is going on here? Does pmode_behavior just not get set up
> correctly? Since it seems you can get it to wake up with your patch,
> perhaps we can get read out the value of pmode_behavior and print it...
>
> -hpa
>
>


Attachments:
(No filename) (710.00 B)
(No filename) (490.00 B)
Download all attachments

2013-05-01 18:55:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> Well, you could give me instructions on how to debug this (I'll do
> everything ;)) or I could ship you the Thinkpad T43. I guess this
> would worth the effort since this bug is somehow critical.
>
> Best regards, Jonas

I'll put together a debug patch unless I can trick Rafael into doing
it first...

-hpa

2013-05-02 00:37:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > Well, you could give me instructions on how to debug this (I'll do
> > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > would worth the effort since this bug is somehow critical.
> >
> > Best regards, Jonas
>
> I'll put together a debug patch unless I can trick Rafael into doing
> it first...

I'm afraid that code has changed quite a bit since I looked at it last time.
[Jarkko Sakkinen seems to have worked on it lately, CCed.]

Jonas, I wonder what happens if you drop the first hunk of the patch (it just
uses a different register, which shouldn't matter)? Does it still help then?

If so, there are still a few things you can do to it, e.g:
(1) drop the

- btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
- jnc 1f

lines,
(2) drop the

- btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
- jnc 1f

lines,
(3) drop the

+ jecxz 1f

line,
(4) drop the

+ movl %eax, %ecx
+ orl %edx, %ecx
+ jz 1f

lines and see what the minimal patch needed for things to work again is.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-05-02 20:34:29

by Jonas Heinrich

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 05-02 02:45, Rafael J. Wysocki wrote:
> On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > Well, you could give me instructions on how to debug this (I'll do
> > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > would worth the effort since this bug is somehow critical.
> > >
> > > Best regards, Jonas
> >
> > I'll put together a debug patch unless I can trick Rafael into doing
> > it first...
>
> I'm afraid that code has changed quite a bit since I looked at it last time.
> [Jarkko Sakkinen seems to have worked on it lately, CCed.]
>
> Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> uses a different register, which shouldn't matter)? Does it still help then?

Hello Rafel, first of all, thank you for helping me out :)
You're right, the patch still solves the suspend bug, after removing the first
hunk of the patch and applying it (see attachement:
suspendfix_first_hunk_dropped.patch).

>
> If so, there are still a few things you can do to it, e.g:
> (1) drop the
>
> - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> - jnc 1f
>

Still works :) (used suspendfix_1.patch)

> lines,
> (2) drop the
>
> - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> - jnc 1f
>
> lines,

Still works :) (used suspendfix_2.patch)

> (3) drop the
>
> + jecxz 1f
>

Still works :) (used suspendfix_3.patch)

> line,
> (4) drop the
>
> + movl %eax, %ecx
> + orl %edx, %ecx
> + jz 1f
>

At this point, the bug reoccurs (used suspendfix_4.patch)!
But that doesn't mean these lines are the only critical, because the more
minimal patch

@@ -119,6 +119,9 @@
jnc 1f
movl pmode_efer, %eax
movl pmode_efer + 4, %edx
+ movl %eax, %ecx
+ orl %edx, %ecx
+ jz 1f
movl $MSR_EFER, %ecx
wrmsr
1:


with removing this part

- movl pmode_cr4, %eax
- movl %eax, %cr4
+ movl pmode_cr4, %ecx
+ movl %ecx, %cr4

also doesn't fix the issue (see suspendfix_5.patch).

> lines and see what the minimal patch needed for things to work again is.
>

So the most minimal working patch is suspendfix_3.patch.

> Thanks,
> Rafael

Thank you and best regards,
Jonas

>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.


Attachments:
(No filename) (0.00 B)
(No filename) (490.00 B)
Download all attachments

2013-05-02 23:20:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> On 05-02 02:45, Rafael J. Wysocki wrote:
> > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > Well, you could give me instructions on how to debug this (I'll do
> > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > would worth the effort since this bug is somehow critical.
> > > >
> > > > Best regards, Jonas
> > >
> > > I'll put together a debug patch unless I can trick Rafael into doing
> > > it first...
> >
> > I'm afraid that code has changed quite a bit since I looked at it last time.
> > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> >
> > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > uses a different register, which shouldn't matter)? Does it still help then?
>
> Hello Rafel, first of all, thank you for helping me out :)
> You're right, the patch still solves the suspend bug, after removing the first
> hunk of the patch and applying it (see attachement:
> suspendfix_first_hunk_dropped.patch).
>
> >
> > If so, there are still a few things you can do to it, e.g:
> > (1) drop the
> >
> > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > - jnc 1f
> >
>
> Still works :) (used suspendfix_1.patch)
>
> > lines,
> > (2) drop the
> >
> > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > - jnc 1f
> >
> > lines,
>
> Still works :) (used suspendfix_2.patch)
>
> > (3) drop the
> >
> > + jecxz 1f
> >
>
> Still works :) (used suspendfix_3.patch)
>
> > line,
> > (4) drop the
> >
> > + movl %eax, %ecx
> > + orl %edx, %ecx
> > + jz 1f
> >
>
> At this point, the bug reoccurs (used suspendfix_4.patch)!
> But that doesn't mean these lines are the only critical, because the more
> minimal patch
>
> @@ -119,6 +119,9 @@
> jnc 1f
> movl pmode_efer, %eax
> movl pmode_efer + 4, %edx
> + movl %eax, %ecx
> + orl %edx, %ecx
> + jz 1f
> movl $MSR_EFER, %ecx
> wrmsr
> 1:
>
>
> with removing this part
>
> - movl pmode_cr4, %eax
> - movl %eax, %cr4
> + movl pmode_cr4, %ecx
> + movl %ecx, %cr4
>
> also doesn't fix the issue (see suspendfix_5.patch).
>
> > lines and see what the minimal patch needed for things to work again is.
> >
>
> So the most minimal working patch is suspendfix_3.patch.

Thanks for doing that detective work!

The only explanation of why this particular patch can help that seems viable to
us at the moment is that we have a memory corruption in the code region modified
by it and the patch simply changes the alignment of the instructions that don't
get corrupted.

It looks like this may be verified by putting a bunch of nops into the region
in question, so can you please check if the attached patch helps too?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.


Attachments:
i386-resume-crash-debug.patch (694.00 B)

2013-05-03 11:08:58

by Jonas Heinrich

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 05-03 01:29, Rafael J. Wysocki wrote:
> On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > Well, you could give me instructions on how to debug this (I'll do
> > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > would worth the effort since this bug is somehow critical.
> > > > >
> > > > > Best regards, Jonas
> > > >
> > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > it first...
> > >
> > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > >
> > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > uses a different register, which shouldn't matter)? Does it still help then?
> >
> > Hello Rafel, first of all, thank you for helping me out :)
> > You're right, the patch still solves the suspend bug, after removing the first
> > hunk of the patch and applying it (see attachement:
> > suspendfix_first_hunk_dropped.patch).
> >
> > >
> > > If so, there are still a few things you can do to it, e.g:
> > > (1) drop the
> > >
> > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > - jnc 1f
> > >
> >
> > Still works :) (used suspendfix_1.patch)
> >
> > > lines,
> > > (2) drop the
> > >
> > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > - jnc 1f
> > >
> > > lines,
> >
> > Still works :) (used suspendfix_2.patch)
> >
> > > (3) drop the
> > >
> > > + jecxz 1f
> > >
> >
> > Still works :) (used suspendfix_3.patch)
> >
> > > line,
> > > (4) drop the
> > >
> > > + movl %eax, %ecx
> > > + orl %edx, %ecx
> > > + jz 1f
> > >
> >
> > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > But that doesn't mean these lines are the only critical, because the more
> > minimal patch
> >
> > @@ -119,6 +119,9 @@
> > jnc 1f
> > movl pmode_efer, %eax
> > movl pmode_efer + 4, %edx
> > + movl %eax, %ecx
> > + orl %edx, %ecx
> > + jz 1f
> > movl $MSR_EFER, %ecx
> > wrmsr
> > 1:
> >
> >
> > with removing this part
> >
> > - movl pmode_cr4, %eax
> > - movl %eax, %cr4
> > + movl pmode_cr4, %ecx
> > + movl %ecx, %cr4
> >
> > also doesn't fix the issue (see suspendfix_5.patch).
> >
> > > lines and see what the minimal patch needed for things to work again is.
> > >
> >
> > So the most minimal working patch is suspendfix_3.patch.
>
> Thanks for doing that detective work!
>
> The only explanation of why this particular patch can help that seems viable to
> us at the moment is that we have a memory corruption in the code region modified
> by it and the patch simply changes the alignment of the instructions that don't
> get corrupted.
>
> It looks like this may be verified by putting a bunch of nops into the region
> in question, so can you please check if the attached patch helps too?

Unfortunately, your attached patch doesn't seem to fix the bug.
Hope you still have some ideas to address this issue :)

- Jonas
>
> Rafael
>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.


Attachments:
(No filename) (3.37 kB)
(No filename) (490.00 B)
Download all attachments

2013-05-03 11:29:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On Friday, May 03, 2013 11:07:05 AM Jonas Heinrich wrote:
> On 05-03 01:29, Rafael J. Wysocki wrote:
> > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > Well, you could give me instructions on how to debug this (I'll do
> > > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > > would worth the effort since this bug is somehow critical.
> > > > > >
> > > > > > Best regards, Jonas
> > > > >
> > > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > > it first...
> > > >
> > > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > > >
> > > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > > uses a different register, which shouldn't matter)? Does it still help then?
> > >
> > > Hello Rafel, first of all, thank you for helping me out :)
> > > You're right, the patch still solves the suspend bug, after removing the first
> > > hunk of the patch and applying it (see attachement:
> > > suspendfix_first_hunk_dropped.patch).
> > >
> > > >
> > > > If so, there are still a few things you can do to it, e.g:
> > > > (1) drop the
> > > >
> > > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > - jnc 1f
> > > >
> > >
> > > Still works :) (used suspendfix_1.patch)
> > >
> > > > lines,
> > > > (2) drop the
> > > >
> > > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > - jnc 1f
> > > >
> > > > lines,
> > >
> > > Still works :) (used suspendfix_2.patch)
> > >
> > > > (3) drop the
> > > >
> > > > + jecxz 1f
> > > >
> > >
> > > Still works :) (used suspendfix_3.patch)
> > >
> > > > line,
> > > > (4) drop the
> > > >
> > > > + movl %eax, %ecx
> > > > + orl %edx, %ecx
> > > > + jz 1f
> > > >
> > >
> > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > But that doesn't mean these lines are the only critical, because the more
> > > minimal patch
> > >
> > > @@ -119,6 +119,9 @@
> > > jnc 1f
> > > movl pmode_efer, %eax
> > > movl pmode_efer + 4, %edx
> > > + movl %eax, %ecx
> > > + orl %edx, %ecx
> > > + jz 1f
> > > movl $MSR_EFER, %ecx
> > > wrmsr
> > > 1:
> > >
> > >
> > > with removing this part
> > >
> > > - movl pmode_cr4, %eax
> > > - movl %eax, %cr4
> > > + movl pmode_cr4, %ecx
> > > + movl %ecx, %cr4
> > >
> > > also doesn't fix the issue (see suspendfix_5.patch).
> > >
> > > > lines and see what the minimal patch needed for things to work again is.
> > > >
> > >
> > > So the most minimal working patch is suspendfix_3.patch.
> >
> > Thanks for doing that detective work!
> >
> > The only explanation of why this particular patch can help that seems viable to
> > us at the moment is that we have a memory corruption in the code region modified
> > by it and the patch simply changes the alignment of the instructions that don't
> > get corrupted.
> >
> > It looks like this may be verified by putting a bunch of nops into the region
> > in question, so can you please check if the attached patch helps too?
>
> Unfortunately, your attached patch doesn't seem to fix the bug.
> Hope you still have some ideas to address this issue :)

Well, not really.

We'll need a dump of the wakeup structure from your system I suppose.
I'll try to write a patch to get that over the weekend.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-05-03 12:16:28

by Sakkinen, Jarkko

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

Hi

On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> On 05-03 01:29, Rafael J. Wysocki wrote:
> > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > this would worth the effort since this bug is somehow critical.
> > > > > >
> > > > > > Best regards, Jonas
> > > > >
> > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > doing it first...
> > > >
> > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > CCed.]
> > > >
> > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > (it just uses a different register, which shouldn't matter)? Does
> > > > it still help then?
> > >
> > > Hello Rafel, first of all, thank you for helping me out :)
> > > You're right, the patch still solves the suspend bug, after removing
> > > the first hunk of the patch and applying it (see attachement:
> > > suspendfix_first_hunk_dropped.patch).
> > >
> > > > If so, there are still a few things you can do to it, e.g:
> > > > (1) drop the
> > > >
> > > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > - jnc 1f
> > >
> > > Still works :) (used suspendfix_1.patch)
> > >
> > > > lines,
> > > > (2) drop the
> > > >
> > > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > - jnc 1f
> > > >
> > > > lines,
> > >
> > > Still works :) (used suspendfix_2.patch)
> > >
> > > > (3) drop the
> > > >
> > > > + jecxz 1f
> > >
> > > Still works :) (used suspendfix_3.patch)
> > >
> > > > line,
> > > > (4) drop the
> > > >
> > > > + movl %eax, %ecx
> > > > + orl %edx, %ecx
> > > > + jz 1f
> > >
> > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > But that doesn't mean these lines are the only critical, because the
> > > more minimal patch
> > >
> > > @@ -119,6 +119,9 @@
> > >
> > > jnc 1f
> > > movl pmode_efer, %eax
> > > movl pmode_efer + 4, %edx
> > >
> > > + movl %eax, %ecx
> > > + orl %edx, %ecx
> > > + jz 1f
> > >
> > > movl $MSR_EFER, %ecx
> > > wrmsr
> > >
> > > 1:
> > > with removing this part
> > >
> > > - movl pmode_cr4, %eax
> > > - movl %eax, %cr4
> > > + movl pmode_cr4, %ecx
> > > + movl %ecx, %cr4
> > >
> > > also doesn't fix the issue (see suspendfix_5.patch).
> > >
> > > > lines and see what the minimal patch needed for things to work again
> > > > is.
> > >
> > > So the most minimal working patch is suspendfix_3.patch.
> >
> > Thanks for doing that detective work!
> >
> > The only explanation of why this particular patch can help that seems
> > viable to us at the moment is that we have a memory corruption in the
> > code region modified by it and the patch simply changes the alignment of
> > the instructions that don't get corrupted.
> >
> > It looks like this may be verified by putting a bunch of nops into the
> > region in question, so can you please check if the attached patch helps
> > too?
>
> Unfortunately, your attached patch doesn't seem to fix the bug.
> Hope you still have some ideas to address this issue :)

Kind of had to experiment with this since I don't have access to
T43. Did you already try:

- EFER handling only is reverted as it was before 73201dbe.
- CR4 handling only is reverted as it was before 73201dbe.

Thanks.

/Jarkko

>
> - Jonas
>
> > Rafael

2013-05-28 21:27:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On Friday, May 03, 2013 01:37:45 PM Rafael J. Wysocki wrote:
> On Friday, May 03, 2013 11:07:05 AM Jonas Heinrich wrote:
> > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > Well, you could give me instructions on how to debug this (I'll do
> > > > > > > everything ;)) or I could ship you the Thinkpad T43. I guess this
> > > > > > > would worth the effort since this bug is somehow critical.
> > > > > > >
> > > > > > > Best regards, Jonas
> > > > > >
> > > > > > I'll put together a debug patch unless I can trick Rafael into doing
> > > > > > it first...
> > > > >
> > > > > I'm afraid that code has changed quite a bit since I looked at it last time.
> > > > > [Jarkko Sakkinen seems to have worked on it lately, CCed.]
> > > > >
> > > > > Jonas, I wonder what happens if you drop the first hunk of the patch (it just
> > > > > uses a different register, which shouldn't matter)? Does it still help then?
> > > >
> > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > You're right, the patch still solves the suspend bug, after removing the first
> > > > hunk of the patch and applying it (see attachement:
> > > > suspendfix_first_hunk_dropped.patch).
> > > >
> > > > >
> > > > > If so, there are still a few things you can do to it, e.g:
> > > > > (1) drop the
> > > > >
> > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > - jnc 1f
> > > > >
> > > >
> > > > Still works :) (used suspendfix_1.patch)
> > > >
> > > > > lines,
> > > > > (2) drop the
> > > > >
> > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > - jnc 1f
> > > > >
> > > > > lines,
> > > >
> > > > Still works :) (used suspendfix_2.patch)
> > > >
> > > > > (3) drop the
> > > > >
> > > > > + jecxz 1f
> > > > >
> > > >
> > > > Still works :) (used suspendfix_3.patch)
> > > >
> > > > > line,
> > > > > (4) drop the
> > > > >
> > > > > + movl %eax, %ecx
> > > > > + orl %edx, %ecx
> > > > > + jz 1f
> > > > >
> > > >
> > > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > > But that doesn't mean these lines are the only critical, because the more
> > > > minimal patch
> > > >
> > > > @@ -119,6 +119,9 @@
> > > > jnc 1f
> > > > movl pmode_efer, %eax
> > > > movl pmode_efer + 4, %edx
> > > > + movl %eax, %ecx
> > > > + orl %edx, %ecx
> > > > + jz 1f
> > > > movl $MSR_EFER, %ecx
> > > > wrmsr
> > > > 1:
> > > >
> > > >
> > > > with removing this part
> > > >
> > > > - movl pmode_cr4, %eax
> > > > - movl %eax, %cr4
> > > > + movl pmode_cr4, %ecx
> > > > + movl %ecx, %cr4
> > > >
> > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > >
> > > > > lines and see what the minimal patch needed for things to work again is.
> > > > >
> > > >
> > > > So the most minimal working patch is suspendfix_3.patch.
> > >
> > > Thanks for doing that detective work!
> > >
> > > The only explanation of why this particular patch can help that seems viable to
> > > us at the moment is that we have a memory corruption in the code region modified
> > > by it and the patch simply changes the alignment of the instructions that don't
> > > get corrupted.
> > >
> > > It looks like this may be verified by putting a bunch of nops into the region
> > > in question, so can you please check if the attached patch helps too?
> >
> > Unfortunately, your attached patch doesn't seem to fix the bug.
> > Hope you still have some ideas to address this issue :)
>
> Well, not really.
>
> We'll need a dump of the wakeup structure from your system I suppose.
> I'll try to write a patch to get that over the weekend.

Can you try 3.10-rc3 by the way? There is a fix in there that *might* make
a difference (although the odds are quite significant).

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-07-08 10:01:28

by Jonas Heinrich

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 05-03 15:15, Jarkko Sakkinen wrote:
> Hi
>
> On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > > this would worth the effort since this bug is somehow critical.
> > > > > > >
> > > > > > > Best regards, Jonas
> > > > > >
> > > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > > doing it first...
> > > > >
> > > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > > CCed.]
> > > > >
> > > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > > (it just uses a different register, which shouldn't matter)? Does
> > > > > it still help then?
> > > >
> > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > You're right, the patch still solves the suspend bug, after removing
> > > > the first hunk of the patch and applying it (see attachement:
> > > > suspendfix_first_hunk_dropped.patch).
> > > >
> > > > > If so, there are still a few things you can do to it, e.g:
> > > > > (1) drop the
> > > > >
> > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > - jnc 1f
> > > >
> > > > Still works :) (used suspendfix_1.patch)
> > > >
> > > > > lines,
> > > > > (2) drop the
> > > > >
> > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > - jnc 1f
> > > > >
> > > > > lines,
> > > >
> > > > Still works :) (used suspendfix_2.patch)
> > > >
> > > > > (3) drop the
> > > > >
> > > > > + jecxz 1f
> > > >
> > > > Still works :) (used suspendfix_3.patch)
> > > >
> > > > > line,
> > > > > (4) drop the
> > > > >
> > > > > + movl %eax, %ecx
> > > > > + orl %edx, %ecx
> > > > > + jz 1f
> > > >
> > > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > > But that doesn't mean these lines are the only critical, because the
> > > > more minimal patch
> > > >
> > > > @@ -119,6 +119,9 @@
> > > >
> > > > jnc 1f
> > > > movl pmode_efer, %eax
> > > > movl pmode_efer + 4, %edx
> > > >
> > > > + movl %eax, %ecx
> > > > + orl %edx, %ecx
> > > > + jz 1f
> > > >
> > > > movl $MSR_EFER, %ecx
> > > > wrmsr
> > > >
> > > > 1:
> > > > with removing this part
> > > >
> > > > - movl pmode_cr4, %eax
> > > > - movl %eax, %cr4
> > > > + movl pmode_cr4, %ecx
> > > > + movl %ecx, %cr4
> > > >
> > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > >
> > > > > lines and see what the minimal patch needed for things to work again
> > > > > is.
> > > >
> > > > So the most minimal working patch is suspendfix_3.patch.
> > >
> > > Thanks for doing that detective work!
> > >
> > > The only explanation of why this particular patch can help that seems
> > > viable to us at the moment is that we have a memory corruption in the
> > > code region modified by it and the patch simply changes the alignment of
> > > the instructions that don't get corrupted.
> > >
> > > It looks like this may be verified by putting a bunch of nops into the
> > > region in question, so can you please check if the attached patch helps
> > > too?
> >
> > Unfortunately, your attached patch doesn't seem to fix the bug.
> > Hope you still have some ideas to address this issue :)
>
> Kind of had to experiment with this since I don't have access to
> T43. Did you already try:
>
> - EFER handling only is reverted as it was before 73201dbe.
> - CR4 handling only is reverted as it was before 73201dbe.
Hi Jarkko,
thank you for your response!
Can you please be more specific about that instruction? I don't really
know what to do, sorry :/

@Rafael: Bug still present in kernel 3.10 final :(

- Jonas

>
> Thanks.
>
> /Jarkko
>
> >
> > - Jonas
> >
> > > Rafael


Attachments:
(No filename) (4.23 kB)
(No filename) (490.00 B)
Download all attachments

2013-07-08 12:56:05

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On Monday, July 08, 2013 09:50:43 AM Jonas Heinrich wrote:
> On 05-03 15:15, Jarkko Sakkinen wrote:
> > Hi
> >
> > On Friday, May 03, 2013 02:07:05 PM Jonas Heinrich wrote:
> > > On 05-03 01:29, Rafael J. Wysocki wrote:
> > > > On Thursday, May 02, 2013 08:32:30 PM Jonas Heinrich wrote:
> > > > > On 05-02 02:45, Rafael J. Wysocki wrote:
> > > > > > On Wednesday, May 01, 2013 11:55:10 AM H. Peter Anvin wrote:
> > > > > > > On 05/01/2013 11:51 AM, Jonas Heinrich wrote:
> > > > > > > > Well, you could give me instructions on how to debug this (I'll
> > > > > > > > do everything ;)) or I could ship you the Thinkpad T43. I guess
> > > > > > > > this would worth the effort since this bug is somehow critical.
> > > > > > > >
> > > > > > > > Best regards, Jonas
> > > > > > >
> > > > > > > I'll put together a debug patch unless I can trick Rafael into
> > > > > > > doing it first...
> > > > > >
> > > > > > I'm afraid that code has changed quite a bit since I looked at it
> > > > > > last time. [Jarkko Sakkinen seems to have worked on it lately,
> > > > > > CCed.]
> > > > > >
> > > > > > Jonas, I wonder what happens if you drop the first hunk of the patch
> > > > > > (it just uses a different register, which shouldn't matter)? Does
> > > > > > it still help then?
> > > > >
> > > > > Hello Rafel, first of all, thank you for helping me out :)
> > > > > You're right, the patch still solves the suspend bug, after removing
> > > > > the first hunk of the patch and applying it (see attachement:
> > > > > suspendfix_first_hunk_dropped.patch).
> > > > >
> > > > > > If so, there are still a few things you can do to it, e.g:
> > > > > > (1) drop the
> > > > > >
> > > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_CR4, %edi
> > > > > > - jnc 1f
> > > > >
> > > > > Still works :) (used suspendfix_1.patch)
> > > > >
> > > > > > lines,
> > > > > > (2) drop the
> > > > > >
> > > > > > - btl $WAKEUP_BEHAVIOR_RESTORE_EFER, %edi
> > > > > > - jnc 1f
> > > > > >
> > > > > > lines,
> > > > >
> > > > > Still works :) (used suspendfix_2.patch)
> > > > >
> > > > > > (3) drop the
> > > > > >
> > > > > > + jecxz 1f
> > > > >
> > > > > Still works :) (used suspendfix_3.patch)
> > > > >
> > > > > > line,
> > > > > > (4) drop the
> > > > > >
> > > > > > + movl %eax, %ecx
> > > > > > + orl %edx, %ecx
> > > > > > + jz 1f
> > > > >
> > > > > At this point, the bug reoccurs (used suspendfix_4.patch)!
> > > > > But that doesn't mean these lines are the only critical, because the
> > > > > more minimal patch
> > > > >
> > > > > @@ -119,6 +119,9 @@
> > > > >
> > > > > jnc 1f
> > > > > movl pmode_efer, %eax
> > > > > movl pmode_efer + 4, %edx
> > > > >
> > > > > + movl %eax, %ecx
> > > > > + orl %edx, %ecx
> > > > > + jz 1f
> > > > >
> > > > > movl $MSR_EFER, %ecx
> > > > > wrmsr
> > > > >
> > > > > 1:
> > > > > with removing this part
> > > > >
> > > > > - movl pmode_cr4, %eax
> > > > > - movl %eax, %cr4
> > > > > + movl pmode_cr4, %ecx
> > > > > + movl %ecx, %cr4
> > > > >
> > > > > also doesn't fix the issue (see suspendfix_5.patch).
> > > > >
> > > > > > lines and see what the minimal patch needed for things to work again
> > > > > > is.
> > > > >
> > > > > So the most minimal working patch is suspendfix_3.patch.
> > > >
> > > > Thanks for doing that detective work!
> > > >
> > > > The only explanation of why this particular patch can help that seems
> > > > viable to us at the moment is that we have a memory corruption in the
> > > > code region modified by it and the patch simply changes the alignment of
> > > > the instructions that don't get corrupted.
> > > >
> > > > It looks like this may be verified by putting a bunch of nops into the
> > > > region in question, so can you please check if the attached patch helps
> > > > too?
> > >
> > > Unfortunately, your attached patch doesn't seem to fix the bug.
> > > Hope you still have some ideas to address this issue :)
> >
> > Kind of had to experiment with this since I don't have access to
> > T43. Did you already try:
> >
> > - EFER handling only is reverted as it was before 73201dbe.
> > - CR4 handling only is reverted as it was before 73201dbe.
> Hi Jarkko,
> thank you for your response!
> Can you please be more specific about that instruction? I don't really
> know what to do, sorry :/
>
> @Rafael: Bug still present in kernel 3.10 final :(

That's because no one knows how to fix it. Actually, we don't even know
what the root cause of it is. Sorry about that.

Are you still able to make things work again by reverting the commit you
bisected to?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-07-10 21:19:37

by Christian Sünkenberg

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

Hello,

On 05/01/2013 07:33 PM, H. Peter Anvin wrote:
> On 05/01/2013 10:01 AM, Jonas Heinrich wrote:
>> Hello, I tried the newest kernel, 3.9 today but the bug is still
>> present. Applying the attached patch solves the bug for me.
>>
>> Best regards, Jonas Heinrich
>
> Okay... WTF is going on here? Does pmode_behavior just not get set up
> correctly? Since it seems you can get it to wake up with your patch,
> perhaps we can get read out the value of pmode_behavior and print it...

indeed, arch/x86/kernel/acpi/sleep.c tries an rdmsr_safe(MSR_EFER, ...)
and sets WAKEUP_BEHAVIOR_RESTORE_EFER bit on success, however,
on 90 nm Pentium M (Family 6, Model 13), reading an invalid MSR
is not guaranteed to trap, see Erratum X4 in "Intel® Pentium® M
Processor on 90 nm Process with 2-MB L2 Cache and Intel® Processor A100
and A110 on 90 nm process with 512-KB L2 Cache Specification Update".
On Jonas' T43, which has an affected Pentium M without EFER,
rdmsr_safe(MSR_EFER, ...) succeeds and WAKEUP_BEHAVIOR_RESTORE_EFER
gets set, while on resume the corresponding wrmsr traps and thus resume
fails.

The pre-3.7 code snippet incidentally catched this by not restoring
EFER when it would be restored to all 0s.

HTH,
Christian


Attachments:
smime.p7s (4.88 kB)
S/MIME Cryptographic Signature

2013-07-10 23:57:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 07/10/2013 01:52 PM, Christian Sünkenberg wrote:
> Hello,
>
> On 05/01/2013 07:33 PM, H. Peter Anvin wrote:
>> On 05/01/2013 10:01 AM, Jonas Heinrich wrote:
>>> Hello, I tried the newest kernel, 3.9 today but the bug is still
>>> present. Applying the attached patch solves the bug for me.
>>>
>>> Best regards, Jonas Heinrich
>>
>> Okay... WTF is going on here? Does pmode_behavior just not get set up
>> correctly? Since it seems you can get it to wake up with your patch,
>> perhaps we can get read out the value of pmode_behavior and print it...
>
> indeed, arch/x86/kernel/acpi/sleep.c tries an rdmsr_safe(MSR_EFER, ...)
> and sets WAKEUP_BEHAVIOR_RESTORE_EFER bit on success, however,
> on 90 nm Pentium M (Family 6, Model 13), reading an invalid MSR
> is not guaranteed to trap, see Erratum X4 in "Intel® Pentium® M
> Processor on 90 nm Process with 2-MB L2 Cache and Intel® Processor A100
> and A110 on 90 nm process with 512-KB L2 Cache Specification Update".
> On Jonas' T43, which has an affected Pentium M without EFER,
> rdmsr_safe(MSR_EFER, ...) succeeds and WAKEUP_BEHAVIOR_RESTORE_EFER
> gets set, while on resume the corresponding wrmsr traps and thus resume
> fails.
>
> The pre-3.7 code snippet incidentally catched this by not restoring
> EFER when it would be restored to all 0s.
>

That does seem like a reasonable explanation.

Does this patch fix the problem? (Comment blatantly ripped off from
your email message.)

-hpa



Attachments:
diff (0.98 kB)

2013-07-12 23:36:48

by Christian Sünkenberg

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

Hello,

On 07/11/2013 01:57 AM, H. Peter Anvin wrote:
> On 07/10/2013 01:52 PM, Christian Sünkenberg wrote:
>> Hello,
>>
>> On 05/01/2013 07:33 PM, H. Peter Anvin wrote:
>>> On 05/01/2013 10:01 AM, Jonas Heinrich wrote:
>>>> Hello, I tried the newest kernel, 3.9 today but the bug is still
>>>> present. Applying the attached patch solves the bug for me.
>>>>
>>>> Best regards, Jonas Heinrich
>>>
>>> Okay... WTF is going on here? Does pmode_behavior just not get set up
>>> correctly? Since it seems you can get it to wake up with your patch,
>>> perhaps we can get read out the value of pmode_behavior and print it...
>>
>> indeed, arch/x86/kernel/acpi/sleep.c tries an rdmsr_safe(MSR_EFER, ...)
>> and sets WAKEUP_BEHAVIOR_RESTORE_EFER bit on success, however,
>> on 90 nm Pentium M (Family 6, Model 13), reading an invalid MSR
>> is not guaranteed to trap, see Erratum X4 in "Intel® Pentium® M
>> Processor on 90 nm Process with 2-MB L2 Cache and Intel® Processor A100
>> and A110 on 90 nm process with 512-KB L2 Cache Specification Update".
>> On Jonas' T43, which has an affected Pentium M without EFER,
>> rdmsr_safe(MSR_EFER, ...) succeeds and WAKEUP_BEHAVIOR_RESTORE_EFER
>> gets set, while on resume the corresponding wrmsr traps and thus resume
>> fails.
>>
>> The pre-3.7 code snippet incidentally catched this by not restoring
>> EFER when it would be restored to all 0s.
>>
>
> That does seem like a reasonable explanation.
>
> Does this patch fix the problem? (Comment blatantly ripped off from
> your email message.)

Jonas tried your patch and it fixes suspend/resume on his T43, although
IMHO the safest approach would be to just add an exception for
Vendor==Intel && Family==6 && Model==13, or more generally Vendor==Intel
&& !supports_long_mode, as the same erratum also warns about wrmsr
possibly not triggering a GP either.
Anyways, at least on this specific MSR with the Pentium M Jonas tested,
it behaved correctly on every try, so I'd say your patch does the trick,
thank you very much!

As a side note, I found a similar erratum #33 in "Pentium® Processor
Specification Update" for Intel P54C (Family 5, Model 2), which would,
supposed there are P54C systems with ACPI sleep/resume support, result
in MSR 0 (P5_MC_ADDR) to be saved and restored instead of nonexistent EFER.

Kind regards,
Christian


Attachments:
smime.p7s (4.88 kB)
S/MIME Cryptographic Signature

2013-07-12 23:46:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 07/12/2013 04:36 PM, Christian Sünkenberg wrote:
>
> Jonas tried your patch and it fixes suspend/resume on his T43, although
> IMHO the safest approach would be to just add an exception for
> Vendor==Intel && Family==6 && Model==13, or more generally Vendor==Intel
> && !supports_long_mode, as the same erratum also warns about wrmsr
> possibly not triggering a GP either.
> Anyways, at least on this specific MSR with the Pentium M Jonas tested,
> it behaved correctly on every try, so I'd say your patch does the trick,
> thank you very much!
>

Using vendor matches is not really a great way to deal with things that
can better be handled analytically.

If WRMSR doesn't fault, it is not a problem...

> As a side note, I found a similar erratum #33 in "Pentium® Processor
> Specification Update" for Intel P54C (Family 5, Model 2), which would,
> supposed there are P54C systems with ACPI sleep/resume support, result
> in MSR 0 (P5_MC_ADDR) to be saved and restored instead of nonexistent EFER.

Doesn't really matter, as we'd only read that one after an #MC.

-hpa

Subject: [tip:x86/urgent] x86, suspend: Handle CPUs which fail to #GP on RDMSR

Commit-ID: 3a783f6e39cc6c89da8846312f29ca47affaa470
Gitweb: http://git.kernel.org/tip/3a783f6e39cc6c89da8846312f29ca47affaa470
Author: H. Peter Anvin <[email protected]>
AuthorDate: Fri, 12 Jul 2013 16:48:12 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 12 Jul 2013 16:48:12 -0700

x86, suspend: Handle CPUs which fail to #GP on RDMSR

There are CPUs which have errata causing RDMSR of a nonexistent MSR to
not fault. We would then try to WRMSR to restore the value of that
MSR, causing a crash. Specifically, some Pentium M variants would
have this problem trying to save and restore the non-existent EFER,
causing a crash on resume.

Work around this by making sure we can write back the result at
suspend time.

Huge thanks to Christian Sünkenberg for finding the offending erratum
that finally deciphered the mystery.

Reported-and-tested-by: Johan Heinrich <[email protected]>
Debugged-by: Christian Sünkenberg <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: <[email protected]> # v3.7+
---
arch/x86/kernel/acpi/sleep.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 2a34aaf..3312010 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -48,9 +48,20 @@ int x86_acpi_suspend_lowlevel(void)
#ifndef CONFIG_64BIT
native_store_gdt((struct desc_ptr *)&header->pmode_gdt);

+ /*
+ * We have to check that we can write back the value, and not
+ * just read it. At least on 90 nm Pentium M (Family 6, Model
+ * 13), reading an invalid MSR is not guaranteed to trap, see
+ * Erratum X4 in "Intel Pentium M Processor on 90 nm Process
+ * with 2-MB L2 Cache and Intel® Processor A100 and A110 on 90
+ * nm process with 512-KB L2 Cache Specification Update".
+ */
if (!rdmsr_safe(MSR_EFER,
&header->pmode_efer_low,
- &header->pmode_efer_high))
+ &header->pmode_efer_high) &&
+ !wrmsr_safe(MSR_EFER,
+ header->pmode_efer_low,
+ header->pmode_efer_high))
header->pmode_behavior |= (1 << WAKEUP_BEHAVIOR_RESTORE_EFER);
#endif /* !CONFIG_64BIT */

@@ -61,7 +72,10 @@ int x86_acpi_suspend_lowlevel(void)
}
if (!rdmsr_safe(MSR_IA32_MISC_ENABLE,
&header->pmode_misc_en_low,
- &header->pmode_misc_en_high))
+ &header->pmode_misc_en_high) &&
+ !wrmsr_safe(MSR_IA32_MISC_ENABLE,
+ header->pmode_misc_en_low,
+ header->pmode_misc_en_high))
header->pmode_behavior |=
(1 << WAKEUP_BEHAVIOR_RESTORE_MISC_ENABLE);
header->realmode_flags = acpi_realmode_flags;

Subject: [tip:x86/urgent] x86, suspend: Handle CPUs which fail to #GP on RDMSR

Commit-ID: 5ff560fd48d5b3d82fa0c3aff625c9da1a301911
Gitweb: http://git.kernel.org/tip/5ff560fd48d5b3d82fa0c3aff625c9da1a301911
Author: H. Peter Anvin <[email protected]>
AuthorDate: Fri, 12 Jul 2013 16:48:12 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 15 Jul 2013 13:50:54 -0700

x86, suspend: Handle CPUs which fail to #GP on RDMSR

There are CPUs which have errata causing RDMSR of a nonexistent MSR to
not fault. We would then try to WRMSR to restore the value of that
MSR, causing a crash. Specifically, some Pentium M variants would
have this problem trying to save and restore the non-existent EFER,
causing a crash on resume.

Work around this by making sure we can write back the result at
suspend time.

Huge thanks to Christian Sünkenberg for finding the offending erratum
that finally deciphered the mystery.

Reported-and-tested-by: Johan Heinrich <[email protected]>
Debugged-by: Christian Sünkenberg <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: <[email protected]> # v3.7+
---
arch/x86/kernel/acpi/sleep.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 2a34aaf..3312010 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -48,9 +48,20 @@ int x86_acpi_suspend_lowlevel(void)
#ifndef CONFIG_64BIT
native_store_gdt((struct desc_ptr *)&header->pmode_gdt);

+ /*
+ * We have to check that we can write back the value, and not
+ * just read it. At least on 90 nm Pentium M (Family 6, Model
+ * 13), reading an invalid MSR is not guaranteed to trap, see
+ * Erratum X4 in "Intel Pentium M Processor on 90 nm Process
+ * with 2-MB L2 Cache and Intel® Processor A100 and A110 on 90
+ * nm process with 512-KB L2 Cache Specification Update".
+ */
if (!rdmsr_safe(MSR_EFER,
&header->pmode_efer_low,
- &header->pmode_efer_high))
+ &header->pmode_efer_high) &&
+ !wrmsr_safe(MSR_EFER,
+ header->pmode_efer_low,
+ header->pmode_efer_high))
header->pmode_behavior |= (1 << WAKEUP_BEHAVIOR_RESTORE_EFER);
#endif /* !CONFIG_64BIT */

@@ -61,7 +72,10 @@ int x86_acpi_suspend_lowlevel(void)
}
if (!rdmsr_safe(MSR_IA32_MISC_ENABLE,
&header->pmode_misc_en_low,
- &header->pmode_misc_en_high))
+ &header->pmode_misc_en_high) &&
+ !wrmsr_safe(MSR_IA32_MISC_ENABLE,
+ header->pmode_misc_en_low,
+ header->pmode_misc_en_high))
header->pmode_behavior |=
(1 << WAKEUP_BEHAVIOR_RESTORE_MISC_ENABLE);
header->realmode_flags = acpi_realmode_flags;

2013-07-15 21:12:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Bisected] 3.7-rc1 can't resume (still present in 3.9)

On 07/08/2013 06:05 AM, Rafael J. Wysocki wrote:
>
> That's because no one knows how to fix it. Actually, we don't even know
> what the root cause of it is. Sorry about that.
>
> Are you still able to make things work again by reverting the commit you
> bisected to?
>

Well, we know now. Hopefully we should have the fix in 3.11-rc2 and
then on its way to stable.

-hpa