2005-05-04 05:01:29

by Andrea Arcangeli

[permalink] [raw]
Subject: avoid infinite loop in x86_64 interrupt return

Hello,

A few minutes ago I've got an unkillable task in R state with vanilla
2.6.12-rc3 on x86_64, luckily system was still up with the other cpu (on
the desktop, so I had no kgdb environment set). Debugging revelaed rdi
corrupt when entering retint_signal (not set to $_TIF_WORK_MASK as
expected). This lead the rdx&rdi to return 0x20000 -> infinite loop.
Precisely rdi is set to ffff810030923f58 instead of $_TIF_WORK_MASK. So
it was the combination of ...2xxxx as rsp with TIF_IA32 in the task
flags. After noticing the rdi screwup the bug was quite clear: rdi was
set to pt_regs instead of $_TIF_WORK_MASK. Of course rsp is set to
ffff810030923f58 too (which also means do_notify_resume didn't clobber
rdi even if it could).

The below should fix the problem, I've no idea how to reproduce the
problem but it works on basic testing. The task looping was java (32bit,
that's where the 0x20000 come from), but it wasn't me starting java, it
must have been some random website (java was hanging around with 100%
system time for half an hour once I noticed it).

Signed-off-by: Andrea Arcangeli <[email protected]>

--- 2.6.12-rc3/arch/x86_64/kernel/entry.S.orig 2005-05-04 06:47:02.000000000 +0200
+++ 2.6.12-rc3/arch/x86_64/kernel/entry.S 2005-05-04 06:50:34.000000000 +0200
@@ -489,6 +489,7 @@ retint_signal:
movq %rsp,%rdi # &pt_regs
call do_notify_resume
RESTORE_REST
+ movl $_TIF_WORK_MASK,%edi
cli
GET_THREAD_INFO(%rcx)
jmp retint_check


2005-05-04 09:00:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return

Hi,

On Wednesday, 4 of May 2005 07:01, Andrea Arcangeli wrote:
> Hello,
>
> A few minutes ago I've got an unkillable task in R state with vanilla
> 2.6.12-rc3 on x86_64, luckily system was still up with the other cpu (on
> the desktop, so I had no kgdb environment set). Debugging revelaed rdi
> corrupt when entering retint_signal (not set to $_TIF_WORK_MASK as
> expected). This lead the rdx&rdi to return 0x20000 -> infinite loop.
> Precisely rdi is set to ffff810030923f58 instead of $_TIF_WORK_MASK. So
> it was the combination of ...2xxxx as rsp with TIF_IA32 in the task
> flags. After noticing the rdi screwup the bug was quite clear: rdi was
> set to pt_regs instead of $_TIF_WORK_MASK. Of course rsp is set to
> ffff810030923f58 too (which also means do_notify_resume didn't clobber
> rdi even if it could).
>
> The below should fix the problem, I've no idea how to reproduce the
> problem but it works on basic testing. The task looping was java (32bit,
> that's where the 0x20000 come from), but it wasn't me starting java, it
> must have been some random website (java was hanging around with 100%
> system time for half an hour once I noticed it).
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> --- 2.6.12-rc3/arch/x86_64/kernel/entry.S.orig 2005-05-04 06:47:02.000000000 +0200
> +++ 2.6.12-rc3/arch/x86_64/kernel/entry.S 2005-05-04 06:50:34.000000000 +0200
> @@ -489,6 +489,7 @@ retint_signal:
> movq %rsp,%rdi # &pt_regs
> call do_notify_resume
> RESTORE_REST
> + movl $_TIF_WORK_MASK,%edi
> cli
> GET_THREAD_INFO(%rcx)
> jmp retint_check
> -

You also need to add two missing clis. Please have a look at the attached
patch from Andi.

Greets,
Rafael


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"


Attachments:
(No filename) (1.83 kB)
2.6.12-rc3-unkillable-java-ak.patch (803.00 B)
Download all attachments

2005-05-04 13:22:18

by Andi Kleen

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return


It is already fixed in mainline. Actually I think it was a merging
problem I had actually fixed it before the last merge in my tree, but
some patches got lost :/

-Andi

2005-05-04 13:31:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return

On Wed, May 04, 2005 at 11:00:32AM +0200, Rafael J. Wysocki wrote:
> You also need to add two missing clis. Please have a look at the attached
> patch from Andi.

Those two clis seems unrelated, so I don't see why I should mix them in
the same patch. I couldn't trigger the other problems, only the one with
rdi corruption.

Note that those clis seems superflous, cli is only needed before swapgs,
so adding cli before swapgs in retint_swapgs sounds a better idea than
to keep irq off for a longer time for no apparent good reason. But I've
no real idea why those cli are needed so I guess I must be missing
something. there's no commentary attached to your patch that can exlain
why the cli are needed _way_ before calling swapgs.

2005-05-04 13:32:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return

On Wed, May 04, 2005 at 03:22:15PM +0200, Andi Kleen wrote:
>
> It is already fixed in mainline. Actually I think it was a merging
> problem I had actually fixed it before the last merge in my tree, but
> some patches got lost :/

Ok thanks. I'm still not in sync with git so I didn't notice, I looked
into mercurial instead of git infact.

2005-05-04 18:24:38

by Andi Kleen

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return

On Wed, May 04, 2005 at 03:31:29PM +0200, Andrea Arcangeli wrote:
> On Wed, May 04, 2005 at 11:00:32AM +0200, Rafael J. Wysocki wrote:
> > You also need to add two missing clis. Please have a look at the attached
> > patch from Andi.
>
> Those two clis seems unrelated, so I don't see why I should mix them in
> the same patch. I couldn't trigger the other problems, only the one with
> rdi corruption.

THere was a second patch which essentially got the line you posted
together with the missing clis.

I originally fixed it in a slightly different way (in the version
that got lost), but this one was equivalent.

>
> Note that those clis seems superflous, cli is only needed before swapgs,
> so adding cli before swapgs in retint_swapgs sounds a better idea than
> to keep irq off for a longer time for no apparent good reason. But I've
> no real idea why those cli are needed so I guess I must be missing
> something. there's no commentary attached to your patch that can exlain
> why the cli are needed _way_ before calling swapgs.

To avoid losing schedule events and signals. Between checking for them
and returning to user space interrupts need to be off. When they are
reenabled everything needs to be rechecked.

-Andi

2005-05-04 18:34:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: avoid infinite loop in x86_64 interrupt return

On Wed, May 04, 2005 at 08:21:43PM +0200, Andi Kleen wrote:
> To avoid losing schedule events and signals. Between checking for them
> and returning to user space interrupts need to be off. When they are
> reenabled everything needs to be rechecked.

All right good point, the latency issues, they'd wait the timeslice to
expire instead of being processed immediatly. That couldn't be noticed
without some measurement but it's sure worth fixing agreed, thanks!