2005-04-05 06:56:05

by Ingo Molnar

[permalink] [raw]
Subject: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


the crashes below happen when PAGEALLOC is enabled. It's this
instruction:

movb OLDSS(%esp), %ah

OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which
correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?

it could be the ESP-16-bit-corruption patch causing this, or it could be
an already existing latent bug getting triggered now: normally only iret
accesses the OLDSS, and we fix any iret faults up, but now that we
explicitly access %esp the esp bug shows up.

so it would be nice to understand why this triggers. It seems to be a
sporadic event - first it hit hotplug, then input.agent. If i disable
PAGEALLOC the system boots up fine. In any case, the ESP-corruption
patch is not safe until this bug is understood, as it right now may read
a random byte off the next page, and possibly doing bogus calls to the
16-bit-fixup code.

Ingo

-------------

BUG: Unable to handle kernel paging request at virtual address f4f84000
printing eip:
c010287c
*pde = 00527067
*pte = 34f84000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2-RT-V0.7.43-09)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000000 edi: 009b63f9 ebp: f4f82000 esp: f4f83fc8
ds: 007b es: 007b ss: 0068 preempt: 00000001
Process 10-udev.hotplug (pid: 1264, threadinfo=f4f82000 task=f5034a10)
Stack: 00000000 bfa71dd0 009c0ffc 00000000 009b63f9 bfa71d44 000000c5 0000007b
0000007b ffffffef c01027ba 00000060 00000206 0000007b
Call Trace:
[<c01036ac>] show_stack+0x7a/0x90 (32)
[<c0103835>] show_registers+0x15a/0x1d2 (56)
[<c0103a30>] die+0xf4/0x17e (68)
[<c010f444>] do_page_fault+0x3de/0x60a (212)
[<c01032eb>] error_code+0x4f/0x54 (-8076)

---------------------

BUG: Unable to handle kernel paging request at virtual address f57bc000
printing eip:
c010287c
*pde = 00529067
*pte = 357bc000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2-RT-V0.7.43-09)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: b7f11000 ecx: 00000000 edx: 00000000
esi: 080e4f28 edi: 00000000 ebp: f57ba000 esp: f57bbfc8
ds: 007b es: 007b ss: 0068 preempt: 00000001
Process input.agent (pid: 1131, threadinfo=f57ba000 task=f57b9a10)
Stack: b7f11000 00001000 009c0ffc 080e4f28 00000000 bfc112c0 0000005b 0000007b
0000007b ffffff00 c01027ba 00000060 00000206 0000007b
Call Trace:
[<c01036ac>] show_stack+0x7a/0x90 (32)
[<c0103835>] show_registers+0x15a/0x1d2 (56)
[<c0103a30>] die+0xf4/0x17e (68)
[<c010f474>] do_page_fault+0x3de/0x60a (212)
[<c01032eb>] error_code+0x4f/0x54 (-8076)


2005-04-05 07:03:32

by Andrew Morton

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Ingo Molnar <[email protected]> wrote:
>
> the crashes below happen when PAGEALLOC is enabled. It's this
> instruction:
>
> movb OLDSS(%esp), %ah
>
> OLDSS is 0x38, esp is f4f83fc8, OLDSS(%esp) is thus f4f84000, which
> correctly creates the PAGEALLOC pagefault. esp is off by 4 bytes?
>
> it could be the ESP-16-bit-corruption patch causing this,

Do you have nmis enabled?



From: Akinobu Mita <[email protected]>

With nmi_watchdog=1, I got random Oopses (Unable to handle kernel paging
request, not by the NMI oopser) from many processes. It is not happend
with -rc1.

The following change fixes this problem.

Signed-off-by: Andrew Morton <[email protected]>
---

25-akpm/arch/i386/kernel/entry.S | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/i386/kernel/entry.S~nmi_stack_correct-fix arch/i386/kernel/entry.S
--- 25/arch/i386/kernel/entry.S~nmi_stack_correct-fix 2005-04-05 00:02:48.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S 2005-04-05 00:02:48.000000000 -0700
@@ -550,7 +550,7 @@ nmi_stack_correct:
xorl %edx,%edx # zero error code
movl %esp,%eax # pt_regs pointer
call do_nmi
- jmp restore_all
+ jmp restore_nocheck

nmi_stack_fixup:
FIX_STACK(12,nmi_stack_correct, 1)
_

2005-04-05 07:14:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Ingo Molnar <[email protected]> wrote:

> it could be the ESP-16-bit-corruption patch causing this, or it could
> be an already existing latent bug getting triggered now: normally only
> iret accesses the OLDSS, and we fix any iret faults up, but now that
> we explicitly access %esp the esp bug shows up.

update: it's a latent condition being exposed by the ESP-corruption-fix
patch. I reverted the ESP patch from 2.6.12-rc2, and applied the patch
below, and it produces the same sort of crashes (attached further
below).

Ingo

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -123,6 +123,7 @@ VM_MASK = 0x00020000


#define RESTORE_ALL \
+ movl OLDSS(%esp), %eax; \
RESTORE_REGS \
addl $4, %esp; \
1: iret; \

------------->
Freeing unused kernel memory: 188k freed
Unable to handle kernel paging request at virtual address f5bda000
printing eip:
c0102874
*pde = 00517067
*pte = 35bda000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000000 edi: 009c0ffc ebp: f5bd8000 esp: f5bd9fc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1124, threadinfo=f5bd8000 task=f5999b10)
Stack: 00000000 00000003 00000000 00000000 009c0ffc bf96dd58 000000dd 0000007b
0000007b ffffff00 c01027ba 00000060 00000246 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6
<1>Unable to handle kernel paging request at virtual address f50a8000
printing eip:
c0102874
*pde = 00515067
*pte = 350a8000
Oops: 0000 [#2]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000003 ecx: 00000000 edx: 00000001
esi: 00000008 edi: 009c0ffc ebp: f50a6000 esp: f50a7fc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1181, threadinfo=f50a6000 task=f55f9b10)
Stack: 00000003 bff9500c bff94f7c 00000008 009c0ffc bff94f60 000000ae 0000007b
0000007b ffffff00 c01027ba 00000060 00000286 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6
<1>Unable to handle kernel paging request at virtual address f553e000
printing eip:
c0102874
*pde = 00516067
*pte = 3553e000
Oops: 0000 [#3]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c0102874>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x0/0x14
eax: 00000260 ebx: 00000000 ecx: 00000000 edx: 00000001
esi: 00000008 edi: 009c0ffc ebp: f553c000 esp: f553dfc8
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 1113, threadinfo=f553c000 task=f5a75b10)
Stack: 00000000 00000000 080e1f3c 00000008 009c0ffc bf856de0 000000af 0000007b
0000007b ffffffef c01027ba 00000060 00000246 0000007b
Call Trace:
[<c01035dc>] show_stack+0x7a/0x90
[<c0103758>] show_registers+0x14d/0x1c5
[<c0103956>] die+0xf4/0x186
[<c010f15d>] do_page_fault+0x430/0x649
[<c0103283>] error_code+0x2b/0x30
Code: 45 08 81 01 75 7d 3d 21 01 00 00 0f 83 da 00 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 80 00 00 00 <8b> 44 24 38 5b 59 5a 5e 5f 5d 58 1f 07 83 c4 04 cf 8d 76 00 f6

2005-04-05 07:15:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Andrew Morton <[email protected]> wrote:

> Do you have nmis enabled?

yeah ...

> call do_nmi
> - jmp restore_all
> + jmp restore_nocheck

and i was about to take a closer look at the NMI path :-)

Ingo

2005-04-05 07:17:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Andrew Morton <[email protected]> wrote:

> From: Akinobu Mita <[email protected]>
>
> With nmi_watchdog=1, I got random Oopses (Unable to handle kernel
> paging request, not by the NMI oopser) from many processes. It is not
> happend with -rc1.
>
> The following change fixes this problem.

this fixed my crashes too.

Ingo

2005-04-05 07:38:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > From: Akinobu Mita <[email protected]>
> >
> > With nmi_watchdog=1, I got random Oopses (Unable to handle kernel
> > paging request, not by the NMI oopser) from many processes. It is not
> > happend with -rc1.
> >
> > The following change fixes this problem.
>

> this fixed my crashes too.

spoke too soon - they still trigger even with the patch applied.

Ingo

Unable to handle kernel paging request at virtual address f543c000
printing eip:
c010287c
*pde = 00516067
*pte = 3543c000
Oops: 0000 [#1]
PREEMPT DEBUG_PAGEALLOC
Modules linked in:
CPU: 0
EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x4/0x18
eax: 00000202 ebx: 009b63f9 ecx: 00000000 edx: 00000001
esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8
ds: 007b es: 007b ss: 0068
Process udevd (pid: 1044, threadinfo=f543a000 task=f5412b10)
Stack: 009b63f9 00000000 000001b6 009b63f9 00000001 bf8c503c 00000005 0000007b
0000007b ffffff00 c01027ba 00000060 00000202 0000007b
Call Trace:
[<c010368c>] show_stack+0x7a/0x90
[<c0103808>] show_registers+0x14d/0x1c5
[<c0103a06>] die+0xf4/0x186
[<c010f2ad>] do_page_fault+0x430/0x649
[<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b
printing eip:
*EIP: 0060:[<c010287c>] Not tainted VLI
EFLAGS: 00010046 (2.6.12-rc2)
EIP is at restore_all+0x4/0x18
eax: 00000206 ebx: 080d1502 ecx: 00000000 edx: 00000001
esi: 080ec058 edi: 080e8b6e ebp: f5e38000 esp: f5e39fc8
ds: 007b es: 007b ss: 0068
Process rc.sysinit (pid: 3032, threadinfo=f5e38000 task=f5913b10)
Stack: 080d1502 bff73ff0 009c0ffc 080ec058 080e8b6e bff73fb4 000000c3 0000007b
0000007b ffffff00 c01027ba 00000060 00000206 0000007b
Call Trace:
[<c010368c>] show_stack+0x7a/0x90
[<c0103808>] show_registers+0x14d/0x1c5
[<c0103a06>] die+0xf4/0x186
[<c010f2ad>] do_page_fault+0x430/0x649
[<c01032eb>] error_code+0x4f/0x54
Code: 00 00 3d 21 01 00 00 0f 83 1a 01 00 00 ff 14 85 00 39 40 c0 89 44 24 18 fa 8b 4d 08 66 f7 c1 ff fe 0f 85 c0 00 00 00 8b 44 24 30 <8a> 64 24 38 8a 44 24 2c 25 03 04 02 00 3d 03 04 00 00 74 0d 5b
<6>EXT3 FS on hdb1, internal journal
cdrom: open failed.
Adding 1052216k swap on /dev/hdb2. Priority:-1 extents:1
eth1: link up, 100Mbps, full-duplex, lpa 0x45E1

2005-04-05 08:05:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Ingo Molnar <[email protected]> wrote:

> > this fixed my crashes too.
>
> spoke too soon - they still trigger even with the patch applied.

the patch below fixes the crash, it was related to CONFIG_PREEMPT.

Ingo

--
fix entry.S crash with PREEMPT+PAGEALLOC

Signed-off-by: Ingo Molnar <[email protected]>

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -165,9 +165,9 @@ ENTRY(resume_kernel)
need_resched:
movl TI_flags(%ebp), %ecx # need_resched set ?
testb $_TIF_NEED_RESCHED, %cl
- jz restore_all
+ jz restore_nocheck
testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
- jz restore_all
+ jz restore_nocheck
call preempt_schedule_irq
jmp need_resched
#endif

2005-04-05 09:57:29

by Mikael Pettersson

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Ingo Molnar writes:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > this fixed my crashes too.
> >
> > spoke too soon - they still trigger even with the patch applied.
>
> the patch below fixes the crash, it was related to CONFIG_PREEMPT.
>
> Ingo
>
> --
> fix entry.S crash with PREEMPT+PAGEALLOC
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> --- linux/arch/i386/kernel/entry.S.orig
> +++ linux/arch/i386/kernel/entry.S
> @@ -165,9 +165,9 @@ ENTRY(resume_kernel)
> need_resched:
> movl TI_flags(%ebp), %ecx # need_resched set ?
> testb $_TIF_NEED_RESCHED, %cl
> - jz restore_all
> + jz restore_nocheck
> testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
> - jz restore_all
> + jz restore_nocheck
> call preempt_schedule_irq
> jmp need_resched
> #endif

Is this sufficient or do we also need the s/restore_all/restore_nocheck/
at around line 553 which was in the first posted patch?

/Mikael

2005-04-05 18:17:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Mikael Pettersson <[email protected]> wrote:

> > - jz restore_all
> > + jz restore_nocheck
> > testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
> > - jz restore_all
> > + jz restore_nocheck
> > call preempt_schedule_irq
> > jmp need_resched
> > #endif
>
> Is this sufficient or do we also need the
> s/restore_all/restore_nocheck/ at around line 553 which was in the
> first posted patch?

all 3 restore_all->restore_nocheck changes are needed to make the bug
not trigger. I dont think it's a real fix because whatever is being
preempted involuntarily, it ought to have a proper kernel stack to begin
with. I'll do some more debugging.

Ingo

2005-04-05 19:17:27

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hi Ingo et all.

Ingo Molnar wrote:
> the crashes below happen when PAGEALLOC is enabled. It's this
> instruction:
> movb OLDSS(%esp), %ah
I am really sorry about that screwup :(
I can't do too much right now as I am
reading the mail in a batch mode, and
the next time I'll be reading it will
be 24 hours from now.

Attached is a quick fix, which I'll be
testing to death tomorrow at work.
I had DEBUG_PAGEALLOC disabled, so I
haven't noticed that stupid bug while
optimizing my checks...
Let me know how it goes.


Attachments:
entry.S.diff (596.00 B)

2005-04-05 19:24:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Tue, 5 Apr 2005, Stas Sergeev wrote:
>
> Attached is a quick fix, which I'll be
> testing to death tomorrow at work.

This one can pass through vm86 mode stuff without the high-16-bit fixup,
as far as I can tell.

Also, I think your optimization to optimistically load SS is valid per se,
but we need to find out how some kernel thread gets zero stack associated
with it. They should all have the full "struct pt_reg" as far as I could
see, which means that we should never be _so_ high up the stack that
SS/ESP would be on the next page.

So I'd actually prefer to get that mystery explained..

Linus

2005-04-05 19:48:11

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hi Linus,

Linus Torvalds wrote:
> This one can pass through vm86 mode stuff without the high-16-bit fixup,
> as far as I can tell.
Yes, but according to Petr, vm86 is not
affected by the bug at all. I did some
rough tests in the past that seem to
confirm that. Also, in any case, the
dependance of vm86 code on the higher word
of %esp would be very, very obscure.

> So I'd actually prefer to get that mystery explained..
IIRC if the interrupt doesn't do the CPL
switch, the interrupt gate doesn't save
the stack, and so there may not be the
full "struct pt_regs" when the kernel
thread is interrupted.
Does this sound any realistic?

So while it would be excellent to hear
that my patch was not guilty at all, I
think it is not the case.

2005-04-05 19:55:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Tue, 5 Apr 2005, Stas Sergeev wrote:
>
> > So I'd actually prefer to get that mystery explained..
> IIRC if the interrupt doesn't do the CPL
> switch, the interrupt gate doesn't save
> the stack, and so there may not be the
> full "struct pt_regs" when the kernel
> thread is interrupted.

Yes. But how do you have _such_ an empty stack when the interrupt comes
in? See what I mean? IOW, that requires that the kernel stack would have
only two words on it when the interrupt happens. How?

It may be a 4kB stack issue or something. Does this happen only with
CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for
example?

So I definitely think the "bug" is in your optimization, I just think it
should be a valid optimization and we should just make sure our kernel
stack is never _so_ empty that "struct pt_regs" is not safe to
dereference.

Linus

2005-04-05 20:58:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Linus Torvalds <[email protected]> wrote:

> > > So I'd actually prefer to get that mystery explained..
> > IIRC if the interrupt doesn't do the CPL
> > switch, the interrupt gate doesn't save
> > the stack, and so there may not be the
> > full "struct pt_regs" when the kernel
> > thread is interrupted.
>
> Yes. But how do you have _such_ an empty stack when the interrupt
> comes in? See what I mean? IOW, that requires that the kernel stack
> would have only two words on it when the interrupt happens. How?
>
> It may be a 4kB stack issue or something. Does this happen only with
> CONFIG_4KSTACKS, and just after a stack switch to an irq stack, for
> example?

i didnt have 4K stacks set. In all crashes, esp had the same pattern:

esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8

i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4
(page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere
we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont
we crash in that case - it ought to shift all the pt_regs structure by a
word, making it a completely senseless return frame. Any task in such a
situation ought to at least segfault. So if this is during thread
startup, i dont know how it survives the first execution.

Ingo

2005-04-05 21:03:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Tue, 5 Apr 2005, Ingo Molnar wrote:
>
> esi: 009b63f9 edi: 00000001 ebp: f543a000 esp: f543bfc8
>
> i.e. esp & 0xfff was 0xfc8 - while i think it should normally be 0xfc4
> (page boundary minus size of pt_regs == 0 - 0x3c == 0xfc4). So somewhere
> we lost 4 bytes of esp? An extra popl, or an addl $4, %esp? But why dont
> we crash in that case

Normally, esp will be immediately reset by any user-land stuff: we'll
forget the old kernel stack entirely, and always re-load esp from the esp0
thing in the TSS.

And as long as we stay in kernel land, we'll obviously never touch the
esp/xss fields of pt_regs (except in this special case of doing the
speculative load of xss), so...

Linus

2005-04-06 15:44:54

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hi,

Linus Torvalds wrote:
> Yes. But how do you have _such_ an empty stack when the interrupt comes
> in? See what I mean?
Yes, I hope so.

> IOW, that requires that the kernel stack would have
> only two words on it when the interrupt happens. How?
Well, you can simply do something like this:

--- entry.S.old1 2005-04-05 22:54:43.000000000 +0400
+++ entry.S 2005-04-06 19:35:14.000000000 +0400
@@ -179,9 +179,9 @@
ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
- sti
pushl $(__USER_DS)
pushl %ebp
+ sti
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN

And this will "elimenate" the problem
(modulo NMI and there could be other places
too, but for me it elimenates it completely).
So I don't think this is something strange.

> So I definitely think the "bug" is in your optimization,
Yes, and I think the patch I posted, can
just work, or are there the problems with
the taken forward jump on a fast path?

> I just think it
> should be a valid optimization
But it is totally bogus, why not should it
crash? It is probably even very good that
it does:)

> and we should just make sure our kernel
> stack is never _so_ empty that "struct pt_regs" is not safe to
> dereference.
I guess you'll just need to adjust the tss.esp0
then, but do you really want this? Accesing
the registers that are simply not there, doesn't
sound too good I think.
Or am I still missing your point?

2005-04-07 08:10:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Stas Sergeev <[email protected]> wrote:

> ENTRY(sysenter_entry)
> movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> - sti
> pushl $(__USER_DS)
> pushl %ebp
> + sti

ah, yes, sysenter. SYSENTER creates a degenerate 'small' stackframe with
an esp0 that is missing the 5 entry words relative to the normal entry
(int80 or irq) esp0 stackframe. These 5 words are: xss, esp, eflags,
xcs, eip. The sysenter code sets them up manually.

now if an interrupt hits at this point, it will set up a 'same privilege
level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss. If upon
irq-return we then examine the stack due to your patch, it will be an
incorrect stackframe -> kaboom.

your patch doesnt remove the condition, it only removes the crash,
because it adds the 2 words space that is needed - but the information
relied on by your irq-return test is still bogus. At this point i'd
suggest to remove the ESP patch altogether.

the correct solution is to always let the sysenter path set up a full
and correct stackframe, before allowing preemption (see the attached
patch). This was a nasty bug in the waiting. (I have not made this
conditional on CONFIG_PREEMPT, to keep it simple and because the impact
to irq latency is small and predictable. There's no runtime overhead.)

so i think with the help of Stas the mystery has been fully explained
and solved. Linus?

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -179,12 +179,17 @@ need_resched:
ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
- sti
+ #
+ # irqs are disabled: set up an entry stackframe without
+ # allowing irqs to potentially preempt us with an
+ # incomplete entry frame!
+ #
pushl $(__USER_DS)
pushl %ebp
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
+ sti

/*
* Load the potential sixth argument from user stack.

2005-04-07 11:10:36

by Andrew Morton

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Ingo Molnar <[email protected]> wrote:
>
> --- linux/arch/i386/kernel/entry.S.orig
> +++ linux/arch/i386/kernel/entry.S
> @@ -179,12 +179,17 @@ need_resched:
> ENTRY(sysenter_entry)
> movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> - sti
> + #
> + # irqs are disabled: set up an entry stackframe without
> + # allowing irqs to potentially preempt us with an
> + # incomplete entry frame!
> + #
> pushl $(__USER_DS)
> pushl %ebp
> pushfl
> pushl $(__USER_CS)
> pushl $SYSENTER_RETURN
> + sti
>

Well that bites the big one.


Program received signal SIGTRAP, Trace/breakpoint trap.
0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
1371 BUG_ON(irqs_disabled());
(gdb) t
[Current thread is 0 (Thread 32768)]
(gdb) bt
#0 0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
#1 0xc015c57b in __getblk (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1487
#2 0xc0199978 in ext3_getblk (handle=0x0, inode=0xcff2a2b4, block=1, create=0, errp=0xcdf61dbc) at include/linux/buffer_head.h:260
#3 0xc019d5a6 in ext3_find_entry (dentry=0xcdaef8d4, res_dir=0xcdf61dfc) at fs/ext3/namei.c:868
#4 0xc019d9ca in ext3_lookup (dir=0x1, dentry=0xcdaef8d4, nd=0xcdf61f58) at fs/ext3/namei.c:988
#5 0xc0166b2d in real_lookup (parent=0xcdaef8d4, name=0xcdf61e8c, nd=0xcdf61f58) at fs/namei.c:416
#6 0xc0166df3 in do_lookup (nd=0xcdf61f58, name=0xcdf61e8c, path=0xcdf61e84) at fs/namei.c:670
#7 0xc0167667 in __link_path_walk (name=0xcdd24011 "", nd=0xcdf61f58) at fs/namei.c:833
#8 0xc0167c28 in link_path_walk (name=0xcdd24000 "/lib/libpcre.so.0", nd=0xcdf61f58) at fs/namei.c:911
#9 0xc0168056 in path_lookup (name=0xcdd24000 "/lib/libpcre.so.0", flags=0, nd=0xcdf61f58) at fs/namei.c:1026
#10 0xc016877d in open_namei (pathname=0xcdd24000 "/lib/libpcre.so.0", flag=1, mode=-1208907480, nd=0xcdf61f58) at fs/namei.c:1420
#11 0xc0159256 in filp_open (filename=0xcdd24000 "/lib/libpcre.so.0", flags=0, mode=-1208907480) at fs/open.c:764
#12 0xc01595e6 in sys_open (filename=0x1 <Address 0x1 out of bounds>, flags=1, mode=1) at fs/open.c:946
#13 0xc0102e91 in syscall_call () at arch/i386/kernel/semaphore.c:177
#14 0xb7f22c8a in ?? ()

Is someone doing a syscall with irqs disabled, or what?

This fixes it...



--- 25/arch/i386/kernel/entry.S~sysenter-irq-atomicity-fix-fix 2005-04-07 04:03:07.000000000 -0700
+++ 25-akpm/arch/i386/kernel/entry.S 2005-04-07 04:05:47.000000000 -0700
@@ -187,6 +187,7 @@ sysenter_past_esp:
pushl $(__USER_DS)
pushl %ebp
pushfl
+ orl $0x200,0(%esp)
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
sti
_

2005-04-07 14:47:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Thu, 7 Apr 2005, Andrew Morton wrote:
>
> Ingo Molnar <[email protected]> wrote:
> >
> > --- linux/arch/i386/kernel/entry.S.orig
> > +++ linux/arch/i386/kernel/entry.S
> > @@ -179,12 +179,17 @@ need_resched:
> > ENTRY(sysenter_entry)
> > movl TSS_sysenter_esp0(%esp),%esp
> > sysenter_past_esp:
> > - sti
> > + #
> > + # irqs are disabled: set up an entry stackframe without
> > + # allowing irqs to potentially preempt us with an
> > + # incomplete entry frame!
> > + #
> > pushl $(__USER_DS)
> > pushl %ebp
> > pushfl
> > pushl $(__USER_CS)
> > pushl $SYSENTER_RETURN
> > + sti
> >
>
> Well that bites the big one.
>>
> Program received signal SIGTRAP, Trace/breakpoint trap.
> 0xc015c4ba in __find_get_block (bdev=0xc18cd988, block=2818614, size=4096) at fs/buffer.c:1371
> 1371 BUG_ON(irqs_disabled());

Yes. With the change in place "entry.S" will always save the wrogn EIP, so
we'll return with interrupts disabled.

Your suggested patch is pretty horrid, though. It's sufficient to enable
interrupts after the two first pushes, since that has already now set up
enough of a kernel stack that any subsequent interrupt will always have at
least a "fake" SS/ESP (ie some values there, although not anything
relevant).

So the sysenter sequence might as well look like

pushl $(__USER_DS)
pushl %ebp
sti
pushfl
..

which actually does three protected pushes thanks to the one-instruction
"interrupt shadow" after an sti.

Another alternative (and to some degree a maybe a better one) is to not
use "pushfl" at all in the sysenter sequence, but instead just push a
fixed default value. At that point, the "sti" can stay where it was.

After all, I very strongly suspect that we don't actually really _care_ if
eflags stays the same over a system call, and I could see that some
dynamic CPU's might prefer not having to push an eflags value that just
got changed by the "sti", so it _might_ save several cycles to avoid that
dependency, and also obviously avoid a subtle dependency on a sw level
that the previous patch clearly introduced.

Anybody willing to time it? ;)

Linus

2005-04-07 14:52:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC


* Linus Torvalds <[email protected]> wrote:

> > > pushfl

> After all, I very strongly suspect that we don't actually really
> _care_ if eflags stays the same over a system call, and I could see
> that some dynamic CPU's might prefer not having to push an eflags
> value that just got changed by the "sti", so it _might_ save several
> cycles to avoid that dependency, and also obviously avoid a subtle
> dependency on a sw level that the previous patch clearly introduced.
>
> Anybody willing to time it? ;)

i can tell you without any measurement that pushfl is slower by a couple
of cycles than a simple pushl $0x00010046, on basically all x86 CPUs.
And since we only support SYENTER from 32-bit mode and we dont guarantee
flags to be saved, it isnt all that incorrect to do?

Ingo

2005-04-07 16:11:42

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hello.

Ingo Molnar wrote:
> now if an interrupt hits at this point, it will set up a 'same privilege
> level' stackframe, which has eip/xcs/eflags, i.e. no esp/xss.
Yes, that's something I tried to say
when talking about the interrupt gates
(sorry if I wasn't clear).

> If upon
> irq-return we then examine the stack due to your patch, it will be an
> incorrect stackframe -> kaboom.
Yes, and that's where I think my patch is
at fault, i.e. it just shouldn't do this.
Another option is to make it always possible
to access OLDSS(%esp), but I think it is just
my patch have to be fixed to not do this at all.

> your patch doesnt remove the condition, it only removes the crash,
No, that wasn't my point at all. That example
with moving "sti" was *only* to answer Linus's
question of where we have an empty stack.
And I guess I wasn't clear enough also here,
I was in a hurry :(
The real patch I meant, is this one:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0504.0/1287.html
This, I am sure, fixes a real bug. But there
can be the other approaches too.

> because it adds the 2 words space that is needed - but the information
> relied on by your irq-return test is still bogus.
But as an example for demonstrating the problem,
I thought, it could do:)

> At this point i'd
> suggest to remove the ESP patch altogether.
That's probably too heavy-handed. The fix is
really simple, we can either store the right
values by hands (as you propose), or fix my
patch (as I propose).

> the correct solution is to always let the sysenter path set up a full
> and correct stackframe,
But what will this solve? If I understand you
correctly, you will push the %ss/%esp of the
user-space process that did sysenter. Then
you enable the interrupts and get pre-empted.
Now what we have: OLDSS/OLDESP are of the
user-space process, but the EFLAGS/CS/EIP
are of the kernel (where it got pre-empted
on a sysenter path). This will avoid the crash,
but the information on stack is still wrong.
Or am I missing something?

> before allowing preemption (see the attached
> patch).
Hmm, will it work also for NMIs? You move
the sti, you can't be pre-empted, but the
NMI uses the restore_all too, no?
Also, it seems that Linus wants only the
*some* values available on stack, just to
make it not to crash. I think we can simply
adjust the tss.esp0 to point 8 bytes below
the real stack top, and so we are always safe.

2005-04-07 16:33:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Thu, 7 Apr 2005, Stas Sergeev wrote:
>
> > because it adds the 2 words space that is needed - but the information
> > relied on by your irq-return test is still bogus.
>
> But as an example for demonstrating the problem,
> I thought, it could do:)

Ingo: the information is bogus, but you're wrong: the code doesn't "rely"
on it.

The fact is, bogus information is _fine_. That's what speculative work is
all about: working with bogus information, with the assumption that some
later test will ignore it if it's not relevant.

And the later test _will_ ignore it if it isn't relevant. Look for
yourself:

cmpl $((4 << 8) | 3), %eax
je ldt_ss # returning to user-space with LDT SS

notice how the "cmpl" _only_ triggers if the old CS had the low three bits
set and if EFLAGS_VM is clear. So if we return to kernel mode (or vm86)
mode, we _know_ that the SS is bogus, but we don't care. We've tested for
the proper thing, and we only do the special user-space LDT SS case if
- the LDT bit was set in SS (possibly bogus)
AND
- old CS was user space
AND
- old eflags wasn't vm86 mode.

Ie the two second checks are what validates the first (possibly bogus)
one.

So I really think that the _correct_ fix is literally to move the "cli"
in the sysenter path down two lines. It doesn't just "hide" the bug, it
literally fixes is.

Linus

2005-04-07 16:46:14

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hello.

Linus Torvalds wrote:
> So I really think that the _correct_ fix is literally to move the "cli"
> in the sysenter path down two lines. It doesn't just "hide" the bug, it
> literally fixes is.
OK, Linus, I find it amazing that you
like all my patches, even though I myself
think they are wrong:)
I still have those two questions however:
1. Does the "later sti" fixes the problem
also in case of an NMI? I mean, why can't
you just be NMI'ed before you did sti?
NMI uses the restore_all too IIRC.
2. How can one be sure there are no more
of the like places where the stack is left
empty?

2005-04-07 16:47:52

by Dave Jones

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:

> So the sysenter sequence might as well look like
>
> pushl $(__USER_DS)
> pushl %ebp
> sti
> pushfl
> ..
>
> which actually does three protected pushes thanks to the one-instruction
> "interrupt shadow" after an sti.

Is this guaranteed on every x86 variant (or rather, every one
that has SEP). ?

Dave

2005-04-07 16:53:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Thu, 7 Apr 2005, Stas Sergeev wrote:
>
> 1. Does the "later sti" fixes the problem
> also in case of an NMI? I mean, why can't
> you just be NMI'ed before you did sti?
> NMI uses the restore_all too IIRC.

The NMI code had better be really careful, and yeah, I suspect it needs
fixing.

> 2. How can one be sure there are no more
> of the like places where the stack is left
> empty?

That's a good argument, and may be the strongest reason for _not_ doing
the speculation. However, I don't think it really can happen anywhere
else.

If people feel very nervous about the speculation, we can certainly just
make it do the two branches. It _is_ a hotspot, though, and the
optimization of the speculatiom may well be worth it.


Linus

2005-04-07 17:18:40

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

On Thu, 7 Apr 2005, Dave Jones wrote:

> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
>
> > So the sysenter sequence might as well look like
> >
> > pushl $(__USER_DS)
> > pushl %ebp
> > sti
> > pushfl
> > ..
> >
> > which actually does three protected pushes thanks to the one-instruction
> > "interrupt shadow" after an sti.
>
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?
>
> Dave

The i486 book says that if the next instruction lets the IF
flag remain enabled, then any interrupts pending occur after
the next instruction. If the next instruction is CLI, no
interrupts will occur. There is a note:

"In case of an NM1, trap, or fault following ST1, the interrupt
will be taken before executing the next sequential instruction
in the code." The use of "NM1" and "ST1" is (sic). I don't
know what NM1 is, it can't be NMI because, by definition NMI
is not maskable so CLI and STI could not affect this pin.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-04-07 17:21:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC



On Thu, 7 Apr 2005, Dave Jones wrote:
>
> On Thu, Apr 07, 2005 at 07:47:41AM -0700, Linus Torvalds wrote:
>
> > So the sysenter sequence might as well look like
> >
> > pushl $(__USER_DS)
> > pushl %ebp
> > sti
> > pushfl
> > ..
> >
> > which actually does three protected pushes thanks to the one-instruction
> > "interrupt shadow" after an sti.
>
> Is this guaranteed on every x86 variant (or rather, every one
> that has SEP). ?

Well, since we only need two in this case, we don't care, but yes, it's
supposed to be guaranteed by anything that calls itself an x86.

In fact, we _do_ depend on it in a few other sequences. Notably

sti ; hlt

depends on the fact that an interrupt will always finish _after_ the hlt,
and we'll never halt before the hlt (and then re-execute the hlt after the
interrupt), and in

sti ; iret

where we depend on the fact that we don't get recursive interrupt stacks
(since we at that point have re-enabled the interrupt that happened).

Of course, if some future x86 decides that the interrupt shadow only
matters for special instructions (ie it's not so much a general interrupt
shadow as a "instruction combination"), I don't think Linux would care. I
really think there are only a very few valid sti-combinations, and I
suspect the above two are pretty much it.

(The other "magic" x86 behaviour is loading into the SS register, which
creates a one-cycle black hole after it. Linux shouldn't care, and in fact
nothing should care about it outside of old 16-bit non-protected-mode
programs, so I think that's another one that could be retired eventually)

Linus

2005-04-07 18:10:27

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hi Linus.

Linus Torvalds wrote:
> The NMI code had better be really careful, and yeah, I suspect it needs
> fixing.
And since the NMI return will need a
ESP fixup too, it will require the
"branched" version of the restore_all
checks I suppose.

>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing
> the speculation.
I haven't said that explicitly, only
implied:) My another idea was to adjust
the tss.esp0 to always point 8 bytes below
the real top of the stack, so that even in
case of an NMI we are still safe. And in
case of another such instance - too.
This may look more like a hack than shifting
the "sti", but it is probably more reliable.
At least something to consider as soon as we
are at it.

2005-04-10 13:20:23

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hello.

Linus Torvalds wrote:
>> 2. How can one be sure there are no more
>> of the like places where the stack is left
>> empty?
> That's a good argument, and may be the strongest reason for _not_ doing
> the speculation. However, I don't think it really can happen anywhere
> else.
OK, so how do you feel about the attached
patch? I understand that from some point
of view it may look like a hack, but at
the same time it:
1. Allows to preserve the valueable optimization
2. Works for NMIs
3. Doesn't care whether or not there are more
of the like instances where the stack is left
empty.
4. Seems to work for me without the crashes:)


Attachments:
esp0.diff (399.00 B)

2005-04-10 22:32:45

by Andrew Morton

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Stas Sergeev <[email protected]> wrote:
>
> - p->thread.esp0 = (unsigned long) (childregs+1);
> + p->thread.esp0 = (unsigned long) (childregs+1) - 8;

This is utterly obscure - it needs a comment so that readers know what that
"- 8" is doing there.

2005-04-11 17:15:08

by Stas Sergeev

[permalink] [raw]
Subject: Re: crash in entry.S restore_all, 2.6.12-rc2, x86, PAGEALLOC

Hello.

Andrew Morton wrote:
> This is utterly obscure - it needs a comment so that readers know what that
> "- 8" is doing there.
Yes, that was only an RFC thing.
And now since there were not too much
of an FC, I prepared the "polished"
version. But apparently you already
released -mm3:)

Well, at least you can still apply the
comments if you feel like that. Here
they are.

Signed-off-by: Stas Sergeev <[email protected]>


Attachments:
esp0fix.diff (1.28 kB)