2006-09-17 23:08:27

by In Cognito

[permalink] [raw]
Subject: Sysenter crash with Nested Task Bit set

Here's a way to heat up your cpu and crash the rest of the system too:

main(){
asm("pushf\n"
"popl %eax\n"
/* enable the NT bit */
"orl $0x4000, %eax\n"
"pushl %eax\n"
"popf\n"

"sysenter\n"
);
return 0;
}

"If NT equals 1, IRET reverses the operation of a CALL or INT that
caused a task switch. The updated state of the task executing IRET is
saved in its task state segment. If the task is reentered later, the
code that follows IRET is executed."
- Intel 80386 Reference Programmer's Manual


2006-09-18 03:57:15

by Chuck Ebbert

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

In-Reply-To: <[email protected]>

On Sun, 17 Sep 2006 19:08:24 -0400, "In Cognito" wrote:

> Here's a way to heat up your cpu and crash the rest of the system too:
>
> main(){
> asm("pushf\n"
> "popl %eax\n"
> /* enable the NT bit */
> "orl $0x4000, %eax\n"
> "pushl %eax\n"
> "popf\n"
>
> "sysenter\n"
> );
> return 0;
> }

I'll take your word that it crashes.

2.6.9 is fine. I'd guess the iret fixups from 2.6.12 are the problem.

This doesn't crash for me, but it's probably not quite the right fix:

Signed-off-by: Chuck Ebbert <[email protected]>
---
arch/i386/kernel/traps.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletion(-)

--- 2.6.18-rc6-nb.orig/arch/i386/kernel/traps.c
+++ 2.6.18-rc6-nb/arch/i386/kernel/traps.c
@@ -516,6 +516,16 @@ fastcall void do_##name(struct pt_regs *
do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
}

+#define DO_TSS_ERROR(trapnr, signr, str, name) \
+fastcall void do_##name(struct pt_regs * regs, long error_code) \
+{ \
+ if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) \
+ == NOTIFY_STOP) \
+ return; \
+ regs->eflags &= ~X86_EFLAGS_NT; \
+ do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
+}
+
#define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
fastcall void do_##name(struct pt_regs * regs, long error_code) \
{ \
@@ -561,7 +571,7 @@ DO_VM86_ERROR( 4, SIGSEGV, "overflow", o
DO_VM86_ERROR( 5, SIGSEGV, "bounds", bounds)
DO_ERROR_INFO( 6, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->eip)
DO_ERROR( 9, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
-DO_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
+DO_TSS_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
DO_ERROR(11, SIGBUS, "segment not present", segment_not_present)
DO_ERROR(12, SIGBUS, "stack segment", stack_segment)
DO_ERROR_INFO(17, SIGBUS, "alignment check", alignment_check, BUS_ADRALN, 0)
--
Chuck

2006-09-18 05:25:46

by Andrew Morton

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

On Sun, 17 Sep 2006 23:51:45 -0400
Chuck Ebbert <[email protected]> wrote:

> In-Reply-To: <[email protected]>
>
> On Sun, 17 Sep 2006 19:08:24 -0400, "In Cognito" wrote:
>
> > Here's a way to heat up your cpu and crash the rest of the system too:
> >
> > main(){
> > asm("pushf\n"
> > "popl %eax\n"
> > /* enable the NT bit */
> > "orl $0x4000, %eax\n"
> > "pushl %eax\n"
> > "popf\n"
> >
> > "sysenter\n"
> > );
> > return 0;
> > }
>
> I'll take your word that it crashes.

It doesn't for me - I get a segfault.

That's on a PIII. Are recenter CPUs different in this regard?

> 2.6.9 is fine. I'd guess the iret fixups from 2.6.12 are the problem.
>
> This doesn't crash for me, but it's probably not quite the right fix:
>
> Signed-off-by: Chuck Ebbert <[email protected]>
> ---
> arch/i386/kernel/traps.c | 12 +++++++++++-
> 1 files changed, 11 insertions(+), 1 deletion(-)
>
> --- 2.6.18-rc6-nb.orig/arch/i386/kernel/traps.c
> +++ 2.6.18-rc6-nb/arch/i386/kernel/traps.c
> @@ -516,6 +516,16 @@ fastcall void do_##name(struct pt_regs *
> do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
> }
>
> +#define DO_TSS_ERROR(trapnr, signr, str, name) \
> +fastcall void do_##name(struct pt_regs * regs, long error_code) \
> +{ \
> + if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) \
> + == NOTIFY_STOP) \
> + return; \
> + regs->eflags &= ~X86_EFLAGS_NT; \
> + do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
> +}
> +
> #define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
> fastcall void do_##name(struct pt_regs * regs, long error_code) \
> { \
> @@ -561,7 +571,7 @@ DO_VM86_ERROR( 4, SIGSEGV, "overflow", o
> DO_VM86_ERROR( 5, SIGSEGV, "bounds", bounds)
> DO_ERROR_INFO( 6, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->eip)
> DO_ERROR( 9, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
> -DO_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
> +DO_TSS_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
> DO_ERROR(11, SIGBUS, "segment not present", segment_not_present)
> DO_ERROR(12, SIGBUS, "stack segment", stack_segment)
> DO_ERROR_INFO(17, SIGBUS, "alignment check", alignment_check, BUS_ADRALN, 0)
> --
> Chuck

2006-09-18 06:32:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

On Sun, 2006-09-17 at 22:25 -0700, Andrew Morton wrote:
> On Sun, 17 Sep 2006 23:51:45 -0400
> Chuck Ebbert <[email protected]> wrote:
>
> > In-Reply-To: <[email protected]>
> >
> > On Sun, 17 Sep 2006 19:08:24 -0400, "In Cognito" wrote:
> >
> > > Here's a way to heat up your cpu and crash the rest of the system too:
> > >
> > > main(){
> > > asm("pushf\n"
> > > "popl %eax\n"
> > > /* enable the NT bit */
> > > "orl $0x4000, %eax\n"
> > > "pushl %eax\n"
> > > "popf\n"
> > >
> > > "sysenter\n"
> > > );
> > > return 0;
> > > }
> >
> > I'll take your word that it crashes.
>
> It doesn't for me - I get a segfault.
>
> That's on a PIII. Are recenter CPUs different in this regard?

I guess so. That proglet does very bad things to my P4/HT. I too get a
segfault, but then init goes insane and may reap the proglet's parent
shell, but then again, it may decide to reap the shell next door.
Thereafter, init is running/gaga. (18-rc7)

Chuck's patch did indeed make it stop doing that.

> > 2.6.9 is fine. I'd guess the iret fixups from 2.6.12 are the problem.
> >
> > This doesn't crash for me, but it's probably not quite the right fix:
> >
> > Signed-off-by: Chuck Ebbert <[email protected]>
> > ---
> > arch/i386/kernel/traps.c | 12 +++++++++++-
> > 1 files changed, 11 insertions(+), 1 deletion(-)
> >
> > --- 2.6.18-rc6-nb.orig/arch/i386/kernel/traps.c
> > +++ 2.6.18-rc6-nb/arch/i386/kernel/traps.c
> > @@ -516,6 +516,16 @@ fastcall void do_##name(struct pt_regs *
> > do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
> > }
> >
> > +#define DO_TSS_ERROR(trapnr, signr, str, name) \
> > +fastcall void do_##name(struct pt_regs * regs, long error_code) \
> > +{ \
> > + if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) \
> > + == NOTIFY_STOP) \
> > + return; \
> > + regs->eflags &= ~X86_EFLAGS_NT; \
> > + do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \
> > +}
> > +
> > #define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
> > fastcall void do_##name(struct pt_regs * regs, long error_code) \
> > { \
> > @@ -561,7 +571,7 @@ DO_VM86_ERROR( 4, SIGSEGV, "overflow", o
> > DO_VM86_ERROR( 5, SIGSEGV, "bounds", bounds)
> > DO_ERROR_INFO( 6, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->eip)
> > DO_ERROR( 9, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
> > -DO_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
> > +DO_TSS_ERROR(10, SIGSEGV, "invalid TSS", invalid_TSS)
> > DO_ERROR(11, SIGBUS, "segment not present", segment_not_present)
> > DO_ERROR(12, SIGBUS, "stack segment", stack_segment)
> > DO_ERROR_INFO(17, SIGBUS, "alignment check", alignment_check, BUS_ADRALN, 0)
> > --
> > Chuck
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-09-18 15:11:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set



On Sun, 17 Sep 2006, Andrew Morton wrote:
> >
> > I'll take your word that it crashes.
>
> It doesn't for me - I get a segfault.
>
> That's on a PIII. Are recenter CPUs different in this regard?

No, I don't think they are, but I think your exact machine matters. For
example, an SMP machine will behave differently, because NT will be set
only on one CPU.

I _think_ that the problem is that once NT gets set in kernel space (and
sysenter is the only way to do that, since the normal fault paths will
clear NT), it stays set, even across a context switch. Nothing ever
reloads a kernel eflags, because we "trust" eflags in kernel space.

So Chuck's patch doesn't really fix it - because a preemptible kernel
might context switch long before the original task that set eflags, and
thus NT in eflags might migrate to somebody else, and while we'll
eventually take the "invalid TSS" error and then Chuck's patch will cause
us to clear it, we might be taking the exception for the wrong task (and
thus kill the wrong guy).

And sysenter really is very special because of the weak trap semantics.
Damn. We could either fix it in the sysenter code-path, or in the
task-switch one, and both of them are timing-critical, but task switching
perhaps just a tad less so.

If we fix it in the task-switch code, we shouldn't need any other changes
(ie Chuck's change is unnecessary too), because then the process that sets
NT will happily die (with NT set), but switch away to something else and
nobody else will be affected.

So if I'm right, then this patch _should_ fix it. UNTESTED (and the
"ref_from_fork" special case doesn't clear NT, so it's strictly incompete,
but maybe somebody can test this?)

Hmm? Ingo? Comments?

Andi? I don't know if x86-64 honors NT in 64-bit mode, but if it does, it
needs something similar (assuming this works).

Linus

---
diff --git a/include/asm-i386/system.h b/include/asm-i386/system.h
index 49928eb..f6e4260 100644
--- a/include/asm-i386/system.h
+++ b/include/asm-i386/system.h
@@ -13,7 +13,8 @@ extern struct task_struct * FASTCALL(__s

#define switch_to(prev,next,last) do { \
unsigned long esi,edi; \
- asm volatile("pushl %%ebp\n\t" \
+ asm volatile("pushfl\n\t" /* Save flags */ \
+ "pushl %%ebp\n\t" \
"movl %%esp,%0\n\t" /* save ESP */ \
"movl %5,%%esp\n\t" /* restore ESP */ \
"movl $1f,%1\n\t" /* save EIP */ \
@@ -21,6 +22,7 @@ #define switch_to(prev,next,last) do {
"jmp __switch_to\n" \
"1:\t" \
"popl %%ebp\n\t" \
+ "popfl" \
:"=m" (prev->thread.esp),"=m" (prev->thread.eip), \
"=a" (last),"=S" (esi),"=D" (edi) \
:"m" (next->thread.esp),"m" (next->thread.eip), \

2006-09-18 15:15:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set


* Linus Torvalds <[email protected]> wrote:

> And sysenter really is very special because of the weak trap
> semantics. Damn. We could either fix it in the sysenter code-path, or
> in the task-switch one, and both of them are timing-critical, but task
> switching perhaps just a tad less so.

agreed. Context-switching is 10 times less frequent on most workloads
than syscalls, so if it takes 10 cycles in the context-switch path to
eliminate a 1 cycle overhead in the syscall-entry path then we are still
break-even on average. In this case the overhead is similar i think, so
the switch_to() fix is preferable.

Ingo

2006-09-18 15:29:32

by Andi Kleen

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set


> If we fix it in the task-switch code, we shouldn't need any other changes
> (ie Chuck's change is unnecessary too), because then the process that sets
> NT will happily die (with NT set), but switch away to something else and
> nobody else will be affected.

Won't it die in the kernel with an oops on the next interrupt?

> So if I'm right, then this patch _should_ fix it. UNTESTED (and the
> "ref_from_fork" special case doesn't clear NT, so it's strictly incompete,
> but maybe somebody can test this?)

Are you sure this handles interrupts or nested syscalls
before the context switch correctly?

I think it really needs to be handled in the sysenter path.

>
> Hmm? Ingo? Comments?
>
> Andi? I don't know if x86-64 honors NT in 64-bit mode, but if it does, it
> needs something similar (assuming this works).

It doesn't task switch, but you would get a #GP in IRET at least.
Leaking that to another process is definitely not good.


> #define switch_to(prev,next,last) do { \
> unsigned long esi,edi; \
> - asm volatile("pushl %%ebp\n\t" \
> + asm volatile("pushfl\n\t" /* Save flags */ \
> + "pushl %%ebp\n\t" \

We used to do that pushfl/popfl some time ago, but Ben removed it because
it was slow on P4. Ok, nobody thought of that case back then.

-Andi

2006-09-18 15:33:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set


* Andi Kleen <[email protected]> wrote:

> > #define switch_to(prev,next,last) do { \
> > unsigned long esi,edi; \
> > - asm volatile("pushl %%ebp\n\t" \
> > + asm volatile("pushfl\n\t" /* Save flags */ \
> > + "pushl %%ebp\n\t" \
>
> We used to do that pushfl/popfl some time ago, but Ben removed it
> because it was slow on P4. Ok, nobody thought of that case back then.

I _should_ have thought of that though, because when this change was
done originally i had to reintroduce it both for exec-shield and for
-rt, so apparently it must have triggered in certain circumstances - i
just didnt notice the reason. (but recent exec-shield and -rt patches
dont have this revert, so the condition must have gone away. But it was
a long time ago.)

Ingo

2006-09-18 16:03:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set



On Mon, 18 Sep 2006, Andi Kleen wrote:
>
> > If we fix it in the task-switch code, we shouldn't need any other changes
> > (ie Chuck's change is unnecessary too), because then the process that sets
> > NT will happily die (with NT set), but switch away to something else and
> > nobody else will be affected.
>
> Won't it die in the kernel with an oops on the next interrupt?

No. As mentioned, "sysenter" is really special. It should really be the
_only_ entry into the kernel that doesn't change eflags. Everything else
clears NT (and most of them do other things too).

So if a process already runs in kernel mode (or used mode, for that
matter) with NT set, and an interrupt happens, the act of taking the
interrupt will clear NT, and nothing bad happens at all. In fact, we very
much depend on this, exactly because otherwise user mode could just set NT
and just wait for an interrupt, and bad things would happen. They
obviously don't.

So it's literally _only_ the path of

set NT
sysenter
...
iret

that causes problems, because all other paths will clear NT on entry into
the kernel.

> > So if I'm right, then this patch _should_ fix it. UNTESTED (and the
> > "ref_from_fork" special case doesn't clear NT, so it's strictly incompete,
> > but maybe somebody can test this?)
>
> Are you sure this handles interrupts or nested syscalls
> before the context switch correctly?

Yeah, see above. And I have now even tested it slightly (ie I ran one of
my x86 machines with that patch).

> I think it really needs to be handled in the sysenter path.

It really would be much more expensive there (well, the expense would be
the same, but any load that have any amount of either would tend to have
many more system calls than context switches).

The only way to have more context switches than system calls is to run
entirely in user space all the time, and then we don't care - the context
switches will also be so rare that the extra cycles simply don't matter.

> > Andi? I don't know if x86-64 honors NT in 64-bit mode, but if it does, it
> > needs something similar (assuming this works).
>
> It doesn't task switch, but you would get a #GP in IRET at least.
> Leaking that to another process is definitely not good.

Right. Then you need that exact same thing on x86-64 too.

One final note: as I already mentioned, this isn't actually entirely
sufficient. There's the magic special case of "switch to a newly created
thread", which jumps to "ret_from_fork" rather than staying within that
small area. We'll need to add "clear NT" there.

So this (UNTESTED - I tested the previous version, and it works, but this
extends on it) second patch should be more complete. It handles the case
where the NT-dirty task context switches to a newly created task, by just
forcing "eflags" to a known value in the newly created task, rather than
whatever value it had at the time of the context switch.

The addition is fairly obvious, but maybe I screwed something up, so buyer
beware...

Linus

---
diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
index 37a7d2e..87f9f60 100644
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -209,6 +209,10 @@ ENTRY(ret_from_fork)
GET_THREAD_INFO(%ebp)
popl %eax
CFI_ADJUST_CFA_OFFSET -4
+ pushl $0x0202 # Reset kernel eflags
+ CFI_ADJUST_CFA_OFFSET 4
+ popfl
+ CFI_ADJUST_CFA_OFFSET -4
jmp syscall_exit
CFI_ENDPROC

diff --git a/include/asm-i386/system.h b/include/asm-i386/system.h
index 49928eb..defbf12 100644
--- a/include/asm-i386/system.h
+++ b/include/asm-i386/system.h
@@ -13,7 +13,8 @@ extern struct task_struct * FASTCALL(__s

#define switch_to(prev,next,last) do { \
unsigned long esi,edi; \
- asm volatile("pushl %%ebp\n\t" \
+ asm volatile("pushfl\n\t" /* Save flags */ \
+ "pushl %%ebp\n\t" \
"movl %%esp,%0\n\t" /* save ESP */ \
"movl %5,%%esp\n\t" /* restore ESP */ \
"movl $1f,%1\n\t" /* save EIP */ \
@@ -21,6 +22,7 @@ #define switch_to(prev,next,last) do {
"jmp __switch_to\n" \
"1:\t" \
"popl %%ebp\n\t" \
+ "popfl" \
:"=m" (prev->thread.esp),"=m" (prev->thread.eip), \
"=a" (last),"=S" (esi),"=D" (edi) \
:"m" (next->thread.esp),"m" (next->thread.eip), \

2006-09-18 16:11:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set



On Mon, 18 Sep 2006, Linus Torvalds wrote:
>
> The addition is fairly obvious, but maybe I screwed something up, so buyer
> beware...

Final note (I promise): now that we save/restore eflags again, we
should probably revert the set_iopl_mask() in task switching too. However,
that apparently has some para-virtualization issues, so I'm going to
ignore that part from now.

However, I'd really like people who know and care about the
paravirtualization to take a good long look at it: because right now, with
the addition of the eflags save/restore, the set_iopl_mask() in
__switch_to() is entirely useless for non-virtualized environments, afaik.

Zack added to the cc. Who else needs to know?

Linus

2006-09-18 16:13:11

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

On Mon, Sep 18, 2006 at 05:29:23PM +0200, Andi Kleen wrote:
> > - asm volatile("pushl %%ebp\n\t" \
> > + asm volatile("pushfl\n\t" /* Save flags */ \
> > + "pushl %%ebp\n\t" \
>
> We used to do that pushfl/popfl some time ago, but Ben removed it because
> it was slow on P4. Ok, nobody thought of that case back then.

It's the pushfl that will be slow on any OoO CPU, as it has dependancies on
any previous instructions that modified the flags, which ends up bringing
all of the memory ordering dependancies into play. Doing a popfl to set the
flags to some known value is much less expensive.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-09-18 16:24:56

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

Linus Torvalds wrote:
> Final note (I promise): now that we save/restore eflags again, we
> should probably revert the set_iopl_mask() in task switching too. However,
> that apparently has some para-virtualization issues, so I'm going to
> ignore that part from now.
>

I'm wondering if we shouldn't have a (__)switch_to paravirt hook, so we
can wrap the context switch in whatever we like.

> However, I'd really like people who know and care about the
> paravirtualization to take a good long look at it: because right now, with
> the addition of the eflags save/restore, the set_iopl_mask() in
> __switch_to() is entirely useless for non-virtualized environments, afaik.
>

Hm. Zach removed the pushf/popf in switch_to this last Sept, with the
comment "The pushf/popf in switch_to are ONLY used to switch IOPL.
Making this explicit in C code is more clear. This pushf/popf pair was
added as a bugfix for leaking IOPL to unprivileged processes when using
sysenter/sysexit based system calls (sysexit does not restore flags)."


> Zack added to the cc. Who else needs to know?
Rusty, Chris Wright and me.

2006-09-18 16:50:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

On Monday 18 September 2006 18:12, Benjamin LaHaise wrote:
> On Mon, Sep 18, 2006 at 05:29:23PM +0200, Andi Kleen wrote:
> > > - asm volatile("pushl %%ebp\n\t" \
> > > + asm volatile("pushfl\n\t" /* Save flags */ \
> > > + "pushl %%ebp\n\t" \
> >
> > We used to do that pushfl/popfl some time ago, but Ben removed it because
> > it was slow on P4. Ok, nobody thought of that case back then.
>
> It's the pushfl that will be slow on any OoO CPU, as it has dependancies on
> any previous instructions that modified the flags, which ends up bringing
> all of the memory ordering dependancies into play. Doing a popfl to set the
> flags to some known value is much less expensive.

Yes it's never fast, but on basically all non P4 CPUs it is still fast enough
to not be a problem. I suppose it causes a trace cache flush or something like
that there.

-Andi

2006-09-18 19:01:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Sysenter crash with Nested Task Bit set

Andi Kleen wrote:
> Yes it's never fast, but on basically all non P4 CPUs it is still fast enough
> to not be a problem. I suppose it causes a trace cache flush or something like
> that there.

I don't think it's that bad, but it might cause a full pipeline flush.
I seem to remember measuring its cost at about 50 cycles.

J