2002-04-17 14:51:32

by Jan Hubicka

[permalink] [raw]
Subject: SSE related security hole

Hi,
while debugging GCC bugreport, I noticed following behaviour of simple
program with no syscalls:

hubicka@nikam:~$ ./a.out
sum of 7 ints: 28
hubicka@nikam:~$ ./a.out
sum of 7 ints: 56
Bad sum (seen with gcc -O -march=pentiumpro -msse)
hubicka@nikam:~$ ./a.out
sum of 7 ints: 84
Bad sum (seen with gcc -O -march=pentiumpro -msse)
hubicka@nikam:~$ ./a.out
sum of 7 ints: 112
Bad sum (seen with gcc -O -march=pentiumpro -msse)
hubicka@nikam:~$ echo

hubicka@nikam:~$ ./a.out
sum of 7 ints: 28


ie it always returns different value, moreover when something else
is run in meantime (verified by loading WWW page served by same machine),
the counter is reinitialized to 28.

I am attaching the source, but it needs to be compiled by cfg-branch GCC
with settings -O2 -march=pentium3 -mfpmath=sse, so I've placed static
binary to http://atrey.karlin.mff.cuni.cz/~hubicka/badsum.bin

The problem appears to be reproducible only on pentium3 and athlon4 systems,
not pentium4 system, where it appears to work as expected. Reproduced on
both 2.4.9-RH and 2.4.16 kernels.

Honza


Attachments:
(No filename) (1.05 kB)
badsum.c (495.00 B)
Download all attachments

2002-04-17 15:23:38

by Jan Hubicka

[permalink] [raw]
Subject: Re: SSE related security hole

Hi,
Jakub asked me to cleanup the source and post assembly file. Here it comes.

#include <stdlib.h>
#include <stdio.h>

int
m ()
{
int i, n = 7;
float comp, sum = 0;
sin(1);
for (i = 1; i <= n; ++i)
sum += i;
printf ("sum of %d ints: %g\n", n, sum);
return 0;
}

main ()
{
m ();
}


Attachments:
(No filename) (302.00 B)
bad3.s (810.00 B)
Download all attachments

2002-04-17 23:43:00

by Doug Ledford

[permalink] [raw]
Subject: Re: SSE related security hole

Subject: Re: SSE related security hole
Reply-To:
In-Reply-To: <[email protected]>; from [email protected] on Wed, Apr 17, 2002 at 04:13:03PM +0100

(NOTE: This may have already been answered by someone else, but I haven't
seen it if it has, so I'm sending this through)

> Hi,
> while debugging GCC bugreport, I noticed following behaviour of simple
> program with no syscalls:
>
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 28
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 56
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 84
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 112
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ echo
>
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 28
>
>
> ie it always returns different value, moreover when something else
> is run in meantime (verified by loading WWW page served by same machine),
> the counter is reinitialized to 28.
>
> I am attaching the source, but it needs to be compiled by cfg-branch GCC
> with settings -O2 -march=pentium3 -mfpmath=sse, so I've placed static
> binary to http://atrey.karlin.mff.cuni.cz/~hubicka/badsum.bin

Compiling the asm source with a different compiler will also make it fail.

> The problem appears to be reproducible only on pentium3 and athlon4 systems,
> not pentium4 system, where it appears to work as expected. Reproduced on
> both 2.4.9-RH and 2.4.16 kernels.

[ program snipped ]

So there are two different issues at play here. First, the kernel uses
the fninit instruction to initialize the fpu on first use. Nothing in the
Intel manuals says anything about the fninit instruction clearing the mmx
or sse registers, and experimentally we now know for sure that it doesn't.
That means that when the first time your program was ran it left 28 in
register xmm1. The next time the program was run, the fninit did nothing
to clear register xmm1 so it still held 28. Now, the pxor instruction
that is part of the m() function and intended to 0 out the xmm1 register
is an sse2 instruction. It just so happens that it doesn't work on sse
only processors such as P3 CPUs. So, when 28 was left in xmm1, then the
pxor failed to 0 out xmm1, we saved 28 as the starting value for the loop
and then looped through 7 additions until we had, you guessed it, 56. In
fact, if you do a while :;do bad; done loop the it will increment by 28
each time it is run except when something else intervenes. Replacing the
pxor instruction with xorps instead makes it work. So, that's a bug in
gcc I suspect, using sse2 instructions when only called to use sse
instructions. It seems odd to me that the CPU wouldn't generate an
illegal instruction exception, but oh well, it evidently doesn't.

So, we really should change arch/i386/kernel/i387.c something like this:

(WARNING, totally untested and not even compile checked change follows)

--- i387.c.save Wed Apr 17 19:22:47 2002
+++ i387.c Wed Apr 17 19:28:27 2002
@@ -33,8 +33,26 @@
void init_fpu(void)
{
__asm__("fninit");
- if ( cpu_has_xmm )
+ if ( cpu_has_mmx )
+ asm volatile("xorq %%mm0, %%mm0;
+ xorq %%mm1, %%mm1;
+ xorq %%mm2, %%mm2;
+ xorq %%mm3, %%mm3;
+ xorq %%mm4, %%mm4;
+ xorq %%mm5, %%mm5;
+ xorq %%mm6, %%mm6;
+ xorq %%mm7, %%mm7");
+ if ( cpu_has_xmm ) {
+ asm volatile("xorps %%xmm0, %%xmm0;
+ xorps %%xmm1, %%xmm1;
+ xorps %%xmm2, %%xmm2;
+ xorps %%xmm3, %%xmm3;
+ xorps %%xmm4, %%xmm4;
+ xorps %%xmm5, %%xmm5;
+ xorps %%xmm6, %%xmm6;
+ xorps %%xmm7, %%xmm7");
load_mxcsr(0x1f80);
+ }

current->used_math = 1;
}


The rest of the problem is a gcc bug and possibly something that Intel
should make a note of on the p3 processors (that is, that the p3 will
silently fail to execute some sse2 instructions without generating the
expected exception).

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-04-18 08:22:37

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

> --- i387.c.save Wed Apr 17 19:22:47 2002
> +++ i387.c Wed Apr 17 19:28:27 2002
> @@ -33,8 +33,26 @@
> void init_fpu(void)
> {
> __asm__("fninit");
> - if ( cpu_has_xmm )
> + if ( cpu_has_mmx )

Shouldn't that be cpu_has_mmx && !cpu_has_sse2 ?


-Andi

> + asm volatile("xorq %%mm0, %%mm0;
> + xorq %%mm1, %%mm1;
> + xorq %%mm2, %%mm2;

2002-04-18 08:23:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: SSE related security hole

On Wed, Apr 17, 2002 at 07:42:49PM -0400, Doug Ledford wrote:
> --- i387.c.save Wed Apr 17 19:22:47 2002
> +++ i387.c Wed Apr 17 19:28:27 2002
> @@ -33,8 +33,26 @@
> void init_fpu(void)
> {
> __asm__("fninit");
> - if ( cpu_has_xmm )
> + if ( cpu_has_mmx )
> + asm volatile("xorq %%mm0, %%mm0;
> + xorq %%mm1, %%mm1;
> + xorq %%mm2, %%mm2;
> + xorq %%mm3, %%mm3;
> + xorq %%mm4, %%mm4;
> + xorq %%mm5, %%mm5;
> + xorq %%mm6, %%mm6;
> + xorq %%mm7, %%mm7");

This mean the mmx isn't really backwards compatible and that's
potentially a problem for all the legacy x86 multiuser operative
systems. That's an hardware design bug, not a software problem. In
short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
same potentially applies to windows NT and other unix, no matter of SSE.

I verified with this simple proggy:

main()
{
long long x = 2;
long long z = 3;

asm volatile("movq %0, %%mm0":: "m" (x));
asm volatile("fninit");
asm volatile("movq %%mm0, %0": "=m" (z):);

printf("%d\n", z);
}

it prints 2 here, while it should print zero or at least random to be
backwards compatible.

SSE was a completly different issue, that is a software bug. SSE is
disabled by non aware OS, and so if we enable it we also must take care
of clearing it at the first math fault.

> + if ( cpu_has_xmm ) {
> + asm volatile("xorps %%xmm0, %%xmm0;
> + xorps %%xmm1, %%xmm1;
> + xorps %%xmm2, %%xmm2;
> + xorps %%xmm3, %%xmm3;
> + xorps %%xmm4, %%xmm4;
> + xorps %%xmm5, %%xmm5;
> + xorps %%xmm6, %%xmm6;
> + xorps %%xmm7, %%xmm7");

The patch has a couple of problems. xorq doesn't exists. Since there are
no params you should also drop one %. Also I think we need an emms after
the mmx operations to remain binary compatible with the x86 ABI.

How does this look?

--- 2.4.19pre7aa1/arch/i386/kernel/i387.c.~1~ Thu Apr 18 05:23:12 2002
+++ 2.4.19pre7aa1/arch/i386/kernel/i387.c Thu Apr 18 07:20:26 2002
@@ -33,8 +33,28 @@
void init_fpu(void)
{
__asm__("fninit");
- if ( cpu_has_xmm )
+ if (cpu_has_mmx) {
+ asm volatile("pxor %mm0, %mm0\n\t"
+ "movq %mm0, %mm1\n\t"
+ "movq %mm0, %mm2\n\t"
+ "movq %mm0, %mm3\n\t"
+ "movq %mm0, %mm4\n\t"
+ "movq %mm0, %mm5\n\t"
+ "movq %mm0, %mm6\n\t"
+ "movq %mm0, %mm7\n\t"
+ "emms\n");
+ }
+ if ( cpu_has_xmm ) {
+ asm volatile("xorps %xmm0, %xmm0\n\t"
+ "xorps %xmm1, %xmm1\n\t"
+ "xorps %xmm2, %xmm2\n\t"
+ "xorps %xmm3, %xmm3\n\t"
+ "xorps %xmm4, %xmm4\n\t"
+ "xorps %xmm5, %xmm5\n\t"
+ "xorps %xmm6, %xmm6\n\t"
+ "xorps %xmm7, %xmm7\n");
load_mxcsr(0x1f80);
+ }

current->used_math = 1;
}

Andrea

2002-04-18 09:10:37

by Arjan van de Ven

[permalink] [raw]
Subject: Re: SSE related security hole

Andrea Arcangeli wrote:
>
> On Wed, Apr 17, 2002 at 07:42:49PM -0400, Doug Ledford wrote:
> > --- i387.c.save Wed Apr 17 19:22:47 2002
> > +++ i387.c Wed Apr 17 19:28:27 2002
> > @@ -33,8 +33,26 @@
> > void init_fpu(void)
> > {
> > __asm__("fninit");
> > - if ( cpu_has_xmm )
> > + if ( cpu_has_mmx )
> > + asm volatile("xorq %%mm0, %%mm0;
> > + xorq %%mm1, %%mm1;
> > + xorq %%mm2, %%mm2;
> > + xorq %%mm3, %%mm3;
> > + xorq %%mm4, %%mm4;
> > + xorq %%mm5, %%mm5;
> > + xorq %%mm6, %%mm6;
> > + xorq %%mm7, %%mm7");
>
> This mean the mmx isn't really backwards compatible and that's
> potentially a problem for all the legacy x86 multiuser operative
> systems. That's an hardware design bug, not a software problem. In
> short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
> same potentially applies to windows NT and other unix, no matter of SSE.
>
> I verified with this simple proggy:
>
> main()
> {
> long long x = 2;
> long long z = 3;
>
> asm volatile("movq %0, %%mm0":: "m" (x));
> asm volatile("fninit");
> asm volatile("movq %%mm0, %0": "=m" (z):);
>
> printf("%d\n", z);
> }
>
> it prints 2 here, while it should print zero or at least random to be
> backwards compatible.
>
> SSE was a completly different issue, that is a software bug. SSE is
> disabled by non aware OS, and so if we enable it we also must take care
> of clearing it at the first math fault.
>
> > + if ( cpu_has_xmm ) {
> > + asm volatile("xorps %%xmm0, %%xmm0;
> > + xorps %%xmm1, %%xmm1;
> > + xorps %%xmm2, %%xmm2;
> > + xorps %%xmm3, %%xmm3;
> > + xorps %%xmm4, %%xmm4;
> > + xorps %%xmm5, %%xmm5;
> > + xorps %%xmm6, %%xmm6;
> > + xorps %%xmm7, %%xmm7");
>
> The patch has a couple of problems. xorq doesn't exists. Since there are
> no params you should also drop one %. Also I think we need an emms after
> the mmx operations to remain binary compatible with the x86 ABI.
>
> How does this look?
>
> --- 2.4.19pre7aa1/arch/i386/kernel/i387.c.~1~ Thu Apr 18 05:23:12 2002
> +++ 2.4.19pre7aa1/arch/i386/kernel/i387.c Thu Apr 18 07:20:26 2002
> @@ -33,8 +33,28 @@
> void init_fpu(void)
> {
> __asm__("fninit");
> - if ( cpu_has_xmm )
> + if (cpu_has_mmx) {
> + asm volatile("pxor %mm0, %mm0\n\t"
> + "movq %mm0, %mm1\n\t"
> + "movq %mm0, %mm2\n\t"
> + "movq %mm0, %mm3\n\t"
> + "movq %mm0, %mm4\n\t"
> + "movq %mm0, %mm5\n\t"
> + "movq %mm0, %mm6\n\t"
> + "movq %mm0, %mm7\n\t"
> + "emms\n");
> + }
> + if ( cpu_has_xmm ) {
> + asm volatile("xorps %xmm0, %xmm0\n\t"
> + "xorps %xmm1, %xmm1\n\t"
> + "xorps %xmm2, %xmm2\n\t"
> + "xorps %xmm3, %xmm3\n\t"
> + "xorps %xmm4, %xmm4\n\t"
> + "xorps %xmm5, %xmm5\n\t"
> + "xorps %xmm6, %xmm6\n\t"
> + "xorps %xmm7, %xmm7\n");
> load_mxcsr(0x1f80);
> + }
>
> current->used_math = 1;
> }

Looks good; I just did the same thing ;)

2002-04-18 09:59:26

by Denis Vlasenko

[permalink] [raw]
Subject: Re: SSE related security hole

On 17 April 2002 13:23, Jan Hubicka wrote:
> #include <stdlib.h>
> #include <stdio.h>
>
> int
> m ()
> {
> int i, n = 7;
> float comp, sum = 0;
^^^^
unused?

> sin(1);

So, removing this sin() stops the bug?

> for (i = 1; i <= n; ++i)
> sum += i;
> printf ("sum of %d ints: %g\n", n, sum);
> return 0;
> }
>
> main ()
> {
> m ();
> }
Can m() body be placed directly in main()?
--
vda

2002-04-18 11:00:42

by Alan

[permalink] [raw]
Subject: Re: SSE related security hole

> This mean the mmx isn't really backwards compatible and that's
> potentially a problem for all the legacy x86 multiuser operative
> systems. That's an hardware design bug, not a software problem. In
> short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
> same potentially applies to windows NT and other unix, no matter of SSE.

That was my initial reaction but when I reread the documentation the
Intel folks are actually saying even back in Pentium MMX days that it isnt
guaranteed that the FP/MMX state are not seperate registers

2002-04-18 11:14:37

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

On Thu, Apr 18, 2002 at 12:18:34PM +0100, Alan Cox wrote:
> > This mean the mmx isn't really backwards compatible and that's
> > potentially a problem for all the legacy x86 multiuser operative
> > systems. That's an hardware design bug, not a software problem. In
> > short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
> > same potentially applies to windows NT and other unix, no matter of SSE.
>
> That was my initial reaction but when I reread the documentation the
> Intel folks are actually saying even back in Pentium MMX days that it isnt
> guaranteed that the FP/MMX state are not seperate registers

In this case it would be possible to only do the explicit clear
when the CPU does support sse1. For mmx only it shouldn't be needed.
For sse2 also not.


-Andi

2002-04-18 11:35:18

by Alan

[permalink] [raw]
Subject: Re: SSE related security hole

> > Intel folks are actually saying even back in Pentium MMX days that it isnt
> > guaranteed that the FP/MMX state are not seperate registers
>
> In this case it would be possible to only do the explicit clear
> when the CPU does support sse1. For mmx only it shouldn't be needed.
> For sse2 also not.

Do you have a documentation cite for that claim ?

2002-04-18 11:47:01

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

On Thu, Apr 18, 2002 at 12:53:12PM +0100, Alan Cox wrote:
> > > Intel folks are actually saying even back in Pentium MMX days that it isnt
> > > guaranteed that the FP/MMX state are not seperate registers
> >
> > In this case it would be possible to only do the explicit clear
> > when the CPU does support sse1. For mmx only it shouldn't be needed.
> > For sse2 also not.
>
> Do you have a documentation cite for that claim ?

Never mind. It was a bogus suggestion and fninit indeed also doesn't clear
XMM on P4 and other SSE2 implementation.

-Andi

2002-04-18 11:55:18

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

On Thu, Apr 18, 2002 at 12:53:12PM +0100, Alan Cox wrote:
> > > Intel folks are actually saying even back in Pentium MMX days that it isnt
> > > guaranteed that the FP/MMX state are not seperate registers
> >
> > In this case it would be possible to only do the explicit clear
> > when the CPU does support sse1. For mmx only it shouldn't be needed.
> > For sse2 also not.
>
> Do you have a documentation cite for that claim ?

Actually I did some more tests:

test program

main()
{
unsigned int i[4], o[4];

i[0] = 1; i[1] = 2; i[2] = 3; i[3] = 4;
asm("movups %1,%%xmm1 ; fninit ; movups %%xmm1,%0" : "=m" (o) : "m" (i));
printf("%x %x %x %x\n",o[0],o[1],o[2],o[3]);

asm("movups %1,%%xmm1 ; movups %%xmm1,%0" : "=m" (o) : "m" (i));
printf("%x %x %x %x\n",o[0],o[1],o[2],o[3]);
}

Result on a pentium4:

./xmm
bffff68c 8048431 8049640 8049660
bffff68c bffff68c bffff68c 8048431

So fninit seems to change something in XMM1.

and pentium 3:

bffff81c 8048431 8049640 8049660
bffff81c bffff81c bffff81c 8048431

changes something different ?

If even Intel cannot agree on this it is probably safest to do an explicit
zeroing like Andrea's patch does. I retract the origina suggestion.

-Andi

2002-04-18 13:44:58

by Doug Ledford

[permalink] [raw]
Subject: Re: SSE related security hole

On Thu, Apr 18, 2002 at 07:26:15AM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 17, 2002 at 07:42:49PM -0400, Doug Ledford wrote:
> > --- i387.c.save Wed Apr 17 19:22:47 2002
> > +++ i387.c Wed Apr 17 19:28:27 2002
> > @@ -33,8 +33,26 @@
> > void init_fpu(void)
> > {
> > __asm__("fninit");
> > - if ( cpu_has_xmm )
> > + if ( cpu_has_mmx )
> > + asm volatile("xorq %%mm0, %%mm0;
> > + xorq %%mm1, %%mm1;
> > + xorq %%mm2, %%mm2;
> > + xorq %%mm3, %%mm3;
> > + xorq %%mm4, %%mm4;
> > + xorq %%mm5, %%mm5;
> > + xorq %%mm6, %%mm6;
> > + xorq %%mm7, %%mm7");
>
> This mean the mmx isn't really backwards compatible and that's
> potentially a problem for all the legacy x86 multiuser operative
> systems. That's an hardware design bug, not a software problem. In
> short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
> same potentially applies to windows NT and other unix, no matter of SSE.

Why is that not backwards compatible? I've never heard of anywhere that
specifies that the starting value in the mmx registers will be anything of
consequence? Also, even though register space is (possibly) shared with
the FP register stack, clearing out the MMX registers does not actually
harm the FP register stack since the fninit already blows the stack away,
which forces the application to load fp data before it can use the fpu
again.

> I verified with this simple proggy:
>
> main()
> {
> long long x = 2;
> long long z = 3;
>
> asm volatile("movq %0, %%mm0":: "m" (x));
> asm volatile("fninit");
> asm volatile("movq %%mm0, %0": "=m" (z):);
>
> printf("%d\n", z);
> }
>
> it prints 2 here, while it should print zero or at least random to be
> backwards compatible.

I verified the same here last night with a similar test.

> The patch has a couple of problems. xorq doesn't exists. Since there are
> no params you should also drop one %. Also I think we need an emms after
> the mmx operations to remain binary compatible with the x86 ABI.

I did mention that it was an untested and even uncompiled patch in my
email after all ;-)

> How does this look?

I'm fine with this.

> --- 2.4.19pre7aa1/arch/i386/kernel/i387.c.~1~ Thu Apr 18 05:23:12 2002
> +++ 2.4.19pre7aa1/arch/i386/kernel/i387.c Thu Apr 18 07:20:26 2002
> @@ -33,8 +33,28 @@
> void init_fpu(void)
> {
> __asm__("fninit");
> - if ( cpu_has_xmm )
> + if (cpu_has_mmx) {
> + asm volatile("pxor %mm0, %mm0\n\t"
> + "movq %mm0, %mm1\n\t"
> + "movq %mm0, %mm2\n\t"
> + "movq %mm0, %mm3\n\t"
> + "movq %mm0, %mm4\n\t"
> + "movq %mm0, %mm5\n\t"
> + "movq %mm0, %mm6\n\t"
> + "movq %mm0, %mm7\n\t"
> + "emms\n");
> + }
> + if ( cpu_has_xmm ) {
> + asm volatile("xorps %xmm0, %xmm0\n\t"
> + "xorps %xmm1, %xmm1\n\t"
> + "xorps %xmm2, %xmm2\n\t"
> + "xorps %xmm3, %xmm3\n\t"
> + "xorps %xmm4, %xmm4\n\t"
> + "xorps %xmm5, %xmm5\n\t"
> + "xorps %xmm6, %xmm6\n\t"
> + "xorps %xmm7, %xmm7\n");
> load_mxcsr(0x1f80);
> + }
>
> current->used_math = 1;
> }
>
> Andrea

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-04-18 14:02:44

by Matthias Andree

[permalink] [raw]
Subject: [OT: nostalgia] Re: SSE related security hole

On Wed, 17 Apr 2002, Doug Ledford wrote:

> pxor instruction with xorps instead makes it work. So, that's a bug in
> gcc I suspect, using sse2 instructions when only called to use sse
> instructions. It seems odd to me that the CPU wouldn't generate an
> illegal instruction exception, but oh well, it evidently doesn't.

Remember ye goode olde 6502/6510 processors used in the famous Commodore
64 computers? These don't bail out when using undefined opcodes either,
some opcodes actually had undocumented but consistent behaviour and were
used in "my program is shorter than yours" 1000 byte demo contests and
the like.

2002-04-18 18:36:49

by George Spelvin

[permalink] [raw]
Subject: Re: SSE related security hole

Um, people here seem to be assuming that, in the absence of MMX,
fninit *doesn't* leak information.

I thought it was well-known to just clear (set to all-ones) the
tag register and not alter the actual floating-point registers.

Thus, it seems quite feasible to reset the tag word with FLDENV and
store out the FPU registers, even on an 80387.

Isn't this the same security hole? Shouldn't there be 8 FLDZ instructions
(or equivalent) in the processor state initialization?

2002-04-18 18:50:12

by Richard B. Johnson

[permalink] [raw]
Subject: Re: SSE related security hole

On 18 Apr 2002 [email protected] wrote:

> Um, people here seem to be assuming that, in the absence of MMX,
> fninit *doesn't* leak information.
>
> I thought it was well-known to just clear (set to all-ones) the
> tag register and not alter the actual floating-point registers.
>
> Thus, it seems quite feasible to reset the tag word with FLDENV and
> store out the FPU registers, even on an 80387.
>
> Isn't this the same security hole? Shouldn't there be 8 FLDZ instructions
> (or equivalent) in the processor state initialization?

Well, if what's on the internal stack of the FPU can actually leak
information, I think the notion of "leak" has expanded just a bit
too much.

A rogue process could not even know what instruction was about to
be executed, nor what the previous instruction was, nor when since
boot it was executed, nor by whom. The 'data' associated with those
unknown instructions would make a good random number generator,
or a round-about method of obtaining PI (st0 almost always contains it).
That's about all.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-04-18 19:20:03

by Pavel Machek

[permalink] [raw]
Subject: Re: SSE related security hole

Hi!

> > > + asm volatile("xorq %%mm0, %%mm0;
> > > + xorq %%mm1, %%mm1;
> > > + xorq %%mm2, %%mm2;
> > > + xorq %%mm3, %%mm3;
> > > + xorq %%mm4, %%mm4;
> > > + xorq %%mm5, %%mm5;
> > > + xorq %%mm6, %%mm6;
> > > + xorq %%mm7, %%mm7");
> >
> > This mean the mmx isn't really backwards compatible and that's
> > potentially a problem for all the legacy x86 multiuser operative
> > systems. That's an hardware design bug, not a software problem. In
> > short running a 2.[02] kernel on a MMX capable CPU isn't secure, the
> > same potentially applies to windows NT and other unix, no matter of SSE.
>
> Why is that not backwards compatible? I've never heard of anywhere that
> specifies that the starting value in the mmx registers will be anything of
> consequence? Also, even though register space is (possibly) shared with
> the FP register stack, clearing out the MMX registers does not actually
> harm the FP register stack since the fninit already blows the stack away,
> which forces the application to load fp data before it can use the fpu
> again.

It introduces security hole: Unrelated tasks now have your top secret
value you stored in one of your registers.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-04-18 19:32:47

by Doug Ledford

[permalink] [raw]
Subject: Re: SSE related security hole

On Thu, Apr 18, 2002 at 09:20:03PM +0200, Pavel Machek wrote:
> It introduces security hole: Unrelated tasks now have your top secret
> value you stored in one of your registers.

Well, that's been my point all along and why I sent the patch. I was not
asking why leaving the registers alone instead of 0ing them out was not a
security hole. I was asking why doing so was not backward compatible?

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-04-18 21:07:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: SSE related security hole

Followup to: <[email protected]>
By author: [email protected]
In newsgroup: linux.dev.kernel
>
> Um, people here seem to be assuming that, in the absence of MMX,
> fninit *doesn't* leak information.
>
> I thought it was well-known to just clear (set to all-ones) the
> tag register and not alter the actual floating-point registers.
>
> Thus, it seems quite feasible to reset the tag word with FLDENV and
> store out the FPU registers, even on an 80387.
>
> Isn't this the same security hole? Shouldn't there be 8 FLDZ instructions
> (or equivalent) in the processor state initialization?
>

Perhaps the right thing to do is to have a description in data of the
desired initialization state and just F[NX]RSTOR it?

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-04-19 11:05:20

by Alan

[permalink] [raw]
Subject: Re: SSE related security hole

> If the FP/MMX state _are_ separate regs then they must also be
> stored/reloaded separately on a context switch.
> Is that already done? (if yes, where?)

FXSAVE/FXRSTOR

2002-04-19 14:06:20

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

"H. Peter Anvin" <[email protected]> writes:
>
> Perhaps the right thing to do is to have a description in data of the
> desired initialization state and just F[NX]RSTOR it?

Sounds like the cleanest solution. The state could be saved at CPU bootup
with just MXCSR initialized.

I'll implement that for x86-64.

-Andi

2002-04-19 18:00:35

by Doug Ledford

[permalink] [raw]
Subject: Re: SSE related security hole

On Fri, Apr 19, 2002 at 04:06:17PM +0200, Andi Kleen wrote:
> "H. Peter Anvin" <[email protected]> writes:
> >
> > Perhaps the right thing to do is to have a description in data of the
> > desired initialization state and just F[NX]RSTOR it?
>
> Sounds like the cleanest solution. The state could be saved at CPU bootup
> with just MXCSR initialized.
>
> I'll implement that for x86-64.

Ummm...last I knew, fxrstor is *expensive*. The fninit/xor regs setup is
likely *very* much faster. Someone should check this before we sacrifice
100 cycles needlessly or something.

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

2002-04-19 21:03:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: SSE related security hole

On Fri, Apr 19, 2002 at 02:00:31PM -0400, Doug Ledford wrote:
> On Fri, Apr 19, 2002 at 04:06:17PM +0200, Andi Kleen wrote:
> > "H. Peter Anvin" <[email protected]> writes:
> > >
> > > Perhaps the right thing to do is to have a description in data of the
> > > desired initialization state and just F[NX]RSTOR it?
> >
> > Sounds like the cleanest solution. The state could be saved at CPU bootup
> > with just MXCSR initialized.
> >
> > I'll implement that for x86-64.
>
> Ummm...last I knew, fxrstor is *expensive*. The fninit/xor regs setup is
> likely *very* much faster. Someone should check this before we sacrifice
> 100 cycles needlessly or something.

most probably yes, fxrestor needs to read ram, pxor also takes some
icache and bytecode ram but it sounds like it will be faster.

Maybe we could also interleave the pxor with the xorps, since they uses
different parts of the cpu, Honza?

diff -urN ref/arch/x86_64/kernel/i387.c xmm/arch/x86_64/kernel/i387.c
--- ref/arch/x86_64/kernel/i387.c Fri Apr 19 19:37:30 2002
+++ xmm/arch/x86_64/kernel/i387.c Fri Apr 19 19:39:02 2002
@@ -34,6 +34,31 @@
struct task_struct *me = current;
__asm__("fninit");
load_mxcsr(0x1f80);
+ asm volatile("pxor %mm0, %mm0\n\t"
+ "pxor %mm1, %mm1\n\t"
+ "pxor %mm2, %mm2\n\t"
+ "pxor %mm3, %mm3\n\t"
+ "pxor %mm4, %mm4\n\t"
+ "pxor %mm5, %mm5\n\t"
+ "pxor %mm6, %mm6\n\t"
+ "pxor %mm7, %mm7\n\t"
+ "emms\n\t"
+ "xorps %xmm0, %xmm0\n\t"
+ "xorps %xmm1, %xmm1\n\t"
+ "xorps %xmm2, %xmm2\n\t"
+ "xorps %xmm3, %xmm3\n\t"
+ "xorps %xmm4, %xmm4\n\t"
+ "xorps %xmm5, %xmm5\n\t"
+ "xorps %xmm6, %xmm6\n\t"
+ "xorps %xmm7, %xmm7\n\t"
+ "xorps %xmm8, %xmm8\n\t"
+ "xorps %xmm9, %xmm9\n\t"
+ "xorps %xmm10, %xmm10\n\t"
+ "xorps %xmm11, %xmm11\n\t"
+ "xorps %xmm12, %xmm12\n\t"
+ "xorps %xmm13, %xmm13\n\t"
+ "xorps %xmm14, %xmm14\n\t"
+ "xorps %xmm15, %xmm15\n");
me->used_math = 1;
}


Andrea

2002-04-19 21:36:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: SSE related security hole

>>
>> Ummm...last I knew, fxrstor is *expensive*. The fninit/xor regs setup
>> is likely *very* much faster. Someone should check this before we
>> sacrifice 100 cycles needlessly or something.
>
> most probably yes, fxrestor needs to read ram, pxor also takes some
> icache and bytecode ram but it sounds like it will be faster.
>
> Maybe we could also interleave the pxor with the xorps, since they uses
> different parts of the cpu, Honza?
>

You almost certainly should. The reason I suggested FXRSTOR is that it
would initialize the entire FPU, including any state that future
processors may add, thus reducing the likelihood of any funnies in the
future.



2002-04-19 21:42:10

by Andi Kleen

[permalink] [raw]
Subject: Re: SSE related security hole

On Fri, Apr 19, 2002 at 02:35:57PM -0700, H. Peter Anvin wrote:
> >>
> >> Ummm...last I knew, fxrstor is *expensive*. The fninit/xor regs setup
> >> is likely *very* much faster. Someone should check this before we
> >> sacrifice 100 cycles needlessly or something.
> >
> > most probably yes, fxrestor needs to read ram, pxor also takes some
> > icache and bytecode ram but it sounds like it will be faster.
> >
> > Maybe we could also interleave the pxor with the xorps, since they uses
> > different parts of the cpu, Honza?
> >
>
> You almost certainly should. The reason I suggested FXRSTOR is that it
> would initialize the entire FPU, including any state that future
> processors may add, thus reducing the likelihood of any funnies in the
> future.

That's also why I like it.

-Andi

2002-04-19 22:18:57

by Jan Hubicka

[permalink] [raw]
Subject: Re: SSE related security hole

> On Fri, Apr 19, 2002 at 02:00:31PM -0400, Doug Ledford wrote:
> > On Fri, Apr 19, 2002 at 04:06:17PM +0200, Andi Kleen wrote:
> > > "H. Peter Anvin" <[email protected]> writes:
> > > >
> > > > Perhaps the right thing to do is to have a description in data of the
> > > > desired initialization state and just F[NX]RSTOR it?
> > >
> > > Sounds like the cleanest solution. The state could be saved at CPU bootup
> > > with just MXCSR initialized.
> > >
> > > I'll implement that for x86-64.
> >
> > Ummm...last I knew, fxrstor is *expensive*. The fninit/xor regs setup is
> > likely *very* much faster. Someone should check this before we sacrifice
> > 100 cycles needlessly or something.
>
> most probably yes, fxrestor needs to read ram, pxor also takes some
> icache and bytecode ram but it sounds like it will be faster.
>
> Maybe we could also interleave the pxor with the xorps, since they uses
> different parts of the cpu, Honza?

Yes, I guess that should help to at least some chips.
Definitly nothing to loose :)

Honza

2002-04-19 23:14:31

by Brian Gerst

[permalink] [raw]
Subject: [PATCH] Re: SSE related security hole

diff -urN linux-2.5.8/arch/i386/kernel/i387.c linux/arch/i386/kernel/i387.c
--- linux-2.5.8/arch/i386/kernel/i387.c Thu Mar 7 21:18:32 2002
+++ linux/arch/i386/kernel/i387.c Fri Apr 19 18:23:53 2002
@@ -31,13 +31,20 @@
* value at reset if we support XMM instructions and then
* remeber the current task has used the FPU.
*/
-void init_fpu(void)
+void init_fpu(struct task_struct *tsk)
{
- __asm__("fninit");
- if ( cpu_has_xmm )
- load_mxcsr(0x1f80);
-
- current->used_math = 1;
+ if ( cpu_has_xmm ) {
+ memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
+ tsk->thread.i387.fxsave.cwd = 0x37f;
+ tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+ } else {
+ memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
+ tsk->thread.i387.fsave.cwd = 0xffff037f;
+ tsk->thread.i387.fsave.swd = 0xffff0000;
+ tsk->thread.i387.fsave.twd = 0xffffffff;
+ tsk->thread.i387.fsave.fos = 0xffff0000;
+ }
+ tsk->used_math = 1;
}

/*
diff -urN linux-2.5.8/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
--- linux-2.5.8/arch/i386/kernel/traps.c Sun Apr 14 23:48:18 2002
+++ linux/arch/i386/kernel/traps.c Fri Apr 19 18:22:12 2002
@@ -757,13 +757,12 @@
*/
asmlinkage void math_state_restore(struct pt_regs regs)
{
+ struct task_struct *tsk = current;
clts(); /* Allow maths ops (or we recurse) */

- if (current->used_math) {
- restore_fpu(current);
- } else {
- init_fpu();
- }
+ if (!tsk->used_math)
+ init_fpu(tsk);
+ restore_fpu(tsk);
set_thread_flag(TIF_USEDFPU); /* So we fnsave on switch_to() */
}

diff -urN linux-2.5.8/include/asm-i386/i387.h linux/include/asm-i386/i387.h
--- linux-2.5.8/include/asm-i386/i387.h Fri Apr 19 18:44:48 2002
+++ linux/include/asm-i386/i387.h Fri Apr 19 18:46:12 2002
@@ -17,7 +17,7 @@
#include <asm/sigcontext.h>
#include <asm/user.h>

-extern void init_fpu(void);
+extern void init_fpu(struct task_struct *);
/*
* FPU lazy state save handling...
*/


Attachments:
fpuclear-1 (1.91 kB)

2002-04-19 23:42:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Fri, 19 Apr 2002, Brian Gerst wrote:
>
> Here's a patch to do just that. It initializes the saved state in the
> task struct and falls through to restore_fpu().

One issue is whether we _can_ restore a "generated" image like this: it's
entirely possible that Intel might save away internal CPU shadow data in
the save-state structure, and future CPU's might be unhappy about loading
data that doesn't conform to what the CPU would save.

That said, the same issue is certainly true for just doing "xorps", since
that will not clear any potential future CPU state.

I get this feeling that Intel screwed up on specifying how to initialize
this whole state.

Linus

2002-04-20 00:01:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

>
> I get this feeling that Intel screwed up on specifying how to
> initialize this whole state.
>

Indeed. Logically, FNINIT should have been extended to initialize it all -
- it is a security hole that it doesn't initialize MMX properly.
Alternatively, for SSE only, an INITP instruction could have been added
that an SSE-enabled OS can use at the time OSXFSR or whatever that flag is
called is set.



2002-04-20 00:10:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Fri, 19 Apr 2002, H. Peter Anvin wrote:
>
> Indeed. Logically, FNINIT should have been extended to initialize it all -
> - it is a security hole that it doesn't initialize MMX properly.

Well, MMX should arguably be initialized with "emms", so the proper
sequence migth be something like

if (sse)
asm("emms");
asm("fninit");

What does emms do to SSE2?

Linus

2002-04-20 00:10:49

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

diff -urN linux-2.5.8/arch/i386/kernel/i387.c linux/arch/i386/kernel/i387.c
--- linux-2.5.8/arch/i386/kernel/i387.c Thu Mar 7 21:18:32 2002
+++ linux/arch/i386/kernel/i387.c Fri Apr 19 19:35:14 2002
@@ -31,13 +31,21 @@
* value at reset if we support XMM instructions and then
* remeber the current task has used the FPU.
*/
-void init_fpu(void)
+void init_fpu(struct task_struct *tsk)
{
- __asm__("fninit");
- if ( cpu_has_xmm )
- load_mxcsr(0x1f80);
-
- current->used_math = 1;
+ if (cpu_has_fxsr) {
+ memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
+ tsk->thread.i387.fxsave.cwd = 0x37f;
+ if (cpu_has_xmm)
+ tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+ } else {
+ memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
+ tsk->thread.i387.fsave.cwd = 0xffff037f;
+ tsk->thread.i387.fsave.swd = 0xffff0000;
+ tsk->thread.i387.fsave.twd = 0xffffffff;
+ tsk->thread.i387.fsave.fos = 0xffff0000;
+ }
+ tsk->used_math = 1;
}

/*
diff -urN linux-2.5.8/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
--- linux-2.5.8/arch/i386/kernel/traps.c Sun Apr 14 23:48:18 2002
+++ linux/arch/i386/kernel/traps.c Fri Apr 19 18:22:12 2002
@@ -757,13 +757,12 @@
*/
asmlinkage void math_state_restore(struct pt_regs regs)
{
+ struct task_struct *tsk = current;
clts(); /* Allow maths ops (or we recurse) */

- if (current->used_math) {
- restore_fpu(current);
- } else {
- init_fpu();
- }
+ if (!tsk->used_math)
+ init_fpu(tsk);
+ restore_fpu(tsk);
set_thread_flag(TIF_USEDFPU); /* So we fnsave on switch_to() */
}

diff -urN linux-2.5.8/include/asm-i386/i387.h linux/include/asm-i386/i387.h
--- linux-2.5.8/include/asm-i386/i387.h Fri Apr 19 18:44:48 2002
+++ linux/include/asm-i386/i387.h Fri Apr 19 19:34:03 2002
@@ -17,7 +17,7 @@
#include <asm/sigcontext.h>
#include <asm/user.h>

-extern void init_fpu(void);
+extern void init_fpu(struct task_struct *);
/*
* FPU lazy state save handling...
*/


Attachments:
fpuclear-2 (1.92 kB)

2002-04-20 00:13:29

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

Linus Torvalds wrote:
>
> On Fri, 19 Apr 2002, H. Peter Anvin wrote:
>
>>Indeed. Logically, FNINIT should have been extended to initialize it all -
>>- it is a security hole that it doesn't initialize MMX properly.
>
>
> Well, MMX should arguably be initialized with "emms", so the proper
> sequence migth be something like
>
> if (sse)
> asm("emms");
> asm("fninit");
>
> What does emms do to SSE2?
>
> Linus
>

All emms does is reset the tag word. It doesn't touch the registers.

--

Brian Gerst

2002-04-20 00:13:27

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

> Well, MMX should arguably be initialized with "emms", so the proper
> sequence migth be something like

Does emms make any guarantees about the register state ?

2002-04-20 00:19:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

>
>
> On Fri, 19 Apr 2002, H. Peter Anvin wrote:
>>
>> Indeed. Logically, FNINIT should have been extended to initialize it
>> all - - it is a security hole that it doesn't initialize MMX properly.
>
> Well, MMX should arguably be initialized with "emms", so the proper
> sequence migth be something like
>

Remember that one of the original design goals of MMX was to run on
unmodified operating systems (and it can't be turned off); thus needing
any extra initialization is a bug.

-hpa



2002-04-20 00:21:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Fri, 19 Apr 2002, Brian Gerst wrote:
>
> I don't know about Intel, but the Athlon doesn't appear to save anything
> in the "reserved" areas.

Yes, I think it will work with current CPU's, I'm just worried about
future state extensions.

That said, I personally certainly prefer this approach which is guaranteed
to set as much as possible to a known state, even in the presense of
future extensions. If future extensions don't like loading zeroes, at
least the state will be _consistent_, which on the whole is really the
most important thing.

Applied.

Linus

2002-04-20 00:29:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Fri, 19 Apr 2002, H. Peter Anvin wrote:
>
> Remember that one of the original design goals of MMX was to run on
> unmodified operating systems (and it can't be turned off); thus needing
> any extra initialization is a bug.

Oh, I agree 100%. The sanest thing would certainly have been to make sure
"fninit" just clears everything inclusing MMX and full XMM state. Oh,
well.

Linus

2002-04-20 03:21:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: SSE related security hole

On Fri, Apr 19, 2002 at 11:42:06PM +0200, Andi Kleen wrote:
> On Fri, Apr 19, 2002 at 02:35:57PM -0700, H. Peter Anvin wrote:
> > would initialize the entire FPU, including any state that future
> > processors may add, thus reducing the likelihood of any funnies in the
> > future.
>
> That's also why I like it.

Trusting the "boot state" of the cpu would require the BIOS to match the
linux ABI. The FPU must be in a known initialized state at the linux
level, not at the BIOS level, as first for the mxcsr, but also the other
registers should be set to zero by default so gcc can exploit that (I
guess that's what gcc is just doing and that's why Honza noticed it). so
if new future processors will add new stuff, the new stuff will have to
be initialized again in the "fxrestor" default payload in linux (so
requiring a modification to the OS), and having to change the default
rxrestor payload for a new cpu is equivalent to add another xor there
(modulo the runtime check on the cpu features that could be avoided with
two separate exception handlers for each cpu revision but it's fast
enough that it doesn't matter at the moment on x86).

Andrea

2002-04-20 04:20:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Fri, Apr 19, 2002 at 08:08:15PM -0400, Brian Gerst wrote:
> diff -urN linux-2.5.8/arch/i386/kernel/i387.c linux/arch/i386/kernel/i387.c
> --- linux-2.5.8/arch/i386/kernel/i387.c Thu Mar 7 21:18:32 2002
> +++ linux/arch/i386/kernel/i387.c Fri Apr 19 19:35:14 2002
> @@ -31,13 +31,21 @@
> * value at reset if we support XMM instructions and then
> * remeber the current task has used the FPU.
> */
> -void init_fpu(void)
> +void init_fpu(struct task_struct *tsk)
> {
> - __asm__("fninit");
> - if ( cpu_has_xmm )
> - load_mxcsr(0x1f80);
> -
> - current->used_math = 1;
> + if (cpu_has_fxsr) {
> + memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
> + tsk->thread.i387.fxsave.cwd = 0x37f;
> + if (cpu_has_xmm)
> + tsk->thread.i387.fxsave.mxcsr = 0x1f80;
> + } else {
> + memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
> + tsk->thread.i387.fsave.cwd = 0xffff037f;
> + tsk->thread.i387.fsave.swd = 0xffff0000;
> + tsk->thread.i387.fsave.twd = 0xffffffff;
> + tsk->thread.i387.fsave.fos = 0xffff0000;
> + }
> + tsk->used_math = 1;
> }
>
> /*
> diff -urN linux-2.5.8/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c
> --- linux-2.5.8/arch/i386/kernel/traps.c Sun Apr 14 23:48:18 2002
> +++ linux/arch/i386/kernel/traps.c Fri Apr 19 18:22:12 2002
> @@ -757,13 +757,12 @@
> */
> asmlinkage void math_state_restore(struct pt_regs regs)
> {
> + struct task_struct *tsk = current;
> clts(); /* Allow maths ops (or we recurse) */
>
> - if (current->used_math) {
> - restore_fpu(current);
> - } else {
> - init_fpu();
> - }
> + if (!tsk->used_math)
> + init_fpu(tsk);
> + restore_fpu(tsk);
> set_thread_flag(TIF_USEDFPU); /* So we fnsave on switch_to() */
> }
>

I don't think it's good enough for merging yet. If you really want to do
the fxrestor, you should at least do the init_fpu only once during
bootup. The fxrestor is probably just overkill, but the memset + the
initializations is completly superflous in a fast path, I'd also use the
proper set_fpu_cwd and friends instead of doing it by hand. Even better
is to merge the:

/* Simulate an empty FPU. */
set_fpu_cwd(child, 0x037f);
set_fpu_swd(child, 0x0000);
set_fpu_twd(child, 0xffff);
set_fpu_mxcsr(child, 0x1f80);

/* Simulate an empty FPU. */
set_fpu_cwd(child, 0x037f);
set_fpu_swd(child, 0x0000);
set_fpu_twd(child, 0xffff);

in ptrace.c in a single function instead of duplicating functionality by
hand.

I still think the xor will be faster, no dcache pollution at all and
less I/O to ram. Future features can require change to the "empty FPU"
state anyways.

Andrea

2002-04-20 04:36:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> I don't think it's good enough for merging yet. If you really want to do
> the fxrestor, you should at least do the init_fpu only once during
> bootup.

No, it needs to be done once per process FP state init, ie "flush_thread".

Think about it - we do _not_ copy the FP registers over an execve().

> The fxrestor is probably just overkill, but the memset + the
> initializations is completly superflous in a fast path,

That's no fast path, that's a "this process has never used the FPU before,
so we'd better make sure that it starts off with a really clean slate".

There is not just "one" FP state per kernel - there is one per process.
There _has_ to be.

> I still think the xor will be faster, no dcache pollution at all and
> less I/O to ram. Future features can require change to the "empty FPU"
> state anyways.

But the point is that people may still use a 2.4.x kernel on a P4-SSE3,
which only adds a few new instructions, and which re-uses the old SSE2
save area.

No kernel support necessary - it's transparent to the kernel, which won't
ever even know that some new fields in the save area are now used.

That was the whole point of MMX - it worked without any new OS support.
Intel learnt somewhat from past mistakes and made the save area bigger
than necessary, so that they can add new extensions without needing to
upgrade the OS yet another time.

THAT is the reason we can't just zero the SSE registers - because if we
do, we'll have the same problem next time around.

Linus

2002-04-20 05:05:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Fri, Apr 19, 2002 at 09:35:45PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
> >
> > I don't think it's good enough for merging yet. If you really want to do
> > the fxrestor, you should at least do the init_fpu only once during
> > bootup.
>
> No, it needs to be done once per process FP state init, ie "flush_thread".

Note that with init_fpu I meant the init_fpu written in the patch. All
you need is a:

fxrstor "default fpu state"

You don't need to memset(0) and init your own stack to create the
"default fpu state" at every "flush_thread". You only need to execute an
fxrestor over a piece of ram initialized only once during boot, never
initialized ala init_fpu in the fast path.

>
> Think about it - we do _not_ copy the FP registers over an execve().
>
> > The fxrestor is probably just overkill, but the memset + the
> > initializations is completly superflous in a fast path,
>
> That's no fast path, that's a "this process has never used the FPU before,
> so we'd better make sure that it starts off with a really clean slate".

it's executed by every single task using the fpu, I do use python or bc
to do a simple math from bash and exit, that's a fast path for me, the
math exception should return ASAP, doing a superflous memset in the math
exception over an array looks bad to me.

> There is not just "one" FP state per kernel - there is one per process.
> There _has_ to be.

There is only "one" FP _empty_ state per kernel. That's the only thing
you need to resolve the math exception, no need to replicate it. Then
you set used_math and PF_USEDFPU and the rest of the matchanism will take
care of using the per-process backing store to save/restore it. No need
to memset to create the "empty FPU" state at every exception over and over again.

>
> > I still think the xor will be faster, no dcache pollution at all and
> > less I/O to ram. Future features can require change to the "empty FPU"
> > state anyways.
>
> But the point is that people may still use a 2.4.x kernel on a P4-SSE3,
> which only adds a few new instructions, and which re-uses the old SSE2
> save area.

If there's no new xmm and new control register that's fine. If there's
new control register the 2.4.x kernel will need modifications anyways.

Just adding new instructions is just fine, like between sse and sse2.

I think the only argument for that is that it will potentially clear the
xmm8-15 registers too, if they will be added to an x86 (they're just in
x86-64). The control part doesn't make much sense to be because it will
likely not be zero anyways.

> THAT is the reason we can't just zero the SSE registers - because if we
> do, we'll have the same problem next time around.

You are zeroing the SSE registers with the fxrestor way too. If a new
control register is added zero won't be guaranteed to be the right
initialization for it, most control registers aren't set to 0 by
default.

Unless they provide some future plan so that we can design the OS now to
support new feature it's pointless to spend cycles to try to be
backwards compatible with future cpus. Zeroing registers will be fine if
they add new instructions, everything else won't be backwards compatible
by definition because it wasn't documented that you had to do a certain
think to be backwards compatible.

Andrea

2002-04-20 16:28:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole


On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> Note that with init_fpu I meant the init_fpu written in the patch. All
> you need is a:
>
> fxrstor "default fpu state"

Ok, that I agree with.

> > That's no fast path, that's a "this process has never used the FPU before,
> > so we'd better make sure that it starts off with a really clean slate".
>
> it's executed by every single task using the fpu

Yes. _Once_ in their lifetimes.

> > But the point is that people may still use a 2.4.x kernel on a P4-SSE3,
> > which only adds a few new instructions, and which re-uses the old SSE2
> > save area.
>
> If there's no new xmm and new control register that's fine. If there's
> new control register the 2.4.x kernel will need modifications anyways.
>
> Just adding new instructions is just fine, like between sse and sse2.

If Intel makes the SSE3 registers twice as wide (or creates new ones), the
xorps trick simply will not work.

> I think the only argument for that is that it will potentially clear the
> xmm8-15 registers too, if they will be added to an x86 (they're just in
> x86-64). The control part doesn't make much sense to be because it will
> likely not be zero anyways.

Actually, even control parts likely _will_ be be zero, the way people
work.

> > THAT is the reason we can't just zero the SSE registers - because if we
> > do, we'll have the same problem next time around.
>
> You are zeroing the SSE registers with the fxrestor way too.


Andrea, that's the whole _point_.

> If a new
> control register is added zero won't be guaranteed to be the right
> initialization for it, most control registers aren't set to 0 by
> default.

Even then, having a reliable failure that is easy to pinpoint it a lot
better than random behaviour that has taken us more than two years to even
_find_.

Besides, zeroes for initial values of control registers actually _is_
fairly likely, in my opinion. I've sent off an email to my Intel contacts
to try to make this architected..

Linus

2002-04-20 17:27:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Sat, Apr 20, 2002 at 09:27:11AM -0700, Linus Torvalds wrote:
> If Intel makes the SSE3 registers twice as wide (or creates new ones), the
> xorps trick simply will not work.

Then the thing is different, I expected SSE3 not to mess the xmm layout.
If you just know SSE3 would break with the xorps the fxrestor way is
better. Anyways the problems I have about the implementation remains
(memset and duplicate efforts with ptrace in creating the empty fpu
state).

If they tell you the xmm registers won't change with SSE3 instead I
still prefer the xorps, that's 3bytes x 8 registers = 24 bytes of
icachce, compared to throwing away 512bytes/32 = 16 dcache cachelines so
it should be significantly faster.

Andrea

2002-04-20 17:39:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> Then the thing is different, I expected SSE3 not to mess the xmm layout.
> If you just know SSE3 would break with the xorps the fxrestor way is
> better. Anyways the problems I have about the implementation remains
> (memset and duplicate efforts with ptrace in creating the empty fpu
> state).

Hey, send a clean patch and it will definitely get fixed.. I don't
disagree with that part, although actual numbers are always good to have.

> If they tell you the xmm registers won't change with SSE3 instead I
> still prefer the xorps, that's 3bytes x 8 registers = 24 bytes of
> icachce, compared to throwing away 512bytes/32 = 16 dcache cachelines so
> it should be significantly faster.

Oh, if they promise to not add registers we have an easy time, I agree.

Linus

2002-04-20 18:12:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Sat, Apr 20, 2002 at 10:38:59AM -0700, Linus Torvalds wrote:
>
>
> On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
> >
> > Then the thing is different, I expected SSE3 not to mess the xmm layout.
> > If you just know SSE3 would break with the xorps the fxrestor way is
> > better. Anyways the problems I have about the implementation remains
> > (memset and duplicate efforts with ptrace in creating the empty fpu
> > state).
>
> Hey, send a clean patch and it will definitely get fixed.. I don't
> disagree with that part, although actual numbers are always good to have.

Well, my preferred patch is still this one :)

--- linux/arch/i386/kernel/i387.c.org Thu Apr 18 09:30:26 2002
+++ linux/arch/i386/kernel/i387.c Thu Apr 18 09:38:23 2002
@@ -33,9 +33,30 @@
void init_fpu(void)
{
__asm__("fninit");
- if ( cpu_has_xmm )
+
+ if ( cpu_has_mmx )
+ asm volatile(
+ "pxor %mm0, %mm0 \n"
+ "pxor %mm1, %mm1 \n"
+ "pxor %mm2, %mm2 \n"
+ "pxor %mm3, %mm3 \n"
+ "pxor %mm4, %mm4 \n"
+ "pxor %mm5, %mm5 \n"
+ "pxor %mm6, %mm6 \n"
+ "pxor %mm7, %mm7 \n"
+ "emms \n");
+ if ( cpu_has_xmm ) {
load_mxcsr(0x1f80);
+ asm volatile(
+ "xorps %xmm0, %xmm0 \n"
+ "xorps %xmm1, %xmm1 \n"
+ "xorps %xmm2, %xmm2 \n"
+ "xorps %xmm3, %xmm3 \n"
+ "xorps %xmm4, %xmm4 \n"
+ "xorps %xmm5, %xmm5 \n"
+ "xorps %xmm6, %xmm6 \n"
+ "xorps %xmm7, %xmm7 ");
-
+ }
current->used_math = 1;
}

--- linux/include/asm-i386/processor.h~ Thu Apr 18 10:06:37 2002
+++ linux/include/asm-i386/processor.h Thu Apr 18 10:06:37 2002
@@ -89,6 +89,7 @@
#define cpu_has_vme (test_bit(X86_FEATURE_VME, boot_cpu_data.x86_capability))
#define cpu_has_fxsr (test_bit(X86_FEATURE_FXSR, boot_cpu_data.x86_capability))
#define cpu_has_xmm (test_bit(X86_FEATURE_XMM, boot_cpu_data.x86_capability))
+#define cpu_has_mmx (test_bit(X86_FEATURE_MMX, boot_cpu_data.x86_capability))
#define cpu_has_fpu (test_bit(X86_FEATURE_FPU, boot_cpu_data.x86_capability))
#define cpu_has_apic (test_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability))



My strongest argument is that even if it's true the fxrestor logic may
not need changes even if they add registers or control words, that
doesn't apply to the kernel as a whole too. The same 2.4 kernel binary
image with the fxrestor patch applied cannot enable SSE3 automatically.
Further patching, recompilation and in turn a new fresh binary image
will be necessary to enable SSE3, regardless of the universal fxrestor
logic. This obviously assuming that they won't release again a chip
that is not backwards compatible 8).

I mean, if they change the registers layout, and so if they require a
different empty FPU state, they must as well add yet another bitflag to
enable SSE3, if they don't the chip isn't backwards compatible.

If they only add new instructions like in sse2 than everything is fine
and the same binary image will work regardless of xorps or
universal-rxrestor.

If the register layout changes, by the time we change the kernel to
enable SSE3, it will be trivial to add yet another series of xoprs in
init_fpu (or a control word initialization ala load_mxcsr) and it will
remain much faster than fxrestor, and actually much simpler as well
IMHO.

This is why I don't feel the need of compilcating the code creating the
"empty fpu" during boot etc..

Andrea

2002-04-20 19:30:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole


On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> Well, my preferred patch is still this one :)

But this one simply will not get applied, for all the reasons already
outlined. It _will_ cause the same problems that it tries to fix, just at
some later time.

Besides, I seriously doubt it is any faster than what is there already.

Time it, and notice how:

- fninit takes about 200 cycles
- fxrstor takes about 215 cycles

and your added 16*(pxor/xorps) likely takes at least 8 cycles.

In short, your "fast" code isn't actually any faster than doing it right.

Linus

2002-04-20 19:41:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

> Besides, I seriously doubt it is any faster than what is there already.
>
> Time it, and notice how:
>
> - fninit takes about 200 cycles
> - fxrstor takes about 215 cycles

On what CPU?

I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
cycles (i guess depending on whether there is XMM state or not so 68 cycles
probably apply here) and fninit as 91 cycles. It doesn't list the SSE1
timings, but i guess the instructions don't take more than 3 cycles
(MMX instructions take that long). So Andrea's way should be
91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle)
vs 68 or 108. So the fxrstor wins well.

On x86-64 the difference is even bigger because it has 16 XMM registers instead
of 8.


> In short, your "fast" code isn't actually any faster than doing it right.

At least on Athlon it should be slower.

-Andi

2002-04-20 21:28:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Sat, Apr 20, 2002 at 09:41:14PM +0200, Andi Kleen wrote:
> > Besides, I seriously doubt it is any faster than what is there already.
> >
> > Time it, and notice how:
> >
> > - fninit takes about 200 cycles
> > - fxrstor takes about 215 cycles
>
> On what CPU?
>
> I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
> cycles (i guess depending on whether there is XMM state or not so 68 cycles
> probably apply here) and fninit as 91 cycles. It doesn't list the SSE1
> timings, but i guess the instructions don't take more than 3 cycles
> (MMX instructions take that long). So Andrea's way should be
> 91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle)
> vs 68 or 108. So the fxrstor wins well.
>
> On x86-64 the difference is even bigger because it has 16 XMM registers instead
> of 8.
>
>
> > In short, your "fast" code isn't actually any faster than doing it right.
>
> At least on Athlon it should be slower.

pxor+xorps is definitely faster than fxrestor on athlon-mp.

fxrestor on athlon-mp 1600, on cold cache (the "default fpu state" will
be cold most of the time, it's only ever used at the first math fault of
a task):

dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 3507
cycles 3293
cycles 3179
cycles 3267
cycles 3374
cycles 3213
cycles 2575
cycles 3306
cycles 2915
cycles 3078

pxor + xorps on the same hardware cold cache:

dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 2444
cycles 1619
cycles 2396
cycles 2644
cycles 2322
cycles 2404
cycles 2266
cycles 1657
cycles 2182
cycles 2077

And here it is again on PIII 800mhz for confirmation on true Intel CPU:

fxrestor:

for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 3461
cycles 3294
cycles 3339
cycles 3276
cycles 3620
cycles 3443
cycles 3401
cycles 3455
cycles 3515
cycles 3327

xor+xorps:

for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 2288
cycles 2181
cycles 2430
cycles 2486
cycles 2360
cycles 2534
cycles 2243
cycles 2199
cycles 2153
cycles 2396

As said in an earlier email it is a matter of memory bandwith, 59 bytes
of icache and zero data, against 7 bytes of icache and 512bytes of data.
the 512bytes of data are visibly slower, period. Saving mem bandwith is
much more important than reducing the number of instructions, even more
on SMP!

On the x86-64 it will be exactly the same, the mem bandwith is much
higher, but it's still a major bottleneck compared to a few more
instructions to execute (also consider the current ia32 isn't probably
reading the xmm8-15 slots, so on x86-64 fxrestor will be potentially
even slower compared to the in-core xorps that never do any I/O to
memory).

Go ahead and try yourself setting the #if to 0 or 1. The benchmark is
very accurate.

#include <sys/mman.h>
#include <asm/msr.h>

struct i387_fxsave_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
} __attribute__ ((aligned (16)));

#define LOOPS 1

#define load_mxcsr( val ) do { \
unsigned long __mxcsr = ((unsigned long)(val) & 0xffbf); \
asm volatile( "ldmxcsr %0" : : "m" (__mxcsr) ); \
} while (0)

struct i387_fxsave_struct i387;
char buf[1024*1024*40];

static void cold_dcache(void)
{
memset(buf, 0, 1024*1024*40);
}

main()
{
unsigned long before, after;
int i;

iopl(3);
mlockall(MCL_FUTURE);

#if 1
cold_dcache();
asm("cli");

rdtscl(before);

__asm__("fninit");

asm volatile(
"pxor %mm0, %mm0 \n"
"xorps %xmm0, %xmm0 \n"
"pxor %mm1, %mm1 \n"
"xorps %xmm1, %xmm1 \n"
"pxor %mm2, %mm2 \n"
"xorps %xmm2, %xmm2 \n"
"pxor %mm3, %mm3 \n"
"xorps %xmm3, %xmm3 \n"
"pxor %mm4, %mm4 \n"
"xorps %xmm4, %xmm4 \n"
"pxor %mm5, %mm5 \n"
"xorps %xmm5, %xmm5 \n"
"pxor %mm6, %mm6 \n"
"xorps %xmm6, %xmm6 \n"
"pxor %mm7, %mm7 \n"
"xorps %xmm7, %xmm7 \n"
"emms \n");
load_mxcsr(0x1f80);
rdtscl(after);
#else
asm volatile("fxsave %0" : : "m" (i387));

cold_dcache();
asm("cli");
rdtscl(before);

asm volatile("fxrstor %0" : "=m" (i387));

rdtscl(after);
#endif

printf("cycles %lu\n", after-before);
}

As said for SSE3 a kernel modification will be required anyways if xorps
isn't enough to initialize the new fpu, so the current 2.4 kernel cannot
cope with it anyways regardless of fxrestor or not.

So I don't buy that fxrestor is better, nor more generic.

Andrea

2002-04-20 22:43:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

Andrea Arcangeli wrote:
>
> As said in an earlier email it is a matter of memory bandwith, 59 bytes
> of icache and zero data, against 7 bytes of icache and 512bytes of data.
> the 512bytes of data are visibly slower, period. Saving mem bandwith is
> much more important than reducing the number of instructions, even more
> on SMP!
>

It's not 512 bytes of data -- only the part that's actually used is
accessed.

-hpa

2002-04-20 23:14:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Sat, 20 Apr 2002, Andi Kleen wrote:

> > Besides, I seriously doubt it is any faster than what is there already.
> >
> > Time it, and notice how:
> >
> > - fninit takes about 200 cycles
> > - fxrstor takes about 215 cycles
>
> On what CPU?

Sorry, should have mentioned. That's a P4.

> I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
> cycles (i guess depending on whether there is XMM state or not so 68 cycles
> probably apply here) and fninit as 91 cycles. It doesn't list the SSE1
> timings, but i guess the instructions don't take more than 3 cycles
> (MMX instructions take that long). So Andrea's way should be
> 91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle)
> vs 68 or 108. So the fxrstor wins well.

The athlon should be able to do two MMX / cycle (the throughput is much
lower, but there are no data dependencies here).

> > In short, your "fast" code isn't actually any faster than doing it right.
>
> At least on Athlon it should be slower.

I suspect that the real answer is "the speed difference is _way_ in the
noise".

Linus

2002-04-20 23:24:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> pxor+xorps is definitely faster than fxrestor on athlon-mp.

Andrea, that's not the _comparison_.

The "fxrestor" replaces the "fninit" too, so you have to take that into
account.

> fxrestor on athlon-mp 1600, on cold cache (the "default fpu state" will
> be cold most of the time, it's only ever used at the first math fault of
> a task):

Except it's _never_ cold-cache the way it's coded now. In fact it's always
hot-cache - which are exactly the numbers I posted.

> Go ahead and try yourself setting the #if to 0 or 1. The benchmark is
> very accurate.

It may have high precision, but since it's testing something that has
nothing to do with the problem at hand, it's basically 100% useless.

Linus

2002-04-21 02:10:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Sat, Apr 20, 2002 at 03:43:17PM -0700, H. Peter Anvin wrote:
> Andrea Arcangeli wrote:
> >
> >As said in an earlier email it is a matter of memory bandwith, 59 bytes
> >of icache and zero data, against 7 bytes of icache and 512bytes of data.
> >the 512bytes of data are visibly slower, period. Saving mem bandwith is
> >much more important than reducing the number of instructions, even more
> >on SMP!
> >
>
> It's not 512 bytes of data -- only the part that's actually used is
> accessed.

On current x86 yes (so far), but the x86-64 the whole 512bytes will have
to be read from ram instead.

Andrea

2002-04-21 02:10:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Sat, Apr 20, 2002 at 04:23:50PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
> >
> > pxor+xorps is definitely faster than fxrestor on athlon-mp.
>
> Andrea, that's not the _comparison_.
>
> The "fxrestor" replaces the "fninit" too, so you have to take that into
> account.

Note that I just took it into account:

rdtscl(before);

__asm__("fninit");
^^^^^^^^^^^^^^^^^

asm volatile(
"pxor %mm0, %mm0 \n"
"xorps %xmm0, %xmm0 \n"
"pxor %mm1, %mm1 \n"
"xorps %xmm1, %xmm1 \n"
"pxor %mm2, %mm2 \n"
"xorps %xmm2, %xmm2 \n"
"pxor %mm3, %mm3 \n"
"xorps %xmm3, %xmm3 \n"
"pxor %mm4, %mm4 \n"
"xorps %xmm4, %xmm4 \n"
"pxor %mm5, %mm5 \n"
"xorps %xmm5, %xmm5 \n"
"pxor %mm6, %mm6 \n"
"xorps %xmm6, %xmm6 \n"
"pxor %mm7, %mm7 \n"
"xorps %xmm7, %xmm7 \n"
"emms \n");
load_mxcsr(0x1f80);
rdtscl(after);

>
> > fxrestor on athlon-mp 1600, on cold cache (the "default fpu state" will
> > be cold most of the time, it's only ever used at the first math fault of
> > a task):
>
> Except it's _never_ cold-cache the way it's coded now. In fact it's always
> hot-cache - which are exactly the numbers I posted.

In the common case fork/clone are executed at a frequency that makes it
a cold cache.

Anyways since you are apparently constantly forking off a new task every
10 milliseconds on your machine, on the PIII with the hot cache it's 97
cycles for fxrestor, and 90 cycles of pxor/xorps/fninit so it's still
faster, but not much faster anymore. With the athlon-mp it's 120 cycles
the xorps/pxor/fninit and 85 cycles fxrestor, but that's only because
the fninit is very slow on the athlon-mp, probably not the case for
x86-64 (while the xorps/pxor is just much faster on the athlon-mp
compared to the PIII). The pxor/xorps alone on the athlon-mp takes only
17 cycles! the fninit alone instead takes 99 cycles. 512bytes of ram on
x86-64 (around half on x86) cannot be optimized away by a new cpu, the
bus will get the hit regardless, while fninit has the potential to be
much faster on the new cpu (just like it's much faster on the PIII and
infact it's a win). I guess the P4 will give hot cache results similar
to the PIII with the hot-cache, and anyways there's no doubt on the huge
speedup with the _common_ case cold-cache, even more obviously on x86-64
that will really have to read the _whole_ 512bytes, not (yet) the case
for x86. I'm astonished you prefer to take the hit on the ram bus.

here the hot cache benchmark that I used so you can test yourself:

#include <sys/mman.h>
#include <asm/msr.h>

struct i387_fxsave_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
} __attribute__ ((aligned (16)));

#define LOOPS 200

#define load_mxcsr( val ) do { \
unsigned long __mxcsr = ((unsigned long)(val) & 0xffbf); \
asm volatile( "ldmxcsr %0" : : "m" (__mxcsr) ); \
} while (0)

struct i387_fxsave_struct i387;
char buf[1024*1024*40];

static void cold_dcache(void)
{
memset(buf, 0, 1024*1024*40);
}

main()
{
unsigned long before, after;
int i;

#if 1
for (i = 0; i < LOOPS; i++) {
rdtscl(before);

__asm__("fninit");

#if 1
asm volatile("pxor %mm0, %mm0 \n"
"pxor %mm1, %mm1 \n"
"pxor %mm2, %mm2 \n"
"pxor %mm3, %mm3 \n"
"pxor %mm4, %mm4 \n"
"pxor %mm5, %mm5 \n"
"pxor %mm6, %mm6 \n"
"pxor %mm7, %mm7 \n"
"xorps %xmm0, %xmm0 \n"
"xorps %xmm1, %xmm1 \n"
"xorps %xmm2, %xmm2 \n"
"xorps %xmm3, %xmm3 \n"
"xorps %xmm4, %xmm4 \n"
"xorps %xmm5, %xmm5 \n"
"xorps %xmm6, %xmm6 \n"
"xorps %xmm7, %xmm7 \n"
"emms \n");
load_mxcsr(0x1f80);
#endif
rdtscl(after);
}
#else
asm volatile("fxsave %0" : : "m" (i387));

for (i = 0; i < LOOPS; i++) {
rdtscl(before);

asm volatile("fxrstor %0" : "=m" (i387));

rdtscl(after);
}
#endif

printf("cycles %lu\n", after-before);
}

> It may have high precision, but since it's testing something that has
> nothing to do with the problem at hand, it's basically 100% useless.

Andrea

2002-04-21 21:12:23

by Pavel Machek

[permalink] [raw]
Subject: Re: SSE related security hole

Hi!

> > Um, people here seem to be assuming that, in the absence of MMX,
> > fninit *doesn't* leak information.
> >
> > I thought it was well-known to just clear (set to all-ones) the
> > tag register and not alter the actual floating-point registers.
> >
> > Thus, it seems quite feasible to reset the tag word with FLDENV and
> > store out the FPU registers, even on an 80387.
> >
> > Isn't this the same security hole? Shouldn't there be 8 FLDZ instructions
> > (or equivalent) in the processor state initialization?
>
> Well, if what's on the internal stack of the FPU can actually leak
> information, I think the notion of "leak" has expanded just a bit
> too much.
>
> A rogue process could not even know what instruction was about to
> be executed, nor what the previous instruction was, nor when since
> boot it was executed, nor by whom. The 'data' associated with those

If fpu unit was used to memcpy your .ssh/identity, well, you might
change your mind.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-04-21 21:14:13

by Pavel Machek

[permalink] [raw]
Subject: Re: SSE related security hole

Hi!

> > It introduces security hole: Unrelated tasks now have your top secret
> > value you stored in one of your registers.
>
> Well, that's been my point all along and why I sent the patch. I was not
> asking why leaving the registers alone instead of 0ing them out was not a
> security hole. I was asking why doing so was not backward compatible?

Introducing security hole counts as "poor backcompatibility" to me.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-04-21 22:20:28

by daw

[permalink] [raw]
Subject: Re: SSE related security hole

Richard B. Johnson wrote:
>Well, if what's on the internal stack of the FPU can actually leak
>information, I think the notion of "leak" has expanded just a bit
>too much.

Note that some crypto implemenations use the FPU heavily to speed up
the encryption process. Thus, if FPU data can leak, secret keys are
at risk. I don't know about you, but that doesn't sound good to me.

2002-04-22 22:25:05

by Saxena, Sunil

[permalink] [raw]
Subject: RE: SSE related security hole

Hi Everyone,

Sorry it took us some time to respond to this issue.

The Intel prescribed method for correct execution of SSE/SSE2 instructions
requires that software detect the support for these instructions via the
CPUID feature flags bits, prior to executing the instruction (see
<<ftp://download.intel.com/design/perftool/cbts/appnotes/ap900/cpuosid.pdf>>
). Also, Section 2.2 of the IA-32 Intel(R) Architecture Software Developer's
Manual, Vol 2, indicates that the use of the operand size prefix with
MMX/SSE/SSE2 instructions is reserved and may cause unpredictable behavior.


We recognized that there is a discrepancy in the individual instruction
descriptions in Vol 2 where it is indicated that the instruction would
generate a UD#. We will be rectifying this discrepancy in the next revision
of Vol 2 as well as via the monthly Specification Updates.

Thanks
Sunil


-----Original Message-----
From: Doug Ledford [mailto:[email protected]]
Sent: Wednesday, April 17, 2002 4:56 PM
To: [email protected]
Cc: [email protected]; [email protected]; [email protected]; [email protected];
[email protected]
Subject: Re: SSE related security hole


Subject: Re: SSE related security hole
Reply-To:
In-Reply-To: <[email protected]>; from
[email protected] on Wed, Apr 17, 2002 at 04:13:03PM +0100

(NOTE: This may have already been answered by someone else, but I haven't
seen it if it has, so I'm sending this through)

> Hi,
> while debugging GCC bugreport, I noticed following behaviour of simple
> program with no syscalls:
>
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 28
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 56
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 84
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 112
> Bad sum (seen with gcc -O -march=pentiumpro -msse)
> hubicka@nikam:~$ echo
>
> hubicka@nikam:~$ ./a.out
> sum of 7 ints: 28
>
>
> ie it always returns different value, moreover when something else
> is run in meantime (verified by loading WWW page served by same machine),
> the counter is reinitialized to 28.
>
> I am attaching the source, but it needs to be compiled by cfg-branch GCC
> with settings -O2 -march=pentium3 -mfpmath=sse, so I've placed static
> binary to http://atrey.karlin.mff.cuni.cz/~hubicka/badsum.bin

Compiling the asm source with a different compiler will also make it fail.

> The problem appears to be reproducible only on pentium3 and athlon4
systems,
> not pentium4 system, where it appears to work as expected. Reproduced on
> both 2.4.9-RH and 2.4.16 kernels.

[ program snipped ]

So there are two different issues at play here. First, the kernel uses
the fninit instruction to initialize the fpu on first use. Nothing in the
Intel manuals says anything about the fninit instruction clearing the mmx
or sse registers, and experimentally we now know for sure that it doesn't.
That means that when the first time your program was ran it left 28 in
register xmm1. The next time the program was run, the fninit did nothing
to clear register xmm1 so it still held 28. Now, the pxor instruction
that is part of the m() function and intended to 0 out the xmm1 register
is an sse2 instruction. It just so happens that it doesn't work on sse
only processors such as P3 CPUs. So, when 28 was left in xmm1, then the
pxor failed to 0 out xmm1, we saved 28 as the starting value for the loop
and then looped through 7 additions until we had, you guessed it, 56. In
fact, if you do a while :;do bad; done loop the it will increment by 28
each time it is run except when something else intervenes. Replacing the
pxor instruction with xorps instead makes it work. So, that's a bug in
gcc I suspect, using sse2 instructions when only called to use sse
instructions. It seems odd to me that the CPU wouldn't generate an
illegal instruction exception, but oh well, it evidently doesn't.

So, we really should change arch/i386/kernel/i387.c something like this:

(WARNING, totally untested and not even compile checked change follows)

--- i387.c.save Wed Apr 17 19:22:47 2002
+++ i387.c Wed Apr 17 19:28:27 2002
@@ -33,8 +33,26 @@
void init_fpu(void)
{
__asm__("fninit");
- if ( cpu_has_xmm )
+ if ( cpu_has_mmx )
+ asm volatile("xorq %%mm0, %%mm0;
+ xorq %%mm1, %%mm1;
+ xorq %%mm2, %%mm2;
+ xorq %%mm3, %%mm3;
+ xorq %%mm4, %%mm4;
+ xorq %%mm5, %%mm5;
+ xorq %%mm6, %%mm6;
+ xorq %%mm7, %%mm7");
+ if ( cpu_has_xmm ) {
+ asm volatile("xorps %%xmm0, %%xmm0;
+ xorps %%xmm1, %%xmm1;
+ xorps %%xmm2, %%xmm2;
+ xorps %%xmm3, %%xmm3;
+ xorps %%xmm4, %%xmm4;
+ xorps %%xmm5, %%xmm5;
+ xorps %%xmm6, %%xmm6;
+ xorps %%xmm7, %%xmm7");
load_mxcsr(0x1f80);
+ }

current->used_math = 1;
}


The rest of the problem is a gcc bug and possibly something that Intel
should make a note of on the p3 processors (that is, that the p3 will
silently fail to execute some sse2 instructions without generating the
expected exception).

--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606




_______________________________________________
rhkernel-dept-list mailing list
[email protected]
http://post-office.corp.redhat.com/mailman/listinfo/rhkernel-dept-list

2002-04-22 22:51:35

by Andi Kleen

[permalink] [raw]
Subject: Initial process CPU state was Re: SSE related security hole

"Saxena, Sunil" <[email protected]> writes:

Hallo Sunil,

> We recognized that there is a discrepancy in the individual instruction
> descriptions in Vol 2 where it is indicated that the instruction would
> generate a UD#. We will be rectifying this discrepancy in the next revision
> of Vol 2 as well as via the monthly Specification Updates.

Could you quickly describe what the Intel recommended way is to clear
the whole CPU at the beginning of a process? Is there a better way
than "save state with fxsave at bootup and restore into each
new process"? After all it would be a bit unfortunate to have
instructions which are transparently tolerant to new CPU state (fxsave/fxrstor
for context switching), but no matching way to clear the same state for
security reasons. Using the bootup FXSAVE image would make linux
depend on the BIOS for this (so in the worst case when the bios
doesn't clear e.g. the XMM registers or some future registers each
process could see the state of some previous boot after a warm boot)

Another way would be to do a fxsave after clearing of known state (x87,MMX,
SSE) at OS bootup and then afterwards set all the so far reserved parts of the
FXSAVE image to zero. Then restore this image later into each new process.
This would avoid any BIOS/direct warmboot dependencies. It would work
assuming that all future IA32 state can be safely initialized with zeroes
via FXRSTOR. Is this a safe assumption?

Thanks,
-Andi

2002-04-22 23:29:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Initial process CPU state was Re: SSE related security hole

In article <[email protected]>,
Andi Kleen <[email protected]> wrote:
>
>Could you quickly describe what the Intel recommended way is to clear
>the whole CPU at the beginning of a process? Is there a better way
>than "save state with fxsave at bootup and restore into each
>new process"?

Note that I will _not_ accept a patch that does the "fxsave at bootup"
thing, because since Linux doesn't actually control the bootup, and
since it gets more an dmore likely that the BIOS will actually use the
SSE etc registers, a boot-time "fxsave" will mean that different
machines will have potentially quite different process initialization.

In fact, even the same machine might act differently depending on how it
was booted up (ie warm vs cold vs resume boot, different BIOS path due
to different BIOS options etc).

Now, that wouldn't be a security hole per se, but it would be hell to
debug problems ("My other machine that is identical doesn't show that
bug").

Basically there _needs_ to be an architected way to ensure that the FP
data is in a known and valid state.

(The "fxsave early" approach results in a valid - but not known -
state).

>Another way would be to do a fxsave after clearing of known state (x87,MMX,
>SSE) at OS bootup and then afterwards set all the so far reserved parts of the
>FXSAVE image to zero.

This is basically what we do right now (ie as of 2.5.9, just released).

Except we set it to zero before, since the state we _do_ know about (ie
current x87, MMX, SSE) is initialized to exactly the state you mention
(by hand), and then the rest is just initialized to zero.

Linus

2002-04-23 19:22:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
> I mean, if they change the registers layout, and so if they require a
> different empty FPU state, they must as well add yet another bitflag to
> enable SSE3, if they don't the chip isn't backwards compatible.

I have unofficial confirmation from Intel that the way to architecturally
initialize the FPU state is indeed something like

memset(&fxsave, 0, sizeof(struct i387_fxsave_struct));
fxsave.cwd = 0x37f;
fxsave.mxcsr = 0x1f80;
fxrstor(&fxsave);

and the person in question is trying to make sure this is documented so
that we won't be bitten by this in the future.

Linus

2002-04-23 20:06:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

Linus Torvalds wrote:
>
> On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
>
>>I mean, if they change the registers layout, and so if they require a
>>different empty FPU state, they must as well add yet another bitflag to
>>enable SSE3, if they don't the chip isn't backwards compatible.
>
>
> I have unofficial confirmation from Intel that the way to architecturally
> initialize the FPU state is indeed something like
>
> memset(&fxsave, 0, sizeof(struct i387_fxsave_struct));
> fxsave.cwd = 0x37f;
> fxsave.mxcsr = 0x1f80;
> fxrstor(&fxsave);
>
> and the person in question is trying to make sure this is documented so
> that we won't be bitten by this in the future.
>

Great!

-hpa


2002-04-23 22:07:47

by Saxena, Sunil

[permalink] [raw]
Subject: RE: Initial process CPU state was Re: SSE related security hole

The correct sequence for initializing MMX/SSE/SSE2 CPU state (I exchanged
mail with Linus) is:

memset(&fxsave, 0, sizeof(struct i387_fxsave_struct));
fxsave.cwd = 0x37f;
fxsave.mxcsr = 0x1f80;

fxrstor(&fxsave);

The above is architectural feature and you can expect this to work in the
future. Intel will work to document this in our Monthly Specification
Updates and update "IA-32 Intel(R) Architecture Software Developer Manual"s.

Thanks
Sunil

-----Original Message-----
From: Andi Kleen [mailto:[email protected]]
Sent: Monday, April 22, 2002 3:51 PM
To: Saxena, Sunil
Cc: [email protected]
Subject: Initial process CPU state was Re: SSE related security hole


"Saxena, Sunil" <[email protected]> writes:

Hallo Sunil,

> We recognized that there is a discrepancy in the individual instruction
> descriptions in Vol 2 where it is indicated that the instruction would
> generate a UD#. We will be rectifying this discrepancy in the next
revision
> of Vol 2 as well as via the monthly Specification Updates.

Could you quickly describe what the Intel recommended way is to clear
the whole CPU at the beginning of a process? Is there a better way
than "save state with fxsave at bootup and restore into each
new process"? After all it would be a bit unfortunate to have
instructions which are transparently tolerant to new CPU state
(fxsave/fxrstor
for context switching), but no matching way to clear the same state for
security reasons. Using the bootup FXSAVE image would make linux
depend on the BIOS for this (so in the worst case when the bios
doesn't clear e.g. the XMM registers or some future registers each
process could see the state of some previous boot after a warm boot)

Another way would be to do a fxsave after clearing of known state (x87,MMX,
SSE) at OS bootup and then afterwards set all the so far reserved parts of
the
FXSAVE image to zero. Then restore this image later into each new process.
This would avoid any BIOS/direct warmboot dependencies. It would work
assuming that all future IA32 state can be safely initialized with zeroes
via FXRSTOR. Is this a safe assumption?

Thanks,
-Andi

2002-04-24 00:33:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Tue, Apr 23, 2002 at 12:21:29PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 20 Apr 2002, Andrea Arcangeli wrote:
> >
> > I mean, if they change the registers layout, and so if they require a
> > different empty FPU state, they must as well add yet another bitflag to
> > enable SSE3, if they don't the chip isn't backwards compatible.
>
> I have unofficial confirmation from Intel that the way to architecturally
> initialize the FPU state is indeed something like
>
> memset(&fxsave, 0, sizeof(struct i387_fxsave_struct));
> fxsave.cwd = 0x37f;
> fxsave.mxcsr = 0x1f80;
> fxrstor(&fxsave);
>
> and the person in question is trying to make sure this is documented so
> that we won't be bitten by this in the future.

Ok, thanks for the info. The advantage (now that it is documented! :) is
that they can add the xmm8-15 reigsters of x86-64 to x86 too without
requiring any change to linux. On the linux side we obviously don't care
if they document it retroactive as an errata of sse docs, so for us it
doesn't matter even if they don't add an additional bitflag before
adding the xmm8-15 registers. Basically they should only deal with the
other operative systems now.

Andrea

2002-04-24 02:11:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole



On Wed, 24 Apr 2002, Andrea Arcangeli wrote:
>
> Basically they should only deal with the
> other operative systems now.

Yeah. And since the security hole is fairly small and insignificant in
comparison to some others, I suspect that the main "other operating
system" just won't care one way or the other.

Linus

2002-04-24 19:21:42

by Bill Davidsen

[permalink] [raw]
Subject: RE: Initial process CPU state was Re: SSE related security hole

On Tue, 23 Apr 2002, Saxena, Sunil wrote:

> The correct sequence for initializing MMX/SSE/SSE2 CPU state (I exchanged
> mail with Linus) is:
>
> memset(&fxsave, 0, sizeof(struct i387_fxsave_struct));
> fxsave.cwd = 0x37f;
> fxsave.mxcsr = 0x1f80;
>
> fxrstor(&fxsave);
>
> The above is architectural feature and you can expect this to work in the
> future. Intel will work to document this in our Monthly Specification
> Updates and update "IA-32 Intel(R) Architecture Software Developer Manual"s.

Sure is nice to see some vendor support!

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-04-26 11:28:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

Hi!

> > Basically they should only deal with the
> > other operative systems now.
>
> Yeah. And since the security hole is fairly small and insignificant in
> comparison to some others, I suspect that the main "other operating
> system" just won't care one way or the other.

*If* the other system uses fpu unit for memcpy, well, it may get
pretty big and significant.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-04-26 11:55:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Re: SSE related security hole

On Fri, Apr 26, 2002 at 11:13:42AM +0200, Pavel Machek wrote:
> *If* [..]

you missed it, you should read what's written behind the lines too 8)

Andrea