LinuxLists.cc - selftests/x86/fsgsbase

2018-01-26 15:37:15

Subject: selftests/x86/fsgsbase_64 test problem

We've noticed that fsgsbase_64 can fail intermittently with the
following error:

[RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
Before schedule, set selector to 0x1
other thread: ARCH_SET_GS(0x1) -- sel is 0x0
[FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0

This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.

for i in $(seq 1 10000); do ./fsgsbase_64 || break; done

This problem isn't new - I've reproduced it on latest mainline and every
release going back to v4.12 (I did not try earlier). This was tested on
a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
i3-5010U.

Thanks,
Dan

2018-01-26 16:23:51

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>
> We've noticed that fsgsbase_64 can fail intermittently with the
> following error:
>
> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
> Before schedule, set selector to 0x1
> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>
> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>
> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>
> This problem isn't new - I've reproduced it on latest mainline and every
> release going back to v4.12 (I did not try earlier). This was tested on
> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
> i3-5010U.
>

Hmm, I can reproduce it, too. I'll look in a bit.

2018-01-26 19:00:39

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>>
>> We've noticed that fsgsbase_64 can fail intermittently with the
>> following error:
>>
>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>> Before schedule, set selector to 0x1
>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>
>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>
>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>
>> This problem isn't new - I've reproduced it on latest mainline and every
>> release going back to v4.12 (I did not try earlier). This was tested on
>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>> i3-5010U.
>>
>
> Hmm, I can reproduce it, too. I'll look in a bit.

I'm triggering a different error, and I think what's going on is that
the kernel doesn't currently re-save GSBASE when a task switches out
and that task has save gsbase != 0 and in-register GS == 0. This is
arguably a bug, but it's not an infoleak, and fixing it could be a wee
bit expensive. I'm not sure what, if anything, to do about this. I
suppose I could add some gross perf hackery to the test to detect this
case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's
up. It may be related to and old problem I've seen that causes signal
delivery to sometimes corrupt %gs. It's deterministic, but it depends
in some odd way on register state. I can currently reproduce that
issue 100% of the time, and I'm trying to see if I can figure out
what's happening.

2018-01-26 19:48:41

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>>>
>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>> following error:
>>>
>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>> Before schedule, set selector to 0x1
>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>
>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>
>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>
>>> This problem isn't new - I've reproduced it on latest mainline and every
>>> release going back to v4.12 (I did not try earlier). This was tested on
>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>> i3-5010U.
>>>
>>
>> Hmm, I can reproduce it, too. I'll look in a bit.
>
> I'm triggering a different error, and I think what's going on is that
> the kernel doesn't currently re-save GSBASE when a task switches out
> and that task has save gsbase != 0 and in-register GS == 0. This is
> arguably a bug, but it's not an infoleak, and fixing it could be a wee
> bit expensive. I'm not sure what, if anything, to do about this. I
> suppose I could add some gross perf hackery to the test to detect this
> case and suppress the error.
>
> I can also trigger the problem you're seeing, and I don't know what's
> up. It may be related to and old problem I've seen that causes signal
> delivery to sometimes corrupt %gs. It's deterministic, but it depends
> in some odd way on register state. I can currently reproduce that
> issue 100% of the time, and I'm trying to see if I can figure out
> what's happening.

I think it's a CPU bug, and I'm a bit mystified. I can trigger the
following, plausibly related issue:

Write a program that writes %gs = 1.
Run that program under gdb
break in which %gs == 1
display/x $gs
si

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it
changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==
1. On TCG, I do. I don't think that's precisely the problem that's
causing the test to fail, since the test doesn't use TF or ptrace, but
I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need
to make invalid_selector() in ptrace.c always return false. The
current implementation is too strict and causes problems.)

2018-01-26 22:39:30

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>>>>
>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>> following error:
>>>>
>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>> Before schedule, set selector to 0x1
>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>
>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>
>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>
>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>> i3-5010U.
>>>>
>>>
>>> Hmm, I can reproduce it, too. I'll look in a bit.
>>
>> I'm triggering a different error, and I think what's going on is that
>> the kernel doesn't currently re-save GSBASE when a task switches out
>> and that task has save gsbase != 0 and in-register GS == 0. This is
>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>> bit expensive. I'm not sure what, if anything, to do about this. I
>> suppose I could add some gross perf hackery to the test to detect this
>> case and suppress the error.
>>
>> I can also trigger the problem you're seeing, and I don't know what's
>> up. It may be related to and old problem I've seen that causes signal
>> delivery to sometimes corrupt %gs. It's deterministic, but it depends
>> in some odd way on register state. I can currently reproduce that
>> issue 100% of the time, and I'm trying to see if I can figure out
>> what's happening.
>
> I think it's a CPU bug, and I'm a bit mystified. I can trigger the
> following, plausibly related issue:
>
> Write a program that writes %gs = 1.
> Run that program under gdb
> break in which %gs == 1
> display/x $gs
> si
>
> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it
> changes to 0.
>
> On KVM or native, I do not observe do_debug getting called with %gs ==
> 1. On TCG, I do. I don't think that's precisely the problem that's
> causing the test to fail, since the test doesn't use TF or ptrace, but
> I wouldn't be shocked if it's related.
>
> hpa, any insight?
>
> (NB: if you want to play with this as I've described it, you may need
> to make invalid_selector() in ptrace.c always return false. The
> current implementation is too strict and causes problems.)

Much simpler test. Run the attached program (gs1). It more or less
just sets %gs to 1 and spins until it stops being 1. Do it on a
kernel with the attached patch applied. I see stuff like this:

# ./gs1
PID = 129
[ 15.703015] pid 129 saved gs = 1
[ 15.703517] pid 129 loaded gs = 1
[ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1
ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in,
setting %gs to 1, observing that %gs is *still* 1 in
prepare_exit_to_usermode(), returning to usermode, and observing %gs
== 0.

Presumably what's happening is that the IRET microcode matches the
SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
DO
tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
THEN (* Segment register invalid *)
SegReg ← NULL;
FI;
OD;

But this is very odd. The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector
THEN
IF segment selector index is outside descriptor table limits
or segment is not a data or readable code segment
or ((segment is a data or nonconforming code segment)
or ((RPL > DPL) and (CPL > DPL))
THEN #GP(selector); FI;

^^^^
This makes no sense. This says that the data segments cannot be
loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL
and RPL <= DPL, but I haven't checked that hard.

IF segment not marked present
THEN #NP(selector);
ELSE
SegmentRegister ← segment selector;
SegmentRegister ← segment descriptor; FI;
FI;

IF DS, ES, FS, or GS is loaded with NULL selector
THEN
SegmentRegister ← segment selector;
SegmentRegister ← segment descriptor;
^^^^
wtf? There is no "segment descriptor". Presumably what actually
gets written to segment.DPL is nonsense.
FI;

Anyway, I think it's nonsense that user code can load a selector using
MOV that is, in turn, rejected by IRET. I don't suppose Intel would
consider fixing this going forward.

Borislav, any chance you could run the attached program on an AMD
machine to see what it does?

Attachments:

gs1.c (588.00 B)

2018-01-26 22:43:42

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>>>>>
>>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>>> following error:
>>>>>
>>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>> Before schedule, set selector to 0x1
>>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>>
>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>>
>>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>>
>>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>>> i3-5010U.
>>>>>
>>>>
>>>> Hmm, I can reproduce it, too. I'll look in a bit.
>>>
>>> I'm triggering a different error, and I think what's going on is that
>>> the kernel doesn't currently re-save GSBASE when a task switches out
>>> and that task has save gsbase != 0 and in-register GS == 0. This is
>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>>> bit expensive. I'm not sure what, if anything, to do about this. I
>>> suppose I could add some gross perf hackery to the test to detect this
>>> case and suppress the error.
>>>
>>> I can also trigger the problem you're seeing, and I don't know what's
>>> up. It may be related to and old problem I've seen that causes signal
>>> delivery to sometimes corrupt %gs. It's deterministic, but it depends
>>> in some odd way on register state. I can currently reproduce that
>>> issue 100% of the time, and I'm trying to see if I can figure out
>>> what's happening.
>>
>> I think it's a CPU bug, and I'm a bit mystified. I can trigger the
>> following, plausibly related issue:
>>
>> Write a program that writes %gs = 1.
>> Run that program under gdb
>> break in which %gs == 1
>> display/x $gs
>> si
>>
>> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it
>> changes to 0.
>>
>> On KVM or native, I do not observe do_debug getting called with %gs ==
>> 1. On TCG, I do. I don't think that's precisely the problem that's
>> causing the test to fail, since the test doesn't use TF or ptrace, but
>> I wouldn't be shocked if it's related.
>>
>> hpa, any insight?
>>
>> (NB: if you want to play with this as I've described it, you may need
>> to make invalid_selector() in ptrace.c always return false. The
>> current implementation is too strict and causes problems.)
>
> Much simpler test. Run the attached program (gs1). It more or less
> just sets %gs to 1 and spins until it stops being 1. Do it on a
> kernel with the attached patch applied. I see stuff like this:
>
> # ./gs1
> PID = 129
> [ 15.703015] pid 129 saved gs = 1
> [ 15.703517] pid 129 loaded gs = 1
> [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1
> ax = 0, cx = 0, dx = 0
>
> So we're interrupting the program, switching out, switching back in,
> setting %gs to 1, observing that %gs is *still* 1 in
> prepare_exit_to_usermode(), returning to usermode, and observing %gs
> == 0.
>
> Presumably what's happening is that the IRET microcode matches the
> SDM's pseudocode, which says:
>
> RETURN-TO-OUTER-PRIVILEGE-LEVEL:
> ...
> FOR each SegReg in (ES, FS, GS, and DS)
> DO
> tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
> IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
> THEN (* Segment register invalid *)
> SegReg ← NULL;
> FI;
> OD;
>
> But this is very odd. The actual permission checks (in the docs for MOV) are:
>
> IF DS, ES, FS, or GS is loaded with non-NULL selector
> THEN
> IF segment selector index is outside descriptor table limits
> or segment is not a data or readable code segment
> or ((segment is a data or nonconforming code segment)
> or ((RPL > DPL) and (CPL > DPL))
> THEN #GP(selector); FI;
>
> ^^^^
> This makes no sense. This says that the data segments cannot be
> loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL
> and RPL <= DPL, but I haven't checked that hard.

Surely Intel meant:

... or ((segment is a data segment or nonconforming code segment) and
((RPL > DPL) or (CPL > DPL))

This would be consistent with the AMD APM #GP condition of "The DS,
ES, FS, or GS register was loaded and the segment pointed to was a
data or non-conforming code segment, but the RPL or CPL was greater
than the DPL."

>
> IF segment not marked present
> THEN #NP(selector);
> ELSE
> SegmentRegister ← segment selector;
> SegmentRegister ← segment descriptor; FI;
> FI;
>
> IF DS, ES, FS, or GS is loaded with NULL selector
> THEN
> SegmentRegister ← segment selector;
> SegmentRegister ← segment descriptor;
> ^^^^
> wtf? There is no "segment descriptor". Presumably what actually
> gets written to segment.DPL is nonsense.
> FI;

I think the bug is here. I think that, when writing a NULL selector
to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
they should set DPL to 3.

>
> Anyway, I think it's nonsense that user code can load a selector using
> MOV that is, in turn, rejected by IRET. I don't suppose Intel would
> consider fixing this going forward.
>
> Borislav, any chance you could run the attached program on an AMD
> machine to see what it does?

2018-01-26 22:56:37

by H. Peter Anvin

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On 01/26/18 11:46, Andy Lutomirski wrote:
>
> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it
> changes to 0.
>
> On KVM or native, I do not observe do_debug getting called with %gs ==
> 1. On TCG, I do. I don't think that's precisely the problem that's
> causing the test to fail, since the test doesn't use TF or ptrace, but
> I wouldn't be shocked if it's related.
>
> hpa, any insight?
>
> (NB: if you want to play with this as I've described it, you may need
> to make invalid_selector() in ptrace.c always return false. The
> current implementation is too strict and causes problems.)
>

Looking at it. I want to grok this in the general context of fsgsbase
as well, of course.

-hpa

2018-01-26 22:57:05

by Borislav Petkov

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:
> Borislav, any chance you could run the attached program on an AMD
> machine to see what it does?

[boris@pd: /tmp> ./gs1
PID = 3420
ax = 0, cx = 0, dx = 0

This is on 4.15-rc7-ish + tip/master from 2 wks ago.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-01-28 19:24:23

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 2:56 PM, Borislav Petkov <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:
>> Borislav, any chance you could run the attached program on an AMD
>> machine to see what it does?
>
> [boris@pd: /tmp> ./gs1
> PID = 3420
> ax = 0, cx = 0, dx = 0
>
> This is on 4.15-rc7-ish + tip/master from 2 wks ago.

Phooey :(

2018-01-28 19:30:51

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <[email protected]> wrote:
>>>>>>
>>>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>>>> following error:
>>>>>>
>>>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>>> Before schedule, set selector to 0x1
>>>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>>>
>>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>>>
>>>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>>>
>>>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>>>> i3-5010U.
>>>>>>
>>>>>
>>>>> Hmm, I can reproduce it, too. I'll look in a bit.
>>>>
>>>> I'm triggering a different error, and I think what's going on is that
>>>> the kernel doesn't currently re-save GSBASE when a task switches out
>>>> and that task has save gsbase != 0 and in-register GS == 0. This is
>>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>>>> bit expensive. I'm not sure what, if anything, to do about this. I
>>>> suppose I could add some gross perf hackery to the test to detect this
>>>> case and suppress the error.
>>>>
>>>> I can also trigger the problem you're seeing, and I don't know what's
>>>> up. It may be related to and old problem I've seen that causes signal
>>>> delivery to sometimes corrupt %gs. It's deterministic, but it depends
>>>> in some odd way on register state. I can currently reproduce that
>>>> issue 100% of the time, and I'm trying to see if I can figure out
>>>> what's happening.
>>>
>>> I think it's a CPU bug, and I'm a bit mystified. I can trigger the
>>> following, plausibly related issue:
>>>
>>> Write a program that writes %gs = 1.
>>> Run that program under gdb
>>> break in which %gs == 1
>>> display/x $gs
>>> si
>>>
>>> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it
>>> changes to 0.
>>>
>>> On KVM or native, I do not observe do_debug getting called with %gs ==
>>> 1. On TCG, I do. I don't think that's precisely the problem that's
>>> causing the test to fail, since the test doesn't use TF or ptrace, but
>>> I wouldn't be shocked if it's related.
>>>
>>> hpa, any insight?
>>>
>>> (NB: if you want to play with this as I've described it, you may need
>>> to make invalid_selector() in ptrace.c always return false. The
>>> current implementation is too strict and causes problems.)
>>
>> Much simpler test. Run the attached program (gs1). It more or less
>> just sets %gs to 1 and spins until it stops being 1. Do it on a
>> kernel with the attached patch applied. I see stuff like this:
>>
>> # ./gs1
>> PID = 129
>> [ 15.703015] pid 129 saved gs = 1
>> [ 15.703517] pid 129 loaded gs = 1
>> [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1
>> ax = 0, cx = 0, dx = 0
>>
>> So we're interrupting the program, switching out, switching back in,
>> setting %gs to 1, observing that %gs is *still* 1 in
>> prepare_exit_to_usermode(), returning to usermode, and observing %gs
>> == 0.
>>
>> Presumably what's happening is that the IRET microcode matches the
>> SDM's pseudocode, which says:
>>
>> RETURN-TO-OUTER-PRIVILEGE-LEVEL:
>> ...
>> FOR each SegReg in (ES, FS, GS, and DS)
>> DO
>> tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
>> IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
>> THEN (* Segment register invalid *)
>> SegReg ← NULL;
>> FI;
>> OD;
>>
>> But this is very odd. The actual permission checks (in the docs for MOV) are:
>>
>> IF DS, ES, FS, or GS is loaded with non-NULL selector
>> THEN
>> IF segment selector index is outside descriptor table limits
>> or segment is not a data or readable code segment
>> or ((segment is a data or nonconforming code segment)
>> or ((RPL > DPL) and (CPL > DPL))
>> THEN #GP(selector); FI;
>>
>> ^^^^
>> This makes no sense. This says that the data segments cannot be
>> loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL
>> and RPL <= DPL, but I haven't checked that hard.
>
> Surely Intel meant:
>
> ... or ((segment is a data segment or nonconforming code segment) and
> ((RPL > DPL) or (CPL > DPL))
>
> This would be consistent with the AMD APM #GP condition of "The DS,
> ES, FS, or GS register was loaded and the segment pointed to was a
> data or non-conforming code segment, but the RPL or CPL was greater
> than the DPL."
>
>>
>> IF segment not marked present
>> THEN #NP(selector);
>> ELSE
>> SegmentRegister ← segment selector;
>> SegmentRegister ← segment descriptor; FI;
>> FI;
>>
>> IF DS, ES, FS, or GS is loaded with NULL selector
>> THEN
>> SegmentRegister ← segment selector;
>> SegmentRegister ← segment descriptor;
>> ^^^^
>> wtf? There is no "segment descriptor". Presumably what actually
>> gets written to segment.DPL is nonsense.
>> FI;
>
> I think the bug is here. I think that, when writing a NULL selector
> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
> they should set DPL to 3.

As an experiment, I did this:

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+ [0] = { .dpl = 3, },
+

This had no apparent effect. I was hoping that maybe loading NULL
into a selector would copy DPL from from gdt[0], but it seems like it
doesn't.

2018-01-29 09:20:37

by H. Peter Anvin

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On 01/28/18 11:21, Andy Lutomirski wrote:
>>
>> I think the bug is here. I think that, when writing a NULL selector
>> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
>> they should set DPL to 3.
>
> As an experiment, I did this:
>
> DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
> + [0] = { .dpl = 3, },
> +
>
> This had no apparent effect. I was hoping that maybe loading NULL
> into a selector would copy DPL from from gdt[0], but it seems like it
> doesn't.
>

GDT[0] doesn't actually exist. It is pretty much scratch space (I have
suggested using it for the gsbase once all those issues get sorted out,
because it lets the paranoid code do something like:

rdgsbase %rax
push %rax /* Save old gsbase */
push %rax /* Reserve space on stack */
sgdt -2(%rsp) /* We don't care about the limit */
pop %rax /* %rax <- gdtbase */
mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
wrgsbase %rax

2018-01-29 16:38:34

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Mon, Jan 29, 2018 at 1:13 AM, H. Peter Anvin <[email protected]> wrote:
> On 01/28/18 11:21, Andy Lutomirski wrote:
>>>
>>> I think the bug is here. I think that, when writing a NULL selector
>>> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
>>> they should set DPL to 3.
>>
>> As an experiment, I did this:
>>
>> DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
>> + [0] = { .dpl = 3, },
>> +
>>
>> This had no apparent effect. I was hoping that maybe loading NULL
>> into a selector would copy DPL from from gdt[0], but it seems like it
>> doesn't.
>>
>
> GDT[0] doesn't actually exist.

That's what I thought, too, and the SDM does say that, but the SDM
says all kinds of not-quite-correct things about segmentation.

> It is pretty much scratch space (I have
> suggested using it for the gsbase once all those issues get sorted out,
> because it lets the paranoid code do something like:
>
> rdgsbase %rax
> push %rax /* Save old gsbase */
> push %rax /* Reserve space on stack */
> sgdt -2(%rsp) /* We don't care about the limit */
> pop %rax /* %rax <- gdtbase */
> mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
> wrgsbase %rax

That will utterly suck on non-UMIP machines that have
hypervisor-provided UMIP emulation.

2018-01-29 18:17:42

by H. Peter Anvin

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On 01/29/18 08:37, Andy Lutomirski wrote:
>
> That's what I thought, too, and the SDM does say that, but the SDM
> says all kinds of not-quite-correct things about segmentation.
>
>> It is pretty much scratch space (I have
>> suggested using it for the gsbase once all those issues get sorted out,
>> because it lets the paranoid code do something like:
>>
>> rdgsbase %rax
>> push %rax /* Save old gsbase */
>> push %rax /* Reserve space on stack */
>> sgdt -2(%rsp) /* We don't care about the limit */
>> pop %rax /* %rax <- gdtbase */
>> mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
>> wrgsbase %rax
>
> That will utterly suck on non-UMIP machines that have
> hypervisor-provided UMIP emulation.
>

Is that a valid thing to optimize for, especially given that paranoid
entries aren't the most common anyway?

-hpa

2018-01-29 18:27:57

by Andy Lutomirski

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

> On Jan 29, 2018, at 10:12 AM, H. Peter Anvin <[email protected]> wrote:
>
>> On 01/29/18 08:37, Andy Lutomirski wrote:
>>
>> That's what I thought, too, and the SDM does say that, but the SDM
>> says all kinds of not-quite-correct things about segmentation.
>>
>>> It is pretty much scratch space (I have
>>> suggested using it for the gsbase once all those issues get sorted out,
>>> because it lets the paranoid code do something like:
>>>
>>> rdgsbase %rax
>>> push %rax /* Save old gsbase */
>>> push %rax /* Reserve space on stack */
>>> sgdt -2(%rsp) /* We don't care about the limit */
>>> pop %rax /* %rax <- gdtbase */
>>> mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
>>> wrgsbase %rax
>>
>> That will utterly suck on non-UMIP machines that have
>> hypervisor-provided UMIP emulation.
>>
>
> Is that a valid thing to optimize for, especially given that paranoid
> entries aren't the most common anyway?
>

A bunch of people seem to care about NMI performance for perf. And the current patch set works without this trick.

FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

2018-01-29 18:36:24

by H. Peter Anvin

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On 01/29/18 10:26, Andy Lutomirski wrote:
>>>
>>> That will utterly suck on non-UMIP machines that have
>>> hypervisor-provided UMIP emulation.
>>
>> Is that a valid thing to optimize for, especially given that paranoid
>> entries aren't the most common anyway?
>
> A bunch of people seem to care about NMI performance for perf.
>

That wasn't really the question...

> And the current patch set works without this trick.

But I believe the tricks it uses are fragile.

> FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

That might be a better option.

-hpa

2018-02-27 23:01:09

by Dan Rue

[permalink] [raw]

Subject: Re: selftests/x86/fsgsbase_64 test problem

On Mon, Jan 29, 2018 at 10:30:05AM -0800, H. Peter Anvin wrote:
> On 01/29/18 10:26, Andy Lutomirski wrote:
> >>>
> >>> That will utterly suck on non-UMIP machines that have
> >>> hypervisor-provided UMIP emulation.
> >>
> >> Is that a valid thing to optimize for, especially given that paranoid
> >> entries aren't the most common anyway?
> >
> > A bunch of people seem to care about NMI performance for perf.
> >
>
> That wasn't really the question...
>
> > And the current patch set works without this trick.
>
> But I believe the tricks it uses are fragile.
>
> > FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.
>
> That might be a better option.

Has there been any conclusion to this thread? I can still reproduce the
issue on mainline and next.

Thanks,
Dan