2015-04-23 12:35:28

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

AMD docs say that SYSRET32 loads %ss selector with a value from a MSR,
but *cached descriptor* of %ss is not modified.
(Intel CPUs reset the descriptor to a fixed, valid state).

It was observed to cause Wine crashes. Conjectured sequence of events
causing it is as follows:

1. Wine process enters kernel via syscall insn.
2. Context switch to any other task.
3. Interrupt or exception happens, CPU loads %ss with 0.
(This happens according to both Intel and AMD docs.)
%ss cached descriptor is set to "invalid" state.
4. Context switch back to Wine.
5. sysret to 32-bit userspace. %ss selector has correct value but its
cached descriptor is still invalid.
6. The very first userspace POP insn after this causes exception 12.

Fix this by checking %ss selector value. If it is not __KERNEL_DS,
(and it really can only be __KERNEL_DS or zero),
then load it with __KERNEL_DS.

We also use SYSRET32 for SYSENTER-based syscalls, but that codepath is
only used by Intel CPUs, which don't have this quirk.

Signed-off-by: Denys Vlasenko <[email protected]>
Reported-by: Brian Gerst <[email protected]>
CC: Brian Gerst <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/ia32/ia32entry.S | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0c302d0..9537dcb 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -408,6 +408,18 @@ cstar_dispatch:
sysretl_from_sys_call:
andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
RESTORE_RSI_RDI_RDX
+ /*
+ * On AMD, SYSRET32 loads %ss selector, but does not modify its
+ * cached descriptor; and in kernel, %ss can be loaded with 0,
+ * setting cached descriptor to "invalid". This has no effect on
+ * 64-bit mode, but on return to 32-bit mode, it makes stack ops fail.
+ * Fix %ss only if it's wrong: read from %ss takes ~2 cycles,
+ * write to %ss is ~40 cycles.
+ */
+ movl %ss, %ecx
+ cmpl $__KERNEL_DS, %ecx
+ jne reload_ss
+ss_is_good:
movl RIP(%rsp),%ecx
CFI_REGISTER rip,rcx
movl EFLAGS(%rsp),%r11d
@@ -426,6 +438,10 @@ sysretl_from_sys_call:
* does not exist, it merely sets eflags.IF=1).
*/
USERGS_SYSRET32
+reload_ss:
+ movl $__KERNEL_DS, %ecx
+ movl %ecx, %ss
+ jmp ss_is_good

#ifdef CONFIG_AUDITSYSCALL
cstar_auditsys:
--
1.8.1.4


2015-04-23 15:22:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 5:34 AM, Denys Vlasenko <[email protected]> wrote:
>
> It was observed to cause Wine crashes. Conjectured sequence of events
> causing it is as follows:
>
> 1. Wine process enters kernel via syscall insn.
> 2. Context switch to any other task.
> 3. Interrupt or exception happens, CPU loads %ss with 0.
> (This happens according to both Intel and AMD docs.)
> %ss cached descriptor is set to "invalid" state.
> 4. Context switch back to Wine.
> 5. sysret to 32-bit userspace. %ss selector has correct value but its
> cached descriptor is still invalid.

I really don't like the patch, as it just feels very hacky to me.

It is a bit scary to me that apparently we leak %ss values between
processes, so that while we run in the kernel we can randomly have the
ss descriptor either be 0 or __KERNEL_DS. That sounds like an
information leak to me, even in 64-bit mode. The value of %ss may not
*matter* in 64-bit mode, but leaking that difference between processes
sounds nasty. I can't offhand thing of any way to actually read the
present bit in the cached descriptor (I was thinking something like
the "LSL" instruction, but that takes a new segment selector, not the
segment itself), but it just smells odd to me.

Also, why does this only happen with Wine? In regular 32-bit mode the
segment valid bit in the cached descriptor should also matter. So how
come this doesn't trigger for any 32-bit user land on a 64-bit kernel?

Linus

2015-04-23 15:50:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 8:22 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 23, 2015 at 5:34 AM, Denys Vlasenko <[email protected]> wrote:
>>
>> It was observed to cause Wine crashes. Conjectured sequence of events
>> causing it is as follows:
>>
>> 1. Wine process enters kernel via syscall insn.
>> 2. Context switch to any other task.
>> 3. Interrupt or exception happens, CPU loads %ss with 0.
>> (This happens according to both Intel and AMD docs.)
>> %ss cached descriptor is set to "invalid" state.
>> 4. Context switch back to Wine.
>> 5. sysret to 32-bit userspace. %ss selector has correct value but its
>> cached descriptor is still invalid.
>
> I really don't like the patch, as it just feels very hacky to me.
>
> It is a bit scary to me that apparently we leak %ss values between
> processes, so that while we run in the kernel we can randomly have the
> ss descriptor either be 0 or __KERNEL_DS. That sounds like an
> information leak to me, even in 64-bit mode. The value of %ss may not
> *matter* in 64-bit mode, but leaking that difference between processes
> sounds nasty. I can't offhand thing of any way to actually read the
> present bit in the cached descriptor (I was thinking something like
> the "LSL" instruction, but that takes a new segment selector, not the
> segment itself), but it just smells odd to me.

How about, in 64-bit code: syscall; long jump to a 32-bit cs; mov with
ss override (or push or pop).

>
> Also, why does this only happen with Wine? In regular 32-bit mode the
> segment valid bit in the cached descriptor should also matter. So how
> come this doesn't trigger for any 32-bit user land on a 64-bit kernel?

See other thread. I bet we got poisoned by a concurrently running,
preempted 16-bit or segmented process.

--Andy

2015-04-23 16:05:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 8:50 AM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 8:22 AM, Linus Torvalds
>>
>> It is a bit scary to me that apparently we leak %ss values between
>> processes, so that while we run in the kernel we can randomly have the
>> ss descriptor either be 0 or __KERNEL_DS. That sounds like an
>> information leak to me, even in 64-bit mode. The value of %ss may not
>> *matter* in 64-bit mode, but leaking that difference between processes
>> sounds nasty. I can't offhand thing of any way to actually read the
>> present bit in the cached descriptor (I was thinking something like
>> the "LSL" instruction, but that takes a new segment selector, not the
>> segment itself), but it just smells odd to me.
>
> How about, in 64-bit code: syscall; long jump to a 32-bit cs; mov with
> ss override (or push or pop).

Yeah. That sounds likely. Something that doesn't reload SS, but gets
us to a point where the present bit matters.

So basically it would allow an "attacker" to see whether it got
scheduled away and an interrupt happened elsewhere.

Which doesn't exactly sound like much of an information leak ("just
read /proc/interrupts instead"), so I guess from a security standpoint
this really doesn't matter. But I could see how it could break odd
code.

So I think even the 64-bit sysret path has this problem, it's just
that any *sane* 64-bit user will never care. But insane ones could.

And presumably Intel doesn't have this problem, because Intel's sysret
properly loads the whole cached ss descriptor (that's certainly what
the Intel documentation says: "However, the CS and SS descriptor
caches are not loaded from the descriptors (in GDT or LDT) referenced
by those selectors. Instead, the descriptor caches are loaded with
fixed values").

So it sounds very much like an AMD bug/misfeature. Because sysret
*should* reset the descriptor cache. Do we know whether this affects
*all* AMD CPU's or just a subset?

Linus

2015-04-23 16:06:45

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 11:22 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 23, 2015 at 5:34 AM, Denys Vlasenko <[email protected]> wrote:
>>
>> It was observed to cause Wine crashes. Conjectured sequence of events
>> causing it is as follows:
>>
>> 1. Wine process enters kernel via syscall insn.
>> 2. Context switch to any other task.
>> 3. Interrupt or exception happens, CPU loads %ss with 0.
>> (This happens according to both Intel and AMD docs.)
>> %ss cached descriptor is set to "invalid" state.
>> 4. Context switch back to Wine.
>> 5. sysret to 32-bit userspace. %ss selector has correct value but its
>> cached descriptor is still invalid.
>
> I really don't like the patch, as it just feels very hacky to me.
>
> It is a bit scary to me that apparently we leak %ss values between
> processes, so that while we run in the kernel we can randomly have the
> ss descriptor either be 0 or __KERNEL_DS. That sounds like an
> information leak to me, even in 64-bit mode. The value of %ss may not
> *matter* in 64-bit mode, but leaking that difference between processes
> sounds nasty. I can't offhand thing of any way to actually read the
> present bit in the cached descriptor (I was thinking something like
> the "LSL" instruction, but that takes a new segment selector, not the
> segment itself), but it just smells odd to me.

So you are saying we should save and conditionally restore the
kernel's %ss during context switch? That shouldn't be too bad. Half
of the time you would be loading the null selector which is fast (no
GDT access, no validation).

> Also, why does this only happen with Wine? In regular 32-bit mode the
> segment valid bit in the cached descriptor should also matter. So how
> come this doesn't trigger for any 32-bit user land on a 64-bit kernel?

Probably just lack of exposure so far. It only affects AMD cpus, and
it was just merged. Wine is probably the most common 32-bit app
people will run on a 64-bit kernel. I'll test something other than
Wine that is 32-bit when I get home tonight.

--
Brian Gerst

2015-04-23 16:11:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 09:05:07AM -0700, Linus Torvalds wrote:
> So it sounds very much like an AMD bug/misfeature. Because sysret
> *should* reset the descriptor cache. Do we know whether this affects
> *all* AMD CPU's or just a subset?

I am being told that this might appear to be the case with at least one
uarch. I could run code on the boxes I have here to check if Andy comes
up with some nice little "exploit".

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 16:13:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 9:06 AM, Brian Gerst <[email protected]> wrote:
>
> So you are saying we should save and conditionally restore the
> kernel's %ss during context switch? That shouldn't be too bad. Half
> of the time you would be loading the null selector which is fast (no
> GDT access, no validation).

I'd almost prefer something along those lines, yes. Who knows *what*
leaks? If the present bit state leaks, then likely so does the limit
value etc etc..

Linus

2015-04-23 16:27:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 9:13 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 23, 2015 at 9:06 AM, Brian Gerst <[email protected]> wrote:
>>
>> So you are saying we should save and conditionally restore the
>> kernel's %ss during context switch? That shouldn't be too bad. Half
>> of the time you would be loading the null selector which is fast (no
>> GDT access, no validation).
>
> I'd almost prefer something along those lines, yes. Who knows *what*
> leaks? If the present bit state leaks, then likely so does the limit
> value etc etc..
>

I'll go out on a limb and guess the present bit doesn't leak. If I
were implementing an x86 cpu, I wouldn't have a present bit at all in
the descriptor cache, since you aren't supposed to be able to load a
non-present descriptor in the first place. I bet it's the limit we're
seeing.

But I think I prefer something closer to Denys' approach with
alternatives instead. I think the only case that matters (if my
hare-brained explanantion of the actual crash is right) is when we
sysret (q or l) while SS is 0. That only happens if we scheduled
inside a syscall, and I'm guessing that testing if ss is zero and
reloading it on syscall return will be a smaller performance hit than
reloading on all context switches. The latter could happen more than
once per syscall, and it could also affect tasks that aren't doing
syscalls at all and are therefore unaffected.

I'll try to send out a patch and a test case later today, but no
promises -- the test case will be a bit tedious, and I'm already
overcommitted for today :(

A sketch of the a reproducer:

Two threads. Thread 1 sets ss to some very-low-limit value, and it
loops doing mov $-1, %eax; int $80. Thread 2 is ordinary 32-bit code
doing while(true) usleep(1);

--Andy

2015-04-23 20:01:42

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 6:27 PM, Andy Lutomirski <[email protected]> wrote:
> I'll go out on a limb and guess the present bit doesn't leak. If I
> were implementing an x86 cpu, I wouldn't have a present bit at all in
> the descriptor cache, since you aren't supposed to be able to load a
> non-present descriptor in the first place.

There is definitely a present bit in cached descriptors.
It is used to track whether NULL selector was loaded into this
particular segment register.
The bit is even visible in SMM save area.
See table 10-1 in 24593_APM.pdf

Naturally, CS can't be NULL, and up until today
I thought SS also can't. But the bit is probably implemented
for all eight cached descriptors.

--
vda

2015-04-23 21:11:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 10:01:16PM +0200, Denys Vlasenko wrote:
> Naturally, CS can't be NULL, and up until today
> I thought SS also can't. But the bit is probably implemented
> for all eight cached descriptors.

There's this section about NULL selector in APM v2. It says that NULL
selectors are used to invalidate segment registers and software can load
a NULL selector in SS in CPL0.

So, if an interrupt happens and as you quoted earlier that SS gets set
to NULL as a result of an interrupt, there's that SS leak causing the SS
exception.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 21:38:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On 04/23/2015 02:10 PM, Borislav Petkov wrote:
> On Thu, Apr 23, 2015 at 10:01:16PM +0200, Denys Vlasenko wrote:
>> Naturally, CS can't be NULL, and up until today
>> I thought SS also can't. But the bit is probably implemented
>> for all eight cached descriptors.
>
> There's this section about NULL selector in APM v2. It says that NULL
> selectors are used to invalidate segment registers and software can load
> a NULL selector in SS in CPL0.
>
> So, if an interrupt happens and as you quoted earlier that SS gets set
> to NULL as a result of an interrupt, there's that SS leak causing the SS
> exception.
>

Yes, the NULL SS is a special thing in 64-bit mode. I agree that
context-switching it is probably the way to go; it should be cheap
enough. We might even be able to conditionalize it on an X86_BUG_ flag.

-hpa

2015-04-23 21:46:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 02:37:27PM -0700, H. Peter Anvin wrote:
> Yes, the NULL SS is a special thing in 64-bit mode. I agree that
> context-switching it is probably the way to go; it should be cheap
> enough. We might even be able to conditionalize it on an X86_BUG_
> flag.

Oh sure, I was thinking of an AMD-specific

"call amd_fixup_ss"

which gets NOPped out on Intel.

We do those for breakfast :-)

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 22:29:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 2:37 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2015 02:10 PM, Borislav Petkov wrote:
>> On Thu, Apr 23, 2015 at 10:01:16PM +0200, Denys Vlasenko wrote:
>>> Naturally, CS can't be NULL, and up until today
>>> I thought SS also can't. But the bit is probably implemented
>>> for all eight cached descriptors.
>>
>> There's this section about NULL selector in APM v2. It says that NULL
>> selectors are used to invalidate segment registers and software can load
>> a NULL selector in SS in CPL0.
>>
>> So, if an interrupt happens and as you quoted earlier that SS gets set
>> to NULL as a result of an interrupt, there's that SS leak causing the SS
>> exception.
>>
>
> Yes, the NULL SS is a special thing in 64-bit mode. I agree that
> context-switching it is probably the way to go; it should be cheap
> enough. We might even be able to conditionalize it on an X86_BUG_ flag.

I still don't see why context switches are a better place than just
before sysret, but I could be convinced.

I updated my test at
https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/. I
want to figure out whether this is a problem for sysretq, too.

--Andy

2015-04-23 22:32:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On 04/23/2015 03:29 PM, Andy Lutomirski wrote:
>>
>> Yes, the NULL SS is a special thing in 64-bit mode. I agree that
>> context-switching it is probably the way to go; it should be cheap
>> enough. We might even be able to conditionalize it on an X86_BUG_ flag.
>
> I still don't see why context switches are a better place than just
> before sysret, but I could be convinced.
>

Because there are way more sysrets than context switches, and Linux is
particularly sensitive to system call latency, by design.

-hpa

2015-04-23 22:39:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 3:31 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2015 03:29 PM, Andy Lutomirski wrote:
>>>
>>> Yes, the NULL SS is a special thing in 64-bit mode. I agree that
>>> context-switching it is probably the way to go; it should be cheap
>>> enough. We might even be able to conditionalize it on an X86_BUG_ flag.
>>
>> I still don't see why context switches are a better place than just
>> before sysret, but I could be convinced.
>>
>
> Because there are way more sysrets than context switches, and Linux is
> particularly sensitive to system call latency, by design.

I mean sysret but only when SS might be zero. Denys' approach
apparently needs ~4 cycles to check that (not bad), we could (yuck)
set a ti flag on context switch.

But yes, maybe you're right.

--Andy

2015-04-23 22:53:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On 04/23/2015 03:38 PM, Andy Lutomirski wrote:
>>
>> Because there are way more sysrets than context switches, and Linux is
>> particularly sensitive to system call latency, by design.
>

Just to clarify: why would Linux be more sensitive to system call by
design? It enables much simpler APIs and avoids hacks like sending down
a syscall task list (which was genuinely proposed at one point.) If
kernel entry/exit is too expensive, then the APIs get more complex
because they *have* to do everything in the smallest number of system calls.

-hpa

2015-04-23 22:55:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 3:52 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2015 03:38 PM, Andy Lutomirski wrote:
>>>
>>> Because there are way more sysrets than context switches, and Linux is
>>> particularly sensitive to system call latency, by design.
>>
>
> Just to clarify: why would Linux be more sensitive to system call by
> design? It enables much simpler APIs and avoids hacks like sending down
> a syscall task list (which was genuinely proposed at one point.) If
> kernel entry/exit is too expensive, then the APIs get more complex
> because they *have* to do everything in the smallest number of system calls.
>

It's a matter of the ratio, right? One cycle of syscall overhead
saved is worth some number of context switch cycles added, and the
ratio probably varies by workload.

If we do syscall, two context switches, and sysret, then we wouldn't
have been better off fixing it on sysret. But maybe most workloads
still prefer the fixup on context switch.

--Andy



--
Andy Lutomirski
AMA Capital Management, LLC

2015-04-23 23:04:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Thu, Apr 23, 2015 at 3:55 PM, Andy Lutomirski <[email protected]> wrote:
>
> It's a matter of the ratio, right? One cycle of syscall overhead
> saved is worth some number of context switch cycles added, and the
> ratio probably varies by workload.

Yeah. That said, I do think Peter is right that there are usually a
*lot* more system calls than there are context switches, even if I
guess you could come up with specific cases where that isn't true (eg
in-kernel file servers etc that never actually go to user space).

And the scheduler approach would have the benefit of not being in asm
code, so we can more easily use things like static keys or
alternative_asm() to make the overhead go away on CPU's that don't
need it. We certainly _can_ do that in an *.S file too, but we don't
have quite the same level of infrastructure help to do it.

Adding some static key protected thing to switch_to() in
arch/x86/kernel/process_64.c wouldn't seem too bad. And at that point,
the only cost is a a single no-op on most CPU's (we still don't know
_which_ AMD CPU's are affected, but I guess we could start off with
all of them and see if we can get an exhaustive list some way).

Linus

2015-04-23 23:23:12

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On Fri, Apr 24, 2015 at 1:04 AM, Linus Torvalds
<[email protected]> wrote:
> And at that point,> the only cost is a a single no-op on most CPU's (we still don't know
> _which_ AMD CPU's are affected, but I guess we could start off with
> all of them and see if we can get an exhaustive list some way).

This seems to be "architectural" for AMD CPU's, similar to Intel's
SYSRET bug with non-canonical return address. Meaning,
both bugs, are documented by respective
companies. Both are officially "not a bug, but a feature".

2015-04-24 01:00:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

On 04/23/2015 03:55 PM, Andy Lutomirski wrote:
> On Thu, Apr 23, 2015 at 3:52 PM, H. Peter Anvin <[email protected]> wrote:
>> On 04/23/2015 03:38 PM, Andy Lutomirski wrote:
>>>>
>>>> Because there are way more sysrets than context switches, and Linux is
>>>> particularly sensitive to system call latency, by design.
>>>
>>
>> Just to clarify: why would Linux be more sensitive to system call by
>> design? It enables much simpler APIs and avoids hacks like sending down
>> a syscall task list (which was genuinely proposed at one point.) If
>> kernel entry/exit is too expensive, then the APIs get more complex
>> because they *have* to do everything in the smallest number of system calls.
>>
>
> It's a matter of the ratio, right? One cycle of syscall overhead
> saved is worth some number of context switch cycles added, and the
> ratio probably varies by workload.
>

Correct. For workloads which do *no* system calls it is kind of "special".

> If we do syscall, two context switches, and sysret, then we wouldn't
> have been better off fixing it on sysret. But maybe most workloads
> still prefer the fixup on context switch.
>

There is also a matter of latency, which tends to be more critical for
syscall.

-hpa