The existing retpoline code carefully and awkwardly retpolinifies
the SYSCALL64 slow path. This stops the fast path from being
particularly fast, and it's IMO rather messy.
Instead, just bypass the fast path entirely if retpolines are on.
This seems to be a speedup on a "minimal" retpoline kernel, mainly
because do_syscall_64() ends up calling the syscall handler without
using a slow retpoline thunk.
As an added benefit, we won't need to apply further Spectre
mitigations to the fast path. The current fast path spectre
mitigations may have a hole: if the syscall nr is out of bounds, it
is plausible that the CPU would mispredict the bounds check and,
load a bogus function pointer, and speculatively execute it right
though the retpoline. If this is indeed a problem, we need to fix
it in the slow paths anyway, but with this patch applied, we can at
least leave the fast path alone.
Cleans-up: 2641f08bb7fc ("x86/retpoline/entry: Convert entry assembler indirect jumps")
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/entry_64.S | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4f8e1d35a97c..b915bad58754 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -245,6 +245,9 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
* If we need to do entry work or if we guess we'll need to do
* exit work, go straight to the slow path.
*/
+#ifdef CONFIG_RETPOLINE
+ ALTERNATIVE "", "jmp entry_SYSCALL64_slow_path", X86_FEATURE_RETPOLINE
+#endif
movq PER_CPU_VAR(current_task), %r11
testl $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
jnz entry_SYSCALL64_slow_path
@@ -270,13 +273,11 @@ entry_SYSCALL_64_fastpath:
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
+ *
+ * NB: no retpoline needed -- we don't execute this code with
+ * retpolines enabled.
*/
-#ifdef CONFIG_RETPOLINE
- movq sys_call_table(, %rax, 8), %rax
- call __x86_indirect_thunk_rax
-#else
call *sys_call_table(, %rax, 8)
-#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
@@ -431,6 +432,9 @@ ENTRY(stub_ptregs_64)
* which we achieve by trying again on the slow path. If we are on
* the slow path, the extra regs are already saved.
*
+ * This code is unreachable (even via mispredicted conditional branches)
+ * if we're using retpolines.
+ *
* RAX stores a pointer to the C function implementing the syscall.
* IRQs are on.
*/
@@ -448,12 +452,19 @@ ENTRY(stub_ptregs_64)
jmp entry_SYSCALL64_slow_path
1:
- JMP_NOSPEC %rax /* Called from C */
+ jmp *%rax /* Called from C */
END(stub_ptregs_64)
.macro ptregs_stub func
ENTRY(ptregs_\func)
UNWIND_HINT_FUNC
+#ifdef CONFIG_RETPOLINE
+ /*
+ * If retpolines are enabled, we don't use the syscall fast path,
+ * so just jump straight to the syscall body.
+ */
+ ALTERNATIVE "", __stringify(jmp \func), X86_FEATURE_RETPOLINE
+#endif
leaq \func(%rip), %rax
jmp stub_ptregs_64
END(ptregs_\func)
--
2.13.6
On Mon, Jan 22, 2018 at 10:04 AM, Andy Lutomirski <[email protected]> wrote:
> The existing retpoline code carefully and awkwardly retpolinifies
> the SYSCALL64 slow path. This stops the fast path from being
> particularly fast, and it's IMO rather messy.
I'm not convinced your patch isn't messier still.. It's certainly
subtle. I had to look at that ptregs stub generator thing twice.
Honestly, I'd rather get rid of the fast-path entirely. Compared to
all the PTI mess, it's not even noticeable.
And if we ever get CPU's that have this all fixed, we can re-visit
introducing the fastpath. But this is all very messy and it doesn't
seem worth it right now.
If we get rid of the fastpath, we can lay out the slow path slightly
better, and get rid of some of those jump-overs. And we'd get rid of
the ptregs hooks entirely.
So we can try to make the "slow" path better while at it, but I really
don't think it matters much now in the post-PTI era. Sadly.
Linus
* Linus Torvalds <[email protected]> wrote:
> On Mon, Jan 22, 2018 at 10:04 AM, Andy Lutomirski <[email protected]> wrote:
> > The existing retpoline code carefully and awkwardly retpolinifies
> > the SYSCALL64 slow path. This stops the fast path from being
> > particularly fast, and it's IMO rather messy.
>
> I'm not convinced your patch isn't messier still.. It's certainly
> subtle. I had to look at that ptregs stub generator thing twice.
>
> Honestly, I'd rather get rid of the fast-path entirely. Compared to
> all the PTI mess, it's not even noticeable.
>
> And if we ever get CPU's that have this all fixed, we can re-visit
> introducing the fastpath. But this is all very messy and it doesn't
> seem worth it right now.
>
> If we get rid of the fastpath, we can lay out the slow path slightly
> better, and get rid of some of those jump-overs. And we'd get rid of
> the ptregs hooks entirely.
>
> So we can try to make the "slow" path better while at it, but I really
> don't think it matters much now in the post-PTI era. Sadly.
Note that there's another advantage to your proposal: should other vulnerabilities
arise in the future, requiring changes in the syscall entry path, we'd be more
flexible to address them in the C space than in the assembly space.
In hindsight a _LOT_ of the PTI complexity and fragility centered around
interacting with x86 kernel entry assembly code - which entry code fortunately got
much simpler (and easier to review) in the past 1-2 years due to the thorough
cleanups and the conversion of most of it to C. But it was still painful.
So I'm fully in favor of that.
Thanks,
Ingo
> And if we ever get CPU's that have this all fixed, we can re-visit
> introducing the fastpath. But this is all very messy and it doesn't
> seem worth it right now.
We already have CPUs it doesn't affect - older Atom for one and I
believe some or all of the VIA ones (still trying to get a definitive
confirmation).
Alan
On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
<[email protected]> wrote:
>
> Honestly, I'd rather get rid of the fast-path entirely. Compared to
> all the PTI mess, it's not even noticeable.
So I looked at how that would be.
Patch attached. Not really "tested", but I'm running the kernel with
this patch now, and 'strace' etc works, and honestly, it seems very
obvious.
Also, code generation for 'do_syscall_64()' does not look horrible. In
fact, it doesn't look all that much worse than the fast-path ever did.
So the biggest impact of this is the extra register saves
(SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
"pushq %reg".
Considering the diffstat:
2 files changed, 2 insertions(+), 121 deletions(-)
and how those 100+ lines are nasty assembly code, I do think we should
just do it.
Comments?
Linus
On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
<[email protected]> wrote:
>
> So the biggest impact of this is the extra register saves
Actually, the other noticeable part is the reloading of the argument
registers from ptregs. Together with just the extra level of
'call/ret' and the stack setup, I'm guessing we're talking maybe 20
cycles or so.
So there's the extra register saves, and simply the fact that the
fastpath had a flatter calling structure.
It still feels worth it. And if we do decide that we want to do the
register clearing on kernel entry for some paranoid mode, we'd pretty
much have to do this anyway.
Linus
In Thu, Jan 25, 2018 at 2:16 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> So the biggest impact of this is the extra register saves
>
> Actually, the other noticeable part is the reloading of the argument
> registers from ptregs. Together with just the extra level of
> 'call/ret' and the stack setup, I'm guessing we're talking maybe 20
> cycles or so.
>
> So there's the extra register saves, and simply the fact that the
> fastpath had a flatter calling structure.
>
> It still feels worth it. And if we do decide that we want to do the
> register clearing on kernel entry for some paranoid mode, we'd pretty
> much have to do this anyway.
>
> Linus
Another extra step the slow path does is checking to see if ptregs is
safe for SYSRET. I think that can be mitigated by moving the check to
the places that do modify ptregs (ptrace, sigreturn, and exec) which
would set a flag to force return with IRET if the modified regs do not
satisfy the criteria for SYSRET.
--
Brian Gerst
On Thu, Jan 25, 2018 at 12:04 PM, Brian Gerst <[email protected]> wrote:
>
> Another extra step the slow path does is checking to see if ptregs is
> safe for SYSRET. I think that can be mitigated by moving the check to
> the places that do modify ptregs (ptrace, sigreturn, and exec) which
> would set a flag to force return with IRET if the modified regs do not
> satisfy the criteria for SYSRET.
I tried to do some profiling, and none of that shows up for me.
That said, what _also_ doesn't show up is the actual page table switch
on entry. And that seems to be because the per-pcu trampoline code
isn't captures by perf (or at least not shown). Oh well.
What _does_ show up a bit is this in prepare_exit_to_usermode():
#ifdef CONFIG_COMPAT
/*
* Compat syscalls set TS_COMPAT. Make sure we clear it before
* returning to user mode. We need to clear it *after* signal
* handling, because syscall restart has a fixup for compat
* syscalls. The fixup is exercised by the ptrace_syscall_32
* selftest.
*
* We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
* special case only applies after poking regs and before the
* very next return to user mode.
*/
current->thread.status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
#endif
and I think the problem there is that it is unnecessarily dirtying
that cacheline. Afaik, those bits are already clear 99.999% of the
time.
So things would be better if that 'status' would be in the thread-info
(to keep cachelines close to the other stuff we already touch) and the
code should have something like
if (unlikely(ti->status & (TS_COMPAT|TS_I386_REGS_POKED)))
or whatever.
There might be other similar small tuning issues going on.
So there is room for improvement there in the slow path.
Linus
On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Honestly, I'd rather get rid of the fast-path entirely. Compared to
>> all the PTI mess, it's not even noticeable.
>
> So I looked at how that would be.
>
> Patch attached. Not really "tested", but I'm running the kernel with
> this patch now, and 'strace' etc works, and honestly, it seems very
> obvious.
>
> Also, code generation for 'do_syscall_64()' does not look horrible. In
> fact, it doesn't look all that much worse than the fast-path ever did.
>
> So the biggest impact of this is the extra register saves
> (SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
> hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
> "pushq %reg".
>
> Considering the diffstat:
>
> 2 files changed, 2 insertions(+), 121 deletions(-)
>
> and how those 100+ lines are nasty assembly code, I do think we should
> just do it.
Feel free to Acked-by: Andy Lutomirski <[email protected]> that patch.
Or I can grab it and send it to -tip.
Re: the trampoline not showing up: if I find some time, I'll try to
wire it up correctly in kallsyms.
--Andy
On Thu, 25 Jan 2018, Andy Lutomirski wrote:
> On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
> <[email protected]> wrote:
> > On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> Honestly, I'd rather get rid of the fast-path entirely. Compared to
> >> all the PTI mess, it's not even noticeable.
> >
> > So I looked at how that would be.
> >
> > Patch attached. Not really "tested", but I'm running the kernel with
> > this patch now, and 'strace' etc works, and honestly, it seems very
> > obvious.
> >
> > Also, code generation for 'do_syscall_64()' does not look horrible. In
> > fact, it doesn't look all that much worse than the fast-path ever did.
> >
> > So the biggest impact of this is the extra register saves
> > (SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
> > hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
> > "pushq %reg".
> >
> > Considering the diffstat:
> >
> > 2 files changed, 2 insertions(+), 121 deletions(-)
> >
> > and how those 100+ lines are nasty assembly code, I do think we should
> > just do it.
>
> Feel free to Acked-by: Andy Lutomirski <[email protected]> that patch.
>
> Or I can grab it and send it to -tip.
That would be nice, so we can route it through x86/pti which provides it
for backporting cleanly.
Thanks,
tglx
On Thu, Jan 25, 2018 at 1:06 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 25, 2018 at 1:02 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Feel free to Acked-by: Andy Lutomirski <[email protected]> that patch.
>>
>> Or I can grab it and send it to -tip.
>
> I'm not going to apply it for 4.15, I just wanted to see how it
> looked, and do some minimal profiling.
>
> From the profiles, as mentioned, moving 'status' from thread_struct to
> thread_info is probably worth doing. But I didn't look at the impact
> of that at all.
>
> So it should go through all the normal channels in -tip for 4.16.
>
> I'll happily sign off on the patch, but it was really pretty mindless,
> so I'm not sure I need the authorship either.
>
>> Re: the trampoline not showing up: if I find some time, I'll try to
>> wire it up correctly in kallsyms.
>
> That would be lovely. Right now the system call exit shows up pretty
> clearly in profiles, and most of it is (obviously) the cr3 write. So
> the missing entry trampoline is not insignificant.
>
With retpoline, the retpoline in the trampoline sucks. I don't need
perf for that -- I've benchmarked it both ways. It sucks. I'll fix
it, but it'll be kind of complicated.
--Andy
On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <[email protected]> wrote:
>
> With retpoline, the retpoline in the trampoline sucks. I don't need
> perf for that -- I've benchmarked it both ways. It sucks. I'll fix
> it, but it'll be kind of complicated.
Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
But yeah, that is fixable even if it does require a page per CPU. Or
did you have some clever scheme in mind?
But that's independent of the slow/fast path.
Linus
On Thu, Jan 25, 2018 at 1:02 PM, Andy Lutomirski <[email protected]> wrote:
>
> Feel free to Acked-by: Andy Lutomirski <[email protected]> that patch.
>
> Or I can grab it and send it to -tip.
I'm not going to apply it for 4.15, I just wanted to see how it
looked, and do some minimal profiling.
From the profiles, as mentioned, moving 'status' from thread_struct to
thread_info is probably worth doing. But I didn't look at the impact
of that at all.
So it should go through all the normal channels in -tip for 4.16.
I'll happily sign off on the patch, but it was really pretty mindless,
so I'm not sure I need the authorship either.
> Re: the trampoline not showing up: if I find some time, I'll try to
> wire it up correctly in kallsyms.
That would be lovely. Right now the system call exit shows up pretty
clearly in profiles, and most of it is (obviously) the cr3 write. So
the missing entry trampoline is not insignificant.
Linus
On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> With retpoline, the retpoline in the trampoline sucks. I don't need
>> perf for that -- I've benchmarked it both ways. It sucks. I'll fix
>> it, but it'll be kind of complicated.
>
> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>
> But yeah, that is fixable even if it does require a page per CPU. Or
> did you have some clever scheme in mind?
Nothing clever. I was going to see if I could get actual
binutils-generated relocations to work in the trampoline. We already
have code to parse ELF relocations and turn them into a simple table,
and it shouldn't be *that* hard to run a separate pass on the entry
trampoline.
Another potentially useful if rather minor optimization would be to
rejigger the SYSCALL_DEFINE macros a bit. Currently we treat all
syscalls like this:
long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
I wonder if we'd be better off doing:
long func(const struct pt_regs *regs);
and autogenerating:
static long SyS_read(const struct pt_regs *regs)
{
return sys_reg(regs->di, ...);
}
On Thu, Jan 25, 2018 at 1:31 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>> With retpoline, the retpoline in the trampoline sucks. I don't need
>>> perf for that -- I've benchmarked it both ways. It sucks. I'll fix
>>> it, but it'll be kind of complicated.
>>
>> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>>
>> But yeah, that is fixable even if it does require a page per CPU. Or
>> did you have some clever scheme in mind?
>
> Nothing clever. I was going to see if I could get actual
> binutils-generated relocations to work in the trampoline. We already
> have code to parse ELF relocations and turn them into a simple table,
> and it shouldn't be *that* hard to run a separate pass on the entry
> trampoline.
>
> Another potentially useful if rather minor optimization would be to
> rejigger the SYSCALL_DEFINE macros a bit. Currently we treat all
> syscalls like this:
>
> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
>
> I wonder if we'd be better off doing:
>
> long func(const struct pt_regs *regs);
>
> and autogenerating:
>
> static long SyS_read(const struct pt_regs *regs)
> {
> return sys_reg(regs->di, ...);
> }
If you're rejiggering, can we also put in a mechanism for detecting
which registers to clear so that userspace can't inject useful values
into speculation paths?
https://patchwork.kernel.org/patch/10153753/
On Thu, Jan 25, 2018 at 1:39 PM, Dan Williams <[email protected]> wrote:
> On Thu, Jan 25, 2018 at 1:31 PM, Andy Lutomirski <[email protected]> wrote:
>> On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>> With retpoline, the retpoline in the trampoline sucks. I don't need
>>>> perf for that -- I've benchmarked it both ways. It sucks. I'll fix
>>>> it, but it'll be kind of complicated.
>>>
>>> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>>>
>>> But yeah, that is fixable even if it does require a page per CPU. Or
>>> did you have some clever scheme in mind?
>>
>> Nothing clever. I was going to see if I could get actual
>> binutils-generated relocations to work in the trampoline. We already
>> have code to parse ELF relocations and turn them into a simple table,
>> and it shouldn't be *that* hard to run a separate pass on the entry
>> trampoline.
>>
>> Another potentially useful if rather minor optimization would be to
>> rejigger the SYSCALL_DEFINE macros a bit. Currently we treat all
>> syscalls like this:
>>
>> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
>>
>> I wonder if we'd be better off doing:
>>
>> long func(const struct pt_regs *regs);
>>
>> and autogenerating:
>>
>> static long SyS_read(const struct pt_regs *regs)
>> {
>> return sys_reg(regs->di, ...);
>> }
>
> If you're rejiggering, can we also put in a mechanism for detecting
> which registers to clear so that userspace can't inject useful values
> into speculation paths?
>
> https://patchwork.kernel.org/patch/10153753/
My SYSCALL_DEFINE rejigger suggestion up-thread does this for free as
a side effect.
That being said, I think this would be more accurately characterized
as "so that userspace has a somewhat harder time injecting useful
values into speculation paths".
On Thu, Jan 25, 2018 at 1:39 PM, Dan Williams <[email protected]> wrote:
>
> If you're rejiggering, can we also put in a mechanism for detecting
> which registers to clear so that userspace can't inject useful values
> into speculation paths?
That actually becomes trivial with just the "no fastpath" patch I sent
out. You can just clear all of them.
Sure, then do_syscall_64() will reload the six first ones, but since
those are the argument registers anyway, and since they are
caller-clobbered, they have very short lifetimes. So it would
effectively not really be an issue.
But yes, SYSCALL_DEFINEx() rejiggery would close even that tiny hole.
Linus
From: Andy Lutomirski
> Sent: 25 January 2018 21:31
...
> Another potentially useful if rather minor optimization would be to
> rejigger the SYSCALL_DEFINE macros a bit. Currently we treat all
> syscalls like this:
>
> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
>
> I wonder if we'd be better off doing:
>
> long func(const struct pt_regs *regs);
>
> and autogenerating:
>
> static long SyS_read(const struct pt_regs *regs)
> {
> return sys_reg(regs->di, ...);
> }
Hmmm....
NetBSD (and the other BSD?) defines a structure for the arguments to each syscall.
On systems where all function arguments are put on stack the user stack that
contains the arguments is copied into a kernel buffer.
For amd64 I changed the register save area layout so that the arguments were in
the right order [1]. Then added an extra area for the extra arguments that had to be
read from the user stack.
Just passing a pointer into the save area has to be better than reading
all the values again.
[1] There was some horrid fallout from that :-(
David
> NetBSD (and the other BSD?) defines a structure for the arguments to
> each syscall.
Goes back to v7 or so but they put the syscall arguments into the uarea
so that no pointers were needed (uarea being a per process mapping at a
fixed address) in order to also reduce pointer dereferencing costs (not
that those matter much on modern processors)
Alan.
On Fri, Jan 26, 2018 at 6:24 AM, Alan Cox <[email protected]> wrote:
>> NetBSD (and the other BSD?) defines a structure for the arguments to
>> each syscall.
>
> Goes back to v7 or so but they put the syscall arguments into the uarea
> so that no pointers were needed (uarea being a per process mapping at a
> fixed address) in order to also reduce pointer dereferencing costs (not
> that those matter much on modern processors)
>
I gave the rearrangement like this a try yesterday and it's a bit of a
mess. Part of the problem is that there are a bunch of pieces of code
that expect sys_xyz() to be actual callable functions. The best way
to deal with that is probably to switch to calling normal functions.
On Fri, Jan 26, 2018 at 7:57 AM, Andy Lutomirski <[email protected]> wrote:
>
> I gave the rearrangement like this a try yesterday and it's a bit of a
> mess. Part of the problem is that there are a bunch of pieces of code
> that expect sys_xyz() to be actual callable functions.
That's not supposed to be a mess.
That's part of why we do that whole indirection through SYSC##xyz to
sys##_xyz: the asm-callable ones will do the full casting of
troublesome arguments (some architectures have C calling sequence
rules that have security issues, so we need to make sure that the
arguments actually follow the right rules and 'int' arguments are
properly sign-extended etc).
So that whole indirection could be made to *also* create another
version of the syscall that instead took the arguments from ptregs.
We already do exactly that for the tracing events: look how
FTRACE_SYSCALLS ends up creating that extra metadata.
The ptreg version should be done the same way: don't make 'sys_xyz()'
take a struct ptregs, instead make those SYSCALL_DEFINE*() macros
create a _new_ function called 'ptregs_xyz()' and then that function
does the argument unpacking.
Then the x86 system call table can just be switched over to call those
ptreg versions instead.
Linus
On Fri, Jan 26, 2018 at 09:40:23AM -0800, Linus Torvalds wrote:
> On Fri, Jan 26, 2018 at 7:57 AM, Andy Lutomirski <[email protected]> wrote:
> >
> > I gave the rearrangement like this a try yesterday and it's a bit of a
> > mess. Part of the problem is that there are a bunch of pieces of code
> > that expect sys_xyz() to be actual callable functions.
>
> That's not supposed to be a mess.
>
> That's part of why we do that whole indirection through SYSC##xyz to
> sys##_xyz: the asm-callable ones will do the full casting of
> troublesome arguments (some architectures have C calling sequence
> rules that have security issues, so we need to make sure that the
> arguments actually follow the right rules and 'int' arguments are
> properly sign-extended etc).
>
> So that whole indirection could be made to *also* create another
> version of the syscall that instead took the arguments from ptregs.
>
> We already do exactly that for the tracing events: look how
> FTRACE_SYSCALLS ends up creating that extra metadata.
>
> The ptreg version should be done the same way: don't make 'sys_xyz()'
> take a struct ptregs, instead make those SYSCALL_DEFINE*() macros
> create a _new_ function called 'ptregs_xyz()' and then that function
> does the argument unpacking.
>
> Then the x86 system call table can just be switched over to call those
> ptreg versions instead.
Umm... What about other architectures? Or do you want SYSCALL_DEFINE...
to be per-arch? I wonder how much would that "go through pt_regs" hurt
on something like sparc...
On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <[email protected]> wrote:
>
> Umm... What about other architectures? Or do you want SYSCALL_DEFINE...
> to be per-arch? I wonder how much would that "go through pt_regs" hurt
> on something like sparc...
No, but I just talked to Will Deacon about register clearing on entry,
and so I suspect that arm64 might want something similar too.
So I think some opt-in for letting architectures add their own
function would be good. Because it wouldn't be all architectures, but
it probably _would_ be more than just x86.
You need to add architecture-specific "load argX from ptregs" macros anyway.
Linus
On Fri, Jan 26, 2018 at 10:13 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <[email protected]> wrote:
>>
>> Umm... What about other architectures? Or do you want SYSCALL_DEFINE...
>> to be per-arch? I wonder how much would that "go through pt_regs" hurt
>> on something like sparc...
>
> No, but I just talked to Will Deacon about register clearing on entry,
> and so I suspect that arm64 might want something similar too.
>
> So I think some opt-in for letting architectures add their own
> function would be good. Because it wouldn't be all architectures, but
> it probably _would_ be more than just x86.
>
> You need to add architecture-specific "load argX from ptregs" macros anyway.
I mocked that up, and it's straightforward. I ended up with something like:
#define __ARCH_SYSCALL_ARGS(n, ...) (regs->di, ...)
(obviously modified so it actually compiles.)
The issue is that doing it this way gives us, effectively:
long sys_foo(int a, int b)
{
body here;
}
long SyS_foo(const struct pt_regs *regs)
{
return sys_foo(regs->di, regs->si);
}
whereas what we want is *static* long sys_foo(...). So I could split
the macros into:
DEFINE_SYSCALL2(foo, ....)
and
DEFINE_EXTERN_SYSCALL2(foo, ...)
or I could just fix up all the code that expects calling sys_foo()
across files to work.
My mockup patch doesn't actually work because of compat crap, but
that's manageable.
--Andy
On Fri, Jan 26, 2018 at 10:23 AM, Andy Lutomirski <[email protected]> wrote:
>
> The issue is that doing it this way gives us, effectively:
>
> long sys_foo(int a, int b)
> {
> body here;
> }
>
> long SyS_foo(const struct pt_regs *regs)
> {
> return sys_foo(regs->di, regs->si);
> }
>
> whereas what we want is *static* long sys_foo(...).
How about just marking 'sys_foo()' as being always_inline (but still
not static)? Because the case that _matters_ is that SyS_foo(), thing
when this is enabled.
Sure, you'll get two copies of the code (one in SyS_foo(), the other
being the callable-from C 'sys_foo()' that is exported and almost
never used). But that seems a fairly small price to pay. We could
leave it for later to try to get rid of the unused copies entirely.
Linus
On Fri, Jan 26, 2018 at 10:54 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jan 26, 2018 at 10:23 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> The issue is that doing it this way gives us, effectively:
>>
>> long sys_foo(int a, int b)
>> {
>> body here;
>> }
>>
>> long SyS_foo(const struct pt_regs *regs)
>> {
>> return sys_foo(regs->di, regs->si);
>> }
>>
>> whereas what we want is *static* long sys_foo(...).
>
> How about just marking 'sys_foo()' as being always_inline (but still
> not static)? Because the case that _matters_ is that SyS_foo(), thing
> when this is enabled.
>
> Sure, you'll get two copies of the code (one in SyS_foo(), the other
> being the callable-from C 'sys_foo()' that is exported and almost
> never used). But that seems a fairly small price to pay. We could
> leave it for later to try to get rid of the unused copies entirely.
>
I could do that, but Josh Triplett will yell at me.
Anyway, I'll fiddle with it. This isn't exactly high priority.
Hi Andy,
On Fri, Jan 26, 2018 at 10:23:23AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 26, 2018 at 10:13 AM, Linus Torvalds
> <[email protected]> wrote:
> > On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <[email protected]> wrote:
> >>
> >> Umm... What about other architectures? Or do you want SYSCALL_DEFINE...
> >> to be per-arch? I wonder how much would that "go through pt_regs" hurt
> >> on something like sparc...
> >
> > No, but I just talked to Will Deacon about register clearing on entry,
> > and so I suspect that arm64 might want something similar too.
> >
> > So I think some opt-in for letting architectures add their own
> > function would be good. Because it wouldn't be all architectures, but
> > it probably _would_ be more than just x86.
> >
> > You need to add architecture-specific "load argX from ptregs" macros anyway.
>
> I mocked that up, and it's straightforward. I ended up with something like:
>
> #define __ARCH_SYSCALL_ARGS(n, ...) (regs->di, ...)
>
> (obviously modified so it actually compiles.)
>
> The issue is that doing it this way gives us, effectively:
>
> long sys_foo(int a, int b)
> {
> body here;
> }
>
> long SyS_foo(const struct pt_regs *regs)
> {
> return sys_foo(regs->di, regs->si);
> }
>
> whereas what we want is *static* long sys_foo(...). So I could split
> the macros into:
>
> DEFINE_SYSCALL2(foo, ....)
>
> and
>
> DEFINE_EXTERN_SYSCALL2(foo, ...)
>
> or I could just fix up all the code that expects calling sys_foo()
> across files to work.
Another issue with this style of macro definition exists on architectures
where the calling convention needs you to carry state around depending on
how you packed the previous parameters. For example, on 32-bit ARM, 64-bit
values are passed in adjacent pairs of registers but the low numbered
register needs to be even. This is what stopped me from trying to use
existing helpers such as syscall_get_arguments to unpack the pt_regs
and it generally means that anything that says "get me argument n" is going
to require constructing arguments 0..n-1 first.
To do this properly I think we'll either need to pass back the size and
current register offset to the arch code, or just allow the thing to be
overridden per syscall (the case above isn't especially frequent).
Will
From: Will Deacon
> Sent: 29 January 2018 13:20
...
> Another issue with this style of macro definition exists on architectures
> where the calling convention needs you to carry state around depending on
> how you packed the previous parameters. For example, on 32-bit ARM, 64-bit
> values are passed in adjacent pairs of registers but the low numbered
> register needs to be even. This is what stopped me from trying to use
> existing helpers such as syscall_get_arguments to unpack the pt_regs
> and it generally means that anything that says "get me argument n" is going
> to require constructing arguments 0..n-1 first.
>
> To do this properly I think we'll either need to pass back the size and
> current register offset to the arch code, or just allow the thing to be
> overridden per syscall (the case above isn't especially frequent).
If you generate a structure from the argument list that might work
'by magic'.
Certainly you can add explicit pads to any structure.
David