Seems to be a recent regression, maybe related to entry/exit work changes.
# ./tools/testing/selftests/x86/ptrace_syscall_32
[RUN] Check int80 return regs
[OK] getpid() preserves regs
[OK] kill(getpid(), SIGUSR1) preserves regs
[RUN] Check AT_SYSINFO return regs
[OK] getpid() preserves regs
[OK] kill(getpid(), SIGUSR1) preserves regs
[RUN] ptrace-induced syscall restart
Child will make one syscall
[RUN] SYSEMU
[FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
[RUN] Restart the syscall (ip = 0xf7f3b549)
[OK] Restarted nr and args are correct
[RUN] Change nr and args and restart the syscall (ip = 0xf7f3b549)
[OK] Replacement nr and args are correct
[OK] Child exited cleanly
[RUN] kernel syscall restart under ptrace
Child will take a nap until signaled
[RUN] SYSCALL
[FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
[RUN] SYSCALL
[OK] Args after SIGUSR1 are correct (ax = -514)
[OK] Child got SIGUSR1
[RUN] Step again
[OK] pause(2) restarted correctly
On Sat, Aug 29, 2020 at 12:52 PM Andy Lutomirski <[email protected]> wrote:
>
> Seems to be a recent regression, maybe related to entry/exit work changes.
>
> # ./tools/testing/selftests/x86/ptrace_syscall_32
> [RUN] Check int80 return regs
> [OK] getpid() preserves regs
> [OK] kill(getpid(), SIGUSR1) preserves regs
> [RUN] Check AT_SYSINFO return regs
> [OK] getpid() preserves regs
> [OK] kill(getpid(), SIGUSR1) preserves regs
> [RUN] ptrace-induced syscall restart
> Child will make one syscall
> [RUN] SYSEMU
> [FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
> [RUN] Restart the syscall (ip = 0xf7f3b549)
> [OK] Restarted nr and args are correct
> [RUN] Change nr and args and restart the syscall (ip = 0xf7f3b549)
> [OK] Replacement nr and args are correct
> [OK] Child exited cleanly
> [RUN] kernel syscall restart under ptrace
> Child will take a nap until signaled
> [RUN] SYSCALL
> [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> [RUN] SYSCALL
> [OK] Args after SIGUSR1 are correct (ax = -514)
> [OK] Child got SIGUSR1
> [RUN] Step again
> [OK] pause(2) restarted correctly
Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
syscall entry").
It looks like it is because syscall_enter_from_user_mode() is called
before reading the 6th argument from the user stack.
--
Brian Gerst
On Sat, Aug 29, 2020 at 9:40 PM Brian Gerst <[email protected]> wrote:
>
> On Sat, Aug 29, 2020 at 12:52 PM Andy Lutomirski <[email protected]> wrote:
> >
> > Seems to be a recent regression, maybe related to entry/exit work changes.
> >
> > # ./tools/testing/selftests/x86/ptrace_syscall_32
> > [RUN] Check int80 return regs
> > [OK] getpid() preserves regs
> > [OK] kill(getpid(), SIGUSR1) preserves regs
> > [RUN] Check AT_SYSINFO return regs
> > [OK] getpid() preserves regs
> > [OK] kill(getpid(), SIGUSR1) preserves regs
> > [RUN] ptrace-induced syscall restart
> > Child will make one syscall
> > [RUN] SYSEMU
> > [FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
> > [RUN] Restart the syscall (ip = 0xf7f3b549)
> > [OK] Restarted nr and args are correct
> > [RUN] Change nr and args and restart the syscall (ip = 0xf7f3b549)
> > [OK] Replacement nr and args are correct
> > [OK] Child exited cleanly
> > [RUN] kernel syscall restart under ptrace
> > Child will take a nap until signaled
> > [RUN] SYSCALL
> > [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> > [RUN] SYSCALL
> > [OK] Args after SIGUSR1 are correct (ax = -514)
> > [OK] Child got SIGUSR1
> > [RUN] Step again
> > [OK] pause(2) restarted correctly
>
> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
> syscall entry").
> It looks like it is because syscall_enter_from_user_mode() is called
> before reading the 6th argument from the user stack.
Ugh. I caught, in review, a potential related issue with exit (not a
problem in current kernels), but I missed the entry version.
Thomas, can we revert the syscall_enter() and syscall_exit() part of
the series? I think that they almost work for x86, but not quite as
indicated by this bug. Even if we imagine we can somehow hack around
this bug, I imagine we're going to find other problems with this
model, e.g. the potential upcoming exit problem I noted in my review.
I really think the model should be:
void do_syscall_whatever(...)
{
irqentry_enter(...);
instrumentation_begin();
/* Do whatever arch ABI oddities are needed on entry. */
Then either:
syscall_begin(arch, nr, regs);
dispatch the syscall;
syscall_end(arch, nr, regs);
Or just:
generic_do_syscall(arch, nr, regs);
/* Do whatever arch ABI oddities are needed on exit from the syscall. */
instrumentation_end();
irqentry_exit(...);
}
x86 has an ABI oddity needed on entry: this fast syscall argument
fixup. We also might end up with ABI oddities on exit if we ever try
to make single-stepping of syscalls work fully correctly. x86 sort of
gets away without specifying arch because the arch helpers that get
called for audit, etc can deduce the arch, but this is kind of gross.
I suppose we could omit arch as an explicit parameter.
Or I suppose we could try to rejigger the API in time for 5.9.
Fortunately only x86 uses the new APIs so far. I cc'd a bunch of
other arch maintainers to see if other architectures fit well in the
new syscall_enter() model, but I feel like the fact that x86 is
already broken indicates that we messed it up a bit.
--Andy
On Sun, Aug 30 2020 at 08:52, Andy Lutomirski wrote:
>> > [RUN] SYSCALL
>> > [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
>> > [RUN] SYSCALL
>> > [OK] Args after SIGUSR1 are correct (ax = -514)
>> > [OK] Child got SIGUSR1
>> > [RUN] Step again
>> > [OK] pause(2) restarted correctly
>>
>> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
>> syscall entry").
>> It looks like it is because syscall_enter_from_user_mode() is called
>> before reading the 6th argument from the user stack.
Bah.I don't know how I managed to miss that part and interestingly
enough that none of the robots caught that either
> Thomas, can we revert the syscall_enter() and syscall_exit() part of
> the series?
Hrm.
> I think that they almost work for x86, but not quite as
> indicated by this bug. Even if we imagine we can somehow hack around
> this bug, I imagine we're going to find other problems with this
> model, e.g. the potential upcoming exit problem I noted in my review.
What's the upcoming problem?
> I really think the model should be:
>
> void do_syscall_whatever(...)
> {
> irqentry_enter(...);
> instrumentation_begin();
>
> /* Do whatever arch ABI oddities are needed on entry. */
>
> Then either:
> syscall_begin(arch, nr, regs);
> dispatch the syscall;
> syscall_end(arch, nr, regs);
>
> Or just:
> generic_do_syscall(arch, nr, regs);
>
> /* Do whatever arch ABI oddities are needed on exit from the syscall. */
>
> instrumentation_end();
> irqentry_exit(...);
> }
I don't think we want that in general. The current variant is perfectly
fine for everything except the 32bit fast syscall nonsense. Also
irqentry_entry/exit is not equivalent to the syscall_enter/exit
counterparts.
> x86 has an ABI oddity needed on entry: this fast syscall argument
> fixup. We also might end up with ABI oddities on exit if we ever try
> to make single-stepping of syscalls work fully correctly. x86 sort of
> gets away without specifying arch because the arch helpers that get
> called for audit, etc can deduce the arch, but this is kind of gross.
> I suppose we could omit arch as an explicit parameter.
I had that in one of the early versions and was advised to drop that.
> Or I suppose we could try to rejigger the API in time for 5.9.
> Fortunately only x86 uses the new APIs so far. I cc'd a bunch of
> other arch maintainers to see if other architectures fit well in the
> new syscall_enter() model, but I feel like the fact that x86 is
> already broken indicates that we messed it up a bit.
It's not unfixable and the fix is close to what you suggested above
except that it preserves the straight forward stuff for the !32bit fast
syscall case. Completely untested patch below. I run proper tests
tomorrow with brain awake.
Thanks,
tglx
---
arch/x86/entry/common.c | 29 ++++++++++++++++--------
include/linux/entry-common.h | 51 +++++++++++++++++++++++++++++++++++--------
kernel/entry/common.c | 35 ++++++++++++++++++++++++-----
3 files changed, 91 insertions(+), 24 deletions(-)
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -60,16 +60,10 @@
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
{
- unsigned int nr = (unsigned int)regs->orig_ax;
-
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;
- /*
- * Subtlety here: if ptrace pokes something larger than 2^32-1 into
- * orig_ax, the unsigned int return value truncates it. This may
- * or may not be necessary, but it matches the old asm behavior.
- */
- return (unsigned int)syscall_enter_from_user_mode(regs, nr);
+
+ return (unsigned int)regs->orig_ax;
}
/*
@@ -91,15 +85,29 @@ static __always_inline void do_syscall_3
{
unsigned int nr = syscall_32_enter(regs);
+ /*
+ * Subtlety here: if ptrace pokes something larger than 2^32-1 into
+ * orig_ax, the unsigned int return value truncates it. This may
+ * or may not be necessary, but it matches the old asm behavior.
+ */
+ nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
+
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
}
static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ unsigned int nr = syscall_32_enter(regs);
int res;
+ /*
+ * This cannot use syscall_enter_from_user_mode() as it has to
+ * fetch EBP before invoking any of the syscall entry work
+ * functions.
+ */
+ syscall_enter_from_user_mode_prepare(regs);
+
instrumentation_begin();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
@@ -122,6 +130,9 @@ static noinstr bool __do_fast_syscall_32
return false;
}
+ /* The case truncates any ptrace induced syscall nr > 2^32 -1 */
+ nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+
/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -110,15 +110,30 @@ static inline __must_check int arch_sysc
#endif
/**
- * syscall_enter_from_user_mode - Check and handle work before invoking
- * a syscall
+ * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
* @regs: Pointer to currents pt_regs
- * @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
* disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and the subsequent functions can be
- * instrumented.
+ * function returns all state is correct, interrupts are enabled and the
+ * subsequent functions can be instrumented.
+ *
+ * This handles lockdep, RCU (context tracking) and tracing state.
+ *
+ * This is invoked when there is extra architecture specific functionality
+ * to be done between establishing state and handling user mode entry work.
+ */
+void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
+
+/**
+ * syscall_enter_from_user_mode_work - Check and handle work before invoking
+ * a syscall
+ * @regs: Pointer to currents pt_regs
+ * @syscall: The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
+ * architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -127,12 +142,30 @@ static inline __must_check int arch_sysc
* syscall_set_return_value() first. If neither of those are called and -1
* is returned, then the syscall will fail with ENOSYS.
*
- * The following functionality is handled here:
+ * It handles the following work items:
*
- * 1) Establish state (lockdep, RCU (context tracking), tracing)
- * 2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
+ * 1) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
* __secure_computing(), trace_sys_enter()
- * 3) Invocation of audit_syscall_entry()
+ * 2) Invocation of audit_syscall_entry()
+ */
+long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
+
+/**
+ * syscall_enter_from_user_mode - Establish state and check and handle work
+ * before invoking a syscall
+ * @regs: Pointer to currents pt_regs
+ * @syscall: The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct, interrupts are enabled and the
+ * subsequent functions can be instrumented.
+ *
+ * This is combination of syscall_enter_from_user_mode_prepare() and
+ * syscall_enter_from_user_mode_work().
+ *
+ * Returns: The original or a modified syscall number. See
+ * syscall_enter_from_user_mode_work() for further explanation.
*/
long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -68,22 +68,45 @@ static long syscall_trace_enter(struct p
return ret ? : syscall;
}
-noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+static __always_inline long
+__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
{
unsigned long ti_work;
- enter_from_user_mode(regs);
- instrumentation_begin();
-
- local_irq_enable();
ti_work = READ_ONCE(current_thread_info()->flags);
if (ti_work & SYSCALL_ENTER_WORK)
syscall = syscall_trace_enter(regs, syscall, ti_work);
- instrumentation_end();
return syscall;
}
+long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
+{
+ return __syscall_enter_from_user_work(regs, syscall);
+}
+
+noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+{
+ long ret;
+
+ enter_from_user_mode(regs);
+
+ instrumentation_begin();
+ local_irq_enable();
+ ret = __syscall_enter_from_user_work(regs, syscall);
+ instrumentation_end();
+
+ return ret;
+}
+
+noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+ instrumentation_begin();
+ local_irq_enable();
+ instrumentation_end();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner <[email protected]> wrote:
>
> On Sun, Aug 30 2020 at 08:52, Andy Lutomirski wrote:
> >> > [RUN] SYSCALL
> >> > [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> >> > [RUN] SYSCALL
> >> > [OK] Args after SIGUSR1 are correct (ax = -514)
> >> > [OK] Child got SIGUSR1
> >> > [RUN] Step again
> >> > [OK] pause(2) restarted correctly
> >>
> >> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
> >> syscall entry").
> >> It looks like it is because syscall_enter_from_user_mode() is called
> >> before reading the 6th argument from the user stack.
>
> Bah.I don't know how I managed to miss that part and interestingly
> enough that none of the robots caught that either
>
> > Thomas, can we revert the syscall_enter() and syscall_exit() part of
> > the series?
>
> Hrm.
>
> > I think that they almost work for x86, but not quite as
> > indicated by this bug. Even if we imagine we can somehow hack around
> > this bug, I imagine we're going to find other problems with this
> > model, e.g. the potential upcoming exit problem I noted in my review.
>
> What's the upcoming problem?
If we ever want to get single-stepping fully correct across syscalls,
we might need to inject SIGTRAP on syscall return. This would be more
awkward if we can't run instrumentable code after the syscall part of
the syscall is done.
>
> > I really think the model should be:
> >
> > void do_syscall_whatever(...)
> > {
> > irqentry_enter(...);
> > instrumentation_begin();
> >
> > /* Do whatever arch ABI oddities are needed on entry. */
> >
> > Then either:
> > syscall_begin(arch, nr, regs);
> > dispatch the syscall;
> > syscall_end(arch, nr, regs);
> >
> > Or just:
> > generic_do_syscall(arch, nr, regs);
> >
> > /* Do whatever arch ABI oddities are needed on exit from the syscall. */
> >
> > instrumentation_end();
> > irqentry_exit(...);
> > }
>
> I don't think we want that in general. The current variant is perfectly
> fine for everything except the 32bit fast syscall nonsense. Also
> irqentry_entry/exit is not equivalent to the syscall_enter/exit
> counterparts.
If there are any architectures in which actual work is needed to
figure out whether something is a syscall in the first place, they'll
want to do the usual kernel entry work before the syscall entry work.
Maybe your patch actually makes this possible -- I haven't digested
all the details yet.
Who advised you to drop the arch parameter?
> ---
> arch/x86/entry/common.c | 29 ++++++++++++++++--------
> include/linux/entry-common.h | 51 +++++++++++++++++++++++++++++++++++--------
> kernel/entry/common.c | 35 ++++++++++++++++++++++++-----
> 3 files changed, 91 insertions(+), 24 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -60,16 +60,10 @@
> #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
> {
> - unsigned int nr = (unsigned int)regs->orig_ax;
> -
> if (IS_ENABLED(CONFIG_IA32_EMULATION))
> current_thread_info()->status |= TS_COMPAT;
> - /*
> - * Subtlety here: if ptrace pokes something larger than 2^32-1 into
> - * orig_ax, the unsigned int return value truncates it. This may
> - * or may not be necessary, but it matches the old asm behavior.
> - */
> - return (unsigned int)syscall_enter_from_user_mode(regs, nr);
> +
> + return (unsigned int)regs->orig_ax;
> }
>
> /*
> @@ -91,15 +85,29 @@ static __always_inline void do_syscall_3
> {
> unsigned int nr = syscall_32_enter(regs);
>
> + /*
> + * Subtlety here: if ptrace pokes something larger than 2^32-1 into
> + * orig_ax, the unsigned int return value truncates it. This may
> + * or may not be necessary, but it matches the old asm behavior.
> + */
> + nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
> +
> do_syscall_32_irqs_on(regs, nr);
> syscall_exit_to_user_mode(regs);
> }
>
> static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
> {
> - unsigned int nr = syscall_32_enter(regs);
> + unsigned int nr = syscall_32_enter(regs);
> int res;
>
> + /*
> + * This cannot use syscall_enter_from_user_mode() as it has to
> + * fetch EBP before invoking any of the syscall entry work
> + * functions.
> + */
> + syscall_enter_from_user_mode_prepare(regs);
I'm getting lost in all these "enter" functions...
On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote:
> On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner <[email protected]> wrote:
>> > I think that they almost work for x86, but not quite as
>> > indicated by this bug. Even if we imagine we can somehow hack around
>> > this bug, I imagine we're going to find other problems with this
>> > model, e.g. the potential upcoming exit problem I noted in my review.
>>
>> What's the upcoming problem?
>
> If we ever want to get single-stepping fully correct across syscalls,
> we might need to inject SIGTRAP on syscall return. This would be more
> awkward if we can't run instrumentable code after the syscall part of
> the syscall is done.
We run a lot of instrumentable code after sys_foo() returns. Otherwise
all the TIF work would not be possible at all.
But you might tell me where exactly you want to inject the SIGTRAP in
the syscall exit code flow.
>> I don't think we want that in general. The current variant is perfectly
>> fine for everything except the 32bit fast syscall nonsense. Also
>> irqentry_entry/exit is not equivalent to the syscall_enter/exit
>> counterparts.
>
> If there are any architectures in which actual work is needed to
> figure out whether something is a syscall in the first place, they'll
> want to do the usual kernel entry work before the syscall entry work.
That's low level entry code which does not require RCU, lockdep, tracing
or whatever muck we setup before actual work can be done.
arch_asm_entry()
...
arch_c_entry(cause) {
switch(cause) {
case EXCEPTION: arch_c_exception(...);
case SYSCALL: arch_c_syscall(...);
...
}
You really want to differentiate between exception and syscall
entry/exit.
The splitting of syscall_enter_from_user_mode() is only necessary for
that 32bit fast syscall thing on x86 and there is no point to open code
it with two calls for e.g. do_syscall_64().
> Maybe your patch actually makes this possible -- I haven't digested
> all the details yet.
>
> Who advised you to drop the arch parameter?
Kees, IIRC, but I would have to search through the gazillions of mail
threads to be sure.
>> + syscall_enter_from_user_mode_prepare(regs);
>
> I'm getting lost in all these "enter" functions...
It's not that hard.
syscall_enter_from_user_mode_prepare()
+ syscall_enter_from_user_mode_work()
= syscall_enter_from_user_mode()
That's exactly what you suggested just with the difference that it is
explicit for syscalls and not using irqentry_enter/exit().
If we would do that then instead of having a single call for sane
syscall pathes:
arch_c_entry()
nr = syscall_enter_from_user_mode();
or for that 32bit fast syscall nonsense the split variant:
arch_c_entry()
syscall_enter_from_user_mode_prepare();
do_fast_syscall_muck();
nr = syscall_enter_from_user_mode_work();
we'd have:
arch_c_entry()
irqentry_enter();
local_irq_enble();
nr = syscall_enter_from_user_mode_work();
...
which enforces two calls for sane entries and more code in arch/....
Thanks,
tglx
On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote:
> > On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner <[email protected]> wrote:
> >> > I think that they almost work for x86, but not quite as
> >> > indicated by this bug. Even if we imagine we can somehow hack around
> >> > this bug, I imagine we're going to find other problems with this
> >> > model, e.g. the potential upcoming exit problem I noted in my review.
> >>
> >> What's the upcoming problem?
> >
> > If we ever want to get single-stepping fully correct across syscalls,
> > we might need to inject SIGTRAP on syscall return. This would be more
> > awkward if we can't run instrumentable code after the syscall part of
> > the syscall is done.
>
> We run a lot of instrumentable code after sys_foo() returns. Otherwise
> all the TIF work would not be possible at all.
>
> But you might tell me where exactly you want to inject the SIGTRAP in
> the syscall exit code flow.
It would be a bit complicated. Definitely after any signals from the
syscall are delivered. Right now, I think that we don't deliver a
SIGTRAP on the instruction boundary after SYSCALL while
single-stepping. (I think we used to, but only sometimes, and now we
are at least consistent.) This is because IRET will not trap if it
starts with TF clear and ends up setting it. (I asked Intel to
document this, and I think they finally did, although I haven't gotten
around to reading the new docs. Certainly the old docs as of a year
or two ago had no description whatsoever of how TF changes worked.)
Deciding exactly *when* a trap should occur would be nontrivial -- we
can't trap on sigreturn() from a SIGTRAP, for example.
So this isn't fully worked out.
>
> >> I don't think we want that in general. The current variant is perfectly
> >> fine for everything except the 32bit fast syscall nonsense. Also
> >> irqentry_entry/exit is not equivalent to the syscall_enter/exit
> >> counterparts.
> >
> > If there are any architectures in which actual work is needed to
> > figure out whether something is a syscall in the first place, they'll
> > want to do the usual kernel entry work before the syscall entry work.
>
> That's low level entry code which does not require RCU, lockdep, tracing
> or whatever muck we setup before actual work can be done.
>
> arch_asm_entry()
> ...
> arch_c_entry(cause) {
> switch(cause) {
> case EXCEPTION: arch_c_exception(...);
> case SYSCALL: arch_c_syscall(...);
> ...
> }
You're assuming that figuring out the cause doesn't need the kernel
entry code to run first. In the case of the 32-bit vDSO fast
syscalls, we arguably don't know whether an entry is a syscall until
we have done a user memory access. Logically, we're doing:
if (get_user() < 0) {
/* Not a syscall. This is actually a silly operation that sets AX =
-EFAULT and returns. Do not audit or invoke ptrace. */
} else {
/* This actually is a syscall. */
}
So we really do want to stick arch code between the
enter_from_user_mode() and the audit check. We *can't* audit because
we don't know the syscall args. Now maybe we could invent new
semantics for this in which a fault here is still somehow a syscall,
but I think that would be a real ABI change and would want very
careful thought. And it would be weird -- syscalls are supposed to
actually call the syscall handler, aren't they? (Arguably we should
go back in time and make this a SIGSEGV. We have the infrastructure
to do this cleanly, but when I wrote the code I just copied the ABI
from code that was before my time. Even so, it would be an exception,
not a syscall.)
>
> You really want to differentiate between exception and syscall
> entry/exit.
>
Why do we want to distinguish between exception and syscall
entry/exit? For the enter part, AFAICS the exception case boils down
to enter_from_user_mode() and the syscall case is:
enter_from_user_mode(regs);
instrumentation_begin();
local_irq_enable();
ti_work = READ_ONCE(current_thread_info()->flags);
if (ti_work & SYSCALL_ENTER_WORK)
syscall = syscall_trace_enter(regs, syscall, ti_work);
instrumentation_end();
Which would decompose quite nicely as a regular (non-syscall) entry
plus the syscall part later.
> The splitting of syscall_enter_from_user_mode() is only necessary for
> that 32bit fast syscall thing on x86 and there is no point to open code
> it with two calls for e.g. do_syscall_64().
>
> > Maybe your patch actually makes this possible -- I haven't digested
> > all the details yet.
> >
> > Who advised you to drop the arch parameter?
>
> Kees, IIRC, but I would have to search through the gazillions of mail
> threads to be sure.
>
> >> + syscall_enter_from_user_mode_prepare(regs);
> >
> > I'm getting lost in all these "enter" functions...
>
> It's not that hard.
>
> syscall_enter_from_user_mode_prepare()
> + syscall_enter_from_user_mode_work()
> = syscall_enter_from_user_mode()
>
> That's exactly what you suggested just with the difference that it is
> explicit for syscalls and not using irqentry_enter/exit().
>
> If we would do that then instead of having a single call for sane
> syscall pathes:
>
> arch_c_entry()
> nr = syscall_enter_from_user_mode();
>
> or for that 32bit fast syscall nonsense the split variant:
>
> arch_c_entry()
> syscall_enter_from_user_mode_prepare();
> do_fast_syscall_muck();
> nr = syscall_enter_from_user_mode_work();
>
> we'd have:
>
> arch_c_entry()
> irqentry_enter();
> local_irq_enble();
> nr = syscall_enter_from_user_mode_work();
> ...
>
> which enforces two calls for sane entries and more code in arch/....
This is why I still like my:
arch_c_entry()
irqentry_enter_from_user_mode();
generic_syscall();
exit...
}
Andy,
On Wed, Sep 02 2020 at 09:49, Andy Lutomirski wrote:
> On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner <[email protected]> wrote:
>>
>> But you might tell me where exactly you want to inject the SIGTRAP in
>> the syscall exit code flow.
>
> It would be a bit complicated. Definitely after any signals from the
> syscall are delivered. Right now, I think that we don't deliver a
> SIGTRAP on the instruction boundary after SYSCALL while
> single-stepping. (I think we used to, but only sometimes, and now we
> are at least consistent.) This is because IRET will not trap if it
> starts with TF clear and ends up setting it. (I asked Intel to
> document this, and I think they finally did, although I haven't gotten
> around to reading the new docs. Certainly the old docs as of a year
> or two ago had no description whatsoever of how TF changes worked.)
>
> Deciding exactly *when* a trap should occur would be nontrivial -- we
> can't trap on sigreturn() from a SIGTRAP, for example.
>
> So this isn't fully worked out.
Oh well.
>> >> I don't think we want that in general. The current variant is perfectly
>> >> fine for everything except the 32bit fast syscall nonsense. Also
>> >> irqentry_entry/exit is not equivalent to the syscall_enter/exit
>> >> counterparts.
>> >
>> > If there are any architectures in which actual work is needed to
>> > figure out whether something is a syscall in the first place, they'll
>> > want to do the usual kernel entry work before the syscall entry work.
>>
>> That's low level entry code which does not require RCU, lockdep, tracing
>> or whatever muck we setup before actual work can be done.
>>
>> arch_asm_entry()
>> ...
>> arch_c_entry(cause) {
>> switch(cause) {
>> case EXCEPTION: arch_c_exception(...);
>> case SYSCALL: arch_c_syscall(...);
>> ...
>> }
>
> You're assuming that figuring out the cause doesn't need the kernel
> entry code to run first. In the case of the 32-bit vDSO fast
> syscalls, we arguably don't know whether an entry is a syscall until
> we have done a user memory access. Logically, we're doing:
>
> if (get_user() < 0) {
> /* Not a syscall. This is actually a silly operation that sets AX =
> -EFAULT and returns. Do not audit or invoke ptrace. */
> } else {
> /* This actually is a syscall. */
> }
Yes, that's what I've addressed with providing split interfaces.
>> You really want to differentiate between exception and syscall
>> entry/exit.
>>
>
> Why do we want to distinguish between exception and syscall
> entry/exit? For the enter part, AFAICS the exception case boils down
> to enter_from_user_mode() and the syscall case is:
>
> enter_from_user_mode(regs);
> instrumentation_begin();
>
> local_irq_enable();
> ti_work = READ_ONCE(current_thread_info()->flags);
> if (ti_work & SYSCALL_ENTER_WORK)
> syscall = syscall_trace_enter(regs, syscall, ti_work);
> instrumentation_end();
>
> Which would decompose quite nicely as a regular (non-syscall) entry
> plus the syscall part later.
There is a difference between syscall entry and exception entry at least
in my view:
syscall:
enter_from_user_mode(regs);
local_irq_enable();
exception:
enter_from_user_mode(regs);
>> we'd have:
>>
>> arch_c_entry()
>> irqentry_enter();
>> local_irq_enble();
>> nr = syscall_enter_from_user_mode_work();
>> ...
>>
>> which enforces two calls for sane entries and more code in arch/....
>
> This is why I still like my:
>
> arch_c_entry()
> irqentry_enter_from_user_mode();
> generic_syscall();
> exit...
So what we have now (with my patch applied) is either:
1) arch_c_entry()
nr = syscall_enter_from_user_mode();
arch_handle_syscall(nr);
syscall_exit_to_user_mode();
or for that extra 32bit fast syscall thing:
2) arch_c_entry()
syscall_enter_from_user_mode_prepare();
arch_do_stuff();
nr = syscall_enter_from_user_mode_work();
arch_handle_syscall(nr);
syscall_exit_to_user_mode();
So for sane cases you just use #1.
Ideally we'd not need arch_handle_syscall(nr) at all, but that does not
work with multiple ABIs supported, i.e. the compat muck.
The only way we could make that work is to have:
syscall_enter_exit(regs, mode)
nr = syscall_enter_from_user_mode();
arch_handle_syscall(mode, nr);
syscall_exit_to_user_mode();
and then arch_c_entry() becomes:
syscall_enter_exit(regs, mode);
which means that arch_handle_syscall() would have to evaluate the mode
and chose the appropriate syscall table. Not sure whether that's a win.
Thanks,
tglx
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 4facb95b7adaf77e2da73aafb9ba60996fe42a12
Gitweb: https://git.kernel.org/tip/4facb95b7adaf77e2da73aafb9ba60996fe42a12
Author: Thomas Gleixner <[email protected]>
AuthorDate: Wed, 02 Sep 2020 01:50:54 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Fri, 04 Sep 2020 15:50:14 +02:00
x86/entry: Unbreak 32bit fast syscall
Andy reported that the syscall treacing for 32bit fast syscall fails:
# ./tools/testing/selftests/x86/ptrace_syscall_32
...
[RUN] SYSEMU
[FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
...
[RUN] SYSCALL
[FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
The eason is that the conversion to generic entry code moved the retrieval
of the sixth argument (EBP) after the point where the syscall entry work
runs, i.e. ptrace, seccomp, audit...
Unbreak it by providing a split up version of syscall_enter_from_user_mode().
- syscall_enter_from_user_mode_prepare() establishes state and enables
interrupts
- syscall_enter_from_user_mode_work() runs the entry work
Replace the call to syscall_enter_from_user_mode() in the 32bit fast
syscall C-entry with the split functions and stick the EBP retrieval
between them.
Fixes: 27d6b4d14f5c ("x86/entry: Use generic syscall entry function")
Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/entry/common.c | 29 +++++++++++++-------
include/linux/entry-common.h | 51 ++++++++++++++++++++++++++++-------
kernel/entry/common.c | 35 +++++++++++++++++++-----
3 files changed, 91 insertions(+), 24 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 48512c7..2f84c7c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -60,16 +60,10 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
{
- unsigned int nr = (unsigned int)regs->orig_ax;
-
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;
- /*
- * Subtlety here: if ptrace pokes something larger than 2^32-1 into
- * orig_ax, the unsigned int return value truncates it. This may
- * or may not be necessary, but it matches the old asm behavior.
- */
- return (unsigned int)syscall_enter_from_user_mode(regs, nr);
+
+ return (unsigned int)regs->orig_ax;
}
/*
@@ -91,15 +85,29 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
{
unsigned int nr = syscall_32_enter(regs);
+ /*
+ * Subtlety here: if ptrace pokes something larger than 2^32-1 into
+ * orig_ax, the unsigned int return value truncates it. This may
+ * or may not be necessary, but it matches the old asm behavior.
+ */
+ nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
+
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
}
static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ unsigned int nr = syscall_32_enter(regs);
int res;
+ /*
+ * This cannot use syscall_enter_from_user_mode() as it has to
+ * fetch EBP before invoking any of the syscall entry work
+ * functions.
+ */
+ syscall_enter_from_user_mode_prepare(regs);
+
instrumentation_begin();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
@@ -122,6 +130,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
return false;
}
+ /* The case truncates any ptrace induced syscall nr > 2^32 -1 */
+ nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+
/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index efebbff..159c747 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -110,15 +110,30 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
#endif
/**
- * syscall_enter_from_user_mode - Check and handle work before invoking
- * a syscall
+ * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
* @regs: Pointer to currents pt_regs
- * @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
* disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct and the subsequent functions can be
- * instrumented.
+ * function returns all state is correct, interrupts are enabled and the
+ * subsequent functions can be instrumented.
+ *
+ * This handles lockdep, RCU (context tracking) and tracing state.
+ *
+ * This is invoked when there is extra architecture specific functionality
+ * to be done between establishing state and handling user mode entry work.
+ */
+void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
+
+/**
+ * syscall_enter_from_user_mode_work - Check and handle work before invoking
+ * a syscall
+ * @regs: Pointer to currents pt_regs
+ * @syscall: The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
+ * architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -127,12 +142,30 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
* syscall_set_return_value() first. If neither of those are called and -1
* is returned, then the syscall will fail with ENOSYS.
*
- * The following functionality is handled here:
+ * It handles the following work items:
*
- * 1) Establish state (lockdep, RCU (context tracking), tracing)
- * 2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
+ * 1) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
* __secure_computing(), trace_sys_enter()
- * 3) Invocation of audit_syscall_entry()
+ * 2) Invocation of audit_syscall_entry()
+ */
+long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
+
+/**
+ * syscall_enter_from_user_mode - Establish state and check and handle work
+ * before invoking a syscall
+ * @regs: Pointer to currents pt_regs
+ * @syscall: The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct, interrupts are enabled and the
+ * subsequent functions can be instrumented.
+ *
+ * This is combination of syscall_enter_from_user_mode_prepare() and
+ * syscall_enter_from_user_mode_work().
+ *
+ * Returns: The original or a modified syscall number. See
+ * syscall_enter_from_user_mode_work() for further explanation.
*/
long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index fcae019..1868359 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -69,22 +69,45 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
return ret ? : syscall_get_nr(current, regs);
}
-noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+static __always_inline long
+__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
{
unsigned long ti_work;
- enter_from_user_mode(regs);
- instrumentation_begin();
-
- local_irq_enable();
ti_work = READ_ONCE(current_thread_info()->flags);
if (ti_work & SYSCALL_ENTER_WORK)
syscall = syscall_trace_enter(regs, syscall, ti_work);
- instrumentation_end();
return syscall;
}
+long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
+{
+ return __syscall_enter_from_user_work(regs, syscall);
+}
+
+noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+{
+ long ret;
+
+ enter_from_user_mode(regs);
+
+ instrumentation_begin();
+ local_irq_enable();
+ ret = __syscall_enter_from_user_work(regs, syscall);
+ instrumentation_end();
+
+ return ret;
+}
+
+noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+ instrumentation_begin();
+ local_irq_enable();
+ instrumentation_end();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*