The 32-bit ptrace syscall on a 64-bit kernel (32-bit debugger on
32-bit task) behaves differently than a native 32-bit kernel. When
setting a register state of orig_eax>=0 and eax=-ERESTART* when the
debugged task is NOT on its way out of a 32-bit syscall, the task will
fail to do the syscall restart logic that it should do.
Test case available at http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/erestartsys-trap.c?cvsroot=systemtap
This happens because the 32-bit ptrace syscall sets eax=0xffffffff
when it sets orig_eax>=0. The resuming task will not sign-extend this
for the -ERESTART* check because TS_COMPAT is not set. (So the task
thinks it is restarting after a 64-bit syscall, not a 32-bit one.)
The fix is to have 32-bit ptrace calls sign-extend eax when orig_eax>=0.
The long comment in the change explains the scenarios and caveats fully.
Reported-by: [email protected]
Signed-off-by: Roland McGrath <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
---
arch/x86/kernel/ptrace.c | 55 +++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 54 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 8d7d5c9..ecb7a49 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1120,7 +1120,6 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
R32(edi, di);
R32(esi, si);
R32(ebp, bp);
- R32(eax, ax);
R32(eip, ip);
R32(esp, sp);
@@ -1130,6 +1129,60 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
* causes (long)orig_ax < 0 tests to fire correctly.
*/
regs->orig_ax = (long) (s32) value;
+
+ /*
+ * Whenever setting orig_eax to indicate a system call in
+ * progress, make sure an eax value set by the debugger gets
+ * sign-extended so that any ax == -ERESTART* tests fire
+ * correctly.
+ *
+ * When those tests (in handle_signal) are done directly
+ * after an actual 32-bit syscall, then TS_COMPAT is set and
+ * so syscall_get_error() does sign-extension. However, the
+ * debugger sometimes saves that state and then restores it
+ * later with the intent of picking up the old thread state
+ * that can be about to do syscall restart.
+ *
+ * When it's a 32-bit debugger, that truncates ax to 32 bits.
+ * If the debugger restores thread state and resumes after a
+ * ptrace stop when the child was not doing a new syscall, it
+ * will not have TS_COMPAT set to make syscall_get_error()
+ * notice and do the sign-extension.
+ *
+ * We can't have syscall_get_error() always sign-extend,
+ * since that's wrong for 64-bit syscalls. We want it to
+ * check TS_COMPAT rather than TIF_IA32 to avoid a false
+ * positive in the oddball case of a 32-bit task doing a
+ * syscall from a 64-bit code segment. In the "restored
+ * thread state" case, it has no way to know whether the
+ * restored state refers to a 32-bit or 64-bit syscall.
+ *
+ * So we can't win 'em all. We assume that if you are using
+ * a 32-bit debugger, you don't really care about arcane
+ * interference with a child trying to use 64-bit syscalls.
+ * (Just use a 64-bit debugger on it instead!) What we do
+ * here makes a 32-bit debugger fiddling a 32-bit task
+ * consistent with what happens on a native 32-bit kernel.
+ *
+ * NOTE! Since we have no similar logic in putreg(), we
+ * just expect a 64-bit debugger to save/restore the full
+ * 64 bits. If a 64-bit debugger were to treat a 32-bit
+ * task differently and save/restore only 32 bits per
+ * register, it would have to grok orig_eax >= 0 and know
+ * to sign-extend its saved eax when setting it as 64 bits.
+ */
+ if (regs->orig_ax >= 0)
+ regs->ax = (long) (s32) regs->ax;
+ break;
+
+ case offsetof(struct user32, regs.eax):
+ /*
+ * As above, for either order of setting both ax and orig_ax.
+ */
+ if (regs->orig_ax >= 0)
+ regs->ax = (long) (s32) value;
+ else
+ regs->ax = value;
break;
case offsetof(struct user32, regs.eflags):
On Tue, 22 Sep 2009, Roland McGrath wrote:
>
> The 32-bit ptrace syscall on a 64-bit kernel (32-bit debugger on
> 32-bit task) behaves differently than a native 32-bit kernel. When
> setting a register state of orig_eax>=0 and eax=-ERESTART* when the
> debugged task is NOT on its way out of a 32-bit syscall, the task will
> fail to do the syscall restart logic that it should do.
Hmm. This really smells extremely hacky to me.
I see what you're doing, and I understand why, but it all makes me
shudder. I think we already do the wrong thing for 'orig_ax', and that we
should probably simply test just the low 32 bits of it (looks to be easily
done by just making the return type of 'syscall_get_nr()' be 'int') rather
than have that special "let's sign-extend orig_ax" code in ptrace.
I also get the feeling that the TS_COMPAT testing is just hacky to begin
with. I think it was broken to do things that way, and I have this gut
feel that we really should have hidden the "am I a 32-bit or 64-bit
syscall" in that orig_ax field instead (ie make the 64-bit system calls
set a bit or whatever). That's the field we've always used for system call
flagging, and TS_COMPAT was broken.
In this example, the problem for ptrace seems to be that TS_COMPAT ends up
being that "extra" bit of information that isn't visible in the register
set.
But that's a bigger separate cleanup. In the meantime, I really get the
feeling that your patch is nasty, and could be replaced with something
like the following instead.. It just sets the TS_COMPAT bit if we use a
32-bit interface to set the system call number to positive (ie "inside
system call").
THE BELOW IS TOTALLY UNTESTED! I haven't really thought it through. I just
reacted very negatively to your patch, and my gut feel is that the code is
fundamentally doing something wrong. The below may not work, and may be
seriously broken and miss the point, but I think it conceptually comes
closer to how things _should_ work.
Linus
---
arch/x86/include/asm/syscall.h | 2 +-
arch/x86/kernel/ptrace.c | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d82f39b..f2a0631 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -16,7 +16,7 @@
#include <linux/sched.h>
#include <linux/err.h>
-static inline long syscall_get_nr(struct task_struct *task,
+static inline int syscall_get_nr(struct task_struct *task,
struct pt_regs *regs)
{
/*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 8d7d5c9..90b67db 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1124,13 +1124,19 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
R32(eip, ip);
R32(esp, sp);
- case offsetof(struct user32, regs.orig_eax):
+ case offsetof(struct user32, regs.orig_eax): {
+ struct thread_info *thread = task_thread_info(child);
/*
* Sign-extend the value so that orig_eax = -1
* causes (long)orig_ax < 0 tests to fire correctly.
*/
regs->orig_ax = (long) (s32) value;
+ if ((s32) value >= 0)
+ thread->status |= TS_COMPAT;
+ else
+ thread->status &= ~TS_COMPAT;
break;
+ }
case offsetof(struct user32, regs.eflags):
return set_flags(child, value);
> Hmm. This really smells extremely hacky to me.
Agreed. Anything with a half-page comment about why it's two-thirds
right and only one-third wrong is not what we'd choose when we could
think of better options.
> I see what you're doing, and I understand why, but it all makes me
> shudder. I think we already do the wrong thing for 'orig_ax', and that we
> should probably simply test just the low 32 bits of it (looks to be easily
> done by just making the return type of 'syscall_get_nr()' be 'int') rather
> than have that special "let's sign-extend orig_ax" code in ptrace.
Yes, that seems mostly equivalent. By having orig_ax sign-extended in
ptrace unconditionally (i.e. even for 64-bit tasks), you've already said
that the semantics of orig_ax are that you only get 32 bits of meaningful
value in there. In fact, that's what you said explicitly when we discussed
that change. There is no reason for syscall_get_nr() not to return int.
I said "mostly equivalent". The only difference I can see is that
today 64-bit ptrace callers (and core dumps, etc.), can see the high
bits of orig_ax and probably expect to see 64 1 bits in the "no
syscall" case. So if we punt all the existing sign-extension and say
that orig_ax has only 32 meaningful bits internally in all uses, then
the ABI-preserving change is to make getreg() always sign-extend the
32-bit orig_ax value so -1 is -1LL in userland.
I wonder if you want to say the same thing about all machines generally,
that the syscall number can only have 32 meaningful bits. That seems
reasonable to me. Then it would be appropriate to change the type in
asm-generic/syscall.h; that file is never really used, but it sets the
function signatures that other arch's files are written to match.
> I also get the feeling that the TS_COMPAT testing is just hacky to begin
> with. I think it was broken to do things that way, and I have this gut
> feel that we really should have hidden the "am I a 32-bit or 64-bit
> syscall" in that orig_ax field instead (ie make the 64-bit system calls
> set a bit or whatever). That's the field we've always used for system call
> flagging, and TS_COMPAT was broken.
On the whole I agree. hpa and I talked about this very idea last
year. We never got around to doing anything about it, since the
previous hack at the time had fixed the known bug. Doing this
internally is probably pretty easy and is nice for all the other cases
with TS_COMPAT checks now. Unfortunately, it's a userland ABI issue
to change this such that any new special meaning of some orig_ax bits
can be seen by userland so as to help this latest case.
What hpa and I discussed in fact was more than just a 32/64 flag, but
a complete "which entry path" indicator. i.e. int80 vs sysenter vs
AMD 32-bit "syscall" (if we ever supported that) vs 64-bit "syscall"
vs non-syscall (exception/interrupt) entries. That would have a bit
or two of information even for 32-bit. (We figured that limiting an
actual syscall # to 24 bits or so would be fine.) But the utility of
distinguishing those paths (aside from 32 vs 64 and syscall vs not) is
purely theoretical, which is to say I don't think we even have any
contrived idea what you'd use it for.
> In this example, the problem for ptrace seems to be that TS_COMPAT ends up
> being that "extra" bit of information that isn't visible in the register
> set.
Agreed. All else being equal, I would prefer to have this bit appear in
the regset layout somewhere. Unfortunately I don't see how we can both
make userland implicitly win by saving and restoring it blindly, and be
ABI-compatible with existing userland that knows the orig_ax meaning today.
> But that's a bigger separate cleanup. In the meantime, I really get the
> feeling that your patch is nasty, and could be replaced with something
> like the following instead.. It just sets the TS_COMPAT bit if we use a
> 32-bit interface to set the system call number to positive (ie "inside
> system call").
I think I may have been tempted toward that but didn't really consider it.
The only real reason is that TS_* bits are "thread-synchronous" and I
thought there was no precedent for touching them on any task != current.
(Given I understand how ptrace is the special situation where it could be
safe.) But, in fact, ptrace can already fiddle TS_USEDFPU. So it's no
worse than that, anyway. So it's better than bad, it's good!
> THE BELOW IS TOTALLY UNTESTED! I haven't really thought it through. I just
> reacted very negatively to your patch, and my gut feel is that the code is
> fundamentally doing something wrong. The below may not work, and may be
> seriously broken and miss the point, but I think it conceptually comes
> closer to how things _should_ work.
I agree.
Thanks,
Roland
Here is the aforementioned other tack on this.
I said earlier that getreg() (i.e. 64-bit ptrace/core-dump fetches)
should sign-extend the low 32 bits of orig_ax up. But I've changed my
mind. It's true that today you can store 0xffffffff via either 64-bit
ptrace or 32-bit ptrace and then read back -1 via 64-bit ptrace. (This
wasn't always so, and so we can hope that no debugger really depends on
it.) What seems more important is that tracing and core dumps correctly
show the full orig_ax value incoming in %rax from userland, since
%rax=__NR_foo|(1UL<<32) behaves differently (i.e. -ENOSYS) than
%eax=_NR_foo in actual fact when user-mode does "syscall" with those
values. In a bogon case like that, you would like to have traces/dumps
tell you why the task is not making a proper syscall rather than lie
about what register bits it entered the kernel with.
Patches 1-3 change no ptrace-tests outcomes, i.e. don't regress on the
test cases that went with the original sign-extension changes. They
reintroduce e.g. the ability to blindly read and write back the whole
regset when at syscall-entry tracing with %rax=__NR_foo|(1UL<<32) and
have that fail with -ENOSYS as it would without tracing rather than
perturb the tracee to call sys_foo instead. (Not that this is useful.)
Patch 4 does Linus's fix for the outstanding bug. I've verified it works.
Thanks,
Roland
---
The following changes since commit 7fa07729e439a6184bd824746d06a49cca553f15:
Linus Torvalds (1):
Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/.../tip/linux-2.6-tip
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git x86/orig_ax
Roland McGrath (4):
asm-generic: syscall_get_nr returns int
x86: syscall_get_nr returns int
x86: ptrace: do not sign-extend orig_ax on write
x86: ptrace: set TS_COMPAT when 32-bit ptrace sets orig_eax>=0
arch/x86/include/asm/syscall.h | 14 +++++++-------
arch/x86/kernel/ptrace.c | 21 ++++++++-------------
include/asm-generic/syscall.h | 8 ++++++--
3 files changed, 21 insertions(+), 22 deletions(-)
Only 32 bits of system call number are meaningful, so make the
specification for syscall_get_nr() be to return int, not long.
Signed-off-by: Roland McGrath <[email protected]>
---
include/asm-generic/syscall.h | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index ea8087b..5c122ae 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -1,7 +1,7 @@
/*
* Access to user system call parameters and results
*
- * Copyright (C) 2008 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2008-2009 Red Hat, Inc. All rights reserved.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
@@ -32,9 +32,13 @@ struct pt_regs;
* If @task is not executing a system call, i.e. it's blocked
* inside the kernel for a fault or signal, returns -1.
*
+ * Note this returns int even on 64-bit machines. Only 32 bits of
+ * system call number can be meaningful. If the actual arch value
+ * is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
+ *
* It's only valid to call this when @task is known to be blocked.
*/
-long syscall_get_nr(struct task_struct *task, struct pt_regs *regs);
+int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);
/**
* syscall_rollback - roll back registers after an aborted system call
Make syscall_get_nr() return int, so we always sign-extend
the low 32 bits of orig_ax in checks.
Signed-off-by: Roland McGrath <[email protected]>
---
arch/x86/include/asm/syscall.h | 14 +++++++-------
1 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d82f39b..8d33bc5 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -1,7 +1,7 @@
/*
* Access to user system call parameters and results
*
- * Copyright (C) 2008 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2008-2009 Red Hat, Inc. All rights reserved.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
@@ -16,13 +16,13 @@
#include <linux/sched.h>
#include <linux/err.h>
-static inline long syscall_get_nr(struct task_struct *task,
- struct pt_regs *regs)
+/*
+ * Only the low 32 bits of orig_ax are meaningful, so we return int.
+ * This importantly ignores the high bits on 64-bit, so comparisons
+ * sign-extend the low 32 bits.
+ */
+static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
{
- /*
- * We always sign-extend a -1 value being set here,
- * so this is always either -1L or a syscall number.
- */
return regs->orig_ax;
}
The high 32 bits of orig_ax will be ignored when it matters,
so don't fiddle them when setting it.
Signed-off-by: Roland McGrath <[email protected]>
---
arch/x86/kernel/ptrace.c | 19 +------------------
1 files changed, 1 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 8d7d5c9..52222fa 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -325,16 +325,6 @@ static int putreg(struct task_struct *child,
return set_flags(child, value);
#ifdef CONFIG_X86_64
- /*
- * Orig_ax is really just a flag with small positive and
- * negative values, so make sure to always sign-extend it
- * from 32 bits so that it works correctly regardless of
- * whether we come from a 32-bit environment or not.
- */
- case offsetof(struct user_regs_struct, orig_ax):
- value = (long) (s32) value;
- break;
-
case offsetof(struct user_regs_struct,fs_base):
if (value >= TASK_SIZE_OF(child))
return -EIO;
@@ -1121,17 +1111,10 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
R32(esi, si);
R32(ebp, bp);
R32(eax, ax);
+ R32(orig_eax, orig_ax);
R32(eip, ip);
R32(esp, sp);
- case offsetof(struct user32, regs.orig_eax):
- /*
- * Sign-extend the value so that orig_eax = -1
- * causes (long)orig_ax < 0 tests to fire correctly.
- */
- regs->orig_ax = (long) (s32) value;
- break;
-
case offsetof(struct user32, regs.eflags):
return set_flags(child, value);
The 32-bit ptrace syscall on a 64-bit kernel (32-bit debugger on
32-bit task) behaves differently than a native 32-bit kernel. When
setting a register state of orig_eax>=0 and eax=-ERESTART* when the
debugged task is NOT on its way out of a 32-bit syscall, the task will
fail to do the syscall restart logic that it should do.
Test case available at http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/erestartsys-trap.c?cvsroot=systemtap
This happens because the 32-bit ptrace syscall sets eax=0xffffffff
when it sets orig_eax>=0. The resuming task will not sign-extend this
for the -ERESTART* check because TS_COMPAT is not set. (So the task
thinks it is restarting after a 64-bit syscall, not a 32-bit one.)
The fix is to have 32-bit ptrace calls set TS_COMPAT when setting
orig_eax>=0. This ensures that the 32-bit syscall restart logic
will apply when the child resumes.
Signed-off-by: Roland McGrath <[email protected]>
---
arch/x86/kernel/ptrace.c | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 52222fa..7b058a2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1111,10 +1111,22 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
R32(esi, si);
R32(ebp, bp);
R32(eax, ax);
- R32(orig_eax, orig_ax);
R32(eip, ip);
R32(esp, sp);
+ case offsetof(struct user32, regs.orig_eax):
+ /*
+ * A 32-bit debugger setting orig_eax means to restore
+ * the state of the task restarting a 32-bit syscall.
+ * Make sure we interpret the -ERESTART* codes correctly
+ * in case the task is not actually still sitting at the
+ * exit from a 32-bit syscall with TS_COMPAT still set.
+ */
+ regs->orig_ax = value;
+ if (syscall_get_nr(child, regs) >= 0)
+ task_thread_info(child)->status |= TS_COMPAT;
+ break;
+
case offsetof(struct user32, regs.eflags):
return set_flags(child, value);
* Roland McGrath <[email protected]> wrote:
> Here is the aforementioned other tack on this.
>
> I said earlier that getreg() (i.e. 64-bit ptrace/core-dump fetches)
> should sign-extend the low 32 bits of orig_ax up. But I've changed my
> mind. It's true that today you can store 0xffffffff via either 64-bit
> ptrace or 32-bit ptrace and then read back -1 via 64-bit ptrace. (This
> wasn't always so, and so we can hope that no debugger really depends on
> it.) What seems more important is that tracing and core dumps correctly
> show the full orig_ax value incoming in %rax from userland, since
> %rax=__NR_foo|(1UL<<32) behaves differently (i.e. -ENOSYS) than
> %eax=_NR_foo in actual fact when user-mode does "syscall" with those
> values. In a bogon case like that, you would like to have traces/dumps
> tell you why the task is not making a proper syscall rather than lie
> about what register bits it entered the kernel with.
>
> Patches 1-3 change no ptrace-tests outcomes, i.e. don't regress on the
> test cases that went with the original sign-extension changes. They
> reintroduce e.g. the ability to blindly read and write back the whole
> regset when at syscall-entry tracing with %rax=__NR_foo|(1UL<<32) and
> have that fail with -ENOSYS as it would without tracing rather than
> perturb the tracee to call sys_foo instead. (Not that this is useful.)
>
> Patch 4 does Linus's fix for the outstanding bug. I've verified it works.
>
>
> Thanks,
> Roland
> ---
> The following changes since commit 7fa07729e439a6184bd824746d06a49cca553f15:
> Linus Torvalds (1):
> Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/.../tip/linux-2.6-tip
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git x86/orig_ax
>
> Roland McGrath (4):
> asm-generic: syscall_get_nr returns int
> x86: syscall_get_nr returns int
> x86: ptrace: do not sign-extend orig_ax on write
> x86: ptrace: set TS_COMPAT when 32-bit ptrace sets orig_eax>=0
>
> arch/x86/include/asm/syscall.h | 14 +++++++-------
> arch/x86/kernel/ptrace.c | 21 ++++++++-------------
> include/asm-generic/syscall.h | 8 ++++++--
> 3 files changed, 21 insertions(+), 22 deletions(-)
Linus, this series looks good to me. Do you want to pull this directly
or should we test this for a few days in the x86 tree? (either solution
is fine to me)
Ingo
On Wed, 23 Sep 2009, Ingo Molnar wrote:
>
> Linus, this series looks good to me. Do you want to pull this directly
> or should we test this for a few days in the x86 tree? (either solution
> is fine to me)
I already pulled it, since I'm used to pulling ptrace fixes from Roland,
Linus
* Linus Torvalds <[email protected]> wrote:
> On Wed, 23 Sep 2009, Ingo Molnar wrote:
> >
> > Linus, this series looks good to me. Do you want to pull this
> > directly or should we test this for a few days in the x86 tree?
> > (either solution is fine to me)
>
> I already pulled it, since I'm used to pulling ptrace fixes from
> Roland,
Ok, thanks!
Ingo