Hello,
get_nr_restart_syscall() is still buggy, TS_I386_REGS_POKED can't
really help and should probably die.
The fix just adds the __USER32_CS check, but perhaps we can avoid
these "fundamentally broken" checks altogether?
Is __NR_ia32_restart_syscall/__NR_restart_syscall the part of ABI?
OK, we probaly can't remove them, at least right now. But what if
we simply add the new syscall number,
#define __NR_new_restart_syscall 383
#define __NR_ia32_new_restart_syscall 383
so that it doesn't depends on bitness and we can just do
static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
{
BUILD_BUG_ON(__NR_ia32_new_restart_syscall != __NR_new_restart_syscall);
#ifdef CONFIG_X86_X32_ABI
return __NR_new_restart_syscall | (regs->orig_ax & __X32_SYSCALL_BIT);
#else
return __NR_new_restart_syscall;
#endif
}
?
Oleg.
get_nr_restart_syscall() checks TS_I386_REGS_POKED but this bit is only
set if debugger is 32-bit. If a 64-bit debugger restores the registers
of a 32-bit debugee outside of syscall exit path get_nr_restart_syscall()
wrongly returns __NR_restart_syscall.
Test-case:
$ cvs -d :pserver:anoncvs:[email protected]:/cvs/systemtap co ptrace-tests
$ gcc -o erestartsys-trap-debuggee ptrace-tests/tests/erestartsys-trap-debuggee.c --m32
$ gcc -o erestartsys-trap-debugger ptrace-tests/tests/erestartsys-trap-debugger.c -lutil
$ ./erestartsys-trap-debugger
Unexpected: retval 1, errno 22
erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421
As Jan explains this is what "(gdb) call func()" actually does:
* Tracee calls sleep(2).
* Debugger interrupts the tracee by CTRL-C after 1 sec.
* Save regs by PTRACE_GETREGS.
* Use PTRACE_SETREGS changing %rip to some 'func' and setting %orig_rax=-1
* PTRACE_CONT
* func() uses int3.
* Debugger catches SIGTRAP.
* Restore original regs by PTRACE_SETREGS.
* PTRACE_CONT
Change get_nr_restart_syscall() to take __USER32_CS into account, to me
this looks a bit better than TIF_IA32 check but either way this logic
can't be always right as the comment explains.
Alternatively we could change putreg() to set TS_I386_REGS_POKED just like
putreg32() does if "child" is 32-bit, but this won't fix all the problems
too and I think it would be beter to kill TS_I386_REGS_POKED after this
change.
Reported-by: Jan Kratochvil <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/signal.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 763af1d..1b05448 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -785,7 +785,8 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
* than the tracee.
*/
#ifdef CONFIG_IA32_EMULATION
- if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
+ if ((current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED)) ||
+ regs->cs == __USER32_CS)
return __NR_ia32_restart_syscall;
#endif
#ifdef CONFIG_X86_X32_ABI
--
2.5.0
On Tue, Mar 28, 2017 at 7:54 AM, Oleg Nesterov <[email protected]> wrote:
> get_nr_restart_syscall() checks TS_I386_REGS_POKED but this bit is only
> set if debugger is 32-bit. If a 64-bit debugger restores the registers
> of a 32-bit debugee outside of syscall exit path get_nr_restart_syscall()
> wrongly returns __NR_restart_syscall.
I had sent a patch that introduced a new syscall nr, but it's not
quite safe because it could break seccomp-using programs. But your
patch here is also screwy.
How about we store the syscall arch to be restored in task_struct
along with restart_block? It's not perfect, but it should be 99% of
the way there without heuristics as nasty as yours.
--Andy
P.S. __USER32_CS is the wrong check even if we used your approach.
user_64bit_regs() is much better.
On 03/28, Andy Lutomirski wrote:
>
> On Tue, Mar 28, 2017 at 7:54 AM, Oleg Nesterov <[email protected]> wrote:
> > get_nr_restart_syscall() checks TS_I386_REGS_POKED but this bit is only
> > set if debugger is 32-bit. If a 64-bit debugger restores the registers
> > of a 32-bit debugee outside of syscall exit path get_nr_restart_syscall()
> > wrongly returns __NR_restart_syscall.
>
> I had sent a patch that introduced a new syscall nr, but it's not
> quite safe because it could break seccomp-using programs.
Ah, indeed...
> But your
> patch here is also screwy.
Yes, yes, it doesn't try to solve all possible problems, I even mentioned
this in the changelog.
> How about we store the syscall arch to be restored in task_struct
> along with restart_block?
Yes, perhaps we will have to finally do this. Not really nice too.
> the way there without heuristics as nasty as yours.
I agree it will be better, but I refuse to treat them as mine checks ;)
> P.S. __USER32_CS is the wrong check even if we used your approach.
> user_64bit_regs() is much better.
Yes, thanks. If only I understood what cs == pv_info.extra_user_64bit_cs
actually means...
OK, please ignore this patch, I'll try to make another fix.
Oleg.
On Tue, Mar 28, 2017 at 9:27 AM, Oleg Nesterov <[email protected]> wrote:
> On 03/28, Andy Lutomirski wrote:
>>
>> On Tue, Mar 28, 2017 at 7:54 AM, Oleg Nesterov <[email protected]> wrote:
>> > get_nr_restart_syscall() checks TS_I386_REGS_POKED but this bit is only
>> > set if debugger is 32-bit. If a 64-bit debugger restores the registers
>> > of a 32-bit debugee outside of syscall exit path get_nr_restart_syscall()
>> > wrongly returns __NR_restart_syscall.
>>
>> I had sent a patch that introduced a new syscall nr, but it's not
>> quite safe because it could break seccomp-using programs.
>
> Ah, indeed...
This is, in theory, solvable. It would be ugly and would pollute seccomp a bit.
>
>> But your
>> patch here is also screwy.
>
> Yes, yes, it doesn't try to solve all possible problems, I even mentioned
> this in the changelog.
>
>> How about we store the syscall arch to be restored in task_struct
>> along with restart_block?
>
> Yes, perhaps we will have to finally do this. Not really nice too.
>
>> the way there without heuristics as nasty as yours.
>
> I agree it will be better, but I refuse to treat them as mine checks ;)
:)
>
>> P.S. __USER32_CS is the wrong check even if we used your approach.
>> user_64bit_regs() is much better.
>
> Yes, thanks. If only I understood what cs == pv_info.extra_user_64bit_cs
> actually means...
>
It means that, if Linux is a Xen PV guest, the GDT contains a bunch of
entries supplied by Xen and outside of Linux's control, and one of
those entries is a 64-bit DPL=3 code segment. On the one hand, it's
annoying. On the other hand, it serves a real purpose
performance-wise.
--Andy
On 03/28, Oleg Nesterov wrote:
>
> On 03/28, Andy Lutomirski wrote:
> >
> > How about we store the syscall arch to be restored in task_struct
> > along with restart_block?
>
> Yes, perhaps we will have to finally do this. Not really nice too.
OK, how about the hack below?
I do not want to a new member into task_struct/restart_block, so the
patch below adds a sticky TS_COMPAT bit which logically is a member
of "struct restart_block".
TS_I386_REGS_POKED must die, I think. But this needs another discussion.
Oleg.
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index b83c61c..a94bb5e 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,17 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
local_irq_enable();
/*
+ * Do this before debugger can change the regs.
+ */
+ if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+ unlikely(regs->ax == -ERESTART_RESTARTBLOCK)) {
+ if (current->thread.status & TS_COMPAT)
+ current->thread.status |= TS_COMPAT_XXX;
+ else
+ current->thread.status &= ~TS_COMPAT_XXX;
+ }
+
+ /*
* First do one-time work. If these work items are enabled, we
* want to run them exactly once per syscall exit with IRQs on.
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1be64da..87179ab 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -477,6 +477,7 @@ struct thread_struct {
* have to worry about atomic accesses.
*/
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
+#define TS_COMPAT_XXX 0x0008
/*
* Set IOPL bits in EFLAGS from given mask
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 763af1d..b3b98ff 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -785,7 +785,7 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
* than the tracee.
*/
#ifdef CONFIG_IA32_EMULATION
- if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
+ if (current->thread.status & TS_COMPAT_XXX)
return __NR_ia32_restart_syscall;
#endif
#ifdef CONFIG_X86_X32_ABI
I must have missed something, but I simply can't undestand it.
static inline long syscall_get_error(struct task_struct *task,
struct pt_regs *regs)
{
unsigned long error = regs->ax;
#ifdef CONFIG_IA32_EMULATION
/*
* TS_COMPAT is set for 32-bit syscall entries and then
* remains set until we return to user mode.
*/
if (task->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
/*
* Sign-extend the value so (int)-EFOO becomes (long)-EFOO
* and will match correctly in comparisons.
*/
error = (long) (int) error;
#endif
return IS_ERR_VALUE(error) ? error : 0;
}
Firstly, why do we need the IS_ERR_VALUE() check? This is only used by
do_signal/handle_signal, we do not care if it returns non-zero as long
as the value can't be confused with -ERESTART.* codes.
And why do we need the TS_ checks? IIUC only because a 32-bit debugger
can change regs->ax, and we also assume that if this happens outside of
syscall-exit path (so that TS_COMPAT is not set) then it should also
change regs->orig_ax and this implies TS_I386_REGS_POKED.
Oherwise it is not needed, even the 32-bit syscalls return long, not int.
So why we can't simply change putreg32() to always sign-extend regs->ax
regs->orig_ax and just do
static inline long syscall_get_error(struct task_struct *task,
struct pt_regs *regs)
{
return regs-ax;
}
? Or, better, simply kill it and use syscall_get_return_value() in
arch/x86/kernel/signal.c.
Of course, if the tracee is 64-bit and debugger is 32-bit then the
unconditional sign-extend can be wrong, but do we really care about
this case? This can't really work anyway. And the current code is not
right too. Say, debugger nacks the signal which interrupted syscall
and sets regs->ax = -ERESTARTSYS to force the restart, this won't work
because TS_COMPAT|TS_I386_REGS_POKED are not set.
Oleg.
On Wed, Mar 29, 2017 at 9:33 AM, Oleg Nesterov <[email protected]> wrote:
>
> Firstly, why do we need the IS_ERR_VALUE() check? This is only used by
> do_signal/handle_signal, we do not care if it returns non-zero as long
> as the value can't be confused with -ERESTART.* codes.
There are system calls that can return "negative" values that aren't errors.
Notably mmap() can return a valid pointer with the high bit set.
So syscall_get_error() should return 0 for not just positive return
values, but for those kinds of negative non-error values.
> And why do we need the TS_ checks?
Those may be bogus.
> So why we can't simply change putreg32() to always sign-extend regs->ax
> regs->orig_ax and just do
>
> static inline long syscall_get_error(struct task_struct *task,
> struct pt_regs *regs)
> {
> return regs-ax;
> }
That would be *complete* garbage. Lots of system calls return positive
values that sure as hell aren't errors.
Linus
On 03/29, Linus Torvalds wrote:
>
> On Wed, Mar 29, 2017 at 9:33 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > Firstly, why do we need the IS_ERR_VALUE() check? This is only used by
> > do_signal/handle_signal, we do not care if it returns non-zero as long
> > as the value can't be confused with -ERESTART.* codes.
>
> There are system calls that can return "negative" values that aren't errors.
>
> Notably mmap() can return a valid pointer with the high bit set.
>
> So syscall_get_error() should return 0 for not just positive return
> values, but for those kinds of negative non-error values.
Once again, it is only used in arch/x86/kernel/signal.c by do_signal() and
handle_signal(). We do not care if mmap() returns a valid pointer with the
high bit set, regs-ax can't be confused with -ERESTART code.
> > And why do we need the TS_ checks?
>
> Those may be bogus.
>
> > So why we can't simply change putreg32() to always sign-extend regs->ax
> > regs->orig_ax and just do
> >
> > static inline long syscall_get_error(struct task_struct *task,
> > struct pt_regs *regs)
> > {
> > return regs-ax;
> > }
>
> That would be *complete* garbage. Lots of system calls return positive
> values that sure as hell aren't errors.
See above. And please note that I actually suggest to kill this helper and
just use syscall_get_return_value() in arch/x86/kernel/signal.c.
Oleg.
On Wed, Mar 29, 2017 at 9:45 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Mar 29, 2017 at 9:33 AM, Oleg Nesterov <[email protected]> wrote:
>>
>> Firstly, why do we need the IS_ERR_VALUE() check? This is only used by
>> do_signal/handle_signal, we do not care if it returns non-zero as long
>> as the value can't be confused with -ERESTART.* codes.
>
> There are system calls that can return "negative" values that aren't errors.
>
> Notably mmap() can return a valid pointer with the high bit set.
>
> So syscall_get_error() should return 0 for not just positive return
> values, but for those kinds of negative non-error values.
>
>> And why do we need the TS_ checks?
>
> Those may be bogus.
>
>> So why we can't simply change putreg32() to always sign-extend regs->ax
>> regs->orig_ax and just do
>>
>> static inline long syscall_get_error(struct task_struct *task,
>> struct pt_regs *regs)
>> {
>> return regs-ax;
>> }
>
> That would be *complete* garbage. Lots of system calls return positive
> values that sure as hell aren't errors.
Does this cause an observable problem? The only things that care are:
a) 32-bit debugger pokes some value with the high bit and a 64-bit
debugger reads it back. I seriously doubt we care.
b) 32-bit debugger pokes some value with the high bit set and the user
code switches to 64-bit mode and reads RAX. This case is so
terminally broken anyway that we definitely don't care.
c) 32-bit debugger pokes some value with the high bit set and
syscall_get_error happens. Oleg's proposed change won't change what
we do, but it will dramatically simplify the code.
On Wed, Mar 29, 2017 at 9:55 AM, Oleg Nesterov <[email protected]> wrote:
>
> Once again, it is only used in arch/x86/kernel/signal.c by do_signal() and
> handle_signal(). We do not care if mmap() returns a valid pointer with the
> high bit set, regs-ax can't be confused with -ERESTART code.
Immaterial. If the function is called "get_error()", it sure as hell
shouldn't return a random non-error value.
Code should make sense, otherwise it's not going to be maintainable.
Naming matters. If the code doesn't match the name of the function,
that's a bug regardless of whether it has semantic effects or not in
the end - because somebody will eventually depend on the _expected_
semantics.
Linus
On Wed, Mar 29, 2017 at 8:05 AM, Oleg Nesterov <[email protected]> wrote:
> On 03/28, Oleg Nesterov wrote:
>>
>> On 03/28, Andy Lutomirski wrote:
>> >
>> > How about we store the syscall arch to be restored in task_struct
>> > along with restart_block?
>>
>> Yes, perhaps we will have to finally do this. Not really nice too.
>
> OK, how about the hack below?
>
> I do not want to a new member into task_struct/restart_block, so the
> patch below adds a sticky TS_COMPAT bit which logically is a member
> of "struct restart_block".
Okay, but I'd much rather we just added a helper that's called in the
few places that actually write to restart_block.
Or we just add the new syscall nr and see what breaks. The answer
could well be nothing at all.
--Andy
On 03/29, Linus Torvalds wrote:
>
> On Wed, Mar 29, 2017 at 9:55 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > Once again, it is only used in arch/x86/kernel/signal.c by do_signal() and
> > handle_signal(). We do not care if mmap() returns a valid pointer with the
> > high bit set, regs-ax can't be confused with -ERESTART code.
>
> Immaterial. If the function is called "get_error()", it sure as hell
> shouldn't return a random non-error value.
Oh, I agree, and let me repeat the 3rd time that I suggest to kill this
helper and use syscall_get_return_value() in arch/x86/kernel/signal.c,
it has no other callers.
Oleg.
On Wed, Mar 29, 2017 at 10:04 AM, Oleg Nesterov <[email protected]> wrote:
>
> Oh, I agree, and let me repeat the 3rd time that I suggest to kill this
> helper and use syscall_get_return_value() in arch/x86/kernel/signal.c,
> it has no other callers.
That is probably fine, I'm just arguing against the suggested changes
to syscall_get_error().
That said, I'm not sure why you want to change this in the first
place? I think the current syscall_get_error() - with explicit compat
handling and all - is fine.
But if the aim is to just remove syscall_get_error() entirely because
it's so unused, then I'm ok with that.
Linus
On 03/29, Linus Torvalds wrote:
>
> That said, I'm not sure why you want to change this in the first
> place? I think the current syscall_get_error() - with explicit compat
> handling and all - is fine.
To simplify this logic. To kill TS_I386_REGS_POKED (which doesn't really
work and can't) and to remove the subtle dependency on TS_COMPAT in ret-
with-signal paths.
Again, afaics we only need these compat checks because regs->ax could be
changed by 32-bit debugger without sign-extension. And TS_I386_REGS_POKED
means that if TS_COMPAT is not set, then the debugger should have also
changed regs->orig_ax. This mostly works, but imo too fragile.
Currently we have the same check in get_nr_restart_syscall() but it is
even more broken (see the patch/changelog), so it should go away and in
this case it would be nice to avoid these checks in do_signal() path too.
Oleg.
On Wed, Mar 29, 2017 at 11:50 AM, Oleg Nesterov <[email protected]> wrote:
>
> Again, afaics we only need these compat checks because regs->ax could be
> changed by 32-bit debugger without sign-extension.
You don't explain how you were planning on *fixing* that code. You
know why it exists, but then you just say "let's remove it", without
any explanation of what you'd replace it with.
If your suggestion is just that "let's remove it, breaking the known
reason it's there", I really really don't see the upside.
It may be hacky, but it *works*. You seem to be advocating replacing
it with something simpler - "cleaner, but broken".
I really don't see the point of "cleaner, but broken".
The fact is, reality is not "clean". But reality trumps :I wish" and
"make-believe" every single time.
Linus
On 03/29, Linus Torvalds wrote:
>
> On Wed, Mar 29, 2017 at 11:50 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > Again, afaics we only need these compat checks because regs->ax could be
> > changed by 32-bit debugger without sign-extension.
>
> You don't explain how you were planning on *fixing* that code. You
> know why it exists, but then you just say "let's remove it", without
> any explanation of what you'd replace it with.
Hmm. I tried to explain... Let me quote my initial email,
So why we can't simply change putreg32() to always sign-extend regs->ax
regs->orig_ax and just do
static inline long syscall_get_error(struct task_struct *task,
struct pt_regs *regs)
{
return regs-ax;
}
? Or, better, simply kill it and use syscall_get_return_value() in
arch/x86/kernel/signal.c.
Of course, if the tracee is 64-bit and debugger is 32-bit then the
unconditional sign-extend can be wrong, but do we really care about
this case? This can't really work anyway. And the current code is not
right too. Say, debugger nacks the signal which interrupted syscall
and sets regs->ax = -ERESTARTSYS to force the restart, this won't work
because TS_COMPAT|TS_I386_REGS_POKED are not set.
In short. can the patch below work?
Oleg.
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9cc7d5a..96f21fc 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -917,11 +917,14 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
R32(edi, di);
R32(esi, si);
R32(ebp, bp);
- R32(eax, ax);
R32(eip, ip);
R32(esp, sp);
case offsetof(struct user32, regs.orig_eax):
+ regs->ax = (long) (int) value;
+ break;
+
+ case offsetof(struct user32, regs.orig_eax):
/*
* Warning: bizarre corner case fixup here. A 32-bit
* debugger setting orig_eax to -1 wants to disable
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b3b98ff..41023f8 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -710,7 +710,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
/* Are we from a system call? */
if (syscall_get_nr(current, regs) >= 0) {
/* If so, check system call restarting.. */
- switch (syscall_get_error(current, regs)) {
+ switch (syscall_get_return_value(current, regs)) {
case -ERESTART_RESTARTBLOCK:
case -ERESTARTNOHAND:
regs->ax = -EINTR;
@@ -813,7 +813,7 @@ void do_signal(struct pt_regs *regs)
/* Did we come from a system call? */
if (syscall_get_nr(current, regs) >= 0) {
/* Restart the system call - no handlers present */
- switch (syscall_get_error(current, regs)) {
+ switch (syscall_get_return_value(current, regs)) {
case -ERESTARTNOHAND:
case -ERESTARTSYS:
case -ERESTARTNOINTR:
On 03/29, Andy Lutomirski wrote:
>
> On Wed, Mar 29, 2017 at 8:05 AM, Oleg Nesterov <[email protected]> wrote:
> > On 03/28, Oleg Nesterov wrote:
> >>
> >> On 03/28, Andy Lutomirski wrote:
> >> >
> >> > How about we store the syscall arch to be restored in task_struct
> >> > along with restart_block?
> >>
> >> Yes, perhaps we will have to finally do this. Not really nice too.
> >
> > OK, how about the hack below?
> >
> > I do not want to a new member into task_struct/restart_block, so the
> > patch below adds a sticky TS_COMPAT bit which logically is a member
> > of "struct restart_block".
>
> Okay, but I'd much rather we just added a helper that's called in the
> few places that actually write to restart_block.
Oh, yes, I thought about this too. This obviously needs more changes, and
every arch needs a dummy definition... I was thinking about
static inline long setup_restart_block(void)
{
if (TS_COMPAT)
set TS_COMPAT_XXX;
else
clear TS_COMPAT_XXX;
return -ERESTART_RESTARTBLOCK;
}
so that we can do
- ret = -ERESTART_RESTARTBLOCK;
+ ret = setup_restart_block();
but I don't really like this... Do you strongly prefer it over the
-ERESTART_RESTARTBLOCK check in syscall_return_slowpath? I agree it doesn't
look nice too but it connects to other TS_ magic we do in arch/x86/entry/,
perhaps it is not that bad...
> Or we just add the new syscall nr and see what breaks. The answer
> could well be nothing at all.
Well, strace knows about __NR_restart_syscall. It won't be really broken,
but I guess it will report something like "unknown syscall" rather than
restart_syscall(...).
However, this still looks like a best solution to me, just I have no idea
how much we can confuse user-space.
Oleg.
On 03/30, Oleg Nesterov wrote:
>
> On 03/29, Linus Torvalds wrote:
> >
> > On Wed, Mar 29, 2017 at 11:50 AM, Oleg Nesterov <[email protected]> wrote:
> > >
> > > Again, afaics we only need these compat checks because regs->ax could be
> > > changed by 32-bit debugger without sign-extension.
> >
> > You don't explain how you were planning on *fixing* that code. You
> > know why it exists, but then you just say "let's remove it", without
> > any explanation of what you'd replace it with.
>
> Hmm. I tried to explain... Let me quote my initial email,
>
> So why we can't simply change putreg32() to always sign-extend regs->ax
> regs->orig_ax and just do
>
> static inline long syscall_get_error(struct task_struct *task,
> struct pt_regs *regs)
> {
> return regs-ax;
> }
>
> ? Or, better, simply kill it and use syscall_get_return_value() in
> arch/x86/kernel/signal.c.
>
> Of course, if the tracee is 64-bit and debugger is 32-bit then the
> unconditional sign-extend can be wrong, but do we really care about
> this case? This can't really work anyway. And the current code is not
> right too. Say, debugger nacks the signal which interrupted syscall
> and sets regs->ax = -ERESTARTSYS to force the restart, this won't work
> because TS_COMPAT|TS_I386_REGS_POKED are not set.
>
> In short. can the patch below work?
>
> Oleg.
>
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 9cc7d5a..96f21fc 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -917,11 +917,14 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
> R32(edi, di);
> R32(esi, si);
> R32(ebp, bp);
> - R32(eax, ax);
> R32(eip, ip);
> R32(esp, sp);
>
> case offsetof(struct user32, regs.orig_eax):
^^^^^^^^^^^^^^
damn, just in case, of course this is typo, should be
case offsetof(struct user32, regs.eax);
> + regs->ax = (long) (int) value;
> + break;
> +
> + case offsetof(struct user32, regs.orig_eax):
> /*
> * Warning: bizarre corner case fixup here. A 32-bit
> * debugger setting orig_eax to -1 wants to disable
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index b3b98ff..41023f8 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -710,7 +710,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
> /* Are we from a system call? */
> if (syscall_get_nr(current, regs) >= 0) {
> /* If so, check system call restarting.. */
> - switch (syscall_get_error(current, regs)) {
> + switch (syscall_get_return_value(current, regs)) {
> case -ERESTART_RESTARTBLOCK:
> case -ERESTARTNOHAND:
> regs->ax = -EINTR;
> @@ -813,7 +813,7 @@ void do_signal(struct pt_regs *regs)
> /* Did we come from a system call? */
> if (syscall_get_nr(current, regs) >= 0) {
> /* Restart the system call - no handlers present */
> - switch (syscall_get_error(current, regs)) {
> + switch (syscall_get_return_value(current, regs)) {
> case -ERESTARTNOHAND:
> case -ERESTARTSYS:
> case -ERESTARTNOINTR:
On Thu, Mar 30, 2017 at 8:49 AM, Oleg Nesterov <[email protected]> wrote:
>>
>> case offsetof(struct user32, regs.orig_eax):
> ^^^^^^^^^^^^^^
> damn, just in case, of course this is typo, should be
>
> case offsetof(struct user32, regs.eax);
So honestly, the first version of your patch I went "Ok, that looks
perfectly reasonable", because orig_eax is special, and always
sign-extending it sounds fine.
The *fixed* version of your patch I go "no, that's obviously complete
garbage, that's completely bogus".
Because the regular eax register is *not* special, and sign-extending
it sounds very dangerous indeed. The user will actually *use* that
register as a register, unlike orig_eax. And eax is no different from
all the other random registers. Yes, it happens to be the return value
for system calls, but this codepath is not at all limited to system
calls, it's absolutely *any* context when you use gdb on a 32-bit
binary.
Now, you may well say that "the upper bits aren't well-defined in that
case anyway". And you'd be mostly right. Except the semantics of
x86-64 is that 32-bit operations zero-extend the upper bits, so
zero-extension really *is* the right thing to do.
For example, let's assume that %eax contains a 32-bit pointer with the
high bit set, and we're using a 32-bit debugger on a 32-bit program
(ie you're just running a 32-bit distro on a 64-bit kernel, which
people have definitely done).
We *really* shouldn't sign-extend that value if the debugger ends up
updating the pointer (or maybe the debugger just reloads previous
values, not really "updating" anything - I think that's what gdb does
when you do a call within the context of the debugged program from
within gdb, for example)
So I really *really* don't think you can just sign-extend %eax. Which
is exactly why we have that nasty odd sign-extension in the signal
path instead, but then have to make it conditional on running a 32-bit
program.
But maybe there is still something I'm not understanding in your
argument. This thread has been a series of mis-understandings.
Linus
On Thu, Mar 30, 2017 at 10:46 AM, Linus Torvalds
<[email protected]> wrote:
> For example, let's assume that %eax contains a 32-bit pointer with the
> high bit set, and we're using a 32-bit debugger on a 32-bit program
> (ie you're just running a 32-bit distro on a 64-bit kernel, which
> people have definitely done).
>
> We *really* shouldn't sign-extend that value if the debugger ends up
> updating the pointer (or maybe the debugger just reloads previous
> values, not really "updating" anything - I think that's what gdb does
> when you do a call within the context of the debugged program from
> within gdb, for example)
Can you think of a case where this would actually matter?
>
> So I really *really* don't think you can just sign-extend %eax. Which
> is exactly why we have that nasty odd sign-extension in the signal
> path instead, but then have to make it conditional on running a 32-bit
> program.
>
> But maybe there is still something I'm not understanding in your
> argument. This thread has been a series of mis-understandings.
As the daft kernel hacker who introduced TS_I386_REGS_POKED in the
first place, I'll try to explain what I think is going on.
TS_I386_REGS_POKED is an enormous kludge, and it serves two purposes.
It avoids a potential security bug that the old code had, and it at
least documents the code paths that are thoroughly broken. (Before
they were TS_COMPAT instead, but most of the TS_COMPAT users are
fine.)
It's used in two places:
--- issue 1 ---
get_nr_restart_syscall() does:
if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
return __NR_ia32_restart_syscall;
This is very, very buggy. Fixing this appears to require somewhat
some surgery. Proposals include adding new restart_syscall numbers
that match across 32-bit and 64-bit (interacts quite awkwardly with
seccomp) or trying to store syscall bitness along with restart_block
(ick, not actually 100% reliable depending on just how abusing the
debugger is).
--- issue 2 ---
syscall_get_error(). This is available on all arches, but it appears
to be used *only* on x86. It's used to figure out whether we're
restarting a syscall. It could plausibly matter if we have a buggy
compat syscall that returns int instead of long, but the main purpose
is for compatibility with 32-bit debuggers.
Neither Oleg nor I have thought of anything other than this code path
that cares at all about the high bits of RAX on a process that's being
poked using 32-bit ptrace. Sign-extending RAX seems like it would get
rid of this code path entirely to me.
--Andy
On Thu, Mar 30, 2017 at 11:23 AM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Mar 30, 2017 at 10:46 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> We *really* shouldn't sign-extend that value if the debugger ends up
>> updating the pointer (or maybe the debugger just reloads previous
>> values, not really "updating" anything - I think that's what gdb does
>> when you do a call within the context of the debugged program from
>> within gdb, for example)
>
> Can you think of a case where this would actually matter?
I'd actually be willing to have people try, but then you'd have to
sign-extend *all* registers. No way in hell will I accept a patch that
randomly sign-extends just %eax.
And then actually run such a kernel on a 32-bit distro, and verifying
that things like gdb and strace really work. But it needs real
testing, not some kind of handwaving. It's a *big* change.
> --- issue 1 ---
>
> get_nr_restart_syscall() does:
>
> if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
> return __NR_ia32_restart_syscall;
>
> This is very, very buggy
Quite frankly, a bug somewhere else is not an excuse for then making
other code buggier. I don't see what "issue 1" has to do with anything
what-so-ever,.
> --- issue 2 ---
>
> syscall_get_error(). This is available on all arches, but it appears
> to be used *only* on x86.
So I think that one is a red herring. We can trivially fix that by
just (a) removing everybody elses syscall_get_error(), and just
inlining the x86 case into the x86 signal handling. Boom, gone.
And in the signal handling path, the sign-extension and test of %eax
is reivially ok. Not because it's ok in general, but because we've
verified using %orig_eax that we're in a system call return path. We
could happily delete the whole TS_I386_REGS_POKED thing, I think.
But then it's very much about the fact that within the particular case
of the signal handling code and system call return detection, the
whole sign extension checking makes sense.
So don't even _try_ to equate this code with the general sign
extension code by ptrace. That is a totally different animal
altogether.
Linus
On Thu, Mar 30, 2017 at 8:28 AM, Oleg Nesterov <[email protected]> wrote:
> On 03/29, Andy Lutomirski wrote:
>>
>> On Wed, Mar 29, 2017 at 8:05 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 03/28, Oleg Nesterov wrote:
>> >>
>> >> On 03/28, Andy Lutomirski wrote:
>> >> >
>> >> > How about we store the syscall arch to be restored in task_struct
>> >> > along with restart_block?
>> >>
>> >> Yes, perhaps we will have to finally do this. Not really nice too.
>> >
>> > OK, how about the hack below?
>> >
>> > I do not want to a new member into task_struct/restart_block, so the
>> > patch below adds a sticky TS_COMPAT bit which logically is a member
>> > of "struct restart_block".
>>
>> Okay, but I'd much rather we just added a helper that's called in the
>> few places that actually write to restart_block.
>
> Oh, yes, I thought about this too. This obviously needs more changes, and
> every arch needs a dummy definition... I was thinking about
>
> static inline long setup_restart_block(void)
> {
> if (TS_COMPAT)
> set TS_COMPAT_XXX;
> else
> clear TS_COMPAT_XXX;
>
> return -ERESTART_RESTARTBLOCK;
> }
>
> so that we can do
>
> - ret = -ERESTART_RESTARTBLOCK;
> + ret = setup_restart_block();
>
> but I don't really like this... Do you strongly prefer it over the
> -ERESTART_RESTARTBLOCK check in syscall_return_slowpath? I agree it doesn't
> look nice too but it connects to other TS_ magic we do in arch/x86/entry/,
> perhaps it is not that bad...
How about:
struct restart_block *restart = set_syscall_restart_fn(do_whatever);
restart->other_stuff = blah.
I'd rather avoid adding stuff to the slow path that runs *that* rarely.
>
>> Or we just add the new syscall nr and see what breaks. The answer
>> could well be nothing at all.
>
> Well, strace knows about __NR_restart_syscall. It won't be really broken,
> but I guess it will report something like "unknown syscall" rather than
> restart_syscall(...).
>
> However, this still looks like a best solution to me, just I have no idea
> how much we can confuse user-space.
Me neither.
On Thu, Mar 30, 2017 at 11:35 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 30, 2017 at 11:23 AM, Andy Lutomirski <[email protected]> wrote:
>> On Thu, Mar 30, 2017 at 10:46 AM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> We *really* shouldn't sign-extend that value if the debugger ends up
>>> updating the pointer (or maybe the debugger just reloads previous
>>> values, not really "updating" anything - I think that's what gdb does
>>> when you do a call within the context of the debugged program from
>>> within gdb, for example)
>>
>> Can you think of a case where this would actually matter?
>
> I'd actually be willing to have people try, but then you'd have to
> sign-extend *all* registers. No way in hell will I accept a patch that
> randomly sign-extends just %eax.
>
> And then actually run such a kernel on a 32-bit distro, and verifying
> that things like gdb and strace really work. But it needs real
> testing, not some kind of handwaving. It's a *big* change.
I'll offer the following handwave: if there are problems, I'd expect
to see them in mixed-bitness uses, not 32-bit distros. But the 32-bit
case is worth testing, too.
>
>> --- issue 1 ---
>>
>> get_nr_restart_syscall() does:
>>
>> if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
>> return __NR_ia32_restart_syscall;
>>
>> This is very, very buggy
>
> Quite frankly, a bug somewhere else is not an excuse for then making
> other code buggier. I don't see what "issue 1" has to do with anything
> what-so-ever,.
>
>> --- issue 2 ---
>>
>> syscall_get_error(). This is available on all arches, but it appears
>> to be used *only* on x86.
>
> So I think that one is a red herring. We can trivially fix that by
> just (a) removing everybody elses syscall_get_error(), and just
> inlining the x86 case into the x86 signal handling. Boom, gone.
>
> And in the signal handling path, the sign-extension and test of %eax
> is reivially ok. Not because it's ok in general, but because we've
> verified using %orig_eax that we're in a system call return path. We
> could happily delete the whole TS_I386_REGS_POKED thing, I think.
Then how do we know whether to sign extend? TS_COMPAT does *not* work
because it isn't set on signal delivery.
>
> So don't even _try_ to equate this code with the general sign
> extension code by ptrace. That is a totally different animal
> altogether.
Of course it's different. But doing sign extension in general would
avoid the need for this special case, and it seems that the special
case isn't actually correct in cases that people try to use. Oleg,
can you clarify what's broken?
--Andy
On Thu, Mar 30, 2017 at 11:59 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> And then actually run such a kernel on a 32-bit distro, and verifying
>> that things like gdb and strace really work. But it needs real
>> testing, not some kind of handwaving. It's a *big* change.
>
> I'll offer the following handwave: if there are problems, I'd expect
> to see them in mixed-bitness uses, not 32-bit distros. But the 32-bit
> case is worth testing, too.
I wouldn't worry too much about the mixed case, simply because you
clearly cannot use a 32-bit gdb on a 64-bit process.
So the mixed case already needs to use a 64-bit gdb, which presumably
would never use the 32-bit ptrace paths in the first place, so this
code never triggers.
Of course, the mroe testing the better, but the thing I'd really want
to check is that there isn't some 32-bit distro that might have a
library that is optimized and notices when it's running on a 64-bit
capable CPU and uses REX prefixes to use special optimized versions.
That would probably already break with a 32-bit gdb, but I can at
least in theory imagine code that simply depends on zero extension.
>> And in the signal handling path, the sign-extension and test of %eax
>> is reivially ok. Not because it's ok in general, but because we've
>> verified using %orig_eax that we're in a system call return path. We
>> could happily delete the whole TS_I386_REGS_POKED thing, I think.
>
> Then how do we know whether to sign extend? TS_COMPAT does *not* work
> because it isn't set on signal delivery.
Ok, so we'd still need that nasty TS_I386_REGS_POKED. I still think
that's "ok" within the particular confines of just signal delivery.
Of course, I don't actually know why we don't set that flag
unconditionally when we poke things using that 32-bit ptrace
interface, but that's just more of the "yeah, this code is crazy
hacky"
Cleaning things up is good.
I just don't want people to dismiss that the current code seems to
work well enough in practice. That's *probably* because nobody
actually uses it, but I do know that people used to run 64-bit kernels
with 32-bit user land exactly because they needed the kernel for large
memory machines (PAE really doesn't work for shit) but were carrying
32-bit userland along for whatever lazy legacy reasons.
So it has gotten *some* use. It's probably (hopefully) fading.
.. there's Wine and dosemu that do crazy things too. Things like "we
only set TS_I386_REGS_POKED when modifying orig_eax" might be
something that those kinds of programs have actually noticed and are
actively working around etc. Because while *normal* programs hopefully
don't do insane things on purpose, both wine and dosemu definitely
have been known to very much intentionally do some truly crazy stuff.
Linus
On Thu, Mar 30, 2017 at 12:11 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 30, 2017 at 11:59 AM, Andy Lutomirski <[email protected]> wrote:
>>>
>>> And then actually run such a kernel on a 32-bit distro, and verifying
>>> that things like gdb and strace really work. But it needs real
>>> testing, not some kind of handwaving. It's a *big* change.
>>
>> I'll offer the following handwave: if there are problems, I'd expect
>> to see them in mixed-bitness uses, not 32-bit distros. But the 32-bit
>> case is worth testing, too.
>
> I wouldn't worry too much about the mixed case, simply because you
> clearly cannot use a 32-bit gdb on a 64-bit process.
>
> So the mixed case already needs to use a 64-bit gdb, which presumably
> would never use the 32-bit ptrace paths in the first place, so this
> code never triggers.
>
Hah. Hah hah. IIRC 64-bit gdb *does* use the 32-bit paths, or at
least it uses some path that can't see the high regs. I don't fully
recall, but this is the case that seems more likely to break to me.
It's a great big mess.
> Of course, the mroe testing the better, but the thing I'd really want
> to check is that there isn't some 32-bit distro that might have a
> library that is optimized and notices when it's running on a 64-bit
> capable CPU and uses REX prefixes to use special optimized versions.
Huh? Aren't those REX prefixes interpreted as INC instructions or
similar in compat mode? You can't just run 64-bit instructions in a
compat code segment. You *can* use LAR to find a 64-bit code segment
and long-jump to it (and I've written code to do exactly that, and
it's even snuck it's way into linux.git, muahaha), but code like this
is terminally screwed under 32-bit gdb.
On Thu, Mar 30, 2017 at 12:21 PM, Andy Lutomirski <[email protected]> wrote:
>
> Huh? Aren't those REX prefixes interpreted as INC instructions or
> similar in compat mode?
Hmm. I think you're right. So it's not like x32 that runs in long mode
but then would use the 64-bit ptrace interface anyway.
Your statement that 64-bit gdb sometimes uses 32-bit compat ptrace
makes me shudder. Why?
Don't even tell me. I suspect I'm happier not knowing.
Linus