2003-01-06 14:44:35

by Luca Barbieri

[permalink] [raw]
Subject: [PATCH] Set TIF_IRET in more places

This patch adds code to set TIF_IRET in sigsuspend and rt_sigsuspend
(since they change registers to invoke signal handlers) and ptrace
setregs. This prevents clobbering of %ecx and %edx.


diff --exclude-from=/home/ldb/src/exclude -urNdp --exclude='speedtouch.*' --exclude='atmsar.*' linux-2.5.54/arch/i386/kernel/ptrace.c linux-2.5.54-ldb/arch/i386/kernel/ptrace.c
--- linux-2.5.54/arch/i386/kernel/ptrace.c 2003-01-02 04:21:29.000000000 +0100
+++ linux-2.5.54-ldb/arch/i386/kernel/ptrace.c 2003-01-04 19:06:07.000000000 +0100
@@ -74,6 +74,8 @@ static inline int put_stack_long(struct
static int putreg(struct task_struct *child,
unsigned long regno, unsigned long value)
{
+ set_tsk_thread_flag(child, TIF_IRET);
+
switch (regno >> 2) {
case FS:
if (value && (value & 3) != 3)
diff --exclude-from=/home/ldb/src/exclude -urNdp --exclude='speedtouch.*' --exclude='atmsar.*' linux-2.5.54/arch/i386/kernel/signal.c linux-2.5.54-ldb/arch/i386/kernel/signal.c
--- linux-2.5.54/arch/i386/kernel/signal.c 2003-01-02 04:21:53.000000000 +0100
+++ linux-2.5.54-ldb/arch/i386/kernel/signal.c 2003-01-04 19:06:07.000000000 +0100
@@ -44,6 +44,7 @@ sys_sigsuspend(int history0, int history
spin_unlock_irq(&current->sig->siglock);

regs->eax = -EINTR;
+ set_thread_flag(TIF_IRET);
while (1) {
current->state = TASK_INTERRUPTIBLE;
schedule();
@@ -73,6 +74,7 @@ sys_rt_sigsuspend(sigset_t *unewset, siz
spin_unlock_irq(&current->sig->siglock);

regs->eax = -EINTR;
+ set_thread_flag(TIF_IRET);
while (1) {
current->state = TASK_INTERRUPTIBLE;
schedule();


Attachments:
(No filename) (1.55 kB)
(No filename) (189.00 B)
Download all attachments

2003-01-06 16:01:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places


On Mon, 6 Jan 2003, Luca Barbieri wrote:
>
> This patch adds code to set TIF_IRET in sigsuspend and rt_sigsuspend
> (since they change registers to invoke signal handlers) and ptrace
> setregs. This prevents clobbering of %ecx and %edx.

Hmm.. I explicitly thought about signals for sysenter, and I don't think
any of these are needed. In particular, here's my logic:

- ptrace only acts on a child that is stopped in the signal handling
path, so if signal handling i scorrect, then so is ptrace. So no
special case handling needed in ptrace.c

- [rt_]sigsuspend() has two exit cases: (a) the signal handler we
invoced, and (b) the point that signal handler will eventually return
to (ie the system call return point)

In the eventual return case, we don't care about ecx/edx, since
they will have been saved on the stack by the trampoline anyway.

In the signal handler we invoce, we also don't care about ecx/edx,
since they aren't part of the calling convention, and will in fact
have random values anyway (do_signal() will re-initialize the FP
state, but leaves the integer regs untouched.

- note that for normal asynchronous signals, it _is_ important that we
return with all registers saved, but right now that is handled by the
fact that the signal trampoline we build in do_signal() will always use
"int 0x80" for the sys_sigreturn() call, and will thus use "iret" when
restoring the registers. The synchronous "[rt_]sigsuspend()" really is
a special case in that respect.

So I _think_ these are all unnecessary, but I may have missed something.
So patch not applied, but you can certainly convince me otherwise. And
even if I'm right, maybe it needs a comment?

Linus

2003-01-06 18:16:10

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

I've looked again at it and it is actually less problematic that I
first thought but I still see the following two cases:

1. vfork seems to not set any TIF_ flags so a ptracer setting regs
while a vforking task is stopped in ptrace_notify called from vfork
would result in clobbered %ecx and %edx.

2. A ptracer could use %ecx or %edx to pass information to signal
handlers and this would not work with the current [rt_]sigsuspend.

These only need setting TIF_IRET on ptrace setregs though.

There is also the very small advantage of being able to hardcode
SYSENTER_RETURN as the return eip for sysexit if TIF_IRET is set in
all the 3 places.


Attachments:
(No filename) (645.00 B)
(No filename) (189.00 B)
Download all attachments

2003-01-06 18:33:34

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

Luca Barbieri wrote:

> 1. vfork seems to not set any TIF_ flags so a ptracer setting regs
> while a vforking task is stopped in ptrace_notify called from vfork
> would result in clobbered %ecx and %edx.


Just to be clear: the vfork syscall does not use sysenter. It's not
possible with the current implementation. The magic sysenter wrapping
code would have to recognize the syscall and handle it special by saving
the return address. From glibc's POV the %edx register would be free
but I consider this interface far to fragile.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2003-01-06 18:42:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places


On Mon, 6 Jan 2003, Luca Barbieri wrote:
>
> 1. vfork seems to not set any TIF_ flags so a ptracer setting regs
> while a vforking task is stopped in ptrace_notify called from vfork
> would result in clobbered %ecx and %edx.

vfork and clone do not work at all with sysenter due to user stack issues.
We should document that (it's already de-facto documented in glibc, but it
should be explicitly documented somewhere).

Btw, the vfork/clone problems aren't sysenter-specific per se: they are
really a generic problem with any non-inlined system call mechanism. In
particular, vfork() really cannot afford to have a stack frame.

> 2. A ptracer could use %ecx or %edx to pass information to signal
> handlers and this would not work with the current [rt_]sigsuspend.

Yes. Although I don't see why it should matter, really.

Linus

2003-01-06 20:25:29

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

> vfork and clone do not work at all with sysenter due to user stack issues.
I actually meant CLONE_VFORK, that if used without CLONE_VM can go
through sysenter (if CLONE_VFORK is not used TIF_NEED_RESCHED is set
so there is no problem).

I also think that vfork() could be used with AT_SYSINFO by switching
stacks around the call (with care about recursive vfork and signals
calling vfork inside the vfork stack).


Attachments:
(No filename) (415.00 B)
(No filename) (189.00 B)
Download all attachments

2003-01-06 21:25:40

by Zack Weinberg

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places


Linus Torvalds wrote:

> - note that for normal asynchronous signals, it _is_ important that
> we return with all registers saved, but right now that is handled
> by the fact that the signal trampoline we build in do_signal()
> will always use "int 0x80" for the sys_sigreturn() call, and will
> thus use "iret" when restoring the registers. The synchronous
> "[rt_]sigsuspend()" really is a special case in that respect.

Consider SA_RESTORER - there isn't a guarantee that user space will
use the same code as the kernel's trampoline. glibc happens to, but
only because GDB has a hardwired idea of what a signal trampoline
looks like. Of course, you could simply document that sigreturn() is
another of the system calls that must be made through int 0x80.

It occurs to me that the kernel-provided signal trampoline could go in
the page at 0xffff0000, instead of on the user stack, which would
eliminate the need for glibc to set SA_RESTORER (it's a pure
optimization).

Tangentially, I've seen people claim that the trampoline ought to be
able to avoid entering the kernel, although I'm not convinced (how
does the signal mask get reset, otherwise?)

zw

2003-01-07 11:07:19

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

Zack Weinberg wrote:
> Consider SA_RESTORER - there isn't a guarantee that user space will
> use the same code as the kernel's trampoline. glibc happens to, but
> only because GDB has a hardwired idea of what a signal trampoline
> looks like. Of course, you could simply document that sigreturn() is
> another of the system calls that must be made through int 0x80.

Glibc must use the same code as the kernel's trampoline because of
MD_FALLBACK_FRAME_STATE_FOR() in GCC's exception handling... (or
libgcc.so must change).

It explicitly checks for the opcode sequences 0x58b877000000cd80 and
0xb8ad000000cd80 in order to unwind exception frames around a
handled signal. Ugly, isn't it?

> It occurs to me that the kernel-provided signal trampoline could go in
> the page at 0xffff0000, instead of on the user stack, which would
> eliminate the need for glibc to set SA_RESTORER (it's a pure
> optimization).

Yup.

> Tangentially, I've seen people claim that the trampoline ought to be
> able to avoid entering the kernel, although I'm not convinced (how
> does the signal mask get reset, otherwise?)

Welcome to a wonderful if rather unsightly optimisation:

1. libc installs its own handler function for all non-SIG_DFL signals,
and sigaction() mostly updates a table in userspace.

2. The libc signal handler redirects all signals to the application
through a funky trampoline in libc.

3. A signal mask is maintained in userspace. Also, a pending mask
is maintained in userspace.

4. When a signal is delivered, libc's handler function checks the
userspace signal mask. If the signal should be blocked, and it
is possible to block it, it is marked as pending in _userspace_
pending mask, and the userspace signal mask is propagated to the
kernel to prevent further signals queuing up. Any siginfo_t is
also saved for tha signa. Then libc's handler returns, without
calling the application handler (because that is deferred).

5. When a signal is unblocked from the userspace signal mask, if it
is in the userspace pending mask, it is synthetically delivered
by userspace, which creates a context _as if_ the kernel had
delivered the signal.

By this mechanism, calls to unblock signals from the signal mask can
be done without entering the kernel, because the unblocking can be
done lazily.

Voila! sigreturn() can be written to avoid entering the kernel. Note
that this is possible _now_, with no changes to the kernel. It only
requires changes to libc. I think it would work on all architectures,
not just i386. (It may also be possible to do it without libc help,
in the vsyscall page).

-- Jamie

ps. A similar optimisation allows "spin_lock_irqsave" and
"spin_unlock_irqrestore" to avoid using the cli & sti instructions.
Spin locks already modify the preempt_count, so use a bit of that to
hold the synthetic interrupt-disabled flag, at zero cost... :)

2003-01-07 17:39:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places


On Tue, 7 Jan 2003, Jamie Lokier wrote:
>
> Voila! sigreturn() can be written to avoid entering the kernel. Note
> that this is possible _now_, with no changes to the kernel. It only
> requires changes to libc. I think it would work on all architectures,
> not just i386. (It may also be possible to do it without libc help,
> in the vsyscall page).

I'd certainly prefer playing these kinds of games _without_ any libc
involvement, since making libc play games like this is fragile as hell and
works really badly for threaded applications, for example (delivering
signals to threads is _hard_ to get right, that's one of the things 2.5.x
finally does thanks to Ingo and Uli).

But moving the signal trampoline to vsyscall space to prepare for changes
like this is a good idea. I'll remove the SA_RESTORER logic too.

Linus

2003-01-07 19:18:58

by Zack Weinberg

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

Jamie Lokier <[email protected]> writes:

> Zack Weinberg wrote:
>> Consider SA_RESTORER - there isn't a guarantee that user space will
>> use the same code as the kernel's trampoline. glibc happens to, but
>> only because GDB has a hardwired idea of what a signal trampoline
>> looks like. Of course, you could simply document that sigreturn() is
>> another of the system calls that must be made through int 0x80.
>
> Glibc must use the same code as the kernel's trampoline because of
> MD_FALLBACK_FRAME_STATE_FOR() in GCC's exception handling... (or
> libgcc.so must change).
>
> It explicitly checks for the opcode sequences 0x58b877000000cd80 and
> 0xb8ad000000cd80 in order to unwind exception frames around a
> handled signal. Ugly, isn't it?

We're open to better ideas ...

>> Tangentially, I've seen people claim that the trampoline ought to be
>> able to avoid entering the kernel, although I'm not convinced (how
>> does the signal mask get reset, otherwise?)
>
> Welcome to a wonderful if rather unsightly optimisation:
[...]

I would want to be very sure that this was actually a performance win
before implementing it, and since it requires data tables in user
space I don't see how it could possibly be done in the vsyscall page.

zw

2003-01-08 01:12:53

by Richard Henderson

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

On Tue, Jan 07, 2003 at 11:27:32AM -0800, Zack Weinberg wrote:
> > It explicitly checks for the opcode sequences 0x58b877000000cd80 and
> > 0xb8ad000000cd80 in order to unwind exception frames around a
> > handled signal. Ugly, isn't it?
>
> We're open to better ideas ...

Something like having dwarf2 unwind information for the
vsyscall page on the page as well?


r~

2003-01-08 02:30:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places


On Tue, 7 Jan 2003, Richard Henderson wrote:
> > We're open to better ideas ...
>
> Something like having dwarf2 unwind information for the
> vsyscall page on the page as well?

What would the unwind info look like? The current BK kernel will put the
signal return into the vsyscall page, so gdb could pick up the info from
there. But I have no idea what the unwind info looks like, or how to tell
gdb about it.

Linus

2003-01-08 03:24:10

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

On Tue, Jan 07, 2003 at 06:33:48PM -0800, Linus Torvalds wrote:
>
> On Tue, 7 Jan 2003, Richard Henderson wrote:
> > > We're open to better ideas ...
> >
> > Something like having dwarf2 unwind information for the
> > vsyscall page on the page as well?
>
> What would the unwind info look like? The current BK kernel will put the
> signal return into the vsyscall page, so gdb could pick up the info from
> there. But I have no idea what the unwind info looks like, or how to tell
> gdb about it.

Right now you can't, since GDB won't use the unwind info anyway; but
Richard is probably talking about MD_FALLBACK_FRAME_STATE_FOR, which is
used for exception handling unwinding instead of debugging unwinding.

Someday soon I hope to have GDB properly using this info, too.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-01-08 16:16:41

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

Richard Henderson wrote:
> On Tue, Jan 07, 2003 at 11:27:32AM -0800, Zack Weinberg wrote:
> > > It explicitly checks for the opcode sequences 0x58b877000000cd80 and
> > > 0xb8ad000000cd80 in order to unwind exception frames around a
> > > handled signal. Ugly, isn't it?
> >
> > We're open to better ideas ...
>
> Something like having dwarf2 unwind information for the
> vsyscall page on the page as well?

It would be quite nice just to have dwarf2 unwind information, with an
unwind handler, for the classic non-vsyscall restorer in Glibc.

Then MD_FALLBACK_FRAME_STATE_FOR could be removed from GCC on all
Linux targets, regardless of kernel version.

Once that is working it will be much clearer what exactly to send
Linus for the vsyscall page.

-- Jamie

2003-01-09 06:40:44

by Zack Weinberg

[permalink] [raw]
Subject: Re: [PATCH] Set TIF_IRET in more places

Jamie Lokier <[email protected]> writes:

>
> It would be quite nice just to have dwarf2 unwind information, with an
> unwind handler, for the classic non-vsyscall restorer in Glibc.

A trivial routine that just calls another routine,

extern void g(void);

void f(void) { g(); asm volatile (""); }

[the asm is to prevent sibcall optimization from kicking in] produces
the assembly dump appended to this message when compiled with -O2
-fexceptions -fomit-frame-pointer. I do not know what the stuff put
in .eh_frame means, and it probably isn't exactly right for __restore
or __restore_rt, but it's a start.

> Then MD_FALLBACK_FRAME_STATE_FOR could be removed from GCC on all
> Linux targets, regardless of kernel version.

I think we'd need to keep it around for the sake of older libcs; it
shouldn't do any harm.

zw

.text
.p2align 2,,3
.globl f
.type f,@function
f:
.LFB1:
subl $12, %esp
.LCFI0:
call g
addl $12, %esp
.LCFI1:
ret
.LFE1:
.Lfe1:
.size f,.Lfe1-f
.section .eh_frame,"aw",@progbits
.Lframe1:
.long .LECIE1-.LSCIE1
.LSCIE1:
.long 0x0
.byte 0x1
.string ""
.uleb128 0x1
.sleb128 -4
.byte 0x8
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
.LECIE1:
.LSFDE1:
.long .LEFDE1-.LASFDE1
.LASFDE1:
.long .LASFDE1-.Lframe1
.long .LFB1
.long .LFE1-.LFB1
.byte 0x4
.long .LCFI0-.LFB1
.byte 0xe
.uleb128 0x10
.byte 0x4
.long .LCFI1-.LCFI0
.byte 0xe
.uleb128 0x4
.align 4
.LEFDE1:
.ident "GCC: (GNU) 3.2.2 20021231 (Debian prerelease)"