Proposal for a sigaltmm()
There is a problem with performance when running virtualised environments
(notably user mode linux). The performance of the mprotect calls needed to
handle syscalls and protect the UML kernel from its user space are large and
the alternatives like a seperate process and ptrace are not pretty either
The cunning plan goes like this
Add
current->alt_mm
A per task flag for 'supervisory' mode
Tasks start with current->alt_mm NULL and the flag set to supervisory
On exec/exit tear down alt_mm as well as mm
Signal delivery checks if alt_mm != NULL && supervisory is clear
if so it sets supervisory and switches mm/alt_mm, flush the tlb and
continue handling the signal in the new space
We add
sys_switchmm(address);
This switches to the altmm (creating one if it doesnt exist as a copy of
the current mm), flushes the tlb and jumps to the address given.
Any opinions, spanners to throw in the works ?
Alan
[email protected] said:
> We add
> sys_switchmm(address);
> This switches to the altmm (creating one if it doesnt exist as a copy
> of the current mm), flushes the tlb and jumps to the address given.
You didn't explicitly say (and so I had to ask :-) that this is intended
to be the mechanism by which UML returns to userspace, rather than the
normal sigreturn you'd get by just returning from the handler.
So, this would make the entry to userspace look like:
restore registers
.
.
.
sys_switchmm(ip);
The problem with this is that it needs to be atomic wrt signals. There
can't be an interrupt in the middle of that sequence. So, sys_switchmm
would also have to restore the old signal mask, which you'd have to pass
in unless you're going to read it off the signal frame. Also, it would
have to be open coded because you've already restored the stack pointer.
So, the entry to userspace starts looking like:
block signals
restore registers
sys_switchmm(ip, new_sigmask);
Well, except for the blocking signals part, this is sigreturn under a
different name and partly moved into userspace.
Your objection to returning through sigreturn was performance. Is performance
a veto of adding an mm switch to sigreturn, or it is possible to make it
acceptible?
Also, is a new sigreturn_mm() reasonable? This would be close to sys_switchmm,
except that it would restore registers and would be a plug replacement for
sigreturn[_rt]. I don't favor this because it would probably have to choose
whether to be an _rt return or not, and I'd like the option of having UML
register some signals as SA_INFO (currently, they are all non SA_INFO).
Comments, brickbats, spanners?
Jeff
> can't be an interrupt in the middle of that sequence. So, sys_switchmm
> would also have to restore the old signal mask, which you'd have to pass
> in unless you're going to read it off the signal frame. Also, it would
> have to be open coded because you've already restored the stack pointer.
Uggh.. you are right. You end up needing sigreturn handling
> Your objection to returning through sigreturn was performance. Is performance
> a veto of adding an mm switch to sigreturn, or it is possible to make it
> acceptible?
Its not a veto. I was trying to avoid having to add any more branches to
the fast paths in the kernel. The remaining sigreturn question is
"how do you get into 'user' mode the first time"
On Thu, Aug 01, 2002 at 11:40:28PM -0500, Jeff Dike wrote:
>
> Your objection to returning through sigreturn was performance. Is performance
> a veto of adding an mm switch to sigreturn, or it is possible to make it
> acceptible?
I have once ported Basilisk to work native on linux-m68k. It works
*slow* so I looked what the problem is - the signal delivery in
Linux is exorbitantly slow. Eg an SIGILL delivery costs ~ 1650 cycles
on a 68060, compared to that sigreturn and getpid are 200-250 and
sched_yield with context switch around 400.
So sigreturn is not the place I would be looking for the biggest
speedups.
Richard
On Fri, 2002-08-02 at 12:34, Richard Zidlicky wrote:
> I have once ported Basilisk to work native on linux-m68k. It works
> *slow* so I looked what the problem is - the signal delivery in
> Linux is exorbitantly slow. Eg an SIGILL delivery costs ~ 1650 cycles
> on a 68060, compared to that sigreturn and getpid are 200-250 and
> sched_yield with context switch around 400.
The numbers look very different on a real processor. Signal delivery is
indeed not stunningly fast but relative to a context switch its very low
indeed.
[email protected] said:
> Its not a veto. I was trying to avoid having to add any more branches
> to the fast paths in the kernel.
Unless I'm missing something, a test for altmm and a branch to out of line
mm switching should be about three instructions on x86 including a correctly
predicted branch not taken in the non-altmm case.
> The remaining sigreturn question is
> "how do you get into 'user' mode the first time"
Last night I told you it was by building a signal frame by hand and returning
through it. That's no longer true. Now, every UML thread (except the idle
thread, which I think can be reasonable expected not to try to enter userspace)
is in a host signal handler when in the kernel.
All entrances to userspace happen by returning through that signal frame.
Special userspace returns (exec and fork et al) fiddle the sigcontext in
that frame beforehand. Normal system calls stuff the return value in the
appropriate slot in the sigcontext before returning, as well.
So, there's nothing special about entering userspace for the first time.
Everything is under a signal frame, so any time something needs to enter
userspace, it just returns through it.
> This switches to the altmm (creating one if it doesnt exist as a copy
> of the current mm)
About this business of creating a UML kernel address space for each UML
user thread - I prefer to have a single kernel address space to which all
signals are delivered.
This has the slight disadvantage that the process address space isn't directly
accessible, but I can live with that. A virt_to_phys translation isn't too
painful.
A single separate kernel address space has the following attractions for me:
there are some cases where 3G of KVA would be very useful
it would make the UML kernel completely invisible to processes, which
is important for honeypots
apps which consume huge amounts of VM might run on the host, but
crap out inside a UML
This raises the question of how the process address spaces are created. For
a variety of reasons unrelated to altmm (which I can go into if anyone's
interested), I want address spaces to be separate user-visible objects.
You'd create a new empty one by opening /proc/new-mm or something and get
back a file descriptor as a handle to it. mmap/munmap/mprotect would be
extended to take a file descriptor pointing to the address space to be
changed.
So, altmm would look like this:
When it starts up, UML would call sigaltmm, passing a descriptor to its own
address space and register its signal handlers with a new flag, SA_IN_MM.
sigaction would have an mm field in which this descriptor would be put (and
would contain -1 in the non-altmm case).
The sigcontext would have an extra int in it which would be the descriptor
of the address space to which sigreturn will return.
Like now, UML would arrange that everything is under a host signal handler.
When it enters userspace it would change the address space fd in the sigcontext
if necessary.
Does this sound sane?
Jeff
> So, there's nothing special about entering userspace for the first time.
> Everything is under a signal frame, so any time something needs to enter
> userspace, it just returns through it.
Ok
> This has the slight disadvantage that the process address space isn't directly
> accessible, but I can live with that. A virt_to_phys translation isn't too
> painful.
Right
> This raises the question of how the process address spaces are created. For
> a variety of reasons unrelated to altmm (which I can go into if anyone's
> interested), I want address spaces to be separate user-visible objects.
That really makes all the existing code not work with it. Doing an altmm
is easy in the sense that it doesn't require 20 new syscall and doesnt
slow down the main kernel paths for a single odd case.
I can see why there is a need to manipulate the other mm I need to think
about the right way to handle it.
[email protected] said:
> That really makes all the existing code not work with it.
Can you be more specific? If you're thinking I'm talking about breaking
mmap, munmap, and mprotect by adding another argument, I'm not. I'm talking
about adding new syscalls, mmap2, munmap2, mprotect2 (or something more
imaginative), which have the extra argument, having them take -1 as meaning
"fiddle the current address space" and pursuading libc to use them instead
of the current syscalls. Then we would start the current ones on their way
to the happy syscall hunting grounds in the sky.
> Doing an altmm is easy in the sense that it doesn't require 20 new
> syscall
I don't think I mentioned 20 new syscalls anywhere :-) If you count the
ones above as replacements and not new, I'm talking about one new syscall -
switch_mm(), which I didn't mention before, that would switch to a given
address space. This would be the basis of UML's switch_mm.
> and doesnt slow down the main kernel paths for a single odd
> case.
Which main kernel paths are you referring to here?
Jeff
> mmap, munmap, and mprotect by adding another argument, I'm not. I'm talking
> about adding new syscalls, mmap2, munmap2, mprotect2 (or something more
> imaginative), which have the extra argument, having them take -1 as meaning
> "fiddle the current address space" and pursuading libc to use them instead
> of the current syscalls. Then we would start the current ones on their way
> to the happy syscall hunting grounds in the sky.
Thats a lot more invasive than I want to be
[email protected] said:
> Thats a lot more invasive than I want to be
OK, that was my best thinking on the subject. I'll be interested to see what
you like.
Jeff
On 2 Aug 2002, Alan Cox wrote:
> The numbers look very different on a real processor. Signal delivery is
> indeed not stunningly fast but relative to a context switch its very low
> indeed.
actually the opposite is true, on a 2.2 GHz P4:
$ ./lat_sig catch
Signal handler overhead: 3.091 microseconds
$ ./lat_ctx -s 0 2
2 0.90
ie. *process to process* context switches are 3.4 times faster than signal
delivery. Ie. we can switch to a helper thread and back, and still be
faster than a *single* signal.
signals are in essence 'lightweight' threads created and destroyed for the
purpose of a single asynchronous event, it's IMO a very inefficient and
baroque concept for almost anything (but debugging and a number of very
special uses). I'd guess that with a sane threading library a helper
thread is faster for almost everything.
Ingo
> actually the opposite is true, on a 2.2 GHz P4:
>
> $ ./lat_sig catch
> Signal handler overhead: 3.091 microseconds
>
> $ ./lat_ctx -s 0 2
> 2 0.90
>
> ie. *process to process* context switches are 3.4 times faster than signal
> delivery. Ie. we can switch to a helper thread and back, and still be
> faster than a *single* signal.
Thats interesting indeed. I'd not tried it with the O(1) scheduler.
> signals are in essence 'lightweight' threads created and destroyed for the
> purpose of a single asynchronous event, it's IMO a very inefficient and
> baroque concept for almost anything (but debugging and a number of very
> special uses). I'd guess that with a sane threading library a helper
> thread is faster for almost everything.
Which would argue UML ought to have a positively microkernel view of
syscalls - sending a message ?
[email protected] said:
> Which would argue UML ought to have a positively microkernel view of
> syscalls - sending a message ?
Indeed. Ingo's mail got me thinking that
[email protected] said:
> the alternatives like a seperate process and ptrace are not pretty either
might not be so bad after all.
All I would need to make this work is for one process to be able to change
the mm of another.
Then, the current UML tracing thread would handle the kernel side of things
and sit in its own address space nicely protected from its processes.
Jeff
Ingo Molnar <[email protected]> writes:
> actually the opposite is true, on a 2.2 GHz P4:
>
> $ ./lat_sig catch
> Signal handler overhead: 3.091 microseconds
>
> $ ./lat_ctx -s 0 2
> 2 0.90
>
> ie. *process to process* context switches are 3.4 times faster than signal
> delivery. Ie. we can switch to a helper thread and back, and still be
> faster than a *single* signal.
This is because the signal save/restore does a lot of unnecessary stuff.
One optimization I implemented at one time was adding a SA_NOFP signal
bit that told the kernel that the signal handler did not intend
to modify floating point state (few signal handlers need FP) It would
not save the FPU state then and reached quite some speedup in signal
latency.
Linux got a lot slower in signal delivery when the SSE2 support was
added. That got this speed back.
The target were certain applications that use signal handlers for async
IO.
If there is interest I can dig up the old patches. They were really simple.
x86-64 does it also faster by FXSAVE'ing directly to the user space
frame with exception handling instead of copying manually. But that's
not possible in i386 because it still has to use the baroque iBCS
FP context format on the stack.
-Andi
In article <[email protected]>,
Andi Kleen <[email protected]> wrote:
>Ingo Molnar <[email protected]> writes:
>
>
>> actually the opposite is true, on a 2.2 GHz P4:
>>
>> $ ./lat_sig catch
>> Signal handler overhead: 3.091 microseconds
>>
>> $ ./lat_ctx -s 0 2
>> 2 0.90
>>
>> ie. *process to process* context switches are 3.4 times faster than signal
>> delivery. Ie. we can switch to a helper thread and back, and still be
>> faster than a *single* signal.
>
>This is because the signal save/restore does a lot of unnecessary stuff.
>One optimization I implemented at one time was adding a SA_NOFP signal
>bit that told the kernel that the signal handler did not intend
>to modify floating point state (few signal handlers need FP) It would
>not save the FPU state then and reached quite some speedup in signal
>latency.
>
>Linux got a lot slower in signal delivery when the SSE2 support was
>added. That got this speed back.
This will break _horribly_ when (if) glibc starts using SSE2 for things
like memcpy() etc.
I agree that it is really sad that we have to save/restore FP on
signals, but I think it's unavoidable. Your hack may work for you, but
it just gets really dangerous in general. having signals randomly
subtly corrupt some SSE2 state just because the signal handler uses
something like memcpy (without even realizing that that could lead to
trouble) is bad, bad, bad.
In other words, "not intending to" does not imply "will not". It's just
potentially too easy to change SSE2 state by mistake.
And yes, this signal handler thing is clearly visible on benchmarks.
MUCH too clearly visible. I just didn't see any safe alternatives
(and I still don't ;( )
Linus
Em Mon, Aug 05, 2002 at 05:35:13AM +0000, Linus Torvalds escreveu:
> This will break _horribly_ when (if) glibc starts using SSE2 for things
> like memcpy() etc.
Humm, related, wasn't one way of having userspace have access to the kernel
optimized versions of memcpy et al, thru a page with these functions that would
be mapped into the process address space (don't remember exact details)
something still being considered?
- Arnaldo
At 05:35 AM 5/08/2002 +0000, Linus Torvalds wrote:
> >Linux got a lot slower in signal delivery when the SSE2 support was
> >added. That got this speed back.
>
>This will break _horribly_ when (if) glibc starts using SSE2 for things
>like memcpy() etc.
>
>I agree that it is really sad that we have to save/restore FP on
>signals, but I think it's unavoidable. Your hack may work for you, but
>it just gets really dangerous in general. having signals randomly
>subtly corrupt some SSE2 state just because the signal handler uses
>something like memcpy (without even realizing that that could lead to
>trouble) is bad, bad, bad.
how about putting the onus on userspace to tell the kernel if/when it uses
extensions that require FP state to be saved/restored?
if/when glibc starts using SSE2, it could then use these extensions.
could be as simple as user-space setting some bit somewhere.
>And yes, this signal handler thing is clearly visible on benchmarks.
>MUCH too clearly visible. I just didn't see any safe alternatives
>(and I still don't ;( )
it probably isn't worthwhile penalising all users of signal just for those
few userspace apps that actually do use SSE2.
cheers,
lincoln.
[email protected] (Linus Torvalds) writes:
> >This is because the signal save/restore does a lot of unnecessary stuff.
> >One optimization I implemented at one time was adding a SA_NOFP signal
> >bit that told the kernel that the signal handler did not intend
> >to modify floating point state (few signal handlers need FP) It would
> >not save the FPU state then and reached quite some speedup in signal
> >latency.
> >
> >Linux got a lot slower in signal delivery when the SSE2 support was
> >added. That got this speed back.
>
> This will break _horribly_ when (if) glibc starts using SSE2 for things
> like memcpy() etc.
>
> I agree that it is really sad that we have to save/restore FP on
> signals, but I think it's unavoidable. Your hack may work for you, but
> it just gets really dangerous in general. having signals randomly
> subtly corrupt some SSE2 state just because the signal handler uses
> something like memcpy (without even realizing that that could lead to
> trouble) is bad, bad, bad.
I think the possibility at least for memcpy is rather remote. Any sane
SSE memcpy would only kick in for really big arguments (for small
memcpys it doesn't make any sense at all because of the context save/possible
reformatting penalty overhead). So only people doing really
big memcpys could be possibly hurt, and that is rather unlikely.
But your point stands, one definitely needs to be very careful with it.
Also for special things like UML who can ensure their environment is sane it
could be still an useful optimization. I did it originally for async IO
handling in some project. At least offering the choice does not hurt.
If it wcould speed up UML I think it would be certainly worth it.
After all Linux should give you enough rope to shot yourself in the foot ;)
>
> In other words, "not intending to" does not imply "will not". It's just
> potentially too easy to change SSE2 state by mistake.
>
> And yes, this signal handler thing is clearly visible on benchmarks.
> MUCH too clearly visible. I just didn't see any safe alternatives
> (and I still don't ;( )
In theory you could do a superhack: put the FP context into an unmapped
page on the stack and only save with lazy FPU or access to the unmapped
page. Unfortunately the details get too nasty
(where to find the unmapped page? is the tlb manipulation worth it if the
page was mapped? how to store the address of the unmapped page for nested
signal handlers for the page fault handler?) so I discarded this idea.
-Andi
On 4 Aug 2002, Andi Kleen wrote:
> > actually the opposite is true, on a 2.2 GHz P4:
> >
> > $ ./lat_sig catch
> > Signal handler overhead: 3.091 microseconds
> >
> > $ ./lat_ctx -s 0 2
> > 2 0.90
> >
> > ie. *process to process* context switches are 3.4 times faster than signal
> > delivery. Ie. we can switch to a helper thread and back, and still be
> > faster than a *single* signal.
>
> This is because the signal save/restore does a lot of unnecessary stuff.
> One optimization I implemented at one time was adding a SA_NOFP signal
> bit that told the kernel that the signal handler did not intend to
> modify floating point state (few signal handlers need FP) It would not
> save the FPU state then and reached quite some speedup in signal
> latency.
well, we have an optimization in this area already - if the thread
receiving the signal has not used any FPU registers during its current
scheduled atom yet then we do not save the FPU state into the signal
frame.
lat_sig uses the FPU so this cost is added. If the FPU saving cost is
removed then signal delivery latency is still 2.0 usecs - slightly more
than twice as expensive as a context-switch - so it's not a win. And
threads can do queued events that amortizes context switch overhead, while
queued signals generate per-event signal delivery, so signal delivery
costs are not amortized.
(Not that i advocate SIGIO or helper threads for highperformance IO -
Ben's aio interface is the fastest and most correct approach.)
Ingo
[email protected] said:
> Also for special things like UML who can ensure their environment is
> sane it could be still an useful optimization.
I use libc, and I haven't been able to convince myself that it isn't
going to use FP instructions or registers on my behalf. I use it as little
as possible, but it still makes me nervous.
> If it wcould speed up UML I think it would be certainly
> worth it.
After Ingo's numbers, I like the idea of just having a separate address
space and process for the UML kernel, and have that process ptrace UML
processes and handle system calls and interrupts on their behalf. One
context switch at the start of a system call and one at the end, as opposed
to a signal delivery and sigreturn.
This also solves the jail mode mprotect performance horrors.
The one thing standing in my way is the need for the kernel process to
be able to change the address space of its processes.
I made a proposal for that, and Alan didn't like it. So, we'll see what
he likes better.
Jeff
On Sat, 03 Aug 2002 10:29:42 -0500
Jeff Dike <[email protected]> wrote:
> [email protected] said:
> > the alternatives like a seperate process and ptrace are not pretty either
I have implemented a usermode version of the Fiasco ?-kernel that uses
a seperate process for the kernel and one process for each task. The kernel
process attaches to all tasks via ptrace.
When the kernel wants to change the MM of a task it puts some trampoline code
on a page mapped into each task's address space and has the task execute that
code on behalf of the kernel.
With that setup we have complete address space protection without all the
trouble of jail at the expense of a few context switches for each mmap, munmap
or mprotect operation.
I would also very much like an extension that would allow one process to modify
the MM of another, possibly via an extended ptrace interface or a new syscall.
Also it would be nice if there was an alternate way to get at the cr2 register,
trap number and error code other than from a SIGSEGV handler.
> All I would need to make this work is for one process to be able to change
> the mm of another.
Yes, exactly.
> Then, the current UML tracing thread would handle the kernel side of things
> and sit in its own address space nicely protected from its processes.
Yes. I already have this part working for our kernel, so it's not just theory.
I believe things could run yet another bit faster if we didn't have to do the
trampoline map operations.
-Udo.
> > > actually the opposite is true, on a 2.2 GHz P4:
> > >
> > > $ ./lat_sig catch
> > > Signal handler overhead: 3.091 microseconds
> > >
> > > $ ./lat_ctx -s 0 2
> > > 2 0.90
> > >
> > > ie. *process to process* context switches are 3.4 times faster than signal
> > > delivery. Ie. we can switch to a helper thread and back, and still be
> > > faster than a *single* signal.
Has someone gone through the lat_ctx.c and lat_sig.c code and convinced
themselves these are measuring things which ought to be compared like this?
When I wrote that code I didn't anticipate this comparison, so somebody
should go look.
I'd suggest that if you want to measure how fast you can communicate using
signals versus pipes (or sockets or whatever), someone write up a test
which has two processes bounce a token between each other using signals
and then compare that with lat_pipe. It's not clear to me that you are
comparing apples to apples.
If someone does write the test, we'll add it to LMbench if it reveals
anything useful. It should be easy enough to do. I can do it if it
isn't obvious.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Linus Torvalds wrote:
> I agree that it is really sad that we have to save/restore FP on
> signals, but I think it's unavoidable.
Couldn't you mark the FPU as unused for the duration of the
handler, and let the lazy FPU mechanism save the state when it is used
by the signal handler?
> And yes, this signal handler thing is clearly visible on benchmarks.
> MUCH too clearly visible. I just didn't see any safe alternatives
> (and I still don't ;( )
I use SEGVs to trap access to read-only pages for garbage collection,
and I know I'm not the only one. That's a lot of SEGVs...
Fwiw, I have timed SIGSEGV handling time on Linux on various Intel CPUs,
on a PA-RISC running HP-UX and on a few Sparcs running Solaris. Linux
came out faster in all cases. Best case: 8 microseconds to trap a page
fault, handle the SEGV and mprotect() one page (600MHz P3). Worst case:
37 microseconds (133MHz Pentium).
That's about 5000 cycles. I'm sure we can do better than that.
For sophisticated user space uses, like the above, I'd like to see
a trap handling mechanism that saves only the _minimum_ state.
Userspace can take care of the rest. Maybe even without a sigreturn in
some cases.
-- Jamie
Ingo Molnar wrote:
> And threads can do queued events that amortizes context switch
> overhead, while queued signals generate per-event signal delivery, so
> signal delivery costs are not amortized.
>
> (Not that i advocate SIGIO or helper threads for highperformance IO -
> Ben's aio interface is the fastest and most correct approach.)
Isn't the per-event queued signal cost amortised when using sigwaitinfo()?
-- Jamie
Jamie Lokier wrote:
> Ingo Molnar wrote:
> > And threads can do queued events that amortizes context switch
> > overhead, while queued signals generate per-event signal delivery, so
> > signal delivery costs are not amortized.
> >
> > (Not that i advocate SIGIO or helper threads for highperformance IO -
> > Ben's aio interface is the fastest and most correct approach.)
>
> Isn't the per-event queued signal cost amortised when using sigwaitinfo()?
Of course I meant:
Isn't the per-event queued signal cost amortised when using sigtimedwait()?
cheers,
-- Jamie
On 5 Aug 2002, Andi Kleen wrote:
>
> I think the possibility at least for memcpy is rather remote. Any sane
> SSE memcpy would only kick in for really big arguments (for small
> memcpys it doesn't make any sense at all because of the context save/possible
> reformatting penalty overhead). So only people doing really
> big memcpys could be possibly hurt, and that is rather unlikely.
And this is why the kernel _has_ to save the FP state.
It's the "only happens in a blue moon" bugs that are the absolute _worst_
bugs. I want to optimize the kernel until I'm blue in the face, but the
kernel must NEVER EVER have a "non-stable" interface.
Signal handlers that don't restore state are hard as _hell_ to debug. Most
of the time it doesn't really matter (unless the lack of restore is
something really major like one of the most common integer registers), but
then depending on what libraries you use, and just _exactly_ when the
signal comes in, you get subtle data corruption that may not show up until
much later.
At which point your programmer wonders if he mistakenly wandered into
MS-Windows land.
No thank you. I'll take slow signal handlers over ones that _sometimes_
don't work.
> After all Linux should give you enough rope to shot yourself in the foot ;)
On purpose, yes. It's ok to take careful aim, and say "I'm now shooting
myself in the foot".
And yes, it's also ok to say "I don't know what I'm doing, so I may be
shooting myself in the foot" (this is obviously the most common
foot-shooter).
And if you come to me and complain about how drunk you were, and how you
shot yourself in the foot by mistake due to that, I'll just ignore you.
BUT - and this is a big BUT - if you are doing everything right, and you
actually know what you're doing, and you end up shooting yourself in the
foot because the kernel was taking a shortcut, then I think the kernel is
_wrong_.
And I'd rather have a slow kernel that does things right, than a fast
kernel which screws with people.
> In theory you could do a superhack: put the FP context into an unmapped
> page on the stack and only save with lazy FPU or access to the unmapped
> page.
That would be extremely interesting especially with signal handlers that
do a longjmp() thing.
The real fix for a lot of programs on x86 would be for them to never ever
use FP in the first place, in which case the kernel would be able to just
not save and restore it at all.
However, glibc fiddles with the fpu at startup, even for non-FP programs.
Dunno what to do about that.
Linus
On Mon, 5 Aug 2002, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > I agree that it is really sad that we have to save/restore FP on
> > signals, but I think it's unavoidable.
>
> Couldn't you mark the FPU as unused for the duration of the
> handler, and let the lazy FPU mechanism save the state when it is used
> by the signal handler?
Nope. Believe me, I gave some thought to clever things to do.
The kernel won't even _see_ a longjmp() out of a signal handler, so the
kernel has a really hard time trying to do any clever lazy stuff.
Also, people who play games with FP actually change the FP data on the
stack frame, and depend on signal return to reload it. Admittedly I've
only ever seen this on SIGFPE, but anyway - this is all done with integer
instructions that just touch bitpatterns on the stack.. The kernel can't
catch it sanely.
> For sophisticated user space uses, like the above, I'd like to see
> a trap handling mechanism that saves only the _minimum_ state.
I would not mind an extra per-signal flag that says "don't bother with FP
saves" (the same way we already have "don't restart" etc), but I would be
very nervous if glibc used it by default (even if glibc doesn't use SSE2
in memcpy, gcc itself can do it, and obviously _users_ may just do it
themselves).
So it would have to be explicitly enabled with a SA_NOFPSIGHANDLER flag or
something.
(And yes, it's the FP stuff that takes most of the time. I think the
lmbench numbers for signal delivery tripled when that went in).
Linus
> Also, people who play games with FP actually change the FP data on the
> stack frame, and depend on signal return to reload it. Admittedly I've
> only ever seen this on SIGFPE, but anyway - this is all done with integer
> instructions that just touch bitpatterns on the stack.. The kernel can't
> catch it sanely.
Could the fp state be put on its own page and the dirty bit
evaluated in the decision whether to restore fpu state ?
Regards
Oliver
On Mon, 5 Aug 2002, Oliver Neukum wrote:
>
> > Also, people who play games with FP actually change the FP data on the
> > stack frame, and depend on signal return to reload it. Admittedly I've
> > only ever seen this on SIGFPE, but anyway - this is all done with integer
> > instructions that just touch bitpatterns on the stack.. The kernel can't
> > catch it sanely.
>
> Could the fp state be put on its own page and the dirty bit
> evaluated in the decision whether to restore fpu state ?
I'm sure anything is _possible_, but there are a few problems with that
approach. In particular, playing VM games tends to be quite expensive on
SMP, since you need to make sure that the TLB entry for that page is
invalidated on all the other CPU's before you insert the FPU page.
Also, you'd need to play games with dirty bit handling, since the page
_is_ dirty (it contains FP data), so the VM must know to write it out if
it pages things. That's ok - we have separate per-page and per-TLB-entry
dirty bits anyway, but right now the VM layer knows it can move the TLB
entry dirty bit into the per-page dirty bit and drop it - which wouldn't
be the case if we also have a FPU dirty bit.
That's fixable - we could just make a "software TLB dirty bit" that it
updated whenever the hardware TLB dirty bit is cleared and moved into the
per-page dirty bit.
But the end result sounds rather complicated, especially since all the
page table walking necessary for setting this all up is likely to be about
as expensive as the thing we're trying to avoid..
Rule of thumb: it almost never pays to be "clever".
Linus
hi :)
On Sat, Aug 03, 2002 at 10:29:42AM -0500, Jeff Dike wrote:
> [email protected] said:
> > the alternatives like a seperate process and ptrace are not pretty either
>
> might not be so bad after all.
there is already a group at our university doing that:
http://www3.informatik.uni-erlangen.de/Research/Projects/UMLinux/umlinux.html
--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient.
Benjamin Franklin (1706 - 1790)
On Mon, Aug 05, 2002 at 03:46:07PM +0200, Udo A. Steinberg wrote:
> On Sat, 03 Aug 2002 10:29:42 -0500
> Jeff Dike <[email protected]> wrote:
>
> > [email protected] said:
> > > the alternatives like a seperate process and ptrace are not pretty either
>
> I have implemented a usermode version of the Fiasco ?-kernel that uses
> a seperate process for the kernel and one process for each task. The kernel
> process attaches to all tasks via ptrace.
> When the kernel wants to change the MM of a task it puts some trampoline code
> on a page mapped into each task's address space and has the task execute that
> code on behalf of the kernel.
> With that setup we have complete address space protection without all the
> trouble of jail at the expense of a few context switches for each mmap, munmap
> or mprotect operation.
very interesting, what is the handiest way to do "syscalls" in this model?
Ptrace is still basically signal driven so I would expect it has still some
unnecessary overhead?
> I would also very much like an extension that would allow one process to modify
> the MM of another, possibly via an extended ptrace interface or a new syscall.
> Also it would be nice if there was an alternate way to get at the cr2 register,
> trap number and error code other than from a SIGSEGV handler.
that's what signals are for, too bad they are slow.
> > Then, the current UML tracing thread would handle the kernel side of things
> > and sit in its own address space nicely protected from its processes.
>
> Yes. I already have this part working for our kernel, so it's not just theory.
> I believe things could run yet another bit faster if we didn't have to do the
> trampoline map operations.
they are very expensive because of the way ptrace accesses the other process
memory, did you try a piece of shared memory ?
Richard
On Mon, 5 Aug 2002 22:44:15 +0200
Richard Zidlicky <[email protected]> wrote:
> very interesting, what is the handiest way to do "syscalls" in this model?
> Ptrace is still basically signal driven so I would expect it has still some
> unnecessary overhead?
Task wants to do a syscall (i.e. int 0x30 in Fiasco), the kernel process tracing
the task sees the signal in its SIGCHLD handler. It pulls the registers out of the
task's address space using PTRACE_GETREGS and sets up an interrupt frame on the
kernel stack. EIP and ESP in the saved signal context are frobbed in a way that
the signal handler falls right into the correct interrupt gate when it returns.
iret works the other way round. SIGSEGV handler in the kernel process copies registers
back to task and restarts the task's process after restoring kernel state.
> > I would also very much like an extension that would allow one process to modify
> > the MM of another, possibly via an extended ptrace interface or a new syscall.
> > Also it would be nice if there was an alternate way to get at the cr2 register,
> > trap number and error code other than from a SIGSEGV handler.
>
> that's what signals are for, too bad they are slow.
As it is now, in order to get at the page fault address one has to invoke a SIGSEGV
handler in the task, then look at the task's signal context to determine the pagefault
address, trapno etc. It would be much faster if the kernel could cancel the SIGSEGV signal
in the task's process and read out the the pagefault info from the TCB via a ptrace
extension. Saves the cost of a running a signal handler in the task and a bunch of context
switches.
> they are very expensive because of the way ptrace accesses the other process
> memory, did you try a piece of shared memory ?
Yes, trampoline page is shared between kernel and task. Nevertheless there are
context switches that wouldn't be necessary if the kernel could tweak the task's
mm directly.
-Udo.
On Mon, 05 Aug 2002 19:42:31 -0500
Jeff Dike <[email protected]> wrote:
> Similarly, with other signals, like the timer, SIGIO, or page faults, it
> would just annull the signal and call into the IRQ system. Although page
> faults will be difficult because of the inability to read err or cr3, as
> you've pointed out.
Jeff,
If my understanding of UML is right, you implement interrupts with socket
pairs where the interrupt handler writes a byte into one end and the other
end receives an async notification (SIGIO). In order to stop the right task
with a SIGIO, you change the socket owner on each context switch using fcntl.
If you have one process per task and a kernel process, the kernel process
cannot change socket ownership over to the next task's process, because it's
not allowed to. Only the process itself could set the ownership to his pid,
but then each task switch would have to be done with a trampoline too.
The issue boils down to how the kernel process can stop a task process in
order to force the task into kernel. You can of course kill (taskpid, SIG)
but that has a race if the task tries to enter kernel at the same time.
SIG will be pending in the task until it is scheduled next.
-Udo.
[email protected] said:
> If my understanding of UML is right, you implement interrupts with
> socket pairs where the interrupt handler writes a byte into one end
> and the other end receives an async notification (SIGIO).
It sounds like you're confusing two mechanisms. Device interrupts are
implemented with something that supports SIGIO (socketpair, tty) with one
end outside UML and one end inside UML generating the SIGIOs.
I use socketpairs in the way you describe to implement context switching.
Out-of-context processes are sleeping in a read on their socket, and are
woken up by an soon-to-be-out-of-context process writing a byte down it.
There's no SIGIO there at all.
I also use socketpairs with SIGIO to implement IPIs on SMP UML.
> In order to
> stop the right task with a SIGIO, you change the socket owner on each
> context switch using fcntl.
Yup. More precisely, in order to ensure that the correct process receives
SIGIO when input comes in from the outside, I F_SETOWN the descriptors to
the incoming process during a context switch.
> If you have one process per task and a kernel process, the kernel
> process cannot change socket ownership over to the next task's
> process, because it's not allowed to.
Why not? I see nothing at all in the implementation of F_SETOWN that requires
that it be called by the current owner:
case F_SETOWN:
lock_kernel();
filp->f_owner.pid = arg;
filp->f_owner.uid = current->uid;
filp->f_owner.euid = current->euid;
...
There are no general checks earlier in do_fcntl or sys_fcntl either.
Jeff
[email protected] said:
> Task wants to do a syscall (i.e. int 0x30 in Fiasco), the kernel
> process tracing the task sees the signal in its SIGCHLD handler. It
> pulls the registers out of the task's address space using
> PTRACE_GETREGS and sets up an interrupt frame on the kernel stack.
Hmmm, I would have the kernel process let the system call bump it out of
wait() rather than delivering a SIGCHLD. And, I'd be inclined to lomgjmp
over to the kernel stack.
Or, even better, have it already running on the appropriate kernel stack,
so it can just read the system call from PTRACE_GETREGS and call into the
main kernel.
Similarly, with other signals, like the timer, SIGIO, or page faults, it
would just annull the signal and call into the IRQ system. Although page
faults will be difficult because of the inability to read err or cr3, as
you've pointed out.
Jeff
On Mon, Aug 05, 2002 at 05:35:13AM +0000, Linus Torvalds wrote:
> And yes, this signal handler thing is clearly visible on benchmarks.
> MUCH too clearly visible. I just didn't see any safe alternatives
> (and I still don't ;( )
To some degree, the original approach taken by Intel may be an alternative...
That is, the signal handler is responsible for saving state of all CPU
resources that it intends to use, and restoring state before returning
control to the caller. (the 'interupt' qualifier from C)
I could see this offered as a GCC optimization, but without the compiler
smarts to detect what is needed and what is not, it would be very difficult
to add this support in a seamless manner.
For example:
typedef void (*__fastsighandler_t) (int) __attribute__ ((signal_handler));
#define signal(number, handler) \
(__attribute_enabled__((handler, signal_handler)) \
? __signal_fast(number, handler) \
: __signal(number, handler))
void handle_sigint (int) __attribute__ ((signal_handler))
{
sigint_received++;
}
mark
--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
[email protected] said:
> there is already a group at our university doing that: http://
> www3.informatik.uni-erlangen.de/Research/Projects/UMLinux/umlinux.html
Yeah, I know. Hans-Joerg and I have been talking about whether and how
much it makes sense to start sharing code.
Jeff
On Mon, 05 Aug 2002 21:55:05 -0500
>
> > If you have one process per task and a kernel process, the kernel
> > process cannot change socket ownership over to the next task's
> > process, because it's not allowed to.
>
> Why not? I see nothing at all in the implementation of F_SETOWN that requires
> that it be called by the current owner:
>
> case F_SETOWN:
> lock_kernel();
> filp->f_owner.pid = arg;
> filp->f_owner.uid = current->uid;
> filp->f_owner.euid = current->euid;
> ...
Ok, I was looking at sockets and not tty's and that has the following in
net/core/sock.c
case F_SETOWN:
/*
* This is a little restrictive, but it's the only
* way to make sure that you can't send a sigurg to
* another process.
*/
if (current->pgrp != -arg &&
current->pid != arg &&
!capable(CAP_KILL)) return(-EPERM);
sk->proc = arg;
return(0);
So it wouldn't work with socketpairs, but with tty's it should.
-Udo.
[email protected] said:
> if (current->pgrp != -arg &&
> current->pid != arg &&
> !capable(CAP_KILL)) return(-EPERM);
What's the problem here? This will let UML do F_SETOWN as well.
Jeff
On Tue, 06 Aug 2002 06:20:52 -0500
Jeff Dike <[email protected]> wrote:
> [email protected] said:
> > if (current->pgrp != -arg &&
> > current->pid != arg &&
> > !capable(CAP_KILL)) return(-EPERM);
>
> What's the problem here? This will let UML do F_SETOWN as well.
It will let the incoming process take over ownership of the socket,
which is probably what you mean and what you currently use.
I'm talking about a setup with the kernel residing in its own process.
On iret it would have to change ownership of the socket to another task,
i.e. process with kernel_pid wants to set task_pid as the owner of the
socket. The above code fragment doesn't permit this, as far as I can see.
What it does permit is the incoming task setting itself to the socket
owner, but that requires that the incoming task always runs a trampoline
first which accomplishes that.
-Udo.
[email protected] said:
> It will let the incoming process take over ownership of the socket,
> which is probably what you mean and what you currently use.
Yup.
> On iret it would have to change ownership of the socket to another
> task, i.e. process with kernel_pid wants to set task_pid as the owner
> of the socket. The above code fragment doesn't permit this, as far as
> I can see.
Why not? There is nothing there that prevents that.
Jeff
On Tue, 06 Aug 2002 08:53:24 -0400
Jeff Dike <[email protected]> wrote:
> > On iret it would have to change ownership of the socket to another
> > task, i.e. process with kernel_pid wants to set task_pid as the owner
> > of the socket. The above code fragment doesn't permit this, as far as
> > I can see.
>
> Why not? There is nothing there that prevents that.
In the following code the parent (i.e. kernel) tries to set the child (i.e. task)
as owner for the socket. Does this work for you? It doesn't for me, for the
reason I described earlier.
#include <sys/types.h>
#include <sys/socket.h>
#include <fcntl.h>
#include <unistd.h>
int main (void) {
int sockets[2], flags;
pid_t pid;
if (socketpair (AF_UNIX, SOCK_STREAM, 0, sockets)) {
perror ("socketpair");
return -1;
}
switch (pid = fork ()) {
case -1:
perror ("fork");
return -1;
case 0:
pause ();
default:
if ((flags = fcntl (sockets[0], F_GETFL)) < 0) {
perror ("fcntl, GETFL");
return -1;
}
if (fcntl (sockets[0], F_SETFL, flags | O_NONBLOCK | O_ASYNC) < 0) {
perror ("fcntl, SETFL");
return -1;
}
if (fcntl (sockets[0], F_SETOWN, pid) < 0) {
perror ("fcntl, SETOWN");
return -1;
}
}
return 0;
}
[email protected] said:
> Does this work for you?
No :-)
> It doesn't for me, for the reason I described
> earlier.
Indeed. I misread the !capable(CAP_KILL) as "I am not allowed to kill the
other guy", which clearly you are when you just forked it.
This looks like a bug to me. If you own the process, you can send it any
signal you want, so you should be allowed to sign it up for SIGURG/SIGIO via
F_SETOWN.
Jeff
On Tue, 06 Aug 2002 10:12:25 -0400
Jeff Dike <[email protected]> wrote:
> Indeed. I misread the !capable(CAP_KILL) as "I am not allowed to kill the
> other guy", which clearly you are when you just forked it.
> This looks like a bug to me. If you own the process, you can send it any
> signal you want, so you should be allowed to sign it up for SIGURG/SIGIO via
> F_SETOWN.
I'm glad we agree on that one :)
Considering we're not using sockets with broken SIGIO, but pseudo-terminals
like UML instead, there's still a problem:
When the task is registered as socket owner and is just about to enter the
kernel due to a syscall, it will stop with a SIGTRAP and the tracing kernel
process will run sometime and see a SIGCHLD. But after the task stopped and
before the kernel process can change SIGIO ownership back, a new interrupt
could come in and the SIGIO would remain pending in the task's process until
the task was scheduled to run next time.
How do you solve this?
-Udo.
[email protected] said:
> I'm glad we agree on that one :)
Yup, sorry. That test is wrong, and is slated to be fixed at some point.
> When the task is registered as socket owner and is just about to enter
> the kernel due to a syscall, it will stop with a SIGTRAP and the
> tracing kernel process will run sometime and see a SIGCHLD. But after
> the task stopped and before the kernel process can change SIGIO
> ownership back, a new interrupt could come in and the SIGIO would
> remain pending in the task's process until the task was scheduled to
> run next time.
>
> How do you solve this?
A couple of ways. The system call path can call sigio_handler to clear
out any pending IO. The SIGIO that was trapped in the process will cause
another call to sigio_handler which won't turn up any IO, but I don't
consider that to be a problem.
The kernel process can examine the signal pending mask of the process after
it has transferred SIGIO to itself. This can be done either through
/proc/<pid>/status or a ptrace extension, since we're happily postulating
new things for it to do anyway. If there is a SIGIO pending, it calls
sigio_handler.
Any other possibilities that you see?
Jeff
On Tue, 06 Aug 2002 13:42:18 -0400
Jeff Dike <[email protected]> wrote:
> A couple of ways. The system call path can call sigio_handler to clear
> out any pending IO. The SIGIO that was trapped in the process will cause
> another call to sigio_handler which won't turn up any IO, but I don't
> consider that to be a problem.
It is not a problem at all, just a small performance penalty.
> The kernel process can examine the signal pending mask of the process after
> it has transferred SIGIO to itself. This can be done either through
> /proc/<pid>/status or a ptrace extension, since we're happily postulating
> new things for it to do anyway. If there is a SIGIO pending, it calls
> sigio_handler.
I don't like the idea of having to fiddle with the proc filesystem. Some
people might not even mount it. A ptrace extension to look at and modify
the pending signal mask of a traced process would be very handy.
> Any other possibilities that you see?
Right now I'm doing something hackish. If the process enters with a syscall
(int 0x30 in my case) after the kernel expects it to enter due to an interrupt,
I just restart the task until it enters with the pending interrupt signal (SIGIO).
The task will do that before it can step on the int instruction again, and after
it returns to usermode it will step on the int again. This works well with faults.
The problem are traps, because the EIP points behind the instruction. In that
case the EIP needs to be adjusted. Ugly, I know.
-Udo.
On Tue, 06 Aug 2002 13:42:18 -0400
Jeff Dike <[email protected]> wrote:
>
> The kernel process can examine the signal pending mask of the process after
> it has transferred SIGIO to itself. This can be done either through
> /proc/<pid>/status or a ptrace extension, since we're happily postulating
> new things for it to do anyway. If there is a SIGIO pending, it calls
> sigio_handler.
>
> Any other possibilities that you see?
Another possibility could be the kernel process and the task processes sharing
a pending signal queue, either for one particular signal or all signals. The
kernel process would block SIGIO while the task runs and when the task enters
kernel mode with a SIGIO still trapped in the task process, SIGIO would get
delivered in the kernel and cleared from the shared pending queue, which is
just what we want.
Someone actually already tried implementing it with a clone extension, see
http://www.rhdv.cistron.nl/sigqueue.html
-Udo.
[email protected] said:
> SIGIO would get delivered in the kernel and cleared from the shared
> pending queue, which is just what we want.
Not really. What we really want is for signals not to be delivered at all.
That's why the ptrace signal annulling capability is nice.
I'm not sure if this makes any sense, but coupling the new aio mechanism with
something that queues up siginfos might be interesting. It would be a magic
descriptor that would feed you signals when you read it.
Is that at all sane?
Jeff
On Wed, Aug 07, 2002 at 10:14:42PM -0500, Jeff Dike wrote:
> I'm not sure if this makes any sense, but coupling the new aio mechanism with
> something that queues up siginfos might be interesting. It would be a magic
> descriptor that would feed you signals when you read it.
>
> Is that at all sane?
Delivering signals from aio completion is indeed possible. There is
even a field in the iocb structure for doing this in order to provide
complete posix compatibility (well, except for the fact that structure
initialization is enforced).
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
On Wed, 07 Aug 2002 22:14:42 -0500
Jeff Dike <[email protected]> wrote:
>
> Not really. What we really want is for signals not to be delivered at all.
> That's why the ptrace signal annulling capability is nice.
>
> I'm not sure if this makes any sense, but coupling the new aio mechanism with
> something that queues up siginfos might be interesting. It would be a magic
> descriptor that would feed you signals when you read it.
>
> Is that at all sane?
I know that we're trying to avoid signal handlers, because they are expensive.
But the signal would not need to be delivered in the task. We need a mechanism to
stop the task and force it into kernel. The task is uncooperative and doesn't
dequeue signals itself. When it gets a signal it stops. The kernel then sees the
signal and accepts it using sigwaitinfo, at which point it is no longer pending
in the task either. The siginfo structure then provides the necessary info,
i.e. which fd caused the i/o.
When running in a kernel context, you actually need to deliver SIGIO in order
to interrupt the current context.
If you have a magic aio descriptor, how does the task process read signals
from it and stop?
-Udo.
[email protected] said:
> The task is uncooperative and doesn't dequeue signals itself. When it
> gets a signal it stops. The kernel then sees the signal and accepts it
> using sigwaitinfo, at which point it is no longer pending in the task
> either. The siginfo structure then provides the necessary info, i.e.
> which fd caused the i/o.
I think this is more or less what I had in mind. The thing that is missing
is for sigwaitinfo to be able to dequeue another process' signals, which is
where the shared signal queue would come in.
> If you have a magic aio descriptor, how does the task process read
> signals from it and stop?
I was looking at this as a way of dequeueing signals from the other process.
The task process would have the signal queued and wake up the kernel process
as happens now. The kernel process would have /proc/<task-pid>/sigqueue
or something opened and would read siginfos from it. Those would then be
dequeued from the task process.
This almost suffices for getting page fault information, except that, for
some reason, siginfo doesn't say whether the faulting access was a read or
a write.
And now that I'm thinking about it, aio doesn't really come into it. This
would be strictly synchronous.
Jeff