by Scott James Remnant

[permalink] [raw]

On 01/10, Scott James Remnant wrote:
>
> On Wed, 2009-01-07 at 12:53 -0800, Roland McGrath wrote:
>
> > New syscall should have gone to linux-api, I think.
> >
> > Do we really need another one for this? How about using signalfd plus
> > setting the child's exit_signal to a queuing (SIGRTMIN+n) signal instead of
> > SIGCHLD? It's slightly more magical for the userland process to know to do
> > that (fork -> clone SIGRTMIN). But compared to adding a syscall we don't
> > really have to add, maybe better.
> >
> This wouldn't help the init daemon case:
>
> - the exit_signal is set on the child, not on the parent.
>
> While the init daemon could clone() every new process and set
> exit_signal, this would not be set for processes reparented to init.
>
> Even if we had a new syscall to change the exit_signal of a given
> process, *and* had the init reparent notification patch, this still
> wouldn't be sufficient; you'd have a race condition between the time
> you were notified of the reparent, and the time you set exit_signal,
> in which the child could die.
>
> Since exit_signal is always reset to SIGCHLD before reparenting, this
> could be done by resetting it to a different signal; but at this point
> we're getting into a rather twisty method full of traps.
>
> - exit_signal is reset to SIGCHLD on exec().
>
> Pretty much a plan-killer ;)

I can't understand why should we change ->exit_signal if we want to
use signalfd. Yes, SIGCHLD is not rt. So what?

We do not need multiple signals in queue if we want to reap multiple
zombies. Once we have a single SIGCHLD (reported by signalfd or
whatever) we can do do_wait(WNOHANG) in a loop.

Confused.

Oleg.

2009-01-10 16:21:38

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RESEND][RFC PATCH v2] waitfd

On 01/10, Scott James Remnant wrote:
>
> On Thu, 2009-01-08 at 22:39 +0100, Oleg Nesterov wrote:
>
> > Btw. It is not that I am trying to argue against sys_waitfd(), but do
> > you have the "real life" example when it can be useful? Yes, poll().
> > But we have signalfd. SIGCHLD is not rt signal, but afaics this is not
> > the problem actually. Just curious.
> >
> signalfd() can't currently be made to work in the way you describe.

Hmm. Could you clarify?

I am not sure we are talking about the same thing, but afaics poll() +
signalfd can work to (say) reap the childs. Actually, ppoll() alone is
enough.

No?

Oleg.

2009-01-10 17:07:45

by Scott James Remnant

[permalink] [raw]

Subject: Re: [RESEND][RFC PATCH v2] waitfd

On Sat, 2009-01-10 at 16:57 +0100, Oleg Nesterov wrote:

> I can't understand why should we change ->exit_signal if we want to
> use signalfd. Yes, SIGCHLD is not rt. So what?
>
> We do not need multiple signals in queue if we want to reap multiple
> zombies. Once we have a single SIGCHLD (reported by signalfd or
> whatever) we can do do_wait(WNOHANG) in a loop.
>
Well, a good reason why is that it makes things much easier to do in
userspace. The NOTES sections of many of the syscall manpages are
already too long with strange behaviours that you have to take into
account every time (and which most people fail to do).

You may as well ask why we have signalfd() at all, and what was wrong
with sigaction() and ordinary signal handlers? Well, lots of things
actually; for a start, you couldn't really do much in the signal
handler, so from userspace all we ended up doing was writing a byte to a
pipe so we could wake up our main loop.

You then had to remember to exhaust this pipe *before* checking what
signals were triggered, just in case a signal was delivered while you
were checking (so at least you'd wake the main loop up once more to
check).

signalfd() improves matters a lot; we don't have to worry about any
strange behaviour, we just read() to know a signal is pending. But
there were two competing implementations: one would just poll for read
if signals were pending, but have no other detail; the other (which is
what we have) gives us a siginfo_t for each pending signal.

Except or non-RT signals of course, in which it just gives us the
siginfo_t for the first pending signal of that type; future ones are
discarded.

I've suggested before that that's a bug in of itself, and that signalfd
should always queue signals even if they're non-RT. But since, other
than SIGCHLD, the only other signals with useful data are SIGILL,
SIGFPE, SIGSEGV and SIGBUS (which have si_addr) which you kinda need to
catch in a signal handler; and SIGPOLL (which has si_band and si_fd)
which is negated by being in poll anyway, that's kinda hard to argue.

So let's compare userspace code for trying to reap children using
signalfd(); to save space, I've omitted error handling and the
select()/poll()/etc. call - assuming that the top half is
initialisation, and the bottom half is inside the select() handler.

First, what we have today:

sigemptyset (&mask);
sigaddset (&mask, SIGCHLD);

/* Block normal delivery and receive by sigfd instead */
sigprocmask (SIG_BLOCK, &mask, NULL);
sigfd = signalfd (-1, &mask, 0);

for (;;) {
read (sigfd, &fd_siginfo, sizeof siginfo);

/* throw away fd_siginfo, we're reading SIGCHLD and
* can't use it :-(
*/

/* SIGCHLD means _at_least_one_ child is pending, there
* may be more; so we have to loop AND expect to find
* nothing
*/
for (;;) {
/* ARGH! waitid returns 0 with WNOHANG if there
* are no children.
*
* AND the structure, despite being logically
* the same, isn't the same as the signalfd
* one :-/
*/
memset (&w_siginfo, 0, sizeof w_siginfo);

waitid (P_ALL, 0, &w_siginfo,
WEXITED | WNOHANG);

/* Did we find anything? */
if (! w_siginfo.si_pid)
break;

/* NOW we have the siginfo_t for a recently
* deceased process
*/

mourn (&w_siginfo);
}

/* Oh-HO!
* While we were looping in waitid(), other children may
* have died, and we probably already cleaned them up!
*
* But we'll still have a pending SIGCHLD, it might be
* tempting to clear that *BUT* the child could have
* died after the last waitid() call but before we clear
* it.
*
* We have no choice in this situation but to loop back
* through our entire main loop again, and do nothing.
*/
}

Pros:
- code exists today

Cons:
- having siginfo_t returned by read() is pointless, we can't use it
- double loop isn't pretty
- strange waitid() API in case of WNOHANG and no child
- incompatible structures for signalfd()'s read result and waitid(),
despite being logically the same structure! :-/
- can't simultaneously clear pending signal and wait, so we always have
to go back round the main loop if a child dies after the read()

In fact, that's not what userspace code does at all.

Since there's no point listening to SIGCHLD, it's a complete no-op, we
don't respond to it at all. We only need to use it to wake up the main
loop. The wait() loop tends to be at the bottom of the main loop
somewhere, completely outside of the fd processing.

Now, what if signalfd() would always queue pending signals even if
they're non-RT? This would also be the same code if we could change
SIGCHLD to SIGRTMIN for the *waiting* process's children:

sigemptyset (&mask);
sigaddset (&mask, SIGCHLD);

/* Block normal delivery and receive by sigfd instead */
sigprocmask (SIG_BLOCK, &mask, NULL);
sigfd = signalfd (-1, &mask, 0);

for (;;) {
read (sigfd, &siginfo, sizeof siginfo);

/* siginfo is immediately useful!
*/

mourn (&siginfo);

/* we didn't clear off the wait queue, but that's easy
* since we have the pid from signalfd()
*/
waitid (P_PID, siginfo.si_pid, NULL, WEXITED);
}

Pros:
- single siginfo_t structure type returned by read()
- we get information for each signal, we don't need a signal loop to
find out what the signal is telling us
- exact match between the signal and the wait call
- no need to go round the main loop again just in case!
- child signal can now be processed just like anything else. This
makes "standard main loop functions" (g_main_loop, etc.) much easier
to write.

Cons:
- needs kernel patch

Personally, I think the fact this solves the case where you wait on a
process that wasn't part of the original set the signal was pending for,
so you have to process an extra SIGCHLD with nothing to do, is a good
enough reason in of itself.

The overhead of going back through a main loop of a userspace process
just to find out whether you already responded to a notification is a
waste of cycles.

That's pretty neat, much nicer than what we had before. So what about
waitfd() [I think I've slightly changed Casey's API here, or he changed
my proposed one <g>]?

wfd = waitfd (-1, P_ALL, 0, WEXITED);

for (;;) {
read (wfd, &siginfo, sizeof siginfo);

/* siginfo is immediately useful AND the process has
* been reaped!
*/

mourn (&siginfo);
}

Pros:
- we don't need to care about SIGCHLD anymore - why should we listen to
a notification to call wait, if we can just call wait?
- the absolute easiest for a generic main loop; signals, timers and
child death (as well as inotify, netlink, etc.) are now just cases of
"select an fd, read structs of known size, process them"
- possibility to allow waitfd (WNOWAIT) on children outside of your own
usual process tree - notification without reaping? Maybe too much?

Cons:
- needs kernel patch
- new API

Scott
--
Scott James Remnant
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-01-10 17:10:13

On Sat, 2009-01-10 at 19:21 +0100, Oleg Nesterov wrote:

> On 01/10, Scott James Remnant wrote:
> >
> > On Sat, 2009-01-10 at 17:19 +0100, Oleg Nesterov wrote:
> >
> > > I am not sure we are talking about the same thing, but afaics poll() +
> > > signalfd can work to (say) reap the childs. Actually, ppoll() alone is
> > > enough.
> > >
> > Last time I checked, ppoll() was not actually implemented across all
> > architectures in a manner that solved the race it was intended to solve.
> >
>
> As I said, this is imho unfair. But I mentioned ppol() "just in case".
>
> My questiong was why do you think that "signalfd() can't currently be
> made to work in the way you describe". You have dropped this part to
> change the topic?
>
Sorry, I may not be following LKML etiquette correctly. These couple of
recent threads (other than some bugs I found in wait last year) are my
first real attempt to participate here.

I wasn't intending to "change the topic" or dropping the parts about
changing signalfd() to somehow sweet it under the carpet.

Rather than posting repeatedly across the thread, I tried to consolidate
my responses into the other post you've replied to.

You made an interesting point about ppoll here, so I only responded to
that to find out whether the situation of that syscall had been
improved.

Not so much changing the topic, but asking a side-bar question ;)

Scott
--
Scott James Remnant
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-01-10 20:13:31

[permalink] [raw]

Subject: Re: [RESEND][RFC PATCH v2] waitfd

On Sat, Jan 10, 2009 at 4:57 PM, Oleg Nesterov <[email protected]> wrote:
> On 01/10, Scott James Remnant wrote:
>> On Wed, 2009-01-07 at 12:53 -0800, Roland McGrath wrote:
>> > Do we really need another one for this? ?How about using signalfd plus
>> > setting the child's exit_signal to a queuing (SIGRTMIN+n) signal instead of
>> > SIGCHLD? ?It's slightly more magical for the userland process to know to do
>> > that (fork -> clone SIGRTMIN). ?But compared to adding a syscall we don't
>> > really have to add, maybe better.
>> >
>> This wouldn't help the init daemon case:
>>
>> - the exit_signal is set on the child, not on the parent.
>>
>> ? While the init daemon could clone() every new process and set
>> ? exit_signal, this would not be set for processes reparented to init.
>>
>> ? Even if we had a new syscall to change the exit_signal of a given
>> ? process, *and* had the init reparent notification patch, this still
>> ? wouldn't be sufficient; you'd have a race condition between the time
>> ? you were notified of the reparent, and the time you set exit_signal,
>> ? in which the child could die.
>>
>> ? Since exit_signal is always reset to SIGCHLD before reparenting, this
>> ? could be done by resetting it to a different signal; but at this point
>> ? we're getting into a rather twisty method full of traps.
>>
>> - exit_signal is reset to SIGCHLD on exec().
>>
>> ? Pretty much a plan-killer ;)
>
> I can't understand why should we change ->exit_signal if we want to
> use signalfd. Yes, SIGCHLD is not rt. So what?
>
> We do not need multiple signals in queue if we want to reap multiple
> zombies. Once we have a single SIGCHLD (reported by signalfd or
> whatever) we can do do_wait(WNOHANG) in a loop.
>
> Confused.

I know I am terribly late for the party :)

"do_wait(WNOHANG) in a loop" is a performance problem.

Oleg, do you remember that strace bug when it was swamped
with gazillions of stop notifications from a multithreaded
task, then by dealing with them one-by-one it was causing
unfairness and ultimately "this program never finishes
when run under strace" bug?

And another typical nuisance that running multithreaded
stuff under strace is much slower, even with -e option
which limits the set of decoded syscalls?

Having waitfd would help both cases: strace can gulp
a lot of waitpid notifications in one go, and
batch process them.

--
vda

2011-03-02 14:03:51

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RESEND][RFC PATCH v2] waitfd

On 03/02, Denys Vlasenko wrote:
>
> On Sat, Jan 10, 2009 at 4:57 PM, Oleg Nesterov <[email protected]> wrote:
> >
> > We do not need multiple signals in queue if we want to reap multiple
> > zombies. Once we have a single SIGCHLD (reported by signalfd or
> > whatever) we can do do_wait(WNOHANG) in a loop.
> >
> > Confused.
>
> I know I am terribly late for the party :)
>
> "do_wait(WNOHANG) in a loop" is a performance problem.

Yes.

> Oleg, do you remember that strace bug when it was swamped
> with gazillions of stop notifications from a multithreaded
> task, then by dealing with them one-by-one it was causing
> unfairness and ultimately "this program never finishes
> when run under strace" bug?

Yes. But, iirc, this was not connected to the performance problems
with do_wait(). The problem was, strace did a single do_wait()
instead of wait-them-all.

> And another typical nuisance that running multithreaded
> stuff under strace is much slower, even with -e option
> which limits the set of decoded syscalls?

IIUC, this is also because strace is single-threaded, I mean it
doesn't scale well.

> Having waitfd would help both cases: strace can gulp
> a lot of waitpid notifications in one go, and
> batch process them.

Perhaps.

I do not know how much do_wait() contributes to the slowness
though. And it is not exactly clear how we can implement the
"fast" waitfd.

For example, this patch (iirc!) just calls do_wait() in a loop.
I doubt very much it can really help to improve the performance.

Oh. Can't resist. The real problem is that ptrace API should
not be per-thread, and it should not use wait() at all. But
this is offtopic.

Oleg.