LinuxLists.cc - Better fork() (and possbly others) failure diagnostics

2002-10-15 11:49:24

Subject: Better fork() (and possbly others) failure diagnostics

Hello!

Several times I had real problems with batch jobs failing with EAGAIN,
printed as "Resource temporarily unavailable". Not with the failure, but to
determine the real cause is really a pain. Usually, the problem is in
resource limits (rlimit, set by ulimit), but the returned error code is
misleading.

There are two ways. One is to print something to syslog, when some rlimit
is reached. This is already done when limit of open files in system is
reached.

The second is more subtle - define error code for reaching the rlimit
(possibly one errorcode for each rlimit) and slightly change the code to
return correct error code.

What do you think about this subject?

Michal Kara

--
PING 111.111.111.111 (111.111.111.111): 56 data bytes
...
---- Waiting for outstanding packets ----
No outstanding packets received, just two ordinary.

2002-10-15 13:10:18

by jw schultz

[permalink] [raw]

Subject: Re: Better fork() (and possbly others) failure diagnostics

On Tue, Oct 15, 2002 at 01:55:17PM +0200, Michal Kara wrote:
> Several times I had real problems with batch jobs failing with EAGAIN,
> printed as "Resource temporarily unavailable". Not with the failure, but to
> determine the real cause is really a pain. Usually, the problem is in
> resource limits (rlimit, set by ulimit), but the returned error code is
> misleading.
>
> There are two ways. One is to print something to syslog, when some rlimit
> is reached. This is already done when limit of open files in system is
> reached.
>
> The second is more subtle - define error code for reaching the rlimit
> (possibly one errorcode for each rlimit) and slightly change the code to
> return correct error code.
>
> What do you think about this subject?

Bad idea at this time. In 1980 it might have been ok.

Take a look at the manpages. It is very clear there that
EAGAIN has two meanings: try again because what you request
isn't available yet, and request exceeds resource limits (at
the moment). Basically POSIX and SUS direct that EAGAIN is
the correct error code for resource limit exceedance.

I agree it would be nice if rlimit caused its own error code
but such a change at this time would break far to many things.

Your alternative of a klogging an error is not appropriate
either. Hitting an rlimit is not a system, but a user
error. There is nothing for the admin to do, the message
really needs to go to the user and when an rlimit is hit you
would be likely to get a flurry of messages. It is better
for the user to be notified by the application.

Optimally whatever hit the rlimit should have reported a
more useful message. That many applications don't have
special processing for EAGAIN isn't surprising as it doesn't
occur that often. I suppose a change to the error message
to read "Resource temporarily unavailable or user limits
exceeded" might help newer users but that is a property of
libc.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-10-15 15:38:27

by Michal Kara

[permalink] [raw]

Subject: Re: Better fork() (and possbly others) failure diagnostics

> Take a look at the manpages. It is very clear there that
> EAGAIN has two meanings: try again because what you request
> isn't available yet, and request exceeds resource limits (at
> the moment). Basically POSIX and SUS direct that EAGAIN is
> the correct error code for resource limit exceedance.

The fork() manpage says:

EAGAIN fork cannot allocate sufficient memory to copy the
parent's page tables and allocate a task structure
for the child.

No word about limits. But that may classify as a manpage problem.

> I agree it would be nice if rlimit caused its own error code
> but such a change at this time would break far to many things.

I can think only of some applications retrying when they get EAGAIN...

> Your alternative of a klogging an error is not appropriate
> either. Hitting an rlimit is not a system, but a user
> error.

On workstation or multi-user server yes. But not on, say, web server.
There hitting the limit is a problem and administrator should do something
about it. When your nightly processing job hits limit (and when you run it
by hand, it doesn't) , "Something wrong" is not to much helpful to solve the
problem.

> Optimally whatever hit the rlimit should have reported a
> more useful message. That many applications don't have
> special processing for EAGAIN isn't surprising as it doesn't
> occur that often. I suppose a change to the error message
> to read "Resource temporarily unavailable or user limits
> exceeded" might help newer users but that is a property of
> libc.

But WHICH limit. This is what this is all about. If there was only one,
then it is OK. And you cannot even display the limit/usage for running
process to give you a hint.

Michal Kara

--
PING 111.111.111.111 (111.111.111.111): 56 data bytes
...
---- Waiting for outstanding packets ----
No outstanding packets received, just two ordinary.

2002-10-15 16:53:31

by Eduardo Pérez

[permalink] [raw]

Subject: fork() wait semantics

On 2002-10-15 13:55:17 +0200, Michal Kara wrote:
> Several times I had real problems with batch jobs failing with EAGAIN,
> printed as "Resource temporarily unavailable". Not with the failure, but to
> determine the real cause is really a pain. Usually, the problem is in
> resource limits (rlimit, set by ulimit), but the returned error code is
> misleading.

I've investigated the use of the fork() system call across some
programs in Debian GNU/Linux and I've found that the current fork
semantics are not very good.

When Linux can't fork() returns a fork error that the application
usually sees as a fatal error. Instead of the fatal error fork
should wait for resources to be available, thus never returning an
error. If the current user has exceeded one of its limits the program
should block waiting for another program of the same user to free
resources. There's a possible user deadlock in this approach that the
kernel should signal.

As no application waits for fork to be available most applications threat
a fork() error as fatal.

A possible fork interface would be waiting for resources to be available
and only returning with error in case of deadlock.

This needs to be fixed in the kernel as there's no interfaces in user
space to wait for fork() availability or signaling deadlocks.

This interface is also applicable to any other memory requesting system
call (like brk) that can return ENOMEM (instead of waiting for memory to
be available), or falling on deadlock.

As an example consider bash. In case of fork() error the program
isn't even run thus causing a fatal error. If fork() waited for
resources to be available there wouldn't be any problem.

Is there a syscall to wait for fork to be available instead of sleeping
an amount of time between fork()?

Attached is a bash patch that works but it's not optimal without a proper
kernel interface as it waits a fixed amount of time betteen fork()
attemps and can't detect deadlocks.

Attachments:

(No filename) (1.92 kB)
bash_no_crash_on_fork.diff (1.56 kB)
Download all attachments

2002-10-15 18:01:58

by Marius Gedminas

[permalink] [raw]

Subject: Re: fork() wait semantics

On Tue, Oct 15, 2002 at 04:58:44PM +0000, Eduardo P?rez wrote:
> As an example consider bash. In case of fork() error the program
> isn't even run thus causing a fatal error. If fork() waited for
> resources to be available there wouldn't be any problem.

No, thank you. This happened to me more than once (runaway fetchmail
plugins). An error message about a failing fork() indicates
immediatelly that I have too many processes, and I can kill them
(thankfully kill is a bash builtin). If bash just waited silently I
wouldn't know what to think.

Marius Gedminas
--
This sentence contradicts itself -- no actually it doesn't.
-- Douglas Hofstadter

2002-10-15 22:51:23

by Eduardo Pérez

[permalink] [raw]

Subject: Re: fork() wait semantics

On 2002-10-15 20:07:43 +0200, Marius Gedminas wrote:
> On Tue, Oct 15, 2002 at 04:58:44PM +0000, Eduardo P?rez wrote:
> > As an example consider bash. In case of fork() error the program
> > isn't even run thus causing a fatal error. If fork() waited for
> > resources to be available there wouldn't be any problem.
>
> No, thank you. This happened to me more than once (runaway fetchmail
> plugins). An error message about a failing fork() indicates
> immediately that I have too many processes, and I can kill them
> (thankfully kill is a bash builtin). If bash just waited silently I
> wouldn't know what to think.

But you are talking about buggy software.
If you software has bugs don't expect it to work properly.

These fork() semantics are for non-buggy software.

2002-10-16 03:05:59

by jw schultz

[permalink] [raw]

Subject: Re: Better fork() (and possbly others) failure diagnostics

On Tue, Oct 15, 2002 at 05:46:21PM +0200, Michal Kara wrote:
> > Take a look at the manpages. It is very clear there that
> > EAGAIN has two meanings: try again because what you request
> > isn't available yet, and request exceeds resource limits (at
> > the moment). Basically POSIX and SUS direct that EAGAIN is
> > the correct error code for resource limit exceedance.
>
> The fork() manpage says:
>
> EAGAIN fork cannot allocate sufficient memory to copy the
> parent's page tables and allocate a task structure
> for the child.
>
> No word about limits. But that may classify as a manpage problem.

I'd say so.

Also i meant that you should do a survey of manpages that
site EAGAIN and not just fork(2). The pattern is clear.

> > I agree it would be nice if rlimit caused its own error code
> > but such a change at this time would break far to many things.
>
> I can think only of some applications retrying when they get EAGAIN...

It is the application that you can't think of that will bite
someone else. Further it isn't just whether they try again.
Some poorly written apps may test errno for known values and
behave oddly if they get an errno that isn't listed in the
manpages. Also it is common to work around limits. Many
apps are written to economize if it gets EAGAIN when
allocating memory.

> > Your alternative of a klogging an error is not appropriate
> > either. Hitting an rlimit is not a system, but a user
> > error.
>
> On workstation or multi-user server yes. But not on, say, web server.
> There hitting the limit is a problem and administrator should do something
> about it. When your nightly processing job hits limit (and when you run it
> by hand, it doesn't) , "Something wrong" is not to much helpful to solve the
> problem.

Which is why your nightly job or server should be logging
its errors from user space.

> But WHICH limit. This is what this is all about. If there was only one,
> then it is OK. And you cannot even display the limit/usage for running
> process to give you a hint.

That is unfortunately a deductive process. You can call
getrlimit and getrusage and try to guess but which one
caused the problem may be, i'll admit, an unknown.
In reality it is seldom that opaque.

Most of the time it is not hard to tell what caused it by
the syscall. For fork it will be RLIMIT_NPROC.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-10-16 08:34:03

by Eric W. Biederman

[permalink] [raw]

Subject: Re: fork() wait semantics

Eduardo PXrez <[email protected]> writes:

> On 2002-10-15 20:07:43 +0200, Marius Gedminas wrote:
> > On Tue, Oct 15, 2002 at 04:58:44PM +0000, Eduardo P?rez wrote:
> > > As an example consider bash. In case of fork() error the program
> > > isn't even run thus causing a fatal error. If fork() waited for
> > > resources to be available there wouldn't be any problem.
> >
> > No, thank you. This happened to me more than once (runaway fetchmail
> > plugins). An error message about a failing fork() indicates
> > immediately that I have too many processes, and I can kill them
> > (thankfully kill is a bash builtin). If bash just waited silently I
> > wouldn't know what to think.
>
> But you are talking about buggy software.
> If you software has bugs don't expect it to work properly.
>
> These fork() semantics are for non-buggy software.

Well that clinches it since there is no non-buggy software we
definitely don't want that behavior.

Eric

2002-10-18 20:04:41

by Pavel Machek

[permalink] [raw]

Subject: Re: Better fork() (and possbly others) failure diagnostics

Hi!

> > Several times I had real problems with batch jobs failing with EAGAIN,
> > printed as "Resource temporarily unavailable". Not with the failure, but to
> > determine the real cause is really a pain. Usually, the problem is in
> > resource limits (rlimit, set by ulimit), but the returned error code is
> > misleading.
> >
> > There are two ways. One is to print something to syslog, when some rlimit
> > is reached. This is already done when limit of open files in system is
> > reached.
> >
> > The second is more subtle - define error code for reaching the rlimit
> > (possibly one errorcode for each rlimit) and slightly change the code to
> > return correct error code.
> >
> > What do you think about this subject?
>
> Bad idea at this time. In 1980 it might have been ok.

I believe it is still good idea.

> Take a look at the manpages. It is very clear there that
> EAGAIN has two meanings: try again because what you request
> isn't available yet, and request exceeds resource limits (at
> the moment). Basically POSIX and SUS direct that EAGAIN is
> the correct error code for resource limit exceedance.
>
> I agree it would be nice if rlimit caused its own error code
> but such a change at this time would break far to many things.

What would break?

> Your alternative of a klogging an error is not appropriate
> either. Hitting an rlimit is not a system, but a user
> error. There is nothing for the admin to do, the message

If it is user error, than it is okay if system returns something else
than EAGAIN. We could for example return EUSERERROR_LIMITREACHED. But
I do not really think it is user error.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]