Hi,
while debugging an unexpected ENOMEM from fork (there was no memory
pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
ENOMEM even when not short on memory.
In this particular case it was due to depleted pid space which is
documented to return EAGAIN in man pages.
Here is a quick fix up. I am sending that as a RFC because I am not
entirely sure about interaction with pid namespace which might cause
fork to fail. pid_ns_prepare_proc might return few error codes which are
not documented by man page but I am not sure whether that is actually
going to happen on a properly configured system.
There doesn't seem to be a good error code for an already "closed"
namespace (PIDNS_HASH_ADDING disabled) as well. I've chosen EBUSY but
that might be completely wrong. I am also wondering how would this
error case happen in the first place because parent should still exist
while fork is going on so the pid namespace shouldn't go away but I
smell this might have something to do with setns or something similar.
Any feedback would be appreciated.
---
>From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 30 Jan 2015 14:50:15 +0100
Subject: [PATCH] fork: report pid reservation failure properly
copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.
The current man page even mentions this case:
"
EAGAIN
A system-imposed limit on the number of threads was encountered.
There are a number of limits that may trigger this error: the
RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
limits the number of processes and threads for a real user ID, was
reached; the kernel's system-wide limit on the number of processes
and threads, /proc/sys/kernel/threads-max, was reached (see
proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
was reached (see proc(5)).
"
so the current behavior is also incorrect wrt. documentation. This patch
simply propagates error code from alloc_pid and makes sure we return
-EAGAIN due to reservation failure.
There is one side effect of the change which might be unexpected.
alloc_pid might fail even when the repear in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. I've taken EBUSY
because it felt like a best fit for the condition - namespace is busy
tearing down all the infrastructure.
Signed-off-by: Michal Hocko <[email protected]>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 19 +++++++++++--------
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3ce8d57cff09..c37b88a162c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }
}
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
diff --git a/kernel/pid.c b/kernel/pid.c
index 82430c858d69..c1a5875bd1ab 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return -EAGAIN;
}
int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int retval;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
- goto out;
+ return ERR_PTR(-ENOMEM);
tmp = ns;
pid->level = ns->level;
for (i = ns->level; i >= 0; i--) {
nr = alloc_pidmap(tmp);
- if (nr < 0)
+ if (IS_ERR_VALUE(nr)) {
+ retval = nr;
goto out_free;
+ }
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
@@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
if (unlikely(is_child_reaper(pid))) {
- if (pid_ns_prepare_proc(ns))
+ if ((retval = pid_ns_prepare_proc(ns)))
goto out_free;
}
@@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
- if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
+ if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
+ retval = -EBUSY;
goto out_unlock;
+ }
for ( ; upid >= pid->numbers; --upid) {
hlist_add_head_rcu(&upid->pid_chain,
&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
@@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
spin_unlock_irq(&pidmap_lock);
-out:
return pid;
out_unlock:
@@ -348,8 +352,7 @@ out_free:
free_pidmap(pid->numbers + i);
kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
- goto out;
+ return ERR_PTR(retval);
}
void disable_pid_allocation(struct pid_namespace *ns)
--
2.1.4
--
Michal Hocko
SUSE Labs
Hi Michal,
On 3 February 2015 at 16:05, Michal Hocko <[email protected]> wrote:
> Hi,
> while debugging an unexpected ENOMEM from fork (there was no memory
> pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> ENOMEM even when not short on memory.
>
> In this particular case it was due to depleted pid space which is
> documented to return EAGAIN in man pages.
>
> Here is a quick fix up.
Could you summarize briefly what the user-space visible change is
here? It is not so obvious from your message. I believe you're turning
some cases of ENOMEM into EAGAIN, right? Note, by the way, that if I
understandwhat you intend, this change would bring the implementation
closer to POSIX, which specifies:
EAGAIN The system lacked the necessary resources to create
another process, or the system-imposed limit on the total
number of processes under execution system-wide or by a
single user {CHILD_MAX} would be exceeded.
Thanks,
Michael
> I am sending that as a RFC because I am not
> entirely sure about interaction with pid namespace which might cause
> fork to fail. pid_ns_prepare_proc might return few error codes which are
> not documented by man page but I am not sure whether that is actually
> going to happen on a properly configured system.
>
> There doesn't seem to be a good error code for an already "closed"
> namespace (PIDNS_HASH_ADDING disabled) as well. I've chosen EBUSY but
> that might be completely wrong. I am also wondering how would this
> error case happen in the first place because parent should still exist
> while fork is going on so the pid namespace shouldn't go away but I
> smell this might have something to do with setns or something similar.
>
> Any feedback would be appreciated.
> ---
> From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 30 Jan 2015 14:50:15 +0100
> Subject: [PATCH] fork: report pid reservation failure properly
>
> copy_process will report any failure in alloc_pid as ENOMEM currently
> which is misleading because the pid allocation might fail not only when
> the memory is short but also when the pid space is consumed already.
>
> The current man page even mentions this case:
> "
> EAGAIN
>
> A system-imposed limit on the number of threads was encountered.
> There are a number of limits that may trigger this error: the
> RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
> limits the number of processes and threads for a real user ID, was
> reached; the kernel's system-wide limit on the number of processes
> and threads, /proc/sys/kernel/threads-max, was reached (see
> proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
> was reached (see proc(5)).
> "
>
> so the current behavior is also incorrect wrt. documentation. This patch
> simply propagates error code from alloc_pid and makes sure we return
> -EAGAIN due to reservation failure.
>
> There is one side effect of the change which might be unexpected.
> alloc_pid might fail even when the repear in the pid namespace is dead
> (the namespace basically disallows all new processes) and there is no
> good error code which would match documented ones. I've taken EBUSY
> because it felt like a best fit for the condition - namespace is busy
> tearing down all the infrastructure.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> kernel/fork.c | 5 +++--
> kernel/pid.c | 19 +++++++++++--------
> 2 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3ce8d57cff09..c37b88a162c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto bad_fork_cleanup_io;
>
> if (pid != &init_struct_pid) {
> - retval = -ENOMEM;
> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> - if (!pid)
> + if (IS_ERR(pid)) {
> + retval = PTR_ERR(pid);
> goto bad_fork_cleanup_io;
> + }
> }
>
> p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 82430c858d69..c1a5875bd1ab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> }
> pid = mk_pid(pid_ns, map, offset);
> }
> - return -1;
> + return -EAGAIN;
> }
>
> int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> int i, nr;
> struct pid_namespace *tmp;
> struct upid *upid;
> + int retval;
>
> pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> if (!pid)
> - goto out;
> + return ERR_PTR(-ENOMEM);
>
> tmp = ns;
> pid->level = ns->level;
> for (i = ns->level; i >= 0; i--) {
> nr = alloc_pidmap(tmp);
> - if (nr < 0)
> + if (IS_ERR_VALUE(nr)) {
> + retval = nr;
> goto out_free;
> + }
>
> pid->numbers[i].nr = nr;
> pid->numbers[i].ns = tmp;
> @@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> }
>
> if (unlikely(is_child_reaper(pid))) {
> - if (pid_ns_prepare_proc(ns))
> + if ((retval = pid_ns_prepare_proc(ns)))
> goto out_free;
> }
>
> @@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
> upid = pid->numbers + ns->level;
> spin_lock_irq(&pidmap_lock);
> - if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
> + if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
> + retval = -EBUSY;
> goto out_unlock;
> + }
> for ( ; upid >= pid->numbers; --upid) {
> hlist_add_head_rcu(&upid->pid_chain,
> &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
> @@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> }
> spin_unlock_irq(&pidmap_lock);
>
> -out:
> return pid;
>
> out_unlock:
> @@ -348,8 +352,7 @@ out_free:
> free_pidmap(pid->numbers + i);
>
> kmem_cache_free(ns->pid_cachep, pid);
> - pid = NULL;
> - goto out;
> + return ERR_PTR(retval);
> }
>
> void disable_pid_allocation(struct pid_namespace *ns)
> --
> 2.1.4
>
> --
> Michal Hocko
> SUSE Labs
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
> Hi Michal,
>
>
> On 3 February 2015 at 16:05, Michal Hocko <[email protected]> wrote:
> > Hi,
> > while debugging an unexpected ENOMEM from fork (there was no memory
> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> > ENOMEM even when not short on memory.
> >
> > In this particular case it was due to depleted pid space which is
> > documented to return EAGAIN in man pages.
> >
> > Here is a quick fix up.
>
> Could you summarize briefly what the user-space visible change is
> here?
The user visible change is that the userspace will get EAGAIN when
calling fork and the pid space is depleted because of a system wide
limit as per man page description rather than ENOMEM which we return
currently.
> It is not so obvious from your message. I believe you're turning
> some cases of ENOMEM into EAGAIN, right?
Yes, except for the case mentioned below which discusses a potential
error code for pid namespace triggered failures.
> Note, by the way, that if I understandwhat you intend, this change
> would bring the implementation closer to POSIX, which specifies:
True.
HTH.
> EAGAIN The system lacked the necessary resources to create
> another process, or the system-imposed limit on the total
> number of processes under execution system-wide or by a
> single user {CHILD_MAX} would be exceeded.
>
> Thanks,
>
> Michael
>
> > I am sending that as a RFC because I am not
> > entirely sure about interaction with pid namespace which might cause
> > fork to fail. pid_ns_prepare_proc might return few error codes which are
> > not documented by man page but I am not sure whether that is actually
> > going to happen on a properly configured system.
> >
> > There doesn't seem to be a good error code for an already "closed"
> > namespace (PIDNS_HASH_ADDING disabled) as well. I've chosen EBUSY but
> > that might be completely wrong. I am also wondering how would this
> > error case happen in the first place because parent should still exist
> > while fork is going on so the pid namespace shouldn't go away but I
> > smell this might have something to do with setns or something similar.
> >
> > Any feedback would be appreciated.
> > ---
> > From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Fri, 30 Jan 2015 14:50:15 +0100
> > Subject: [PATCH] fork: report pid reservation failure properly
> >
> > copy_process will report any failure in alloc_pid as ENOMEM currently
> > which is misleading because the pid allocation might fail not only when
> > the memory is short but also when the pid space is consumed already.
> >
> > The current man page even mentions this case:
> > "
> > EAGAIN
> >
> > A system-imposed limit on the number of threads was encountered.
> > There are a number of limits that may trigger this error: the
> > RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
> > limits the number of processes and threads for a real user ID, was
> > reached; the kernel's system-wide limit on the number of processes
> > and threads, /proc/sys/kernel/threads-max, was reached (see
> > proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
> > was reached (see proc(5)).
> > "
> >
> > so the current behavior is also incorrect wrt. documentation. This patch
> > simply propagates error code from alloc_pid and makes sure we return
> > -EAGAIN due to reservation failure.
> >
> > There is one side effect of the change which might be unexpected.
> > alloc_pid might fail even when the repear in the pid namespace is dead
> > (the namespace basically disallows all new processes) and there is no
> > good error code which would match documented ones. I've taken EBUSY
> > because it felt like a best fit for the condition - namespace is busy
> > tearing down all the infrastructure.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > kernel/fork.c | 5 +++--
> > kernel/pid.c | 19 +++++++++++--------
> > 2 files changed, 14 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3ce8d57cff09..c37b88a162c5 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > goto bad_fork_cleanup_io;
> >
> > if (pid != &init_struct_pid) {
> > - retval = -ENOMEM;
> > pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> > - if (!pid)
> > + if (IS_ERR(pid)) {
> > + retval = PTR_ERR(pid);
> > goto bad_fork_cleanup_io;
> > + }
> > }
> >
> > p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 82430c858d69..c1a5875bd1ab 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> > }
> > pid = mk_pid(pid_ns, map, offset);
> > }
> > - return -1;
> > + return -EAGAIN;
> > }
> >
> > int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> > @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > int i, nr;
> > struct pid_namespace *tmp;
> > struct upid *upid;
> > + int retval;
> >
> > pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> > if (!pid)
> > - goto out;
> > + return ERR_PTR(-ENOMEM);
> >
> > tmp = ns;
> > pid->level = ns->level;
> > for (i = ns->level; i >= 0; i--) {
> > nr = alloc_pidmap(tmp);
> > - if (nr < 0)
> > + if (IS_ERR_VALUE(nr)) {
> > + retval = nr;
> > goto out_free;
> > + }
> >
> > pid->numbers[i].nr = nr;
> > pid->numbers[i].ns = tmp;
> > @@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > }
> >
> > if (unlikely(is_child_reaper(pid))) {
> > - if (pid_ns_prepare_proc(ns))
> > + if ((retval = pid_ns_prepare_proc(ns)))
> > goto out_free;
> > }
> >
> > @@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >
> > upid = pid->numbers + ns->level;
> > spin_lock_irq(&pidmap_lock);
> > - if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
> > + if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
> > + retval = -EBUSY;
> > goto out_unlock;
> > + }
> > for ( ; upid >= pid->numbers; --upid) {
> > hlist_add_head_rcu(&upid->pid_chain,
> > &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
> > @@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > }
> > spin_unlock_irq(&pidmap_lock);
> >
> > -out:
> > return pid;
> >
> > out_unlock:
> > @@ -348,8 +352,7 @@ out_free:
> > free_pidmap(pid->numbers + i);
> >
> > kmem_cache_free(ns->pid_cachep, pid);
> > - pid = NULL;
> > - goto out;
> > + return ERR_PTR(retval);
> > }
> >
> > void disable_pid_allocation(struct pid_namespace *ns)
> > --
> > 2.1.4
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> writes:
> On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
>> Hi Michal,
>>
>>
>> On 3 February 2015 at 16:05, Michal Hocko <[email protected]> wrote:
>> > Hi,
>> > while debugging an unexpected ENOMEM from fork (there was no memory
>> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
>> > ENOMEM even when not short on memory.
>> >
>> > In this particular case it was due to depleted pid space which is
>> > documented to return EAGAIN in man pages.
>> >
>> > Here is a quick fix up.
>>
>> Could you summarize briefly what the user-space visible change is
>> here?
>
> The user visible change is that the userspace will get EAGAIN when
> calling fork and the pid space is depleted because of a system wide
> limit as per man page description rather than ENOMEM which we return
> currently.
I don't think that EAGAIN is any better than ENOMEM,
nor do I know that it is safe to return EBUSY from fork. What nonsense
will applications do when they see an unexpected error code.
>> It is not so obvious from your message. I believe you're turning
>> some cases of ENOMEM into EAGAIN, right?
>
> Yes, except for the case mentioned below which discusses a potential
> error code for pid namespace triggered failures.
>
>> Note, by the way, that if I understandwhat you intend, this change
>> would bring the implementation closer to POSIX, which specifies:
>
> True.
>
> HTH.
>
>> EAGAIN The system lacked the necessary resources to create
>> another process, or the system-imposed limit on the total
>> number of processes under execution system-wide or by a
>> single user {CHILD_MAX} would be exceeded.
>>
Note. All of those documented errors documented to return EAGAIN
are the kind of errors that if you wait a while you can reasonably
expect fork to succeed later.
With respecting to dealing with errors from fork, fork
is a major pain. Fork only has only two return codes documented,
and fork is one of the most complicated system calls in the kernel with
the most failure modes of any system call I am familiar with. Mapping
a plethora of failure modes onto two error codes is always going to be
problematic from some view point.
EAGAIN is a bad idea in general because that means try again and if you
have hit a fixed limit trying again is wrong.
Frankly I think posix is probably borked to recommend EAGAIN instead of
ENOMEM.
Everyone in the world uses fork which makes is quite tricky to figure
out which assumptions on the return values of fork exist in the wild,
so it is not clear if it is safe to add new more descriptive return
messages.
With respect to the case where PIDNS_HASH_ADDING would cause fork to
fail, that only happens after init has exited in a pid namespace, so it
is very much a permanent failure, and there are no longer any processes
in the specific pid namespace nor will there ever be any more processes
in that pid namespace. EINVAL might actually makes sense. Of course
a sensible error code from fork does not seem to be allowed.
Of the two return codes that are allowed for fork, EAGAIN and ENOMEM
ENOMEM seems to be better as it is a more permanement failure.
I agree it is a little confusing, but I don't see anything that is other
than a little confusing.
Other than someone doing:
unshare(CLONE_NEWPID);
pid = fork();
waitpid(pid);
fork(); /* returns ENOMEM */
Was there any other real world issue that started this goal to fix fork?
I think there is a reasonable argument for digging into the fork return
code situation. Perhaps it is just a matter of returning exotic return
codes for the weird and strange cases like trying to create a pid in a
dead pid namespace.
But what we have works, and I don't know of anything bad that happens
except when people are developing new code they get confused.
Further we can't count on people to read their man pages because this
behavior of returning ENOMEM is documented in pid_namespaces(7). Which
makes me really thinking changing the code to match the manpage is more
likely to break code than to fix code.
Eric
On Tue 03-02-15 14:44:31, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
> >> Hi Michal,
> >>
> >>
> >> On 3 February 2015 at 16:05, Michal Hocko <[email protected]> wrote:
> >> > Hi,
> >> > while debugging an unexpected ENOMEM from fork (there was no memory
> >> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> >> > ENOMEM even when not short on memory.
> >> >
> >> > In this particular case it was due to depleted pid space which is
> >> > documented to return EAGAIN in man pages.
> >> >
> >> > Here is a quick fix up.
> >>
> >> Could you summarize briefly what the user-space visible change is
> >> here?
> >
> > The user visible change is that the userspace will get EAGAIN when
> > calling fork and the pid space is depleted because of a system wide
> > limit as per man page description rather than ENOMEM which we return
> > currently.
>
> I don't think that EAGAIN is any better than ENOMEM,
Well, EAGAIN is what the documentation says (both POSIX and ours).
I do not want to get to language lawyering but we should either change
the documentation and reason why we differ from POSIX or change the
behavior. I find the later option more viable because ENOMEM is really
hard to debug when there is not a lack of memory.
> nor do I know that it is safe to return EBUSY from fork. What nonsense
> will applications do when they see an unexpected error code.
Agreed, any new error code might be confusing.
> >> It is not so obvious from your message. I believe you're turning
> >> some cases of ENOMEM into EAGAIN, right?
> >
> > Yes, except for the case mentioned below which discusses a potential
> > error code for pid namespace triggered failures.
> >
> >> Note, by the way, that if I understandwhat you intend, this change
> >> would bring the implementation closer to POSIX, which specifies:
> >
> > True.
> >
> > HTH.
> >
> >> EAGAIN The system lacked the necessary resources to create
> >> another process, or the system-imposed limit on the total
> >> number of processes under execution system-wide or by a
> >> single user {CHILD_MAX} would be exceeded.
> >>
>
> Note. All of those documented errors documented to return EAGAIN
> are the kind of errors that if you wait a while you can reasonably
> expect fork to succeed later.
>
> With respecting to dealing with errors from fork, fork
> is a major pain. Fork only has only two return codes documented,
> and fork is one of the most complicated system calls in the kernel with
> the most failure modes of any system call I am familiar with. Mapping
> a plethora of failure modes onto two error codes is always going to be
> problematic from some view point.
Agreed.
> EAGAIN is a bad idea in general because that means try again and if you
> have hit a fixed limit trying again is wrong.
I don't know what was the justification for EAGAIN but it was like that
for ages so I assume there is a code which relies on that.
> Frankly I think posix is probably borked to recommend EAGAIN instead of
> ENOMEM.
>
> Everyone in the world uses fork which makes is quite tricky to figure
> out which assumptions on the return values of fork exist in the wild,
> so it is not clear if it is safe to add new more descriptive return
> messages.
>
> With respect to the case where PIDNS_HASH_ADDING would cause fork to
> fail, that only happens after init has exited in a pid namespace, so it
> is very much a permanent failure, and there are no longer any processes
> in the specific pid namespace nor will there ever be any more processes
> in that pid namespace. EINVAL might actually makes sense. Of course
> a sensible error code from fork does not seem to be allowed.
>
> Of the two return codes that are allowed for fork, EAGAIN and ENOMEM
> ENOMEM seems to be better as it is a more permanement failure.
Yes, if a new error code is really a nogo I would go for ENOMEM.
> I agree it is a little confusing, but I don't see anything that is other
> than a little confusing.
>
> Other than someone doing:
>
> unshare(CLONE_NEWPID);
> pid = fork();
> waitpid(pid);
> fork(); /* returns ENOMEM */
Ohh, I got it finally. unshare itself doesn't move the calling process
into a new namsepace.
> Was there any other real world issue that started this goal to fix fork?
I haven't seen any real world issue wrt. PIDNS_HASH_ADDING. It came out
from the error code propagation.
But I have seen ENOMEM when pid space was depleted and that confused
people and they reported it as a bug. Exactly because there was a ton of
memory free and no overcommit.
> I think there is a reasonable argument for digging into the fork return
> code situation. Perhaps it is just a matter of returning exotic return
> codes for the weird and strange cases like trying to create a pid in a
> dead pid namespace.
>
> But what we have works, and I don't know of anything bad that happens
> except when people are developing new code they get confused.
>
> Further we can't count on people to read their man pages because this
> behavior of returning ENOMEM is documented in pid_namespaces(7). Which
> makes me really thinking changing the code to match the manpage is more
> likely to break code than to fix code.
OK, I will remove the pid namespace related error code propagation and
keep it returning ENOMEM but I think the EAGAIN part is still worth fixing.
Will post a new patch.
--
Michal Hocko
SUSE Labs
copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.
The current man page even mentions this case:
"
EAGAIN
A system-imposed limit on the number of threads was encountered.
There are a number of limits that may trigger this error: the
RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
limits the number of processes and threads for a real user ID, was
reached; the kernel's system-wide limit on the number of processes
and threads, /proc/sys/kernel/threads-max, was reached (see
proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
was reached (see proc(5)).
"
so the current behavior is also incorrect wrt. documentation. POSIX man
page also suggest returing EAGAIN when the process count limit is
reached.
This patch simply propagates error code from alloc_pid and makes sure we
return -EAGAIN due to reservation failure. This will make behavior of
fork closer to both our documentation and POSIX.
alloc_pid might alsoo fail when the reaper in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. We have traditionally
returned ENOMEM for this case which is misleading as well but as per
Eric W. Biederman this behavior is documented in man pid_namespaces(7)
"
If the "init" process of a PID namespace terminates, the kernel
terminates all of the processes in the namespace via a SIGKILL signal.
This behavior reflects the fact that the "init" process is essential for
the correct operation of a PID namespace. In this case, a subsequent
fork(2) into this PID namespace will fail with the error ENOMEM; it is
not possible to create a new processes in a PID namespace whose "init"
process has terminated.
"
and introducing a new error code would be too risky so let's stick to
ENOMEM for this case.
Signed-off-by: Michal Hocko <[email protected]>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 15 ++++++++-------
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3ce8d57cff09..c37b88a162c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }
}
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
diff --git a/kernel/pid.c b/kernel/pid.c
index 82430c858d69..bbb805ccd893 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -179,7 +179,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
spin_unlock_irq(&pidmap_lock);
kfree(page);
if (unlikely(!map->page))
- break;
+ return -ENOMEM;
}
if (likely(atomic_read(&map->nr_free))) {
for ( ; ; ) {
@@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return -EAGAIN;
}
int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int retval = -ENOMEM;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
- goto out;
+ return ERR_PTR(retval);
tmp = ns;
pid->level = ns->level;
for (i = ns->level; i >= 0; i--) {
nr = alloc_pidmap(tmp);
- if (nr < 0)
+ if (IS_ERR_VALUE(nr)) {
+ retval = nr;
goto out_free;
+ }
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
@@ -336,7 +339,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
spin_unlock_irq(&pidmap_lock);
-out:
return pid;
out_unlock:
@@ -348,8 +350,7 @@ out_free:
free_pidmap(pid->numbers + i);
kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
- goto out;
+ return ERR_PTR(retval);
}
void disable_pid_allocation(struct pid_namespace *ns)
--
2.1.4
ping on this one? Should I just resend (your way Andrew)? Or are there
any objections to the patch as is.
On Wed 04-02-15 11:43:48, Michal Hocko wrote:
> copy_process will report any failure in alloc_pid as ENOMEM currently
> which is misleading because the pid allocation might fail not only when
> the memory is short but also when the pid space is consumed already.
>
> The current man page even mentions this case:
> "
> EAGAIN
>
> A system-imposed limit on the number of threads was encountered.
> There are a number of limits that may trigger this error: the
> RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
> limits the number of processes and threads for a real user ID, was
> reached; the kernel's system-wide limit on the number of processes
> and threads, /proc/sys/kernel/threads-max, was reached (see
> proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
> was reached (see proc(5)).
> "
>
> so the current behavior is also incorrect wrt. documentation. POSIX man
> page also suggest returing EAGAIN when the process count limit is
> reached.
>
> This patch simply propagates error code from alloc_pid and makes sure we
> return -EAGAIN due to reservation failure. This will make behavior of
> fork closer to both our documentation and POSIX.
>
> alloc_pid might alsoo fail when the reaper in the pid namespace is dead
> (the namespace basically disallows all new processes) and there is no
> good error code which would match documented ones. We have traditionally
> returned ENOMEM for this case which is misleading as well but as per
> Eric W. Biederman this behavior is documented in man pid_namespaces(7)
> "
> If the "init" process of a PID namespace terminates, the kernel
> terminates all of the processes in the namespace via a SIGKILL signal.
> This behavior reflects the fact that the "init" process is essential for
> the correct operation of a PID namespace. In this case, a subsequent
> fork(2) into this PID namespace will fail with the error ENOMEM; it is
> not possible to create a new processes in a PID namespace whose "init"
> process has terminated.
> "
> and introducing a new error code would be too risky so let's stick to
> ENOMEM for this case.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> kernel/fork.c | 5 +++--
> kernel/pid.c | 15 ++++++++-------
> 2 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3ce8d57cff09..c37b88a162c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto bad_fork_cleanup_io;
>
> if (pid != &init_struct_pid) {
> - retval = -ENOMEM;
> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> - if (!pid)
> + if (IS_ERR(pid)) {
> + retval = PTR_ERR(pid);
> goto bad_fork_cleanup_io;
> + }
> }
>
> p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 82430c858d69..bbb805ccd893 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -179,7 +179,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> spin_unlock_irq(&pidmap_lock);
> kfree(page);
> if (unlikely(!map->page))
> - break;
> + return -ENOMEM;
> }
> if (likely(atomic_read(&map->nr_free))) {
> for ( ; ; ) {
> @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> }
> pid = mk_pid(pid_ns, map, offset);
> }
> - return -1;
> + return -EAGAIN;
> }
>
> int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> int i, nr;
> struct pid_namespace *tmp;
> struct upid *upid;
> + int retval = -ENOMEM;
>
> pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> if (!pid)
> - goto out;
> + return ERR_PTR(retval);
>
> tmp = ns;
> pid->level = ns->level;
> for (i = ns->level; i >= 0; i--) {
> nr = alloc_pidmap(tmp);
> - if (nr < 0)
> + if (IS_ERR_VALUE(nr)) {
> + retval = nr;
> goto out_free;
> + }
>
> pid->numbers[i].nr = nr;
> pid->numbers[i].ns = tmp;
> @@ -336,7 +339,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> }
> spin_unlock_irq(&pidmap_lock);
>
> -out:
> return pid;
>
> out_unlock:
> @@ -348,8 +350,7 @@ out_free:
> free_pidmap(pid->numbers + i);
>
> kmem_cache_free(ns->pid_cachep, pid);
> - pid = NULL;
> - goto out;
> + return ERR_PTR(retval);
> }
>
> void disable_pid_allocation(struct pid_namespace *ns)
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <[email protected]> wrote:
> ping on this one? Should I just resend (your way Andrew)? Or are there
> any objections to the patch as is.
Were Eric's concerns all addressed?
Oleg: wake up ;)
Overall it looks like a pretty minor issue?
On Thu 26-02-15 14:49:08, Andrew Morton wrote:
> On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <[email protected]> wrote:
>
> > ping on this one? Should I just resend (your way Andrew)? Or are there
> > any objections to the patch as is.
>
> Were Eric's concerns all addressed?
I hope so. I didn't touch pid namespace parts and they return ENOMEM as
before.
> Oleg: wake up ;)
>
> Overall it looks like a pretty minor issue?
I believe so. It should help deubugging when pid space is exhausted
because getting ENOMEM under that situation is really awkward and the
real reason for failure is hard to find out.
--
Michal Hocko
SUSE Labs