Michal Koutný <[email protected]> wrote:
> The check is currently against the current->cred but since those are
> going to change and we want to check RLIMIT_NPROC condition after the
> switch, supply the capability check with the new cred.
> But since we're checking new_user being INIT_USER any new cred's
> capability-based allowance may be redundant when the check fails and the
> alternative solution would be revert of the commit 2863643fb8b9
> ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
As of commit 2863643fb8b9 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
should check RLIMIT_NPROC is buggy, as it tests the capabilites from
before the credential change and not aftwards.
As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
ucounts") examining the rlimit is buggy as cred->ucounts has not yet
been properly set in the new credential.
Make the code correct and more robust moving the test to see if
execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
of the rest of the logic into execve() itself.
As the flag only indicateds that RLIMIT_NPROC should be checked
in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Michal Koutný <[email protected]>
Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 15 ++++++++-------
include/linux/sched.h | 2 +-
kernel/cred.c | 13 +++++++++----
kernel/fork.c | 2 +-
kernel/sys.c | 14 --------------
5 files changed, 19 insertions(+), 27 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..1e7f757cbc2c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1875,20 +1875,21 @@ static int do_execveat_common(int fd, struct filename *filename,
return PTR_ERR(filename);
/*
- * We move the actual failure in case of RLIMIT_NPROC excess from
- * set*uid() to execve() because too many poorly written programs
- * don't check setuid() return code. Here we additionally recheck
- * whether NPROC limit is still exceeded.
+ * After calling set*uid() is RLIMT_NPROC exceeded?
+ * This can not be checked in set*uid() because too many programs don't
+ * check the setuid() return code.
*/
- if ((current->flags & PF_NPROC_EXCEEDED) &&
- is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
+ if ((current->flags & PF_NPROC_CHECK) &&
+ is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
+ (current_user() != INIT_USER) &&
+ !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
retval = -EAGAIN;
goto out_ret;
}
/* We're below the limit (still or again), so we don't want to make
* further execve() calls fail. */
- current->flags &= ~PF_NPROC_EXCEEDED;
+ current->flags &= ~PF_NPROC_CHECK;
bprm = alloc_bprm(fd, filename);
if (IS_ERR(bprm)) {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..6605a262a6be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1678,7 +1678,7 @@ extern struct pid *cad_pid;
#define PF_DUMPCORE 0x00000200 /* Dumped core */
#define PF_SIGNALED 0x00000400 /* Killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
-#define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */
+#define PF_NPROC_CHECK 0x00001000 /* Check in execve if RLIMIT_NPROC was exceeded */
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
#define PF_FROZEN 0x00010000 /* Frozen for system suspend */
diff --git a/kernel/cred.c b/kernel/cred.c
index 933155c96922..229cff081167 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -490,13 +490,18 @@ int commit_creds(struct cred *new)
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(new);
- /* do it
- * RLIMIT_NPROC limits on user->processes have already been checked
- * in set_user().
+ /*
+ * Remember if the NPROC limit may be exceeded. The set*uid() functions
+ * can not fail if the NPROC limit is exceeded as too many programs
+ * don't check the return code. Instead enforce the NPROC limit for
+ * programs doing set*uid()+execve by harmlessly defering the failure
+ * to the execve() stage.
*/
alter_cred_subscribers(new, 2);
- if (new->user != old->user || new->user_ns != old->user_ns)
+ if (new->user != old->user || new->user_ns != old->user_ns) {
inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+ task->flags |= PF_NPROC_CHECK;
+ }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user || new->user_ns != old->user_ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index 17d8a8c85e3b..2b6a28a86325 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_cleanup_count;
}
- current->flags &= ~PF_NPROC_EXCEEDED;
+ current->flags &= ~PF_NPROC_CHECK;
/*
* If multiple threads are within copy_process(), then this check
diff --git a/kernel/sys.c b/kernel/sys.c
index ecc4cf019242..b1ed21d79f3b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,20 +472,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;
- /*
- * We don't fail in case of NPROC limit excess here because too many
- * poorly written programs don't check set*uid() return code, assuming
- * it never fails if called by root. We may still enforce NPROC limit
- * for programs doing set*uid()+execve() by harmlessly deferring the
- * failure to the execve() stage.
- */
- if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
- new_user != INIT_USER &&
- !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
- current->flags |= PF_NPROC_EXCEEDED;
- else
- current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
--
2.29.2
On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
> As of commit 2863643fb8b9 ("set_user: add capability check when
> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
> before the credential change and not aftwards.
>
> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
> been properly set in the new credential.
>
> Make the code correct and more robust moving the test to see if
> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
> of the rest of the logic into execve() itself.
>
> As the flag only indicateds that RLIMIT_NPROC should be checked
> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
>
> Cc: [email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Reported-by: Michal Koutn?? <[email protected]>
> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
On one hand, this looks good.
On the other, you asked about the Apache httpd suexec scenario in the
other thread, and here's what this means for it (per my code review):
In that scenario, we have two execve(): first from httpd to suexec, then
from suexec to the CGI script. Previously, the limit check only
occurred on the setuid() call by suexec, and its effect was deferred
until execve() of the script. Now wouldn't it occur on both execve()
calls, because commit_creds() is also called on execve() (such as in
case the program is SUID, which suexec actually is)? Since the check is
kind of against real uid (not the euid=0 that suexec gains), it'd apply
the limit against httpd pseudo-user's process count. While it could be
a reasonable kernel policy to impose this limit in more places, this is
a change of behavior for Apache httpd, and is not the intended behavior
there. However, I think the answer to my question earlier in this
paragraph is actually a "no", the check wouldn't occur on the execve()
of suexec, because "new->user != old->user" would be false. Right?
As an alternative, you could keep setting the (renamed and reused) flag
in set_user(). That would avoid the (non-)issue I described above - but
again, your patch is probably fine as-is.
I do see it's logical to have these two lines next to each other:
> inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
> + task->flags |= PF_NPROC_CHECK;
Of course, someone would need to actually test this.
Alexander
Solar Designer <[email protected]> writes:
> On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
>> As of commit 2863643fb8b9 ("set_user: add capability check when
>> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
>> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
>> before the credential change and not aftwards.
>>
>> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
>> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
>> been properly set in the new credential.
>>
>> Make the code correct and more robust moving the test to see if
>> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
>> of the rest of the logic into execve() itself.
>>
>> As the flag only indicateds that RLIMIT_NPROC should be checked
>> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
>>
>> Cc: [email protected]
>> Link: https://lkml.kernel.org/r/[email protected]
>> Link: https://lkml.kernel.org/r/[email protected]
>> Reported-by: Michal Koutn?? <[email protected]>
>> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> On one hand, this looks good.
>
> On the other, you asked about the Apache httpd suexec scenario in the
> other thread, and here's what this means for it (per my code review):
>
> In that scenario, we have two execve(): first from httpd to suexec, then
> from suexec to the CGI script. Previously, the limit check only
> occurred on the setuid() call by suexec, and its effect was deferred
> until execve() of the script. Now wouldn't it occur on both execve()
> calls, because commit_creds() is also called on execve() (such as in
> case the program is SUID, which suexec actually is)?
Yes. Moving the check into commit_creds means that the exec after a
suid exec will perform an RLIMIT_NPROC check and could possibly fail. I
would call that a bug. Anything happening in execve should be checked
and handled in execve as execve can fail.
It also points out that our permission checks for increasing
RLIMIT_NPROC are highly inconsistent.
One set of permissions in fork().
Another set of permissions in set*id() and delayed until execve.
No permission checks for the uid change in execve.
Every time I look into the previous behavior of RLIMIT_NPROC I seem
to find issues. Currently I am planning a posting to linux-api
so sorting out what when RLIMIT_NPROC should be enforced and how
RLIMIT_NPROC gets accounted receives review. I am also planning a
feature branch to deal with the historical goofiness.
I really like how cleanly this patch seems to be. Unfortunately it is
wrong.
> Since the check is
> kind of against real uid (not the euid=0 that suexec gains), it'd apply
> the limit against httpd pseudo-user's process count. While it could be
> a reasonable kernel policy to impose this limit in more places, this is
> a change of behavior for Apache httpd, and is not the intended behavior
> there. However, I think the answer to my question earlier in this
> paragraph is actually a "no", the check wouldn't occur on the execve()
> of suexec, because "new->user != old->user" would be false. Right?
>
> As an alternative, you could keep setting the (renamed and reused) flag
> in set_user(). That would avoid the (non-)issue I described above - but
> again, your patch is probably fine as-is.
>
> I do see it's logical to have these two lines next to each other:
>
>> inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
>> + task->flags |= PF_NPROC_CHECK;
>
> Of course, someone would need to actually test this.
That too.
I am increasingly of the opinion that the process accounting should not
be in cred.c at all. That we just remember the who was charged with the
process when we created it, and then at exec time we can update that
charge, and verify that the new user is solid. At exit time we can look
up who was charged with the process and decrement the count.
Of course at this point my opinion may change after I implement that.
Eric
"Eric W. Biederman" <[email protected]> writes:
> Solar Designer <[email protected]> writes:
>
>> On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
>>> As of commit 2863643fb8b9 ("set_user: add capability check when
>>> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
>>> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
>>> before the credential change and not aftwards.
>>>
>>> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
>>> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
>>> been properly set in the new credential.
>>>
>>> Make the code correct and more robust moving the test to see if
>>> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
>>> of the rest of the logic into execve() itself.
>>>
>>> As the flag only indicateds that RLIMIT_NPROC should be checked
>>> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
>>>
>>> Cc: [email protected]
>>> Link: https://lkml.kernel.org/r/[email protected]
>>> Link: https://lkml.kernel.org/r/[email protected]
>>> Reported-by: Michal Koutn?? <[email protected]>
>>> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>
>> On one hand, this looks good.
>>
>> On the other, you asked about the Apache httpd suexec scenario in the
>> other thread, and here's what this means for it (per my code review):
>>
>> In that scenario, we have two execve(): first from httpd to suexec, then
>> from suexec to the CGI script. Previously, the limit check only
>> occurred on the setuid() call by suexec, and its effect was deferred
>> until execve() of the script. Now wouldn't it occur on both execve()
>> calls, because commit_creds() is also called on execve() (such as in
>> case the program is SUID, which suexec actually is)?
>
> Yes. Moving the check into commit_creds means that the exec after a
> suid exec will perform an RLIMIT_NPROC check and could possibly fail. I
> would call that a bug. Anything happening in execve should be checked
> and handled in execve as execve can fail.
>
> It also points out that our permission checks for increasing
> RLIMIT_NPROC are highly inconsistent.
>
> One set of permissions in fork().
> Another set of permissions in set*id() and delayed until execve.
> No permission checks for the uid change in execve.
>
> Every time I look into the previous behavior of RLIMIT_NPROC I seem
> to find issues. Currently I am planning a posting to linux-api
> so sorting out what when RLIMIT_NPROC should be enforced and how
> RLIMIT_NPROC gets accounted receives review. I am also planning a
> feature branch to deal with the historical goofiness.
>
> I really like how cleanly this patch seems to be. Unfortunately it is
> wrong.
Hmm. Maybe not as wrong as I thought. An suid execve does not change
the real user.
Still a bit wrong from a conservative change point of view because the
user namespace can change in setns and CLONE_NEWUSER which will change
the accounting now. Which with the ucount rlimit stuff changes where
things should be accounted.
I am playing with the idea of changing accounting aka (cred->ucounts &
cred->user) to only change in fork (aka clone without CLONE_THREAD) and
exec. I think that would make maintenance and cleaning all of this up
easier.
That would also remove serious complications from RLIMIT_SIGPENDING as
well.
I thought SIGPENDING was only a multi-threaded process issue but from
one signal to the next the set*id() family functions can be called.
Eric
On Mon, Feb 14, 2022 at 09:10:49AM -0600, "Eric W. Biederman" <[email protected]> wrote:
> I really like how cleanly this patch seems to be. Unfortunately it is
> wrong.
It seems [1] so:
setuid() // RLIMIT_NPROC is fine at this moment
... fork()
...
... fork()
execve() // eh, oh
This "punishes" the exec'ing task although the cause is elsewhere.
Michal
[1] The decoupled setuid()+execve() check can be interpretted both ways.
I understood historically the excess at the setuid() moment is relevant.
Michal Koutný <[email protected]> writes:
> On Mon, Feb 14, 2022 at 09:10:49AM -0600, "Eric W. Biederman" <[email protected]> wrote:
>> I really like how cleanly this patch seems to be. Unfortunately it is
>> wrong.
>
> It seems [1] so:
>
> setuid() // RLIMIT_NPROC is fine at this moment
> ... fork()
> ...
> ... fork()
> execve() // eh, oh
>
> This "punishes" the exec'ing task although the cause is elsewhere.
>
> Michal
>
> [1] The decoupled setuid()+execve() check can be interpretted both ways.
> I understood historically the excess at the setuid() moment is
> relevant.
I have been digging into this to understand why we are doing the strange
things we are doing.
Ordinarily for rlimits we are fine with letting things go over limit
until we reach a case where we need the limit (which would be fork in
the RLIMIT_NPROC case). So things like setrlimit do not check your
counts to see if you will be over the limit.
The practical problem with fork in the unix model is that you can not
change limits or do anything for the new process until it is created
(with clone/fork). Making it impossible to set the rlimits and change
the user before the new process is created.
The result is that in applications like apache when they run cgi scripts
(as a different user than the apache process) RLIMIT_NPROC did not work
until a check was placed into set*id() as well. As the typical cgi
script did not fork it just did it's work and exited.
That it was discovered that allowing set*id() to fail was a footgun for
privileged processes. And we have the existing system.
Which leads me to the starting point that set*id() checking rlimits is
a necessary but fundamentally a special case.
As long as the original use case works I think there is some latitude in
the implementation. Maybe we set a flag and perform all of the checks
in exec. Maybe we just send SIGKILL. Maybe we just say it is an ugly
wart but it is our ugly wart and comment it and leave it alone.
I am leaving that decision to a clean-up patchset.