LinuxLists.cc - [RFC PATCH 0/6] RLIMIT

2022-02-08 06:53:08

Subject: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

This series is a result of looking deeper into breakage of
tools/testing/selftests/rlimits/rlimits-per-userns.c after
https://lore.kernel.org/r/[email protected]/
is applied.

The description of the original problem that lead to RLIMIT_NPROC et al.
ucounts rewrite could be ambiguously interpretted as supporting either
the case of:
- never-fork service or
- fork (RLIMIT_NPROC-1) times service.

The scenario is weird anyway given existence of pids controller.

The realization of that scenario relies not only on tracking number of
processes per user_ns but also newly allows the root to override limit through
set*uid. The commit message didn't mention that, so it's unclear if it
was the intention too.

I also noticed that the RLIMIT_NPROC enforcing in fork seems subject to TOCTOU
race (check(nr_tasks),...,nr_tasks++) so the limit is rather advisory (but
that's not a new thing related to ucounts rewrite).

This series is RFC to discuss relevance of the subtle changes RLIMIT_NPROC to
ucounts rewrite introduced.

Michal Koutný (6):
set_user: Perform RLIMIT_NPROC capability check against new user
credentials
set*uid: Check RLIMIT_PROC against new credentials
cred: Count tasks by their real uid into RLIMIT_NPROC
ucounts: Allow root to override RLIMIT_NPROC
selftests: Challenge RLIMIT_NPROC in user namespaces
selftests: Test RLIMIT_NPROC in clone-created user namespaces

fs/exec.c | 2 +-
include/linux/cred.h | 2 +-
kernel/cred.c | 29 ++-
kernel/fork.c | 2 +-
kernel/sys.c | 20 +-
kernel/ucount.c | 3 +
kernel/user_namespace.c | 2 +-
.../selftests/rlimits/rlimits-per-userns.c | 233 +++++++++++++++---
8 files changed, 229 insertions(+), 64 deletions(-)

--
2.34.1

2022-02-08 14:14:10

by Michal Koutný

[permalink] [raw]

Subject: [RFC PATCH 5/6] selftests: Challenge RLIMIT_NPROC in user namespaces

The services are started in descendant user namepaces, each of them
should honor the RLIMIT_NPROC that's passed during user namespace
creation.

main [user_ns_0]
` service [user_ns_1]
` worker 1
` worker 2
...
` worker k
...
` service [user_ns_n]
` worker 1
` worker 2
...
` worker k

Test uses explicit synchronization, to make sure original parent's limit
does not interfere with descendants.

Signed-off-by: Michal Koutný <[email protected]>
---
.../selftests/rlimits/rlimits-per-userns.c | 154 ++++++++++++++----
1 file changed, 125 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
index 26dc949e93ea..54c1b345e42b 100644
--- a/tools/testing/selftests/rlimits/rlimits-per-userns.c
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -9,7 +9,9 @@
#include <sys/resource.h>
#include <sys/prctl.h>
#include <sys/stat.h>
+#include <sys/socket.h>

+#include <assert.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
@@ -21,38 +23,74 @@
#include <errno.h>
#include <err.h>

-#define NR_CHILDS 2
+#define THE_LIMIT 4
+#define NR_CHILDREN 5
+
+static_assert(NR_CHILDREN >= THE_LIMIT-1, "Need slots for limit-1 children.");

static char *service_prog;
static uid_t user = 60000;
static uid_t group = 60000;
+static struct rlimit saved_limit;
+
+/* Two uses: main and service */
+static pid_t child[NR_CHILDREN];
+static pid_t pid;

static void setrlimit_nproc(rlim_t n)
{
- pid_t pid = getpid();
struct rlimit limit = {
.rlim_cur = n,
.rlim_max = n
};
-
- warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+ if (getrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d): getrlimit(RLIMIT_NPROC)", pid);

if (setrlimit(RLIMIT_NPROC, &limit) < 0)
err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+
+ warnx("(pid=%d): Set RLIMIT_NPROC=%ld", pid, n);
+}
+
+static void restore_rlimit_nproc(void)
+{
+ if (setrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC, saved)", pid);
+ warnx("(pid=%d) Restored RLIMIT_NPROC", pid);
}

-static pid_t fork_child(void)
+enum msg_sync {
+ UNSHARE,
+ RLIMIT_RESTORE,
+};
+
+static void sync_notify(int fd, enum msg_sync m)
{
- pid_t pid = fork();
+ char tmp = m;
+
+ if (write(fd, &tmp, 1) < 0)
+ warnx("(pid=%d): failed sync-write", pid);
+}

- if (pid < 0)
+static void sync_wait(int fd, enum msg_sync m)
+{
+ char tmp;
+
+ if (read(fd, &tmp, 1) < 0)
+ warnx("(pid=%d): failed sync-read", pid);
+}
+
+static pid_t fork_child(int control_fd)
+{
+ pid_t new_pid = fork();
+
+ if (new_pid < 0)
err(EXIT_FAILURE, "fork");

- if (pid > 0)
- return pid;
+ if (new_pid > 0)
+ return new_pid;

pid = getpid();
-
warnx("(pid=%d): New process starting ...", pid);

if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
@@ -73,6 +111,9 @@ static pid_t fork_child(void)
if (unshare(CLONE_NEWUSER) < 0)
err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");

+ sync_notify(control_fd, UNSHARE);
+ sync_wait(control_fd, RLIMIT_RESTORE);
+
char *const argv[] = { "service", NULL };
char *const envp[] = { "I_AM_SERVICE=1", NULL };

@@ -82,37 +123,92 @@ static pid_t fork_child(void)
err(EXIT_FAILURE, "(pid=%d): execve", pid);
}

+static void run_service(void)
+{
+ size_t i;
+ int ret = EXIT_SUCCESS;
+ struct rlimit limit;
+ char user_ns[PATH_MAX];
+
+ if (getrlimit(RLIMIT_NPROC, &limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d) failed getrlimit", pid);
+ if (readlink("/proc/self/ns/user", user_ns, PATH_MAX) < 0)
+ err(EXIT_FAILURE, "(pid=%d) failed readlink", pid);
+
+ warnx("(pid=%d) Service instance attempts %i children, limit %lu:%lu, ns=%s",
+ pid, THE_LIMIT, limit.rlim_cur, limit.rlim_max, user_ns);
+
+ /* test rlimit inside the service, effectively THE_LIMIT-1 becaue of service itself */
+ for (i = 0; i < THE_LIMIT; i++) {
+ child[i] = fork();
+ if (child[i] == 0) {
+ /* service child */
+ pause();
+ exit(EXIT_SUCCESS);
+ }
+ if (child[i] < 0) {
+ warnx("(pid=%d) service fork %lu failed, errno = %i", pid, i+1, errno);
+ if (!(i == THE_LIMIT-1 && errno == EAGAIN))
+ ret = EXIT_FAILURE;
+ } else if (i == THE_LIMIT-1) {
+ warnx("(pid=%d) RLIMIT_NPROC not honored", pid);
+ ret = EXIT_FAILURE;
+ }
+ }
+
+ /* service cleanup */
+ for (i = 0; i < THE_LIMIT; i++)
+ if (child[i] > 0)
+ kill(child[i], SIGUSR1);
+
+ for (i = 0; i < THE_LIMIT; i++)
+ if (child[i] > 0)
+ waitpid(child[i], NULL, WNOHANG);
+
+ if (ret)
+ exit(ret);
+ pause();
+}
+
int main(int argc, char **argv)
{
size_t i;
- pid_t child[NR_CHILDS];
- int wstatus[NR_CHILDS];
- int childs = NR_CHILDS;
- pid_t pid;
+ int control_fd[NR_CHILDREN];
+ int wstatus[NR_CHILDREN];
+ int children = NR_CHILDREN;
+ int sockets[2];
+
+ pid = getpid();

if (getenv("I_AM_SERVICE")) {
- pause();
- exit(EXIT_SUCCESS);
+ run_service();
+ exit(EXIT_FAILURE);
}

service_prog = argv[0];
- pid = getpid();

warnx("(pid=%d) Starting testcase", pid);

- /*
- * This rlimit is not a problem for root because it can be exceeded.
- */
- setrlimit_nproc(1);
-
- for (i = 0; i < NR_CHILDS; i++) {
- child[i] = fork_child();
+ setrlimit_nproc(THE_LIMIT);
+ for (i = 0; i < NR_CHILDREN; i++) {
+ if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0, sockets) < 0)
+ err(EXIT_FAILURE, "(pid=%d) socketpair failed", pid);
+ control_fd[i] = sockets[0];
+ child[i] = fork_child(sockets[1]);
wstatus[i] = 0;
+ }
+
+ for (i = 0; i < NR_CHILDREN; i++)
+ sync_wait(control_fd[i], UNSHARE);
+ restore_rlimit_nproc();
+
+ for (i = 0; i < NR_CHILDREN; i++) {
+ sync_notify(control_fd[i], RLIMIT_RESTORE);
usleep(250000);
}

while (1) {
- for (i = 0; i < NR_CHILDS; i++) {
+ for (i = 0; i < NR_CHILDREN; i++) {
if (child[i] <= 0)
continue;

@@ -126,22 +222,22 @@ int main(int argc, char **argv)
warn("(pid=%d): waitpid(%d)", pid, child[i]);

child[i] *= -1;
- childs -= 1;
+ children -= 1;
}

- if (!childs)
+ if (!children)
break;

usleep(250000);

- for (i = 0; i < NR_CHILDS; i++) {
+ for (i = 0; i < NR_CHILDREN; i++) {
if (child[i] <= 0)
continue;
kill(child[i], SIGUSR1);
}
}

- for (i = 0; i < NR_CHILDS; i++) {
+ for (i = 0; i < NR_CHILDREN; i++) {
if (WIFEXITED(wstatus[i]))
warnx("(pid=%d): pid %d exited, status=%d",
pid, -child[i], WEXITSTATUS(wstatus[i]));
--
2.34.1

2022-02-09 06:36:10

by Michal Koutný

[permalink] [raw]

Subject: [RFC PATCH 6/6] selftests: Test RLIMIT_NPROC in clone-created user namespaces

Verify RLIMIT_NPROC observance in user namespaces also in the
clone(CLONE_NEWUSER) path.
Note the such a user_ns is created by the privileged user.

Signed-off-by: Michal Koutný <[email protected]>
---
.../selftests/rlimits/rlimits-per-userns.c | 141 +++++++++++++-----
1 file changed, 101 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
index 54c1b345e42b..46f4cff36b30 100644
--- a/tools/testing/selftests/rlimits/rlimits-per-userns.c
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Author: Alexey Gladkov <[email protected]>
+ * Author: Michal Koutný <[email protected]>
*/
#define _GNU_SOURCE
#include <sys/types.h>
@@ -25,16 +26,25 @@

#define THE_LIMIT 4
#define NR_CHILDREN 5
+#define STACK_SIZE (2 * (1<<20))

-static_assert(NR_CHILDREN >= THE_LIMIT-1, "Need slots for limit-1 children.");
+static_assert(NR_CHILDREN >= THE_LIMIT-1, "Need slots for THE_LIMIT-1 children.");

-static char *service_prog;
static uid_t user = 60000;
static uid_t group = 60000;
static struct rlimit saved_limit;

-/* Two uses: main and service */
-static pid_t child[NR_CHILDREN];
+enum userns_mode {
+ UM_UNSHARE, /* setrlimit,clone(0),setuid,unshare,execve */
+ UM_CLONE_NEWUSER, /* setrlimit,clone(NEWUSER),setuid,execve */
+};
+static struct {
+ int control_fd;
+ char *pathname;
+ enum userns_mode mode;
+} child_args;
+
+/* Cache current pid */
static pid_t pid;

static void setrlimit_nproc(rlim_t n)
@@ -60,6 +70,7 @@ static void restore_rlimit_nproc(void)
}

enum msg_sync {
+ MAP_DEFINE,
UNSHARE,
RLIMIT_RESTORE,
};
@@ -80,15 +91,32 @@ static void sync_wait(int fd, enum msg_sync m)
warnx("(pid=%d): failed sync-read", pid);
}

-static pid_t fork_child(int control_fd)
+static int define_maps(pid_t child_pid)
{
- pid_t new_pid = fork();
+ FILE *f;
+ char filename[PATH_MAX];

- if (new_pid < 0)
- err(EXIT_FAILURE, "fork");
+ if (child_args.mode != UM_CLONE_NEWUSER)
+ return 0;
+
+ snprintf(filename, PATH_MAX, "/proc/%i/uid_map", child_pid);
+ f = fopen(filename, "w");
+ if (fprintf(f, "%i %i 1\n", user, user) < 0)
+ return -1;
+ fclose(f);
+
+ snprintf(filename, PATH_MAX, "/proc/%i/gid_map", child_pid);
+ f = fopen(filename, "w");
+ if (fprintf(f, "%i %i 1\n", group, group) < 0)
+ return -1;
+ fclose(f);
+
+ return 0;
+}

- if (new_pid > 0)
- return new_pid;
+static int setup_and_exec(void *arg)
+{
+ int control_fd = child_args.control_fd;

pid = getpid();
warnx("(pid=%d): New process starting ...", pid);
@@ -98,6 +126,7 @@ static pid_t fork_child(int control_fd)

signal(SIGUSR1, SIG_DFL);

+ sync_wait(control_fd, RLIMIT_RESTORE);
warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);

if (setgid(group) < 0)
@@ -107,9 +136,11 @@ static pid_t fork_child(int control_fd)

warnx("(pid=%d): Service running ...", pid);

- warnx("(pid=%d): Unshare user namespace", pid);
- if (unshare(CLONE_NEWUSER) < 0)
- err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+ if (child_args.mode == UM_UNSHARE) {
+ warnx("(pid=%d): Unshare user namespace", pid);
+ if (unshare(CLONE_NEWUSER) < 0)
+ err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+ }

sync_notify(control_fd, UNSHARE);
sync_wait(control_fd, RLIMIT_RESTORE);
@@ -119,14 +150,30 @@ static pid_t fork_child(int control_fd)

warnx("(pid=%d): Executing real service ...", pid);

- execve(service_prog, argv, envp);
+ execve(child_args.pathname, argv, envp);
err(EXIT_FAILURE, "(pid=%d): execve", pid);
}

-static void run_service(void)
+static pid_t start_child(char *pathname, int control_fd)
+{
+ char *stack = malloc(STACK_SIZE);
+ int flags = child_args.mode == UM_CLONE_NEWUSER ? CLONE_NEWUSER : 0;
+ pid_t new_pid;
+
+ child_args.control_fd = control_fd;
+ child_args.pathname = pathname;
+
+ new_pid = clone(setup_and_exec, stack+STACK_SIZE-1, flags, NULL);
+ if (new_pid < 0)
+ err(EXIT_FAILURE, "clone");
+
+ free(stack);
+ close(control_fd);
+ return new_pid;
+}
+
+static void dump_context(size_t n_workers)
{
- size_t i;
- int ret = EXIT_SUCCESS;
struct rlimit limit;
char user_ns[PATH_MAX];

@@ -135,44 +182,55 @@ static void run_service(void)
if (readlink("/proc/self/ns/user", user_ns, PATH_MAX) < 0)
err(EXIT_FAILURE, "(pid=%d) failed readlink", pid);

- warnx("(pid=%d) Service instance attempts %i children, limit %lu:%lu, ns=%s",
- pid, THE_LIMIT, limit.rlim_cur, limit.rlim_max, user_ns);
+ warnx("(pid=%d) Service instance attempts %lu workers, limit %lu:%lu, ns=%s",
+ pid, n_workers, limit.rlim_cur, limit.rlim_max, user_ns);
+}
+
+static int run_service(void)
+{
+ size_t i, n_workers = THE_LIMIT;
+ pid_t worker[NR_CHILDREN];
+ int ret = EXIT_SUCCESS;

- /* test rlimit inside the service, effectively THE_LIMIT-1 becaue of service itself */
- for (i = 0; i < THE_LIMIT; i++) {
- child[i] = fork();
- if (child[i] == 0) {
- /* service child */
+ dump_context(n_workers);
+
+ /* test rlimit inside the service, last worker should fail because of service itself */
+ for (i = 0; i < n_workers; i++) {
+ worker[i] = fork();
+ if (worker[i] == 0) {
+ /* service worker */
pause();
exit(EXIT_SUCCESS);
}
- if (child[i] < 0) {
+ if (worker[i] < 0) {
warnx("(pid=%d) service fork %lu failed, errno = %i", pid, i+1, errno);
- if (!(i == THE_LIMIT-1 && errno == EAGAIN))
+ if (!(i == n_workers-1 && errno == EAGAIN))
ret = EXIT_FAILURE;
- } else if (i == THE_LIMIT-1) {
+ } else if (i == n_workers-1) {
warnx("(pid=%d) RLIMIT_NPROC not honored", pid);
ret = EXIT_FAILURE;
}
}

/* service cleanup */
- for (i = 0; i < THE_LIMIT; i++)
- if (child[i] > 0)
- kill(child[i], SIGUSR1);
+ for (i = 0; i < n_workers; i++)
+ if (worker[i] > 0)
+ kill(worker[i], SIGUSR1);

- for (i = 0; i < THE_LIMIT; i++)
- if (child[i] > 0)
- waitpid(child[i], NULL, WNOHANG);
+ for (i = 0; i < n_workers; i++)
+ if (worker[i] > 0)
+ waitpid(worker[i], NULL, WNOHANG);

if (ret)
- exit(ret);
+ return ret;
pause();
+ return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
size_t i;
+ pid_t child[NR_CHILDREN];
int control_fd[NR_CHILDREN];
int wstatus[NR_CHILDREN];
int children = NR_CHILDREN;
@@ -180,12 +238,11 @@ int main(int argc, char **argv)

pid = getpid();

- if (getenv("I_AM_SERVICE")) {
- run_service();
- exit(EXIT_FAILURE);
- }
+ if (getenv("I_AM_SERVICE"))
+ return run_service();

- service_prog = argv[0];
+ if (argc > 1 && *argv[1] == 'c')
+ child_args.mode = UM_CLONE_NEWUSER;

warnx("(pid=%d) Starting testcase", pid);

@@ -194,8 +251,12 @@ int main(int argc, char **argv)
if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0, sockets) < 0)
err(EXIT_FAILURE, "(pid=%d) socketpair failed", pid);
control_fd[i] = sockets[0];
- child[i] = fork_child(sockets[1]);
+ child[i] = start_child(argv[0], sockets[1]);
wstatus[i] = 0;
+
+ if (define_maps(child[i]) < 0)
+ err(EXIT_FAILURE, "(pid=%d) user_ns maps definition failed", pid);
+ sync_notify(control_fd[i], MAP_DEFINE);
}

for (i = 0; i < NR_CHILDREN; i++)
--
2.34.1

2022-02-09 06:42:11

by Michal Koutný

[permalink] [raw]

Subject: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

The check is currently against the current->cred but since those are
going to change and we want to check RLIMIT_NPROC condition after the
switch, supply the capability check with the new cred.
But since we're checking new_user being INIT_USER any new cred's
capability-based allowance may be redundant when the check fails and the
alternative solution would be revert of the commit 2863643fb8b9
("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")

Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")

Cc: Solar Designer <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sys.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 8ea20912103a..48c90dcceff3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -481,7 +481,8 @@ static int set_user(struct cred *new)
*/
if (ucounts_limit_cmp(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) >= 0 &&
new_user != INIT_USER &&
- !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
+ !security_capable(new, &init_user_ns, CAP_SYS_RESOURCE, CAP_OPT_NONE) &&
+ !security_capable(new, &init_user_ns, CAP_SYS_ADMIN, CAP_OPT_NONE))
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
--
2.34.1

2022-02-09 07:04:54

by Michal Koutný

[permalink] [raw]

Subject: [RFC PATCH 4/6] ucounts: Allow root to override RLIMIT_NPROC

Call sites of ucounts_limit_cmp() would allow the global root or capable
user to bypass RLIMIT_NPROC on the bottom level of user_ns tree by not
looking at ucounts at all.

As the traversal up the user_ns tree continues, the ucounts to which the
task is charged may switch the owning user (to the creator of user_ns).
If the new chargee is root, we don't really care about RLIMIT_NPROC
observation, so lift the limit to the max.

The result is that an unprivileged user U can globally run more that
RLIMIT_NPROC (of user_ns) tasks but within each user_ns it is still
limited to RLIMINT_NPROC (as passed into task->signal->rlim) iff the
user_nss are created by the privileged user.

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/ucount.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 53ccd96387dd..f52b7273a572 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -356,6 +356,9 @@ long ucounts_limit_cmp(struct ucounts *ucounts, enum ucount_type type, unsigned
if (excess > 0)
return excess;
max = READ_ONCE(iter->ns->ucount_max[type]);
+ /* Next ucounts owned by root? RLIMIT_NPROC is moot */
+ if (type == UCOUNT_RLIMIT_NPROC && uid_eq(iter->ns->owner, GLOBAL_ROOT_UID))
+ max = LONG_MAX;
}
return excess;
}
--
2.34.1

2022-02-09 09:57:53

by Michal Koutný

[permalink] [raw]

Subject: [RFC PATCH 2/6] set*uid: Check RLIMIT_PROC against new credentials

The generic idea is that not even root or capable user can force an
unprivileged user's limit breach. (For historical and security reasons
this check is postponed from set*uid to execve.) During the switch the
resource consumption of target the user has to be checked. The commits
905ae01c4ae2 ("Add a reference to ucounts for each cred") and
21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") made the
check in set_user() look at the old user's consumption.

This version of the fix simply moves the check to the place where the
actual switch of the accounting structure happens -- set_cred_ucounts().

The other callers are kept without the check but with the per-userns
accounting they may be newly subject to the check too.
The set_cred_ucounts() becomes inconsistent since task->flags are
passed by the caller but task_rlimit() is implicitly `current`'s, this
patch is meant to illustrate the issue, nicer implementation is
possible.

Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: Michal Koutný <[email protected]>
---
fs/exec.c | 2 +-
include/linux/cred.h | 2 +-
kernel/cred.c | 24 +++++++++++++++++++++---
kernel/fork.c | 2 +-
kernel/sys.c | 21 +++------------------
kernel/user_namespace.c | 2 +-
6 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index fc598c2652b2..e759e42c61da 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1363,7 +1363,7 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);

- retval = set_cred_ucounts(bprm->cred);
+ retval = set_cred_ucounts(bprm->cred, NULL);
if (retval < 0)
goto out_unlock;

diff --git a/include/linux/cred.h b/include/linux/cred.h
index fcbc6885cc09..455525ab380d 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -170,7 +170,7 @@ extern int set_security_override_from_ctx(struct cred *, const char *);
extern int set_create_files_as(struct cred *, struct inode *);
extern int cred_fscmp(const struct cred *, const struct cred *);
extern void __init cred_init(void);
-extern int set_cred_ucounts(struct cred *);
+extern int set_cred_ucounts(struct cred *, unsigned int *);

/*
* check for validity of credentials
diff --git a/kernel/cred.c b/kernel/cred.c
index 473d17c431f3..791cab70b764 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -370,7 +370,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
- ret = set_cred_ucounts(new);
+ ret = set_cred_ucounts(new, NULL);
if (ret < 0)
goto error_put;
}
@@ -492,7 +492,7 @@ int commit_creds(struct cred *new)

/* do it
* RLIMIT_NPROC limits on user->processes have already been checked
- * in set_user().
+ * in set_cred_ucounts().
*/
alter_cred_subscribers(new, 2);
if (new->user != old->user || new->user_ns != old->user_ns)
@@ -663,7 +663,7 @@ int cred_fscmp(const struct cred *a, const struct cred *b)
}
EXPORT_SYMBOL(cred_fscmp);

-int set_cred_ucounts(struct cred *new)
+int set_cred_ucounts(struct cred *new, unsigned int *nproc_flags)
{
struct task_struct *task = current;
const struct cred *old = task->real_cred;
@@ -685,6 +685,24 @@ int set_cred_ucounts(struct cred *new)
new->ucounts = new_ucounts;
put_ucounts(old_ucounts);

+ if (!nproc_flags)
+ return 0;
+
+ /*
+ * We don't fail in case of NPROC limit excess here because too many
+ * poorly written programs don't check set*uid() return code, assuming
+ * it never fails if called by root. We may still enforce NPROC limit
+ * for programs doing set*uid()+execve() by harmlessly deferring the
+ * failure to the execve() stage.
+ */
+ if (ucounts_limit_cmp(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) >= 0 &&
+ new->user != INIT_USER &&
+ !security_capable(new, &init_user_ns, CAP_SYS_RESOURCE, CAP_OPT_NONE) &&
+ !security_capable(new, &init_user_ns, CAP_SYS_ADMIN, CAP_OPT_NONE))
+ *nproc_flags |= PF_NPROC_EXCEEDED;
+ else
+ *nproc_flags &= ~PF_NPROC_EXCEEDED;
+
return 0;
}

diff --git a/kernel/fork.c b/kernel/fork.c
index 7cb21a70737d..a4005c679d29 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3051,7 +3051,7 @@ int ksys_unshare(unsigned long unshare_flags)
goto bad_unshare_cleanup_cred;

if (new_cred) {
- err = set_cred_ucounts(new_cred);
+ err = set_cred_ucounts(new_cred, NULL);
if (err)
goto bad_unshare_cleanup_cred;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 48c90dcceff3..4e4eea30e235 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,21 +472,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;

- /*
- * We don't fail in case of NPROC limit excess here because too many
- * poorly written programs don't check set*uid() return code, assuming
- * it never fails if called by root. We may still enforce NPROC limit
- * for programs doing set*uid()+execve() by harmlessly deferring the
- * failure to the execve() stage.
- */
- if (ucounts_limit_cmp(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) >= 0 &&
- new_user != INIT_USER &&
- !security_capable(new, &init_user_ns, CAP_SYS_RESOURCE, CAP_OPT_NONE) &&
- !security_capable(new, &init_user_ns, CAP_SYS_ADMIN, CAP_OPT_NONE))
- current->flags |= PF_NPROC_EXCEEDED;
- else
- current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
@@ -560,7 +545,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
if (retval < 0)
goto error;

- retval = set_cred_ucounts(new);
+ retval = set_cred_ucounts(new, &current->flags);
if (retval < 0)
goto error;

@@ -622,7 +607,7 @@ long __sys_setuid(uid_t uid)
if (retval < 0)
goto error;

- retval = set_cred_ucounts(new);
+ retval = set_cred_ucounts(new, &current->flags);
if (retval < 0)
goto error;

@@ -701,7 +686,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
if (retval < 0)
goto error;

- retval = set_cred_ucounts(new);
+ retval = set_cred_ucounts(new, &current->flags);
if (retval < 0)
goto error;

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..f7eec0b0233b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1344,7 +1344,7 @@ static int userns_install(struct nsset *nsset, struct ns_common *ns)
put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));

- if (set_cred_ucounts(cred) < 0)
+ if (set_cred_ucounts(cred, NULL) < 0)
return -EINVAL;

return 0;
--
2.34.1

2022-02-09 10:05:38

Subject: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Subject: [RFC PATCH 5/6] selftests: Challenge RLIMIT_NPROC in user namespaces

Subject: [RFC PATCH 6/6] selftests: Test RLIMIT_NPROC in clone-created user namespaces

Subject: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: [RFC PATCH 4/6] ucounts: Allow root to override RLIMIT_NPROC

Subject: [RFC PATCH 2/6] set*uid: Check RLIMIT_PROC against new credentials

Subject: Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Subject: Re: [RFC PATCH 5/6] selftests: Challenge RLIMIT_NPROC in user namespaces

Subject: Re: [RFC PATCH 6/6] selftests: Test RLIMIT_NPROC in clone-created user namespaces

Subject: Re: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: Re: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: Re: [RFC PATCH 4/6] ucounts: Allow root to override RLIMIT_NPROC

Subject: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Subject: Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Subject: [PATCH 7/8] rlimit: For RLIMIT_NPROC test the child not the parent for capabilites

Subject: Re: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: Re: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Subject: Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Subject: Re: [RFC PATCH 6/6] selftests: Test RLIMIT_NPROC in clone-created user namespaces

Subject: Re: [RFC PATCH 1/6] set_user: Perform RLIMIT_NPROC capability check against new user credentials

Subject: Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Subject: Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Subject: Re: [RFC PATCH 5/6] selftests: Challenge RLIMIT_NPROC in user namespaces

Subject: [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes

Subject: [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17

Subject: Re: [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17

Subject: Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Subject: How should rlimits, suid exec, and capabilities interact?

Subject: Re: How should rlimits, suid exec, and capabilities interact?

Subject: Re: How should rlimits, suid exec, and capabilities interact?

Subject: RE: How should rlimits, suid exec, and capabilities interact?

Subject: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Subject: Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Subject: Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Subject: Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Subject: [GIT PULL] ucounts: Regression fix for v5.17

Subject: Re: [GIT PULL] ucounts: Regression fix for v5.17