2021-04-22 12:28:52

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 0/9] Count rlimits in each user namespace

From: Alexey Gladkov <[email protected]>

Preface
-------
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4

Problem
-------
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
------------------
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1]: https://lore.kernel.org/containers/[email protected]/
[2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
---------
v11:
* Revert most of changes in signal.c to fix performance issues and remove
unnecessary memory allocations.
* Fixed issue found by lkp robot (again).

v10:
* Fixed memory leak in __sigqueue_alloc.
* Handled an unlikely situation when all consumers will return ucounts at once.
* Addressed other review comments from Eric W. Biederman.

v9:
* Used a negative value to check that the ucounts->count is close to overflow.
* Rebased onto v5.12-rc4.

v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
RLIMIT_MEMLOCK on top of ucounts.

v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to atomic_long_t
and add ucounts to cred. These commits were merged by mistake during the rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (9):
Increase size of ucounts to atomic_long_t
Add a reference to ucounts for each cred
Use atomic_t for ucounts reference counting
Reimplement RLIMIT_NPROC on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MEMLOCK on top of ucounts
kselftests: Add test to check for rlimit changes in different user
namespaces
ucounts: Set ucount_max to the largest positive value the type can
hold

fs/exec.c | 6 +-
fs/hugetlbfs/inode.c | 16 +-
fs/proc/array.c | 2 +-
include/linux/cred.h | 4 +
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 4 +-
include/linux/sched/user.h | 7 -
include/linux/shmem_fs.h | 2 +-
include/linux/signal_types.h | 4 +-
include/linux/user_namespace.h | 31 +++-
ipc/mqueue.c | 40 ++---
ipc/shm.c | 26 +--
kernel/cred.c | 50 +++++-
kernel/exit.c | 2 +-
kernel/fork.c | 18 +-
kernel/signal.c | 25 +--
kernel/sys.c | 14 +-
kernel/ucount.c | 116 ++++++++++---
kernel/user.c | 3 -
kernel/user_namespace.c | 9 +-
mm/memfd.c | 4 +-
mm/mlock.c | 22 ++-
mm/mmap.c | 4 +-
mm/shmem.c | 10 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
29 files changed, 467 insertions(+), 127 deletions(-)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

--
2.29.3


2021-04-22 12:29:02

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 1/9] Increase size of ucounts to atomic_long_t

From: Alexey Gladkov <[email protected]>

RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
counters. As a preparation for moving rlimits based on ucounts, we need
to increase the size of the variable to long.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 4 ++--
kernel/ucount.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..0bb833fd41f4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
#endif
struct ucounts *ucounts;
- int ucount_max[UCOUNT_COUNTS];
+ long ucount_max[UCOUNT_COUNTS];
} __randomize_layout;

struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
int count;
- atomic_t ucount[UCOUNT_COUNTS];
+ atomic_long_t ucount[UCOUNT_COUNTS];
};

extern struct user_namespace init_user_ns;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..04c561751af1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -175,14 +175,14 @@ static void put_ucounts(struct ucounts *ucounts)
kfree(ucounts);
}

-static inline bool atomic_inc_below(atomic_t *v, int u)
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
{
- int c, old;
- c = atomic_read(v);
+ long c, old;
+ c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
- old = atomic_cmpxchg(v, c, c+1);
+ old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -196,17 +196,17 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
struct user_namespace *tns;
ucounts = get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
- int max;
+ long max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
- if (!atomic_inc_below(&iter->ucount[type], max))
+ if (!atomic_long_inc_below(&iter->ucount[type], max))
goto fail;
}
return ucounts;
fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
- atomic_dec(&iter->ucount[type]);
+ atomic_long_dec(&iter->ucount[type]);

put_ucounts(ucounts);
return NULL;
@@ -216,7 +216,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
{
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
- int dec = atomic_dec_if_positive(&iter->ucount[type]);
+ long dec = atomic_long_dec_if_positive(&iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
--
2.29.3

2021-04-22 12:29:04

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

From: Alexey Gladkov <[email protected]>

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/sched/user.h | 4 ----
include/linux/user_namespace.h | 1 +
ipc/mqueue.c | 40 ++++++++++++++++++----------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user_namespace.c | 1 +
6 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
#endif
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
- /* protected by mq_lock */
- unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
#endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d5bb4abb8f3e..21ad1ad1b990 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
#endif
UCOUNT_RLIMIT_NPROC,
+ UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
};

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..461fcf8c873d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
- struct user_struct *user; /* user who created, for accounting */
+ struct ucounts *ucounts; /* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;

@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
{
- struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;

@@ -321,7 +320,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
- info->user = NULL; /* set when all is ok */
+ info->ucounts = NULL; /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +370,23 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
- spin_lock(&mq_lock);
- if (u->mq_bytes + mq_bytes < u->mq_bytes ||
- u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+ info->ucounts = get_ucounts(current_ucounts());
+ if (info->ucounts) {
+ long msgqueue;
+
+ spin_lock(&mq_lock);
+ msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+ if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
+ dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
+ spin_unlock(&mq_lock);
+ put_ucounts(info->ucounts);
+ info->ucounts = NULL;
+ /* mqueue_evict_inode() releases info->messages */
+ ret = -EMFILE;
+ goto out_inode;
+ }
spin_unlock(&mq_lock);
- /* mqueue_evict_inode() releases info->messages */
- ret = -EMFILE;
- goto out_inode;
}
- u->mq_bytes += mq_bytes;
- spin_unlock(&mq_lock);
-
- /* all is ok */
- info->user = get_uid(u);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +500,6 @@ static void mqueue_free_inode(struct inode *inode)
static void mqueue_evict_inode(struct inode *inode)
{
struct mqueue_inode_info *info;
- struct user_struct *user;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +522,7 @@ static void mqueue_evict_inode(struct inode *inode)
free_msg(msg);
}

- user = info->user;
- if (user) {
+ if (info->ucounts) {
unsigned long mq_bytes, mq_treesize;

/* Total amount of bytes accounted for the mqueue */
@@ -533,7 +534,7 @@ static void mqueue_evict_inode(struct inode *inode)
info->attr.mq_msgsize);

spin_lock(&mq_lock);
- user->mq_bytes -= mq_bytes;
+ dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
/*
* get_ns_from_inode() ensures that the
* (ipc_ns = sb->s_fs_info) is either a valid ipc_ns
@@ -543,7 +544,8 @@ static void mqueue_evict_inode(struct inode *inode)
if (ipc_ns)
ipc_ns->mq_queues_count--;
spin_unlock(&mq_lock);
- free_uid(user);
+ put_ucounts(info->ucounts);
+ info->ucounts = NULL;
}
if (ipc_ns)
put_ipc_ns(ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index d8a4956463ae..85c6094f5a48 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -823,6 +823,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[i] = max_threads/2;

init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 6caa56f7dec8..6e6f936a5963 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -80,6 +80,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ },
{ }
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2434b13b02e5..cc90d5203acf 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -122,6 +122,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[i] = INT_MAX;
}
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
+ ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.3

2021-04-22 12:29:13

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 3/9] Use atomic_t for ucounts reference counting

From: Alexey Gladkov <[email protected]>

The current implementation of the ucounts reference counter requires the
use of spin_lock. We're going to use get_ucounts() in more performance
critical areas like a handling of RLIMIT_SIGPENDING.

Now we need to use spin_lock only if we want to change the hashtable.

v10:
* Always try to put ucounts in case we cannot increase ucounts->count.
This will allow to cover the case when all consumers will return
ucounts at once.

v9:
* Use a negative value to check that the ucounts->count is close to
overflow.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 4 +--
kernel/ucount.c | 53 ++++++++++++----------------------
2 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..d84cc2c0b443 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
- int count;
+ atomic_t count;
atomic_long_t ucount[UCOUNT_COUNTS];
};

@@ -104,7 +104,7 @@ void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
-struct ucounts *get_ucounts(struct ucounts *ucounts);
+struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
void put_ucounts(struct ucounts *ucounts);

#ifdef CONFIG_USER_NS
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..365865f368ec 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -11,7 +11,7 @@
struct ucounts init_ucounts = {
.ns = &init_user_ns,
.uid = GLOBAL_ROOT_UID,
- .count = 1,
+ .count = ATOMIC_INIT(1),
};

#define UCOUNTS_HASHTABLE_BITS 10
@@ -139,6 +139,15 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(&ucounts_lock);
}

+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ if (ucounts && atomic_add_negative(1, &ucounts->count)) {
+ put_ucounts(ucounts);
+ ucounts = NULL;
+ }
+ return ucounts;
+}
+
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
@@ -155,7 +164,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)

new->ns = ns;
new->uid = uid;
- new->count = 0;
+ atomic_set(&new->count, 1);

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -163,33 +172,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
kfree(new);
} else {
hlist_add_head(&new->node, hashent);
- ucounts = new;
+ spin_unlock_irq(&ucounts_lock);
+ return new;
}
}
- if (ucounts->count == INT_MAX)
- ucounts = NULL;
- else
- ucounts->count += 1;
spin_unlock_irq(&ucounts_lock);
- return ucounts;
-}
-
-struct ucounts *get_ucounts(struct ucounts *ucounts)
-{
- unsigned long flags;
-
- if (!ucounts)
- return NULL;
-
- spin_lock_irqsave(&ucounts_lock, flags);
- if (ucounts->count == INT_MAX) {
- WARN_ONCE(1, "ucounts: counter has reached its maximum value");
- ucounts = NULL;
- } else {
- ucounts->count += 1;
- }
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
+ ucounts = get_ucounts(ucounts);
return ucounts;
}

@@ -197,15 +185,12 @@ void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

- spin_lock_irqsave(&ucounts_lock, flags);
- ucounts->count -= 1;
- if (!ucounts->count)
+ if (atomic_dec_and_test(&ucounts->count)) {
+ spin_lock_irqsave(&ucounts_lock, flags);
hlist_del_init(&ucounts->node);
- else
- ucounts = NULL;
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
- kfree(ucounts);
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+ kfree(ucounts);
+ }
}

static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
--
2.29.3

2021-04-22 12:29:24

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 6/9] Reimplement RLIMIT_SIGPENDING on top of ucounts

From: Alexey Gladkov <[email protected]>

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Revert most of changes to fix performance issues.

v10:
* Fix memory leak on get_ucounts failure.

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/array.c | 2 +-
include/linux/sched/user.h | 1 -
include/linux/signal_types.h | 4 +++-
include/linux/user_namespace.h | 1 +
kernel/fork.c | 1 +
kernel/signal.c | 25 +++++++++++++------------
kernel/ucount.c | 1 +
kernel/user.c | 1 -
kernel/user_namespace.c | 1 +
9 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct task_struct *p)
collect_sigign_sigcatch(p, &ignored, &caught);
num_threads = get_nr_threads(p);
rcu_read_lock(); /* FIXME: is this correct? */
- qsize = atomic_read(&__task_cred(p)->user->sigpending);
+ qsize = get_ucounts_value(task_ucounts(p), UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, &flags);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
*/
struct user_struct {
refcount_t __count; /* reference count */
- atomic_t sigpending; /* How many pending signals does this user have? */
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
#endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
} kernel_siginfo_t;

+struct ucounts;
+
/*
* Real Time signals may be queued.
*/
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
- struct user_struct *user;
+ struct ucounts *ucounts;
};

/* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 21ad1ad1b990..0e961bd8a090 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
#endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+ UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
};

diff --git a/kernel/fork.c b/kernel/fork.c
index 85c6094f5a48..741f896c156e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)

init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index f2a1b898da29..92383b890b35 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -414,8 +414,8 @@ static struct sigqueue *
__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimit)
{
struct sigqueue *q = NULL;
- struct user_struct *user;
- int sigpending;
+ struct ucounts *ucounts = NULL;
+ long sigpending;

/*
* Protect access to @t credentials. This can go away when all
@@ -426,27 +426,26 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
* changes from/to zero.
*/
rcu_read_lock();
- user = __task_cred(t)->user;
- sigpending = atomic_inc_return(&user->sigpending);
+ ucounts = task_ucounts(t);
+ sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
if (sigpending == 1)
- get_uid(user);
+ ucounts = get_ucounts(ucounts);
rcu_read_unlock();

- if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
+ if (override_rlimit || (sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
q = kmem_cache_alloc(sigqueue_cachep, flags);
} else {
print_dropped_signal(sig);
}

if (unlikely(q == NULL)) {
- if (atomic_dec_and_test(&user->sigpending))
- free_uid(user);
+ if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
+ put_ucounts(ucounts);
} else {
INIT_LIST_HEAD(&q->list);
q->flags = 0;
- q->user = user;
+ q->ucounts = ucounts;
}
-
return q;
}

@@ -454,8 +453,10 @@ static void __sigqueue_free(struct sigqueue *q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
- if (atomic_dec_and_test(&q->user->sigpending))
- free_uid(q->user);
+ if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
+ put_ucounts(q->ucounts);
+ q->ucounts = NULL;
+ }
kmem_cache_free(sigqueue_cachep, q);
}

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 6e6f936a5963..8ce62da6a62c 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -80,6 +80,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ },
{ },
{ }
diff --git a/kernel/user.c b/kernel/user.c
index 7f5ff498207a..6737327f83be 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
.ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index cc90d5203acf..df1bed32dd48 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,7 @@ int create_user_ns(struct cred *new)
}
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
+ ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.3

2021-04-22 12:29:35

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 7/9] Reimplement RLIMIT_MEMLOCK on top of ucounts

From: Alexey Gladkov <[email protected]>

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/hugetlbfs/inode.c | 16 ++++++++--------
include/linux/hugetlb.h | 4 ++--
include/linux/mm.h | 4 ++--
include/linux/sched/user.h | 1 -
include/linux/shmem_fs.h | 2 +-
include/linux/user_namespace.h | 1 +
ipc/shm.c | 26 +++++++++++++-------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user.c | 1 -
kernel/user_namespace.c | 1 +
mm/memfd.c | 4 ++--
mm/mlock.c | 22 ++++++++++++++--------
mm/mmap.c | 4 ++--
mm/shmem.c | 10 +++++-----
15 files changed, 53 insertions(+), 45 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 701c82c36138..be519fc9559a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1443,7 +1443,7 @@ static int get_hstate_idx(int page_size_log)
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
*/
struct file *hugetlb_file_setup(const char *name, size_t size,
- vm_flags_t acctflag, struct user_struct **user,
+ vm_flags_t acctflag, struct ucounts **ucounts,
int creat_flags, int page_size_log)
{
struct inode *inode;
@@ -1455,20 +1455,20 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);

- *user = NULL;
+ *ucounts = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);

if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
- *user = current_user();
- if (user_shm_lock(size, *user)) {
+ *ucounts = current_ucounts();
+ if (user_shm_lock(size, *ucounts)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
- *user = NULL;
+ *ucounts = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1495,9 +1495,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size,

iput(inode);
out:
- if (*user) {
- user_shm_unlock(size, *user);
- *user = NULL;
+ if (*ucounts) {
+ user_shm_unlock(size, *ucounts);
+ *ucounts = NULL;
}
return file;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..96d63dbdec65 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
- struct user_struct **user, int creat_flags,
+ struct ucounts **ucounts, int creat_flags,
int page_size_log);

static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
#define is_file_hugepages(file) false
static inline struct file *
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
- struct user_struct **user, int creat_flags,
+ struct ucounts **ucounts, int creat_flags,
int page_size_log)
{
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..7466eab000d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1658,8 +1658,8 @@ extern bool can_do_mlock(void);
#else
static inline bool can_do_mlock(void) { return false; }
#endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, struct ucounts *);
+extern void user_shm_unlock(size_t, struct ucounts *);

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9cec4fb99..82bd2532da6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,7 +18,6 @@ struct user_struct {
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
#endif
- unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
atomic_long_t pipe_bufs; /* how many pages are allocated in pipe buffers */

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..aa77dcd1646f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
extern int shmem_zero_setup(struct vm_area_struct *);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
#ifdef CONFIG_SHMEM
extern const struct address_space_operations shmem_aops;
static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0e961bd8a090..e42efd0ae595 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,6 +53,7 @@ enum ucount_type {
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_RLIMIT_SIGPENDING,
+ UCOUNT_RLIMIT_MEMLOCK,
UCOUNT_COUNTS,
};

diff --git a/ipc/shm.c b/ipc/shm.c
index febd88daba8c..003234fbbd17 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -60,7 +60,7 @@ struct shmid_kernel /* private to the kernel */
time64_t shm_ctim;
struct pid *shm_cprid;
struct pid *shm_lprid;
- struct user_struct *mlock_user;
+ struct ucounts *mlock_ucounts;

/* The task created the shm object. NULL if the task is dead. */
struct task_struct *shm_creator;
@@ -286,10 +286,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_rmid(ns, shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
- shmem_lock(shm_file, 0, shp->mlock_user);
- else if (shp->mlock_user)
+ shmem_lock(shm_file, 0, shp->mlock_ucounts);
+ else if (shp->mlock_ucounts)
user_shm_unlock(i_size_read(file_inode(shm_file)),
- shp->mlock_user);
+ shp->mlock_ucounts);
fput(shm_file);
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
@@ -625,7 +625,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)

shp->shm_perm.key = key;
shp->shm_perm.mode = (shmflg & S_IRWXUGO);
- shp->mlock_user = NULL;
+ shp->mlock_ucounts = NULL;

shp->shm_perm.security = NULL;
error = security_shm_alloc(&shp->shm_perm);
@@ -650,7 +650,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, hugesize, acctflag,
- &shp->mlock_user, HUGETLB_SHMFS_INODE,
+ &shp->mlock_ucounts, HUGETLB_SHMFS_INODE,
(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
} else {
/*
@@ -698,8 +698,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
no_id:
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
- if (is_file_hugepages(file) && shp->mlock_user)
- user_shm_unlock(size, shp->mlock_user);
+ if (is_file_hugepages(file) && shp->mlock_ucounts)
+ user_shm_unlock(size, shp->mlock_ucounts);
fput(file);
ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
return error;
@@ -1105,12 +1105,12 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
goto out_unlock0;

if (cmd == SHM_LOCK) {
- struct user_struct *user = current_user();
+ struct ucounts *ucounts = current_ucounts();

- err = shmem_lock(shm_file, 1, user);
+ err = shmem_lock(shm_file, 1, ucounts);
if (!err && !(shp->shm_perm.mode & SHM_LOCKED)) {
shp->shm_perm.mode |= SHM_LOCKED;
- shp->mlock_user = user;
+ shp->mlock_ucounts = ucounts;
}
goto out_unlock0;
}
@@ -1118,9 +1118,9 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
/* SHM_UNLOCK */
if (!(shp->shm_perm.mode & SHM_LOCKED))
goto out_unlock0;
- shmem_lock(shm_file, 0, shp->mlock_user);
+ shmem_lock(shm_file, 0, shp->mlock_ucounts);
shp->shm_perm.mode &= ~SHM_LOCKED;
- shp->mlock_user = NULL;
+ shp->mlock_ucounts = NULL;
get_file(shm_file);
ipc_unlock_object(&shp->shm_perm);
rcu_read_unlock();
diff --git a/kernel/fork.c b/kernel/fork.c
index 741f896c156e..a3a5e317c3c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = task_rlimit(&init_task, RLIMIT_MEMLOCK);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8ce62da6a62c..d316bac3e520 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -83,6 +83,7 @@ static struct ctl_table user_table[] = {
{ },
{ },
{ },
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
diff --git a/kernel/user.c b/kernel/user.c
index 6737327f83be..c82399c1618a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
.ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0),
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index df1bed32dd48..5ef0d4b182ba 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -124,6 +124,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
+ ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..081dd33e6a61 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -297,9 +297,9 @@ SYSCALL_DEFINE2(memfd_create,
}

if (flags & MFD_HUGETLB) {
- struct user_struct *user = NULL;
+ struct ucounts *ucounts = NULL;

- file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
+ file = hugetlb_file_setup(name, 0, VM_NORESERVE, &ucounts,
HUGETLB_ANONHUGE_INODE,
(flags >> MFD_HUGE_SHIFT) &
MFD_HUGE_MASK);
diff --git a/mm/mlock.c b/mm/mlock.c
index f8f8cc32d03d..dd411aabf695 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -817,9 +817,10 @@ SYSCALL_DEFINE0(munlockall)
*/
static DEFINE_SPINLOCK(shmlock_user_lock);

-int user_shm_lock(size_t size, struct user_struct *user)
+int user_shm_lock(size_t size, struct ucounts *ucounts)
{
unsigned long lock_limit, locked;
+ long memlock;
int allowed = 0;

locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -828,21 +829,26 @@ int user_shm_lock(size_t size, struct user_struct *user)
allowed = 1;
lock_limit >>= PAGE_SHIFT;
spin_lock(&shmlock_user_lock);
- if (!allowed &&
- locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
+ memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
+
+ if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
+ goto out;
+ }
+ if (!get_ucounts(ucounts)) {
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
goto out;
- get_uid(user);
- user->locked_shm += locked;
+ }
allowed = 1;
out:
spin_unlock(&shmlock_user_lock);
return allowed;
}

-void user_shm_unlock(size_t size, struct user_struct *user)
+void user_shm_unlock(size_t size, struct ucounts *ucounts)
{
spin_lock(&shmlock_user_lock);
- user->locked_shm -= (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
spin_unlock(&shmlock_user_lock);
- free_uid(user);
+ put_ucounts(ucounts);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..99f97d200aa4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1605,7 +1605,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
goto out_fput;
}
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
+ struct ucounts *ucounts = NULL;
struct hstate *hs;

hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
@@ -1621,7 +1621,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
+ &ucounts, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..7ee6d27222e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2227,7 +2227,7 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
}
#endif

-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
{
struct inode *inode = file_inode(file);
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2239,13 +2239,13 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
* no serialization needed when called from shm_destroy().
*/
if (lock && !(info->flags & VM_LOCKED)) {
- if (!user_shm_lock(inode->i_size, user))
+ if (!user_shm_lock(inode->i_size, ucounts))
goto out_nomem;
info->flags |= VM_LOCKED;
mapping_set_unevictable(file->f_mapping);
}
- if (!lock && (info->flags & VM_LOCKED) && user) {
- user_shm_unlock(inode->i_size, user);
+ if (!lock && (info->flags & VM_LOCKED) && ucounts) {
+ user_shm_unlock(inode->i_size, ucounts);
info->flags &= ~VM_LOCKED;
mapping_clear_unevictable(file->f_mapping);
}
@@ -4093,7 +4093,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
return 0;
}

-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
{
return 0;
}
--
2.29.3

2021-04-22 12:30:52

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 9/9] ucounts: Set ucount_max to the largest positive value the type can hold

From: Alexey Gladkov <[email protected]>

The ns->ucount_max[] is signed long which is less than the rlimit size.
We have to protect ucount_max[] from overflow and only use the largest
value that we can hold.

On 32bit using "long" instead of "unsigned long" to hold the counts has
the downside that RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK are limited to 2GiB
instead of 4GiB. I don't think anyone cares but it should be mentioned
in case someone does.

The RLIMIT_NPROC and RLIMIT_SIGPENDING used atomic_t so their maximum
hasn't changed.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 6 ++++++
kernel/fork.c | 8 ++++----
kernel/user_namespace.c | 8 ++++----
3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e42efd0ae595..2409fd57fd61 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -122,6 +122,12 @@ long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);

+static inline void set_rlimit_ucount_max(struct user_namespace *ns,
+ enum ucount_type type, unsigned long max)
+{
+ ns->ucount_max[type] = max <= LONG_MAX ? max : LONG_MAX;
+}
+
#ifdef CONFIG_USER_NS

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index a3a5e317c3c0..2cd01c443196 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,10 +822,10 @@ void __init fork_init(void)
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
init_user_ns.ucount_max[i] = max_threads/2;

- init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
- init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
- init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);
- init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = task_rlimit(&init_task, RLIMIT_MEMLOCK);
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC, task_rlimit(&init_task, RLIMIT_NPROC));
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_SIGPENDING, task_rlimit(&init_task, RLIMIT_SIGPENDING));
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK, task_rlimit(&init_task, RLIMIT_MEMLOCK));

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 5ef0d4b182ba..df7651935fd5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -121,10 +121,10 @@ int create_user_ns(struct cred *new)
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
ns->ucount_max[i] = INT_MAX;
}
- ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
- ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
- ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
- ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
+ set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+ set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
+ set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
+ set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.3

2021-04-22 12:31:00

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v11 8/9] kselftests: Add test to check for rlimit changes in different user namespaces

From: Alexey Gladkov <[email protected]>

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=60000, in different user namespaces.

Signed-off-by: Alexey Gladkov <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
5 files changed, 171 insertions(+)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..a4ea1481bd9a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -48,6 +48,7 @@ TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
TARGETS += openat2
+TARGETS += rlimits
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index 000000000000..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index 000000000000..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config b/tools/testing/selftests/rlimits/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index 000000000000..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <signal.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <err.h>
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user = 60000;
+static uid_t group = 60000;
+
+static void setrlimit_nproc(rlim_t n)
+{
+ pid_t pid = getpid();
+ struct rlimit limit = {
+ .rlim_cur = n,
+ .rlim_max = n
+ };
+
+ warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+ if (setrlimit(RLIMIT_NPROC, &limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+ pid_t pid = fork();
+
+ if (pid < 0)
+ err(EXIT_FAILURE, "fork");
+
+ if (pid > 0)
+ return pid;
+
+ pid = getpid();
+
+ warnx("(pid=%d): New process starting ...", pid);
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+ err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+ signal(SIGUSR1, SIG_DFL);
+
+ warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+ if (setgid(group) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+ if (setuid(user) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+ warnx("(pid=%d): Service running ...", pid);
+
+ warnx("(pid=%d): Unshare user namespace", pid);
+ if (unshare(CLONE_NEWUSER) < 0)
+ err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+ char *const argv[] = { "service", NULL };
+ char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+ warnx("(pid=%d): Executing real service ...", pid);
+
+ execve(service_prog, argv, envp);
+ err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+ size_t i;
+ pid_t child[NR_CHILDS];
+ int wstatus[NR_CHILDS];
+ int childs = NR_CHILDS;
+ pid_t pid;
+
+ if (getenv("I_AM_SERVICE")) {
+ pause();
+ exit(EXIT_SUCCESS);
+ }
+
+ service_prog = argv[0];
+ pid = getpid();
+
+ warnx("(pid=%d) Starting testcase", pid);
+
+ /*
+ * This rlimit is not a problem for root because it can be exceeded.
+ */
+ setrlimit_nproc(1);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ child[i] = fork_child();
+ wstatus[i] = 0;
+ usleep(250000);
+ }
+
+ while (1) {
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+
+ errno = 0;
+ pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG);
+
+ if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i])))
+ continue;
+
+ if (ret < 0 && errno != ECHILD)
+ warn("(pid=%d): waitpid(%d)", pid, child[i]);
+
+ child[i] *= -1;
+ childs -= 1;
+ }
+
+ if (!childs)
+ break;
+
+ usleep(250000);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+ kill(child[i], SIGUSR1);
+ }
+ }
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (WIFEXITED(wstatus[i]))
+ warnx("(pid=%d): pid %d exited, status=%d",
+ pid, -child[i], WEXITSTATUS(wstatus[i]));
+ else if (WIFSIGNALED(wstatus[i]))
+ warnx("(pid=%d): pid %d killed by signal %d",
+ pid, -child[i], WTERMSIG(wstatus[i]));
+
+ if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1)
+ continue;
+
+ warnx("(pid=%d): Test failed", pid);
+ exit(EXIT_FAILURE);
+ }
+
+ warnx("(pid=%d): Test passed", pid);
+ exit(EXIT_SUCCESS);
+}
--
2.29.3

2021-05-10 01:14:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Count rlimits in each user namespace

On Thu, 22 Apr 2021 14:27:07 +0200 [email protected] wrote:

> These patches are for binding the rlimit counters to a user in user namespace.

It's at v11 and no there has been no acking or reviewing activity? Or
have you not been tracking these?

2021-05-12 13:21:10

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Count rlimits in each user namespace

On Sun, May 09, 2021 at 06:12:05PM -0700, Andrew Morton wrote:
> On Thu, 22 Apr 2021 14:27:07 +0200 [email protected] wrote:
>
> > These patches are for binding the rlimit counters to a user in user namespace.
>
> It's at v11 and no there has been no acking or reviewing activity? Or
> have you not been tracking these?

Eric W. Biederman told me he's ready to pick it [1]. But as far as I can
see he hasn't put it in for-next yet.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/log/?h=ucounts-rlimits-for-v5.13

--
Rgrds, legion

2021-06-02 20:39:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Count rlimits in each user namespace

Andrew Morton <[email protected]> writes:

> On Thu, 22 Apr 2021 14:27:07 +0200 [email protected] wrote:
>
>> These patches are for binding the rlimit counters to a user in user namespace.
>
> It's at v11 and no there has been no acking or reviewing activity? Or
> have you not been tracking these?

Most of the reviews were noticing things that needed to be changed.

For the ack/review by tags I am the one reviewing it and merging the
change so I guess I didn't give Alex any Acked-by or Reviewed-by
tags.

Regardless the changes are sitting in linux-next now and seem to be
doing fine.

Eric

2021-08-17 04:08:28

by Ma Xinjian

[permalink] [raw]
Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

Hi Alexey,

When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
in kselftest failed with following message.

If you confirm and fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot [email protected]

```
# selftests: mqueue: mq_perf_tests
#
# Initial system state:
# Using queue path: /mq_perf_tests
# RLIMIT_MSGQUEUE(soft): 819200
# RLIMIT_MSGQUEUE(hard): 819200
# Maximum Message Size: 8192
# Maximum Queue Size: 10
# Nice value: 0
#
# Adjusted system state for testing:
# RLIMIT_MSGQUEUE(soft): (unlimited)
# RLIMIT_MSGQUEUE(hard): (unlimited)
# Maximum Message Size: 16777216
# Maximum Queue Size: 65530
# Nice value: -20
# Continuous mode: (disabled)
# CPUs to pin: 3
# ./mq_perf_tests: mq_open() at 296: Too many open files
not ok 2 selftests: mqueue: mq_perf_tests # exit=1
```

Test env:
rootfs: debian-10
gcc version: 9

------
Thanks
Ma Xinjian

2021-08-17 15:56:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

"Ma, XinjianX" <[email protected]> writes:

> Hi Alexey,
>
> When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> in kselftest failed with following message.

Which kernel was this run against?

Where can the mq_perf_tests that you ran and had problems with be found?

During your run were you using user namespaces as part of your test
environment?

The error message too many files corresponds to the error code EMFILES
which is the error code that is returned when the rlimit is reached.

One possibility is that your test environment was run in a user
namespace and so you wound up limited by rlimit of the user who created
the user namespace at the point of user namespace creation.

At this point if you can give us enough information to look into this
and attempt to reproduce it that would be appreciated.

> If you confirm and fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot [email protected]
>
> ```
> # selftests: mqueue: mq_perf_tests
> #
> # Initial system state:
> # Using queue path: /mq_perf_tests
> # RLIMIT_MSGQUEUE(soft): 819200
> # RLIMIT_MSGQUEUE(hard): 819200
> # Maximum Message Size: 8192
> # Maximum Queue Size: 10
> # Nice value: 0
> #
> # Adjusted system state for testing:
> # RLIMIT_MSGQUEUE(soft): (unlimited)
> # RLIMIT_MSGQUEUE(hard): (unlimited)
> # Maximum Message Size: 16777216
> # Maximum Queue Size: 65530
> # Nice value: -20
> # Continuous mode: (disabled)
> # CPUs to pin: 3
> # ./mq_perf_tests: mq_open() at 296: Too many open files
> not ok 2 selftests: mqueue: mq_perf_tests # exit=1
> ```
>
> Test env:
> rootfs: debian-10
> gcc version: 9
>
> ------
> Thanks
> Ma Xinjian

Eric

2021-08-18 13:14:46

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

On Tue, Aug 17, 2021 at 10:47:14AM -0500, Eric W. Biederman wrote:
> "Ma, XinjianX" <[email protected]> writes:
>
> > Hi Alexey,
> >
> > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> > in kselftest failed with following message.
>
> Which kernel was this run against?
>
> Where can the mq_perf_tests that you ran and had problems with be found?
>
> During your run were you using user namespaces as part of your test
> environment?
>
> The error message too many files corresponds to the error code EMFILES
> which is the error code that is returned when the rlimit is reached.
>
> One possibility is that your test environment was run in a user
> namespace and so you wound up limited by rlimit of the user who created
> the user namespace at the point of user namespace creation.
>
> At this point if you can give us enough information to look into this
> and attempt to reproduce it that would be appreciated.

I was able to reproduce it on master without using user namespace.
I suspect that the maximum value is not assigned here [1]:

set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n832

> > If you confirm and fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot [email protected]
> >
> > ```
> > # selftests: mqueue: mq_perf_tests
> > #
> > # Initial system state:
> > # Using queue path: /mq_perf_tests
> > # RLIMIT_MSGQUEUE(soft): 819200
> > # RLIMIT_MSGQUEUE(hard): 819200
> > # Maximum Message Size: 8192
> > # Maximum Queue Size: 10
> > # Nice value: 0
> > #
> > # Adjusted system state for testing:
> > # RLIMIT_MSGQUEUE(soft): (unlimited)
> > # RLIMIT_MSGQUEUE(hard): (unlimited)
> > # Maximum Message Size: 16777216
> > # Maximum Queue Size: 65530
> > # Nice value: -20
> > # Continuous mode: (disabled)
> > # CPUs to pin: 3
> > # ./mq_perf_tests: mq_open() at 296: Too many open files
> > not ok 2 selftests: mqueue: mq_perf_tests # exit=1
> > ```
> >
> > Test env:
> > rootfs: debian-10
> > gcc version: 9
> >
> > ------
> > Thanks
> > Ma Xinjian
>
> Eric
>

--
Rgrds, legion

2021-08-19 01:52:22

by Ma Xinjian

[permalink] [raw]
Subject: RE: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts



> -----Original Message-----
> From: Alexey Gladkov <[email protected]>
> Sent: Wednesday, August 18, 2021 9:11 PM
> To: Eric W. Biederman <[email protected]>
> Cc: Ma, XinjianX <[email protected]>; [email protected];
> lkp <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; kernel-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of
> ucounts
>
> On Tue, Aug 17, 2021 at 10:47:14AM -0500, Eric W. Biederman wrote:
> > "Ma, XinjianX" <[email protected]> writes:
> >
> > > Hi Alexey,
> > >
> > > When lkp team run kernel selftests, we found after these series of
> > > patches, testcase mqueue: mq_perf_tests in kselftest failed with
> following message.
> >
> > Which kernel was this run against?
> >
> > Where can the mq_perf_tests that you ran and had problems with be
> found?
> >
> > During your run were you using user namespaces as part of your test
> > environment?
> >
> > The error message too many files corresponds to the error code EMFILES
> > which is the error code that is returned when the rlimit is reached.
> >
> > One possibility is that your test environment was run in a user
> > namespace and so you wound up limited by rlimit of the user who
> > created the user namespace at the point of user namespace creation.
> >
> > At this point if you can give us enough information to look into this
> > and attempt to reproduce it that would be appreciated.
>
> I was able to reproduce it on master without using user namespace.
> I suspect that the maximum value is not assigned here [1]:
>
> set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE,
> task_rlimit(&init_task, RLIMIT_MSGQUEUE));
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kerne
> l/fork.c#n832
Thank you for confirming the issue. And will you plan to fix this issue?
If it's your plan, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


>
> > > If you confirm and fix the issue, kindly add following tag as
> > > appropriate
> > > Reported-by: kernel test robot [email protected]
> > >
> > > ```
> > > # selftests: mqueue: mq_perf_tests
> > > #
> > > # Initial system state:
> > > # Using queue path: /mq_perf_tests
> > > # RLIMIT_MSGQUEUE(soft): 819200
> > > # RLIMIT_MSGQUEUE(hard): 819200
> > > # Maximum Message Size: 8192
> > > # Maximum Queue Size: 10
> > > # Nice value: 0
> > > #
> > > # Adjusted system state for testing:
> > > # RLIMIT_MSGQUEUE(soft): (unlimited)
> > > # RLIMIT_MSGQUEUE(hard): (unlimited)
> > > # Maximum Message Size: 16777216
> > > # Maximum Queue Size: 65530
> > > # Nice value: -20
> > > # Continuous mode: (disabled)
> > > # CPUs to pin: 3
> > > # ./mq_perf_tests: mq_open() at 296: Too many open files not ok 2
> > > selftests: mqueue: mq_perf_tests # exit=1
> > > ```
> > >
> > > Test env:
> > > rootfs: debian-10
> > > gcc version: 9
> > >
> > > ------
> > > Thanks
> > > Ma Xinjian
> >
> > Eric
> >
>
> --
> Rgrds, legion

2021-08-19 15:12:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

Alexey Gladkov <[email protected]> writes:

> On Tue, Aug 17, 2021 at 10:47:14AM -0500, Eric W. Biederman wrote:
>> "Ma, XinjianX" <[email protected]> writes:
>>
>> > Hi Alexey,
>> >
>> > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
>> > in kselftest failed with following message.
>>
>> Which kernel was this run against?
>>
>> Where can the mq_perf_tests that you ran and had problems with be found?
>>
>> During your run were you using user namespaces as part of your test
>> environment?
>>
>> The error message too many files corresponds to the error code EMFILES
>> which is the error code that is returned when the rlimit is reached.
>>
>> One possibility is that your test environment was run in a user
>> namespace and so you wound up limited by rlimit of the user who created
>> the user namespace at the point of user namespace creation.
>>
>> At this point if you can give us enough information to look into this
>> and attempt to reproduce it that would be appreciated.
>
> I was able to reproduce it on master without using user namespace.
> I suspect that the maximum value is not assigned here [1]:
>
> set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n832

The rlimits for init_task are set to INIT_RLIMITS.
In INIT_RLIMITS RLIMIT_MSGQUEUE is set to MQ_MAX_BYTES

So that definitely means that as the code is current constructed the
rlimit can not be effectively raised.

So it looks like we are just silly and preventing the initial rlimits
from being raised.

So we probably want to do something like:

diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..557ce0083ba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,13 +825,13 @@ void __init fork_init(void)
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];

+ /* For non-rlimit ucounts make their default limit max_threads/2 */
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
init_user_ns.ucount_max[i] = max_threads/2;

- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC, task_rlimit(&init_task, RLIMIT_NPROC));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_SIGPENDING, task_rlimit(&init_task, RLIMIT_SIGPENDING));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK, task_rlimit(&init_task, RLIMIT_MEMLOCK));
+ /* In init_user_ns default rlimit to be the only limit */
+ for (; i < UCOUNT_COUNTS; i++)
+ set_rlimit_ucount_max(&init_user_ns, i, RLIMIT_INFINITY);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",


Eric

2021-08-19 17:28:33

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH v11 5/9] Reimplement RLIMIT_MSGQUEUE on top of ucounts

On Thu, Aug 19, 2021 at 10:10:26AM -0500, Eric W. Biederman wrote:
> Alexey Gladkov <[email protected]> writes:
>
> > On Tue, Aug 17, 2021 at 10:47:14AM -0500, Eric W. Biederman wrote:
> >> "Ma, XinjianX" <[email protected]> writes:
> >>
> >> > Hi Alexey,
> >> >
> >> > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> >> > in kselftest failed with following message.
> >>
> >> Which kernel was this run against?
> >>
> >> Where can the mq_perf_tests that you ran and had problems with be found?
> >>
> >> During your run were you using user namespaces as part of your test
> >> environment?
> >>
> >> The error message too many files corresponds to the error code EMFILES
> >> which is the error code that is returned when the rlimit is reached.
> >>
> >> One possibility is that your test environment was run in a user
> >> namespace and so you wound up limited by rlimit of the user who created
> >> the user namespace at the point of user namespace creation.
> >>
> >> At this point if you can give us enough information to look into this
> >> and attempt to reproduce it that would be appreciated.
> >
> > I was able to reproduce it on master without using user namespace.
> > I suspect that the maximum value is not assigned here [1]:
> >
> > set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n832
>
> The rlimits for init_task are set to INIT_RLIMITS.
> In INIT_RLIMITS RLIMIT_MSGQUEUE is set to MQ_MAX_BYTES
>
> So that definitely means that as the code is current constructed the
> rlimit can not be effectively raised.
>
> So it looks like we are just silly and preventing the initial rlimits
> from being raised.
>
> So we probably want to do something like:

Damn, you are faster than me! :)

> diff --git a/kernel/fork.c b/kernel/fork.c
> index bc94b2cc5995..557ce0083ba3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -825,13 +825,13 @@ void __init fork_init(void)
> init_task.signal->rlim[RLIMIT_SIGPENDING] =
> init_task.signal->rlim[RLIMIT_NPROC];
>
> + /* For non-rlimit ucounts make their default limit max_threads/2 */
> for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
> init_user_ns.ucount_max[i] = max_threads/2;
>
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC, task_rlimit(&init_task, RLIMIT_NPROC));
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_SIGPENDING, task_rlimit(&init_task, RLIMIT_SIGPENDING));
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK, task_rlimit(&init_task, RLIMIT_MEMLOCK));
> + /* In init_user_ns default rlimit to be the only limit */
> + for (; i < UCOUNT_COUNTS; i++)
> + set_rlimit_ucount_max(&init_user_ns, i, RLIMIT_INFINITY);

s/RLIMIT_INFINITY/RLIM_INFINITY/

>
> #ifdef CONFIG_VMAP_STACK
> cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
>

Acked-by: Alexey Gladkov <[email protected]>

I cannot complete this test on my laptop. On 4Gb, the test ends with
oom-killer. But with this patch, the test definitely passes the moment of
the previous fall.

--
Rgrds, legion

2021-08-23 21:11:43

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] ucounts: Fix regression preventing increasing of rlimits in init_user_ns


"Ma, XinjianX" <[email protected]> reported:

> When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> in kselftest failed with following message.
>
> # selftests: mqueue: mq_perf_tests
> #
> # Initial system state:
> # Using queue path: /mq_perf_tests
> # RLIMIT_MSGQUEUE(soft): 819200
> # RLIMIT_MSGQUEUE(hard): 819200
> # Maximum Message Size: 8192
> # Maximum Queue Size: 10
> # Nice value: 0
> #
> # Adjusted system state for testing:
> # RLIMIT_MSGQUEUE(soft): (unlimited)
> # RLIMIT_MSGQUEUE(hard): (unlimited)
> # Maximum Message Size: 16777216
> # Maximum Queue Size: 65530
> # Nice value: -20
> # Continuous mode: (disabled)
> # CPUs to pin: 3
> # ./mq_perf_tests: mq_open() at 296: Too many open files
> not ok 2 selftests: mqueue: mq_perf_tests # exit=1
> ```
>
> Test env:
> rootfs: debian-10
> gcc version: 9

After investigation the problem turned out to be that ucount_max for
the rlimits in init_user_ns was being set to the initial rlimit value.
The practical problem is that ucount_max provides a limit that
applications inside the user namespace can not exceed. Which means in
practice that rlimits that have been converted to use the ucount
infrastructure were not able to exceend their initial rlimits.

Solve this by setting the relevant values of ucount_max to
RLIM_INIFINITY. A limit in init_user_ns is pointless so the code
should allow the values to grow as large as possible without riscking
an underflow or an overflow.

As the ltp test case was a bit of a pain I have reproduced the rlimit failure
and tested the fix with the following little C program:
> #include <stdio.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <mqueue.h>
> #include <sys/time.h>
> #include <sys/resource.h>
> #include <errno.h>
> #include <string.h>
> #include <stdlib.h>
> #include <limits.h>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
> struct mq_attr mq_attr;
> struct rlimit rlim;
> mqd_t mqd;
> int ret;
>
> ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
> if (ret != 0) {
> fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
> exit(EXIT_FAILURE);
> }
> printf("RLIMIT_MSGQUEUE %lu %lu\n",
> rlim.rlim_cur, rlim.rlim_max);
> rlim.rlim_cur = RLIM_INFINITY;
> rlim.rlim_max = RLIM_INFINITY;
> ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
> if (ret != 0) {
> fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
> exit(EXIT_FAILURE);
> }
>
> memset(&mq_attr, 0, sizeof(struct mq_attr));
> mq_attr.mq_maxmsg = 65536 - 1;
> mq_attr.mq_msgsize = 16*1024*1024 - 1;
>
> mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
> if (mqd == (mqd_t)-1) {
> fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
> exit(EXIT_FAILURE);
> }
> ret = mq_close(mqd);
> if (ret) {
> fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
> exit(EXIT_FAILURE);
> }
>
> return EXIT_SUCCESS;
> }

Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Reported-by: kernel test robot [email protected]
Acked-by: Alexey Gladkov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

This is a simplified version of my previous change that I have tested
and will push out to linux-next and then to Linus shortly.

kernel/fork.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..44f4c2d83763 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -828,10 +828,10 @@ void __init fork_init(void)
for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
init_user_ns.ucount_max[i] = max_threads/2;

- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC, task_rlimit(&init_task, RLIMIT_NPROC));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, task_rlimit(&init_task, RLIMIT_MSGQUEUE));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_SIGPENDING, task_rlimit(&init_task, RLIMIT_SIGPENDING));
- set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK, task_rlimit(&init_task, RLIMIT_MEMLOCK));
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC, RLIM_INFINITY);
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE, RLIM_INFINITY);
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_SIGPENDING, RLIM_INFINITY);
+ set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK, RLIM_INFINITY);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
--
2.20.1

2021-08-24 01:22:15

by Ma Xinjian

[permalink] [raw]
Subject: RE: [PATCH] ucounts: Fix regression preventing increasing of rlimits in init_user_ns



> -----Original Message-----
> From: Eric W. Biederman <[email protected]>
> Sent: Tuesday, August 24, 2021 5:07 AM
> To: Alexey Gladkov <[email protected]>
> Cc: Ma, XinjianX <[email protected]>; [email protected];
> lkp <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; kernel-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH] ucounts: Fix regression preventing increasing of rlimits in
> init_user_ns
>
>
> "Ma, XinjianX" <[email protected]> reported:
>
> > When lkp team run kernel selftests, we found after these series of
> > patches, testcase mqueue: mq_perf_tests in kselftest failed with following
> message.
> >
> > # selftests: mqueue: mq_perf_tests
> > #
> > # Initial system state:
> > # Using queue path: /mq_perf_tests
> > # RLIMIT_MSGQUEUE(soft): 819200
> > # RLIMIT_MSGQUEUE(hard): 819200
> > # Maximum Message Size: 8192
> > # Maximum Queue Size: 10
> > # Nice value: 0
> > #
> > # Adjusted system state for testing:
> > # RLIMIT_MSGQUEUE(soft): (unlimited)
> > # RLIMIT_MSGQUEUE(hard): (unlimited)
> > # Maximum Message Size: 16777216
> > # Maximum Queue Size: 65530
> > # Nice value: -20
> > # Continuous mode: (disabled)
> > # CPUs to pin: 3
> > # ./mq_perf_tests: mq_open() at 296: Too many open files not ok 2
> > selftests: mqueue: mq_perf_tests # exit=1 ```
> >
> > Test env:
> > rootfs: debian-10
> > gcc version: 9
>
> After investigation the problem turned out to be that ucount_max for the
> rlimits in init_user_ns was being set to the initial rlimit value.
> The practical problem is that ucount_max provides a limit that applications
> inside the user namespace can not exceed. Which means in practice that
> rlimits that have been converted to use the ucount infrastructure were not
> able to exceend their initial rlimits.
>
> Solve this by setting the relevant values of ucount_max to RLIM_INIFINITY. A
> limit in init_user_ns is pointless so the code should allow the values to grow
> as large as possible without riscking an underflow or an overflow.
>
> As the ltp test case was a bit of a pain I have reproduced the rlimit failure and
> tested the fix with the following little C program:
> > #include <stdio.h>
> > #include <fcntl.h>
> > #include <sys/stat.h>
> > #include <mqueue.h>
> > #include <sys/time.h>
> > #include <sys/resource.h>
> > #include <errno.h>
> > #include <string.h>
> > #include <stdlib.h>
> > #include <limits.h>
> > #include <unistd.h>
> >
> > int main(int argc, char **argv)
> > {
> > struct mq_attr mq_attr;
> > struct rlimit rlim;
> > mqd_t mqd;
> > int ret;
> >
> > ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
> > if (ret != 0) {
> > fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n",
> strerror(errno));
> > exit(EXIT_FAILURE);
> > }
> > printf("RLIMIT_MSGQUEUE %lu %lu\n",
> > rlim.rlim_cur, rlim.rlim_max);
> > rlim.rlim_cur = RLIM_INFINITY;
> > rlim.rlim_max = RLIM_INFINITY;
> > ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
> > if (ret != 0) {
> > fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY)
> failed: %s\n", strerror(errno));
> > exit(EXIT_FAILURE);
> > }
> >
> > memset(&mq_attr, 0, sizeof(struct mq_attr));
> > mq_attr.mq_maxmsg = 65536 - 1;
> > mq_attr.mq_msgsize = 16*1024*1024 - 1;
> >
> > mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600,
> &mq_attr);
> > if (mqd == (mqd_t)-1) {
> > fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
> > exit(EXIT_FAILURE);
> > }
> > ret = mq_close(mqd);
> > if (ret) {
> > fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
> > exit(EXIT_FAILURE);
> > }
> >
> > return EXIT_SUCCESS;
> > }
>
> Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
> Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Reported-by: kernel test robot [email protected]
Sorry, but <> around email address is needed
Reported-by: kernel test robot <[email protected]>

> Acked-by: Alexey Gladkov <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> This is a simplified version of my previous change that I have tested and will
> push out to linux-next and then to Linus shortly.
>
> kernel/fork.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c index bc94b2cc5995..44f4c2d83763
> 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -828,10 +828,10 @@ void __init fork_init(void)
> for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
> init_user_ns.ucount_max[i] = max_threads/2;
>
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC,
> task_rlimit(&init_task, RLIMIT_NPROC));
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE,
> task_rlimit(&init_task, RLIMIT_MSGQUEUE));
> - set_rlimit_ucount_max(&init_user_ns,
> UCOUNT_RLIMIT_SIGPENDING, task_rlimit(&init_task, RLIMIT_SIGPENDING));
> - set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK,
> task_rlimit(&init_task, RLIMIT_MEMLOCK));
> + set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_NPROC,
> RLIM_INFINITY);
> + set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MSGQUEUE,
> RLIM_INFINITY);
> + set_rlimit_ucount_max(&init_user_ns,
> UCOUNT_RLIMIT_SIGPENDING, RLIM_INFINITY);
> + set_rlimit_ucount_max(&init_user_ns, UCOUNT_RLIMIT_MEMLOCK,
> RLIM_INFINITY);
>
> #ifdef CONFIG_VMAP_STACK
> cpuhp_setup_state(CPUHP_BP_PREPARE_DYN,
> "fork:vm_stack_cache",
> --
> 2.20.1

2021-08-24 03:28:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] ucounts: Fix regression preventing increasing of rlimits in init_user_ns

"Ma, XinjianX" <[email protected]> writes:

>> -----Original Message-----
>> From: Eric W. Biederman <[email protected]>
>> ...
>> Reported-by: kernel test robot [email protected]
> Sorry, but <> around email address is needed
> Reported-by: kernel test robot <[email protected]>

The change is already tested and pushed out so I really don't want to
mess with it. Especially as I am aiming to send it to Linus on
Wednesday after it has had a chance to pass through linux-next and
whatever automated tests are there.

What does copying and pasting the Reported-by: tag as included in
your original report cause to break?

At this point I suspect that the danger of fat fingering something
far outweighs whatever benefits might be gained by surrounding the
email address with <> marks.

Eric