2021-01-15 15:02:05

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 0/8] Count rlimits in each user namespace

Preface
-------
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11-rc2

Problem
-------
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
------------------
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/[email protected]/
[2] https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
---------
v3:
* Added get_ucounts() function to increase the reference count. The existing
get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (8):
Use refcount_t for ucounts reference counting
Add a reference to ucounts for each cred
Move RLIMIT_NPROC counter to ucounts
Move RLIMIT_MSGQUEUE counter to ucounts
Move RLIMIT_SIGPENDING counter to ucounts
Move RLIMIT_MEMLOCK counter to ucounts
Move RLIMIT_NPROC check to the place where we increment the counter
kselftests: Add test to check for rlimit changes in different user
namespaces

fs/exec.c | 2 +-
fs/hugetlbfs/inode.c | 17 +-
fs/io-wq.c | 22 ++-
fs/io-wq.h | 2 +-
fs/io_uring.c | 2 +-
fs/proc/array.c | 2 +-
include/linux/cred.h | 3 +
include/linux/hugetlb.h | 3 +-
include/linux/mm.h | 4 +-
include/linux/sched/user.h | 6 -
include/linux/shmem_fs.h | 2 +-
include/linux/signal_types.h | 4 +-
include/linux/user_namespace.h | 31 +++-
ipc/mqueue.c | 29 ++--
ipc/shm.c | 31 ++--
kernel/cred.c | 46 ++++-
kernel/exit.c | 2 +-
kernel/fork.c | 12 +-
kernel/signal.c | 53 +++---
kernel/sys.c | 13 --
kernel/ucount.c | 111 +++++++++---
kernel/user.c | 2 -
kernel/user_namespace.c | 7 +-
mm/memfd.c | 4 +-
mm/mlock.c | 35 ++--
mm/mmap.c | 3 +-
mm/shmem.c | 8 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
32 files changed, 445 insertions(+), 182 deletions(-)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

--
2.29.2


2021-01-15 15:02:06

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 4/8] Move RLIMIT_MSGQUEUE counter to ucounts

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/sched/user.h | 4 ----
include/linux/user_namespace.h | 1 +
ipc/mqueue.c | 29 +++++++++++++++--------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user_namespace.c | 1 +
6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
#endif
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
- /* protected by mq_lock */
- unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
#endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index bca6d28c85ce..ff96a906d7da 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
#endif
UCOUNT_RLIMIT_NPROC,
+ UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
};

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..05fcf067131f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
- struct user_struct *user; /* user who created, for accounting */
+ struct ucounts *ucounts; /* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;

@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
{
- struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;

@@ -309,6 +308,8 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
if (S_ISREG(mode)) {
struct mqueue_inode_info *info;
unsigned long mq_bytes, mq_treesize;
+ struct ucounts *ucounts;
+ bool overlimit;

inode->i_fop = &mqueue_file_operations;
inode->i_size = FILENT_SIZE;
@@ -321,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
- info->user = NULL; /* set when all is ok */
+ info->ucounts = NULL; /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +372,19 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
+ ucounts = current_ucounts();
spin_lock(&mq_lock);
- if (u->mq_bytes + mq_bytes < u->mq_bytes ||
- u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+ overlimit = inc_rlimit_ucounts_and_test(ucounts, UCOUNT_RLIMIT_MSGQUEUE,
+ mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+ if (overlimit) {
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
spin_unlock(&mq_lock);
/* mqueue_evict_inode() releases info->messages */
ret = -EMFILE;
goto out_inode;
}
- u->mq_bytes += mq_bytes;
spin_unlock(&mq_lock);
-
- /* all is ok */
- info->user = get_uid(u);
+ info->ucounts = get_ucounts(ucounts);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +498,7 @@ static void mqueue_free_inode(struct inode *inode)
static void mqueue_evict_inode(struct inode *inode)
{
struct mqueue_inode_info *info;
- struct user_struct *user;
+ struct ucounts *ucounts;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +521,8 @@ static void mqueue_evict_inode(struct inode *inode)
free_msg(msg);
}

- user = info->user;
- if (user) {
+ ucounts = info->ucounts;
+ if (ucounts) {
unsigned long mq_bytes, mq_treesize;

/* Total amount of bytes accounted for the mqueue */
@@ -533,7 +534,7 @@ static void mqueue_evict_inode(struct inode *inode)
info->attr.mq_msgsize);

spin_lock(&mq_lock);
- user->mq_bytes -= mq_bytes;
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
/*
* get_ns_from_inode() ensures that the
* (ipc_ns = sb->s_fs_info) is either a valid ipc_ns
@@ -543,7 +544,7 @@ static void mqueue_evict_inode(struct inode *inode)
if (ipc_ns)
ipc_ns->mq_queues_count--;
spin_unlock(&mq_lock);
- free_uid(user);
+ put_ucounts(ucounts);
}
if (ipc_ns)
put_ipc_ns(ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index ef7936daeeda..f61a5a3dc02f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -824,6 +824,7 @@ void __init fork_init(void)
}

init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index ee683cc088af..daeb657d4989 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -75,6 +75,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ },
{ }
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 974f10da072c..9ace2a45a25d 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -122,6 +122,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[i] = INT_MAX;
}
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
+ ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.2

2021-01-15 15:02:31

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 2/8] Add a reference to ucounts for each cred

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/cred.h | 1 +
include/linux/user_namespace.h | 13 +++++++++++--
kernel/cred.c | 20 ++++++++++++++++++--
kernel/ucount.c | 30 ++++++++++++++++++++----------
kernel/user_namespace.c | 1 +
5 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..307744fcc387 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
#endif
struct user_struct *user; /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
+ struct ucounts *ucounts;
struct group_info *group_info; /* supplementary groups for euid/fsgid */
/* RCU deletion */
union {
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84fc2d9ce20..9a3ba69e9223 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
#endif
struct ucounts *ucounts;
- int ucount_max[UCOUNT_COUNTS];
+ long ucount_max[UCOUNT_COUNTS];
} __randomize_layout;

struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
refcount_t count;
- atomic_t ucount[UCOUNT_COUNTS];
+ atomic_long_t ucount[UCOUNT_COUNTS];
};

extern struct user_namespace init_user_ns;
@@ -102,6 +102,15 @@ bool setup_userns_sysctls(struct user_namespace *ns);
void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+void put_ucounts(struct ucounts *ucounts);
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t uid);
+
+static inline struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ if (ucounts)
+ refcount_inc(&ucounts->count);
+ return ucounts;
+}

#ifdef CONFIG_USER_NS

diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..a27d725c7c79 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+ if (cred->ucounts)
+ put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
}
@@ -144,6 +146,9 @@ void __put_cred(struct cred *cred)
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);

+ if (cred->ucounts);
+ BUG_ON(cred->ucounts->ns != cred->user_ns);
+
if (cred->non_rcu)
put_cred_rcu(&cred->rcu);
else
@@ -270,6 +275,7 @@ struct cred *prepare_creds(void)
get_group_info(new->group_info);
get_uid(new->user);
get_user_ns(new->user_ns);
+ get_ucounts(new->ucounts);

#ifdef CONFIG_KEYS
key_get(new->session_keyring);
@@ -363,6 +369,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+ set_cred_ucounts(new, new->user_ns, new->euid);
}

#ifdef CONFIG_KEYS
@@ -485,8 +492,11 @@ int commit_creds(struct cred *new)
* in set_user().
*/
alter_cred_subscribers(new, 2);
- if (new->user != old->user)
- atomic_inc(&new->user->processes);
+ if (new->user != old->user || new->user_ns != old->user_ns) {
+ if (new->user != old->user)
+ atomic_inc(&new->user->processes);
+ set_cred_ucounts(new, new->user_ns, new->euid);
+ }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
@@ -661,6 +671,11 @@ void __init cred_init(void)
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+ /*
+ * This is needed here because this is the first cred and there is no
+ * ucount reference to copy.
+ */
+ set_cred_ucounts(&init_cred, &init_user_ns, GLOBAL_ROOT_UID);
}

/**
@@ -704,6 +719,7 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
get_uid(new->user);
get_user_ns(new->user_ns);
get_group_info(new->group_info);
+ get_ucounts(new->ucounts);

#ifdef CONFIG_KEYS
new->session_keyring = NULL;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 82acd2226460..0b4e956d87bb 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -125,7 +125,7 @@ static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struc
return NULL;
}

-static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
+static struct ucounts *__get_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
struct ucounts *ucounts, *new;
@@ -158,7 +158,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
return ucounts;
}

-static void put_ucounts(struct ucounts *ucounts)
+void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

@@ -169,14 +169,24 @@ static void put_ucounts(struct ucounts *ucounts)
}
}

-static inline bool atomic_inc_below(atomic_t *v, int u)
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t uid)
{
- int c, old;
- c = atomic_read(v);
+ struct ucounts *old = cred->ucounts;
+ if (old && old->ns == ns && uid_eq(old->uid, uid))
+ return;
+ cred->ucounts = __get_ucounts(ns, uid);
+ if (old)
+ put_ucounts(old);
+}
+
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
+{
+ long c, old;
+ c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
- old = atomic_cmpxchg(v, c, c+1);
+ old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -188,19 +198,19 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
{
struct ucounts *ucounts, *iter, *bad;
struct user_namespace *tns;
- ucounts = get_ucounts(ns, uid);
+ ucounts = __get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
int max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
- if (!atomic_inc_below(&iter->ucount[type], max))
+ if (!atomic_long_inc_below(&iter->ucount[type], max))
goto fail;
}
return ucounts;
fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
- atomic_dec(&iter->ucount[type]);
+ atomic_long_dec(&iter->ucount[type]);

put_ucounts(ucounts);
return NULL;
@@ -210,7 +220,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
{
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
- int dec = atomic_dec_if_positive(&iter->ucount[type]);
+ int dec = atomic_long_dec_if_positive(&iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..4b8a4468d391 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1280,6 +1280,7 @@ static int userns_install(struct nsset *nsset, struct ns_common *ns)

put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));
+ set_cred_ucounts(cred, user_ns, cred->euid);

return 0;
}
--
2.29.2

2021-01-15 15:03:05

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 3/8] Move RLIMIT_NPROC counter to ucounts

RLIMIT_NPROC is implemented on top of ucounts. The process counter is
tied to the user in the user namespace. Therefore, there is no longer
one single counter for the user. Instead, there is now one counter for
each user namespace. Thus, getting the RLIMIT_NPROC counter value to
check the rlimit becomes meaningless.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/exec.c | 2 +-
fs/io-wq.c | 22 ++++++-------
fs/io-wq.h | 2 +-
fs/io_uring.c | 2 +-
include/linux/cred.h | 2 ++
include/linux/sched/user.h | 1 -
include/linux/user_namespace.h | 13 ++++++++
kernel/cred.c | 10 +++---
kernel/exit.c | 2 +-
kernel/fork.c | 9 ++---
kernel/sys.c | 2 +-
kernel/ucount.c | 60 ++++++++++++++++++++++++++++++++++
kernel/user.c | 1 -
kernel/user_namespace.c | 3 +-
14 files changed, 102 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..f62fd2632104 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1870,7 +1870,7 @@ static int do_execveat_common(int fd, struct filename *filename,
* whether NPROC limit is still exceeded.
*/
if ((current->flags & PF_NPROC_EXCEEDED) &&
- atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
+ is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
#include <linux/blk-cgroup.h>
#include <linux/audit.h>
#include <linux/cpu.h>
+#include <linux/user_namespace.h>

#include "../kernel/sched/sched.h"
#include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;

struct task_struct *manager;
- struct user_struct *user;
+ const struct cred *cred;
refcount_t refs;
struct completion done;

@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(&acct->nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
- atomic_dec(&wqe->wq->user->processes);
+ dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
worker->flags = 0;
preempt_enable();

@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
- atomic_dec(&wqe->wq->user->processes);
+ dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
- atomic_inc(&wqe->wq->user->processes);
+ inc_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
- }
+ }
}

/*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
raw_spin_unlock_irq(&wqe->lock);

if (index == IO_WQ_ACCT_UNBOUND)
- atomic_inc(&wq->user->processes);
+ inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);

refcount_inc(&wq->refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct,
if (free_worker)
return true;

- if (atomic_read(&wqe->wq->user->processes) >= acct->max_workers &&
+ if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))
return false;

@@ -1074,7 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wq->do_work = data->do_work;

/* caller must already hold a reference to this */
- wq->user = data->user;
+ wq->cred = data->cred;

ret = -ENOMEM;
for_each_node(node) {
@@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wqe->node = alloc_node;
wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
- if (wq->user) {
- wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
- task_rlimit(current, RLIMIT_NPROC);
- }
+ wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = task_rlimit(current, RLIMIT_NPROC);
atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
wqe->wq = wq;
raw_spin_lock_init(&wqe->lock);
diff --git a/fs/io-wq.h b/fs/io-wq.h
index b158f8addcf3..4130e247c556 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -111,7 +111,7 @@ typedef void (free_work_fn)(struct io_wq_work *);
typedef struct io_wq_work *(io_wq_work_fn)(struct io_wq_work *);

struct io_wq_data {
- struct user_struct *user;
+ const struct cred *cred;

io_wq_work_fn *do_work;
free_work_fn *free_work;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca46f314640b..7e463dd5f3d0 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7985,7 +7985,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
unsigned int concurrency;
int ret = 0;

- data.user = ctx->user;
+ data.cred = ctx->creds;
data.free_work = io_free_work;
data.do_work = io_wq_submit_work;

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 307744fcc387..2ca545b8669b 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -371,6 +371,7 @@ static inline void put_cred(const struct cred *_cred)

#define task_uid(task) (task_cred_xxx((task), uid))
#define task_euid(task) (task_cred_xxx((task), euid))
+#define task_ucounts(task) (task_cred_xxx((task), ucounts))

#define current_cred_xxx(xxx) \
({ \
@@ -387,6 +388,7 @@ static inline void put_cred(const struct cred *_cred)
#define current_fsgid() (current_cred_xxx(fsgid))
#define current_cap() (current_cred_xxx(cap_effective))
#define current_user() (current_cred_xxx(user))
+#define current_ucounts() (current_cred_xxx(ucounts))

extern struct user_namespace init_user_ns;
#ifdef CONFIG_USER_NS
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index a8ec3b6093fc..d33d867ad6c1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
*/
struct user_struct {
refcount_t __count; /* reference count */
- atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 9a3ba69e9223..bca6d28c85ce 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -50,9 +50,12 @@ enum ucount_type {
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
#endif
+ UCOUNT_RLIMIT_NPROC,
UCOUNT_COUNTS,
};

+#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
+
struct user_namespace {
struct uid_gid_map uid_map;
struct uid_gid_map gid_map;
@@ -112,6 +115,16 @@ static inline struct ucounts *get_ucounts(struct ucounts *ucounts)
return ucounts;
}

+static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type type)
+{
+ return atomic_long_read(&ucounts->ucount[type]);
+}
+
+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type type, long v, long max);
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long max);
+
#ifdef CONFIG_USER_NS

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/cred.c b/kernel/cred.c
index a27d725c7c79..c43e30407d22 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -357,7 +357,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
kdebug("share_creds(%p{%d,%d})",
p->cred, atomic_read(&p->cred->usage),
read_cred_subscribers(p->cred));
- atomic_inc(&p->cred->user->processes);
+ inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
return 0;
}

@@ -391,8 +391,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
}
#endif

- atomic_inc(&new->user->processes);
p->cred = p->real_cred = get_cred(new);
+ inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
alter_cred_subscribers(new, 2);
validate_creds(new);
return 0;
@@ -493,14 +493,13 @@ int commit_creds(struct cred *new)
*/
alter_cred_subscribers(new, 2);
if (new->user != old->user || new->user_ns != old->user_ns) {
- if (new->user != old->user)
- atomic_inc(&new->user->processes);
set_cred_ucounts(new, new->user_ns, new->euid);
+ inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
}
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
- atomic_dec(&old->user->processes);
+ dec_rlimit_ucounts(old->ucounts, UCOUNT_RLIMIT_NPROC, 1);
alter_cred_subscribers(old, -2);

/* send notifications */
@@ -676,6 +675,7 @@ void __init cred_init(void)
* ucount reference to copy.
*/
set_cred_ucounts(&init_cred, &init_user_ns, GLOBAL_ROOT_UID);
+ inc_rlimit_ucounts(init_cred.ucounts, UCOUNT_RLIMIT_NPROC, 1);
}

/**
diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..61c0fe902b50 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -188,7 +188,7 @@ void release_task(struct task_struct *p)
/* don't need to get the RCU readlock here - the process is dead and
* can't be modifying its own credentials. But shut RCU-lockdep up */
rcu_read_lock();
- atomic_dec(&__task_cred(p)->user->processes);
+ dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
rcu_read_unlock();

cgroup_release(p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 37720a6d04ea..ef7936daeeda 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -819,10 +819,12 @@ void __init fork_init(void)
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];

- for (i = 0; i < UCOUNT_COUNTS; i++) {
+ for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
init_user_ns.ucount_max[i] = max_threads/2;
}

+ init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
+
#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
NULL, free_vm_stack_cache);
@@ -1964,8 +1966,7 @@ static __latent_entropy struct task_struct *copy_process(
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
retval = -EAGAIN;
- if (atomic_read(&p->real_cred->user->processes) >=
- task_rlimit(p, RLIMIT_NPROC)) {
+ if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_free;
@@ -2368,7 +2369,7 @@ static __latent_entropy struct task_struct *copy_process(
#endif
delayacct_tsk_free(p);
bad_fork_cleanup_count:
- atomic_dec(&p->cred->user->processes);
+ dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
exit_creds(p);
bad_fork_free:
p->state = TASK_DEAD;
diff --git a/kernel/sys.c b/kernel/sys.c
index 51f00fe20e4d..c2734ab9474e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -474,7 +474,7 @@ static int set_user(struct cred *new)
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
- if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
+ if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 0b4e956d87bb..ee683cc088af 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -7,6 +7,7 @@
#include <linux/hash.h>
#include <linux/kmemleak.h>
#include <linux/user_namespace.h>
+#include <linux/security.h>

#define UCOUNTS_HASHTABLE_BITS 10
static struct hlist_head ucounts_hashtable[(1 << UCOUNTS_HASHTABLE_BITS)];
@@ -74,6 +75,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
@@ -193,6 +195,19 @@ static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
}
}

+static inline long atomic_long_dec_value(atomic_long_t *v, long n)
+{
+ long c, old;
+ c = atomic_long_read(v);
+ for (;;) {
+ old = atomic_long_cmpxchg(v, c, c - n);
+ if (likely(old == c))
+ return c;
+ c = old;
+ }
+ return c;
+}
+
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
enum ucount_type type)
{
@@ -226,6 +241,51 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
put_ucounts(ucounts);
}

+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
+{
+ struct ucounts *iter;
+ bool overlimit = false;
+
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long max = READ_ONCE(iter->ns->ucount_max[type]);
+ if (atomic_long_add_return(v, &iter->ucount[type]) > max)
+ overlimit = true;
+ }
+
+ return overlimit;
+}
+
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type type,
+ long v, long max)
+{
+ bool overlimit = inc_rlimit_ucounts(ucounts, type, v);
+ if (!overlimit && get_ucounts_value(ucounts, type) > max)
+ overlimit = true;
+ return overlimit;
+}
+
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
+{
+ struct ucounts *iter;
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long dec = atomic_long_dec_value(&iter->ucount[type], v);
+ WARN_ON_ONCE(dec < 0);
+ }
+}
+
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long max)
+{
+ struct ucounts *iter;
+ if (get_ucounts_value(ucounts, type) > max)
+ return true;
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ max = READ_ONCE(iter->ns->ucount_max[type]);
+ if (get_ucounts_value(iter, type) > max)
+ return true;
+ }
+ return false;
+}
+
static __init int user_namespace_sysctl_init(void)
{
#ifdef CONFIG_SYSCTL
diff --git a/kernel/user.c b/kernel/user.c
index a2478cddf536..7f5ff498207a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .processes = ATOMIC_INIT(1),
.sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4b8a4468d391..974f10da072c 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -118,9 +118,10 @@ int create_user_ns(struct cred *new)
ns->owner = owner;
ns->group = group;
INIT_WORK(&ns->work, free_user_ns);
- for (i = 0; i < UCOUNT_COUNTS; i++) {
+ for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
ns->ucount_max[i] = INT_MAX;
}
+ ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.2

2021-01-15 15:03:08

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 5/8] Move RLIMIT_SIGPENDING counter to ucounts

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/array.c | 2 +-
include/linux/sched/user.h | 1 -
include/linux/signal_types.h | 4 ++-
include/linux/user_namespace.h | 1 +
kernel/fork.c | 1 +
kernel/signal.c | 53 ++++++++++++++--------------------
kernel/ucount.c | 1 +
kernel/user.c | 1 -
kernel/user_namespace.c | 1 +
9 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74b0ea4b7e38 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct task_struct *p)
collect_sigign_sigcatch(p, &ignored, &caught);
num_threads = get_nr_threads(p);
rcu_read_lock(); /* FIXME: is this correct? */
- qsize = atomic_read(&__task_cred(p)->user->sigpending);
+ qsize = get_ucounts_value(task_ucounts(p), UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
qlim = task_rlimit(p, RLIMIT_SIGPENDING);
unlock_task_sighand(p, &flags);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8a34446681aa..8ba9cec4fb99 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
*/
struct user_struct {
refcount_t __count; /* reference count */
- atomic_t sigpending; /* How many pending signals does this user have? */
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
#endif
diff --git a/include/linux/signal_types.h b/include/linux/signal_types.h
index 68e06c75c5b2..34cb28b8f16c 100644
--- a/include/linux/signal_types.h
+++ b/include/linux/signal_types.h
@@ -13,6 +13,8 @@ typedef struct kernel_siginfo {
__SIGINFO;
} kernel_siginfo_t;

+struct ucounts;
+
/*
* Real Time signals may be queued.
*/
@@ -21,7 +23,7 @@ struct sigqueue {
struct list_head list;
int flags;
kernel_siginfo_t info;
- struct user_struct *user;
+ struct ucounts *ucounts;
};

/* flags values. */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index ff96a906d7da..852b7bc40318 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,7 @@ enum ucount_type {
#endif
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
+ UCOUNT_RLIMIT_SIGPENDING,
UCOUNT_COUNTS,
};

diff --git a/kernel/fork.c b/kernel/fork.c
index f61a5a3dc02f..a7be5790392e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)

init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/signal.c b/kernel/signal.c
index 5736c55aaa1a..b01c2007a282 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,49 +412,40 @@ void task_join_group_stop(struct task_struct *task)
static struct sigqueue *
__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimit)
{
- struct sigqueue *q = NULL;
- struct user_struct *user;
- int sigpending;
+ struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);

- /*
- * Protect access to @t credentials. This can go away when all
- * callers hold rcu read lock.
- *
- * NOTE! A pending signal will hold on to the user refcount,
- * and we get/put the refcount only when the sigpending count
- * changes from/to zero.
- */
- rcu_read_lock();
- user = __task_cred(t)->user;
- sigpending = atomic_inc_return(&user->sigpending);
- if (sigpending == 1)
- get_uid(user);
- rcu_read_unlock();
+ if (likely(q != NULL)) {
+ bool overlimit;

- if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
- q = kmem_cache_alloc(sigqueue_cachep, flags);
- } else {
- print_dropped_signal(sig);
- }
-
- if (unlikely(q == NULL)) {
- if (atomic_dec_and_test(&user->sigpending))
- free_uid(user);
- } else {
INIT_LIST_HEAD(&q->list);
q->flags = 0;
- q->user = user;
+
+ /*
+ * Protect access to @t credentials. This can go away when all
+ * callers hold rcu read lock.
+ */
+ rcu_read_lock();
+ q->ucounts = get_ucounts(task_ucounts(t));
+ overlimit = inc_rlimit_ucounts_and_test(q->ucounts, UCOUNT_RLIMIT_SIGPENDING,
+ 1, task_rlimit(t, RLIMIT_SIGPENDING));
+
+ if (override_rlimit || likely(!overlimit)) {
+ rcu_read_unlock();
+ return q;
+ }
+ rcu_read_unlock();
}

- return q;
+ print_dropped_signal(sig);
+ return NULL;
}

static void __sigqueue_free(struct sigqueue *q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
- if (atomic_dec_and_test(&q->user->sigpending))
- free_uid(q->user);
+ dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+ put_ucounts(q->ucounts);
kmem_cache_free(sigqueue_cachep, q);
}

diff --git a/kernel/ucount.c b/kernel/ucount.c
index daeb657d4989..86f6c67ec147 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -75,6 +75,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ },
{ },
{ }
diff --git a/kernel/user.c b/kernel/user.c
index 7f5ff498207a..6737327f83be 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
.ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9ace2a45a25d..eeff7f6d81c0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,7 @@ int create_user_ns(struct cred *new)
}
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
+ ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.2

2021-01-15 15:03:31

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 7/8] Move RLIMIT_NPROC check to the place where we increment the counter

After calling set_user(), we always have to call commit_creds() to apply
new credentials upon the current task. There is no need to separate
limit check and counter incrementing.

Signed-off-by: Alexey Gladkov <[email protected]>
---
kernel/cred.c | 22 +++++++++++++++++-----
kernel/sys.c | 13 -------------
2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index c43e30407d22..991c43559ee8 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -487,14 +487,26 @@ int commit_creds(struct cred *new)
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(new);

- /* do it
- * RLIMIT_NPROC limits on user->processes have already been checked
- * in set_user().
- */
alter_cred_subscribers(new, 2);
if (new->user != old->user || new->user_ns != old->user_ns) {
+ bool overlimit;
+
set_cred_ucounts(new, new->user_ns, new->euid);
- inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+
+ overlimit = inc_rlimit_ucounts_and_test(new->ucounts, UCOUNT_RLIMIT_NPROC,
+ 1, rlimit(RLIMIT_NPROC));
+
+ /*
+ * We don't fail in case of NPROC limit excess here because too many
+ * poorly written programs don't check set*uid() return code, assuming
+ * it never fails if called by root. We may still enforce NPROC limit
+ * for programs doing set*uid()+execve() by harmlessly deferring the
+ * failure to the execve() stage.
+ */
+ if (overlimit && new->user != INIT_USER)
+ current->flags |= PF_NPROC_EXCEEDED;
+ else
+ current->flags &= ~PF_NPROC_EXCEEDED;
}
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
diff --git a/kernel/sys.c b/kernel/sys.c
index c2734ab9474e..180c4e06064f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -467,19 +467,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;

- /*
- * We don't fail in case of NPROC limit excess here because too many
- * poorly written programs don't check set*uid() return code, assuming
- * it never fails if called by root. We may still enforce NPROC limit
- * for programs doing set*uid()+execve() by harmlessly deferring the
- * failure to the execve() stage.
- */
- if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
- new_user != INIT_USER)
- current->flags |= PF_NPROC_EXCEEDED;
- else
- current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
--
2.29.2

2021-01-15 15:03:59

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 8/8] kselftests: Add test to check for rlimit changes in different user namespaces

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=60000, in different user namespaces.

Signed-off-by: Alexey Gladkov <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
5 files changed, 171 insertions(+)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index afbab4aeef3c..4dbeb5686f7b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
TARGETS += openat2
+TARGETS += rlimits
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index 000000000000..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index 000000000000..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config b/tools/testing/selftests/rlimits/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index 000000000000..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <signal.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <err.h>
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user = 60000;
+static uid_t group = 60000;
+
+static void setrlimit_nproc(rlim_t n)
+{
+ pid_t pid = getpid();
+ struct rlimit limit = {
+ .rlim_cur = n,
+ .rlim_max = n
+ };
+
+ warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+ if (setrlimit(RLIMIT_NPROC, &limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+ pid_t pid = fork();
+
+ if (pid < 0)
+ err(EXIT_FAILURE, "fork");
+
+ if (pid > 0)
+ return pid;
+
+ pid = getpid();
+
+ warnx("(pid=%d): New process starting ...", pid);
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+ err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+ signal(SIGUSR1, SIG_DFL);
+
+ warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+ if (setgid(group) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+ if (setuid(user) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+ warnx("(pid=%d): Service running ...", pid);
+
+ warnx("(pid=%d): Unshare user namespace", pid);
+ if (unshare(CLONE_NEWUSER) < 0)
+ err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+ char *const argv[] = { "service", NULL };
+ char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+ warnx("(pid=%d): Executing real service ...", pid);
+
+ execve(service_prog, argv, envp);
+ err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+ size_t i;
+ pid_t child[NR_CHILDS];
+ int wstatus[NR_CHILDS];
+ int childs = NR_CHILDS;
+ pid_t pid;
+
+ if (getenv("I_AM_SERVICE")) {
+ pause();
+ exit(EXIT_SUCCESS);
+ }
+
+ service_prog = argv[0];
+ pid = getpid();
+
+ warnx("(pid=%d) Starting testcase", pid);
+
+ /*
+ * This rlimit is not a problem for root because it can be exceeded.
+ */
+ setrlimit_nproc(1);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ child[i] = fork_child();
+ wstatus[i] = 0;
+ usleep(250000);
+ }
+
+ while (1) {
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+
+ errno = 0;
+ pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG);
+
+ if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i])))
+ continue;
+
+ if (ret < 0 && errno != ECHILD)
+ warn("(pid=%d): waitpid(%d)", pid, child[i]);
+
+ child[i] *= -1;
+ childs -= 1;
+ }
+
+ if (!childs)
+ break;
+
+ usleep(250000);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+ kill(child[i], SIGUSR1);
+ }
+ }
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (WIFEXITED(wstatus[i]))
+ warnx("(pid=%d): pid %d exited, status=%d",
+ pid, -child[i], WEXITSTATUS(wstatus[i]));
+ else if (WIFSIGNALED(wstatus[i]))
+ warnx("(pid=%d): pid %d killed by signal %d",
+ pid, -child[i], WTERMSIG(wstatus[i]));
+
+ if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1)
+ continue;
+
+ warnx("(pid=%d): Test failed", pid);
+ exit(EXIT_FAILURE);
+ }
+
+ warnx("(pid=%d): Test passed", pid);
+ exit(EXIT_SUCCESS);
+}
--
2.29.2

2021-01-15 15:04:04

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 6/8] Move RLIMIT_MEMLOCK counter to ucounts

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/hugetlbfs/inode.c | 17 ++++++++---------
include/linux/hugetlb.h | 3 +--
include/linux/mm.h | 4 ++--
include/linux/shmem_fs.h | 2 +-
include/linux/user_namespace.h | 1 +
ipc/shm.c | 31 ++++++++++++++++--------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user_namespace.c | 1 +
mm/memfd.c | 4 +---
mm/mlock.c | 35 +++++++++++++---------------------
mm/mmap.c | 3 +--
mm/shmem.c | 8 ++++----
13 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..82298412f020 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1451,34 +1451,35 @@ static int get_hstate_idx(int page_size_log)
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
*/
struct file *hugetlb_file_setup(const char *name, size_t size,
- vm_flags_t acctflag, struct user_struct **user,
+ vm_flags_t acctflag,
int creat_flags, int page_size_log)
{
struct inode *inode;
struct vfsmount *mnt;
int hstate_idx;
struct file *file;
+ const struct cred *cred;

hstate_idx = get_hstate_idx(page_size_log);
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);

- *user = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);

if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
- *user = current_user();
- if (user_shm_lock(size, *user)) {
+ cred = current_cred();
+ if (user_shm_lock(size, cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
- *user = NULL;
return ERR_PTR(-EPERM);
}
+ } else {
+ cred = NULL;
}

file = ERR_PTR(-ENOSPC);
@@ -1503,10 +1504,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size,

iput(inode);
out:
- if (*user) {
- user_shm_unlock(size, *user);
- *user = NULL;
- }
+ if (cred)
+ user_shm_unlock(size, cred);
return file;
}

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..fbd36c452648 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,8 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
- struct user_struct **user, int creat_flags,
- int page_size_log);
+ int creat_flags, int page_size_log);

static inline bool is_file_hugepages(struct file *file)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
#else
static inline bool can_do_mlock(void) { return false; }
#endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..10f50b1c4e0e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
extern int shmem_zero_setup(struct vm_area_struct *);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, const struct cred *cred);
#ifdef CONFIG_SHMEM
extern const struct address_space_operations shmem_aops;
static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 852b7bc40318..701903a8beeb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,6 +53,7 @@ enum ucount_type {
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_RLIMIT_SIGPENDING,
+ UCOUNT_RLIMIT_MEMLOCK,
UCOUNT_COUNTS,
};

diff --git a/ipc/shm.c b/ipc/shm.c
index febd88daba8c..40c566cd6f7a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -60,7 +60,7 @@ struct shmid_kernel /* private to the kernel */
time64_t shm_ctim;
struct pid *shm_cprid;
struct pid *shm_lprid;
- struct user_struct *mlock_user;
+ const struct cred *mlock_cred;

/* The task created the shm object. NULL if the task is dead. */
struct task_struct *shm_creator;
@@ -286,10 +286,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_rmid(ns, shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
- shmem_lock(shm_file, 0, shp->mlock_user);
- else if (shp->mlock_user)
+ shmem_lock(shm_file, 0, shp->mlock_cred);
+ else if (shp->mlock_cred)
user_shm_unlock(i_size_read(file_inode(shm_file)),
- shp->mlock_user);
+ shp->mlock_cred);
fput(shm_file);
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
@@ -625,7 +625,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)

shp->shm_perm.key = key;
shp->shm_perm.mode = (shmflg & S_IRWXUGO);
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;

shp->shm_perm.security = NULL;
error = security_shm_alloc(&shp->shm_perm);
@@ -650,8 +650,9 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, hugesize, acctflag,
- &shp->mlock_user, HUGETLB_SHMFS_INODE,
+ HUGETLB_SHMFS_INODE,
(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
+ shp->mlock_cred = current_cred();
} else {
/*
* Do not allow no accounting for OVERCOMMIT_NEVER, even
@@ -663,8 +664,10 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
file = shmem_kernel_file_setup(name, size, acctflag);
}
error = PTR_ERR(file);
- if (IS_ERR(file))
+ if (IS_ERR(file)) {
+ shp->mlock_cred = NULL;
goto no_file;
+ }

shp->shm_cprid = get_pid(task_tgid(current));
shp->shm_lprid = NULL;
@@ -698,8 +701,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
no_id:
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
- if (is_file_hugepages(file) && shp->mlock_user)
- user_shm_unlock(size, shp->mlock_user);
+ if (is_file_hugepages(file) && shp->mlock_cred)
+ user_shm_unlock(size, shp->mlock_cred);
fput(file);
ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
return error;
@@ -1105,12 +1108,12 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
goto out_unlock0;

if (cmd == SHM_LOCK) {
- struct user_struct *user = current_user();
+ const struct cred *cred = current_cred();

- err = shmem_lock(shm_file, 1, user);
+ err = shmem_lock(shm_file, 1, cred);
if (!err && !(shp->shm_perm.mode & SHM_LOCKED)) {
shp->shm_perm.mode |= SHM_LOCKED;
- shp->mlock_user = user;
+ shp->mlock_cred = cred;
}
goto out_unlock0;
}
@@ -1118,9 +1121,9 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
/* SHM_UNLOCK */
if (!(shp->shm_perm.mode & SHM_LOCKED))
goto out_unlock0;
- shmem_lock(shm_file, 0, shp->mlock_user);
+ shmem_lock(shm_file, 0, shp->mlock_cred);
shp->shm_perm.mode &= ~SHM_LOCKED;
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;
get_file(shm_file);
ipc_unlock_object(&shp->shm_perm);
rcu_read_unlock();
diff --git a/kernel/fork.c b/kernel/fork.c
index a7be5790392e..8104870f67c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -826,6 +826,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = task_rlimit(&init_task, RLIMIT_MEMLOCK);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 86f6c67ec147..803b6ec2075b 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -78,6 +78,7 @@ static struct ctl_table user_table[] = {
{ },
{ },
{ },
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index eeff7f6d81c0..a634ce74988c 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -124,6 +124,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
+ ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..9f80f162791a 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -297,9 +297,7 @@ SYSCALL_DEFINE2(memfd_create,
}

if (flags & MFD_HUGETLB) {
- struct user_struct *user = NULL;
-
- file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
+ file = hugetlb_file_setup(name, 0, VM_NORESERVE,
HUGETLB_ANONHUGE_INODE,
(flags >> MFD_HUGE_SHIFT) &
MFD_HUGE_MASK);
diff --git a/mm/mlock.c b/mm/mlock.c
index 55b3b3672977..2d49d1afd7e0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -812,15 +812,10 @@ SYSCALL_DEFINE0(munlockall)
return ret;
}

-/*
- * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
- * shm segments) get accounted against the user_struct instead.
- */
-static DEFINE_SPINLOCK(shmlock_user_lock);
-
-int user_shm_lock(size_t size, struct user_struct *user)
+int user_shm_lock(size_t size, const struct cred *cred)
{
unsigned long lock_limit, locked;
+ bool overlimit;
int allowed = 0;

locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -828,22 +823,18 @@ int user_shm_lock(size_t size, struct user_struct *user)
if (lock_limit == RLIM_INFINITY)
allowed = 1;
lock_limit >>= PAGE_SHIFT;
- spin_lock(&shmlock_user_lock);
- if (!allowed &&
- locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
- goto out;
- get_uid(user);
- user->locked_shm += locked;
- allowed = 1;
-out:
- spin_unlock(&shmlock_user_lock);
- return allowed;
+
+ overlimit = inc_rlimit_ucounts_and_test(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK,
+ locked, lock_limit);
+
+ if (!allowed && overlimit && !capable(CAP_IPC_LOCK)) {
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
+ return 0;
+ }
+ return 1;
}

-void user_shm_unlock(size_t size, struct user_struct *user)
+void user_shm_unlock(size_t size, const struct cred *cred)
{
- spin_lock(&shmlock_user_lock);
- user->locked_shm -= (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
- spin_unlock(&shmlock_user_lock);
- free_uid(user);
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index dc7206032387..e7980e2c18e8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1607,7 +1607,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
goto out_fput;
}
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
struct hstate *hs;

hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
@@ -1623,7 +1622,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
+ HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..de9bf6866f51 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2225,7 +2225,7 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
}
#endif

-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+int shmem_lock(struct file *file, int lock, const struct cred *cred)
{
struct inode *inode = file_inode(file);
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2237,13 +2237,13 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
* no serialization needed when called from shm_destroy().
*/
if (lock && !(info->flags & VM_LOCKED)) {
- if (!user_shm_lock(inode->i_size, user))
+ if (!user_shm_lock(inode->i_size, cred))
goto out_nomem;
info->flags |= VM_LOCKED;
mapping_set_unevictable(file->f_mapping);
}
- if (!lock && (info->flags & VM_LOCKED) && user) {
- user_shm_unlock(inode->i_size, user);
+ if (!lock && (info->flags & VM_LOCKED) && cred) {
+ user_shm_unlock(inode->i_size, cred);
info->flags &= ~VM_LOCKED;
mapping_clear_unevictable(file->f_mapping);
}
--
2.29.2

2021-01-15 15:04:46

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 2 +-
kernel/ucount.c | 20 +++++++-------------
2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f84fc2d9ce20 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
- int count;
+ refcount_t count;
atomic_t ucount[UCOUNT_COUNTS];
};

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..82acd2226460 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -141,7 +141,8 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)

new->ns = ns;
new->uid = uid;
- new->count = 0;
+
+ refcount_set(&new->count, 0);

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -152,10 +153,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
ucounts = new;
}
}
- if (ucounts->count == INT_MAX)
- ucounts = NULL;
- else
- ucounts->count += 1;
+ refcount_inc(&ucounts->count);
spin_unlock_irq(&ucounts_lock);
return ucounts;
}
@@ -164,15 +162,11 @@ static void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

- spin_lock_irqsave(&ucounts_lock, flags);
- ucounts->count -= 1;
- if (!ucounts->count)
+ if (refcount_dec_and_lock_irqsave(&ucounts->count, &ucounts_lock, &flags)) {
hlist_del_init(&ucounts->node);
- else
- ucounts = NULL;
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
- kfree(ucounts);
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+ kfree(ucounts);
+ }
}

static inline bool atomic_inc_below(atomic_t *v, int u)
--
2.29.2

2021-01-18 06:04:23

by Oliver Sang

[permalink] [raw]
Subject: c25050162e: WARNING:at_lib/refcount.c:#refcount_warn_saturate


Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: c25050162e76334c7ec2d23bf1b3ed73aae84744 ("[RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting")
url: https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210115-230051
base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------------+------------+------------+
| | df00d02989 | c25050162e |
+---------------------------------------------------+------------+------------+
| boot_successes | 4 | 0 |
| boot_failures | 0 | 4 |
| WARNING:at_lib/refcount.c:#refcount_warn_saturate | 0 | 4 |
| RIP:refcount_warn_saturate | 0 | 4 |
+---------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 0.411125] WARNING: CPU: 0 PID: 0 at lib/refcount.c:25 refcount_warn_saturate (kbuild/src/consumer/lib/refcount.c:25 (discriminator 3))
[ 0.411125] Modules linked in:
[ 0.411125] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.11.0-rc2-00003-gc25050162e76 #1
[ 0.411125] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 0.411125] RIP: 0010:refcount_warn_saturate (kbuild/src/consumer/lib/refcount.c:25 (discriminator 3))
[ 0.411125] Code: 05 64 40 66 01 01 e8 b5 5d 63 00 0f 0b c3 80 3d 54 40 66 01 00 75 d3 48 c7 c7 c8 0c 3b 82 c6 05 44 40 66 01 01 e8 96 5d 63 00 <0f> 0b c3 80 3d 37 40 66 01 00 75 b4 48 c7 c7 a0 0c 3b 82 c6 05 27
All code
========
0: 05 64 40 66 01 add $0x1664064,%eax
5: 01 e8 add %ebp,%eax
7: b5 5d mov $0x5d,%ch
9: 63 00 movslq (%rax),%eax
b: 0f 0b ud2
d: c3 retq
e: 80 3d 54 40 66 01 00 cmpb $0x0,0x1664054(%rip) # 0x1664069
15: 75 d3 jne 0xffffffffffffffea
17: 48 c7 c7 c8 0c 3b 82 mov $0xffffffff823b0cc8,%rdi
1e: c6 05 44 40 66 01 01 movb $0x1,0x1664044(%rip) # 0x1664069
25: e8 96 5d 63 00 callq 0x635dc0
2a:* 0f 0b ud2 <-- trapping instruction
2c: c3 retq
2d: 80 3d 37 40 66 01 00 cmpb $0x0,0x1664037(%rip) # 0x166406b
34: 75 b4 jne 0xffffffffffffffea
36: 48 c7 c7 a0 0c 3b 82 mov $0xffffffff823b0ca0,%rdi
3d: c6 .byte 0xc6
3e: 05 .byte 0x5
3f: 27 (bad)

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: c3 retq
3: 80 3d 37 40 66 01 00 cmpb $0x0,0x1664037(%rip) # 0x1664041
a: 75 b4 jne 0xffffffffffffffc0
c: 48 c7 c7 a0 0c 3b 82 mov $0xffffffff823b0ca0,%rdi
13: c6 .byte 0xc6
14: 05 .byte 0x5
15: 27 (bad)
[ 0.411125] RSP: 0000:ffffffff82603e50 EFLAGS: 00010082
[ 0.411125] RAX: 0000000000000000 RBX: 0000000000000000 RCX: c0000000ffff7fff
[ 0.411125] RDX: ffffffff82603c70 RSI: 00000000ffff7fff RDI: 0000000000000046
[ 0.411125] RBP: 0000000000000005 R08: 0000000000000000 R09: ffffffff82603c68
[ 0.411125] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888100134360
[ 0.411125] R13: 00000000000003e7 R14: ffffffff833a6300 R15: ffffffff8265e380
[ 0.411125] FS: 0000000000000000(0000) GS:ffff88823fc00000(0000) knlGS:0000000000000000
[ 0.411125] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.411125] CR2: ffff88823ffff000 CR3: 000000000260a000 CR4: 00000000000406b0
[ 0.411125] Call Trace:
[ 0.411125] inc_ucount (kbuild/src/consumer/include/linux/refcount.h:199 kbuild/src/consumer/include/linux/refcount.h:250 kbuild/src/consumer/include/linux/refcount.h:267 kbuild/src/consumer/kernel/ucount.c:156 kbuild/src/consumer/kernel/ucount.c:191)
[ 0.411125] alloc_mnt_ns (kbuild/src/consumer/fs/namespace.c:3261)
[ 0.411125] mnt_init (kbuild/src/consumer/fs/namespace.c:3798 kbuild/src/consumer/fs/namespace.c:3849)
[ 0.411125] vfs_caches_init (kbuild/src/consumer/fs/dcache.c:3242)
[ 0.411125] start_kernel (kbuild/src/consumer/init/main.c:1042)
[ 0.411125] secondary_startup_64_no_verify (kbuild/src/consumer/arch/x86/kernel/head_64.S:283)
[ 0.411125] ---[ end trace 5b3ffa3578b7d906 ]---
[ 0.411525] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.412130] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.413133] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.414132] Spectre V2 : Mitigation: Full generic retpoline
[ 0.415129] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.416129] Speculative Store Bypass: Vulnerable
[ 0.417133] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[ 0.418333] Freeing SMP alternatives memory: 44K
[ 0.422600] smpboot: CPU0: Intel Xeon E312xx (Sandy Bridge) (family: 0x6, model: 0x2a, stepping: 0x1)
[ 0.423317] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only.
[ 0.424198] rcu: Hierarchical SRCU implementation.
[ 0.425646] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.426242] smp: Bringing up secondary CPUs ...
[ 0.427313] x86: Booting SMP configuration:
[ 0.428132] .... node #0, CPUs: #1
[ 0.127154] kvm-clock: cpu 1, msr 337d041, secondary cpu clock
[ 0.127154] masked ExtINT on CPU#1
[ 0.127154] smpboot: CPU 1 Converting physical 0 to logical die 1
[ 0.453531] kvm-guest: stealtime: cpu 1, msr 23fd18540
[ 0.454218] smp: Brought up 1 node, 2 CPUs
[ 0.455134] smpboot: Max logical packages: 2
[ 0.456112] smpboot: Total of 2 processors activated (11999.99 BogoMIPS)
[ 0.457900] ------------[ cut here ]------------
[ 0.458125] refcount_t: saturated; leaking memory.


To reproduce:

# build kernel
cd linux
cp config-5.11.0-rc2-00003-gc25050162e76 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Oliver Sang


Attachments:
(No filename) (6.77 kB)
config-5.11.0-rc2-00003-gc25050162e76 (194.55 kB)
job-script (4.26 kB)
dmesg.xz (14.32 kB)
Download all attachments

2021-01-18 07:00:09

by Oliver Sang

[permalink] [raw]
Subject: 14c3c8a27f: kernel_BUG_at_kernel/cred.c


Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 14c3c8a27f70d6d6b7c1d64a9af899eb80169495 ("[RFC PATCH v3 2/8] Add a reference to ucounts for each cred")
url: https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210115-230051
base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next

in testcase: trinity
version: trinity-i386
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------------+------------+------------+
| | c25050162e | 14c3c8a27f |
+---------------------------------------------------+------------+------------+
| boot_successes | 0 | 0 |
| boot_failures | 8 | 8 |
| WARNING:at_lib/refcount.c:#refcount_warn_saturate | 7 | 8 |
| EIP:refcount_warn_saturate | 7 | 8 |
| BUG:kernel_hang_in_boot_stage | 1 | |
| kernel_BUG_at_kernel/cred.c | 0 | 3 |
| invalid_opcode:#[##] | 0 | 3 |
| EIP:__put_cred | 0 | 7 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 7 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 4 |
| Oops:#[##] | 0 | 4 |
+---------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 77.068709] kernel BUG at kernel/cred.c:150!
[ 77.069392] invalid opcode: 0000 [#1] SMP
[ 77.070035] CPU: 1 PID: 895 Comm: trinity-c7 Tainted: G W 5.11.0-rc2-00004-g14c3c8a27f70 #1
[ 77.071425] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 77.072871] EIP: __put_cred (kbuild/src/consumer/kernel/cred.c:150 (discriminator 1))
[ 77.073493] Code: 66 90 ba 90 b4 e7 c3 89 c8 e8 f4 6e 04 00 5d c3 66 90 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 89 c2 64 8b 0d cc 66 0e c5 8b 81 a8 04 00
All code
========
0: 66 90 xchg %ax,%ax
2: ba 90 b4 e7 c3 mov $0xc3e7b490,%edx
7: 89 c8 mov %ecx,%eax
9: e8 f4 6e 04 00 callq 0x46f02
e: 5d pop %rbp
f: c3 retq
10: 66 90 xchg %ax,%ax
12: 0f 0b ud2
14: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
1a: 0f 0b ud2
1c: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
22: 0f 0b ud2
24: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
2a:* 0f 0b ud2 <-- trapping instruction
2c: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
32: 89 c2 mov %eax,%edx
34: 64 8b 0d cc 66 0e c5 mov %fs:-0x3af19934(%rip),%ecx # 0xffffffffc50e6707
3b: 8b .byte 0x8b
3c: 81 .byte 0x81
3d: a8 04 test $0x4,%al
...

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
8: 89 c2 mov %eax,%edx
a: 64 8b 0d cc 66 0e c5 mov %fs:-0x3af19934(%rip),%ecx # 0xffffffffc50e66dd
11: 8b .byte 0x8b
12: 81 .byte 0x81
13: a8 04 test $0x4,%al
...
[ 77.076068] EAX: de3ef880 EBX: de2af080 ECX: 00000000 EDX: 00000000
[ 77.076997] ESI: de3ef880 EDI: 00000000 EBP: de349f74 ESP: de349f50
[ 77.077896] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
[ 77.078914] CR0: 80050033 CR2: b7cb2ff0 CR3: 030a4000 CR4: 000406d0
[ 77.079858] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 77.080745] DR6: fffe0ff0 DR7: 00000400
[ 77.081373] Call Trace:
[ 77.081834] ? keyctl_session_to_parent (kbuild/src/consumer/security/keys/keyctl.c:1711)
[ 77.082629] __ia32_sys_keyctl (kbuild/src/consumer/security/keys/keyctl.c:1951 kbuild/src/consumer/security/keys/keyctl.c:1869 kbuild/src/consumer/security/keys/keyctl.c:1869)
[ 77.083320] __do_fast_syscall_32 (kbuild/src/consumer/arch/x86/entry/common.c:78 kbuild/src/consumer/arch/x86/entry/common.c:137)
[ 77.084032] do_fast_syscall_32 (kbuild/src/consumer/arch/x86/entry/common.c:160)
[ 77.084704] do_SYSENTER_32 (kbuild/src/consumer/arch/x86/entry/common.c:204)
[ 77.085316] entry_SYSENTER_32 (kbuild/src/consumer/arch/x86/entry/entry_32.S:953)
[ 77.085973] EIP: 0xb7f04549
[ 77.086493] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
All code
========
0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi
4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d
a: 10 06 adc %al,(%rsi)
c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi
10: 10 07 adc %al,(%rdi)
12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
16: 10 08 adc %cl,(%rax)
18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
1c: 00 00 add %al,(%rax)
1e: 00 00 add %al,(%rax)
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24: 89 e5 mov %esp,%ebp
26: 0f 34 sysenter
28: cd 80 int $0x80
2a:* 5d pop %rbp <-- trapping instruction
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d 76 00 lea 0x0(%rsi),%esi
35: 58 pop %rax
36: b8 77 00 00 00 mov $0x77,%eax
3b: cd 80 int $0x80
3d: 90 nop
3e: 8d .byte 0x8d
3f: 76 .byte 0x76

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d 76 00 lea 0x0(%rsi),%esi
b: 58 pop %rax
c: b8 77 00 00 00 mov $0x77,%eax
11: cd 80 int $0x80
13: 90 nop
14: 8d .byte 0x8d
15: 76 .byte 0x76
[ 77.089120] EAX: ffffffda EBX: 00000012 ECX: ffff8a8b EDX: ffffffff
[ 77.090075] ESI: 7d7d7d7d EDI: 000000a3 EBP: 426bb44d ESP: bfaa6c8c
[ 77.091007] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
[ 77.092031] Modules linked in:
[ 77.092629] ---[ end trace 66869751d0fb6313 ]---
[ 77.093388] EIP: __put_cred (kbuild/src/consumer/kernel/cred.c:150 (discriminator 1))
[ 77.094000] Code: 66 90 ba 90 b4 e7 c3 89 c8 e8 f4 6e 04 00 5d c3 66 90 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 <0f> 0b 8d b6 00 00 00 00 89 c2 64 8b 0d cc 66 0e c5 8b 81 a8 04 00
All code
========
0: 66 90 xchg %ax,%ax
2: ba 90 b4 e7 c3 mov $0xc3e7b490,%edx
7: 89 c8 mov %ecx,%eax
9: e8 f4 6e 04 00 callq 0x46f02
e: 5d pop %rbp
f: c3 retq
10: 66 90 xchg %ax,%ax
12: 0f 0b ud2
14: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
1a: 0f 0b ud2
1c: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
22: 0f 0b ud2
24: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
2a:* 0f 0b ud2 <-- trapping instruction
2c: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
32: 89 c2 mov %eax,%edx
34: 64 8b 0d cc 66 0e c5 mov %fs:-0x3af19934(%rip),%ecx # 0xffffffffc50e6707
3b: 8b .byte 0x8b
3c: 81 .byte 0x81
3d: a8 04 test $0x4,%al
...

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
8: 89 c2 mov %eax,%edx
a: 64 8b 0d cc 66 0e c5 mov %fs:-0x3af19934(%rip),%ecx # 0xffffffffc50e66dd
11: 8b .byte 0x8b
12: 81 .byte 0x81
13: a8 04 test $0x4,%al


To reproduce:

# build kernel
cd linux
cp config-5.11.0-rc2-00004-g14c3c8a27f70 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Oliver Sang


Attachments:
(No filename) (9.74 kB)
config-5.11.0-rc2-00004-g14c3c8a27f70 (124.91 kB)
job-script (4.00 kB)
dmesg.xz (12.46 kB)
Download all attachments

2021-01-18 19:29:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Fri, Jan 15, 2021 at 6:59 AM Alexey Gladkov <[email protected]> wrote:
>
> @@ -152,10 +153,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
> ucounts = new;
> }
> }
> - if (ucounts->count == INT_MAX)
> - ucounts = NULL;
> - else
> - ucounts->count += 1;
> + refcount_inc(&ucounts->count);
> spin_unlock_irq(&ucounts_lock);
> return ucounts;
> }

This is wrong.

It used to return NULL when the count saturated.

Now it just silently saturates.

I'm not sure how many people care, but that NULL return ends up being
returned quite widely (through "inc_uncount()" and friends).

The fact that this has no commit message at all to explain what it is
doing and why is also a grounds for just NAK.

Linus

2021-01-18 19:50:57

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Mon, Jan 18, 2021 at 11:14:48AM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2021 at 6:59 AM Alexey Gladkov <[email protected]> wrote:
> >
> > @@ -152,10 +153,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
> > ucounts = new;
> > }
> > }
> > - if (ucounts->count == INT_MAX)
> > - ucounts = NULL;
> > - else
> > - ucounts->count += 1;
> > + refcount_inc(&ucounts->count);
> > spin_unlock_irq(&ucounts_lock);
> > return ucounts;
> > }
>
> This is wrong.
>
> It used to return NULL when the count saturated.
>
> Now it just silently saturates.
>
> I'm not sure how many people care, but that NULL return ends up being
> returned quite widely (through "inc_uncount()" and friends).
>
> The fact that this has no commit message at all to explain what it is
> doing and why is also a grounds for just NAK.

Sorry about that. I thought that this code is not needed when switching
from int to refcount_t. I was wrong. I'll think about how best to check
it.

--
Rgrds, legion

2021-01-18 20:59:52

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
> <[email protected]> wrote:
> >
> > Sorry about that. I thought that this code is not needed when switching
> > from int to refcount_t. I was wrong.
>
> Well, you _may_ be right. I personally didn't check how the return
> value is used.
>
> I only reacted to "it certainly _may_ be used, and there is absolutely
> no comment anywhere about why it wouldn't matter".

I have not found examples where checked the overflow after calling
refcount_inc/refcount_add.

For example in kernel/fork.c:2298 :

current->signal->nr_threads++;
atomic_inc(&current->signal->live);
refcount_inc(&current->signal->sigcnt);

$ semind search signal_struct.sigcnt
def include/linux/sched/signal.h:83 refcount_t sigcnt;
m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);

It seems to me that the only way is to use __refcount_inc and then compare
the old value with REFCOUNT_MAX

Since I have not seen examples of such checks, I thought that this is
acceptable. Sorry once again. I have not tried to hide these changes.

--
Rgrds, legion

2021-01-19 01:13:38

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v4 2/8] Add a reference to ucounts for each cred

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Add a ucounts reference to struct cred, so that
RLIMIT_NPROC can switch from using a per user limit to using a per user
per user namespace limit.

Changelog
---------
v4:
* Fixed typo in the kernel/cred.c

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/cred.h | 1 +
include/linux/user_namespace.h | 13 +++++++++++--
kernel/cred.c | 20 ++++++++++++++++++--
kernel/ucount.c | 30 ++++++++++++++++++++----------
kernel/user_namespace.c | 1 +
5 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..307744fcc387 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
#endif
struct user_struct *user; /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
+ struct ucounts *ucounts;
struct group_info *group_info; /* supplementary groups for euid/fsgid */
/* RCU deletion */
union {
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84fc2d9ce20..9a3ba69e9223 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -85,7 +85,7 @@ struct user_namespace {
struct ctl_table_header *sysctls;
#endif
struct ucounts *ucounts;
- int ucount_max[UCOUNT_COUNTS];
+ long ucount_max[UCOUNT_COUNTS];
} __randomize_layout;

struct ucounts {
@@ -93,7 +93,7 @@ struct ucounts {
struct user_namespace *ns;
kuid_t uid;
refcount_t count;
- atomic_t ucount[UCOUNT_COUNTS];
+ atomic_long_t ucount[UCOUNT_COUNTS];
};

extern struct user_namespace init_user_ns;
@@ -102,6 +102,15 @@ bool setup_userns_sysctls(struct user_namespace *ns);
void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+void put_ucounts(struct ucounts *ucounts);
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t uid);
+
+static inline struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ if (ucounts)
+ refcount_inc(&ucounts->count);
+ return ucounts;
+}

#ifdef CONFIG_USER_NS

diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..9473e71e784c 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -119,6 +119,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+ if (cred->ucounts)
+ put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
}
@@ -144,6 +146,9 @@ void __put_cred(struct cred *cred)
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);

+ if (cred->ucounts)
+ BUG_ON(cred->ucounts->ns != cred->user_ns);
+
if (cred->non_rcu)
put_cred_rcu(&cred->rcu);
else
@@ -270,6 +275,7 @@ struct cred *prepare_creds(void)
get_group_info(new->group_info);
get_uid(new->user);
get_user_ns(new->user_ns);
+ get_ucounts(new->ucounts);

#ifdef CONFIG_KEYS
key_get(new->session_keyring);
@@ -363,6 +369,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+ set_cred_ucounts(new, new->user_ns, new->euid);
}

#ifdef CONFIG_KEYS
@@ -485,8 +492,11 @@ int commit_creds(struct cred *new)
* in set_user().
*/
alter_cred_subscribers(new, 2);
- if (new->user != old->user)
- atomic_inc(&new->user->processes);
+ if (new->user != old->user || new->user_ns != old->user_ns) {
+ if (new->user != old->user)
+ atomic_inc(&new->user->processes);
+ set_cred_ucounts(new, new->user_ns, new->euid);
+ }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
@@ -661,6 +671,11 @@ void __init cred_init(void)
/* allocate a slab in which we can store credentials */
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+ /*
+ * This is needed here because this is the first cred and there is no
+ * ucount reference to copy.
+ */
+ set_cred_ucounts(&init_cred, &init_user_ns, GLOBAL_ROOT_UID);
}

/**
@@ -704,6 +719,7 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
get_uid(new->user);
get_user_ns(new->user_ns);
get_group_info(new->group_info);
+ get_ucounts(new->ucounts);

#ifdef CONFIG_KEYS
new->session_keyring = NULL;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 82acd2226460..0b4e956d87bb 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -125,7 +125,7 @@ static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struc
return NULL;
}

-static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
+static struct ucounts *__get_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
struct ucounts *ucounts, *new;
@@ -158,7 +158,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
return ucounts;
}

-static void put_ucounts(struct ucounts *ucounts)
+void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

@@ -169,14 +169,24 @@ static void put_ucounts(struct ucounts *ucounts)
}
}

-static inline bool atomic_inc_below(atomic_t *v, int u)
+void set_cred_ucounts(struct cred *cred, struct user_namespace *ns, kuid_t uid)
{
- int c, old;
- c = atomic_read(v);
+ struct ucounts *old = cred->ucounts;
+ if (old && old->ns == ns && uid_eq(old->uid, uid))
+ return;
+ cred->ucounts = __get_ucounts(ns, uid);
+ if (old)
+ put_ucounts(old);
+}
+
+static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
+{
+ long c, old;
+ c = atomic_long_read(v);
for (;;) {
if (unlikely(c >= u))
return false;
- old = atomic_cmpxchg(v, c, c+1);
+ old = atomic_long_cmpxchg(v, c, c+1);
if (likely(old == c))
return true;
c = old;
@@ -188,19 +198,19 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
{
struct ucounts *ucounts, *iter, *bad;
struct user_namespace *tns;
- ucounts = get_ucounts(ns, uid);
+ ucounts = __get_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
int max;
tns = iter->ns;
max = READ_ONCE(tns->ucount_max[type]);
- if (!atomic_inc_below(&iter->ucount[type], max))
+ if (!atomic_long_inc_below(&iter->ucount[type], max))
goto fail;
}
return ucounts;
fail:
bad = iter;
for (iter = ucounts; iter != bad; iter = iter->ns->ucounts)
- atomic_dec(&iter->ucount[type]);
+ atomic_long_dec(&iter->ucount[type]);

put_ucounts(ucounts);
return NULL;
@@ -210,7 +220,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
{
struct ucounts *iter;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
- int dec = atomic_dec_if_positive(&iter->ucount[type]);
+ int dec = atomic_long_dec_if_positive(&iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
}
put_ucounts(ucounts);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..4b8a4468d391 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1280,6 +1280,7 @@ static int userns_install(struct nsset *nsset, struct ns_common *ns)

put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));
+ set_cred_ucounts(cred, user_ns, cred->euid);

return 0;
}
--
2.29.2

2021-01-19 05:17:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
<[email protected]> wrote:
>
> Sorry about that. I thought that this code is not needed when switching
> from int to refcount_t. I was wrong.

Well, you _may_ be right. I personally didn't check how the return
value is used.

I only reacted to "it certainly _may_ be used, and there is absolutely
no comment anywhere about why it wouldn't matter".

Linus

2021-01-19 06:10:05

by Kaiwan N Billimoria

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

(Sorry for the gmail client)
My 0.2, HTH:
a) AFAIK, refcount_inc() (and similar friends) don't return any value
b) they're designed to just WARN() if they saturate or if you're
attempting to increment the value 0 (as it's possibly a UAF bug)
c) refcount_inc_checked() is documented as "Similar to atomic_inc(),
but will saturate at UINT_MAX and WARN"
d) we should avoid using the __foo() when foo() 's present as far as
is sanely possible...

So is one expected to just fix things when they break? - as signalled
by the WARN firing?

--
Regards, kaiwan.


On Tue, Jan 19, 2021 at 2:26 AM Alexey Gladkov <[email protected]> wrote:
>
> On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
> > On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
> > <[email protected]> wrote:
> > >
> > > Sorry about that. I thought that this code is not needed when switching
> > > from int to refcount_t. I was wrong.
> >
> > Well, you _may_ be right. I personally didn't check how the return
> > value is used.
> >
> > I only reacted to "it certainly _may_ be used, and there is absolutely
> > no comment anywhere about why it wouldn't matter".
>
> I have not found examples where checked the overflow after calling
> refcount_inc/refcount_add.
>
> For example in kernel/fork.c:2298 :
>
> current->signal->nr_threads++;
> atomic_inc(&current->signal->live);
> refcount_inc(&current->signal->sigcnt);
>
> $ semind search signal_struct.sigcnt
> def include/linux/sched/signal.h:83 refcount_t sigcnt;
> m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
> m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
> m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);
>
> It seems to me that the only way is to use __refcount_inc and then compare
> the old value with REFCOUNT_MAX
>
> Since I have not seen examples of such checks, I thought that this is
> acceptable. Sorry once again. I have not tried to hide these changes.
>
> --
> Rgrds, legion
>
>

2021-01-20 02:02:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

Alexey Gladkov <[email protected]> writes:

> On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>> <[email protected]> wrote:
>> >
>> > Sorry about that. I thought that this code is not needed when switching
>> > from int to refcount_t. I was wrong.
>>
>> Well, you _may_ be right. I personally didn't check how the return
>> value is used.
>>
>> I only reacted to "it certainly _may_ be used, and there is absolutely
>> no comment anywhere about why it wouldn't matter".
>
> I have not found examples where checked the overflow after calling
> refcount_inc/refcount_add.
>
> For example in kernel/fork.c:2298 :
>
> current->signal->nr_threads++;
> atomic_inc(&current->signal->live);
> refcount_inc(&current->signal->sigcnt);
>
> $ semind search signal_struct.sigcnt
> def include/linux/sched/signal.h:83 refcount_t sigcnt;
> m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
> m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
> m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);
>
> It seems to me that the only way is to use __refcount_inc and then compare
> the old value with REFCOUNT_MAX
>
> Since I have not seen examples of such checks, I thought that this is
> acceptable. Sorry once again. I have not tried to hide these changes.

The current ucount code does check for overflow and fails the increment
in every case.

So arguably it will be a regression and inferior error handling behavior
if the code switches to the ``better'' refcount_t data structure.

I originally didn't use refcount_t because silently saturating and not
bothering to handle the error makes me uncomfortable.

Not having to acquire the ucounts_lock every time seems nice. Perhaps
the path forward would be to start with stupid/correct code that always
takes the ucounts_lock for every increment of ucounts->count, that is
later replaced with something more optimal.

Not impacting performance in the non-namespace cases and having good
performance in the other cases is a fundamental requirement of merging
code like this.

Eric

2021-01-20 02:02:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

[email protected] (Eric W. Biederman) writes:

> Alexey Gladkov <[email protected]> writes:
>
>> On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>>> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>>> <[email protected]> wrote:
>>> >
>>> > Sorry about that. I thought that this code is not needed when switching
>>> > from int to refcount_t. I was wrong.
>>>
>>> Well, you _may_ be right. I personally didn't check how the return
>>> value is used.
>>>
>>> I only reacted to "it certainly _may_ be used, and there is absolutely
>>> no comment anywhere about why it wouldn't matter".
>>
>> I have not found examples where checked the overflow after calling
>> refcount_inc/refcount_add.
>>
>> For example in kernel/fork.c:2298 :
>>
>> current->signal->nr_threads++;
>> atomic_inc(&current->signal->live);
>> refcount_inc(&current->signal->sigcnt);
>>
>> $ semind search signal_struct.sigcnt
>> def include/linux/sched/signal.h:83 refcount_t sigcnt;
>> m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
>> m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
>> m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);
>>
>> It seems to me that the only way is to use __refcount_inc and then compare
>> the old value with REFCOUNT_MAX
>>
>> Since I have not seen examples of such checks, I thought that this is
>> acceptable. Sorry once again. I have not tried to hide these changes.
>
> The current ucount code does check for overflow and fails the increment
> in every case.
>
> So arguably it will be a regression and inferior error handling behavior
> if the code switches to the ``better'' refcount_t data structure.
>
> I originally didn't use refcount_t because silently saturating and not
> bothering to handle the error makes me uncomfortable.
>
> Not having to acquire the ucounts_lock every time seems nice. Perhaps
> the path forward would be to start with stupid/correct code that always
> takes the ucounts_lock for every increment of ucounts->count, that is
> later replaced with something more optimal.
>
> Not impacting performance in the non-namespace cases and having good
> performance in the other cases is a fundamental requirement of merging
> code like this.

So starting with something easy to comprehend and simple, may make it
easier to figure out how to optimize the code.

Eric

2021-01-21 12:09:25

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Tue, Jan 19, 2021 at 07:57:36PM -0600, Eric W. Biederman wrote:
> Alexey Gladkov <[email protected]> writes:
>
> > On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
> >> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
> >> <[email protected]> wrote:
> >> >
> >> > Sorry about that. I thought that this code is not needed when switching
> >> > from int to refcount_t. I was wrong.
> >>
> >> Well, you _may_ be right. I personally didn't check how the return
> >> value is used.
> >>
> >> I only reacted to "it certainly _may_ be used, and there is absolutely
> >> no comment anywhere about why it wouldn't matter".
> >
> > I have not found examples where checked the overflow after calling
> > refcount_inc/refcount_add.
> >
> > For example in kernel/fork.c:2298 :
> >
> > current->signal->nr_threads++;
> > atomic_inc(&current->signal->live);
> > refcount_inc(&current->signal->sigcnt);
> >
> > $ semind search signal_struct.sigcnt
> > def include/linux/sched/signal.h:83 refcount_t sigcnt;
> > m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
> > m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
> > m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);
> >
> > It seems to me that the only way is to use __refcount_inc and then compare
> > the old value with REFCOUNT_MAX
> >
> > Since I have not seen examples of such checks, I thought that this is
> > acceptable. Sorry once again. I have not tried to hide these changes.
>
> The current ucount code does check for overflow and fails the increment
> in every case.
>
> So arguably it will be a regression and inferior error handling behavior
> if the code switches to the ``better'' refcount_t data structure.
>
> I originally didn't use refcount_t because silently saturating and not
> bothering to handle the error makes me uncomfortable.
>
> Not having to acquire the ucounts_lock every time seems nice. Perhaps
> the path forward would be to start with stupid/correct code that always
> takes the ucounts_lock for every increment of ucounts->count, that is
> later replaced with something more optimal.
>
> Not impacting performance in the non-namespace cases and having good
> performance in the other cases is a fundamental requirement of merging
> code like this.

Did I understand your suggestion correctly that you suggest to use
spin_lock for atomic_read and atomic_inc ?

If so, then we are already incrementing the counter under ucounts_lock.

...
if (atomic_read(&ucounts->count) == INT_MAX)
ucounts = NULL;
else
atomic_inc(&ucounts->count);
spin_unlock_irq(&ucounts_lock);
return ucounts;

something like this ?

--
Rgrds, legion

2021-01-21 15:58:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

Alexey Gladkov <[email protected]> writes:

> On Tue, Jan 19, 2021 at 07:57:36PM -0600, Eric W. Biederman wrote:
>> Alexey Gladkov <[email protected]> writes:
>>
>> > On Mon, Jan 18, 2021 at 12:34:29PM -0800, Linus Torvalds wrote:
>> >> On Mon, Jan 18, 2021 at 11:46 AM Alexey Gladkov
>> >> <[email protected]> wrote:
>> >> >
>> >> > Sorry about that. I thought that this code is not needed when switching
>> >> > from int to refcount_t. I was wrong.
>> >>
>> >> Well, you _may_ be right. I personally didn't check how the return
>> >> value is used.
>> >>
>> >> I only reacted to "it certainly _may_ be used, and there is absolutely
>> >> no comment anywhere about why it wouldn't matter".
>> >
>> > I have not found examples where checked the overflow after calling
>> > refcount_inc/refcount_add.
>> >
>> > For example in kernel/fork.c:2298 :
>> >
>> > current->signal->nr_threads++;
>> > atomic_inc(&current->signal->live);
>> > refcount_inc(&current->signal->sigcnt);
>> >
>> > $ semind search signal_struct.sigcnt
>> > def include/linux/sched/signal.h:83 refcount_t sigcnt;
>> > m-- kernel/fork.c:723 put_signal_struct if (refcount_dec_and_test(&sig->sigcnt))
>> > m-- kernel/fork.c:1571 copy_signal refcount_set(&sig->sigcnt, 1);
>> > m-- kernel/fork.c:2298 copy_process refcount_inc(&current->signal->sigcnt);
>> >
>> > It seems to me that the only way is to use __refcount_inc and then compare
>> > the old value with REFCOUNT_MAX
>> >
>> > Since I have not seen examples of such checks, I thought that this is
>> > acceptable. Sorry once again. I have not tried to hide these changes.
>>
>> The current ucount code does check for overflow and fails the increment
>> in every case.
>>
>> So arguably it will be a regression and inferior error handling behavior
>> if the code switches to the ``better'' refcount_t data structure.
>>
>> I originally didn't use refcount_t because silently saturating and not
>> bothering to handle the error makes me uncomfortable.
>>
>> Not having to acquire the ucounts_lock every time seems nice. Perhaps
>> the path forward would be to start with stupid/correct code that always
>> takes the ucounts_lock for every increment of ucounts->count, that is
>> later replaced with something more optimal.
>>
>> Not impacting performance in the non-namespace cases and having good
>> performance in the other cases is a fundamental requirement of merging
>> code like this.
>
> Did I understand your suggestion correctly that you suggest to use
> spin_lock for atomic_read and atomic_inc ?
>
> If so, then we are already incrementing the counter under ucounts_lock.
>
> ...
> if (atomic_read(&ucounts->count) == INT_MAX)
> ucounts = NULL;
> else
> atomic_inc(&ucounts->count);
> spin_unlock_irq(&ucounts_lock);
> return ucounts;
>
> something like this ?

Yes. But without atomics. Something a bit more like:
> ...
> if (ucounts->count == INT_MAX)
> ucounts = NULL;
> else
> ucounts->count++;
> spin_unlock_irq(&ucounts_lock);
> return ucounts;

I do believe at some point we will want to say using the spin_lock for
ucounts->count is cumbersome, and suboptimal and we want to change it to
get a better performing implementation.

Just for getting the semantics correct we should be able to use just
ucounts_lock for locking. Then when everything is working we can
profile and optimize the code.

I just don't want figuring out what is needed to get hung up over little
details that we can change later.

Eric

2021-01-21 16:16:01

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/8] Use refcount_t for ucounts reference counting

On Thu, Jan 21, 2021 at 09:50:34AM -0600, Eric W. Biederman wrote:
> >> The current ucount code does check for overflow and fails the increment
> >> in every case.
> >>
> >> So arguably it will be a regression and inferior error handling behavior
> >> if the code switches to the ``better'' refcount_t data structure.
> >>
> >> I originally didn't use refcount_t because silently saturating and not
> >> bothering to handle the error makes me uncomfortable.
> >>
> >> Not having to acquire the ucounts_lock every time seems nice. Perhaps
> >> the path forward would be to start with stupid/correct code that always
> >> takes the ucounts_lock for every increment of ucounts->count, that is
> >> later replaced with something more optimal.
> >>
> >> Not impacting performance in the non-namespace cases and having good
> >> performance in the other cases is a fundamental requirement of merging
> >> code like this.
> >
> > Did I understand your suggestion correctly that you suggest to use
> > spin_lock for atomic_read and atomic_inc ?
> >
> > If so, then we are already incrementing the counter under ucounts_lock.
> >
> > ...
> > if (atomic_read(&ucounts->count) == INT_MAX)
> > ucounts = NULL;
> > else
> > atomic_inc(&ucounts->count);
> > spin_unlock_irq(&ucounts_lock);
> > return ucounts;
> >
> > something like this ?
>
> Yes. But without atomics. Something a bit more like:
> > ...
> > if (ucounts->count == INT_MAX)
> > ucounts = NULL;
> > else
> > ucounts->count++;
> > spin_unlock_irq(&ucounts_lock);
> > return ucounts;

This is the original code.

> I do believe at some point we will want to say using the spin_lock for
> ucounts->count is cumbersome, and suboptimal and we want to change it to
> get a better performing implementation.
>
> Just for getting the semantics correct we should be able to use just
> ucounts_lock for locking. Then when everything is working we can
> profile and optimize the code.
>
> I just don't want figuring out what is needed to get hung up over little
> details that we can change later.

OK. So I will drop this my change for now.

--
Rgrds, legion