2021-02-15 12:44:33

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 0/7] Count rlimits in each user namespace

Preface
-------
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11

Problem
-------
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
------------------
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1] https://lore.kernel.org/containers/[email protected]/
[2] https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3] https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
---------
v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to atomic_long_t
and add ucounts to cred. These commits were merged by mistake during the rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (7):
Increase size of ucounts to atomic_long_t
Add a reference to ucounts for each cred
Reimplement RLIMIT_NPROC on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MEMLOCK on top of ucounts
kselftests: Add test to check for rlimit changes in different user
namespaces

fs/exec.c | 6 +-
fs/hugetlbfs/inode.c | 16 +-
fs/io-wq.c | 22 ++-
fs/io-wq.h | 2 +-
fs/io_uring.c | 2 +-
fs/proc/array.c | 2 +-
include/linux/cred.h | 4 +
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 4 +-
include/linux/sched/user.h | 7 -
include/linux/shmem_fs.h | 2 +-
include/linux/signal_types.h | 4 +-
include/linux/user_namespace.h | 24 ++-
ipc/mqueue.c | 29 ++--
ipc/shm.c | 30 ++--
kernel/cred.c | 50 +++++-
kernel/exit.c | 2 +-
kernel/fork.c | 18 +-
kernel/signal.c | 53 +++---
kernel/sys.c | 14 +-
kernel/ucount.c | 120 +++++++++++--
kernel/user.c | 3 -
kernel/user_namespace.c | 9 +-
mm/memfd.c | 5 +-
mm/mlock.c | 35 ++--
mm/mmap.c | 4 +-
mm/shmem.c | 8 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
32 files changed, 495 insertions(+), 155 deletions(-)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

--
2.29.2


2021-02-15 12:44:40

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 2/7] Add a reference to ucounts for each cred

For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
error was caused by the fact that cred_alloc_blank() left the ucounts
pointer empty.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/exec.c | 4 ++++
include/linux/cred.h | 2 ++
include/linux/user_namespace.h | 4 ++++
kernel/cred.c | 40 ++++++++++++++++++++++++++++++++++
kernel/fork.c | 6 +++++
kernel/sys.c | 12 ++++++++++
kernel/ucount.c | 40 +++++++++++++++++++++++++++++++---
kernel/user_namespace.c | 3 +++
8 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);

+ retval = set_cred_ucounts(bprm->cred);
+ if (retval < 0)
+ goto out_unlock;
+
/*
* install the new credentials for this executable
*/
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
#endif
struct user_struct *user; /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
+ struct ucounts *ucounts;
struct group_info *group_info; /* supplementary groups for euid/fsgid */
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, const char *);
extern int set_create_files_as(struct cred *, struct inode *);
extern int cred_fscmp(const struct cred *, const struct cred *);
extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);

/*
* check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
};

extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;

bool setup_userns_sysctls(struct user_namespace *ns);
void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);

#ifdef CONFIG_USER_NS

diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user = INIT_USER,
.user_ns = &init_user_ns,
.group_info = &init_groups,
+ .ucounts = &init_ucounts,
};

static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+ if (cred->ucounts)
+ put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
}
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
#ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
#endif
+ new->ucounts = get_ucounts(&init_ucounts);

if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)

if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+ new->ucounts = get_ucounts(new->ucounts);
+ if (!new->ucounts)
+ goto error;
+
validate_creds(new);
return new;

@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+ if (set_cred_ucounts(new) < 0)
+ goto error_put;
}

#ifdef CONFIG_KEYS
@@ -653,6 +664,31 @@ int cred_fscmp(const struct cred *a, const struct cred *b)
}
EXPORT_SYMBOL(cred_fscmp);

+int set_cred_ucounts(struct cred *new)
+{
+ struct task_struct *task = current;
+ const struct cred *old = task->real_cred;
+ struct ucounts *old_ucounts = new->ucounts;
+
+ if (new->user == old->user && new->user_ns == old->user_ns)
+ return 0;
+
+ /*
+ * This optimization is needed because alloc_ucounts() uses locks
+ * for table lookups.
+ */
+ if (old_ucounts && old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
+ return 0;
+
+ if (!(new->ucounts = alloc_ucounts(new->user_ns, new->euid)))
+ return -EAGAIN;
+
+ if (old_ucounts)
+ put_ucounts(old_ucounts);
+
+ return 0;
+}
+
/*
* initialise the credentials stuff
*/
@@ -719,6 +755,10 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;

+ new->ucounts = get_ucounts(new->ucounts);
+ if (!new->ucounts)
+ goto error;
+
put_cred(old);
validate_creds(new);
return new;
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..40a5da7d3d70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2957,6 +2957,12 @@ int ksys_unshare(unsigned long unshare_flags)
if (err)
goto bad_unshare_cleanup_cred;

+ if (new_cred) {
+ err = set_cred_ucounts(new_cred);
+ if (err)
+ goto bad_unshare_cleanup_cred;
+ }
+
if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) {
if (do_sysvsem) {
/*
diff --git a/kernel/sys.c b/kernel/sys.c
index 51f00fe20e4d..373def7debe8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -553,6 +553,10 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
if (retval < 0)
goto error;

+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);

error:
@@ -611,6 +615,10 @@ long __sys_setuid(uid_t uid)
if (retval < 0)
goto error;

+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);

error:
@@ -686,6 +694,10 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
if (retval < 0)
goto error;

+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);

error:
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 04c561751af1..50cc1dfb7d28 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -8,6 +8,12 @@
#include <linux/kmemleak.h>
#include <linux/user_namespace.h>

+struct ucounts init_ucounts = {
+ .ns = &init_user_ns,
+ .uid = GLOBAL_ROOT_UID,
+ .count = 1,
+};
+
#define UCOUNTS_HASHTABLE_BITS 10
static struct hlist_head ucounts_hashtable[(1 << UCOUNTS_HASHTABLE_BITS)];
static DEFINE_SPINLOCK(ucounts_lock);
@@ -125,7 +131,15 @@ static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struc
return NULL;
}

-static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
+static void hlist_add_ucounts(struct ucounts *ucounts)
+{
+ struct hlist_head *hashent = ucounts_hashentry(ucounts->ns, ucounts->uid);
+ spin_lock_irq(&ucounts_lock);
+ hlist_add_head(&ucounts->node, hashent);
+ spin_unlock_irq(&ucounts_lock);
+}
+
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
struct ucounts *ucounts, *new;
@@ -160,7 +174,26 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
return ucounts;
}

-static void put_ucounts(struct ucounts *ucounts)
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ unsigned long flags;
+
+ if (!ucounts)
+ return NULL;
+
+ spin_lock_irqsave(&ucounts_lock, flags);
+ if (ucounts->count == INT_MAX) {
+ WARN_ONCE(1, "ucounts: counter has reached its maximum value");
+ ucounts = NULL;
+ } else {
+ ucounts->count += 1;
+ }
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+
+ return ucounts;
+}
+
+void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

@@ -194,7 +227,7 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
{
struct ucounts *ucounts, *iter, *bad;
struct user_namespace *tns;
- ucounts = get_ucounts(ns, uid);
+ ucounts = alloc_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
long max;
tns = iter->ns;
@@ -237,6 +270,7 @@ static __init int user_namespace_sysctl_init(void)
BUG_ON(!user_header);
BUG_ON(!setup_userns_sysctls(&init_user_ns));
#endif
+ hlist_add_ucounts(&init_ucounts);
return 0;
}
subsys_initcall(user_namespace_sysctl_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..516db53166ab 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1281,6 +1281,9 @@ static int userns_install(struct nsset *nsset, struct ns_common *ns)
put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));

+ if (set_cred_ucounts(cred) < 0)
+ return -EINVAL;
+
return 0;
}

--
2.29.2

2021-02-15 12:46:00

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 4/7] Reimplement RLIMIT_MSGQUEUE on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/sched/user.h | 4 ----
include/linux/user_namespace.h | 1 +
ipc/mqueue.c | 29 +++++++++++++++--------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user_namespace.c | 1 +
6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index d33d867ad6c1..8a34446681aa 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,10 +18,6 @@ struct user_struct {
#endif
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
-#endif
-#ifdef CONFIG_POSIX_MQUEUE
- /* protected by mq_lock */
- unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
#endif
unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0a27cd049404..52453143fe23 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -51,6 +51,7 @@ enum ucount_type {
UCOUNT_INOTIFY_WATCHES,
#endif
UCOUNT_RLIMIT_NPROC,
+ UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_COUNTS,
};

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index beff0cfcd1e8..05fcf067131f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -144,7 +144,7 @@ struct mqueue_inode_info {
struct pid *notify_owner;
u32 notify_self_exec_id;
struct user_namespace *notify_user_ns;
- struct user_struct *user; /* user who created, for accounting */
+ struct ucounts *ucounts; /* user who created, for accounting */
struct sock *notify_sock;
struct sk_buff *notify_cookie;

@@ -292,7 +292,6 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
struct ipc_namespace *ipc_ns, umode_t mode,
struct mq_attr *attr)
{
- struct user_struct *u = current_user();
struct inode *inode;
int ret = -ENOMEM;

@@ -309,6 +308,8 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
if (S_ISREG(mode)) {
struct mqueue_inode_info *info;
unsigned long mq_bytes, mq_treesize;
+ struct ucounts *ucounts;
+ bool overlimit;

inode->i_fop = &mqueue_file_operations;
inode->i_size = FILENT_SIZE;
@@ -321,7 +322,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
info->notify_owner = NULL;
info->notify_user_ns = NULL;
info->qsize = 0;
- info->user = NULL; /* set when all is ok */
+ info->ucounts = NULL; /* set when all is ok */
info->msg_tree = RB_ROOT;
info->msg_tree_rightmost = NULL;
info->node_cache = NULL;
@@ -371,19 +372,19 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
if (mq_bytes + mq_treesize < mq_bytes)
goto out_inode;
mq_bytes += mq_treesize;
+ ucounts = current_ucounts();
spin_lock(&mq_lock);
- if (u->mq_bytes + mq_bytes < u->mq_bytes ||
- u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
+ overlimit = inc_rlimit_ucounts_and_test(ucounts, UCOUNT_RLIMIT_MSGQUEUE,
+ mq_bytes, rlimit(RLIMIT_MSGQUEUE));
+ if (overlimit) {
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
spin_unlock(&mq_lock);
/* mqueue_evict_inode() releases info->messages */
ret = -EMFILE;
goto out_inode;
}
- u->mq_bytes += mq_bytes;
spin_unlock(&mq_lock);
-
- /* all is ok */
- info->user = get_uid(u);
+ info->ucounts = get_ucounts(ucounts);
} else if (S_ISDIR(mode)) {
inc_nlink(inode);
/* Some things misbehave if size == 0 on a directory */
@@ -497,7 +498,7 @@ static void mqueue_free_inode(struct inode *inode)
static void mqueue_evict_inode(struct inode *inode)
{
struct mqueue_inode_info *info;
- struct user_struct *user;
+ struct ucounts *ucounts;
struct ipc_namespace *ipc_ns;
struct msg_msg *msg, *nmsg;
LIST_HEAD(tmp_msg);
@@ -520,8 +521,8 @@ static void mqueue_evict_inode(struct inode *inode)
free_msg(msg);
}

- user = info->user;
- if (user) {
+ ucounts = info->ucounts;
+ if (ucounts) {
unsigned long mq_bytes, mq_treesize;

/* Total amount of bytes accounted for the mqueue */
@@ -533,7 +534,7 @@ static void mqueue_evict_inode(struct inode *inode)
info->attr.mq_msgsize);

spin_lock(&mq_lock);
- user->mq_bytes -= mq_bytes;
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
/*
* get_ns_from_inode() ensures that the
* (ipc_ns = sb->s_fs_info) is either a valid ipc_ns
@@ -543,7 +544,7 @@ static void mqueue_evict_inode(struct inode *inode)
if (ipc_ns)
ipc_ns->mq_queues_count--;
spin_unlock(&mq_lock);
- free_uid(user);
+ put_ucounts(ucounts);
}
if (ipc_ns)
put_ipc_ns(ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 812b023ecdce..0a939332efcc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -823,6 +823,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[i] = max_threads/2;

init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 2f42d2ee6e27..6fb2ebdef0bc 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -81,6 +81,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ },
{ }
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2434b13b02e5..cc90d5203acf 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -122,6 +122,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[i] = INT_MAX;
}
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
+ ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.2

2021-02-15 12:46:31

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/exec.c | 2 +-
fs/io-wq.c | 22 ++++++------
fs/io-wq.h | 2 +-
fs/io_uring.c | 2 +-
include/linux/cred.h | 2 ++
include/linux/sched/user.h | 1 -
include/linux/user_namespace.h | 13 ++++++++
kernel/cred.c | 10 +++---
kernel/exit.c | 2 +-
kernel/fork.c | 9 ++---
kernel/sys.c | 2 +-
kernel/ucount.c | 61 ++++++++++++++++++++++++++++++++++
kernel/user.c | 1 -
kernel/user_namespace.c | 3 +-
14 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0371a3400be5..e6d7f186f33c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1874,7 +1874,7 @@ static int do_execveat_common(int fd, struct filename *filename,
* whether NPROC limit is still exceeded.
*/
if ((current->flags & PF_NPROC_EXCEEDED) &&
- atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
+ is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
diff --git a/fs/io-wq.c b/fs/io-wq.c
index a564f36e260c..5b6940c90c61 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -20,6 +20,7 @@
#include <linux/blk-cgroup.h>
#include <linux/audit.h>
#include <linux/cpu.h>
+#include <linux/user_namespace.h>

#include "../kernel/sched/sched.h"
#include "io-wq.h"
@@ -120,7 +121,7 @@ struct io_wq {
io_wq_work_fn *do_work;

struct task_struct *manager;
- struct user_struct *user;
+ const struct cred *cred;
refcount_t refs;
struct completion done;

@@ -234,7 +235,7 @@ static void io_worker_exit(struct io_worker *worker)
if (worker->flags & IO_WORKER_F_RUNNING)
atomic_dec(&acct->nr_running);
if (!(worker->flags & IO_WORKER_F_BOUND))
- atomic_dec(&wqe->wq->user->processes);
+ dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
worker->flags = 0;
preempt_enable();

@@ -364,15 +365,15 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
worker->flags |= IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
- atomic_dec(&wqe->wq->user->processes);
+ dec_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
} else {
worker->flags &= ~IO_WORKER_F_BOUND;
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
wqe->acct[IO_WQ_ACCT_BOUND].nr_workers--;
- atomic_inc(&wqe->wq->user->processes);
+ inc_rlimit_ucounts(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);
}
io_wqe_inc_running(wqe, worker);
- }
+ }
}

/*
@@ -707,7 +708,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
raw_spin_unlock_irq(&wqe->lock);

if (index == IO_WQ_ACCT_UNBOUND)
- atomic_inc(&wq->user->processes);
+ inc_rlimit_ucounts(wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, 1);

refcount_inc(&wq->refs);
wake_up_process(worker->task);
@@ -838,7 +839,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct,
if (free_worker)
return true;

- if (atomic_read(&wqe->wq->user->processes) >= acct->max_workers &&
+ if (is_ucounts_overlimit(wqe->wq->cred->ucounts, UCOUNT_RLIMIT_NPROC, acct->max_workers) &&
!(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))
return false;

@@ -1074,7 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wq->do_work = data->do_work;

/* caller must already hold a reference to this */
- wq->user = data->user;
+ wq->cred = data->cred;

ret = -ENOMEM;
for_each_node(node) {
@@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wqe->node = alloc_node;
wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
- if (wq->user) {
- wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
- task_rlimit(current, RLIMIT_NPROC);
- }
+ wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = task_rlimit(current, RLIMIT_NPROC);
atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
wqe->wq = wq;
raw_spin_lock_init(&wqe->lock);
diff --git a/fs/io-wq.h b/fs/io-wq.h
index b158f8addcf3..4130e247c556 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -111,7 +111,7 @@ typedef void (free_work_fn)(struct io_wq_work *);
typedef struct io_wq_work *(io_wq_work_fn)(struct io_wq_work *);

struct io_wq_data {
- struct user_struct *user;
+ const struct cred *cred;

io_wq_work_fn *do_work;
free_work_fn *free_work;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 931671082e61..389998f39843 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8084,7 +8084,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
unsigned int concurrency;
int ret = 0;

- data.user = ctx->user;
+ data.cred = ctx->creds;
data.free_work = io_free_work;
data.do_work = io_wq_submit_work;

diff --git a/include/linux/cred.h b/include/linux/cred.h
index ad160e5fe5c6..8025fe48198f 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -372,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)

#define task_uid(task) (task_cred_xxx((task), uid))
#define task_euid(task) (task_cred_xxx((task), euid))
+#define task_ucounts(task) (task_cred_xxx((task), ucounts))

#define current_cred_xxx(xxx) \
({ \
@@ -388,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
#define current_fsgid() (current_cred_xxx(fsgid))
#define current_cap() (current_cred_xxx(cap_effective))
#define current_user() (current_cred_xxx(user))
+#define current_ucounts() (current_cred_xxx(ucounts))

extern struct user_namespace init_user_ns;
#ifdef CONFIG_USER_NS
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index a8ec3b6093fc..d33d867ad6c1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -12,7 +12,6 @@
*/
struct user_struct {
refcount_t __count; /* reference count */
- atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..0a27cd049404 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -50,9 +50,12 @@ enum ucount_type {
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
#endif
+ UCOUNT_RLIMIT_NPROC,
UCOUNT_COUNTS,
};

+#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
+
struct user_namespace {
struct uid_gid_map uid_map;
struct uid_gid_map gid_map;
@@ -107,6 +110,16 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
struct ucounts *get_ucounts(struct ucounts *ucounts);
void put_ucounts(struct ucounts *ucounts);

+static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type type)
+{
+ return atomic_long_read(&ucounts->ucount[type]);
+}
+
+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type type, long v, long max);
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long max);
+
#ifdef CONFIG_USER_NS

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/cred.c b/kernel/cred.c
index 58a8a9e24347..dcfa30b337c5 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -360,7 +360,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
kdebug("share_creds(%p{%d,%d})",
p->cred, atomic_read(&p->cred->usage),
read_cred_subscribers(p->cred));
- atomic_inc(&p->cred->user->processes);
+ inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
return 0;
}

@@ -395,8 +395,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
}
#endif

- atomic_inc(&new->user->processes);
p->cred = p->real_cred = get_cred(new);
+ inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
alter_cred_subscribers(new, 2);
validate_creds(new);
return 0;
@@ -496,12 +496,12 @@ int commit_creds(struct cred *new)
* in set_user().
*/
alter_cred_subscribers(new, 2);
- if (new->user != old->user)
- atomic_inc(&new->user->processes);
+ if (new->user != old->user || new->user_ns != old->user_ns)
+ inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user)
- atomic_dec(&old->user->processes);
+ dec_rlimit_ucounts(old->ucounts, UCOUNT_RLIMIT_NPROC, 1);
alter_cred_subscribers(old, -2);

/* send notifications */
diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..61c0fe902b50 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -188,7 +188,7 @@ void release_task(struct task_struct *p)
/* don't need to get the RCU readlock here - the process is dead and
* can't be modifying its own credentials. But shut RCU-lockdep up */
rcu_read_lock();
- atomic_dec(&__task_cred(p)->user->processes);
+ dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
rcu_read_unlock();

cgroup_release(p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 40a5da7d3d70..812b023ecdce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -819,9 +819,11 @@ void __init fork_init(void)
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];

- for (i = 0; i < UCOUNT_COUNTS; i++)
+ for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++)
init_user_ns.ucount_max[i] = max_threads/2;

+ init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
+
#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
NULL, free_vm_stack_cache);
@@ -1962,8 +1964,7 @@ static __latent_entropy struct task_struct *copy_process(
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
retval = -EAGAIN;
- if (atomic_read(&p->real_cred->user->processes) >=
- task_rlimit(p, RLIMIT_NPROC)) {
+ if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_free;
@@ -2366,7 +2367,7 @@ static __latent_entropy struct task_struct *copy_process(
#endif
delayacct_tsk_free(p);
bad_fork_cleanup_count:
- atomic_dec(&p->cred->user->processes);
+ dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
exit_creds(p);
bad_fork_free:
p->state = TASK_DEAD;
diff --git a/kernel/sys.c b/kernel/sys.c
index 373def7debe8..304b6b5e5942 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -474,7 +474,7 @@ static int set_user(struct cred *new)
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
- if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
+ if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..2f42d2ee6e27 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -7,6 +7,7 @@
#include <linux/hash.h>
#include <linux/kmemleak.h>
#include <linux/user_namespace.h>
+#include <linux/security.h>

struct ucounts init_ucounts = {
.ns = &init_user_ns,
@@ -80,6 +81,7 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_inotify_instances"),
UCOUNT_ENTRY("max_inotify_watches"),
#endif
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
@@ -222,6 +224,19 @@ static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
}
}

+static inline long atomic_long_dec_value(atomic_long_t *v, long n)
+{
+ long c, old;
+ c = atomic_long_read(v);
+ for (;;) {
+ old = atomic_long_cmpxchg(v, c, c - n);
+ if (likely(old == c))
+ return c;
+ c = old;
+ }
+ return c;
+}
+
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
enum ucount_type type)
{
@@ -255,6 +270,51 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
put_ucounts(ucounts);
}

+bool inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
+{
+ struct ucounts *iter;
+ bool overlimit = false;
+
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long max = READ_ONCE(iter->ns->ucount_max[type]);
+ if (atomic_long_add_return(v, &iter->ucount[type]) > max)
+ overlimit = true;
+ }
+
+ return overlimit;
+}
+
+bool inc_rlimit_ucounts_and_test(struct ucounts *ucounts, enum ucount_type type,
+ long v, long max)
+{
+ bool overlimit = inc_rlimit_ucounts(ucounts, type, v);
+ if (!overlimit && get_ucounts_value(ucounts, type) > max)
+ overlimit = true;
+ return overlimit;
+}
+
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
+{
+ struct ucounts *iter;
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long dec = atomic_long_dec_value(&iter->ucount[type], v);
+ WARN_ON_ONCE(dec < 0);
+ }
+}
+
+bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long max)
+{
+ struct ucounts *iter;
+ if (get_ucounts_value(ucounts, type) > max)
+ return true;
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ max = READ_ONCE(iter->ns->ucount_max[type]);
+ if (get_ucounts_value(iter, type) > max)
+ return true;
+ }
+ return false;
+}
+
static __init int user_namespace_sysctl_init(void)
{
#ifdef CONFIG_SYSCTL
@@ -271,6 +331,7 @@ static __init int user_namespace_sysctl_init(void)
BUG_ON(!setup_userns_sysctls(&init_user_ns));
#endif
hlist_add_ucounts(&init_ucounts);
+ inc_rlimit_ucounts(&init_ucounts, UCOUNT_RLIMIT_NPROC, 1);
return 0;
}
subsys_initcall(user_namespace_sysctl_init);
diff --git a/kernel/user.c b/kernel/user.c
index a2478cddf536..7f5ff498207a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .processes = ATOMIC_INIT(1),
.sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 516db53166ab..2434b13b02e5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -118,9 +118,10 @@ int create_user_ns(struct cred *new)
ns->owner = owner;
ns->group = group;
INIT_WORK(&ns->work, free_user_ns);
- for (i = 0; i < UCOUNT_COUNTS; i++) {
+ for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
ns->ucount_max[i] = INT_MAX;
}
+ ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
--
2.29.2

2021-02-15 12:46:32

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 7/7] kselftests: Add test to check for rlimit changes in different user namespaces

The testcase runs few instances of the program with RLIMIT_NPROC=1 from
user uid=60000, in different user namespaces.

Signed-off-by: Alexey Gladkov <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
5 files changed, 171 insertions(+)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 8a917cb4426a..a6d3fde4a617 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -46,6 +46,7 @@ TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
TARGETS += openat2
+TARGETS += rlimits
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/rlimits/.gitignore b/tools/testing/selftests/rlimits/.gitignore
new file mode 100644
index 000000000000..091021f255b3
--- /dev/null
+++ b/tools/testing/selftests/rlimits/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+rlimits-per-userns
diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile
new file mode 100644
index 000000000000..03aadb406212
--- /dev/null
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := rlimits-per-userns
+
+include ../lib.mk
diff --git a/tools/testing/selftests/rlimits/config b/tools/testing/selftests/rlimits/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/rlimits/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
new file mode 100644
index 000000000000..26dc949e93ea
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/prctl.h>
+#include <sys/stat.h>
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <signal.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <err.h>
+
+#define NR_CHILDS 2
+
+static char *service_prog;
+static uid_t user = 60000;
+static uid_t group = 60000;
+
+static void setrlimit_nproc(rlim_t n)
+{
+ pid_t pid = getpid();
+ struct rlimit limit = {
+ .rlim_cur = n,
+ .rlim_max = n
+ };
+
+ warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
+
+ if (setrlimit(RLIMIT_NPROC, &limit) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+}
+
+static pid_t fork_child(void)
+{
+ pid_t pid = fork();
+
+ if (pid < 0)
+ err(EXIT_FAILURE, "fork");
+
+ if (pid > 0)
+ return pid;
+
+ pid = getpid();
+
+ warnx("(pid=%d): New process starting ...", pid);
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
+ err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
+
+ signal(SIGUSR1, SIG_DFL);
+
+ warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+ if (setgid(group) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+ if (setuid(user) < 0)
+ err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+ warnx("(pid=%d): Service running ...", pid);
+
+ warnx("(pid=%d): Unshare user namespace", pid);
+ if (unshare(CLONE_NEWUSER) < 0)
+ err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+
+ char *const argv[] = { "service", NULL };
+ char *const envp[] = { "I_AM_SERVICE=1", NULL };
+
+ warnx("(pid=%d): Executing real service ...", pid);
+
+ execve(service_prog, argv, envp);
+ err(EXIT_FAILURE, "(pid=%d): execve", pid);
+}
+
+int main(int argc, char **argv)
+{
+ size_t i;
+ pid_t child[NR_CHILDS];
+ int wstatus[NR_CHILDS];
+ int childs = NR_CHILDS;
+ pid_t pid;
+
+ if (getenv("I_AM_SERVICE")) {
+ pause();
+ exit(EXIT_SUCCESS);
+ }
+
+ service_prog = argv[0];
+ pid = getpid();
+
+ warnx("(pid=%d) Starting testcase", pid);
+
+ /*
+ * This rlimit is not a problem for root because it can be exceeded.
+ */
+ setrlimit_nproc(1);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ child[i] = fork_child();
+ wstatus[i] = 0;
+ usleep(250000);
+ }
+
+ while (1) {
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+
+ errno = 0;
+ pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG);
+
+ if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i])))
+ continue;
+
+ if (ret < 0 && errno != ECHILD)
+ warn("(pid=%d): waitpid(%d)", pid, child[i]);
+
+ child[i] *= -1;
+ childs -= 1;
+ }
+
+ if (!childs)
+ break;
+
+ usleep(250000);
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (child[i] <= 0)
+ continue;
+ kill(child[i], SIGUSR1);
+ }
+ }
+
+ for (i = 0; i < NR_CHILDS; i++) {
+ if (WIFEXITED(wstatus[i]))
+ warnx("(pid=%d): pid %d exited, status=%d",
+ pid, -child[i], WEXITSTATUS(wstatus[i]));
+ else if (WIFSIGNALED(wstatus[i]))
+ warnx("(pid=%d): pid %d killed by signal %d",
+ pid, -child[i], WTERMSIG(wstatus[i]));
+
+ if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1)
+ continue;
+
+ warnx("(pid=%d): Test failed", pid);
+ exit(EXIT_FAILURE);
+ }
+
+ warnx("(pid=%d): Test passed", pid);
+ exit(EXIT_SUCCESS);
+}
--
2.29.2

2021-02-15 12:46:38

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v6 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/hugetlbfs/inode.c | 16 ++++++++--------
include/linux/hugetlb.h | 4 ++--
include/linux/mm.h | 4 ++--
include/linux/sched/user.h | 1 -
include/linux/shmem_fs.h | 2 +-
include/linux/user_namespace.h | 1 +
ipc/shm.c | 30 +++++++++++++++--------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user.c | 1 -
kernel/user_namespace.c | 1 +
mm/memfd.c | 5 ++---
mm/mlock.c | 35 +++++++++++++---------------------
mm/mmap.c | 4 ++--
mm/shmem.c | 8 ++++----
15 files changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..a8757e39cefa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
*/
struct file *hugetlb_file_setup(const char *name, size_t size,
- vm_flags_t acctflag, struct user_struct **user,
+ vm_flags_t acctflag, const struct cred **cred,
int creat_flags, int page_size_log)
{
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);

- *user = NULL;
+ *cred = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);

if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
- *user = current_user();
- if (user_shm_lock(size, *user)) {
+ *cred = current_cred();
+ if (user_shm_lock(size, *cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
- *user = NULL;
+ *cred = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size,

iput(inode);
out:
- if (*user) {
- user_shm_unlock(size, *user);
- *user = NULL;
+ if (*cred) {
+ user_shm_unlock(size, *cred);
+ *cred = NULL;
}
return file;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..de5ce8a11b5e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
- struct user_struct **user, int creat_flags,
+ const struct cred **cred, int creat_flags,
int page_size_log);

static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
#define is_file_hugepages(file) false
static inline struct file *
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
- struct user_struct **user, int creat_flags,
+ struct cred **cred, int creat_flags,
int page_size_log)
{
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
#else
static inline bool can_do_mlock(void) { return false; }
#endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9cec4fb99..82bd2532da6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,7 +18,6 @@ struct user_struct {
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
#endif
- unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
atomic_long_t pipe_bufs; /* how many pages are allocated in pipe buffers */

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..10f50b1c4e0e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
extern int shmem_zero_setup(struct vm_area_struct *);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, const struct cred *cred);
#ifdef CONFIG_SHMEM
extern const struct address_space_operations shmem_aops;
static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84b68832c56..966b0d733bb8 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,6 +53,7 @@ enum ucount_type {
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_RLIMIT_SIGPENDING,
+ UCOUNT_RLIMIT_MEMLOCK,
UCOUNT_COUNTS,
};

diff --git a/ipc/shm.c b/ipc/shm.c
index febd88daba8c..9b3fbf33a7b7 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -60,7 +60,7 @@ struct shmid_kernel /* private to the kernel */
time64_t shm_ctim;
struct pid *shm_cprid;
struct pid *shm_lprid;
- struct user_struct *mlock_user;
+ const struct cred *mlock_cred;

/* The task created the shm object. NULL if the task is dead. */
struct task_struct *shm_creator;
@@ -286,10 +286,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_rmid(ns, shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
- shmem_lock(shm_file, 0, shp->mlock_user);
- else if (shp->mlock_user)
+ shmem_lock(shm_file, 0, shp->mlock_cred);
+ else if (shp->mlock_cred)
user_shm_unlock(i_size_read(file_inode(shm_file)),
- shp->mlock_user);
+ shp->mlock_cred);
fput(shm_file);
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
@@ -625,7 +625,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)

shp->shm_perm.key = key;
shp->shm_perm.mode = (shmflg & S_IRWXUGO);
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;

shp->shm_perm.security = NULL;
error = security_shm_alloc(&shp->shm_perm);
@@ -650,7 +650,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, hugesize, acctflag,
- &shp->mlock_user, HUGETLB_SHMFS_INODE,
+ &shp->mlock_cred, HUGETLB_SHMFS_INODE,
(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
} else {
/*
@@ -663,8 +663,10 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
file = shmem_kernel_file_setup(name, size, acctflag);
}
error = PTR_ERR(file);
- if (IS_ERR(file))
+ if (IS_ERR(file)) {
+ shp->mlock_cred = NULL;
goto no_file;
+ }

shp->shm_cprid = get_pid(task_tgid(current));
shp->shm_lprid = NULL;
@@ -698,8 +700,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
no_id:
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
- if (is_file_hugepages(file) && shp->mlock_user)
- user_shm_unlock(size, shp->mlock_user);
+ if (is_file_hugepages(file) && shp->mlock_cred)
+ user_shm_unlock(size, shp->mlock_cred);
fput(file);
ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
return error;
@@ -1105,12 +1107,12 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
goto out_unlock0;

if (cmd == SHM_LOCK) {
- struct user_struct *user = current_user();
+ const struct cred *cred = current_cred();

- err = shmem_lock(shm_file, 1, user);
+ err = shmem_lock(shm_file, 1, cred);
if (!err && !(shp->shm_perm.mode & SHM_LOCKED)) {
shp->shm_perm.mode |= SHM_LOCKED;
- shp->mlock_user = user;
+ shp->mlock_cred = cred;
}
goto out_unlock0;
}
@@ -1118,9 +1120,9 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
/* SHM_UNLOCK */
if (!(shp->shm_perm.mode & SHM_LOCKED))
goto out_unlock0;
- shmem_lock(shm_file, 0, shp->mlock_user);
+ shmem_lock(shm_file, 0, shp->mlock_cred);
shp->shm_perm.mode &= ~SHM_LOCKED;
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;
get_file(shm_file);
ipc_unlock_object(&shp->shm_perm);
rcu_read_unlock();
diff --git a/kernel/fork.c b/kernel/fork.c
index 99b10b9fe4b6..76ccb000856c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = task_rlimit(&init_task, RLIMIT_MEMLOCK);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 2ac969fba668..b6242b77eb89 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -84,6 +84,7 @@ static struct ctl_table user_table[] = {
{ },
{ },
{ },
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
diff --git a/kernel/user.c b/kernel/user.c
index 6737327f83be..c82399c1618a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
.ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0),
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index df1bed32dd48..5ef0d4b182ba 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -124,6 +124,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
+ ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..473515a74b99 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -297,9 +297,8 @@ SYSCALL_DEFINE2(memfd_create,
}

if (flags & MFD_HUGETLB) {
- struct user_struct *user = NULL;
-
- file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
+ const struct cred *cred;
+ file = hugetlb_file_setup(name, 0, VM_NORESERVE, &cred,
HUGETLB_ANONHUGE_INODE,
(flags >> MFD_HUGE_SHIFT) &
MFD_HUGE_MASK);
diff --git a/mm/mlock.c b/mm/mlock.c
index 55b3b3672977..2d49d1afd7e0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -812,15 +812,10 @@ SYSCALL_DEFINE0(munlockall)
return ret;
}

-/*
- * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
- * shm segments) get accounted against the user_struct instead.
- */
-static DEFINE_SPINLOCK(shmlock_user_lock);
-
-int user_shm_lock(size_t size, struct user_struct *user)
+int user_shm_lock(size_t size, const struct cred *cred)
{
unsigned long lock_limit, locked;
+ bool overlimit;
int allowed = 0;

locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -828,22 +823,18 @@ int user_shm_lock(size_t size, struct user_struct *user)
if (lock_limit == RLIM_INFINITY)
allowed = 1;
lock_limit >>= PAGE_SHIFT;
- spin_lock(&shmlock_user_lock);
- if (!allowed &&
- locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
- goto out;
- get_uid(user);
- user->locked_shm += locked;
- allowed = 1;
-out:
- spin_unlock(&shmlock_user_lock);
- return allowed;
+
+ overlimit = inc_rlimit_ucounts_and_test(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK,
+ locked, lock_limit);
+
+ if (!allowed && overlimit && !capable(CAP_IPC_LOCK)) {
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
+ return 0;
+ }
+ return 1;
}

-void user_shm_unlock(size_t size, struct user_struct *user)
+void user_shm_unlock(size_t size, const struct cred *cred)
{
- spin_lock(&shmlock_user_lock);
- user->locked_shm -= (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
- spin_unlock(&shmlock_user_lock);
- free_uid(user);
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index dc7206032387..76edf28344a4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1607,7 +1607,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
goto out_fput;
}
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
+ const struct cred *cred;
struct hstate *hs;

hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
@@ -1623,7 +1623,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
+ &cred, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..de9bf6866f51 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2225,7 +2225,7 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
}
#endif

-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+int shmem_lock(struct file *file, int lock, const struct cred *cred)
{
struct inode *inode = file_inode(file);
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2237,13 +2237,13 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
* no serialization needed when called from shm_destroy().
*/
if (lock && !(info->flags & VM_LOCKED)) {
- if (!user_shm_lock(inode->i_size, user))
+ if (!user_shm_lock(inode->i_size, cred))
goto out_nomem;
info->flags |= VM_LOCKED;
mapping_set_unevictable(file->f_mapping);
}
- if (!lock && (info->flags & VM_LOCKED) && user) {
- user_shm_unlock(inode->i_size, user);
+ if (!lock && (info->flags & VM_LOCKED) && cred) {
+ user_shm_unlock(inode->i_size, cred);
info->flags &= ~VM_LOCKED;
mapping_clear_unevictable(file->f_mapping);
}
--
2.29.2

2021-02-15 15:18:13

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

Hi Alexey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kselftest/next]
[also build test ERROR on linux/master linus/master v5.11 next-20210212]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210215-204524
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/f009495a8def89a71b9e0b9025a39379d6f9097d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210215-204524
git checkout f009495a8def89a71b9e0b9025a39379d6f9097d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=xtensa

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

ipc/shm.c: In function 'newseg':
>> ipc/shm.c:653:5: error: passing argument 4 of 'hugetlb_file_setup' from incompatible pointer type [-Werror=incompatible-pointer-types]
653 | &shp->mlock_cred, HUGETLB_SHMFS_INODE,
| ^~~~~~~~~~~~~~~~
| |
| const struct cred **
In file included from ipc/shm.c:30:
include/linux/hugetlb.h:457:17: note: expected 'struct cred **' but argument is of type 'const struct cred **'
457 | struct cred **cred, int creat_flags,
| ~~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors


vim +/hugetlb_file_setup +653 ipc/shm.c

592
593 /**
594 * newseg - Create a new shared memory segment
595 * @ns: namespace
596 * @params: ptr to the structure that contains key, size and shmflg
597 *
598 * Called with shm_ids.rwsem held as a writer.
599 */
600 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
601 {
602 key_t key = params->key;
603 int shmflg = params->flg;
604 size_t size = params->u.size;
605 int error;
606 struct shmid_kernel *shp;
607 size_t numpages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
608 struct file *file;
609 char name[13];
610 vm_flags_t acctflag = 0;
611
612 if (size < SHMMIN || size > ns->shm_ctlmax)
613 return -EINVAL;
614
615 if (numpages << PAGE_SHIFT < size)
616 return -ENOSPC;
617
618 if (ns->shm_tot + numpages < ns->shm_tot ||
619 ns->shm_tot + numpages > ns->shm_ctlall)
620 return -ENOSPC;
621
622 shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
623 if (unlikely(!shp))
624 return -ENOMEM;
625
626 shp->shm_perm.key = key;
627 shp->shm_perm.mode = (shmflg & S_IRWXUGO);
628 shp->mlock_cred = NULL;
629
630 shp->shm_perm.security = NULL;
631 error = security_shm_alloc(&shp->shm_perm);
632 if (error) {
633 kvfree(shp);
634 return error;
635 }
636
637 sprintf(name, "SYSV%08x", key);
638 if (shmflg & SHM_HUGETLB) {
639 struct hstate *hs;
640 size_t hugesize;
641
642 hs = hstate_sizelog((shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
643 if (!hs) {
644 error = -EINVAL;
645 goto no_file;
646 }
647 hugesize = ALIGN(size, huge_page_size(hs));
648
649 /* hugetlb_file_setup applies strict accounting */
650 if (shmflg & SHM_NORESERVE)
651 acctflag = VM_NORESERVE;
652 file = hugetlb_file_setup(name, hugesize, acctflag,
> 653 &shp->mlock_cred, HUGETLB_SHMFS_INODE,
654 (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
655 } else {
656 /*
657 * Do not allow no accounting for OVERCOMMIT_NEVER, even
658 * if it's asked for.
659 */
660 if ((shmflg & SHM_NORESERVE) &&
661 sysctl_overcommit_memory != OVERCOMMIT_NEVER)
662 acctflag = VM_NORESERVE;
663 file = shmem_kernel_file_setup(name, size, acctflag);
664 }
665 error = PTR_ERR(file);
666 if (IS_ERR(file)) {
667 shp->mlock_cred = NULL;
668 goto no_file;
669 }
670
671 shp->shm_cprid = get_pid(task_tgid(current));
672 shp->shm_lprid = NULL;
673 shp->shm_atim = shp->shm_dtim = 0;
674 shp->shm_ctim = ktime_get_real_seconds();
675 shp->shm_segsz = size;
676 shp->shm_nattch = 0;
677 shp->shm_file = file;
678 shp->shm_creator = current;
679
680 /* ipc_addid() locks shp upon success. */
681 error = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
682 if (error < 0)
683 goto no_id;
684
685 list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
686
687 /*
688 * shmid gets reported as "inode#" in /proc/pid/maps.
689 * proc-ps tools use this. Changing this will break them.
690 */
691 file_inode(file)->i_ino = shp->shm_perm.id;
692
693 ns->shm_tot += numpages;
694 error = shp->shm_perm.id;
695
696 ipc_unlock_object(&shp->shm_perm);
697 rcu_read_unlock();
698 return error;
699
700 no_id:
701 ipc_update_pid(&shp->shm_cprid, NULL);
702 ipc_update_pid(&shp->shm_lprid, NULL);
703 if (is_file_hugepages(file) && shp->mlock_cred)
704 user_shm_unlock(size, shp->mlock_cred);
705 fput(file);
706 ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
707 return error;
708 no_file:
709 call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
710 return error;
711 }
712

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (6.12 kB)
.config.gz (10.57 kB)
Download all attachments

2021-02-15 17:52:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

Hi Alexey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kselftest/next]
[also build test ERROR on linux/master linus/master v5.11 next-20210212]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210215-204524
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
config: openrisc-randconfig-r001-20210215 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/f009495a8def89a71b9e0b9025a39379d6f9097d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210215-204524
git checkout f009495a8def89a71b9e0b9025a39379d6f9097d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/mmap.c: In function 'ksys_mmap_pgoff':
>> mm/mmap.c:1626:5: error: passing argument 4 of 'hugetlb_file_setup' from incompatible pointer type [-Werror=incompatible-pointer-types]
1626 | &cred, HUGETLB_ANONHUGE_INODE,
| ^~~~~
| |
| const struct cred **
In file included from mm/mmap.c:28:
include/linux/hugetlb.h:457:17: note: expected 'struct cred **' but argument is of type 'const struct cred **'
457 | struct cred **cred, int creat_flags,
| ~~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors
--
mm/memfd.c: In function '__do_sys_memfd_create':
>> mm/memfd.c:301:52: error: passing argument 4 of 'hugetlb_file_setup' from incompatible pointer type [-Werror=incompatible-pointer-types]
301 | file = hugetlb_file_setup(name, 0, VM_NORESERVE, &cred,
| ^~~~~
| |
| const struct cred **
In file included from mm/memfd.c:18:
include/linux/hugetlb.h:457:17: note: expected 'struct cred **' but argument is of type 'const struct cred **'
457 | struct cred **cred, int creat_flags,
| ~~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors
--
ipc/shm.c: In function 'newseg':
>> ipc/shm.c:653:5: error: passing argument 4 of 'hugetlb_file_setup' from incompatible pointer type [-Werror=incompatible-pointer-types]
653 | &shp->mlock_cred, HUGETLB_SHMFS_INODE,
| ^~~~~~~~~~~~~~~~
| |
| const struct cred **
In file included from ipc/shm.c:30:
include/linux/hugetlb.h:457:17: note: expected 'struct cred **' but argument is of type 'const struct cred **'
457 | struct cred **cred, int creat_flags,
| ~~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors


vim +/hugetlb_file_setup +1626 mm/mmap.c

1590
1591 unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
1592 unsigned long prot, unsigned long flags,
1593 unsigned long fd, unsigned long pgoff)
1594 {
1595 struct file *file = NULL;
1596 unsigned long retval;
1597
1598 if (!(flags & MAP_ANONYMOUS)) {
1599 audit_mmap_fd(fd, flags);
1600 file = fget(fd);
1601 if (!file)
1602 return -EBADF;
1603 if (is_file_hugepages(file)) {
1604 len = ALIGN(len, huge_page_size(hstate_file(file)));
1605 } else if (unlikely(flags & MAP_HUGETLB)) {
1606 retval = -EINVAL;
1607 goto out_fput;
1608 }
1609 } else if (flags & MAP_HUGETLB) {
1610 const struct cred *cred;
1611 struct hstate *hs;
1612
1613 hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
1614 if (!hs)
1615 return -EINVAL;
1616
1617 len = ALIGN(len, huge_page_size(hs));
1618 /*
1619 * VM_NORESERVE is used because the reservations will be
1620 * taken when vm_ops->mmap() is called
1621 * A dummy user value is used because we are not locking
1622 * memory so no accounting is necessary
1623 */
1624 file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
1625 VM_NORESERVE,
> 1626 &cred, HUGETLB_ANONHUGE_INODE,
1627 (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
1628 if (IS_ERR(file))
1629 return PTR_ERR(file);
1630 }
1631
1632 flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
1633
1634 retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
1635 out_fput:
1636 if (file)
1637 fput(file);
1638 return retval;
1639 }
1640

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (5.27 kB)
.config.gz (22.33 kB)
Download all attachments

2021-02-16 11:18:17

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v7 6/7] Reimplement RLIMIT_MEMLOCK on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v7:
* Fix hugetlb_file_setup() declaration if CONFIG_HUGETLBFS=n. A `const'
was missing from one of the arguments.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/hugetlbfs/inode.c | 16 ++++++++--------
include/linux/hugetlb.h | 4 ++--
include/linux/mm.h | 4 ++--
include/linux/sched/user.h | 1 -
include/linux/shmem_fs.h | 2 +-
include/linux/user_namespace.h | 1 +
ipc/shm.c | 30 +++++++++++++++--------------
kernel/fork.c | 1 +
kernel/ucount.c | 1 +
kernel/user.c | 1 -
kernel/user_namespace.c | 1 +
mm/memfd.c | 5 ++---
mm/mlock.c | 35 +++++++++++++---------------------
mm/mmap.c | 4 ++--
mm/shmem.c | 8 ++++----
15 files changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..a8757e39cefa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1452,7 +1452,7 @@ static int get_hstate_idx(int page_size_log)
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
*/
struct file *hugetlb_file_setup(const char *name, size_t size,
- vm_flags_t acctflag, struct user_struct **user,
+ vm_flags_t acctflag, const struct cred **cred,
int creat_flags, int page_size_log)
{
struct inode *inode;
@@ -1464,20 +1464,20 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
if (hstate_idx < 0)
return ERR_PTR(-ENODEV);

- *user = NULL;
+ *cred = NULL;
mnt = hugetlbfs_vfsmount[hstate_idx];
if (!mnt)
return ERR_PTR(-ENOENT);

if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
- *user = current_user();
- if (user_shm_lock(size, *user)) {
+ *cred = current_cred();
+ if (user_shm_lock(size, *cred)) {
task_lock(current);
pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
current->comm, current->pid);
task_unlock(current);
} else {
- *user = NULL;
+ *cred = NULL;
return ERR_PTR(-EPERM);
}
}
@@ -1504,9 +1504,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size,

iput(inode);
out:
- if (*user) {
- user_shm_unlock(size, *user);
- *user = NULL;
+ if (*cred) {
+ user_shm_unlock(size, *cred);
+ *cred = NULL;
}
return file;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..0a19897b773b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
- struct user_struct **user, int creat_flags,
+ const struct cred **cred, int creat_flags,
int page_size_log);

static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
#define is_file_hugepages(file) false
static inline struct file *
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
- struct user_struct **user, int creat_flags,
+ const struct cred **cred, int creat_flags,
int page_size_log)
{
return ERR_PTR(-ENOSYS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..30a37aef1ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1628,8 +1628,8 @@ extern bool can_do_mlock(void);
#else
static inline bool can_do_mlock(void) { return false; }
#endif
-extern int user_shm_lock(size_t, struct user_struct *);
-extern void user_shm_unlock(size_t, struct user_struct *);
+extern int user_shm_lock(size_t, const struct cred *);
+extern void user_shm_unlock(size_t, const struct cred *);

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 8ba9cec4fb99..82bd2532da6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -18,7 +18,6 @@ struct user_struct {
#ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
#endif
- unsigned long locked_shm; /* How many pages of mlocked shm ? */
unsigned long unix_inflight; /* How many files in flight in unix sockets */
atomic_long_t pipe_bufs; /* how many pages are allocated in pipe buffers */

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..10f50b1c4e0e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -65,7 +65,7 @@ extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
extern int shmem_zero_setup(struct vm_area_struct *);
extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
-extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern int shmem_lock(struct file *file, int lock, const struct cred *cred);
#ifdef CONFIG_SHMEM
extern const struct address_space_operations shmem_aops;
static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f84b68832c56..966b0d733bb8 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -53,6 +53,7 @@ enum ucount_type {
UCOUNT_RLIMIT_NPROC,
UCOUNT_RLIMIT_MSGQUEUE,
UCOUNT_RLIMIT_SIGPENDING,
+ UCOUNT_RLIMIT_MEMLOCK,
UCOUNT_COUNTS,
};

diff --git a/ipc/shm.c b/ipc/shm.c
index febd88daba8c..9b3fbf33a7b7 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -60,7 +60,7 @@ struct shmid_kernel /* private to the kernel */
time64_t shm_ctim;
struct pid *shm_cprid;
struct pid *shm_lprid;
- struct user_struct *mlock_user;
+ const struct cred *mlock_cred;

/* The task created the shm object. NULL if the task is dead. */
struct task_struct *shm_creator;
@@ -286,10 +286,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
shm_rmid(ns, shp);
shm_unlock(shp);
if (!is_file_hugepages(shm_file))
- shmem_lock(shm_file, 0, shp->mlock_user);
- else if (shp->mlock_user)
+ shmem_lock(shm_file, 0, shp->mlock_cred);
+ else if (shp->mlock_cred)
user_shm_unlock(i_size_read(file_inode(shm_file)),
- shp->mlock_user);
+ shp->mlock_cred);
fput(shm_file);
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
@@ -625,7 +625,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)

shp->shm_perm.key = key;
shp->shm_perm.mode = (shmflg & S_IRWXUGO);
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;

shp->shm_perm.security = NULL;
error = security_shm_alloc(&shp->shm_perm);
@@ -650,7 +650,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, hugesize, acctflag,
- &shp->mlock_user, HUGETLB_SHMFS_INODE,
+ &shp->mlock_cred, HUGETLB_SHMFS_INODE,
(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
} else {
/*
@@ -663,8 +663,10 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
file = shmem_kernel_file_setup(name, size, acctflag);
}
error = PTR_ERR(file);
- if (IS_ERR(file))
+ if (IS_ERR(file)) {
+ shp->mlock_cred = NULL;
goto no_file;
+ }

shp->shm_cprid = get_pid(task_tgid(current));
shp->shm_lprid = NULL;
@@ -698,8 +700,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
no_id:
ipc_update_pid(&shp->shm_cprid, NULL);
ipc_update_pid(&shp->shm_lprid, NULL);
- if (is_file_hugepages(file) && shp->mlock_user)
- user_shm_unlock(size, shp->mlock_user);
+ if (is_file_hugepages(file) && shp->mlock_cred)
+ user_shm_unlock(size, shp->mlock_cred);
fput(file);
ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
return error;
@@ -1105,12 +1107,12 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
goto out_unlock0;

if (cmd == SHM_LOCK) {
- struct user_struct *user = current_user();
+ const struct cred *cred = current_cred();

- err = shmem_lock(shm_file, 1, user);
+ err = shmem_lock(shm_file, 1, cred);
if (!err && !(shp->shm_perm.mode & SHM_LOCKED)) {
shp->shm_perm.mode |= SHM_LOCKED;
- shp->mlock_user = user;
+ shp->mlock_cred = cred;
}
goto out_unlock0;
}
@@ -1118,9 +1120,9 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
/* SHM_UNLOCK */
if (!(shp->shm_perm.mode & SHM_LOCKED))
goto out_unlock0;
- shmem_lock(shm_file, 0, shp->mlock_user);
+ shmem_lock(shm_file, 0, shp->mlock_cred);
shp->shm_perm.mode &= ~SHM_LOCKED;
- shp->mlock_user = NULL;
+ shp->mlock_cred = NULL;
get_file(shm_file);
ipc_unlock_object(&shp->shm_perm);
rcu_read_unlock();
diff --git a/kernel/fork.c b/kernel/fork.c
index 99b10b9fe4b6..76ccb000856c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -825,6 +825,7 @@ void __init fork_init(void)
init_user_ns.ucount_max[UCOUNT_RLIMIT_NPROC] = task_rlimit(&init_task, RLIMIT_NPROC);
init_user_ns.ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = task_rlimit(&init_task, RLIMIT_MSGQUEUE);
init_user_ns.ucount_max[UCOUNT_RLIMIT_SIGPENDING] = task_rlimit(&init_task, RLIMIT_SIGPENDING);
+ init_user_ns.ucount_max[UCOUNT_RLIMIT_MEMLOCK] = task_rlimit(&init_task, RLIMIT_MEMLOCK);

#ifdef CONFIG_VMAP_STACK
cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:vm_stack_cache",
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 2ac969fba668..b6242b77eb89 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -84,6 +84,7 @@ static struct ctl_table user_table[] = {
{ },
{ },
{ },
+ { },
{ }
};
#endif /* CONFIG_SYSCTL */
diff --git a/kernel/user.c b/kernel/user.c
index 6737327f83be..c82399c1618a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -98,7 +98,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
/* root_user.__count is 1, for init task cred */
struct user_struct root_user = {
.__count = REFCOUNT_INIT(1),
- .locked_shm = 0,
.uid = GLOBAL_ROOT_UID,
.ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0),
};
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index df1bed32dd48..5ef0d4b182ba 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -124,6 +124,7 @@ int create_user_ns(struct cred *new)
ns->ucount_max[UCOUNT_RLIMIT_NPROC] = rlimit(RLIMIT_NPROC);
ns->ucount_max[UCOUNT_RLIMIT_MSGQUEUE] = rlimit(RLIMIT_MSGQUEUE);
ns->ucount_max[UCOUNT_RLIMIT_SIGPENDING] = rlimit(RLIMIT_SIGPENDING);
+ ns->ucount_max[UCOUNT_RLIMIT_MEMLOCK] = rlimit(RLIMIT_MEMLOCK);
ns->ucounts = ucounts;

/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..473515a74b99 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -297,9 +297,8 @@ SYSCALL_DEFINE2(memfd_create,
}

if (flags & MFD_HUGETLB) {
- struct user_struct *user = NULL;
-
- file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
+ const struct cred *cred;
+ file = hugetlb_file_setup(name, 0, VM_NORESERVE, &cred,
HUGETLB_ANONHUGE_INODE,
(flags >> MFD_HUGE_SHIFT) &
MFD_HUGE_MASK);
diff --git a/mm/mlock.c b/mm/mlock.c
index 55b3b3672977..2d49d1afd7e0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -812,15 +812,10 @@ SYSCALL_DEFINE0(munlockall)
return ret;
}

-/*
- * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
- * shm segments) get accounted against the user_struct instead.
- */
-static DEFINE_SPINLOCK(shmlock_user_lock);
-
-int user_shm_lock(size_t size, struct user_struct *user)
+int user_shm_lock(size_t size, const struct cred *cred)
{
unsigned long lock_limit, locked;
+ bool overlimit;
int allowed = 0;

locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -828,22 +823,18 @@ int user_shm_lock(size_t size, struct user_struct *user)
if (lock_limit == RLIM_INFINITY)
allowed = 1;
lock_limit >>= PAGE_SHIFT;
- spin_lock(&shmlock_user_lock);
- if (!allowed &&
- locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
- goto out;
- get_uid(user);
- user->locked_shm += locked;
- allowed = 1;
-out:
- spin_unlock(&shmlock_user_lock);
- return allowed;
+
+ overlimit = inc_rlimit_ucounts_and_test(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK,
+ locked, lock_limit);
+
+ if (!allowed && overlimit && !capable(CAP_IPC_LOCK)) {
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
+ return 0;
+ }
+ return 1;
}

-void user_shm_unlock(size_t size, struct user_struct *user)
+void user_shm_unlock(size_t size, const struct cred *cred)
{
- spin_lock(&shmlock_user_lock);
- user->locked_shm -= (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
- spin_unlock(&shmlock_user_lock);
- free_uid(user);
+ dec_rlimit_ucounts(cred->ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index dc7206032387..76edf28344a4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1607,7 +1607,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
goto out_fput;
}
} else if (flags & MAP_HUGETLB) {
- struct user_struct *user = NULL;
+ const struct cred *cred;
struct hstate *hs;

hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
@@ -1623,7 +1623,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
- &user, HUGETLB_ANONHUGE_INODE,
+ &cred, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..de9bf6866f51 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2225,7 +2225,7 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
}
#endif

-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+int shmem_lock(struct file *file, int lock, const struct cred *cred)
{
struct inode *inode = file_inode(file);
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2237,13 +2237,13 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
* no serialization needed when called from shm_destroy().
*/
if (lock && !(info->flags & VM_LOCKED)) {
- if (!user_shm_lock(inode->i_size, user))
+ if (!user_shm_lock(inode->i_size, cred))
goto out_nomem;
info->flags |= VM_LOCKED;
mapping_set_unevictable(file->f_mapping);
}
- if (!lock && (info->flags & VM_LOCKED) && user) {
- user_shm_unlock(inode->i_size, user);
+ if (!lock && (info->flags & VM_LOCKED) && cred) {
+ user_shm_unlock(inode->i_size, cred);
info->flags &= ~VM_LOCKED;
mapping_clear_unevictable(file->f_mapping);
}
--
2.29.2

2021-02-17 15:57:02

by kernel test robot

[permalink] [raw]
Subject: f009495a8d: BUG:KASAN:use-after-free_in_user_shm_unlock


Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: f009495a8def89a71b9e0b9025a39379d6f9097d ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git Alexey-Gladkov/Count-rlimits-in-each-user-namespace/20210215-204524


in testcase: trinity
version: trinity-x86_64-4d2343bd-1_20210105
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------+------------+------------+
| | ebc4144c8c | f009495a8d |
+---------------------------------------------+------------+------------+
| boot_successes | 12 | 3 |
| boot_failures | 0 | 9 |
| BUG:KASAN:use-after-free_in_user_shm_unlock | 0 | 9 |
+---------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 379.451460] BUG: KASAN: use-after-free in user_shm_unlock (kbuild/src/consumer/mm/mlock.c:839)
[ 379.452995] Read of size 8 at addr ffff888117ff7e90 by task trinity-c2/3961
[ 379.454626]
[ 379.455018] CPU: 0 PID: 3961 Comm: trinity-c2 Tainted: G E 5.11.0-rc7-00017-gf009495a8def #1
[ 379.457212] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 379.459153] Call Trace:
[ 379.459777] print_address_description+0x18/0x26f
[ 379.461168] ? user_shm_unlock (kbuild/src/consumer/mm/mlock.c:839)
[ 379.462171] kasan_report (kbuild/src/consumer/mm/kasan/report.c:397 kbuild/src/consumer/mm/kasan/report.c:413)
[ 379.463132] ? user_shm_unlock (kbuild/src/consumer/mm/mlock.c:839)
[ 379.464053] user_shm_unlock (kbuild/src/consumer/mm/mlock.c:839)
[ 379.464986] shmem_lock (kbuild/src/consumer/mm/shmem.c:2247)
[ 379.465741] shmctl_do_lock (kbuild/src/consumer/ipc/shm.c:1124)
[ 379.466611] ksys_shmctl+0x19b/0x1e2
[ 379.467620] ? __x32_compat_sys_shmctl (kbuild/src/consumer/ipc/shm.c:1141)
[ 379.468612] ? lock_acquire (kbuild/src/consumer/kernel/locking/lockdep.c:437 kbuild/src/consumer/kernel/locking/lockdep.c:5444)
[ 379.469427] ? find_held_lock (kbuild/src/consumer/kernel/locking/lockdep.c:4956)
[ 379.470301] ? __context_tracking_exit (kbuild/src/consumer/kernel/context_tracking.c:161)
[ 379.471508] ? lock_downgrade (kbuild/src/consumer/kernel/locking/lockdep.c:5450)
[ 379.472561] ? kvm_clock_read (kbuild/src/consumer/arch/x86/include/asm/preempt.h:84 kbuild/src/consumer/arch/x86/kernel/kvmclock.c:90)
[ 379.473521] ? account_steal_time (kbuild/src/consumer/kernel/sched/cputime.c:212)
[ 379.474581] ? account_other_time (kbuild/src/consumer/kernel/sched/cputime.c:245 kbuild/src/consumer/kernel/sched/cputime.c:262)
[ 379.475544] ? mark_held_locks (kbuild/src/consumer/kernel/locking/lockdep.c:4000 (discriminator 1))
[ 379.476491] ? lockdep_hardirqs_on_prepare (kbuild/src/consumer/kernel/locking/lockdep.c:437 kbuild/src/consumer/kernel/locking/lockdep.c:4099)
[ 379.477743] do_syscall_64 (kbuild/src/consumer/arch/x86/entry/common.c:46)
[ 379.478611] entry_SYSCALL_64_after_hwframe (kbuild/src/consumer/arch/x86/entry/entry_64.S:127)
[ 379.479768] RIP: 0033:0x7f79708ebf59
[ 379.480640] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
All code
========
0: 00 c3 add %al,%bl
2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
9: 00 00 00
c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f41
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W

Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f17
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 379.484875] RSP: 002b:00007ffd0b8ac428 EFLAGS: 00000246 ORIG_RAX: 000000000000001f
[ 379.486602] RAX: ffffffffffffffda RBX: 000000000000001f RCX: 00007f79708ebf59
[ 379.488077] RDX: 0000000000000004 RSI: 000000000000000c RDI: 0000000000000000
[ 379.489493] RBP: 000000000000001f R08: 0000a7fc6cf3f14d R09: 0000000008000000
[ 379.491020] R10: ffffffffffffff71 R11: 0000000000000246 R12: 0000000000000002
[ 379.492661] R13: 00007f796f2bb058 R14: 00007f79707d46c0 R15: 00007f796f2bb000
[ 379.494454]
[ 379.494871] Allocated by task 0:
[ 379.495620] (stack is not available)
[ 379.496488]
[ 379.496893] Freed by task 10:
[ 379.497655] kasan_save_stack (kbuild/src/consumer/mm/kasan/common.c:38)
[ 379.498658] kasan_set_track (kbuild/src/consumer/mm/kasan/common.c:46)
[ 379.499609] kasan_set_free_info (kbuild/src/consumer/mm/kasan/generic.c:358)
[ 379.500681] ____kasan_slab_free (kbuild/src/consumer/mm/kasan/common.c:364)
[ 379.501725] slab_free_freelist_hook (kbuild/src/consumer/mm/slub.c:1580)
[ 379.502861] kmem_cache_free (kbuild/src/consumer/mm/slub.c:3143 kbuild/src/consumer/mm/slub.c:3159)
[ 379.503731] rcu_process_callbacks (kbuild/src/consumer/include/linux/rcupdate.h:264 kbuild/src/consumer/kernel/rcu/tiny.c:99 kbuild/src/consumer/kernel/rcu/tiny.c:130)
[ 379.504755] __do_softirq (kbuild/src/consumer/include/linux/instrumented.h:71 kbuild/src/consumer/include/asm-generic/atomic-instrumented.h:27 kbuild/src/consumer/include/linux/jump_label.h:254 kbuild/src/consumer/include/linux/jump_label.h:264 kbuild/src/consumer/include/trace/events/irq.h:142 kbuild/src/consumer/kernel/softirq.c:344)
[ 379.505618]
[ 379.505979] The buggy address belongs to the object at ffff888117ff7e00
[ 379.505979] which belongs to the cache cred_jar of size 176
[ 379.508744] The buggy address is located 144 bytes inside of
[ 379.508744] 176-byte region [ffff888117ff7e00, ffff888117ff7eb0)
[ 379.511290] The buggy address belongs to the page:
[ 379.512399] page:0000000097ece402 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x117ff7
[ 379.514652] flags: 0x8000000000000200(slab)
[ 379.515652] raw: 8000000000000200 dead000000000100 dead000000000122 ffff888100372a00
[ 379.517377] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 379.519257] page dumped because: kasan: bad access detected
[ 379.520478]
[ 379.520835] Memory state around the buggy address:
[ 379.521953] ffff888117ff7d80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
[ 379.523570] ffff888117ff7e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 379.525357] >ffff888117ff7e80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
[ 379.527029] ^
[ 379.527887] ffff888117ff7f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 379.529581] ffff888117ff7f80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
[ 379.531334] ==================================================================
[ 379.533107] Disabling lock debugging due to kernel taint
[ 379.755941] [main] kernel became tainted! (8224/8192) Last seed was 782038633
[ 379.756009]
[ 379.773617] trinity: Detected kernel tainting. Last seed was 782038633
[ 379.773690]
[ 379.789324] [main] exit_reason=7, but 3 children still running.
[ 379.789394]
[ 381.812865] [main] Bailing main loop because kernel became tainted..
[ 381.812932]
[ 382.091273] [main] Ran 93208 syscalls. Successes: 23634 Failures: 67538
[ 382.091348]
[ 405.279282] /lkp/lkp/src/tests/trinity: 45: kill: No such process
[ 405.279354]
[ 405.298590]
[ 405.298646]
[ 405.656613] /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/vm-snb-124/trinity-300s-debian-10.4-x86_64-20200603.cgz-f009495a8def89a71b9e0b9025a39379d6f9097d-20210217-33540-1tuu5rt-2.yaml&job_state=post_run -O /dev/null
[ 405.656700]
[ 407.339684] kill 377 vmstat --timestamp -n 10
[ 407.339744]
[ 407.453173] kill 375 dmesg --follow --decode
[ 407.453237]
[ 407.547712] wait for background processes: 379 meminfo
[ 407.547783]
[ 415.539948] sysrq: Emergency Sync
[ 415.540999] Emergency Sync complete
[ 415.544090] sysrq: Resetting

Kboot worker: lkp-worker31
Elapsed time: 420

kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-snb-124.cgz
-m 8192
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0,hostfwd=tcp::32032-:22
-boot order=nc
-no-reboot
-watchdog i6300esb
-watchdog-action debug
-rtc base=localtime
-serial stdio
-display none
-monitor null
)

append=(
ip=::::vm-snb-124::dhcp
root=/dev/ram0


To reproduce:

# build kernel
cd linux
cp config-5.11.0-rc7-00017-gf009495a8def .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Oliver Sang


Attachments:
(No filename) (10.13 kB)
config-5.11.0-rc7-00017-gf009495a8def (153.64 kB)
job-script (4.35 kB)
dmesg.xz (34.55 kB)
trinity (144.24 kB)
Download all attachments

2021-02-21 22:21:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] Count rlimits in each user namespace

On Mon, Feb 15, 2021 at 4:42 AM Alexey Gladkov <[email protected]> wrote:
>
> These patches are for binding the rlimit counters to a user in user namespace.

So this is now version 6, but I think the kernel test robot keeps
complaining about them causing KASAN issues.

The complaints seem to change, so I'm hoping they get fixed, but it
does seem like every version there's a new one. Hmm?

Linus

2021-02-22 00:07:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

On 2/15/21 5:41 AM, Alexey Gladkov wrote:
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index a564f36e260c..5b6940c90c61 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
> wqe->node = alloc_node;
> wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
> atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
> - if (wq->user) {
> - wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
> - task_rlimit(current, RLIMIT_NPROC);
> - }
> + wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = task_rlimit(current, RLIMIT_NPROC);

This doesn't look like an equivalent transformation. But that may be
moot if we merge the io_uring-worker.v3 series, as then you would not
have to touch io-wq at all.

--
Jens Axboe

2021-02-22 10:17:49

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

On Sun, Feb 21, 2021 at 04:38:10PM -0700, Jens Axboe wrote:
> On 2/15/21 5:41 AM, Alexey Gladkov wrote:
> > diff --git a/fs/io-wq.c b/fs/io-wq.c
> > index a564f36e260c..5b6940c90c61 100644
> > --- a/fs/io-wq.c
> > +++ b/fs/io-wq.c
> > @@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
> > wqe->node = alloc_node;
> > wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
> > atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
> > - if (wq->user) {
> > - wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
> > - task_rlimit(current, RLIMIT_NPROC);
> > - }
> > + wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = task_rlimit(current, RLIMIT_NPROC);
>
> This doesn't look like an equivalent transformation. But that may be
> moot if we merge the io_uring-worker.v3 series, as then you would not
> have to touch io-wq at all.

In the current code the wq->user is always set to current_user():

io_uring_create [1]
`- io_sq_offload_create
`- io_init_wq_offload [2]
`-io_wq_create [3]

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io_uring.c#n9752
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io_uring.c#n8107
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/io-wq.c#n1070

So, specifying max_workers always happens.

--
Rgrds, legion

2021-02-22 10:31:58

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] Count rlimits in each user namespace

On Sun, Feb 21, 2021 at 02:20:00PM -0800, Linus Torvalds wrote:
> On Mon, Feb 15, 2021 at 4:42 AM Alexey Gladkov <[email protected]> wrote:
> >
> > These patches are for binding the rlimit counters to a user in user namespace.
>
> So this is now version 6, but I think the kernel test robot keeps
> complaining about them causing KASAN issues.
>
> The complaints seem to change, so I'm hoping they get fixed, but it
> does seem like every version there's a new one. Hmm?

First, KASAN found an unexpected bug in the second patch (Add a reference
to ucounts for each cred). Because I missed that creed_alloc_blank() is
used wider than I found.

Now KASAN has found problems in the RLIMIT_MEMLOCK which I believe I fixed
in v7.

--
Rgrds, legion

2021-02-22 14:26:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6 3/7] Reimplement RLIMIT_NPROC on top of ucounts

On 2/22/21 3:11 AM, Alexey Gladkov wrote:
> On Sun, Feb 21, 2021 at 04:38:10PM -0700, Jens Axboe wrote:
>> On 2/15/21 5:41 AM, Alexey Gladkov wrote:
>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>> index a564f36e260c..5b6940c90c61 100644
>>> --- a/fs/io-wq.c
>>> +++ b/fs/io-wq.c
>>> @@ -1090,10 +1091,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
>>> wqe->node = alloc_node;
>>> wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
>>> atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
>>> - if (wq->user) {
>>> - wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>>> - task_rlimit(current, RLIMIT_NPROC);
>>> - }
>>> + wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = task_rlimit(current, RLIMIT_NPROC);
>>
>> This doesn't look like an equivalent transformation. But that may be
>> moot if we merge the io_uring-worker.v3 series, as then you would not
>> have to touch io-wq at all.
>
> In the current code the wq->user is always set to current_user():
>
> io_uring_create [1]
> `- io_sq_offload_create
> `- io_init_wq_offload [2]
> `-io_wq_create [3]

current vs other wasn't my concern, but we're always setting ->user so
the test was pointless. So looks fine to me.

--
Jens Axboe

2021-02-23 06:22:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] Count rlimits in each user namespace

Linus Torvalds <[email protected]> writes:

> On Mon, Feb 15, 2021 at 4:42 AM Alexey Gladkov <[email protected]> wrote:
>>
>> These patches are for binding the rlimit counters to a user in user namespace.
>
> So this is now version 6, but I think the kernel test robot keeps
> complaining about them causing KASAN issues.
>
> The complaints seem to change, so I'm hoping they get fixed, but it
> does seem like every version there's a new one. Hmm?

I have been keeping an eye on this as well, and yes the issues are
getting fixed.

My current plan is to aim at getting v7 rebased onto -rc1 into a branch.
Review the changes very closely. Get some performance testing and some
other testing against it. Then to get this code into linux-next.

If everything goes smoothly I will send you a pull request next merge
window. I have no intention of shipping this (or sending you a pull
request) before it is ready.

Eric