Preface
-------
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.11
Problem
-------
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.
To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.
service-A
\- program (uid=1000, container1, rlimit_nproc=1)
\- program (uid=1000, container2, rlimit_nproc=1)
The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.
The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.
This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.
Eric W. Biederman mentioned this issue [2][3].
Introduced changes
------------------
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.
[1]: https://lore.kernel.org/containers/[email protected]/
[2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html
Changelog
---------
v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
RLIMIT_MEMLOCK on top of ucounts.
v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
RLIMIT_MEMLOCK on top of ucounts.
v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.
v5:
* Split the first commit into two commits: change ucounts.count type to atomic_long_t
and add ucounts to cred. These commits were merged by mistake during the rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
to handle errors.
* Added error handling of set_cred_ucounts().
v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c
v3:
* Added get_ucounts() function to increase the reference count. The existing
get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.
v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.
v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
atomic_long_t.
* Added ucount_max to avoid the fork bomb.
--
Alexey Gladkov (8):
Increase size of ucounts to atomic_long_t
Add a reference to ucounts for each cred
Use atomic_t for ucounts reference counting
Reimplement RLIMIT_NPROC on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MEMLOCK on top of ucounts
kselftests: Add test to check for rlimit changes in different user
namespaces
fs/exec.c | 6 +-
fs/hugetlbfs/inode.c | 16 +-
fs/io-wq.c | 22 ++-
fs/io-wq.h | 2 +-
fs/io_uring.c | 2 +-
fs/proc/array.c | 2 +-
include/linux/cred.h | 4 +
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 4 +-
include/linux/sched/user.h | 7 -
include/linux/shmem_fs.h | 2 +-
include/linux/signal_types.h | 4 +-
include/linux/user_namespace.h | 26 ++-
ipc/mqueue.c | 41 ++---
ipc/shm.c | 26 +--
kernel/cred.c | 50 +++++-
kernel/exit.c | 2 +-
kernel/fork.c | 18 +-
kernel/signal.c | 57 +++----
kernel/sys.c | 14 +-
kernel/ucount.c | 140 ++++++++++++---
kernel/user.c | 3 -
kernel/user_namespace.c | 9 +-
mm/memfd.c | 4 +-
mm/mlock.c | 23 ++-
mm/mmap.c | 4 +-
mm/shmem.c | 8 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../selftests/rlimits/rlimits-per-userns.c | 161 ++++++++++++++++++
32 files changed, 512 insertions(+), 159 deletions(-)
create mode 100644 tools/testing/selftests/rlimits/.gitignore
create mode 100644 tools/testing/selftests/rlimits/Makefile
create mode 100644 tools/testing/selftests/rlimits/config
create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c
--
2.29.2
The current implementation of the ucounts reference counter requires the
use of spin_lock. We're going to use get_ucounts() in more performance
critical areas like a handling of RLIMIT_SIGPENDING.
Now we need to use spin_lock only if we want to change the hashtable.
Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 4 +--
kernel/ucount.c | 60 +++++++++++++++-------------------
2 files changed, 28 insertions(+), 36 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f71b5a4a3e74..d84cc2c0b443 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -92,7 +92,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
- int count;
+ atomic_t count;
atomic_long_t ucount[UCOUNT_COUNTS];
};
@@ -104,7 +104,7 @@ void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
-struct ucounts *get_ucounts(struct ucounts *ucounts);
+struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
void put_ucounts(struct ucounts *ucounts);
#ifdef CONFIG_USER_NS
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 50cc1dfb7d28..bb3203039b5e 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -11,7 +11,7 @@
struct ucounts init_ucounts = {
.ns = &init_user_ns,
.uid = GLOBAL_ROOT_UID,
- .count = 1,
+ .count = ATOMIC_INIT(1),
};
#define UCOUNTS_HASHTABLE_BITS 10
@@ -139,6 +139,22 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(&ucounts_lock);
}
+/* 127: arbitrary random number, small enough to assemble well */
+#define refcount_zero_or_close_to_overflow(ucounts) \
+ ((unsigned int) atomic_read(&ucounts->count) + 127u <= 127u)
+
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ if (ucounts) {
+ if (refcount_zero_or_close_to_overflow(ucounts)) {
+ WARN_ONCE(1, "ucounts: counter has reached its maximum value");
+ return NULL;
+ }
+ atomic_inc(&ucounts->count);
+ }
+ return ucounts;
+}
+
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
@@ -155,7 +171,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
new->ns = ns;
new->uid = uid;
- new->count = 0;
+ atomic_set(&new->count, 1);
spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -163,33 +179,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
kfree(new);
} else {
hlist_add_head(&new->node, hashent);
- ucounts = new;
+ spin_unlock_irq(&ucounts_lock);
+ return new;
}
}
- if (ucounts->count == INT_MAX)
- ucounts = NULL;
- else
- ucounts->count += 1;
spin_unlock_irq(&ucounts_lock);
- return ucounts;
-}
-
-struct ucounts *get_ucounts(struct ucounts *ucounts)
-{
- unsigned long flags;
-
- if (!ucounts)
- return NULL;
-
- spin_lock_irqsave(&ucounts_lock, flags);
- if (ucounts->count == INT_MAX) {
- WARN_ONCE(1, "ucounts: counter has reached its maximum value");
- ucounts = NULL;
- } else {
- ucounts->count += 1;
- }
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
+ ucounts = get_ucounts(ucounts);
return ucounts;
}
@@ -197,15 +192,12 @@ void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;
- spin_lock_irqsave(&ucounts_lock, flags);
- ucounts->count -= 1;
- if (!ucounts->count)
+ if (atomic_dec_and_test(&ucounts->count)) {
+ spin_lock_irqsave(&ucounts_lock, flags);
hlist_del_init(&ucounts->node);
- else
- ucounts = NULL;
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
- kfree(ucounts);
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+ kfree(ucounts);
+ }
}
static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
--
2.29.2
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.
Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().
Changelog
v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
error was caused by the fact that cred_alloc_blank() left the ucounts
pointer empty.
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/exec.c | 4 ++++
include/linux/cred.h | 2 ++
include/linux/user_namespace.h | 4 ++++
kernel/cred.c | 40 ++++++++++++++++++++++++++++++++++
kernel/fork.c | 6 +++++
kernel/sys.c | 12 ++++++++++
kernel/ucount.c | 40 +++++++++++++++++++++++++++++++---
kernel/user_namespace.c | 3 +++
8 files changed, 108 insertions(+), 3 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..0371a3400be5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
flush_signal_handlers(me, 0);
+ retval = set_cred_ucounts(bprm->cred);
+ if (retval < 0)
+ goto out_unlock;
+
/*
* install the new credentials for this executable
*/
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..ad160e5fe5c6 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -144,6 +144,7 @@ struct cred {
#endif
struct user_struct *user; /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
+ struct ucounts *ucounts;
struct group_info *group_info; /* supplementary groups for euid/fsgid */
/* RCU deletion */
union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, const char *);
extern int set_create_files_as(struct cred *, struct inode *);
extern int cred_fscmp(const struct cred *, const struct cred *);
extern void __init cred_init(void);
+extern int set_cred_ucounts(struct cred *);
/*
* check for validity of credentials
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 0bb833fd41f4..f71b5a4a3e74 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -97,11 +97,15 @@ struct ucounts {
};
extern struct user_namespace init_user_ns;
+extern struct ucounts init_ucounts;
bool setup_userns_sysctls(struct user_namespace *ns);
void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
+struct ucounts *get_ucounts(struct ucounts *ucounts);
+void put_ucounts(struct ucounts *ucounts);
#ifdef CONFIG_USER_NS
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..58a8a9e24347 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -60,6 +60,7 @@ struct cred init_cred = {
.user = INIT_USER,
.user_ns = &init_user_ns,
.group_info = &init_groups,
+ .ucounts = &init_ucounts,
};
static inline void set_cred_subscribers(struct cred *cred, int n)
@@ -119,6 +120,8 @@ static void put_cred_rcu(struct rcu_head *rcu)
if (cred->group_info)
put_group_info(cred->group_info);
free_uid(cred->user);
+ if (cred->ucounts)
+ put_ucounts(cred->ucounts);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
}
@@ -222,6 +225,7 @@ struct cred *cred_alloc_blank(void)
#ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
#endif
+ new->ucounts = get_ucounts(&init_ucounts);
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;
@@ -284,6 +288,11 @@ struct cred *prepare_creds(void)
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+
+ new->ucounts = get_ucounts(new->ucounts);
+ if (!new->ucounts)
+ goto error;
+
validate_creds(new);
return new;
@@ -363,6 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
+ if (set_cred_ucounts(new) < 0)
+ goto error_put;
}
#ifdef CONFIG_KEYS
@@ -653,6 +664,31 @@ int cred_fscmp(const struct cred *a, const struct cred *b)
}
EXPORT_SYMBOL(cred_fscmp);
+int set_cred_ucounts(struct cred *new)
+{
+ struct task_struct *task = current;
+ const struct cred *old = task->real_cred;
+ struct ucounts *old_ucounts = new->ucounts;
+
+ if (new->user == old->user && new->user_ns == old->user_ns)
+ return 0;
+
+ /*
+ * This optimization is needed because alloc_ucounts() uses locks
+ * for table lookups.
+ */
+ if (old_ucounts && old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
+ return 0;
+
+ if (!(new->ucounts = alloc_ucounts(new->user_ns, new->euid)))
+ return -EAGAIN;
+
+ if (old_ucounts)
+ put_ucounts(old_ucounts);
+
+ return 0;
+}
+
/*
* initialise the credentials stuff
*/
@@ -719,6 +755,10 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
goto error;
+ new->ucounts = get_ucounts(new->ucounts);
+ if (!new->ucounts)
+ goto error;
+
put_cred(old);
validate_creds(new);
return new;
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..40a5da7d3d70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2957,6 +2957,12 @@ int ksys_unshare(unsigned long unshare_flags)
if (err)
goto bad_unshare_cleanup_cred;
+ if (new_cred) {
+ err = set_cred_ucounts(new_cred);
+ if (err)
+ goto bad_unshare_cleanup_cred;
+ }
+
if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) {
if (do_sysvsem) {
/*
diff --git a/kernel/sys.c b/kernel/sys.c
index 51f00fe20e4d..373def7debe8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -553,6 +553,10 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
if (retval < 0)
goto error;
+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);
error:
@@ -611,6 +615,10 @@ long __sys_setuid(uid_t uid)
if (retval < 0)
goto error;
+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);
error:
@@ -686,6 +694,10 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
if (retval < 0)
goto error;
+ retval = set_cred_ucounts(new);
+ if (retval < 0)
+ goto error;
+
return commit_creds(new);
error:
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 04c561751af1..50cc1dfb7d28 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -8,6 +8,12 @@
#include <linux/kmemleak.h>
#include <linux/user_namespace.h>
+struct ucounts init_ucounts = {
+ .ns = &init_user_ns,
+ .uid = GLOBAL_ROOT_UID,
+ .count = 1,
+};
+
#define UCOUNTS_HASHTABLE_BITS 10
static struct hlist_head ucounts_hashtable[(1 << UCOUNTS_HASHTABLE_BITS)];
static DEFINE_SPINLOCK(ucounts_lock);
@@ -125,7 +131,15 @@ static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struc
return NULL;
}
-static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
+static void hlist_add_ucounts(struct ucounts *ucounts)
+{
+ struct hlist_head *hashent = ucounts_hashentry(ucounts->ns, ucounts->uid);
+ spin_lock_irq(&ucounts_lock);
+ hlist_add_head(&ucounts->node, hashent);
+ spin_unlock_irq(&ucounts_lock);
+}
+
+struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
struct ucounts *ucounts, *new;
@@ -160,7 +174,26 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
return ucounts;
}
-static void put_ucounts(struct ucounts *ucounts)
+struct ucounts *get_ucounts(struct ucounts *ucounts)
+{
+ unsigned long flags;
+
+ if (!ucounts)
+ return NULL;
+
+ spin_lock_irqsave(&ucounts_lock, flags);
+ if (ucounts->count == INT_MAX) {
+ WARN_ONCE(1, "ucounts: counter has reached its maximum value");
+ ucounts = NULL;
+ } else {
+ ucounts->count += 1;
+ }
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+
+ return ucounts;
+}
+
+void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;
@@ -194,7 +227,7 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
{
struct ucounts *ucounts, *iter, *bad;
struct user_namespace *tns;
- ucounts = get_ucounts(ns, uid);
+ ucounts = alloc_ucounts(ns, uid);
for (iter = ucounts; iter; iter = tns->ucounts) {
long max;
tns = iter->ns;
@@ -237,6 +270,7 @@ static __init int user_namespace_sysctl_init(void)
BUG_ON(!user_header);
BUG_ON(!setup_userns_sysctls(&init_user_ns));
#endif
+ hlist_add_ucounts(&init_ucounts);
return 0;
}
subsys_initcall(user_namespace_sysctl_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..516db53166ab 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1281,6 +1281,9 @@ static int userns_install(struct nsset *nsset, struct ns_common *ns)
put_user_ns(cred->user_ns);
set_cred_user_ns(cred, get_user_ns(user_ns));
+ if (set_cred_ucounts(cred) < 0)
+ return -EINVAL;
+
return 0;
}
--
2.29.2
On Wed, Mar 10, 2021 at 4:01 AM Alexey Gladkov <[email protected]> wrote:
>
>
> +/* 127: arbitrary random number, small enough to assemble well */
> +#define refcount_zero_or_close_to_overflow(ucounts) \
> + ((unsigned int) atomic_read(&ucounts->count) + 127u <= 127u)
> +
> +struct ucounts *get_ucounts(struct ucounts *ucounts)
> +{
> + if (ucounts) {
> + if (refcount_zero_or_close_to_overflow(ucounts)) {
> + WARN_ONCE(1, "ucounts: counter has reached its maximum value");
> + return NULL;
> + }
> + atomic_inc(&ucounts->count);
> + }
> + return ucounts;
Side note: you probably should just make the limit be the "oh, the
count overflows into the sign bit".
The reason the page cache did that tighter thing is that it actually
has _two_ limits:
- the "try_get_page()" thing uses the sign bit as a "uhhuh, I've now
used up half of the available reference counting bits, and I will
refuse to use any more".
This is basically your "get_ucounts()" function. It's a "I want a
refcount, but I'm willing to deal with failures".
- the page cache has a _different_ set of "I need to unconditionally
get a refcount, and I can *not* deal with failures".
This is basically the traditional "get_page()", which is only used
in fairly controlled places, and should never be something that can
overflow.
And *that* special code then uses that
"zero_or_close_to_overflow()" case as a "doing a get_page() in this
situation is very very wrong". This is purely a debugging feature used
for a VM_BUG_ON() (that has never triggered, as far as I know).
For your ucounts situation, you don't have that second case at all, so
you have no reason to ever allow the count to even get remotely close
to overflowing.
A reference count being within 128 counts of overflow (when we're
talking a 32-bit count) is basically never a good idea. It means that
you are way too close to the limit, and there's a risk that lots of
concurrent people all first see an ok value, and then *all* decide to
do the increment, and then you're toast.
In contrast, if you use the sign bit as a "ok, let's stop
incrementing", the fact that your "overflow" test and the increment
aren't atomic really isn't a big deal.
(And yes, you could use a cmpxchg to *make* the overflow test atomic,
but it's often much much more expensive, so..)
Linus
On Wed, Mar 10, 2021 at 01:01:28PM +0100, Alexey Gladkov wrote:
> The current implementation of the ucounts reference counter requires the
> use of spin_lock. We're going to use get_ucounts() in more performance
> critical areas like a handling of RLIMIT_SIGPENDING.
This really looks like it should be refcount_t. I read the earlier
thread[1] on this, and it's not clear to me that this is a "normal"
condition. I think there was a bug in that version (This appeared
to *instantly* crash at boot with mnt_init() calling alloc_mnt_ns()
calling inc_ucount()). The current code looks like just a "regular"
reference counter of the allocated struct ucounts. Overflow should be
very unexpected, yes? And operating on a "0" ucounts should be a bug
too, right?
> [...]
> +/* 127: arbitrary random number, small enough to assemble well */
> +#define refcount_zero_or_close_to_overflow(ucounts) \
> + ((unsigned int) atomic_read(&ucounts->count) + 127u <= 127u)
Regardless, this should absolutely not have "refcount" as a prefix. I
realize it's only used here, but that's needlessly confusing with regard
to it being atomic_t not refcount_t.
> +struct ucounts *get_ucounts(struct ucounts *ucounts)
> +{
> + if (ucounts) {
> + if (refcount_zero_or_close_to_overflow(ucounts)) {
> + WARN_ONCE(1, "ucounts: counter has reached its maximum value");
> + return NULL;
> + }
> + atomic_inc(&ucounts->count);
> + }
> + return ucounts;
> +}
I feel like this should just be:
refcount_inc_not_zero(&ucounts->count);
Or, to address Linus's comment in the v3 series, change get_ucounts to
not return NULL first -- I can't see why that can ever happen in v8.
-Kees
[1] https://lore.kernel.org/lkml/116c7669744404364651e3b380db2d82bb23f983.1610722473.git.gladkov.alexey@gmail.com/
--
Kees Cook
On Mon, Mar 15, 2021 at 3:03 PM Kees Cook <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 01:01:28PM +0100, Alexey Gladkov wrote:
> > The current implementation of the ucounts reference counter requires the
> > use of spin_lock. We're going to use get_ucounts() in more performance
> > critical areas like a handling of RLIMIT_SIGPENDING.
>
> This really looks like it should be refcount_t.
No.
refcount_t didn't have the capabilities required.
It just saturates, and doesn't have the "don't do this" case, which
the ucounts case *DOES* have.
In other words, refcount_t is entirely misdesigned for this - because
it's literally designed for "people can't handle overflow, so we warn
and saturate".
ucounts can never saturate, because they replace saturation with
"don't do that then".
In other words, ucounts work like the page counts do (which also don't
saturate, they just say "ok, you can't get a reference".
I know you are attached to refcounts, but really: they are not only
more expensive, THEY LITERALLY DO THE WRONG THING.
Linus
On Mon, Mar 15, 2021 at 03:19:17PM -0700, Linus Torvalds wrote:
> It just saturates, and doesn't have the "don't do this" case, which
> the ucounts case *DOES* have.
Right -- I saw that when digging through the thread. I'm honestly
curious, though, why did the 0-day bot find a boot crash? (I can't
imagine ucounts wrapped in 0.4 seconds.) So it looked like an
increment-from-zero case, which seems like it would be a bug?
> I know you are attached to refcounts, but really: they are not only
> more expensive, THEY LITERALLY DO THE WRONG THING.
Heh, right -- I'm not arguing that refcount_t MUST be used, I just didn't
see the code path that made them unsuitable: hitting INT_MAX - 128 seems
very hard to do. Anyway, I'll go study it more to try to understand what
I'm missing.
--
Kees Cook
On Tue, Mar 16, 2021 at 11:49 AM Kees Cook <[email protected]> wrote:
>
> Right -- I saw that when digging through the thread. I'm honestly
> curious, though, why did the 0-day bot find a boot crash? (I can't
> imagine ucounts wrapped in 0.4 seconds.) So it looked like an
> increment-from-zero case, which seems like it would be a bug?
Agreed. It's almost certainly a bug. Possibly a use-after-free, but
more likely just a "this count had never gotten initialized to
anything but zero, but is used by the init process (and kernel
threads) and will be incremented but never be free'd, so we never
noticed"
> Heh, right -- I'm not arguing that refcount_t MUST be used, I just didn't
> see the code path that made them unsuitable: hitting INT_MAX - 128 seems
> very hard to do. Anyway, I'll go study it more to try to understand what
> I'm missing.
So as you may have seen later in the thread, I don't like the "INT_MAX
- 128" as a limit.
I think the page count thing does the right thing: it has separate
"debug checks" and "limit checks", and the way it's done it never
really needs to worry about doing the (often) expensive cmpxchg loop,
because the limit check is _so_ far off the final case that we don't
care, and the debug checks aren't about races, they are about "uhhuh,
yoiu used this wrong".
So what the page code does is:
- try_get_page() has a limit check _and_ a debug check:
(a) the limit check is "you've used up half the refcounts, I'm not
giving you any more".
(b) the debug check is "you can't get a page that has a zero count
or has underflowed".
it's not obvious that it has both of those checks, because they are
merged into one single WARN_ON_ONCE(), but that's purely for "we
actually want that warning for the limit check, because that looks
like somebody trying an attack" and it just got combined.
So technically, the code really should do
page = compound_head(page);
/* Debug check for mis-use of the count */
if (WARN_ON_ONCE(page_ref_zero_or_close_to_overflow(page)))
return false;
/*
* Limit check - we're not incrementing the
* count (much) past the halfway point
*/
if (page_ref_count(page) <= 0)
return false;
/* The actual atomic reference - the above were done "carelessly" */
page_ref_inc(page);
return true;
because the "oh, we're not allowing you this ref" is not
_technically_ wrong, it's just traditionally wrong, if you see what I
mean.
and notice how none of the above really cares about the
"page_ref_inc()" itself being atomic wrt the checks. It's ok if we
race, and the page ref goes a bit above the half-way point. You can't
race _so_ much that you actually overflow, because our limit check is
_so_ far away from the overflow area that it's not an issue.
And similarly, the debug check with
page_ref_zero_or_close_to_overflow() is one of those things that are
trying to see underflows or bad use-cases, and trying to do that
atomically with the actual ref update doesn't really help. The
underfulow or mis-use will have happened before we increment the page
count.
So the above is very close to what the ucounts code I think really
wants to do: the "zero_or_close_to_overflow" is an error case: it
means something just underflowed, or you were trying to increment a
ref to something you didn't have a reference to in the first place.
And the "<= 0" check is just the cheap test for "I'm giving you at
most half the counter space, because I don't want to have to even
remotely worry about overflow".
Note that the above very intentionally does allow the "we can go over
the limit" case for another reason: we still have that regular
*unconditional* get_page(), that has a "I absolutely need a temporary
ref to this page, but I know it's not some long-term thing that a user
can force". That's not only our traditional model, but it's something
that some kernel code simply does need, so it's a good feature in
itself. That might be less of an issue for ucounts, but for pages, we
somethines do have "I need to take a ref to this page just for my own
use while I then drop the page lock and do something else".
The "put_page()" case then has its own debug check (in
"put_page_testzero()") which says "hey, you can't put a page that has
no refcount.
Thct could could easily use that "zero_or_close_to_overflow(()" rule
too, but if you actually do underflow for real, you'll see the zero
(again - races aren't really important because even if you have some
attack vector that depends on the race, such attack vectors will also
have to depend on doing the thing over and over and over again until
it successfully hits the race, so you'll see the zero case in
practice, and trying to be "atomic" for debug testing is thus
pointless.
So I do think out page counting this is actually pretty good.
And it's possible that "refcount_t" could use that exact same model,
and actually then offer that option that ucounts wants, of a "try to
get a refcount, but if we have too many refcounts, then never mind, I
can just return an error to user space instead".
Hmm? On x86 (and honestly, these days on arm too with the new
atomics), it's generally quite a bit cheaper to do an atomic
increment/decrement than it is to do a cmpxchg loop. That seems to
become even more true as microarchitectures optimize those atomics -
apparently AMD actually does regular locked ops by doing them
optimistically out-of-order, and verifying that the serialization
requirements hold after-the-fact. So plain simple locked ops that
historically used to be quite expensive are getting less so (because
they've obviously gotten much more important over the years).
Linus
On Tue, Mar 16, 2021 at 12:26:05PM -0700, Linus Torvalds wrote:
> Note that the above very intentionally does allow the "we can go over
> the limit" case for another reason: we still have that regular
> *unconditional* get_page(), that has a "I absolutely need a temporary
> ref to this page, but I know it's not some long-term thing that a user
> can force". That's not only our traditional model, but it's something
> that some kernel code simply does need, so it's a good feature in
> itself. That might be less of an issue for ucounts, but for pages, we
> somethines do have "I need to take a ref to this page just for my own
> use while I then drop the page lock and do something else".
Right, get_page() has a whole other set of requirements. :) I just
couldn't find the "we _must_ to get a reference to ucounts" code path,
so I was scratching my head.
> And it's possible that "refcount_t" could use that exact same model,
> and actually then offer that option that ucounts wants, of a "try to
> get a refcount, but if we have too many refcounts, then never mind, I
> can just return an error to user space instead".
Yeah, if there starts to be more of these cases, I think it'd be a
nice addition. And with the recent performance work Will Deacon did on
refcount_t, I think any general performance concerns are met now. But
I'd love to just leave refcount_t alone until we can really show a need
for an API change. :P
--
Kees Cook