LinuxLists.cc - [CFT][PATCH] ucounts: Fix signal ucount refcounting

2021-10-18 03:14:14

by Eric W. Biederman

[permalink] [raw]

Subject: [CFT][PATCH] ucounts: Fix signal ucount refcounting

In commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.

Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.

I wish we had a single user across all of the threads whose rlimit
could be charged so we did not need this complexity.

Cc: [email protected]
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

With a lot of help from Alex who found a way I could reproduce this
I believe I have found the issue.

Could people who are seeing this issue test and verify this solves the
problem for them?

include/linux/user_namespace.h | 2 ++
kernel/signal.c | 25 +++++----------------
kernel/ucount.c | 41 ++++++++++++++++++++++++++++++++++
3 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb70cabe6e7f..33a4240e6a6f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t

long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);

static inline void set_rlimit_ucount_max(struct user_namespace *ns,
diff --git a/kernel/signal.c b/kernel/signal.c
index a3229add4455..762de58c6e76 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
*/
rcu_read_lock();
ucounts = task_ucounts(t);
- sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- switch (sigpending) {
- case 1:
- if (likely(get_ucounts(ucounts)))
- break;
- fallthrough;
- case LONG_MAX:
- /*
- * we need to decrease the ucount in the userns tree on any
- * failure to avoid counts leaking.
- */
- dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- rcu_read_unlock();
- return NULL;
- }
+ sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
+ if (sigpending == LONG_MAX)
+ return NULL;

if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
@@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
}

if (unlikely(q == NULL)) {
- if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
- put_ucounts(ucounts);
+ dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
} else {
INIT_LIST_HEAD(&q->list);
q->flags = sigqueue_flags;
@@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
- if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
- put_ucounts(q->ucounts);
+ if (q->ucounts) {
+ dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
q->ucounts = NULL;
}
kmem_cache_free(sigqueue_cachep, q);
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 3b7e176cf7a2..687d77aa66bb 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -285,6 +285,47 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
return (new == 0);
}

+static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
+ struct ucounts *last, enum ucount_type type)
+{
+ struct ucounts *iter;
+ for (iter = ucounts; iter != last; iter = iter->ns->ucounts) {
+ long dec = atomic_long_add_return(-1, &iter->ucount[type]);
+ WARN_ON_ONCE(dec < 0);
+ if (dec == 0)
+ put_ucounts(iter);
+ }
+}
+
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+ do_dec_rlimit_put_ucounts(ucounts, NULL, type);
+}
+
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+ struct ucounts *iter;
+ long dec, ret = 0;
+
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long max = READ_ONCE(iter->ns->ucount_max[type]);
+ long new = atomic_long_add_return(1, &iter->ucount[type]);
+ if (new < 0 || new > max)
+ goto unwind;
+ else if (iter == ucounts)
+ ret = new;
+ if ((new == 1) && (get_ucounts(iter) != iter))
+ goto dec_unwind;
+ }
+ return ret;
+dec_unwind:
+ dec = atomic_long_add_return(1, &iter->ucount[type]);
+ WARN_ON_ONCE(dec < 0);
+unwind:
+ do_dec_rlimit_put_ucounts(ucounts, iter, type);
+ return LONG_MAX;
+}
+
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
{
struct ucounts *iter;
--
2.20.1

2021-10-18 03:21:29

by Alexey Gladkov

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

On Fri, Oct 15, 2021 at 05:10:58PM -0500, Eric W. Biederman wrote:
>
> In commit fda31c50292a ("signal: avoid double atomic counter
> increments for user accounting") Linus made a clever optimization to
> how rlimits and the struct user_struct. Unfortunately that
> optimization does not work in the obvious way when moved to nested
> rlimits. The problem is that the last decrement of the per user
> namespace per user sigpending counter might also be the last decrement
> of the sigpending counter in the parent user namespace as well. Which
> means that simply freeing the leaf ucount in __free_sigqueue is not
> enough.
>
> Maintain the optimization and handle the tricky cases by introducing
> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>
> By moving the entire optimization into functions that perform all of
> the work it becomes possible to ensure that every level is handled
> properly.
>
> I wish we had a single user across all of the threads whose rlimit
> could be charged so we did not need this complexity.
>
> Cc: [email protected]
> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> With a lot of help from Alex who found a way I could reproduce this
> I believe I have found the issue.
>
> Could people who are seeing this issue test and verify this solves the
> problem for them?
>
> include/linux/user_namespace.h | 2 ++
> kernel/signal.c | 25 +++++----------------
> kernel/ucount.c | 41 ++++++++++++++++++++++++++++++++++
> 3 files changed, 49 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb70cabe6e7f..33a4240e6a6f 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
>
> long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
>
> static inline void set_rlimit_ucount_max(struct user_namespace *ns,
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229add4455..762de58c6e76 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> */
> rcu_read_lock();
> ucounts = task_ucounts(t);
> - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> - switch (sigpending) {
> - case 1:
> - if (likely(get_ucounts(ucounts)))
> - break;
> - fallthrough;
> - case LONG_MAX:
> - /*
> - * we need to decrease the ucount in the userns tree on any
> - * failure to avoid counts leaking.
> - */
> - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> - rcu_read_unlock();
> - return NULL;
> - }
> + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> rcu_read_unlock();
> + if (sigpending == LONG_MAX)
> + return NULL;
>
> if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> }
>
> if (unlikely(q == NULL)) {
> - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> - put_ucounts(ucounts);
> + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> } else {
> INIT_LIST_HEAD(&q->list);
> q->flags = sigqueue_flags;
> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
> {
> if (q->flags & SIGQUEUE_PREALLOC)
> return;
> - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
> - put_ucounts(q->ucounts);
> + if (q->ucounts) {
> + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
> q->ucounts = NULL;
> }
> kmem_cache_free(sigqueue_cachep, q);
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 3b7e176cf7a2..687d77aa66bb 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -285,6 +285,47 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
> return (new == 0);
> }
>
> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
> + struct ucounts *last, enum ucount_type type)
> +{
> + struct ucounts *iter;
> + for (iter = ucounts; iter != last; iter = iter->ns->ucounts) {
> + long dec = atomic_long_add_return(-1, &iter->ucount[type]);
> + WARN_ON_ONCE(dec < 0);
> + if (dec == 0)
> + put_ucounts(iter);
> + }
> +}
> +
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> + do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> +}
> +
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> + struct ucounts *iter;
> + long dec, ret = 0;
> +
> + for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> + long max = READ_ONCE(iter->ns->ucount_max[type]);
> + long new = atomic_long_add_return(1, &iter->ucount[type]);
> + if (new < 0 || new > max)
> + goto unwind;
> + else if (iter == ucounts)
> + ret = new;
> + if ((new == 1) && (get_ucounts(iter) != iter))

get_ucounts can do put_ucounts. Are you sure it's correct to use
get_ucounts here?

> + goto dec_unwind;
> + }
> + return ret;
> +dec_unwind:
> + dec = atomic_long_add_return(1, &iter->ucount[type]);

Should be -1 ?

> + WARN_ON_ONCE(dec < 0);
> +unwind:
> + do_dec_rlimit_put_ucounts(ucounts, iter, type);
> + return LONG_MAX;
> +}
> +
> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
> {
> struct ucounts *iter;
> --
> 2.20.1
>

--
Rgrds, legion

2021-10-18 03:37:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

Alexey Gladkov <[email protected]> writes:

> On Fri, Oct 15, 2021 at 05:10:58PM -0500, Eric W. Biederman wrote:
>>
>> In commit fda31c50292a ("signal: avoid double atomic counter
>> increments for user accounting") Linus made a clever optimization to
>> how rlimits and the struct user_struct. Unfortunately that
>> optimization does not work in the obvious way when moved to nested
>> rlimits. The problem is that the last decrement of the per user
>> namespace per user sigpending counter might also be the last decrement
>> of the sigpending counter in the parent user namespace as well. Which
>> means that simply freeing the leaf ucount in __free_sigqueue is not
>> enough.
>>
>> Maintain the optimization and handle the tricky cases by introducing
>> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>>
>> By moving the entire optimization into functions that perform all of
>> the work it becomes possible to ensure that every level is handled
>> properly.
>>
>> I wish we had a single user across all of the threads whose rlimit
>> could be charged so we did not need this complexity.
>>
>> Cc: [email protected]
>> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>>
>> With a lot of help from Alex who found a way I could reproduce this
>> I believe I have found the issue.
>>
>> Could people who are seeing this issue test and verify this solves the
>> problem for them?
>>
>> include/linux/user_namespace.h | 2 ++
>> kernel/signal.c | 25 +++++----------------
>> kernel/ucount.c | 41 ++++++++++++++++++++++++++++++++++
>> 3 files changed, 49 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index eb70cabe6e7f..33a4240e6a6f 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
>>
>> long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
>> bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
>> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
>> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
>> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
>>
>> static inline void set_rlimit_ucount_max(struct user_namespace *ns,
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index a3229add4455..762de58c6e76 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>> */
>> rcu_read_lock();
>> ucounts = task_ucounts(t);
>> - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>> - switch (sigpending) {
>> - case 1:
>> - if (likely(get_ucounts(ucounts)))
>> - break;
>> - fallthrough;
>> - case LONG_MAX:
>> - /*
>> - * we need to decrease the ucount in the userns tree on any
>> - * failure to avoid counts leaking.
>> - */
>> - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>> - rcu_read_unlock();
>> - return NULL;
>> - }
>> + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> rcu_read_unlock();
>> + if (sigpending == LONG_MAX)
>> + return NULL;
>>
>> if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
>> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
>> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>> }
>>
>> if (unlikely(q == NULL)) {
>> - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
>> - put_ucounts(ucounts);
>> + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> } else {
>> INIT_LIST_HEAD(&q->list);
>> q->flags = sigqueue_flags;
>> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
>> {
>> if (q->flags & SIGQUEUE_PREALLOC)
>> return;
>> - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
>> - put_ucounts(q->ucounts);
>> + if (q->ucounts) {
>> + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> q->ucounts = NULL;
>> }
>> kmem_cache_free(sigqueue_cachep, q);
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 3b7e176cf7a2..687d77aa66bb 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -285,6 +285,47 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
>> return (new == 0);
>> }
>>
>> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
>> + struct ucounts *last, enum ucount_type type)
>> +{
>> + struct ucounts *iter;
>> + for (iter = ucounts; iter != last; iter = iter->ns->ucounts) {
>> + long dec = atomic_long_add_return(-1, &iter->ucount[type]);
>> + WARN_ON_ONCE(dec < 0);
>> + if (dec == 0)
>> + put_ucounts(iter);
>> + }
>> +}
>> +
>> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
>> +{
>> + do_dec_rlimit_put_ucounts(ucounts, NULL, type);
>> +}
>> +
>> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
>> +{
>> + struct ucounts *iter;
>> + long dec, ret = 0;
>> +
>> + for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>> + long max = READ_ONCE(iter->ns->ucount_max[type]);
>> + long new = atomic_long_add_return(1, &iter->ucount[type]);
>> + if (new < 0 || new > max)
>> + goto unwind;
>> + else if (iter == ucounts)
>> + ret = new;
>> + if ((new == 1) && (get_ucounts(iter) != iter))
>
> get_ucounts can do put_ucounts. Are you sure it's correct to use
> get_ucounts here?

My only concern would be if we could not run inc_rlimit_get_ucounts
would not be safe to call under rcu_read_lock(). I don't see anything
in get_ucounts or put_ucounts that would not be safe under
rcu_read_lock().

For get_ucounts we do need to test to see if it fails. Either by
testing for NULL or testing to see if it does not return the expected
ucount.

Does that make sense or do you have another concern?

>> + goto dec_unwind;
>> + }
>> + return ret;
>> +dec_unwind:
>> + dec = atomic_long_add_return(1, &iter->ucount[type]);
>
> Should be -1 ?

Yes it should. I will fix and resend.

>> + WARN_ON_ONCE(dec < 0);
>> +unwind:
>> + do_dec_rlimit_put_ucounts(ucounts, iter, type);
>> + return LONG_MAX;
>> +}
>> +
>> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
>> {
>> struct ucounts *iter;
>> --
>> 2.20.1
>>

Eric

2021-10-18 03:37:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

Hillf Danton <[email protected]> writes:

> On Fri, 15 Oct 2021 17:10:58 -0500 Eric W. Biederman wrote:
>>
>> In commit fda31c50292a ("signal: avoid double atomic counter
>> increments for user accounting") Linus made a clever optimization to
>> how rlimits and the struct user_struct. Unfortunately that
>> optimization does not work in the obvious way when moved to nested
>> rlimits. The problem is that the last decrement of the per user
>> namespace per user sigpending counter might also be the last decrement
>> of the sigpending counter in the parent user namespace as well. Which
>> means that simply freeing the leaf ucount in __free_sigqueue is not
>> enough.
>>
>> Maintain the optimization and handle the tricky cases by introducing
>> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>>
>> By moving the entire optimization into functions that perform all of
>> the work it becomes possible to ensure that every level is handled
>> properly.
>>
>> I wish we had a single user across all of the threads whose rlimit
>> could be charged so we did not need this complexity.
>>
>> Cc: [email protected]
>> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>>
>> With a lot of help from Alex who found a way I could reproduce this
>> I believe I have found the issue.
>>
>> Could people who are seeing this issue test and verify this solves the
>> problem for them?
>>
>> include/linux/user_namespace.h | 2 ++
>> kernel/signal.c | 25 +++++----------------
>> kernel/ucount.c | 41 ++++++++++++++++++++++++++++++++++
>> 3 files changed, 49 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index eb70cabe6e7f..33a4240e6a6f 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
>>
>> long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
>> bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
>> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
>> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
>> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
>>
>> static inline void set_rlimit_ucount_max(struct user_namespace *ns,
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index a3229add4455..762de58c6e76 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>> */
>> rcu_read_lock();
>> ucounts = task_ucounts(t);
>> - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>> - switch (sigpending) {
>> - case 1:
>> - if (likely(get_ucounts(ucounts)))
>> - break;
>> - fallthrough;
>> - case LONG_MAX:
>> - /*
>> - * we need to decrease the ucount in the userns tree on any
>> - * failure to avoid counts leaking.
>> - */
>> - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>> - rcu_read_unlock();
>> - return NULL;
>> - }
>> + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> rcu_read_unlock();
>> + if (sigpending == LONG_MAX)
>> + return NULL;
>>
>> if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
>> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
>> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>> }
>>
>> if (unlikely(q == NULL)) {
>> - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
>> - put_ucounts(ucounts);
>> + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> } else {
>> INIT_LIST_HEAD(&q->list);
>> q->flags = sigqueue_flags;
>> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
>> {
>> if (q->flags & SIGQUEUE_PREALLOC)
>> return;
>> - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
>> - put_ucounts(q->ucounts);
>> + if (q->ucounts) {
>> + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
>> q->ucounts = NULL;
>> }
>> kmem_cache_free(sigqueue_cachep, q);
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 3b7e176cf7a2..687d77aa66bb 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -285,6 +285,47 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
>> return (new == 0);
>> }
>>
>> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
>> + struct ucounts *last, enum ucount_type type)
>> +{
>> + struct ucounts *iter;
>> + for (iter = ucounts; iter != last; iter = iter->ns->ucounts) {
>> + long dec = atomic_long_add_return(-1, &iter->ucount[type]);
>> + WARN_ON_ONCE(dec < 0);
>> + if (dec == 0)
>> + put_ucounts(iter);
>> + }
>
> Given kfree in put_ucounts(), this has difficulty surviving tests like
> kasan if the put pairs with the get in the below
> inc_rlimit_get_ucounts().

I don't know if this is what you are thinking about but there is indeed
a bug in that loop caused by kfree.

The problem is that iter->ns->ucounts is read after put_ucounts. It
just needs to be read before hand.

>> +}
>> +
>> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
>> +{
>> + do_dec_rlimit_put_ucounts(ucounts, NULL, type);
>> +}
>> +
>> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
>> +{
>> + struct ucounts *iter;
>> + long dec, ret = 0;
>> +
>> + for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>> + long max = READ_ONCE(iter->ns->ucount_max[type]);
>> + long new = atomic_long_add_return(1, &iter->ucount[type]);
>> + if (new < 0 || new > max)
>> + goto unwind;
>> + else if (iter == ucounts)
>> + ret = new;
>> + if ((new == 1) && (get_ucounts(iter) != iter))
>> + goto dec_unwind;
>
> Add a line of comment for get to ease readers.

/* you are not expected to understand this */

I think that is the classic comment from unix source. Seriously I can't
think of any comment that will make the situation more comprehensible.

> Hillf
>
>> + }
>> + return ret;
>> +dec_unwind:
>> + dec = atomic_long_add_return(1, &iter->ucount[type]);
>> + WARN_ON_ONCE(dec < 0);
>> +unwind:
>> + do_dec_rlimit_put_ucounts(ucounts, iter, type);
>> + return LONG_MAX;
>> +}
>> +
>> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
>> {
>> struct ucounts *iter;
>> --
>> 2.20.1

Eric

2021-10-18 03:46:26

by Rune Kleveland

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

Hi!

After applying the below patch, the 5 most problematic servers have run
without any issues for 23 hours. That never happened before the patch on
5.14, so the patch seems to have fixed the issue for me.

On Monday there will be more load on the servers, which caused them to
crash faster without the patch. I will let you know if it happens again.

Best regards,
Rune

On 16/10/2021 00:10, Eric W. Biederman wrote:
>
> In commit fda31c50292a ("signal: avoid double atomic counter
> increments for user accounting") Linus made a clever optimization to
> how rlimits and the struct user_struct. Unfortunately that
> optimization does not work in the obvious way when moved to nested
> rlimits. The problem is that the last decrement of the per user
> namespace per user sigpending counter might also be the last decrement
> of the sigpending counter in the parent user namespace as well. Which
> means that simply freeing the leaf ucount in __free_sigqueue is not
> enough.
>
> Maintain the optimization and handle the tricky cases by introducing
> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>
> By moving the entire optimization into functions that perform all of
> the work it becomes possible to ensure that every level is handled
> properly.
>
> I wish we had a single user across all of the threads whose rlimit
> could be charged so we did not need this complexity.

2021-10-18 03:48:29

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

On Sat, Oct 16, 2021 at 11:35 AM Eric W. Biederman
<[email protected]> wrote:
>
> Alexey Gladkov <[email protected]> writes:
>
> > On Fri, Oct 15, 2021 at 05:10:58PM -0500, Eric W. Biederman wrote:
> >>
> >> In commit fda31c50292a ("signal: avoid double atomic counter
> >> increments for user accounting") Linus made a clever optimization to
> >> how rlimits and the struct user_struct. Unfortunately that
> >> optimization does not work in the obvious way when moved to nested
> >> rlimits. The problem is that the last decrement of the per user
> >> namespace per user sigpending counter might also be the last decrement
> >> of the sigpending counter in the parent user namespace as well. Which
> >> means that simply freeing the leaf ucount in __free_sigqueue is not
> >> enough.
> >>
> >> Maintain the optimization and handle the tricky cases by introducing
> >> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
> >>
> >> By moving the entire optimization into functions that perform all of
> >> the work it becomes possible to ensure that every level is handled
> >> properly.
> >>
> >> I wish we had a single user across all of the threads whose rlimit
> >> could be charged so we did not need this complexity.
> >>
> >> Cc: [email protected]
> >> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >>
> >> With a lot of help from Alex who found a way I could reproduce this
> >> I believe I have found the issue.
> >>
> >> Could people who are seeing this issue test and verify this solves the
> >> problem for them?
> >>
> >> include/linux/user_namespace.h | 2 ++
> >> kernel/signal.c | 25 +++++----------------
> >> kernel/ucount.c | 41 ++++++++++++++++++++++++++++++++++
> >> 3 files changed, 49 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> >> index eb70cabe6e7f..33a4240e6a6f 100644
> >> --- a/include/linux/user_namespace.h
> >> +++ b/include/linux/user_namespace.h
> >> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
> >>
> >> long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> >> bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> >> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
> >> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
> >> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
> >>
> >> static inline void set_rlimit_ucount_max(struct user_namespace *ns,
> >> diff --git a/kernel/signal.c b/kernel/signal.c
> >> index a3229add4455..762de58c6e76 100644
> >> --- a/kernel/signal.c
> >> +++ b/kernel/signal.c
> >> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> >> */
> >> rcu_read_lock();
> >> ucounts = task_ucounts(t);
> >> - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> >> - switch (sigpending) {
> >> - case 1:
> >> - if (likely(get_ucounts(ucounts)))
> >> - break;
> >> - fallthrough;
> >> - case LONG_MAX:
> >> - /*
> >> - * we need to decrease the ucount in the userns tree on any
> >> - * failure to avoid counts leaking.
> >> - */
> >> - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> >> - rcu_read_unlock();
> >> - return NULL;
> >> - }
> >> + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> >> rcu_read_unlock();
> >> + if (sigpending == LONG_MAX)
> >> + return NULL;
> >>
> >> if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
> >> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> >> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> >> }
> >>
> >> if (unlikely(q == NULL)) {
> >> - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> >> - put_ucounts(ucounts);
> >> + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> >> } else {
> >> INIT_LIST_HEAD(&q->list);
> >> q->flags = sigqueue_flags;
> >> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
> >> {
> >> if (q->flags & SIGQUEUE_PREALLOC)
> >> return;
> >> - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
> >> - put_ucounts(q->ucounts);
> >> + if (q->ucounts) {
> >> + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
> >> q->ucounts = NULL;
> >> }
> >> kmem_cache_free(sigqueue_cachep, q);
> >> diff --git a/kernel/ucount.c b/kernel/ucount.c
> >> index 3b7e176cf7a2..687d77aa66bb 100644
> >> --- a/kernel/ucount.c
> >> +++ b/kernel/ucount.c
> >> @@ -285,6 +285,47 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
> >> return (new == 0);
> >> }
> >>
> >> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
> >> + struct ucounts *last, enum ucount_type type)
> >> +{
> >> + struct ucounts *iter;
> >> + for (iter = ucounts; iter != last; iter = iter->ns->ucounts) {
> >> + long dec = atomic_long_add_return(-1, &iter->ucount[type]);
> >> + WARN_ON_ONCE(dec < 0);
> >> + if (dec == 0)
> >> + put_ucounts(iter);
> >> + }
> >> +}
> >> +
> >> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
> >> +{
> >> + do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> >> +}
> >> +
> >> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
> >> +{
> >> + struct ucounts *iter;
> >> + long dec, ret = 0;
> >> +
> >> + for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> >> + long max = READ_ONCE(iter->ns->ucount_max[type]);
> >> + long new = atomic_long_add_return(1, &iter->ucount[type]);
> >> + if (new < 0 || new > max)
> >> + goto unwind;
> >> + else if (iter == ucounts)
> >> + ret = new;
> >> + if ((new == 1) && (get_ucounts(iter) != iter))
> >
> > get_ucounts can do put_ucounts. Are you sure it's correct to use
> > get_ucounts here?
>
> My only concern would be if we could not run inc_rlimit_get_ucounts
> would not be safe to call under rcu_read_lock(). I don't see anything
> in get_ucounts or put_ucounts that would not be safe under
> rcu_read_lock().
>
> For get_ucounts we do need to test to see if it fails. Either by
> testing for NULL or testing to see if it does not return the expected
> ucount.
>
> Does that make sense or do you have another concern?
>
>
> >> + goto dec_unwind;
> >> + }
> >> + return ret;
> >> +dec_unwind:
> >> + dec = atomic_long_add_return(1, &iter->ucount[type]);
> >
> > Should be -1 ?
>
> Yes it should. I will fix and resend.

Or just atomic_long_dec_return().

> >> + WARN_ON_ONCE(dec < 0);
> >> +unwind:
> >> + do_dec_rlimit_put_ucounts(ucounts, iter, type);
> >> + return LONG_MAX;
> >> +}
> >> +
> >> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
> >> {
> >> struct ucounts *iter;
> >> --
> >> 2.20.1
> >>
>
> Eric

2021-10-18 06:28:17

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

On Sun, Oct 17, 2021 at 10:47 AM Rune Kleveland
<[email protected]> wrote:
>
> Hi!
>
> After applying the below patch, the 5 most problematic servers have run
> without any issues for 23 hours. That never happened before the patch on
> 5.14, so the patch seems to have fixed the issue for me.

Confirm. I couldn't reproduce the problem on 5.14 either.

> On Monday there will be more load on the servers, which caused them to
> crash faster without the patch. I will let you know if it happens again.
>
> Best regards,
> Rune
>
> On 16/10/2021 00:10, Eric W. Biederman wrote:
> >
> > In commit fda31c50292a ("signal: avoid double atomic counter
> > increments for user accounting") Linus made a clever optimization to
> > how rlimits and the struct user_struct. Unfortunately that
> > optimization does not work in the obvious way when moved to nested
> > rlimits. The problem is that the last decrement of the per user
> > namespace per user sigpending counter might also be the last decrement
> > of the sigpending counter in the parent user namespace as well. Which
> > means that simply freeing the leaf ucount in __free_sigqueue is not
> > enough.
> >
> > Maintain the optimization and handle the tricky cases by introducing
> > inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
> >
> > By moving the entire optimization into functions that perform all of
> > the work it becomes possible to ensure that every level is handled
> > properly.
> >
> > I wish we had a single user across all of the threads whose rlimit
> > could be charged so we did not need this complexity.

2021-10-18 10:34:04

by Jordan Glover

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

On Monday, October 18th, 2021 at 6:25 AM, Yu Zhao <[email protected]> wrote:

> On Sun, Oct 17, 2021 at 10:47 AM Rune Kleveland
>
> [email protected] wrote:
>
> > Hi!
> >
> > After applying the below patch, the 5 most problematic servers have run
> >
> > without any issues for 23 hours. That never happened before the patch on
> >
> > 5.14, so the patch seems to have fixed the issue for me.
>
> Confirm. I couldn't reproduce the problem on 5.14 either.
>

I'm also unable to reproduce the crash as for now. Thx for the patch.

Jordan

> > On Monday there will be more load on the servers, which caused them to
> >
> > crash faster without the patch. I will let you know if it happens again.
> >
> > Best regards,
> >
> > Rune
> >
> > On 16/10/2021 00:10, Eric W. Biederman wrote:
> >
> > > In commit fda31c50292a ("signal: avoid double atomic counter
> > >
> > > increments for user accounting") Linus made a clever optimization to
> > >
> > > how rlimits and the struct user_struct. Unfortunately that
> > >
> > > optimization does not work in the obvious way when moved to nested
> > >
> > > rlimits. The problem is that the last decrement of the per user
> > >
> > > namespace per user sigpending counter might also be the last decrement
> > >
> > > of the sigpending counter in the parent user namespace as well. Which
> > >
> > > means that simply freeing the leaf ucount in __free_sigqueue is not
> > >
> > > enough.
> > >
> > > Maintain the optimization and handle the tricky cases by introducing
> > >
> > > inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
> > >
> > > By moving the entire optimization into functions that perform all of
> > >
> > > the work it becomes possible to ensure that every level is handled
> > >
> > > properly.
> > >
> > > I wish we had a single user across all of the threads whose rlimit
> > >
> > > could be charged so we did not need this complexity.

2021-10-18 15:43:39

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [CFT][PATCH] ucounts: Fix signal ucount refcounting

Yu Zhao <[email protected]> writes:

> On Sat, Oct 16, 2021 at 11:35 AM Eric W. Biederman
> <[email protected]> wrote:
>>
>> Alexey Gladkov <[email protected]> writes:
>>
>> > On Fri, Oct 15, 2021 at 05:10:58PM -0500, Eric W. Biederman wrote:
>> >> + goto dec_unwind;
>> >> + }
>> >> + return ret;
>> >> +dec_unwind:
>> >> + dec = atomic_long_add_return(1, &iter->ucount[type]);
>> >
>> > Should be -1 ?
>>
>> Yes it should. I will fix and resend.
>
> Or just atomic_long_dec_return().

It would have to be atomic_long_sub_return().

Even then I would want to change all of kernel/ucount.c to use
the same helper function so discrepancies can easily be spotted.

It is a good idea, just not I think for this particular patch.

Eric

2021-10-18 16:09:16

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH v2] ucounts: Fix signal ucount refcounting

In commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.

Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.

The new function inc_rlimit_get_ucounts returns 0 on failure to
increment the ucount. This is different than inc_rlimit_ucounts which
increments the ucounts and returns LONG_MAX if the ucount counter has
exceeded it's maximum or it wrapped (to indicate the counter needs to
decremented).

I wish we had a single user to account all pending signals to across
all of the threads of a process so this complexity was not necessary

Cc: [email protected]
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
Tested-by: Rune Kleveland <[email protected]>
Tested-by: Yu Zhao <[email protected]>
Tested-by: Jordan Glover <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

This is my version of the fix with all of the feedback rolled in.
I have tested it and believe this is ready to send out.

If folks code take a once over and see if I have spotted things.

For the people who are testing or have tested this I have added your
tested-by's please let me know if you mind.

Eric

include/linux/user_namespace.h | 2 ++
kernel/signal.c | 25 +++++------------
kernel/ucount.c | 49 ++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb70cabe6e7f..33a4240e6a6f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t

long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);

static inline void set_rlimit_ucount_max(struct user_namespace *ns,
diff --git a/kernel/signal.c b/kernel/signal.c
index a3229add4455..13d2505a14a0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
*/
rcu_read_lock();
ucounts = task_ucounts(t);
- sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- switch (sigpending) {
- case 1:
- if (likely(get_ucounts(ucounts)))
- break;
- fallthrough;
- case LONG_MAX:
- /*
- * we need to decrease the ucount in the userns tree on any
- * failure to avoid counts leaking.
- */
- dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- rcu_read_unlock();
- return NULL;
- }
+ sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
rcu_read_unlock();
+ if (!sigpending)
+ return NULL;

if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
@@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
}

if (unlikely(q == NULL)) {
- if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
- put_ucounts(ucounts);
+ dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
} else {
INIT_LIST_HEAD(&q->list);
q->flags = sigqueue_flags;
@@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
- if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
- put_ucounts(q->ucounts);
+ if (q->ucounts) {
+ dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
q->ucounts = NULL;
}
kmem_cache_free(sigqueue_cachep, q);
diff --git a/kernel/ucount.c b/kernel/ucount.c
index bb51849e6375..eb03f3c68375 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -284,6 +284,55 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
return (new == 0);
}

+static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
+ struct ucounts *last, enum ucount_type type)
+{
+ struct ucounts *iter, *next;
+ for (iter = ucounts; iter != last; iter = next) {
+ long dec = atomic_long_add_return(-1, &iter->ucount[type]);
+ WARN_ON_ONCE(dec < 0);
+ next = iter->ns->ucounts;
+ if (dec == 0)
+ put_ucounts(iter);
+ }
+}
+
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+ do_dec_rlimit_put_ucounts(ucounts, NULL, type);
+}
+
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+ /* Caller must hold a reference to ucounts */
+ struct ucounts *iter;
+ long dec, ret = 0;
+
+ for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+ long max = READ_ONCE(iter->ns->ucount_max[type]);
+ long new = atomic_long_add_return(1, &iter->ucount[type]);
+ if (new < 0 || new > max)
+ goto unwind;
+ if (iter == ucounts)
+ ret = new;
+ /*
+ * Grab an extra ucount reference for the caller when
+ * the rlimit count was previously 0.
+ */
+ if (new != 1)
+ continue;
+ if (!get_ucounts(iter))
+ goto dec_unwind;
+ }
+ return ret;
+dec_unwind:
+ dec = atomic_long_add_return(-1, &iter->ucount[type]);
+ WARN_ON_ONCE(dec < 0);
+unwind:
+ do_dec_rlimit_put_ucounts(ucounts, iter, type);
+ return 0;
+}
+
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
{
struct ucounts *iter;
--
2.20.1

2021-10-18 17:22:40

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 0/3] ucounts: misc fixes

While digging into the previous ucount kernel crashes I found some minor
bugs in the ucount code. When hit these bugs all result in a ucount
either being counted in the wrong location or leak of a struct ucounts.

Nothing particularly serious but certainly things that should be fixed.

Eric W. Biederman (3):
ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds
ucounts: Proper error handling in set_cred_ucounts
ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring

kernel/cred.c | 9 ++++-----
security/keys/process_keys.c | 8 ++++++++
2 files changed, 12 insertions(+), 5 deletions(-)

Eric

2021-10-18 17:24:28

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 1/3] ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds

The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds
is to change which rlimit counter is used to track a process when the
credentials changes.

Use the same test for both to guarantee the tracking is correct.

Cc: [email protected]
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/cred.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index f784e08c2fbd..3d163bfd64a9 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -501,7 +501,7 @@ int commit_creds(struct cred *new)
inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
- if (new->user != old->user)
+ if (new->user != old->user || new->user_ns != old->user_ns)
dec_rlimit_ucounts(old->ucounts, UCOUNT_RLIMIT_NPROC, 1);
alter_cred_subscribers(old, -2);

--
2.20.1

2021-10-18 17:24:57

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 2/3] ucounts: Proper error handling in set_cred_ucounts

Instead of leaking the ucounts in new if alloc_ucounts fails, store
the result of alloc_ucounts into a temporary variable, which is later
assigned to new->ucounts.

Cc: [email protected]
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/cred.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 3d163bfd64a9..16c05dfbec4d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -669,7 +669,7 @@ int set_cred_ucounts(struct cred *new)
{
struct task_struct *task = current;
const struct cred *old = task->real_cred;
- struct ucounts *old_ucounts = new->ucounts;
+ struct ucounts *new_ucounts, *old_ucounts = new->ucounts;

if (new->user == old->user && new->user_ns == old->user_ns)
return 0;
@@ -681,9 +681,10 @@ int set_cred_ucounts(struct cred *new)
if (old_ucounts && old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
return 0;

- if (!(new->ucounts = alloc_ucounts(new->user_ns, new->euid)))
+ if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid)))
return -EAGAIN;

+ new->ucounts = new_ucounts;
if (old_ucounts)
put_ucounts(old_ucounts);

--
2.20.1

2021-10-18 17:26:37

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 3/3] ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring

Setting cred->ucounts in cred_alloc_blank does not make sense. The
uid and user_ns are deliberately not set in cred_alloc_blank but
instead the setting is delayed until key_change_session_keyring.

So move dealing with ucounts into key_change_session_keyring as well.

Unfortunately that movement of get_ucounts adds a new failure mode to
key_change_session_keyring. I do not see anything stopping the parent
process from calling setuid and changing the relevant part of it's
cred while keyctl_session_to_parent is running making it fundamentally
necessary to call get_ucounts in key_change_session_keyring. Which
means that the new failure mode cannot be avoided.

A failure of key_change_session_keyring results in a single threaded
parent keeping it's existing credentials. Which results in the parent
process not being able to access the session keyring and whichever
keys are in the new keyring.

Further get_ucounts is only expected to fail if the number of bits in
the refernece count for the structure is too few.

Since the code has no other way to report the failure of get_ucounts
and because such failures are not expected to be common add a WARN_ONCE
to report this problem to userspace.

Between the WARN_ONCE and the parent process not having access to
the keys in the new session keyring I expect any failure of get_ucounts
will be noticed and reported and we can find another way to handle this
condition. (Possibly by just making ucounts->count an atomic_long_t).

Cc: [email protected]
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/cred.c | 2 --
security/keys/process_keys.c | 8 ++++++++
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 16c05dfbec4d..1ae0b4948a5a 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -225,8 +225,6 @@ struct cred *cred_alloc_blank(void)
#ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
#endif
- new->ucounts = get_ucounts(&init_ucounts);
-
if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0)
goto error;

diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index e3d79a7b6db6..20cc5a9a05fd 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -918,6 +918,13 @@ void key_change_session_keyring(struct callback_head *twork)
return;
}

+ /* If get_ucounts fails more bits are needed in the refcount */
+ if (unlikely(!get_ucounts(old->ucounts))) {
+ WARN_ONCE(1, "In %s get_ucounts failed\n");
+ put_cred(new);
+ return;
+ }
+
new-> uid = old-> uid;
new-> euid = old-> euid;
new-> suid = old-> suid;
@@ -927,6 +934,7 @@ void key_change_session_keyring(struct callback_head *twork)
new-> sgid = old-> sgid;
new->fsgid = old->fsgid;
new->user = get_uid(old->user);
+ new->ucounts = old->ucounts;
new->user_ns = get_user_ns(old->user_ns);
new->group_info = get_group_info(old->group_info);

--
2.20.1

2021-10-18 17:57:54

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 0/4] ucounts: misc cleanups

The following changes are a set of miscellaneous fixes that makes
the ucount code a little bit easier to read. There are all things
that I ran into while helping hunt the crazy reference count
bug.

I am aiming these at the next merge window and 5.16 rather than bug
fixes to get into the current 5.15.

Eric W. Biederman (4):
ucounts: In set_cred_ucounts assume new->ucounts is non-NULL
ucounts: Remove unnecessary test for NULL ucount in get_ucounts
ucounts: Add get_ucounts_or_wrap for clarity
ucounts: Use atomic_long_sub_return for clarity

kernel/cred.c | 5 ++---
kernel/ucount.c | 20 +++++++++++++-------
2 files changed, 15 insertions(+), 10 deletions(-)

Eric

2021-10-18 17:58:45

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 3/4] ucounts: Add get_ucounts_or_wrap for clarity

Add a helper function get_ucounts_or_wrap that is a trivial
wrapper around atomic_add_negative, that makes it clear
how atomic_add_negative is used in the context of ucounts.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/ucount.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 708d05164a7d..133b6044fda4 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -150,9 +150,15 @@ static void hlist_add_ucounts(struct ucounts *ucounts)
spin_unlock_irq(&ucounts_lock);
}

+static inline bool get_ucounts_or_wrap(struct ucounts *ucounts)
+{
+ /* Returns true on a successful get, false if the count wraps. */
+ return !atomic_add_negative(1, &ucounts->count);
+}
+
struct ucounts *get_ucounts(struct ucounts *ucounts)
{
- if (atomic_add_negative(1, &ucounts->count)) {
+ if (!get_ucounts_or_wrap(ucounts)) {
put_ucounts(ucounts);
ucounts = NULL;
}
@@ -163,7 +169,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
{
struct hlist_head *hashent = ucounts_hashentry(ns, uid);
struct ucounts *ucounts, *new;
- long overflow;
+ bool wrapped;

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -188,9 +194,9 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
return new;
}
}
- overflow = atomic_add_negative(1, &ucounts->count);
+ wrapped = !get_ucounts_or_wrap(ucounts);
spin_unlock_irq(&ucounts_lock);
- if (overflow) {
+ if (wrapped) {
put_ucounts(ucounts);
return NULL;
}
--
2.20.1

2021-10-18 17:59:30

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 4/4] ucounts: Use atomic_long_sub_return for clarity

Decrement ucounts using atomic_long_sub_return to make it
clear the point is for the ucount to decrease.

Not a big deal but it should make it easier to catch bugs.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/ucount.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 133b6044fda4..4f5613dac227 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -282,7 +282,7 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
struct ucounts *iter;
long new = -1; /* Silence compiler warning */
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
- long dec = atomic_long_add_return(-v, &iter->ucount[type]);
+ long dec = atomic_long_sub_return(v, &iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
if (iter == ucounts)
new = dec;
@@ -295,7 +295,7 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
{
struct ucounts *iter, *next;
for (iter = ucounts; iter != last; iter = next) {
- long dec = atomic_long_add_return(-1, &iter->ucount[type]);
+ long dec = atomic_long_sub_return(1, &iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
next = iter->ns->ucounts;
if (dec == 0)
@@ -332,7 +332,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
}
return ret;
dec_unwind:
- dec = atomic_long_add_return(-1, &iter->ucount[type]);
+ dec = atomic_long_sub_return(1, &iter->ucount[type]);
WARN_ON_ONCE(dec < 0);
unwind:
do_dec_rlimit_put_ucounts(ucounts, iter, type);
--
2.20.1

2021-10-18 22:27:43

[permalink] [raw]

Subject: Re: [PATCH v2] ucounts: Fix signal ucount refcounting

On Mon, Oct 18, 2021 at 10:06 AM Eric W. Biederman
<[email protected]> wrote:
>
>
> In commit fda31c50292a ("signal: avoid double atomic counter
> increments for user accounting") Linus made a clever optimization to
> how rlimits and the struct user_struct. Unfortunately that
> optimization does not work in the obvious way when moved to nested
> rlimits. The problem is that the last decrement of the per user
> namespace per user sigpending counter might also be the last decrement
> of the sigpending counter in the parent user namespace as well. Which
> means that simply freeing the leaf ucount in __free_sigqueue is not
> enough.
>
> Maintain the optimization and handle the tricky cases by introducing
> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>
> By moving the entire optimization into functions that perform all of
> the work it becomes possible to ensure that every level is handled
> properly.
>
> The new function inc_rlimit_get_ucounts returns 0 on failure to
> increment the ucount. This is different than inc_rlimit_ucounts which
> increments the ucounts and returns LONG_MAX if the ucount counter has
> exceeded it's maximum or it wrapped (to indicate the counter needs to
> decremented).
>
> I wish we had a single user to account all pending signals to across
> all of the threads of a process so this complexity was not necessary
>
> Cc: [email protected]
> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
> Tested-by: Rune Kleveland <[email protected]>
> Tested-by: Yu Zhao <[email protected]>
> Tested-by: Jordan Glover <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> This is my version of the fix with all of the feedback rolled in.
> I have tested it and believe this is ready to send out.
>
> If folks code take a once over and see if I have spotted things.
>
> For the people who are testing or have tested this I have added your
> tested-by's please let me know if you mind.
>
> Eric

Retested on the latest 5.15-rc6. This patch fixes the following crash:

[ 3307.621443] ==================================================================
[ 3307.628890] BUG: KASAN: use-after-free in dec_ucount+0x50/0xd8
[ 3307.634903] Write of size 8 at addr ffffff80a5e4c520 by task kworker/7:3/201
[ 3307.642149]
[ 3307.643695] CPU: 7 PID: 201 Comm: kworker/7:3 Not tainted
5.15.0-rc6-lockdep+ #7
[ 3307.651291] Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
[ 3307.658001] Workqueue: events free_user_ns
[ 3307.662231] Call trace:
[ 3307.664750] dump_backtrace+0x0/0x42c
[ 3307.668522] show_stack+0x24/0x30
[ 3307.671935] dump_stack_lvl+0xd0/0x100
[ 3307.675797] print_address_description+0x30/0x304
[ 3307.680646] kasan_report+0x190/0x1d8
[ 3307.684419] kasan_check_range+0x1ac/0x1bc
[ 3307.688630] __kasan_check_write+0x44/0x54
[ 3307.692852] dec_ucount+0x50/0xd8
[ 3307.696266] free_user_ns+0x1b0/0x288
[ 3307.700036] process_one_work+0x7b4/0x1130
[ 3307.704251] worker_thread+0x800/0xcf4
[ 3307.708111] kthread+0x2a8/0x358
[ 3307.711437] ret_from_fork+0x10/0x20
[ 3307.715121]
[ 3307.716664] Allocated by task 6564:
[ 3307.720253] kasan_save_stack+0x38/0x68
[ 3307.724206] __kasan_kmalloc+0x9c/0xb8
[ 3307.728068] kmem_cache_alloc_trace+0x260/0x32c
[ 3307.732729] alloc_ucounts+0x150/0x374
[ 3307.736589] set_cred_ucounts+0x178/0x240
[ 3307.740714] __sys_setresuid+0x31c/0x4f8
[ 3307.744754] __arm64_sys_setresuid+0x84/0x98
[ 3307.749153] invoke_syscall+0xcc/0x2bc
[ 3307.753012] el0_svc_common+0x118/0x1ec
[ 3307.756961] do_el0_svc_compat+0x50/0x60
[ 3307.761005] el0_svc_compat+0x5c/0xec
[ 3307.764774] el0t_32_sync_handler+0xc0/0xf0
[ 3307.769083] el0t_32_sync+0x1a4/0x1a8
[ 3307.772852]
[ 3307.774399] The buggy address belongs to the object at ffffff80a5e4c500
[ 3307.774399] which belongs to the cache kmalloc-256 of size 256
[ 3307.787250] The buggy address is located 32 bytes inside of
[ 3307.787250] 256-byte region [ffffff80a5e4c500, ffffff80a5e4c600)
[ 3307.799216] The buggy address belongs to the page:
[ 3307.804141] page:fffffffe02979200 refcount:1 mapcount:0
mapping:0000000000000000 index:0xffffff80a5e4c100 pfn:0x125e48
[ 3307.815127] head:fffffffe02979200 order:3 compound_mapcount:0
compound_pincount:0
[ 3307.822808] flags: 0x8000000000010200(slab|head|zone=2)
[ 3307.828187] raw: 8000000000010200 fffffffe0250ba08 fffffffe00f04a08
ffffff808000c980
[ 3307.836148] raw: ffffff80a5e4c100 0000000000200001 00000001ffffffff
0000000000000000
[ 3307.844104] page dumped because: kasan: bad access detected
[ 3307.849837]
[ 3307.851381] Memory state around the buggy address:
[ 3307.856307] ffffff80a5e4c400: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.863729] ffffff80a5e4c480: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.871146] >ffffff80a5e4c500: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 3307.878562] ^
[ 3307.882956] ffffff80a5e4c580: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 3307.890375] ffffff80a5e4c600: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.897800] ==================================================================

> include/linux/user_namespace.h | 2 ++
> kernel/signal.c | 25 +++++------------
> kernel/ucount.c | 49 ++++++++++++++++++++++++++++++++++
> 3 files changed, 57 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb70cabe6e7f..33a4240e6a6f 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
>
> long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
>
> static inline void set_rlimit_ucount_max(struct user_namespace *ns,
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229add4455..13d2505a14a0 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> */
> rcu_read_lock();
> ucounts = task_ucounts(t);
> - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> - switch (sigpending) {
> - case 1:
> - if (likely(get_ucounts(ucounts)))
> - break;
> - fallthrough;
> - case LONG_MAX:
> - /*
> - * we need to decrease the ucount in the userns tree on any
> - * failure to avoid counts leaking.
> - */
> - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> - rcu_read_unlock();
> - return NULL;
> - }
> + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> rcu_read_unlock();
> + if (!sigpending)
> + return NULL;
>
> if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> }
>
> if (unlikely(q == NULL)) {
> - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> - put_ucounts(ucounts);
> + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> } else {
> INIT_LIST_HEAD(&q->list);
> q->flags = sigqueue_flags;
> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
> {
> if (q->flags & SIGQUEUE_PREALLOC)
> return;
> - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
> - put_ucounts(q->ucounts);
> + if (q->ucounts) {
> + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
> q->ucounts = NULL;
> }
> kmem_cache_free(sigqueue_cachep, q);
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index bb51849e6375..eb03f3c68375 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -284,6 +284,55 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
> return (new == 0);
> }
>
> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
> + struct ucounts *last, enum ucount_type type)
> +{
> + struct ucounts *iter, *next;
> + for (iter = ucounts; iter != last; iter = next) {
> + long dec = atomic_long_add_return(-1, &iter->ucount[type]);
> + WARN_ON_ONCE(dec < 0);
> + next = iter->ns->ucounts;
> + if (dec == 0)
> + put_ucounts(iter);
> + }
> +}
> +
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> + do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> +}
> +
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> + /* Caller must hold a reference to ucounts */
> + struct ucounts *iter;
> + long dec, ret = 0;
> +
> + for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> + long max = READ_ONCE(iter->ns->ucount_max[type]);
> + long new = atomic_long_add_return(1, &iter->ucount[type]);
> + if (new < 0 || new > max)
> + goto unwind;
> + if (iter == ucounts)
> + ret = new;
> + /*
> + * Grab an extra ucount reference for the caller when
> + * the rlimit count was previously 0.
> + */
> + if (new != 1)
> + continue;
> + if (!get_ucounts(iter))
> + goto dec_unwind;
> + }
> + return ret;
> +dec_unwind:
> + dec = atomic_long_add_return(-1, &iter->ucount[type]);
> + WARN_ON_ONCE(dec < 0);
> +unwind:
> + do_dec_rlimit_put_ucounts(ucounts, iter, type);
> + return 0;
> +}
> +
> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
> {
> struct ucounts *iter;
> --
> 2.20.1
>

2021-10-18 22:31:19

[permalink] [raw]

Subject: Re: [PATCH 0/3] ucounts: misc fixes

On Mon, Oct 18, 2021 at 11:21 AM Eric W. Biederman
<[email protected]> wrote:
>
>
> While digging into the previous ucount kernel crashes I found some minor
> bugs in the ucount code. When hit these bugs all result in a ucount
> either being counted in the wrong location or leak of a struct ucounts.
>
> Nothing particularly serious but certainly things that should be fixed.
>
> Eric W. Biederman (3):
> ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds
> ucounts: Proper error handling in set_cred_ucounts
> ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring
>
> kernel/cred.c | 9 ++++-----
> security/keys/process_keys.c | 8 ++++++++
> 2 files changed, 12 insertions(+), 5 deletions(-)

Thanks for the fixes. Tested the whole series on the latest 5.15-rc6.

Tested-by: Yu Zhao <[email protected]>

2021-10-18 22:32:11

[permalink] [raw]

Subject: Re: [PATCH 0/4] ucounts: misc cleanups

On Mon, Oct 18, 2021 at 11:55 AM Eric W. Biederman
<[email protected]> wrote:
>
>
> The following changes are a set of miscellaneous fixes that makes
> the ucount code a little bit easier to read. There are all things
> that I ran into while helping hunt the crazy reference count
> bug.
>
> I am aiming these at the next merge window and 5.16 rather than bug
> fixes to get into the current 5.15.
>
> Eric W. Biederman (4):
> ucounts: In set_cred_ucounts assume new->ucounts is non-NULL
> ucounts: Remove unnecessary test for NULL ucount in get_ucounts
> ucounts: Add get_ucounts_or_wrap for clarity
> ucounts: Use atomic_long_sub_return for clarity
>
> kernel/cred.c | 5 ++---
> kernel/ucount.c | 20 +++++++++++++-------
> 2 files changed, 15 insertions(+), 10 deletions(-)

Thanks for the cleanups. Tested the whole series on the latest 5.15-rc6.

Tested-by: Yu Zhao <[email protected]>