2021-06-29 20:37:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 1:20 PM Alexey Gladkov <[email protected]> wrote:
>
> Waaaait. task_ucounts() is a different thing. This function only gets a
> field from the task structure without any reference counting. But the
> get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
> counter overflow.

Alexey, that code cannot be right.

Look here:

rcu_read_lock();
ucounts = task_ucounts(t);
sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
if (sigpending == 1)
ucounts = get_ucounts(ucounts);
rcu_read_unlock();

so now we've done that inc_rlimit_ucounts() unconditionally on that
task_ucounts() thing.

And then if the allocation fails (or the limit is hit) the code does

if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
put_ucounts(ucounts);

ie now it does the dec_rlimit_ucounts _conditionally_.

See what I'm complaining about? This is not logical, AND IT CANNOT
POSSIBLY BE RIGHT.

My argument is that

(a) the dec_rlimit_ucounts() has to pair up with
inc_rlimit_ucounts(), or you're leaking counts

(b) get_ucounts() has to pair up with put_ucounts().

Note that (a) has to be REGARDLESS of whether get_ucounts() was
successful or not.

> Earlier I tried to use refcount_t which never returns errors [1]. We
> talked and you said that ignoring counter overflow errors is bad
> design for this case.

You can't ignore counter overflow errors, no. But that's exactly what
that code is doing.

If get_ucount() fails due to overflow, you don't return an error. You
just miscount the end result!

So yeah, its' "testing" the overflow condition, but that's not an
argument, when it then DOES EXPLICITLY THE WRONG THING.

At that point, the test is actively harmful and wrong. See?

Linus


2021-06-29 21:36:10

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 01:33:39PM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 1:20 PM Alexey Gladkov <[email protected]> wrote:
> >
> > Waaaait. task_ucounts() is a different thing. This function only gets a
> > field from the task structure without any reference counting. But the
> > get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
> > counter overflow.
>
> Alexey, that code cannot be right.
>
> Look here:
>
> rcu_read_lock();
> ucounts = task_ucounts(t);
> sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> if (sigpending == 1)
> ucounts = get_ucounts(ucounts);
> rcu_read_unlock();
>
> so now we've done that inc_rlimit_ucounts() unconditionally on that
> task_ucounts() thing.
>
> And then if the allocation fails (or the limit is hit) the code does
>
> if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> put_ucounts(ucounts);
>
> ie now it does the dec_rlimit_ucounts _conditionally_.
>
> See what I'm complaining about? This is not logical, AND IT CANNOT
> POSSIBLY BE RIGHT.
>
> My argument is that
>
> (a) the dec_rlimit_ucounts() has to pair up with
> inc_rlimit_ucounts(), or you're leaking counts
>
> (b) get_ucounts() has to pair up with put_ucounts().
>
> Note that (a) has to be REGARDLESS of whether get_ucounts() was
> successful or not.
>
> > Earlier I tried to use refcount_t which never returns errors [1]. We
> > talked and you said that ignoring counter overflow errors is bad
> > design for this case.
>
> You can't ignore counter overflow errors, no. But that's exactly what
> that code is doing.
>
> If get_ucount() fails due to overflow, you don't return an error. You
> just miscount the end result!
>
> So yeah, its' "testing" the overflow condition, but that's not an
> argument, when it then DOES EXPLICITLY THE WRONG THING.
>
> At that point, the test is actively harmful and wrong. See?

Yes. Please, give me some time to fix it.

--
Rgrds, legion

2021-07-02 17:58:29

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak

If get_ucounts() could not increase the reference counter, then in case
of an error it will be impossible to decrease the counter.

In case of an error, the ucounts variable cannot be set to NULL. Also
dec_rlimit_ucounts() has to be REGARDLESS of whether get_ucounts() was
successful or not.

Signed-off-by: Alexey Gladkov <[email protected]>
---
kernel/signal.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index de0920353d30..87a64b3307a8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -412,7 +412,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
int override_rlimit, const unsigned int sigqueue_flags)
{
struct sigqueue *q = NULL;
- struct ucounts *ucounts = NULL;
+ struct ucounts *ucounts, *ucounts_new;
long sigpending;

/*
@@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
* changes from/to zero.
*/
rcu_read_lock();
- ucounts = task_ucounts(t);
+ ucounts = ucounts_new = task_ucounts(t);
sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
if (sigpending == 1)
- ucounts = get_ucounts(ucounts);
+ ucounts_new = get_ucounts(ucounts);
rcu_read_unlock();

if (override_rlimit || (sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
@@ -437,13 +437,24 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
}

if (unlikely(q == NULL)) {
- if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
- put_ucounts(ucounts);
+ ucounts_new = NULL;
} else {
INIT_LIST_HEAD(&q->list);
q->flags = sigqueue_flags;
- q->ucounts = ucounts;
+ q->ucounts = ucounts_new;
}
+
+ /*
+ * In case it failed to allocate sigqueue or ucounts reference counter
+ * overflow, we decrement UCOUNT_RLIMIT_SIGPENDING to avoid counter
+ * leaks.
+ */
+ if (unlikely(ucounts_new == NULL)) {
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+ if (sigpending == 1)
+ put_ucounts(ucounts);
+ }
+
return q;
}

@@ -451,7 +462,7 @@ static void __sigqueue_free(struct sigqueue *q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
- if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
+ if (dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
put_ucounts(q->ucounts);
q->ucounts = NULL;
}
--
2.29.3

2021-07-02 22:14:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak

On Fri, Jul 2, 2021 at 10:55 AM Alexey Gladkov <[email protected]> wrote:
>
> @@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> * changes from/to zero.
> */
> rcu_read_lock();
> - ucounts = task_ucounts(t);
> + ucounts = ucounts_new = task_ucounts(t);
> sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> if (sigpending == 1)
> - ucounts = get_ucounts(ucounts);
> + ucounts_new = get_ucounts(ucounts);
> rcu_read_unlock();

I think this is still problematic.

If get_ucounts() fails, we can't just drop the RCU lock and (later)
use "ucounts" that we hold no reference to.

Or am I missing something? I'm not entirely sure about the lifetime of
that RCU protection, and I do note that "task_ucounts()" uses
"task_cred_xxx()", which already does
rcu_read_lock()/rcu_read_unlock() in the actual access.

So I'm thinking the code could/should be written something like this instead:

diff --git a/kernel/signal.c b/kernel/signal.c
index f6371dfa1f89..40781b197227 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -422,22 +422,33 @@ __sigqueue_alloc(int sig, struct task_struct
*t, gfp_t gfp_flags,
* NOTE! A pending signal will hold on to the user refcount,
* and we get/put the refcount only when the sigpending count
* changes from/to zero.
+ *
+ * And if the ucount rlimit overflowed, we do not get to use it at all.
*/
rcu_read_lock();
ucounts = task_ucounts(t);
sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- if (sigpending == 1)
- ucounts = get_ucounts(ucounts);
+ switch (sigpending) {
+ case 1:
+ if (likely(get_ucounts(ucounts)))
+ break;
+
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+ fallthrough;
+ case LONG_MAX:
+ rcu_read_unlock();
+ return NULL;
+ }
rcu_read_unlock();

- if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
task_rlimit(t, RLIMIT_SIGPENDING))) {
+ if (override_rlimit || sigpending <= task_rlimit(t,
RLIMIT_SIGPENDING)) {
q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
} else {
print_dropped_signal(sig);
}

if (unlikely(q == NULL)) {
- if (ucounts && dec_rlimit_ucounts(ucounts,
UCOUNT_RLIMIT_SIGPENDING, 1))
+ if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
put_ucounts(ucounts);
} else {
INIT_LIST_HEAD(&q->list);

(and no, I'm not sure it's a good idea to make that use a "switch()" -
maybe the LONG_MAX case should be a "if (unlikely())" thing after the
rcu_read_ulock() instead?

Hmm?

The alternate thing is to say "No, Linus, you're a nincompoop and
wrong, that RCU protection is a non-issue because we hold a reference
to the task, and task_ucounts() will not change, so the RCU read lock
doesn't do anything".

Although then I would think the rcu_read_lock/rcu_read_unlock here is
entirely pointless.

Linus

2021-07-07 17:40:51

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak

On Fri, Jul 02, 2021 at 03:13:18PM -0700, Linus Torvalds wrote:
> On Fri, Jul 2, 2021 at 10:55 AM Alexey Gladkov <[email protected]> wrote:
> >
> > @@ -424,10 +424,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> > * changes from/to zero.
> > */
> > rcu_read_lock();
> > - ucounts = task_ucounts(t);
> > + ucounts = ucounts_new = task_ucounts(t);
> > sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> > if (sigpending == 1)
> > - ucounts = get_ucounts(ucounts);
> > + ucounts_new = get_ucounts(ucounts);
> > rcu_read_unlock();
>
> I think this is still problematic.
>
> If get_ucounts() fails, we can't just drop the RCU lock and (later)
> use "ucounts" that we hold no reference to.
>
> Or am I missing something? I'm not entirely sure about the lifetime of
> that RCU protection, and I do note that "task_ucounts()" uses
> "task_cred_xxx()", which already does
> rcu_read_lock()/rcu_read_unlock() in the actual access.
>
> So I'm thinking the code could/should be written something like this instead:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f6371dfa1f89..40781b197227 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -422,22 +422,33 @@ __sigqueue_alloc(int sig, struct task_struct
> *t, gfp_t gfp_flags,
> * NOTE! A pending signal will hold on to the user refcount,
> * and we get/put the refcount only when the sigpending count
> * changes from/to zero.
> + *
> + * And if the ucount rlimit overflowed, we do not get to use it at all.
> */
> rcu_read_lock();
> ucounts = task_ucounts(t);
> sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> - if (sigpending == 1)
> - ucounts = get_ucounts(ucounts);
> + switch (sigpending) {
> + case 1:
> + if (likely(get_ucounts(ucounts)))
> + break;
> +
> + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> + fallthrough;
> + case LONG_MAX:

I think that the counter should be decreased in this case too.
inc_rlimit_ucounts() increments the counter in all parent userns. If we
don't decrease the counter then the parent userns will have a counter
leak.

> + rcu_read_unlock();
> + return NULL;
> + }
> rcu_read_unlock();
>
> - if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
> + if (override_rlimit || sigpending <= task_rlimit(t,
> RLIMIT_SIGPENDING)) {
> q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> } else {
> print_dropped_signal(sig);
> }
>
> if (unlikely(q == NULL)) {
> - if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
> + if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> put_ucounts(ucounts);
> } else {
> INIT_LIST_HEAD(&q->list);
>
> (and no, I'm not sure it's a good idea to make that use a "switch()" -
> maybe the LONG_MAX case should be a "if (unlikely())" thing after the
> rcu_read_ulock() instead?
>
> Hmm?

You are absolutely right. This fixes a BUG I just got during internal
testing.

> The alternate thing is to say "No, Linus, you're a nincompoop and
> wrong, that RCU protection is a non-issue because we hold a reference
> to the task, and task_ucounts() will not change, so the RCU read lock
> doesn't do anything".
>
> Although then I would think the rcu_read_lock/rcu_read_unlock here is
> entirely pointless.
>
> Linus
>

[ 84.670919] ==================================================================
[ 84.673326] BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
[ 84.674983] Write of size 4 at addr ffff8880045f031c by task swapper/2/0
[ 84.676702]
[ 84.677119] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
[ 84.678920] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
[ 84.683128] Call Trace:
[ 84.683950] <IRQ>
[ 84.684635] dump_stack+0x8a/0xb5
[ 84.685751] print_address_description.constprop.0+0x18/0x130
[ 84.687633] ? put_ucounts+0x17/0xa0
[ 84.688831] ? put_ucounts+0x17/0xa0
[ 84.690050] kasan_report.cold+0x7f/0x111
[ 84.691413] ? put_ucounts+0x17/0xa0
[ 84.692618] kasan_check_range+0xf9/0x1e0
[ 84.694039] put_ucounts+0x17/0xa0
[ 84.695237] put_cred_rcu+0xd5/0x190
[ 84.696998] rcu_core+0x3bf/0xcb0
[ 84.699692] ? note_gp_changes+0x90/0x90
[ 84.700938] ? recalibrate_cpu_khz+0x10/0x10
[ 84.702575] ? lapic_timer_set_periodic+0x30/0x30
[ 84.704152] ? clockevents_program_event+0xd3/0x130
[ 84.705754] ? hrtimer_interrupt+0x418/0x440
[ 84.707198] __do_softirq+0xe3/0x341
[ 84.708412] irq_exit_rcu+0xbe/0xe0
[ 84.709572] sysvec_apic_timer_interrupt+0x6a/0x90
[ 84.711235] </IRQ>
[ 84.711972] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 84.713661] RIP: 0010:default_idle+0xb/0x10
[ 84.715070] Code: ff f0 80 63 02 df 5b 41 5c c3 0f ae f0 0f ae 3b 0f ae f0 eb 90 66 2e 0f 1f 84 00 00 00 00 00 eb 07 0f 00 2d 37 7d 5a 5
[ 84.721579] RSP: 0018:ffffc900000dfe80 EFLAGS: 00000202
[ 84.723330] RAX: ffffffffad279280 RBX: ffff8880013f3e00 RCX: ffffffffad269c56
[ 84.725724] RDX: 000000000001a8c6 RSI: 0000000000000004 RDI: ffff8880361322a0
[ 84.728085] RBP: 0000000000000002 R08: 0000000000000001 R09: ffff8880361322a3
[ 84.730485] R10: ffffed1006c26454 R11: 0000000000000001 R12: 0000000000000002
[ 84.732814] R13: 0000000000000000 R14: 0000000000000000 R15: 1ffff9200001bfd6
[ 84.736325] ? mwait_idle+0xc0/0xc0
[ 84.737488] ? rcu_eqs_enter.constprop.0+0x86/0xa0
[ 84.739085] default_idle_call+0x53/0x130
[ 84.740410] do_idle+0x311/0x3c0
[ 84.741512] ? _raw_spin_lock_irqsave+0x7b/0xd0
[ 84.743060] ? arch_cpu_idle_exit+0x30/0x30
[ 84.744442] ? swake_up_locked+0x6d/0x80
[ 84.745673] cpu_startup_entry+0x14/0x20
[ 84.746657] secondary_startup_64_no_verify+0xc2/0xcb
[ 84.747746]
[ 84.748073] Allocated by task 127:
[ 84.748742] kasan_save_stack+0x1b/0x40
[ 84.750483] __kasan_kmalloc+0x7c/0x90
[ 84.751359] alloc_ucounts+0x169/0x2b0
[ 84.752270] set_cred_ucounts+0xbb/0x170
[ 84.753266] ksys_unshare+0x24c/0x4e0
[ 84.754194] __x64_sys_unshare+0x16/0x20
[ 84.755332] do_syscall_64+0x37/0x70
[ 84.756503] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 84.758106]
[ 84.758625] Freed by task 0:
[ 84.759589] kasan_save_stack+0x1b/0x40
[ 84.760841] kasan_set_track+0x1c/0x30
[ 84.762082] kasan_set_free_info+0x20/0x30
[ 84.763421] __kasan_slab_free+0xeb/0x120
[ 84.764736] kfree+0xaa/0x460
[ 84.765722] put_cred_rcu+0xd5/0x190
[ 84.766925] rcu_core+0x3bf/0xcb0
[ 84.768074] __do_softirq+0xe3/0x341
[ 84.769262]
[ 84.769765] The buggy address belongs to the object at ffff8880045f0300
[ 84.769765] which belongs to the cache kmalloc-192 of size 192
[ 84.773675] The buggy address is located 28 bytes inside of
[ 84.773675] 192-byte region [ffff8880045f0300, ffff8880045f03c0)
[ 84.778876] The buggy address belongs to the page:
[ 84.780338] page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
[ 84.782844] flags: 0x100000000000200(slab|node=0|zone=1)
[ 84.784640] raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
[ 84.787229] raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
[ 84.789685] page dumped because: kasan: bad access detected
[ 84.791511]
[ 84.792037] Memory state around the buggy address:
[ 84.793590] ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.795920] ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 84.798252] >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.800636] ^
[ 84.801953] ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 84.804261] ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.806622] ==================================================================
[ 84.808909] Disabling lock debugging due to kernel taint

--
Rgrds, legion

2021-07-07 18:51:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak

On Wed, Jul 7, 2021 at 9:50 AM Alexey Gladkov <[email protected]> wrote:
>
> > + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> > + fallthrough;
> > + case LONG_MAX:
>
> I think that the counter should be decreased in this case too.
> inc_rlimit_ucounts() increments the counter in all parent userns. If we
> don't decrease the counter then the parent userns will have a counter
> leak.

Ack. So basically that patch, but move the dec_rlimit_ucounts() into
the LONG_MAX case?

Would you mind making a real patch with a commit message, and trying
whatever test-case you had for that KASAN use-after-free report?

Linus

2021-07-08 10:35:15

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v2] Fix UCOUNT_RLIMIT_SIGPENDING counter leak

We must properly handle an errors when we increase the rlimit counter
and the ucounts reference counter. We have to this with RCU protection
to prevent possible use-after-free that could occur due to concurrent
put_cred_rcu().

The following reproducer triggers the problem:

$ cat testcase.sh
case "${STEP:-0}" in
0)
ulimit -Si 1
ulimit -Hi 1
STEP=1 unshare -rU "$0"
killall sleep
;;
1)
for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
;;
esac

[ 84.670919] ==================================================================
[ 84.673326] BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
[ 84.674983] Write of size 4 at addr ffff8880045f031c by task swapper/2/0
[ 84.676702]
[ 84.677119] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
[ 84.678920] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
[ 84.683128] Call Trace:
[ 84.683950] <IRQ>
[ 84.684635] dump_stack+0x8a/0xb5
[ 84.685751] print_address_description.constprop.0+0x18/0x130
[ 84.687633] ? put_ucounts+0x17/0xa0
[ 84.688831] ? put_ucounts+0x17/0xa0
[ 84.690050] kasan_report.cold+0x7f/0x111
[ 84.691413] ? put_ucounts+0x17/0xa0
[ 84.692618] kasan_check_range+0xf9/0x1e0
[ 84.694039] put_ucounts+0x17/0xa0
[ 84.695237] put_cred_rcu+0xd5/0x190
[ 84.696998] rcu_core+0x3bf/0xcb0
[ 84.699692] ? note_gp_changes+0x90/0x90
[ 84.700938] ? recalibrate_cpu_khz+0x10/0x10
[ 84.702575] ? lapic_timer_set_periodic+0x30/0x30
[ 84.704152] ? clockevents_program_event+0xd3/0x130
[ 84.705754] ? hrtimer_interrupt+0x418/0x440
[ 84.707198] __do_softirq+0xe3/0x341
[ 84.708412] irq_exit_rcu+0xbe/0xe0
[ 84.709572] sysvec_apic_timer_interrupt+0x6a/0x90
[ 84.711235] </IRQ>
[ 84.711972] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 84.713661] RIP: 0010:default_idle+0xb/0x10
[ 84.715070] Code: ff f0 80 63 02 df 5b 41 5c c3 0f ae f0 0f ae 3b 0f ae f0 eb 90 66 2e 0f 1f 84 00 00 00 00 00 eb 07 0f 00 2d 37 7d 5a 5
[ 84.721579] RSP: 0018:ffffc900000dfe80 EFLAGS: 00000202
[ 84.723330] RAX: ffffffffad279280 RBX: ffff8880013f3e00 RCX: ffffffffad269c56
[ 84.725724] RDX: 000000000001a8c6 RSI: 0000000000000004 RDI: ffff8880361322a0
[ 84.728085] RBP: 0000000000000002 R08: 0000000000000001 R09: ffff8880361322a3
[ 84.730485] R10: ffffed1006c26454 R11: 0000000000000001 R12: 0000000000000002
[ 84.732814] R13: 0000000000000000 R14: 0000000000000000 R15: 1ffff9200001bfd6
[ 84.736325] ? mwait_idle+0xc0/0xc0
[ 84.737488] ? rcu_eqs_enter.constprop.0+0x86/0xa0
[ 84.739085] default_idle_call+0x53/0x130
[ 84.740410] do_idle+0x311/0x3c0
[ 84.741512] ? _raw_spin_lock_irqsave+0x7b/0xd0
[ 84.743060] ? arch_cpu_idle_exit+0x30/0x30
[ 84.744442] ? swake_up_locked+0x6d/0x80
[ 84.745673] cpu_startup_entry+0x14/0x20
[ 84.746657] secondary_startup_64_no_verify+0xc2/0xcb
[ 84.747746]
[ 84.748073] Allocated by task 127:
[ 84.748742] kasan_save_stack+0x1b/0x40
[ 84.750483] __kasan_kmalloc+0x7c/0x90
[ 84.751359] alloc_ucounts+0x169/0x2b0
[ 84.752270] set_cred_ucounts+0xbb/0x170
[ 84.753266] ksys_unshare+0x24c/0x4e0
[ 84.754194] __x64_sys_unshare+0x16/0x20
[ 84.755332] do_syscall_64+0x37/0x70
[ 84.756503] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 84.758106]
[ 84.758625] Freed by task 0:
[ 84.759589] kasan_save_stack+0x1b/0x40
[ 84.760841] kasan_set_track+0x1c/0x30
[ 84.762082] kasan_set_free_info+0x20/0x30
[ 84.763421] __kasan_slab_free+0xeb/0x120
[ 84.764736] kfree+0xaa/0x460
[ 84.765722] put_cred_rcu+0xd5/0x190
[ 84.766925] rcu_core+0x3bf/0xcb0
[ 84.768074] __do_softirq+0xe3/0x341
[ 84.769262]
[ 84.769765] The buggy address belongs to the object at ffff8880045f0300
[ 84.769765] which belongs to the cache kmalloc-192 of size 192
[ 84.773675] The buggy address is located 28 bytes inside of
[ 84.773675] 192-byte region [ffff8880045f0300, ffff8880045f03c0)
[ 84.778876] The buggy address belongs to the page:
[ 84.780338] page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
[ 84.782844] flags: 0x100000000000200(slab|node=0|zone=1)
[ 84.784640] raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
[ 84.787229] raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
[ 84.789685] page dumped because: kasan: bad access detected
[ 84.791511]
[ 84.792037] Memory state around the buggy address:
[ 84.793590] ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.795920] ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 84.798252] >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.800636] ^
[ 84.801953] ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 84.804261] ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 84.806622] ==================================================================
[ 84.808909] Disabling lock debugging due to kernel taint

Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Cc: Eric W. Biederman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
kernel/signal.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index de0920353d30..e1e4d6d07f08 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -426,18 +426,30 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
rcu_read_lock();
ucounts = task_ucounts(t);
sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
- if (sigpending == 1)
- ucounts = get_ucounts(ucounts);
+ switch (sigpending) {
+ case 1:
+ if (likely(get_ucounts(ucounts)))
+ break;
+ fallthrough;
+ case LONG_MAX:
+ /*
+ * we need to decrease the ucount in the userns tree on any
+ * failure to avoid counts leaking.
+ */
+ dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
+ rcu_read_unlock();
+ return NULL;
+ }
rcu_read_unlock();

- if (override_rlimit || (sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
+ if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
} else {
print_dropped_signal(sig);
}

if (unlikely(q == NULL)) {
- if (ucounts && dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
+ if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
put_ucounts(ucounts);
} else {
INIT_LIST_HEAD(&q->list);
--
2.29.3

2021-07-08 11:01:51

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH] ucounts: Fix UCOUNT_RLIMIT_SIGPENDING counter leak

On Wed, Jul 07, 2021 at 10:23:35AM -0700, Linus Torvalds wrote:
> On Wed, Jul 7, 2021 at 9:50 AM Alexey Gladkov <[email protected]> wrote:
> >
> > > + dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> > > + fallthrough;
> > > + case LONG_MAX:
> >
> > I think that the counter should be decreased in this case too.
> > inc_rlimit_ucounts() increments the counter in all parent userns. If we
> > don't decrease the counter then the parent userns will have a counter
> > leak.
>
> Ack. So basically that patch, but move the dec_rlimit_ucounts() into
> the LONG_MAX case?

Yep.

> Would you mind making a real patch with a commit message, and trying
> whatever test-case you had for that KASAN use-after-free report?

Sure. I will.

--
Rgrds, legion

2021-07-08 18:46:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] Fix UCOUNT_RLIMIT_SIGPENDING counter leak

On Thu, Jul 8, 2021 at 3:34 AM Alexey Gladkov <[email protected]> wrote:
>
> We must properly handle an errors when we increase the rlimit counter
> and the ucounts reference counter. We have to this with RCU protection
> to prevent possible use-after-free that could occur due to concurrent
> put_cred_rcu().

Thanks, LGTM. Applied,

Linus