2021-10-11 17:27:21

by Michal Koutný

[permalink] [raw]
Subject: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

The removal path checks cfs_rq->on_list flag without synchronization
based on the reasoning that only empty cgroups can be removed and
->on_list can't switch back to 1. However, since the commit a7b359fc6a37
("sched/fair: Correctly insert cfs_rq's to list on unthrottle") even
cfs_rq of an empty cgroup may be added to the list because of
non-decayed load that transiently remains there.
The result is that we may skip unlinking the cfs_rq from the list on the
removal path and the list will then contain free'd cfs_rq, which leads
sooner or later to a task failing inside update_blocked_averages while
holding rq->lock and eventually locking up the machine on all other CPUs
as well:

[ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21

Illustrative stack dump of a task that faulted by accessing released
cfs_rq (+unrelated deadlock on rq->lock):

PID: 0 TASK: ffff8a310a5dc000 CPU: 16 COMMAND: "swapper/16"
#0 [fffffe0000379e58] crash_nmi_callback at ffffffffba063683
#1 [fffffe0000379e60] nmi_handle at ffffffffba0377ef
#2 [fffffe0000379eb8] default_do_nmi at ffffffffba037c5e
#3 [fffffe0000379ed8] do_nmi at ffffffffba037ea7
#4 [fffffe0000379ef0] end_repeat_nmi at ffffffffbaa0178b
[exception RIP: native_queued_spin_lock_slowpath+98]
RIP: ffffffffba0fa6e2 RSP: ffffa8505932ca30 RFLAGS: 00000002
RAX: 0000000001200101 RBX: ffff8a8de9044000 RCX: ffff8a8ec0800000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a8ec082cc80
RBP: ffff8a8ec082cc80 R8: ffff8a8ec0800000 R9: ffff8a31078058f8
R10: 0000000000000000 R11: ffffffffbb4639d8 R12: 0000000000000000
R13: ffff8a8de9044b84 R14: 0000000000000006 R15: 0000000000000010
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffa8505932ca30] native_queued_spin_lock_slowpath at ffffffffba0fa6e2
#6 [ffffa8505932ca30] _raw_spin_lock at ffffffffba922aab
#7 [ffffa8505932ca38] try_to_wake_up at ffffffffba0cf8f9
#8 [ffffa8505932ca98] __queue_work at ffffffffba0b9c7e
#9 [ffffa8505932cae0] queue_work_on at ffffffffba0ba7b4
#10 [ffffa8505932caf0] bit_putcs at ffffffffba541bc0
#11 [ffffa8505932cbf0] fbcon_putcs at ffffffffba53c36b
#12 [ffffa8505932cc48] vt_console_print at ffffffffba5ff032
#13 [ffffa8505932cca8] console_unlock at ffffffffba1091a2
#14 [ffffa8505932ccf0] vprintk_emit at ffffffffba10ad29
#15 [ffffa8505932cd40] printk at ffffffffba10b590
#16 [ffffa8505932cda8] no_context at ffffffffba081f66
#17 [ffffa8505932ce10] do_page_fault at ffffffffba082df0
#18 [ffffa8505932ce40] page_fault at ffffffffbaa012fe
[exception RIP: update_blocked_averages+685]
RIP: ffffffffba0d85cd RSP: ffffa8505932cef0 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff8a8ca0510000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8a8ca0510000 RDI: 0000000000000000
RBP: 0000000000000000 R8: 00000000eac0c6e6 R9: 0000000000000233
R10: ffffa8505932cef0 R11: 0000000000000233 R12: ffff8a8ca0510140
R13: 0000000000000000 R14: fffffffffffffec2 R15: 0000000000000080
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#19 [ffffa8505932cf50] run_rebalance_domains at ffffffffba0e2751
#20 [ffffa8505932cf68] __softirqentry_text_start at ffffffffbac000e3
#21 [ffffa8505932cfc8] irq_exit at ffffffffba0a2cf5
#22 [ffffa8505932cfd8] smp_apic_timer_interrupt at ffffffffbaa02874
#23 [ffffa8505932cff0] apic_timer_interrupt at ffffffffbaa01d9f
--- <IRQ stack> ---
#24 [ffffa850402f3dd8] apic_timer_interrupt at ffffffffbaa01d9f
[exception RIP: cpuidle_enter_state+171]
RIP: ffffffffba6fc32b RSP: ffffa850402f3e80 RFLAGS: 00000246
RAX: ffff8a8ec082cc80 RBX: ffffc7f053605e80 RCX: 000000000000001f
RDX: 0000082e557d390c RSI: 000000003d1877c2 RDI: 0000000000000000
RBP: ffffffffbb55f100 R8: 0000000000000002 R9: 000000000002c500
R10: ffffa850402f3e60 R11: 00000000000002ff R12: 0000000000000002
R13: 0000082e557d390c R14: 0000000000000002 R15: 0000000000000000
ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
#25 [ffffa850402f3ec0] cpuidle_enter at ffffffffba6fc6f9
#26 [ffffa850402f3ee0] do_idle at ffffffffba0d4567
#27 [ffffa850402f3f20] cpu_startup_entry at ffffffffba0d4769
#28 [ffffa850402f3f30] start_secondary at ffffffffba064e35
#29 [ffffa850402f3f50] secondary_startup_64_no_verify at ffffffffba000112

Fix this by always taking rq->lock when checking the ->on_list condition
(the modification of the list in UBA is therefore synchronized).

Taking the rq->lock on every cpu cgroup removal may pose a performance
penalty. However, this should be just moving the necessary work from UBA
into the unregister_fair_sched_group() and therefore neutral on larger
scale (assuming given cpu cgroup was populated at least once).

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/sched/fair.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6a05d9b5443..081c7ac02f65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11440,13 +11440,6 @@ void unregister_fair_sched_group(struct task_group *tg)
if (tg->se[cpu])
remove_entity_load_avg(tg->se[cpu]);

- /*
- * Only empty task groups can be destroyed; so we can speculatively
- * check on_list without danger of it being re-added.
- */
- if (!tg->cfs_rq[cpu]->on_list)
- continue;
-
rq = cpu_rq(cpu);

raw_spin_rq_lock_irqsave(rq, flags);
--
2.33.0


2021-10-11 19:15:05

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

man. 11. okt. 2021 kl. 18:24 skrev Michal Koutný <[email protected]>:
>
> The removal path checks cfs_rq->on_list flag without synchronization
> based on the reasoning that only empty cgroups can be removed and
> ->on_list can't switch back to 1. However, since the commit a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") even
> cfs_rq of an empty cgroup may be added to the list because of
> non-decayed load that transiently remains there.
> The result is that we may skip unlinking the cfs_rq from the list on the
> removal path and the list will then contain free'd cfs_rq, which leads
> sooner or later to a task failing inside update_blocked_averages while
> holding rq->lock and eventually locking up the machine on all other CPUs
> as well:
>
> [ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
> [ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
>
> Illustrative stack dump of a task that faulted by accessing released
> cfs_rq (+unrelated deadlock on rq->lock):
>
> PID: 0 TASK: ffff8a310a5dc000 CPU: 16 COMMAND: "swapper/16"
> #0 [fffffe0000379e58] crash_nmi_callback at ffffffffba063683
> #1 [fffffe0000379e60] nmi_handle at ffffffffba0377ef
> #2 [fffffe0000379eb8] default_do_nmi at ffffffffba037c5e
> #3 [fffffe0000379ed8] do_nmi at ffffffffba037ea7
> #4 [fffffe0000379ef0] end_repeat_nmi at ffffffffbaa0178b
> [exception RIP: native_queued_spin_lock_slowpath+98]
> RIP: ffffffffba0fa6e2 RSP: ffffa8505932ca30 RFLAGS: 00000002
> RAX: 0000000001200101 RBX: ffff8a8de9044000 RCX: ffff8a8ec0800000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a8ec082cc80
> RBP: ffff8a8ec082cc80 R8: ffff8a8ec0800000 R9: ffff8a31078058f8
> R10: 0000000000000000 R11: ffffffffbb4639d8 R12: 0000000000000000
> R13: ffff8a8de9044b84 R14: 0000000000000006 R15: 0000000000000010
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffffa8505932ca30] native_queued_spin_lock_slowpath at ffffffffba0fa6e2
> #6 [ffffa8505932ca30] _raw_spin_lock at ffffffffba922aab
> #7 [ffffa8505932ca38] try_to_wake_up at ffffffffba0cf8f9
> #8 [ffffa8505932ca98] __queue_work at ffffffffba0b9c7e
> #9 [ffffa8505932cae0] queue_work_on at ffffffffba0ba7b4
> #10 [ffffa8505932caf0] bit_putcs at ffffffffba541bc0
> #11 [ffffa8505932cbf0] fbcon_putcs at ffffffffba53c36b
> #12 [ffffa8505932cc48] vt_console_print at ffffffffba5ff032
> #13 [ffffa8505932cca8] console_unlock at ffffffffba1091a2
> #14 [ffffa8505932ccf0] vprintk_emit at ffffffffba10ad29
> #15 [ffffa8505932cd40] printk at ffffffffba10b590
> #16 [ffffa8505932cda8] no_context at ffffffffba081f66
> #17 [ffffa8505932ce10] do_page_fault at ffffffffba082df0
> #18 [ffffa8505932ce40] page_fault at ffffffffbaa012fe
> [exception RIP: update_blocked_averages+685]
> RIP: ffffffffba0d85cd RSP: ffffa8505932cef0 RFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff8a8ca0510000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffff8a8ca0510000 RDI: 0000000000000000
> RBP: 0000000000000000 R8: 00000000eac0c6e6 R9: 0000000000000233
> R10: ffffa8505932cef0 R11: 0000000000000233 R12: ffff8a8ca0510140
> R13: 0000000000000000 R14: fffffffffffffec2 R15: 0000000000000080
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #19 [ffffa8505932cf50] run_rebalance_domains at ffffffffba0e2751
> #20 [ffffa8505932cf68] __softirqentry_text_start at ffffffffbac000e3
> #21 [ffffa8505932cfc8] irq_exit at ffffffffba0a2cf5
> #22 [ffffa8505932cfd8] smp_apic_timer_interrupt at ffffffffbaa02874
> #23 [ffffa8505932cff0] apic_timer_interrupt at ffffffffbaa01d9f
> --- <IRQ stack> ---
> #24 [ffffa850402f3dd8] apic_timer_interrupt at ffffffffbaa01d9f
> [exception RIP: cpuidle_enter_state+171]
> RIP: ffffffffba6fc32b RSP: ffffa850402f3e80 RFLAGS: 00000246
> RAX: ffff8a8ec082cc80 RBX: ffffc7f053605e80 RCX: 000000000000001f
> RDX: 0000082e557d390c RSI: 000000003d1877c2 RDI: 0000000000000000
> RBP: ffffffffbb55f100 R8: 0000000000000002 R9: 000000000002c500
> R10: ffffa850402f3e60 R11: 00000000000002ff R12: 0000000000000002
> R13: 0000082e557d390c R14: 0000000000000002 R15: 0000000000000000
> ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
> #25 [ffffa850402f3ec0] cpuidle_enter at ffffffffba6fc6f9
> #26 [ffffa850402f3ee0] do_idle at ffffffffba0d4567
> #27 [ffffa850402f3f20] cpu_startup_entry at ffffffffba0d4769
> #28 [ffffa850402f3f30] start_secondary at ffffffffba064e35
> #29 [ffffa850402f3f50] secondary_startup_64_no_verify at ffffffffba000112
>
> Fix this by always taking rq->lock when checking the ->on_list condition
> (the modification of the list in UBA is therefore synchronized).

Hi,

Well, yeah, that statement (in the code you removed) does not hold any more,
that is definitely true. Good catch.

To be 100% clear, this can only happen when a control group is
throttled while it has load
(cfs_rq_is_decayed(cfs_rq) is false); and then its unthrottling race
with its deletion?
Is that a correct understanding Michal?

>
> Taking the rq->lock on every cpu cgroup removal may pose a performance
> penalty. However, this should be just moving the necessary work from UBA
> into the unregister_fair_sched_group() and therefore neutral on larger
> scale (assuming given cpu cgroup was populated at least once).
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---

Others know more about the performance impact of that lock, but in
terms of logic,
it looks good to me. Fixing it the other way around, without
reintroducing the issue fixed in
a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle"), would create even
more brittle logic I think. fdaba61ef8a26 ("sched/fair: Ensure that
the CFS parent is added
after unthrottling) will also complicate that even more (and that
patch is also needed for
correcting the issues it does.

In theory, we can do something like the diff below (from the
top of my head), but I'm not sure if I am 100% comfortable with such a solution.
Especially due to the "child_cfs_rq_on_list(cfs_rq)" we need in addition
to the nr_running. Do you agree that that will also solve the problem Michal,
or am I missing something obvious here?

I do think Vincent might have some more thoughts about this, so I will
defer it to him.

So in general, given that the locking thing is ok:
Acked-by: Odin Ugedal <[email protected]>

Thanks
Odin

[0]:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6a05d9b5443..e9a104d57b59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4805,9 +4805,13 @@ static int tg_unthrottle_up(struct task_group
*tg, void *data)
cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
cfs_rq->throttled_clock_task;

- /* Add cfs_rq with load or one or more already running
entities to the list */
- if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
+ if (cfs_rq->nr_running || child_cfs_rq_on_list(cfs_rq)) {
list_add_leaf_cfs_rq(cfs_rq);
+ } else (!cfs_rq->on_list && !cfs_rq_is_decayed(cfs_rq)) {
+ // Fully decay/detach the load of the cfs_rq
+ cfs_rq->avg.load_avg = 0;
+ update_tg_load_avg(cfs_rq);
+ }
}

return 0;

2021-10-12 18:51:55

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

Hi Odin,

On Mon, Oct 11, 2021 at 08:12:08PM +0100, Odin Ugedal wrote:
> man. 11. okt. 2021 kl. 18:24 skrev Michal Koutný <[email protected]>:
> >
> > The removal path checks cfs_rq->on_list flag without synchronization
> > based on the reasoning that only empty cgroups can be removed and
> > ->on_list can't switch back to 1. However, since the commit a7b359fc6a37
> > ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") even
> > cfs_rq of an empty cgroup may be added to the list because of
> > non-decayed load that transiently remains there.
> > The result is that we may skip unlinking the cfs_rq from the list on the
> > removal path and the list will then contain free'd cfs_rq, which leads
> > sooner or later to a task failing inside update_blocked_averages while
> > holding rq->lock and eventually locking up the machine on all other CPUs
> > as well:
> >
> > [ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
> > [ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
> >
> > Illustrative stack dump of a task that faulted by accessing released
> > cfs_rq (+unrelated deadlock on rq->lock):
> >
> > PID: 0 TASK: ffff8a310a5dc000 CPU: 16 COMMAND: "swapper/16"
> > #0 [fffffe0000379e58] crash_nmi_callback at ffffffffba063683
> > #1 [fffffe0000379e60] nmi_handle at ffffffffba0377ef
> > #2 [fffffe0000379eb8] default_do_nmi at ffffffffba037c5e
> > #3 [fffffe0000379ed8] do_nmi at ffffffffba037ea7
> > #4 [fffffe0000379ef0] end_repeat_nmi at ffffffffbaa0178b
> > [exception RIP: native_queued_spin_lock_slowpath+98]
> > RIP: ffffffffba0fa6e2 RSP: ffffa8505932ca30 RFLAGS: 00000002
> > RAX: 0000000001200101 RBX: ffff8a8de9044000 RCX: ffff8a8ec0800000
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a8ec082cc80
> > RBP: ffff8a8ec082cc80 R8: ffff8a8ec0800000 R9: ffff8a31078058f8
> > R10: 0000000000000000 R11: ffffffffbb4639d8 R12: 0000000000000000
> > R13: ffff8a8de9044b84 R14: 0000000000000006 R15: 0000000000000010
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #5 [ffffa8505932ca30] native_queued_spin_lock_slowpath at ffffffffba0fa6e2
> > #6 [ffffa8505932ca30] _raw_spin_lock at ffffffffba922aab
> > #7 [ffffa8505932ca38] try_to_wake_up at ffffffffba0cf8f9
> > #8 [ffffa8505932ca98] __queue_work at ffffffffba0b9c7e
> > #9 [ffffa8505932cae0] queue_work_on at ffffffffba0ba7b4
> > #10 [ffffa8505932caf0] bit_putcs at ffffffffba541bc0
> > #11 [ffffa8505932cbf0] fbcon_putcs at ffffffffba53c36b
> > #12 [ffffa8505932cc48] vt_console_print at ffffffffba5ff032
> > #13 [ffffa8505932cca8] console_unlock at ffffffffba1091a2
> > #14 [ffffa8505932ccf0] vprintk_emit at ffffffffba10ad29
> > #15 [ffffa8505932cd40] printk at ffffffffba10b590
> > #16 [ffffa8505932cda8] no_context at ffffffffba081f66
> > #17 [ffffa8505932ce10] do_page_fault at ffffffffba082df0
> > #18 [ffffa8505932ce40] page_fault at ffffffffbaa012fe
> > [exception RIP: update_blocked_averages+685]
> > RIP: ffffffffba0d85cd RSP: ffffa8505932cef0 RFLAGS: 00010046
> > RAX: 0000000000000000 RBX: ffff8a8ca0510000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffff8a8ca0510000 RDI: 0000000000000000
> > RBP: 0000000000000000 R8: 00000000eac0c6e6 R9: 0000000000000233
> > R10: ffffa8505932cef0 R11: 0000000000000233 R12: ffff8a8ca0510140
> > R13: 0000000000000000 R14: fffffffffffffec2 R15: 0000000000000080
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > #19 [ffffa8505932cf50] run_rebalance_domains at ffffffffba0e2751
> > #20 [ffffa8505932cf68] __softirqentry_text_start at ffffffffbac000e3
> > #21 [ffffa8505932cfc8] irq_exit at ffffffffba0a2cf5
> > #22 [ffffa8505932cfd8] smp_apic_timer_interrupt at ffffffffbaa02874
> > #23 [ffffa8505932cff0] apic_timer_interrupt at ffffffffbaa01d9f
> > --- <IRQ stack> ---
> > #24 [ffffa850402f3dd8] apic_timer_interrupt at ffffffffbaa01d9f
> > [exception RIP: cpuidle_enter_state+171]
> > RIP: ffffffffba6fc32b RSP: ffffa850402f3e80 RFLAGS: 00000246
> > RAX: ffff8a8ec082cc80 RBX: ffffc7f053605e80 RCX: 000000000000001f
> > RDX: 0000082e557d390c RSI: 000000003d1877c2 RDI: 0000000000000000
> > RBP: ffffffffbb55f100 R8: 0000000000000002 R9: 000000000002c500
> > R10: ffffa850402f3e60 R11: 00000000000002ff R12: 0000000000000002
> > R13: 0000082e557d390c R14: 0000000000000002 R15: 0000000000000000
> > ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
> > #25 [ffffa850402f3ec0] cpuidle_enter at ffffffffba6fc6f9
> > #26 [ffffa850402f3ee0] do_idle at ffffffffba0d4567
> > #27 [ffffa850402f3f20] cpu_startup_entry at ffffffffba0d4769
> > #28 [ffffa850402f3f30] start_secondary at ffffffffba064e35
> > #29 [ffffa850402f3f50] secondary_startup_64_no_verify at ffffffffba000112
> >
> > Fix this by always taking rq->lock when checking the ->on_list condition
> > (the modification of the list in UBA is therefore synchronized).
>
> Hi,
>
> Well, yeah, that statement (in the code you removed) does not hold any more,
> that is definitely true. Good catch.
>
> To be 100% clear, this can only happen when a control group is
> throttled while it has load
> (cfs_rq_is_decayed(cfs_rq) is false); and then its unthrottling race
> with its deletion?
> Is that a correct understanding Michal?
>
> >
> > Taking the rq->lock on every cpu cgroup removal may pose a performance
> > penalty. However, this should be just moving the necessary work from UBA
> > into the unregister_fair_sched_group() and therefore neutral on larger
> > scale (assuming given cpu cgroup was populated at least once).
> >
> > Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> >
> > Signed-off-by: Michal Koutný <[email protected]>
> > ---
>
> Others know more about the performance impact of that lock, but in
> terms of logic,
> it looks good to me. Fixing it the other way around, without
> reintroducing the issue fixed in
> a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle"), would create even
> more brittle logic I think. fdaba61ef8a26 ("sched/fair: Ensure that
> the CFS parent is added
> after unthrottling) will also complicate that even more (and that
> patch is also needed for
> correcting the issues it does.
>
> In theory, we can do something like the diff below (from the
> top of my head), but I'm not sure if I am 100% comfortable with such a solution.
> Especially due to the "child_cfs_rq_on_list(cfs_rq)" we need in addition
> to the nr_running. Do you agree that that will also solve the problem Michal,
> or am I missing something obvious here?
>
> I do think Vincent might have some more thoughts about this, so I will
> defer it to him.
>
> So in general, given that the locking thing is ok:
> Acked-by: Odin Ugedal <[email protected]>
>
> Thanks
> Odin
>
> [0]:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f6a05d9b5443..e9a104d57b59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4805,9 +4805,13 @@ static int tg_unthrottle_up(struct task_group
> *tg, void *data)
> cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> cfs_rq->throttled_clock_task;
>
> - /* Add cfs_rq with load or one or more already running
> entities to the list */
> - if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> + if (cfs_rq->nr_running || child_cfs_rq_on_list(cfs_rq)) {
> list_add_leaf_cfs_rq(cfs_rq);
> + } else (!cfs_rq->on_list && !cfs_rq_is_decayed(cfs_rq)) {
> + // Fully decay/detach the load of the cfs_rq
> + cfs_rq->avg.load_avg = 0;
> + update_tg_load_avg(cfs_rq);


Er.. this is considered specific to this fix I think. Normal unthrottle(not
race with delete, avg maybe used in after) also need the normal avg decay.

> + }
> }
>
> return 0;



Thanks,
Tao

2021-10-13 08:01:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

On Mon, 11 Oct 2021 at 19:24, Michal Koutný <[email protected]> wrote:
>
> The removal path checks cfs_rq->on_list flag without synchronization
> based on the reasoning that only empty cgroups can be removed and
> ->on_list can't switch back to 1. However, since the commit a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") even
> cfs_rq of an empty cgroup may be added to the list because of
> non-decayed load that transiently remains there.
> The result is that we may skip unlinking the cfs_rq from the list on the
> removal path and the list will then contain free'd cfs_rq, which leads

This test was there before we started to del a cfs_rq from the list
when unused. At that time, a cfs_rq was added when the 1st entity was
enqueued and then remained in the list until the group was destroyed.
This test in unregister_fair_sched_group() was there to optimize the
removal of a cfs_rq which has never been used and as a result never
added in the list AFAICT.

I'm not sure about your explanation above. We don't skip unlinking the
cfs_rq because it is already not linked at the moment of the test.
Furthermore, list_del_leaf_cfs_rq() starts with the same test on of
cfs_rq->on_list. The problem is that the cfs_rq can be added during or
after the test. Removing it should not be enough because we do the
same test under rq lock which only ensures that both the test and the
add on the list will not happen simultaneously. This seems to closes
the race window in your case but this could still happen AFAICT.

What about your patchset about adding a cfs in the list only when
there is a runnable task ? Wouldn't this fix the problem ?

> sooner or later to a task failing inside update_blocked_averages while
> holding rq->lock and eventually locking up the machine on all other CPUs
> as well:
>
> [ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
> [ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
>
> Illustrative stack dump of a task that faulted by accessing released
> cfs_rq (+unrelated deadlock on rq->lock):
>
> PID: 0 TASK: ffff8a310a5dc000 CPU: 16 COMMAND: "swapper/16"
> #0 [fffffe0000379e58] crash_nmi_callback at ffffffffba063683
> #1 [fffffe0000379e60] nmi_handle at ffffffffba0377ef
> #2 [fffffe0000379eb8] default_do_nmi at ffffffffba037c5e
> #3 [fffffe0000379ed8] do_nmi at ffffffffba037ea7
> #4 [fffffe0000379ef0] end_repeat_nmi at ffffffffbaa0178b
> [exception RIP: native_queued_spin_lock_slowpath+98]
> RIP: ffffffffba0fa6e2 RSP: ffffa8505932ca30 RFLAGS: 00000002
> RAX: 0000000001200101 RBX: ffff8a8de9044000 RCX: ffff8a8ec0800000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a8ec082cc80
> RBP: ffff8a8ec082cc80 R8: ffff8a8ec0800000 R9: ffff8a31078058f8
> R10: 0000000000000000 R11: ffffffffbb4639d8 R12: 0000000000000000
> R13: ffff8a8de9044b84 R14: 0000000000000006 R15: 0000000000000010
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffffa8505932ca30] native_queued_spin_lock_slowpath at ffffffffba0fa6e2
> #6 [ffffa8505932ca30] _raw_spin_lock at ffffffffba922aab
> #7 [ffffa8505932ca38] try_to_wake_up at ffffffffba0cf8f9
> #8 [ffffa8505932ca98] __queue_work at ffffffffba0b9c7e
> #9 [ffffa8505932cae0] queue_work_on at ffffffffba0ba7b4
> #10 [ffffa8505932caf0] bit_putcs at ffffffffba541bc0
> #11 [ffffa8505932cbf0] fbcon_putcs at ffffffffba53c36b
> #12 [ffffa8505932cc48] vt_console_print at ffffffffba5ff032
> #13 [ffffa8505932cca8] console_unlock at ffffffffba1091a2
> #14 [ffffa8505932ccf0] vprintk_emit at ffffffffba10ad29
> #15 [ffffa8505932cd40] printk at ffffffffba10b590
> #16 [ffffa8505932cda8] no_context at ffffffffba081f66
> #17 [ffffa8505932ce10] do_page_fault at ffffffffba082df0
> #18 [ffffa8505932ce40] page_fault at ffffffffbaa012fe
> [exception RIP: update_blocked_averages+685]
> RIP: ffffffffba0d85cd RSP: ffffa8505932cef0 RFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff8a8ca0510000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffff8a8ca0510000 RDI: 0000000000000000
> RBP: 0000000000000000 R8: 00000000eac0c6e6 R9: 0000000000000233
> R10: ffffa8505932cef0 R11: 0000000000000233 R12: ffff8a8ca0510140
> R13: 0000000000000000 R14: fffffffffffffec2 R15: 0000000000000080
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #19 [ffffa8505932cf50] run_rebalance_domains at ffffffffba0e2751
> #20 [ffffa8505932cf68] __softirqentry_text_start at ffffffffbac000e3
> #21 [ffffa8505932cfc8] irq_exit at ffffffffba0a2cf5
> #22 [ffffa8505932cfd8] smp_apic_timer_interrupt at ffffffffbaa02874
> #23 [ffffa8505932cff0] apic_timer_interrupt at ffffffffbaa01d9f
> --- <IRQ stack> ---
> #24 [ffffa850402f3dd8] apic_timer_interrupt at ffffffffbaa01d9f
> [exception RIP: cpuidle_enter_state+171]
> RIP: ffffffffba6fc32b RSP: ffffa850402f3e80 RFLAGS: 00000246
> RAX: ffff8a8ec082cc80 RBX: ffffc7f053605e80 RCX: 000000000000001f
> RDX: 0000082e557d390c RSI: 000000003d1877c2 RDI: 0000000000000000
> RBP: ffffffffbb55f100 R8: 0000000000000002 R9: 000000000002c500
> R10: ffffa850402f3e60 R11: 00000000000002ff R12: 0000000000000002
> R13: 0000082e557d390c R14: 0000000000000002 R15: 0000000000000000
> ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
> #25 [ffffa850402f3ec0] cpuidle_enter at ffffffffba6fc6f9
> #26 [ffffa850402f3ee0] do_idle at ffffffffba0d4567
> #27 [ffffa850402f3f20] cpu_startup_entry at ffffffffba0d4769
> #28 [ffffa850402f3f30] start_secondary at ffffffffba064e35
> #29 [ffffa850402f3f50] secondary_startup_64_no_verify at ffffffffba000112
>
> Fix this by always taking rq->lock when checking the ->on_list condition
> (the modification of the list in UBA is therefore synchronized).
>
> Taking the rq->lock on every cpu cgroup removal may pose a performance
> penalty. However, this should be just moving the necessary work from UBA
> into the unregister_fair_sched_group() and therefore neutral on larger
> scale (assuming given cpu cgroup was populated at least once).
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/sched/fair.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f6a05d9b5443..081c7ac02f65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11440,13 +11440,6 @@ void unregister_fair_sched_group(struct task_group *tg)
> if (tg->se[cpu])
> remove_entity_load_avg(tg->se[cpu]);
>
> - /*
> - * Only empty task groups can be destroyed; so we can speculatively
> - * check on_list without danger of it being re-added.
> - */
> - if (!tg->cfs_rq[cpu]->on_list)
> - continue;
> -
> rq = cpu_rq(cpu);
>
> raw_spin_rq_lock_irqsave(rq, flags);
> --
> 2.33.0
>

2021-10-13 14:29:46

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot <[email protected]> wrote:
> Furthermore, list_del_leaf_cfs_rq() starts with the same test on of
> cfs_rq->on_list.

Yes, the same check but synchronized with rq->lock.

> The problem is that the cfs_rq can be added during or
> after the test. Removing it should not be enough because we do the
> same test under rq lock which only ensures that both the test and the
> add on the list will not happen simultaneously.

This is what I overlooked when I was looking for explanation of the UAF
on the leaf list.

> This seems to closes the race window in your case but this could still
> happen AFAICT.

You seem to be right.
Hopefully, I'll be able to collect more data evaluating this.

> What about your patchset about adding a cfs in the list only when
> there is a runnable task ?

The patches I had sent previously [1] avoid adding cfs_rq to the list
when it's under a throttled ancestor (namely 4/5). The runnable
condition is rather orthogonal. (Not sure it's the patchset you were
referring to.)


> Wouldn't this fix the problem ?

FWIW, the "reliable" fix so far is a revert of the commit a7b359fc6a37
("sched/fair: Correctly insert cfs_rq's to list on
unthrottle"). Therefore my hypothesis about racy adding from
tg_unthrottle_up(), so I think the other patches won't affect the issue.

Thanks for your feedback. Let me examine the problem some more before
continuing with this patch.

Michal


[1] https://lore.kernel.org/all/[email protected]/

2021-10-13 14:43:33

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

On Mon, Oct 11, 2021 at 08:12:08PM +0100, Odin Ugedal <[email protected]> wrote:
> To be 100% clear, this can only happen when a control group is
> throttled while it has load
> (cfs_rq_is_decayed(cfs_rq) is false); and then its unthrottling race
> with its deletion?
> Is that a correct understanding Michal?

Yes, that's my working hypothesis but Vincent found a loophole in the
proposed fix under this assumption.


> Do you agree that that will also solve the problem Michal,
> or am I missing something obvious here?

It's not easy for me to verify this with a reproducer and as suggested
by your discomfort, let's dismiss this idea for the time being.

Michal

2021-10-13 18:50:34

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

ons. 13. okt. 2021 kl. 15:39 skrev Michal Koutný <[email protected]>:
>
> On Mon, Oct 11, 2021 at 08:12:08PM +0100, Odin Ugedal <[email protected]> wrote:
>
> Yes, that's my working hypothesis but Vincent found a loophole in the
> proposed fix under this assumption.

Yes, I started to think about the same thing late yesterday afternoon as well...

Ref. your comment about reverting a7b359fc6a37
("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I
think that is fine as
long as we revert the commit it fixes as well, to avoid a regression
of that (but yeah,
that regression itself is less bad than your discovery). If we do so,
we know that the
only time we remove it from the list is when it is fully decayed,
creating way less
edge cases

In regards to the race, would a simple fix for that be to, in addition
to your patch,
set cfs_rq->on_list=2 inside that lock under your code change? If we
then treat on_list=2
as "not on list, and do not add"? We can then make constants for them.
In that case, we
would know that the cfs_rq will never again be inserted into the list,
even if it has load. Would
something like that work properly?

Thanks
Odin

2021-10-13 18:55:48

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list presence

Hi,

tir. 12. okt. 2021 kl. 19:31 skrev Tao Zhou <[email protected]>:
>
> Er.. this is considered specific to this fix I think. Normal unthrottle(not
> race with delete, avg maybe used in after) also need the normal avg decay.

Yeah, it was more meant as a way to express my idea, should probably have
said that more explicitly. It would essentially be a revert of a7b359fc6a37
("sched/fair: Correctly insert cfs_rq's to list on unthrottle"),
while "temporary" fixing the problem it fixed, by removing the load from the tg.

But yeah, it would still need to decay the load properly to _actually_ be fully
correct and work as it does today. But in most cases nr_running>0 on unthrottle,
but yeah, it is definitely not always the case.

Thanks
Odin

2021-11-02 16:32:38

by Michal Koutný

[permalink] [raw]
Subject: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

Hello.

(Getting back to this after some more analysis.)

On Wed, Oct 13, 2021 at 04:26:43PM +0200, Michal Koutn? <[email protected]> wrote:
> On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot <[email protected]> wrote:
> > This seems to closes the race window in your case but this could still
> > happen AFAICT.
>
> You seem to be right.
> Hopefully, I'll be able to collect more data evaluating this.

I've observed that the window between unregister_fair_sched_group() and
free_fair_sched_group() is commonly around 15 ms (based on kprobe
tracing).

I have a reproducer (attached) that can hit this window quite easily
after tuning. I can observe consequences of it even with a recent 5.15
kernel. (And I also have reports from real world workloads failing due
to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle").)

My original patch was really an uninformed attempt given the length of
the window.

[snip]

On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <[email protected]> wrote:
> Ref. your comment about reverting a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I
> think that is fine as long as we revert the commit it fixes as well,
> to avoid a regression of that (but yeah, that regression itself is
> less bad than your discovery).

I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
update_blocked_averages()") (it solves reported performance issues, it's
way too old :-).

> set cfs_rq->on_list=2 inside that lock under your code change? If we
> then treat on_list=2
> as "not on list, and do not add"?

The possibilities for the current problem:

1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
(Not exclusive with the other suggestions, rather a stop-gap for the
time being.)

2) Don't add offlined task_groups into the undecayed list
- Your proposal with overloaded on_list=2 could serve as mark of that,
but it's a hack IMO.
- Proper way (tm) would be to use css_tryget_online() and css_put() when
dealing with the list (my favorite at the moment).

3) Narrowing the race-window dramatically
- that is by moving list removal from unregister_fair_sched_group() to
free_fair_sched_group(),
- <del>or use list_empty(tg->list) as indicator whether we're working
with onlined task_group.</del> (won't work for RCU list)

4) Rework how throttled load is handled (hand waving)
- There is remove_entity_load_avg() that moves the load to parent upon
final removal. Maybe it could be generalized for temporary removals by
throttling (so that unthrottling could again add only non-empty
cfs_rqs to the list and undecayed load won't skew fairness).
- or the way of [1].

5) <your ideas>

Opinions?

Thanks,
Michal

[1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@mail.gmail.com/


Attachments:
(No filename) (2.88 kB)
run2.sh (1.56 kB)
Download all attachments

2021-11-02 20:22:57

by Odin Ugedal

[permalink] [raw]
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

tir. 2. nov. 2021 kl. 16:02 skrev Michal Koutný <[email protected]>:
>
>
> I've observed that the window between unregister_fair_sched_group() and
> free_fair_sched_group() is commonly around 15 ms (based on kprobe
> tracing).
>
> I have a reproducer (attached) that can hit this window quite easily
> after tuning. I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)
>
> My original patch was really an uninformed attempt given the length of
> the window.


Thanks for your digging into this! I don't have time to run the
reproduction myself
atm but it looks good. Will look more deeply into it this weekend.

> On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <[email protected]> wrote:
>
> I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
> update_blocked_averages()") (it solves reported performance issues, it's
> way too old :-).

Yeah, I am fine with not reverting that one.


> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)

I am fine with reverting this commit for the short term. Can you post
a revert patch of it together
with your bug details, as well as references to this discussion. Feel
free to add me to it
Acked-by: Odin Ugedal <[email protected]>

I do not think we need to revert its fixups, as they help the
consistency of the list in other cases
as well. Might be wrong about this one, so it would probably need some
testing. They should
at least not cause _this_ race issue.

The performance regression of reverting that commit should also be
_fairly_ small, even though
it can cause quite severe fairness issues in some cases. That is
however preferable compared to
a hard lockup.

We can then look at solutions for that issue, and use some time to
figure out the best way
to mitigate it.

> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
> but it's a hack IMO.
> - Proper way (tm) would be to use css_tryget_online() and css_put() when
> dealing with the list (my favorite at the moment).

Ahh. Not familiar with css_tryget_online() and css_put(), but they
might be useful here.
Maybe cc the cgroups list about this? But yeah, this is my favorite
medium to long term
as well.

> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
> free_fair_sched_group(),
> - <del>or use list_empty(tg->list) as indicator whether we're working
> with onlined task_group.</del> (won't work for RCU list)

Outside what I can wrap my head around this late, but I vote for a solution
that does _not_ include the phrase "narrowing a race window"
if we can eliminate it.

> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
> final removal. Maybe it could be generalized for temporary removals by
> throttling (so that unthrottling could again add only non-empty
> cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].

Yeah, we can do this if the performance impact of always grabbing the rq lock
on control group delete (essentially your initial patch here). My hunch is that
the impact of the lock is essentially zero, and that doing something like 2)
is better if we can do that. Have not sketched up a proper solution like this,
so not sure how it will look.

> 5) <your ideas>

Nah, I think you found all the possible solutions I have thought of. :)

> Opinions?

Again, thanks for working on this, and for the nice script to reproduce!

> [1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@mail.gmail.com/

Thanks
Odin

2021-11-03 09:54:29

by Mathias Krause

[permalink] [raw]
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

Hi!

Am 02.11.21 um 17:02 schrieb Michal Koutn?:
> [snip]
> I have a reproducer (attached) that can hit this window quite easily
> after tuning. I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)

Thanks for the reproducer!

To provide yet another data point, Kevin (on Cc) is running into this
bug as well very reliable with a production workload, so we started
looking into this too. His crashes indicate a use-after-free of a cfs_rq
in update_blocked_averages(), much like you already diagnosed in your
initial patch description -- there are live cfs_rq's (on_list=1) in an
about to be kfree()'d task group in free_fair_sched_group().

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object which
makes it incidentally overlap with SLUB's freelist pointer. That in
combination with SLAB_FREELIST_HARDENED's freelist pointer mangling
leads to a reliable access violation in form of a #GP which allowed us
to make the UAF fail fast.

As the real root cause cannot be seen from the crash backtrace only, we
tested a debug patch (attached) that unveiled that the real offender is
tg_unthrottle_up() getting called via sched_cfs_period_timer() via the
timer interrupt at an inconvenient time, namely when
unregister_fair_sched_group() unlinks all cfs_rq's from the dying task
group. It doesn't protect itself from getting interrupted, so if the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group in sched_offline_group(), sched_cfs_period_timer() will
execute and walk the list of task groups, trying to unthrottle cfs_rq's
and possibly re-add them to the dying task group. These will later -- in
free_fair_sched_group() -- be kfree()'ed while still being linked,
leading to the fireworks Kevin and you are seeing.

We tried the below patch which, unfortunately, doesn't fix the issue. So
there must be something else. :(

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..afee07e9faf9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us
+ * (via sched_cfs_period_timer()).
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
spin_unlock_irqrestore(&task_group_lock, flags);
+
+ /* End participation in shares distribution: */
+ unregister_fair_sched_group(tg);
}

static void sched_change_group(struct task_struct *tsk, int type)

> [snip]
>
> The possibilities for the current problem:
>
> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)
>
> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
> but it's a hack IMO.

> - Proper way (tm) would be to use css_tryget_online() and css_put() when
> dealing with the list (my favorite at the moment).

That might work, as -- at least in Kevin's case -- it all gets triggered
by a dying cgroup.

> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
> free_fair_sched_group(),

That might work, too. However, the unlinking needs protection against
the timer interrupt (and other sources?) which might try to re-add
entries. Or won't that happen any more, as at lesat one RCU GP has
passed? Anyhow, the kfree() calls likely would need to become
kfree_rcu() to handle concurrent traversal of cfs_rq's.

> - <del>or use list_empty(tg->list) as indicator whether we're working
> with onlined task_group.</del> (won't work for RCU list)
>
> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
> final removal. Maybe it could be generalized for temporary removals by
> throttling (so that unthrottling could again add only non-empty
> cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].
>
> 5) <your ideas>

It should be something that can be backported to stable kernels, as this
seem to affect v5.13, too.


Thanks,
Mathias


Attachments:
5.14-sched-fair-dbg.diff (2.60 kB)

2021-11-03 10:53:30

by Mathias Krause

[permalink] [raw]
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

Heh, sometimes a good night sleep helps unfolding the knot in the head!

Am 03.11.21 um 10:51 schrieb Mathias Krause:
> [snip]
>
> We tried the below patch which, unfortunately, doesn't fix the issue. So
> there must be something else. :(
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 978460f891a1..afee07e9faf9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us
> + * (via sched_cfs_period_timer()).
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> spin_unlock_irqrestore(&task_group_lock, flags);
> +
> + /* End participation in shares distribution: */

Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
will have finished what they're doing, so we can unlink safely. That
was, apparently, the missing piece.

> + unregister_fair_sched_group(tg);
> }
>
> static void sched_change_group(struct task_struct *tsk, int type)
>

Now, synchronize_rcu() is quite a heavy hammer. So using a RCU callback
should be more appropriate. I'll hack up something and post a proper
patch, if you don't beat me to.

Mathias

2021-11-03 11:13:26

by Michal Koutný

[permalink] [raw]
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

On Wed, Nov 03, 2021 at 11:51:12AM +0100, Mathias Krause <[email protected]> wrote:
> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
> will have finished what they're doing, so we can unlink safely. That
> was, apparently, the missing piece.

What reader(s) are you referring to here? The
list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
throttled_list) in distribute_cfs_runtime()?

I'm think (not sure) that wouldn't work since the unthrottle_cfs_rq can
still be called after this synchronize_rcu() but before
free_fair_sched_group().

(But if you considered update_blocked_averages() as the reader(s) and
synchronize_rcu() within free_fair_sched_group(), that may ensure UBA
won't step on a free'd cfs_rq (+unlinking would need to happen only in
free_fair_sched_group() too.))

Michal

2021-11-03 14:19:52

by Mathias Krause

[permalink] [raw]
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

Am 03.11.21 um 12:10 schrieb Michal Koutný:
> On Wed, Nov 03, 2021 at 11:51:12AM +0100, Mathias Krause <[email protected]> wrote:
>> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
>> will have finished what they're doing, so we can unlink safely. That
>> was, apparently, the missing piece.
>
> What reader(s) are you referring to here? The
> list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
> throttled_list) in distribute_cfs_runtime()?

Let me quote a little bit more context here to explain my rationale:

>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 978460f891a1..afee07e9faf9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
>> {
>> unsigned long flags;
>>
>> - /* End participation in shares distribution: */
>> - unregister_fair_sched_group(tg);
>> -
>> + /*
>> + * Unlink first, to avoid walk_tg_tree_from() from finding us
>> + * (via sched_cfs_period_timer()).
>> + */

>> spin_lock_irqsave(&task_group_lock, flags);
>> list_del_rcu(&tg->list);
>> list_del_rcu(&tg->siblings);
>> spin_unlock_irqrestore(&task_group_lock, flags);

The above list_del_rcu() implies there are users that use RCU semantics
to walk the lists, namely tg->list and tg->siblings. And, in fact,
walk_tg_tree_from() does just that:

[...]
list_for_each_entry_rcu(child, &parent->children, siblings) {
[...]
}

Now, taking the task_group_lock in sched_offline_group() just prevents
multiple users from entering this code path. But walk_tg_tree_from()
does not take that lock, so can walk the task group hierarchy concurrently.

Disabling IRQs also doesn't protect us from do_sched_cfs_period_timer()
running in parallel, as the timer interrupt might fire on a different
CPU than the one we're currently running on.

So, assuming the scenario in race.txt (sorry, my mailer is just too dump
for bigger ASCI art), CPU2 will walk the task group list in parallel to
sched_offline_group(), specifically it'll read the soon to be unlinked
task group entry in (1). So removing it on CPU1 in (2) won't prevent
CPU2 from still passing it on to tg_unthrottle_up(). Assuming
unregister_fair_sched_group() on CPU1 now tries to unlink all cfs_rq's
via calls to list_del_leaf_cfs_rq(), CPU2 will re-add some in (3), which
is the cause of the UAF later on. Now, if we would add a
synchronize_rcu() prior to calling unregister_fair_sched_group(), we
would wait until the walk on CPU2 has finished at (4). Afterwards each
new walk won't see the dying task group any more, thereby preventing the
UAF from happening. Or am I missing something?

>> +
>> + /* End participation in shares distribution: */
>
> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
> will have finished what they're doing, so we can unlink safely. That
> was, apparently, the missing piece.
>
>> + unregister_fair_sched_group(tg);
>> }
>>
>> static void sched_change_group(struct task_struct *tsk, int type)
>>

Btw, the above patch (plus the warning debug patch) is running your
reproducer here for 3.5h without triggering any warnings. It used to
trigger them within the first few minutes.

> I'm think (not sure) that wouldn't work since the unthrottle_cfs_rq can
> still be called after this synchronize_rcu() but before
> free_fair_sched_group().

Sure, but it won't see the dying task group any more. So won't re-add
cfs_rq's to them, preventing the UAF from happening.

> (But if you considered update_blocked_averages() as the reader(s) and
> synchronize_rcu() within free_fair_sched_group(), that may ensure UBA
> won't step on a free'd cfs_rq (+unlinking would need to happen only in
> free_fair_sched_group() too.))

I'd prefer not linking cfs_rq's to a dead tg. Trying to cover up just
before we're about to free the memory feels hacky.

Thanks,
Mathias


Attachments:
race.txt (1.91 kB)

2021-11-03 19:10:06

by Mathias Krause

[permalink] [raw]
Subject: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've live
cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object which
makes it incidentally overlap with SLUB's freelist pointer. That, in
combination with SLAB_FREELIST_HARDENED's freelist pointer mangling,
leads to a reliable access violation in form of a #GP which made the UAF
fail fast, e.g. like this:

general protection fault, probably for non-canonical address 0xff80f68888f69107: 0000 [#1] SMP
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.14.13-custom+ #3
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
RIP: 0010:[<ffffffff81194af8>] update_blocked_averages+0x428/0xb90
Code: 28 01 00 4c 8b 4c 24 10 41 8b 97 c4 00 00 00 85 d2 75 32 4c 89 fe 4c 89 cf e8 74 2c 01 00 49 8b 96 80 00 00 00 48 85 d2 74 0e <48> 83 ba 08 01 00 00 00 0f 85 45 01 00 00 85 c0 0f 84 34 fe ff ff
RSP: 0018:ffffc9000002bf08 EFLAGS: 00010086
RAX: 0000000000000001 RBX: ffff8880202eba00 RCX: 000000000000b686
RDX: ff80f68888f68fff RSI: 0000000000000000 RDI: 000000000000000c
RBP: ffff888006808a00 R08: ffff888006808a00 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000000 R12: ffff8880202ebb40
R13: 0000000000000000 R14: ffff8880087e7f00 R15: ffff888006808a00
FS: 0000000000000000(0000) GS:ffff888237d40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000315b147b8000 CR3: 0000000002d70000 CR4: 00000000001606f0 shadow CR4: 00000000001606f0
Stack:
0100008237d58b80 0000000000000286 000003ae017023d5 00000000000000f0
ffff888237d5d240 0000000000000028 ffff888237d5c980 ffff888237d5c900
ffff888237d5c900 0000000000000001 0000000000000100 0000000000000007
Call Trace:
<IRQ>
[<ffffffff8119952a>] run_rebalance_domains+0x3a/0x60
[<ffffffff810030bf>] __do_softirq+0xbf/0x1fb
[<ffffffff81162e7f>] irq_exit_rcu+0x7f/0x90
[<ffffffff822d0d7e>] sysvec_apic_timer_interrupt+0x6e/0x90
</IRQ>
[<ffffffff81001d82>] asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:[<ffffffff822debe9>] acpi_idle_do_entry+0x49/0x50
Code: 8b 15 2f b3 75 01 ed c3 0f 0b e9 52 fd ff ff 65 48 8b 04 25 40 1c 01 00 48 8b 00 a8 08 75 e8 eb 07 0f 00 2d d9 e2 1d 00 fb f4 <fa> c3 0f 0b cc cc cc 41 55 41 89 d5 41 54 49 89 f4 55 53 48 89 fb
RSP: 0000:ffffc900000bbe68 EFLAGS: 00000246
RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff888237d40000
RDX: 0000000000000001 RSI: ffffffff82cdd4c0 RDI: ffff888100b7bc64
RBP: ffff888101c07000 R08: ffff888100b7bc00 R09: 00000000000000ac
R10: 0000000000000005 R11: ffff888237d5b824 R12: 0000000000000001
R13: ffffffff82cdd4c0 R14: ffffffff82cdd540 R15: 0000000000000000
[<ffffffff8118dab9>] ? sched_clock_cpu+0x9/0xa0
[<ffffffff818d26f8>] acpi_idle_enter+0x48/0xb0
[<ffffffff81ec123c>] cpuidle_enter_state+0x7c/0x2c0
[<ffffffff81ec14b4>] cpuidle_enter+0x24/0x40
[<ffffffff8118e5d4>] do_idle+0x1c4/0x210
[<ffffffff8118e79e>] cpu_startup_entry+0x1e/0x20
[<ffffffff810a8a4a>] start_secondary+0x11a/0x120
[<ffffffff81000103>] secondary_startup_64_no_verify+0xae/0xbb
---[ end trace aac4ad8b95ba31e5 ]---

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin and
Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

CPU1: CPU2:
: timer IRQ:
: do_sched_cfs_period_timer():
: :
: distribute_cfs_runtime():
: rcu_read_lock();
: :
: unthrottle_cfs_rq():
sched_offline_group(): :
: walk_tg_tree_from(…,tg_unthrottle_up,…):
list_del_rcu(&tg->list); :
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
: :
(2) list_del_rcu(&tg->siblings); :
: tg_unthrottle_up():
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
: :
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
: :
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
(3) : list_add_leaf_cfs_rq(cfs_rq);
: :
: :
: :
: :
: :
(4) : rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all
cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group().
Meanwhile CPU 2 will re-add some of these at (3), which is the cause of
the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e. after
the RCU read critical section ends in (4). Afterwards we're safe to call
unregister_fair_sched_group(), as each new walk won't see the dying task
group any more.

Using synchronize_rcu() might be seen as a too heavy hammer to nail this
problem. However, the overall tear down sequence (e.g., as documented in
css_free_rwork_fn()) already relies on quite a few assumptions regarding
execution context and RCU grace periods from passing. Looking at the
autogroup code, which calls sched_destroy_group() directly after
sched_offline_group() and the apparent need to have at least one RCU
grace period expire after unlinking the task group, prior to calling
unregister_fair_sched_group(), there seems to be no better alternative.
Calling unregister_fair_sched_group() via call_rcu() will only lead to
trouble in sched_offline_group() which also relies on (yet another)
expired RCU grace period.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Cc: Odin Ugedal <[email protected]>
Cc: Michal Koutný <[email protected]>
Reported-by: Kevin Tanguy <[email protected]>
Suggested-by: Brad Spengler <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
kernel/sched/core.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..60125a6c9d1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us (via
+ * sched_cfs_period_timer()).
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
spin_unlock_irqrestore(&task_group_lock, flags);
+
+ /*
+ * Wait for all pending users of this task group to leave their RCU
+ * critical section to ensure no new user will see our dying task
+ * group any more. Specifically ensure that tg_unthrottle_up() won't
+ * add decayed cfs_rq's to it.
+ */
+ synchronize_rcu();
+
+ /* End participation in shares distribution: */
+ unregister_fair_sched_group(tg);
}

static void sched_change_group(struct task_struct *tsk, int type)
--
2.30.2

2021-11-03 22:06:18

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Mathias Krause <[email protected]> writes:

> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> in update_blocked_averages(). Initial debugging revealed that we've live
> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> free_fair_sched_group(). However, it was unclear how that can happen.
> [...]
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Cc: Odin Ugedal <[email protected]>
> Cc: Michal Koutný <[email protected]>
> Reported-by: Kevin Tanguy <[email protected]>
> Suggested-by: Brad Spengler <[email protected]>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> kernel/sched/core.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 978460f891a1..60125a6c9d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> spin_unlock_irqrestore(&task_group_lock, flags);
> +
> + /*
> + * Wait for all pending users of this task group to leave their RCU
> + * critical section to ensure no new user will see our dying task
> + * group any more. Specifically ensure that tg_unthrottle_up() won't
> + * add decayed cfs_rq's to it.
> + */
> + synchronize_rcu();

I was going to suggest that we could just clear all of avg.load_sum/etc, but
that breaks the speculative on_list read. Currently the final avg update
just races, but that's not good enough if we wanted to rely on it to
prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
taking every rqlock anyways.

I do wonder if we can move the relevant part of
unregister_fair_sched_group into sched_free_group_rcu. After all
for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
should be updated).


> +
> + /* End participation in shares distribution: */
> + unregister_fair_sched_group(tg);
> }
>
> static void sched_change_group(struct task_struct *tsk, int type)

2021-11-04 08:52:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Wed, 3 Nov 2021 at 23:04, Benjamin Segall <[email protected]> wrote:
>
> Mathias Krause <[email protected]> writes:
>
> > Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> > in update_blocked_averages(). Initial debugging revealed that we've live
> > cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> > free_fair_sched_group(). However, it was unclear how that can happen.
> > [...]
> > Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> > Cc: Odin Ugedal <[email protected]>
> > Cc: Michal Koutný <[email protected]>
> > Reported-by: Kevin Tanguy <[email protected]>
> > Suggested-by: Brad Spengler <[email protected]>
> > Signed-off-by: Mathias Krause <[email protected]>
> > ---
> > kernel/sched/core.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 978460f891a1..60125a6c9d1b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> > {
> > unsigned long flags;
> >
> > - /* End participation in shares distribution: */
> > - unregister_fair_sched_group(tg);
> > -
> > + /*
> > + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> > + * sched_cfs_period_timer()).
> > + */
> > spin_lock_irqsave(&task_group_lock, flags);
> > list_del_rcu(&tg->list);
> > list_del_rcu(&tg->siblings);
> > spin_unlock_irqrestore(&task_group_lock, flags);
> > +
> > + /*
> > + * Wait for all pending users of this task group to leave their RCU
> > + * critical section to ensure no new user will see our dying task
> > + * group any more. Specifically ensure that tg_unthrottle_up() won't
> > + * add decayed cfs_rq's to it.
> > + */
> > + synchronize_rcu();
>
> I was going to suggest that we could just clear all of avg.load_sum/etc, but
> that breaks the speculative on_list read. Currently the final avg update
> just races, but that's not good enough if we wanted to rely on it to
> prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
> taking every rqlock anyways.
>
> I do wonder if we can move the relevant part of
> unregister_fair_sched_group into sched_free_group_rcu. After all
> for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
> in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
> should be updated).

I was wondering the same thing.
we would have to move unregister_fair_sched_group() completely in
sched_free_group_rcu() and probably in cpu_cgroup_css_free() too.

remove_entity_load_avg(tg->se[cpu]); has to be called only once we are
sure that se->my_q will not be updated which means no more in the
leaf_list

>
>
> > +
> > + /* End participation in shares distribution: */
> > + unregister_fair_sched_group(tg);
> > }
> >
> > static void sched_change_group(struct task_struct *tsk, int type)

2021-11-04 15:15:11

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 04.11.21 um 09:50 schrieb Vincent Guittot:
> On Wed, 3 Nov 2021 at 23:04, Benjamin Segall <[email protected]> wrote:
>>
>> Mathias Krause <[email protected]> writes:
>>
>>> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
>>> in update_blocked_averages(). Initial debugging revealed that we've live
>>> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
>>> free_fair_sched_group(). However, it was unclear how that can happen.
>>> [...]
>>> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
>>> Cc: Odin Ugedal <[email protected]>
>>> Cc: Michal Koutný <[email protected]>
>>> Reported-by: Kevin Tanguy <[email protected]>
>>> Suggested-by: Brad Spengler <[email protected]>
>>> Signed-off-by: Mathias Krause <[email protected]>
>>> ---
>>> kernel/sched/core.c | 18 +++++++++++++++---
>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 978460f891a1..60125a6c9d1b 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
>>> {
>>> unsigned long flags;
>>>
>>> - /* End participation in shares distribution: */
>>> - unregister_fair_sched_group(tg);
>>> -
>>> + /*
>>> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
>>> + * sched_cfs_period_timer()).
>>> + */
>>> spin_lock_irqsave(&task_group_lock, flags);
>>> list_del_rcu(&tg->list);
>>> list_del_rcu(&tg->siblings);
>>> spin_unlock_irqrestore(&task_group_lock, flags);
>>> +
>>> + /*
>>> + * Wait for all pending users of this task group to leave their RCU
>>> + * critical section to ensure no new user will see our dying task
>>> + * group any more. Specifically ensure that tg_unthrottle_up() won't
>>> + * add decayed cfs_rq's to it.
>>> + */
>>> + synchronize_rcu();
>>
>> I was going to suggest that we could just clear all of avg.load_sum/etc, but
>> that breaks the speculative on_list read. Currently the final avg update
>> just races, but that's not good enough if we wanted to rely on it to
>> prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
>> taking every rqlock anyways.
>>
>> I do wonder if we can move the relevant part of
>> unregister_fair_sched_group into sched_free_group_rcu. After all
>> for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
>> in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
>> should be updated).
>
> I was wondering the same thing.
> we would have to move unregister_fair_sched_group() completely in
> sched_free_group_rcu() and probably in cpu_cgroup_css_free() too.

Well, the point is, print_cfs_stats() pretty much relies on the list to
be stable, i.e. safe to traverse. It doesn't take locks while walking it
(beside the RCU lock), so if we would modify it concurrently, bad things
would happen.

Now, all manipulations of &cfs_rq->leaf_cfs_rq_list happen via the
list_*_rcu() variants, so that assumption in print_cfs_stats() is
actually sound -- the list chain can be walked without fearing to
dereference bad pointers. But if we would move the unchaining of
cfs_rq's down to sched_free_group_rcu() (or free_fair_sched_group(), as
that gets called more or less directly from sched_free_group_rcu()), we
would have the unchaining and kfree() of cfs_rq's in the same RCU GP.
Or, ignoring RCU for a moment, print_cfs_stats() may see a cfs_rq on one
CPU we're about to kfree() on the other. So, at least, the
kfree(tg->cfs_rq[i]) would need to become a kfree_rcu(). But likely all
the others as well, as print_cfs_stats() looks at cfs_rq->tg and
cfs_rq->tg->se[cpu], too.

> remove_entity_load_avg(tg->se[cpu]); has to be called only once we are
> sure that se->my_q will not be updated which means no more in the
> leaf_list

And that would be only be after tq was unlinked in sched_offline_group()
so walk_tg_tree_from() can't find it any more (to prevent re-linking the
cfs_rq) and after unregister_fair_sched_group() has run. But between the
two an RCU GP is needed. We can, sure, use the existing one by moving
unregister_fair_sched_group() to sched_free_group_rcu() /
cpu_cgroup_css_free(). But then we also need another RCU GP after that,
prior to releasing the involved objects, to ensure print_cfs_stats()
won't be the new UAF subject.

Thanks,
Mathias

2021-11-04 16:51:23

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Thu, 4 Nov 2021 at 16:13, Mathias Krause <[email protected]> wrote:
>
> Am 04.11.21 um 09:50 schrieb Vincent Guittot:
> > On Wed, 3 Nov 2021 at 23:04, Benjamin Segall <[email protected]> wrote:
> >>
> >> Mathias Krause <[email protected]> writes:
> >>
> >>> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> >>> in update_blocked_averages(). Initial debugging revealed that we've live
> >>> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> >>> free_fair_sched_group(). However, it was unclear how that can happen.
> >>> [...]
> >>> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> >>> Cc: Odin Ugedal <[email protected]>
> >>> Cc: Michal Koutný <[email protected]>
> >>> Reported-by: Kevin Tanguy <[email protected]>
> >>> Suggested-by: Brad Spengler <[email protected]>
> >>> Signed-off-by: Mathias Krause <[email protected]>
> >>> ---
> >>> kernel/sched/core.c | 18 +++++++++++++++---
> >>> 1 file changed, 15 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index 978460f891a1..60125a6c9d1b 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> >>> {
> >>> unsigned long flags;
> >>>
> >>> - /* End participation in shares distribution: */
> >>> - unregister_fair_sched_group(tg);
> >>> -
> >>> + /*
> >>> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> >>> + * sched_cfs_period_timer()).
> >>> + */
> >>> spin_lock_irqsave(&task_group_lock, flags);
> >>> list_del_rcu(&tg->list);
> >>> list_del_rcu(&tg->siblings);
> >>> spin_unlock_irqrestore(&task_group_lock, flags);
> >>> +
> >>> + /*
> >>> + * Wait for all pending users of this task group to leave their RCU
> >>> + * critical section to ensure no new user will see our dying task
> >>> + * group any more. Specifically ensure that tg_unthrottle_up() won't
> >>> + * add decayed cfs_rq's to it.
> >>> + */
> >>> + synchronize_rcu();
> >>
> >> I was going to suggest that we could just clear all of avg.load_sum/etc, but
> >> that breaks the speculative on_list read. Currently the final avg update
> >> just races, but that's not good enough if we wanted to rely on it to
> >> prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
> >> taking every rqlock anyways.
> >>
> >> I do wonder if we can move the relevant part of
> >> unregister_fair_sched_group into sched_free_group_rcu. After all
> >> for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
> >> in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
> >> should be updated).
> >
> > I was wondering the same thing.
> > we would have to move unregister_fair_sched_group() completely in
> > sched_free_group_rcu() and probably in cpu_cgroup_css_free() too.
>
> Well, the point is, print_cfs_stats() pretty much relies on the list to
> be stable, i.e. safe to traverse. It doesn't take locks while walking it
> (beside the RCU lock), so if we would modify it concurrently, bad things
> would happen.
>
> Now, all manipulations of &cfs_rq->leaf_cfs_rq_list happen via the
> list_*_rcu() variants, so that assumption in print_cfs_stats() is
> actually sound -- the list chain can be walked without fearing to
> dereference bad pointers. But if we would move the unchaining of
> cfs_rq's down to sched_free_group_rcu() (or free_fair_sched_group(), as
> that gets called more or less directly from sched_free_group_rcu()), we
> would have the unchaining and kfree() of cfs_rq's in the same RCU GP.
> Or, ignoring RCU for a moment, print_cfs_stats() may see a cfs_rq on one
> CPU we're about to kfree() on the other. So, at least, the
> kfree(tg->cfs_rq[i]) would need to become a kfree_rcu(). But likely all
> the others as well, as print_cfs_stats() looks at cfs_rq->tg and
> cfs_rq->tg->se[cpu], too.
>
> > remove_entity_load_avg(tg->se[cpu]); has to be called only once we are
> > sure that se->my_q will not be updated which means no more in the
> > leaf_list
>
> And that would be only be after tq was unlinked in sched_offline_group()
> so walk_tg_tree_from() can't find it any more (to prevent re-linking the
> cfs_rq) and after unregister_fair_sched_group() has run. But between the
> two an RCU GP is needed. We can, sure, use the existing one by moving
> unregister_fair_sched_group() to sched_free_group_rcu() /
> cpu_cgroup_css_free(). But then we also need another RCU GP after that,
> prior to releasing the involved objects, to ensure print_cfs_stats()
> won't be the new UAF subject.

Ok so we must have 2 GPs:

list_del_rcu(&tg->siblings);
GP to wait for the end of ongoing walk_tg_tree_from : synchronize_rcu
in your patch
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); if on_list
remove_entity_load_avg(tg->se[cpu]);
GP to wait for the end of ongoing for_each_leaf_cfs_rq_safe (print_cfs_stats)
kfree everything


>
> Thanks,
> Mathias

2021-11-04 18:47:17

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 04.11.21 um 17:49 schrieb Vincent Guittot:
> [snip]
>
> Ok so we must have 2 GPs:
>
> list_del_rcu(&tg->siblings);
> GP to wait for the end of ongoing walk_tg_tree_from : synchronize_rcu
> in your patch
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); if on_list
> remove_entity_load_avg(tg->se[cpu]);
> GP to wait for the end of ongoing for_each_leaf_cfs_rq_safe (print_cfs_stats)
> kfree everything

Basically yes, but with my patch we already have these two, as there's
at least one RCU GP between after sched_offline_group() finishes and
sched_free_group() / cpu_cgroup_css_free() starts.

So we either use my patch as-is or move unregister_fair_sched_group() to
free_fair_sched_group() and use kfree_rcu() instead of kfree(). Both
approaches have pros and cons.

Pro for my version is the early unlinking of cfs_rq's for dead task
groups, so no surprises later on. Con is the explicit synchronize_rcu().

Pro for the kfree_rcu() approach is the lack of the explicit
synchronize_rcu() call, so no explicit blocking operation. Con is that
we have cfs_rq's re-added to dead task groups which feels wrong and need
to find a suitable member to overlap with the rcu_head in each involved
data type.

Which one do you prefer?

Thanks,
Mathias

2021-11-04 18:52:12

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Hi.

On Wed, Nov 03, 2021 at 08:06:13PM +0100, Mathias Krause <[email protected]> wrote:
> When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> task group, it doesn't protect itself from getting interrupted. If the
> timer interrupt triggers while we iterate over all CPUs or after
> unregister_fair_sched_group() has finished but prior to unlinking the
> task group, sched_cfs_period_timer() will execute and walk the list of
> task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> dying task group. These will later -- in free_fair_sched_group() -- be
> kfree()'ed while still being linked, leading to the fireworks Kevin and
> Michal are seeing.

[...]

> CPU1: CPU2:
> : timer IRQ:
> : do_sched_cfs_period_timer():
> : :
> : distribute_cfs_runtime():
> : rcu_read_lock();
> : :
> : unthrottle_cfs_rq():
> sched_offline_group(): :
> : walk_tg_tree_from(…,tg_unthrottle_up,…):
> list_del_rcu(&tg->list); :
> (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> : :
> (2) list_del_rcu(&tg->siblings); :
> : tg_unthrottle_up():
> unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> : :
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> : :
> : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> (3) : list_add_leaf_cfs_rq(cfs_rq);
> : :
> : :
> : :
> : :
> : :
> (4) : rcu_read_unlock();

The list traversal (1) may happen in some scenarios (quota on non-leaf
task_group) but in the presented reproducer, the quota is set on the
leaf task_group. That means it has no children and this list iteration
is irrelevant.
The cause is that walk_tg_tree_from includes `from` task_group and
calls tg_unthrottle_up() on it too.
What I mean is that the unlinking of tg->list and tg->siblings is
irrelevant in this case.

The timer can still fire after
sched_offline_group()/unregister_fair_sched_group() finished (i.e. after
synchronize_rcu())


> This patch survives Michal's reproducer[2] for 8h+ now, which used to
> trigger within minutes before.

Note that the reproducer is sensitive to the sleep between last task
exit and cgroup rmdir. I assume that the added synchronize_rcu() before
list_del_leaf_cfs_rq() shifted the list removal after the last timer
callback and prevented re-adding of the offlined task_group in
unthrottle_cfs_rq().

(Of course, it'd more convincing if I backed this theory by results from
the reproducer with the increased interval to crash again. I may get
down to that later.)

Does your patch fix the crashes also in your real workload?

Michal

2021-11-04 21:54:40

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Mathias Krause <[email protected]> writes:

> Am 04.11.21 um 09:50 schrieb Vincent Guittot:
>> On Wed, 3 Nov 2021 at 23:04, Benjamin Segall <[email protected]> wrote:
>>>
>>> Mathias Krause <[email protected]> writes:
>>>
>>>> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
>>>> in update_blocked_averages(). Initial debugging revealed that we've live
>>>> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
>>>> free_fair_sched_group(). However, it was unclear how that can happen.
>>>> [...]
>>>> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
>>>> Cc: Odin Ugedal <[email protected]>
>>>> Cc: Michal Koutný <[email protected]>
>>>> Reported-by: Kevin Tanguy <[email protected]>
>>>> Suggested-by: Brad Spengler <[email protected]>
>>>> Signed-off-by: Mathias Krause <[email protected]>
>>>> ---
>>>> kernel/sched/core.c | 18 +++++++++++++++---
>>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 978460f891a1..60125a6c9d1b 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
>>>> {
>>>> unsigned long flags;
>>>>
>>>> - /* End participation in shares distribution: */
>>>> - unregister_fair_sched_group(tg);
>>>> -
>>>> + /*
>>>> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
>>>> + * sched_cfs_period_timer()).
>>>> + */
>>>> spin_lock_irqsave(&task_group_lock, flags);
>>>> list_del_rcu(&tg->list);
>>>> list_del_rcu(&tg->siblings);
>>>> spin_unlock_irqrestore(&task_group_lock, flags);
>>>> +
>>>> + /*
>>>> + * Wait for all pending users of this task group to leave their RCU
>>>> + * critical section to ensure no new user will see our dying task
>>>> + * group any more. Specifically ensure that tg_unthrottle_up() won't
>>>> + * add decayed cfs_rq's to it.
>>>> + */
>>>> + synchronize_rcu();
>>>
>>> I was going to suggest that we could just clear all of avg.load_sum/etc, but
>>> that breaks the speculative on_list read. Currently the final avg update
>>> just races, but that's not good enough if we wanted to rely on it to
>>> prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
>>> taking every rqlock anyways.
>>>
>>> I do wonder if we can move the relevant part of
>>> unregister_fair_sched_group into sched_free_group_rcu. After all
>>> for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
>>> in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
>>> should be updated).
>>
>> I was wondering the same thing.
>> we would have to move unregister_fair_sched_group() completely in
>> sched_free_group_rcu() and probably in cpu_cgroup_css_free() too.
>
> Well, the point is, print_cfs_stats() pretty much relies on the list to
> be stable, i.e. safe to traverse. It doesn't take locks while walking it
> (beside the RCU lock), so if we would modify it concurrently, bad things
> would happen.


Yeah, my idea was that print_cfs_stats is just debug info so it wouldn't
be a disaster to hold the lock, but I forgot that we probably can't hold
it over all that writing.

2021-11-05 14:27:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Thu, 4 Nov 2021 at 18:37, Mathias Krause <[email protected]> wrote:
>
> Am 04.11.21 um 17:49 schrieb Vincent Guittot:
> > [snip]
> >
> > Ok so we must have 2 GPs:
> >
> > list_del_rcu(&tg->siblings);
> > GP to wait for the end of ongoing walk_tg_tree_from : synchronize_rcu
> > in your patch
> > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); if on_list
> > remove_entity_load_avg(tg->se[cpu]);
> > GP to wait for the end of ongoing for_each_leaf_cfs_rq_safe (print_cfs_stats)
> > kfree everything
>
> Basically yes, but with my patch we already have these two, as there's
> at least one RCU GP between after sched_offline_group() finishes and
> sched_free_group() / cpu_cgroup_css_free() starts.
>
> So we either use my patch as-is or move unregister_fair_sched_group() to
> free_fair_sched_group() and use kfree_rcu() instead of kfree(). Both
> approaches have pros and cons.
>
> Pro for my version is the early unlinking of cfs_rq's for dead task
> groups, so no surprises later on. Con is the explicit synchronize_rcu().

which blocks the caller and could be problematic

It seems that LKP has reported such issue:
20211104145128.GC6499@xsang-OptiPlex-9020

>
> Pro for the kfree_rcu() approach is the lack of the explicit
> synchronize_rcu() call, so no explicit blocking operation. Con is that
> we have cfs_rq's re-added to dead task groups which feels wrong and need
> to find a suitable member to overlap with the rcu_head in each involved
> data type.
>
> Which one do you prefer?
>
> Thanks,
> Mathias

2021-11-05 14:48:02

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 05.11.21 um 15:25 schrieb Vincent Guittot:
> On Thu, 4 Nov 2021 at 18:37, Mathias Krause <[email protected]> wrote:
>>
>> Am 04.11.21 um 17:49 schrieb Vincent Guittot:
>>> [snip]
>>>
>>> Ok so we must have 2 GPs:
>>>
>>> list_del_rcu(&tg->siblings);
>>> GP to wait for the end of ongoing walk_tg_tree_from : synchronize_rcu
>>> in your patch
>>> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); if on_list
>>> remove_entity_load_avg(tg->se[cpu]);
>>> GP to wait for the end of ongoing for_each_leaf_cfs_rq_safe (print_cfs_stats)
>>> kfree everything
>>
>> Basically yes, but with my patch we already have these two, as there's
>> at least one RCU GP between after sched_offline_group() finishes and
>> sched_free_group() / cpu_cgroup_css_free() starts.
>>
>> So we either use my patch as-is or move unregister_fair_sched_group() to
>> free_fair_sched_group() and use kfree_rcu() instead of kfree(). Both
>> approaches have pros and cons.
>>
>> Pro for my version is the early unlinking of cfs_rq's for dead task
>> groups, so no surprises later on. Con is the explicit synchronize_rcu().
>
> which blocks the caller and could be problematic
>
> It seems that LKP has reported such issue:
> 20211104145128.GC6499@xsang-OptiPlex-9020

Heh, indeed.

>>
>> Pro for the kfree_rcu() approach is the lack of the explicit
>> synchronize_rcu() call, so no explicit blocking operation. Con is that
>> we have cfs_rq's re-added to dead task groups which feels wrong and need
>> to find a suitable member to overlap with the rcu_head in each involved
>> data type.
>>
>> Which one do you prefer?

Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
a patch.

Thanks,
Mathias

2021-11-05 16:42:44

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 04.11.21 um 19:49 schrieb Michal Koutný:
> On Wed, Nov 03, 2021 at 08:06:13PM +0100, Mathias Krause <[email protected]> wrote:
>> When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
>> task group, it doesn't protect itself from getting interrupted. If the
>> timer interrupt triggers while we iterate over all CPUs or after
>> unregister_fair_sched_group() has finished but prior to unlinking the
>> task group, sched_cfs_period_timer() will execute and walk the list of
>> task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
>> dying task group. These will later -- in free_fair_sched_group() -- be
>> kfree()'ed while still being linked, leading to the fireworks Kevin and
>> Michal are seeing.
>
> [...]
>
>> CPU1: CPU2:
>> : timer IRQ:
>> : do_sched_cfs_period_timer():
>> : :
>> : distribute_cfs_runtime():
>> : rcu_read_lock();
>> : :
>> : unthrottle_cfs_rq():
>> sched_offline_group(): :
>> : walk_tg_tree_from(…,tg_unthrottle_up,…):
>> list_del_rcu(&tg->list); :
>> (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
>> : :
>> (2) list_del_rcu(&tg->siblings); :
>> : tg_unthrottle_up():
>> unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>> : :
>> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
>> : :
>> : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
>> (3) : list_add_leaf_cfs_rq(cfs_rq);
>> : :
>> : :
>> : :
>> : :
>> : :
>> (4) : rcu_read_unlock();
>
> The list traversal (1) may happen in some scenarios (quota on non-leaf
> task_group) but in the presented reproducer, the quota is set on the
> leaf task_group. That means it has no children and this list iteration
> is irrelevant.
> The cause is that walk_tg_tree_from includes `from` task_group and
> calls tg_unthrottle_up() on it too.
> What I mean is that the unlinking of tg->list and tg->siblings is
> irrelevant in this case.

Interesting.

> The timer can still fire after
> sched_offline_group()/unregister_fair_sched_group() finished (i.e. after
> synchronize_rcu())

Yeah, I also noticed the timer gets disabled rather late, in
free_fair_sched_group() via destroy_cfs_bandwidth(). But as I saw no
more warnings from my debug patch I was under the impression,
do_sched_cfs_period_timer() won't see this thread group any more.
Apparently, this is not true?

Anyhow, see below.

>> This patch survives Michal's reproducer[2] for 8h+ now, which used to
>> trigger within minutes before.
>
> Note that the reproducer is sensitive to the sleep between last task
> exit and cgroup rmdir. I assume that the added synchronize_rcu() before
> list_del_leaf_cfs_rq() shifted the list removal after the last timer
> callback and prevented re-adding of the offlined task_group in
> unthrottle_cfs_rq().

As Vincent reported in the other thread, synchronize_rcu() is actually
problematic, as we're not allowed to block here. :( So I'd go for the
kfree_rcu() route and move unregister_fair_sched_group() to
free_fair_sched_group(), after disabling the timers.

> (Of course, it'd more convincing if I backed this theory by results from
> the reproducer with the increased interval to crash again. I may get
> down to that later.)
>
> Does your patch fix the crashes also in your real workload?

I haven't heard back from Kevin since. But he might just be busy.

Thanks,
Mathias

2021-11-05 16:48:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's


TJ, long email below, but TL;DR, is it ok to do synchornize_rcu() from
a css_released callback? We can't use the css_offline callback because
commit 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
but we do need a second GP, as per the below.

On Wed, Nov 03, 2021 at 08:06:13PM +0100, Mathias Krause wrote:
> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> in update_blocked_averages(). Initial debugging revealed that we've live
> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> free_fair_sched_group(). However, it was unclear how that can happen.
>
> His kernel config happened to lead to a layout of struct sched_entity
> that put the 'my_q' member directly into the middle of the object which
> makes it incidentally overlap with SLUB's freelist pointer. That, in
> combination with SLAB_FREELIST_HARDENED's freelist pointer mangling,
> leads to a reliable access violation in form of a #GP which made the UAF
> fail fast, e.g. like this:
>
> general protection fault, probably for non-canonical address 0xff80f68888f69107: 0000 [#1] SMP
> CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.14.13-custom+ #3
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
> RIP: 0010:[<ffffffff81194af8>] update_blocked_averages+0x428/0xb90
> Code: 28 01 00 4c 8b 4c 24 10 41 8b 97 c4 00 00 00 85 d2 75 32 4c 89 fe 4c 89 cf e8 74 2c 01 00 49 8b 96 80 00 00 00 48 85 d2 74 0e <48> 83 ba 08 01 00 00 00 0f 85 45 01 00 00 85 c0 0f 84 34 fe ff ff
> RSP: 0018:ffffc9000002bf08 EFLAGS: 00010086
> RAX: 0000000000000001 RBX: ffff8880202eba00 RCX: 000000000000b686
> RDX: ff80f68888f68fff RSI: 0000000000000000 RDI: 000000000000000c
> RBP: ffff888006808a00 R08: ffff888006808a00 R09: 0000000000000000
> R10: 0000000000000008 R11: 0000000000000000 R12: ffff8880202ebb40
> R13: 0000000000000000 R14: ffff8880087e7f00 R15: ffff888006808a00
> FS: 0000000000000000(0000) GS:ffff888237d40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000315b147b8000 CR3: 0000000002d70000 CR4: 00000000001606f0 shadow CR4: 00000000001606f0
> Stack:
> 0100008237d58b80 0000000000000286 000003ae017023d5 00000000000000f0
> ffff888237d5d240 0000000000000028 ffff888237d5c980 ffff888237d5c900
> ffff888237d5c900 0000000000000001 0000000000000100 0000000000000007
> Call Trace:
> <IRQ>
> [<ffffffff8119952a>] run_rebalance_domains+0x3a/0x60
> [<ffffffff810030bf>] __do_softirq+0xbf/0x1fb
> [<ffffffff81162e7f>] irq_exit_rcu+0x7f/0x90
> [<ffffffff822d0d7e>] sysvec_apic_timer_interrupt+0x6e/0x90
> </IRQ>
> [<ffffffff81001d82>] asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0010:[<ffffffff822debe9>] acpi_idle_do_entry+0x49/0x50
> Code: 8b 15 2f b3 75 01 ed c3 0f 0b e9 52 fd ff ff 65 48 8b 04 25 40 1c 01 00 48 8b 00 a8 08 75 e8 eb 07 0f 00 2d d9 e2 1d 00 fb f4 <fa> c3 0f 0b cc cc cc 41 55 41 89 d5 41 54 49 89 f4 55 53 48 89 fb
> RSP: 0000:ffffc900000bbe68 EFLAGS: 00000246
> RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff888237d40000
> RDX: 0000000000000001 RSI: ffffffff82cdd4c0 RDI: ffff888100b7bc64
> RBP: ffff888101c07000 R08: ffff888100b7bc00 R09: 00000000000000ac
> R10: 0000000000000005 R11: ffff888237d5b824 R12: 0000000000000001
> R13: ffffffff82cdd4c0 R14: ffffffff82cdd540 R15: 0000000000000000
> [<ffffffff8118dab9>] ? sched_clock_cpu+0x9/0xa0
> [<ffffffff818d26f8>] acpi_idle_enter+0x48/0xb0
> [<ffffffff81ec123c>] cpuidle_enter_state+0x7c/0x2c0
> [<ffffffff81ec14b4>] cpuidle_enter+0x24/0x40
> [<ffffffff8118e5d4>] do_idle+0x1c4/0x210
> [<ffffffff8118e79e>] cpu_startup_entry+0x1e/0x20
> [<ffffffff810a8a4a>] start_secondary+0x11a/0x120
> [<ffffffff81000103>] secondary_startup_64_no_verify+0xae/0xbb
> ---[ end trace aac4ad8b95ba31e5 ]---
>
> Michal seems to have run into the same issue[1]. He already correctly
> diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
> cfs_rq's to list on unthrottle") is causing the preconditions for the
> UAF to happen by re-adding cfs_rq's also to task groups that have no
> more running tasks, i.e. also to dead ones. His analysis, however,
> misses the real root cause and it cannot be seen from the crash
> backtrace only, as the real offender is tg_unthrottle_up() getting
> called via sched_cfs_period_timer() via the timer interrupt at an
> inconvenient time.
>
> When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> task group, it doesn't protect itself from getting interrupted. If the
> timer interrupt triggers while we iterate over all CPUs or after
> unregister_fair_sched_group() has finished but prior to unlinking the
> task group, sched_cfs_period_timer() will execute and walk the list of
> task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> dying task group. These will later -- in free_fair_sched_group() -- be
> kfree()'ed while still being linked, leading to the fireworks Kevin and
> Michal are seeing.
>
> To fix this race, ensure the dying task group gets unlinked first.
> However, simply switching the order of unregistering and unlinking the
> task group isn't sufficient, as concurrent RCU walkers might still see
> it, as can be seen below:
>
> CPU1: CPU2:
> : timer IRQ:
> : do_sched_cfs_period_timer():
> : :
> : distribute_cfs_runtime():
> : rcu_read_lock();
> : :
> : unthrottle_cfs_rq():
> sched_offline_group(): :
> : walk_tg_tree_from(…,tg_unthrottle_up,…):
> list_del_rcu(&tg->list); :
> (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> : :
> (2) list_del_rcu(&tg->siblings); :
> : tg_unthrottle_up():
> unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> : :
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> : :
> : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> (3) : list_add_leaf_cfs_rq(cfs_rq);
> : :
> : :
> : :
> : :
> : :
> (4) : rcu_read_unlock();
>
> CPU 2 walks the task group list in parallel to sched_offline_group(),
> specifically, it'll read the soon to be unlinked task group entry at
> (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
> still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all
> cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group().
> Meanwhile CPU 2 will re-add some of these at (3), which is the cause of
> the UAF later on.
>
> To prevent this additional race from happening, we need to wait until
> walk_tg_tree_from() has finished traversing the task groups, i.e. after
> the RCU read critical section ends in (4). Afterwards we're safe to call
> unregister_fair_sched_group(), as each new walk won't see the dying task
> group any more.
>
> Using synchronize_rcu() might be seen as a too heavy hammer to nail this
> problem. However, the overall tear down sequence (e.g., as documented in
> css_free_rwork_fn()) already relies on quite a few assumptions regarding
> execution context and RCU grace periods from passing. Looking at the
> autogroup code, which calls sched_destroy_group() directly after
> sched_offline_group() and the apparent need to have at least one RCU
> grace period expire after unlinking the task group, prior to calling
> unregister_fair_sched_group(), there seems to be no better alternative.
> Calling unregister_fair_sched_group() via call_rcu() will only lead to
> trouble in sched_offline_group() which also relies on (yet another)
> expired RCU grace period.
>
> This patch survives Michal's reproducer[2] for 8h+ now, which used to
> trigger within minutes before.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Cc: Odin Ugedal <[email protected]>
> Cc: Michal Koutný <[email protected]>
> Reported-by: Kevin Tanguy <[email protected]>
> Suggested-by: Brad Spengler <[email protected]>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> kernel/sched/core.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 978460f891a1..60125a6c9d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> spin_unlock_irqrestore(&task_group_lock, flags);
> +
> + /*
> + * Wait for all pending users of this task group to leave their RCU
> + * critical section to ensure no new user will see our dying task
> + * group any more. Specifically ensure that tg_unthrottle_up() won't
> + * add decayed cfs_rq's to it.
> + */
> + synchronize_rcu();
> +
> + /* End participation in shares distribution: */
> + unregister_fair_sched_group(tg);
> }
>
> static void sched_change_group(struct task_struct *tsk, int type)
> --
> 2.30.2
>

2021-11-05 17:14:25

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

> Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
> a patch.

Testing the below patch right now. Looking good so far. Will prepare a
proper patch later, if we all can agree that this covers all cases.

But the basic idea is to defer the kfree()'s to after the next RCU GP,
which also means we need to free the tg object itself later. Slightly
ugly. :/

Thanks,
Mathias

--8<--

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..8b4c849bc892 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9439,12 +9439,19 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
#endif
}

+void tg_free(struct task_group *tg)
+{
+ kmem_cache_free(task_group_cache, tg);
+}
+
static void sched_free_group(struct task_group *tg)
{
- free_fair_sched_group(tg);
+ bool delayed_free;
+ delayed_free = free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
- kmem_cache_free(task_group_cache, tg);
+ if (!delayed_free)
+ tg_free(tg);
}

/* allocate runqueue etc for a new task group */
@@ -9506,9 +9513,19 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us (via
+ * sched_cfs_period_timer()).
+ *
+ * For this to be effective, we have to wait for all pending users of
+ * this task group to leave their RCU critical section to ensure no new
+ * user will see our dying task group any more. Specifically ensure
+ * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
+ *
+ * We therefore defer calling unregister_fair_sched_group() to
+ * sched_free_group() which is guarantied to get called only after the
+ * current RCU grace period has expired.
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 567c571d624f..54c1f7b571e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11287,12 +11287,11 @@ static void task_change_group_fair(struct task_struct *p, int type)
}
}

-void free_fair_sched_group(struct task_group *tg)
+static void free_tg(struct rcu_head *rcu)
{
+ struct task_group *tg = container_of(rcu, struct task_group, rcu);
int i;

- destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
-
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -11302,6 +11301,19 @@ void free_fair_sched_group(struct task_group *tg)

kfree(tg->cfs_rq);
kfree(tg->se);
+ tg_free(tg);
+}
+
+bool free_fair_sched_group(struct task_group *tg)
+{
+ destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+ unregister_fair_sched_group(tg);
+ /*
+ * We have to wait for yet another RCU grace period to expire, as
+ * print_cfs_stats() might run concurrently.
+ */
+ call_rcu(&tg->rcu, free_tg);
+ return true;
}

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
@@ -11459,7 +11471,10 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
}
#else /* CONFIG_FAIR_GROUP_SCHED */

-void free_fair_sched_group(struct task_group *tg) { }
+bool free_fair_sched_group(struct task_group *tg)
+{
+ return false;
+}

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eaca971a3ee2..b45ba45d8bdc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -437,6 +437,8 @@ struct task_group {

};

+extern void tg_free(struct task_group *tg);
+
#ifdef CONFIG_FAIR_GROUP_SCHED
#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD

@@ -470,7 +472,7 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)

extern int tg_nop(struct task_group *tg, void *data);

-extern void free_fair_sched_group(struct task_group *tg);
+extern bool free_fair_sched_group(struct task_group *tg);
extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
extern void online_fair_sched_group(struct task_group *tg);
extern void unregister_fair_sched_group(struct task_group *tg);

2021-11-05 17:27:57

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 05.11.21 um 17:58 schrieb Peter Zijlstra:
> On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
>>> Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
>>> a patch.
>>
>> Testing the below patch right now. Looking good so far. Will prepare a
>> proper patch later, if we all can agree that this covers all cases.
>>
>> But the basic idea is to defer the kfree()'s to after the next RCU GP,
>> which also means we need to free the tg object itself later. Slightly
>> ugly. :/
>
> Can't we add an rcu_head to struct task_group and simply call_rcu() the
> thing with a free function?

There is already one and this patch makes use of it for the second RCU
GP requirement. Basically, the patch is doing what you request, no? See
the new free_fair_sched_group().

Thanks,
Mathias

2021-11-05 18:59:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
> > Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
> > a patch.
>
> Testing the below patch right now. Looking good so far. Will prepare a
> proper patch later, if we all can agree that this covers all cases.
>
> But the basic idea is to defer the kfree()'s to after the next RCU GP,
> which also means we need to free the tg object itself later. Slightly
> ugly. :/

Can't we add an rcu_head to struct task_group and simply call_rcu() the
thing with a free function?

2021-11-05 19:28:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Fri, Nov 05, 2021 at 06:14:33PM +0100, Mathias Krause wrote:
> Am 05.11.21 um 17:58 schrieb Peter Zijlstra:
> > On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
> >>> Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
> >>> a patch.
> >>
> >> Testing the below patch right now. Looking good so far. Will prepare a
> >> proper patch later, if we all can agree that this covers all cases.
> >>
> >> But the basic idea is to defer the kfree()'s to after the next RCU GP,
> >> which also means we need to free the tg object itself later. Slightly
> >> ugly. :/
> >
> > Can't we add an rcu_head to struct task_group and simply call_rcu() the
> > thing with a free function?
>
> There is already one and this patch makes use of it for the second RCU
> GP requirement. Basically, the patch is doing what you request, no? See
> the new free_fair_sched_group().

For some reason I thought you still did kfree_rcu(), I suppose reading
is hard. I'll give it another go after dinner.

2021-11-05 21:21:28

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 05.11.21 um 18:27 schrieb Peter Zijlstra:
> On Fri, Nov 05, 2021 at 06:14:33PM +0100, Mathias Krause wrote:
>> Am 05.11.21 um 17:58 schrieb Peter Zijlstra:
>>> On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
>>>>> Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
>>>>> a patch.
>>>>
>>>> Testing the below patch right now. Looking good so far. Will prepare a
>>>> proper patch later, if we all can agree that this covers all cases.
>>>>
>>>> But the basic idea is to defer the kfree()'s to after the next RCU GP,
>>>> which also means we need to free the tg object itself later. Slightly
>>>> ugly. :/
>>>
>>> Can't we add an rcu_head to struct task_group and simply call_rcu() the
>>> thing with a free function?
>>
>> There is already one and this patch makes use of it for the second RCU
>> GP requirement. Basically, the patch is doing what you request, no? See
>> the new free_fair_sched_group().
>
> For some reason I thought you still did kfree_rcu(), I suppose reading
> is hard. I'll give it another go after dinner.

I wanted to go for kfree_rcu() initially, true. But after realizing,
that adding a rcu_head to struct cfs_rg, sched_entity and task_group
(which already has one) might be too much for what's needed, I went the
call_rcu() route instead and re-used the rcu_head of task_group.

Actually re-using the rcu_head in task_group is safe, as we'll use it
only sequentially: first in sched_destroy_group() to schedule
sched_free_group_rcu() and then, when it's executing, in
free_fair_sched_group() to schedule free_tg().

rcu_head's get unlinked prior to getting their callback function
invoked, which makes the above a valid use case.

Thanks,
Mathias

2021-11-06 18:39:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
> > Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
> > a patch.
>
> Testing the below patch right now. Looking good so far. Will prepare a
> proper patch later, if we all can agree that this covers all cases.
>
> But the basic idea is to defer the kfree()'s to after the next RCU GP,
> which also means we need to free the tg object itself later. Slightly
> ugly. :/

How's this then?

---
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 2067080bb235..8629b37d118e 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
ag->tg->rt_se = NULL;
ag->tg->rt_rq = NULL;
#endif
- sched_offline_group(ag->tg);
+ sched_release_group(ag->tg);
sched_destroy_group(ag->tg);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9cb81ef8acc8..22528bd61ba5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9715,6 +9715,21 @@ static void sched_free_group(struct task_group *tg)
kmem_cache_free(task_group_cache, tg);
}

+static void sched_free_group_rcu(struct rcu_head *rcu)
+{
+ sched_free_group(container_of(rcu, struct task_group, rcu_head));
+}
+
+static void sched_unregister_group(struct task_group *tg)
+{
+ unregister_fair_sched_group(tg);
+ /*
+ * We have to wait for yet another RCU grace period to expire, as
+ * print_cfs_stats() might run concurrently.
+ */
+ call_rcu(&tg->rcu, sched_free_group_rcu);
+}
+
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent)
{
@@ -9735,7 +9750,7 @@ struct task_group *sched_create_group(struct task_group *parent)
return tg;

err:
- sched_free_group(tg);
+ sched_unregister_group(tg);
return ERR_PTR(-ENOMEM);
}

@@ -9758,25 +9773,35 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
}

/* rcu callback to free various structures associated with a task group */
-static void sched_free_group_rcu(struct rcu_head *rhp)
+static void sched_unregister_group_rcu(struct rcu_head *rhp)
{
/* Now it should be safe to free those cfs_rqs: */
- sched_free_group(container_of(rhp, struct task_group, rcu));
+ sched_unregister_group(container_of(rhp, struct task_group, rcu));
}

void sched_destroy_group(struct task_group *tg)
{
/* Wait for possible concurrent references to cfs_rqs complete: */
- call_rcu(&tg->rcu, sched_free_group_rcu);
+ call_rcu(&tg->rcu, sched_unregister_group_rcu);
}

-void sched_offline_group(struct task_group *tg)
+void sched_release_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us (via
+ * sched_cfs_period_timer()).
+ *
+ * For this to be effective, we have to wait for all pending users of
+ * this task group to leave their RCU critical section to ensure no new
+ * user will see our dying task group any more. Specifically ensure
+ * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
+ *
+ * We therefore defer calling unregister_fair_sched_group() to
+ * sched_unregister_group() which is guarantied to get called only after the
+ * current RCU grace period has expired.
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
@@ -9895,7 +9920,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);

- sched_offline_group(tg);
+ sched_release_group(tg);
}

static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
@@ -9905,7 +9930,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
/*
* Relies on the RCU grace period between css_released() and this.
*/
- sched_free_group(tg);
+ sched_unregister_group(tg);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f0b249ec581d..20038274c57b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
extern void sched_online_group(struct task_group *tg,
struct task_group *parent);
extern void sched_destroy_group(struct task_group *tg);
-extern void sched_offline_group(struct task_group *tg);
+extern void sched_release_group(struct task_group *tg);

extern void sched_move_task(struct task_struct *tsk);

2021-11-08 14:22:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Mon, Nov 08, 2021 at 11:27:57AM +0100, Mathias Krause wrote:

> The timers need to be destroyed prior to unregister_fair_sched_group()
> via destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)), i.e. move it from
> free_fair_sched_group() to here, as I did in my patch. Otherwise the tg
> might still be messed with and we don't want that.

Oh, duh, yes. For consistency's sake, I've also added an unregister_*
for the rt class, also destroying the bandwidth timer.

> Beside that, looks good to me. Will you create a new proper patch or
> should I do it?

Something like so good?

(I stripped the #PF splat, because I don't think it adds anything not
covered by the text).

---
Subject: sched/fair: Prevent dead task groups from regaining cfs_rq's
From: Mathias Krause <[email protected]>
Date: Wed, 3 Nov 2021 20:06:13 +0100

From: Mathias Krause <[email protected]>

Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

CPU1: CPU2:
: timer IRQ:
: do_sched_cfs_period_timer():
: :
: distribute_cfs_runtime():
: rcu_read_lock();
: :
: unthrottle_cfs_rq():
sched_offline_group(): :
: walk_tg_tree_from(…,tg_unthrottle_up,…):
list_del_rcu(&tg->list); :
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
: :
(2) list_del_rcu(&tg->siblings); :
: tg_unthrottle_up():
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
: :
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
: :
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
(3) : list_add_leaf_cfs_rq(cfs_rq);
: :
: :
: :
: :
: :
(4) : rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.

Using synchronize_rcu() might be seen as a too heavy hammer to nail
this problem. However, the overall tear down sequence (e.g., as
documented in css_free_rwork_fn()) already relies on quite a few
assumptions regarding execution context and RCU grace periods from
passing. Looking at the autogroup code, which calls
sched_destroy_group() directly after sched_offline_group() and the
apparent need to have at least one RCU grace period expire after
unlinking the task group, prior to calling
unregister_fair_sched_group(), there seems to be no better
alternative. Calling unregister_fair_sched_group() via call_rcu()
will only lead to trouble in sched_offline_group() which also relies
on (yet another) expired RCU grace period.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Reported-by: Kevin Tanguy <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
[peterz: shuffle code around a bit]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/autogroup.c | 2 +-
kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++----------
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 12 +++++++++---
kernel/sched/sched.h | 3 ++-
5 files changed, 50 insertions(+), 17 deletions(-)

--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -31,7 +31,7 @@ static inline void autogroup_destroy(str
ag->tg->rt_se = NULL;
ag->tg->rt_rq = NULL;
#endif
- sched_offline_group(ag->tg);
+ sched_release_group(ag->tg);
sched_destroy_group(ag->tg);
}

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9716,6 +9716,22 @@ static void sched_free_group(struct task
kmem_cache_free(task_group_cache, tg);
}

+static void sched_free_group_rcu(struct rcu_head *rcu)
+{
+ sched_free_group(container_of(rcu, struct task_group, rcu));
+}
+
+static void sched_unregister_group(struct task_group *tg)
+{
+ unregister_fair_sched_group(tg);
+ unregister_rt_sched_group(tg);
+ /*
+ * We have to wait for yet another RCU grace period to expire, as
+ * print_cfs_stats() might run concurrently.
+ */
+ call_rcu(&tg->rcu, sched_free_group_rcu);
+}
+
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent)
{
@@ -9736,7 +9752,7 @@ struct task_group *sched_create_group(st
return tg;

err:
- sched_free_group(tg);
+ sched_unregister_group(tg);
return ERR_PTR(-ENOMEM);
}

@@ -9759,25 +9775,35 @@ void sched_online_group(struct task_grou
}

/* rcu callback to free various structures associated with a task group */
-static void sched_free_group_rcu(struct rcu_head *rhp)
+static void sched_unregister_group_rcu(struct rcu_head *rhp)
{
/* Now it should be safe to free those cfs_rqs: */
- sched_free_group(container_of(rhp, struct task_group, rcu));
+ sched_unregister_group(container_of(rhp, struct task_group, rcu));
}

void sched_destroy_group(struct task_group *tg)
{
/* Wait for possible concurrent references to cfs_rqs complete: */
- call_rcu(&tg->rcu, sched_free_group_rcu);
+ call_rcu(&tg->rcu, sched_unregister_group_rcu);
}

-void sched_offline_group(struct task_group *tg)
+void sched_release_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us (via
+ * sched_cfs_period_timer()).
+ *
+ * For this to be effective, we have to wait for all pending users of
+ * this task group to leave their RCU critical section to ensure no new
+ * user will see our dying task group any more. Specifically ensure
+ * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
+ *
+ * We therefore defer calling unregister_fair_sched_group() to
+ * sched_unregister_group() which is guarantied to get called only after the
+ * current RCU grace period has expired.
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
@@ -9896,7 +9922,7 @@ static void cpu_cgroup_css_released(stru
{
struct task_group *tg = css_tg(css);

- sched_offline_group(tg);
+ sched_release_group(tg);
}

static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
@@ -9906,7 +9932,7 @@ static void cpu_cgroup_css_free(struct c
/*
* Relies on the RCU grace period between css_released() and this.
*/
- sched_free_group(tg);
+ sched_unregister_group(tg);
}

/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_g
{
int i;

- destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
-
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -11551,6 +11549,8 @@ void unregister_fair_sched_group(struct
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
raw_spin_rq_unlock_irqrestore(rq, flags);
}
+
+ destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
}

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(str
return rt_rq->rq;
}

-void free_rt_sched_group(struct task_group *tg)
+void unregister_rt_sched_group(struct task_group *tg)
{
- int i;
-
if (tg->rt_se)
destroy_rt_bandwidth(&tg->rt_bandwidth);

+}
+
+void free_rt_sched_group(struct task_group *tg)
+{
+ int i;
+
for_each_possible_cpu(i) {
if (tg->rt_rq)
kfree(tg->rt_rq[i]);
@@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(
return &rq->rt;
}

+void unregister_rt_sched_group(struct task_group *tg) { }
+
void free_rt_sched_group(struct task_group *tg) { }

int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runti
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);

+extern void unregister_rt_sched_group(struct task_group *tg);
extern void free_rt_sched_group(struct task_group *tg);
extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -503,7 +504,7 @@ extern struct task_group *sched_create_g
extern void sched_online_group(struct task_group *tg,
struct task_group *parent);
extern void sched_destroy_group(struct task_group *tg);
-extern void sched_offline_group(struct task_group *tg);
+extern void sched_release_group(struct task_group *tg);

extern void sched_move_task(struct task_struct *tsk);

2021-11-08 15:10:17

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 06.11.21 um 11:48 schrieb Peter Zijlstra:
> On Fri, Nov 05, 2021 at 05:29:14PM +0100, Mathias Krause wrote:
>>> Looks like it needs to be the kfree_rcu() one in this case. I'll prepare
>>> a patch.
>>
>> Testing the below patch right now. Looking good so far. Will prepare a
>> proper patch later, if we all can agree that this covers all cases.
>>
>> But the basic idea is to defer the kfree()'s to after the next RCU GP,
>> which also means we need to free the tg object itself later. Slightly
>> ugly. :/
>
> How's this then?

Well, slightly more code churn, but looks cleaner indeed -- no tg_free()
hack. Just one bit's missing IMHO, see below.

>
> ---
> diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
> index 2067080bb235..8629b37d118e 100644
> --- a/kernel/sched/autogroup.c
> +++ b/kernel/sched/autogroup.c
> @@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
> ag->tg->rt_se = NULL;
> ag->tg->rt_rq = NULL;
> #endif
> - sched_offline_group(ag->tg);
> + sched_release_group(ag->tg);
> sched_destroy_group(ag->tg);
> }
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9cb81ef8acc8..22528bd61ba5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9715,6 +9715,21 @@ static void sched_free_group(struct task_group *tg)
> kmem_cache_free(task_group_cache, tg);
> }
>
> +static void sched_free_group_rcu(struct rcu_head *rcu)
> +{
> + sched_free_group(container_of(rcu, struct task_group, rcu_head));
^^^^^^^^
This should be 'rcu'.

> +}
> +
> +static void sched_unregister_group(struct task_group *tg)
> +{

The timers need to be destroyed prior to unregister_fair_sched_group()
via destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)), i.e. move it from
free_fair_sched_group() to here, as I did in my patch. Otherwise the tg
might still be messed with and we don't want that.

> + unregister_fair_sched_group(tg);
> + /*
> + * We have to wait for yet another RCU grace period to expire, as
> + * print_cfs_stats() might run concurrently.
> + */
> + call_rcu(&tg->rcu, sched_free_group_rcu);
> +}
> +
> /* allocate runqueue etc for a new task group */
> struct task_group *sched_create_group(struct task_group *parent)
> {
> @@ -9735,7 +9750,7 @@ struct task_group *sched_create_group(struct task_group *parent)
> return tg;
>
> err:
> - sched_free_group(tg);
> + sched_unregister_group(tg);
> return ERR_PTR(-ENOMEM);
> }
>
> @@ -9758,25 +9773,35 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
> }
>
> /* rcu callback to free various structures associated with a task group */
> -static void sched_free_group_rcu(struct rcu_head *rhp)
> +static void sched_unregister_group_rcu(struct rcu_head *rhp)
> {
> /* Now it should be safe to free those cfs_rqs: */
> - sched_free_group(container_of(rhp, struct task_group, rcu));
> + sched_unregister_group(container_of(rhp, struct task_group, rcu));
> }
>
> void sched_destroy_group(struct task_group *tg)
> {
> /* Wait for possible concurrent references to cfs_rqs complete: */
> - call_rcu(&tg->rcu, sched_free_group_rcu);
> + call_rcu(&tg->rcu, sched_unregister_group_rcu);
> }
>
> -void sched_offline_group(struct task_group *tg)
> +void sched_release_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + *
> + * For this to be effective, we have to wait for all pending users of
> + * this task group to leave their RCU critical section to ensure no new
> + * user will see our dying task group any more. Specifically ensure
> + * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
> + *
> + * We therefore defer calling unregister_fair_sched_group() to
> + * sched_unregister_group() which is guarantied to get called only after the
> + * current RCU grace period has expired.
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> @@ -9895,7 +9920,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
> {
> struct task_group *tg = css_tg(css);
>
> - sched_offline_group(tg);
> + sched_release_group(tg);
> }
>
> static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> @@ -9905,7 +9930,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> /*
> * Relies on the RCU grace period between css_released() and this.
> */
> - sched_free_group(tg);
> + sched_unregister_group(tg);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f0b249ec581d..20038274c57b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -504,7 +504,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
> extern void sched_online_group(struct task_group *tg,
> struct task_group *parent);
> extern void sched_destroy_group(struct task_group *tg);
> -extern void sched_offline_group(struct task_group *tg);
> +extern void sched_release_group(struct task_group *tg);
>
> extern void sched_move_task(struct task_struct *tsk);
>

Beside that, looks good to me. Will you create a new proper patch or
should I do it?

Thanks,
Mathias

2021-11-08 20:29:55

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Am 08.11.21 um 12:40 schrieb Peter Zijlstra:
> On Mon, Nov 08, 2021 at 11:27:57AM +0100, Mathias Krause wrote:
>
>> The timers need to be destroyed prior to unregister_fair_sched_group()
>> via destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)), i.e. move it from
>> free_fair_sched_group() to here, as I did in my patch. Otherwise the tg
>> might still be messed with and we don't want that.
>
> Oh, duh, yes.

Well, still slightly wrong: "prior to unregister_fair_sched_group()"
means calling destroy_cfs_bandwidth() before not after ;)

> For consistency's sake, I've also added an unregister_*
> for the rt class, also destroying the bandwidth timer.

Looks good to me. Not strictly needed by the code as of now, but
shouldn't hurt either, to defer the final kfree() to the next RCU GP.

>> Beside that, looks good to me. Will you create a new proper patch or
>> should I do it?
>
> Something like so good?
>
> (I stripped the #PF splat, because I don't think it adds anything not
> covered by the text).

Well, me, personally, always searches for parts of a Oops splat first.
It sometimes finds related discussions or, even better, commits fixing
an issue. So I prefer keeping it. But, in this case, it should find this
Email thread and, in turn, this patch. So I'm fine with dropping it.

> ---
> Subject: sched/fair: Prevent dead task groups from regaining cfs_rq's
> From: Mathias Krause <[email protected]>
> Date: Wed, 3 Nov 2021 20:06:13 +0100
>
> From: Mathias Krause <[email protected]>
>
> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> in update_blocked_averages(). Initial debugging revealed that we've
> live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> free_fair_sched_group(). However, it was unclear how that can happen.
>
> His kernel config happened to lead to a layout of struct sched_entity
> that put the 'my_q' member directly into the middle of the object
> which makes it incidentally overlap with SLUB's freelist pointer.
> That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
> mangling, leads to a reliable access violation in form of a #GP which
> made the UAF fail fast.
>
> Michal seems to have run into the same issue[1]. He already correctly
> diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
> cfs_rq's to list on unthrottle") is causing the preconditions for the
> UAF to happen by re-adding cfs_rq's also to task groups that have no
> more running tasks, i.e. also to dead ones. His analysis, however,
> misses the real root cause and it cannot be seen from the crash
> backtrace only, as the real offender is tg_unthrottle_up() getting
> called via sched_cfs_period_timer() via the timer interrupt at an
> inconvenient time.
>
> When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> task group, it doesn't protect itself from getting interrupted. If the
> timer interrupt triggers while we iterate over all CPUs or after
> unregister_fair_sched_group() has finished but prior to unlinking the
> task group, sched_cfs_period_timer() will execute and walk the list of
> task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> dying task group. These will later -- in free_fair_sched_group() -- be
> kfree()'ed while still being linked, leading to the fireworks Kevin
> and Michal are seeing.
>
> To fix this race, ensure the dying task group gets unlinked first.
> However, simply switching the order of unregistering and unlinking the
> task group isn't sufficient, as concurrent RCU walkers might still see
> it, as can be seen below:
>
> CPU1: CPU2:
> : timer IRQ:
> : do_sched_cfs_period_timer():
> : :
> : distribute_cfs_runtime():
> : rcu_read_lock();
> : :
> : unthrottle_cfs_rq():
> sched_offline_group(): :
> : walk_tg_tree_from(…,tg_unthrottle_up,…):
> list_del_rcu(&tg->list); :
> (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> : :
> (2) list_del_rcu(&tg->siblings); :
> : tg_unthrottle_up():
> unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> : :
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> : :
> : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> (3) : list_add_leaf_cfs_rq(cfs_rq);
> : :
> : :
> : :
> : :
> : :
> (4) : rcu_read_unlock();
>
> CPU 2 walks the task group list in parallel to sched_offline_group(),
> specifically, it'll read the soon to be unlinked task group entry at
> (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
> still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
> all cfs_rq's via list_del_leaf_cfs_rq() in
> unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
> these at (3), which is the cause of the UAF later on.
>
> To prevent this additional race from happening, we need to wait until
> walk_tg_tree_from() has finished traversing the task groups, i.e.
> after the RCU read critical section ends in (4). Afterwards we're safe
> to call unregister_fair_sched_group(), as each new walk won't see the
> dying task group any more.
Replace the following paragraph, which is outdated by now,...:

> Using synchronize_rcu() might be seen as a too heavy hammer to nail
> this problem. However, the overall tear down sequence (e.g., as
> documented in css_free_rwork_fn()) already relies on quite a few
> assumptions regarding execution context and RCU grace periods from
> passing. Looking at the autogroup code, which calls
> sched_destroy_group() directly after sched_offline_group() and the
> apparent need to have at least one RCU grace period expire after
> unlinking the task group, prior to calling
> unregister_fair_sched_group(), there seems to be no better
> alternative. Calling unregister_fair_sched_group() via call_rcu()
> will only lead to trouble in sched_offline_group() which also relies
> on (yet another) expired RCU grace period.

...with something like this (already mentioned in the code, btw):

On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd ones.

>
> This patch survives Michal's reproducer[2] for 8h+ now, which used to
> trigger within minutes before.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Reported-by: Kevin Tanguy <[email protected]>
> Signed-off-by: Mathias Krause <[email protected]>
> [peterz: shuffle code around a bit]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/autogroup.c | 2 +-
> kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++----------
> kernel/sched/fair.c | 4 ++--
> kernel/sched/rt.c | 12 +++++++++---
> kernel/sched/sched.h | 3 ++-
> 5 files changed, 50 insertions(+), 17 deletions(-)
>
> --- a/kernel/sched/autogroup.c
> +++ b/kernel/sched/autogroup.c
> @@ -31,7 +31,7 @@ static inline void autogroup_destroy(str
> ag->tg->rt_se = NULL;
> ag->tg->rt_rq = NULL;
> #endif
> - sched_offline_group(ag->tg);
> + sched_release_group(ag->tg);
> sched_destroy_group(ag->tg);
> }
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9716,6 +9716,22 @@ static void sched_free_group(struct task
> kmem_cache_free(task_group_cache, tg);
> }
>
> +static void sched_free_group_rcu(struct rcu_head *rcu)
> +{
> + sched_free_group(container_of(rcu, struct task_group, rcu));
> +}
> +
> +static void sched_unregister_group(struct task_group *tg)
> +{
> + unregister_fair_sched_group(tg);
> + unregister_rt_sched_group(tg);
> + /*
> + * We have to wait for yet another RCU grace period to expire, as
> + * print_cfs_stats() might run concurrently.
> + */
> + call_rcu(&tg->rcu, sched_free_group_rcu);
> +}
> +
> /* allocate runqueue etc for a new task group */
> struct task_group *sched_create_group(struct task_group *parent)
> {
> @@ -9736,7 +9752,7 @@ struct task_group *sched_create_group(st
> return tg;
>
> err:
> - sched_free_group(tg);
> + sched_unregister_group(tg);

This can stay sched_free_group() as neither have the bandwidth timers
been started yet, nor was this tg made visible outside of this function.
So omitting the calls to destroy_{cfs,rt}_bandwidth() isn't a problem --
timers aren't running yet.

> return ERR_PTR(-ENOMEM);
> }
>
> @@ -9759,25 +9775,35 @@ void sched_online_group(struct task_grou
> }
>
> /* rcu callback to free various structures associated with a task group */
> -static void sched_free_group_rcu(struct rcu_head *rhp)
> +static void sched_unregister_group_rcu(struct rcu_head *rhp)
> {
> /* Now it should be safe to free those cfs_rqs: */
> - sched_free_group(container_of(rhp, struct task_group, rcu));
> + sched_unregister_group(container_of(rhp, struct task_group, rcu));
> }
>
> void sched_destroy_group(struct task_group *tg)
> {
> /* Wait for possible concurrent references to cfs_rqs complete: */
> - call_rcu(&tg->rcu, sched_free_group_rcu);
> + call_rcu(&tg->rcu, sched_unregister_group_rcu);
> }
>
> -void sched_offline_group(struct task_group *tg)
> +void sched_release_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + *
> + * For this to be effective, we have to wait for all pending users of
> + * this task group to leave their RCU critical section to ensure no new
> + * user will see our dying task group any more. Specifically ensure
> + * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
> + *
> + * We therefore defer calling unregister_fair_sched_group() to
> + * sched_unregister_group() which is guarantied to get called only after the
> + * current RCU grace period has expired.
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> @@ -9896,7 +9922,7 @@ static void cpu_cgroup_css_released(stru
> {
> struct task_group *tg = css_tg(css);
>
> - sched_offline_group(tg);
> + sched_release_group(tg);
> }
>
> static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> @@ -9906,7 +9932,7 @@ static void cpu_cgroup_css_free(struct c
> /*
> * Relies on the RCU grace period between css_released() and this.
> */
> - sched_free_group(tg);
> + sched_unregister_group(tg);
> }
>
> /*
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_g
> {
> int i;
>
> - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> -
> for_each_possible_cpu(i) {
> if (tg->cfs_rq)
> kfree(tg->cfs_rq[i]);
> @@ -11551,6 +11549,8 @@ void unregister_fair_sched_group(struct
> list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> raw_spin_rq_unlock_irqrestore(rq, flags);
> }

> +
> + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));

Move that hunk to the beginning of unregister_fair_sched_group() and
we're good.

> }
>
> void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(str
> return rt_rq->rq;
> }
>
> -void free_rt_sched_group(struct task_group *tg)
> +void unregister_rt_sched_group(struct task_group *tg)
> {
> - int i;
> -
> if (tg->rt_se)
> destroy_rt_bandwidth(&tg->rt_bandwidth);
>
> +}
> +
> +void free_rt_sched_group(struct task_group *tg)
> +{
> + int i;
> +
> for_each_possible_cpu(i) {
> if (tg->rt_rq)
> kfree(tg->rt_rq[i]);
> @@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(
> return &rq->rt;
> }
>
> +void unregister_rt_sched_group(struct task_group *tg) { }
> +
> void free_rt_sched_group(struct task_group *tg) { }
>
> int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runti
> extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>
> +extern void unregister_rt_sched_group(struct task_group *tg);
> extern void free_rt_sched_group(struct task_group *tg);
> extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> @@ -503,7 +504,7 @@ extern struct task_group *sched_create_g
> extern void sched_online_group(struct task_group *tg,
> struct task_group *parent);
> extern void sched_destroy_group(struct task_group *tg);
> -extern void sched_offline_group(struct task_group *tg);
> +extern void sched_release_group(struct task_group *tg);
>
> extern void sched_move_task(struct task_struct *tsk);
>
>

Thanks,
Mathias

2021-11-10 00:13:31

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

Hello.

Let me add a little summary (and I'm CCing cgroups ML too).

The cgroup API allows following callbacks (simplified explanation):

- .css_offline called after killing main reference
- .css_released called when css.refcnt hits zero
- .css_free called after css_released and RCU GP

(The hidden catch is that they're not called directly but through workqueues,
see css_free_work_fn() for details. .css_free() is queued after RCU GP though.)

How is this currently used in the cpu controller:

- .css_offline not used
- .css_released sched_offline_group / unregister_fair_sched_group
- .css_free sched_free_group / free_fair_sched_group


On Mon, Nov 08, 2021 at 12:40:19PM +0100, Peter Zijlstra <[email protected]> wrote:
> Something like so good?

Thanks for putting this together, in short I understand it reshuffles as follows:

.css_offline not used
.css_released cpu_cgroup_css_released
sched_released_group(tg)
// unlinking from tg tree
.css_free cpu_cgroup_css_free
sched_unregister_group(tg)
unregister_fair_sched_group(tg)
for_each(cpu)
remove_entity_load_avg(cfs_rq)
list_del_leaf_cfs_rq(cfs_rq) (1)
destroy_cfs_bandwidth(tg) (2)
call_rcu(sched_free_group_rcu) (3)
[rcu] sched_free_group

I see two following issues with this:

The cfs_bandwidth.period_timer is still active between (1) and (2) and can be
fired and tg_unthrottle_up() may add a cfs_rq back to the leaf list (subsequent
GP won't help).
(Admittedly, this window is shorter than the original window between
cpu_cgroup_css_released() and cpu_cgroup_css_released().)

Having another call_rcu() at (3) seems overly complex given the cgroup API
provides a grace period for free (pun intended :-).


Therefore, symbolically, the symbolic ordering should be:

.css_offline not used
.css_released cpu_cgroup_css_released
sched_unregister_group(tg)
unregister_fair_sched_group(tg)
destroy_cfs_bandwidth(tg)
for_each(cpu)
remove_entity_load_avg(cfs_rq)
list_del_leaf_cfs_rq(cfs_rq)
sched_released_group(tg)
// unlinking from tg tree

.css_free cpu_cgroup_css_free
sched_free_group

That is, the destroy_cfs_bandwidth() is called first to make sure the
tg_unthrottle_up() won't undo some of the cleanups and the respective
structures are only freed after RCU grace period.

(Considering Vincent's remark, remove_entity_load_avg() and
list_del_leaf_cfs_rq() should be swapped but I'd keep that for a different
patch.)

To continue the discussion, the suggestion above in a form of diff (I stripped
the commit message for now).

--8<--
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 2067080bb235..8629b37d118e 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
ag->tg->rt_se = NULL;
ag->tg->rt_rq = NULL;
#endif
- sched_offline_group(ag->tg);
+ sched_release_group(ag->tg);
sched_destroy_group(ag->tg);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 523fd602ea90..a5ca3ae6837b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9771,12 +9771,12 @@ void sched_destroy_group(struct task_group *tg)
call_rcu(&tg->rcu, sched_free_group_rcu);
}

-void sched_offline_group(struct task_group *tg)
+void sched_release_group(struct task_group *tg)
{
unsigned long flags;

- /* End participation in shares distribution: */
unregister_fair_sched_group(tg);
+ unregister_rt_sched_group(tg);

spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
@@ -9896,7 +9896,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);

- sched_offline_group(tg);
+ sched_release_group(tg);
}

static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13950beb01a2..6e476f6d9435 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_group *tg)
{
int i;

- destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
-
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -11534,6 +11532,8 @@ void unregister_fair_sched_group(struct task_group *tg)
struct rq *rq;
int cpu;

+ destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
for_each_possible_cpu(cpu) {
if (tg->se[cpu])
remove_entity_load_avg(tg->se[cpu]);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bb945f8faeca..b48baaba2fc2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
return rt_rq->rq;
}

-void free_rt_sched_group(struct task_group *tg)
+void unregister_rt_sched_group(struct task_group *tg)
{
- int i;
-
if (tg->rt_se)
destroy_rt_bandwidth(&tg->rt_bandwidth);

+}
+
+void free_rt_sched_group(struct task_group *tg)
+{
+ int i;
+
for_each_possible_cpu(i) {
if (tg->rt_rq)
kfree(tg->rt_rq[i]);
@@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
return &rq->rt;
}

+void unregister_rt_sched_group(struct task_group *tg) { }
+
void free_rt_sched_group(struct task_group *tg) { }

int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f1612d26c18..0e66749486e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);

+extern void unregister_rt_sched_group(struct task_group *tg);
extern void free_rt_sched_group(struct task_group *tg);
extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -503,7 +504,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
extern void sched_online_group(struct task_group *tg,
struct task_group *parent);
extern void sched_destroy_group(struct task_group *tg);
-extern void sched_offline_group(struct task_group *tg);
+extern void sched_release_group(struct task_group *tg);

extern void sched_move_task(struct task_struct *tsk);


2021-11-10 15:17:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Mon, 8 Nov 2021 at 16:06, Mathias Krause <[email protected]> wrote:
>
> Am 08.11.21 um 12:40 schrieb Peter Zijlstra:
> > On Mon, Nov 08, 2021 at 11:27:57AM +0100, Mathias Krause wrote:
> >
> >> The timers need to be destroyed prior to unregister_fair_sched_group()
> >> via destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)), i.e. move it from
> >> free_fair_sched_group() to here, as I did in my patch. Otherwise the tg
> >> might still be messed with and we don't want that.
> >
> > Oh, duh, yes.
>
> Well, still slightly wrong: "prior to unregister_fair_sched_group()"
> means calling destroy_cfs_bandwidth() before not after ;)
>
> > For consistency's sake, I've also added an unregister_*
> > for the rt class, also destroying the bandwidth timer.
>
> Looks good to me. Not strictly needed by the code as of now, but
> shouldn't hurt either, to defer the final kfree() to the next RCU GP.
>
> >> Beside that, looks good to me. Will you create a new proper patch or
> >> should I do it?
> >
> > Something like so good?
> >
> > (I stripped the #PF splat, because I don't think it adds anything not
> > covered by the text).
>
> Well, me, personally, always searches for parts of a Oops splat first.
> It sometimes finds related discussions or, even better, commits fixing
> an issue. So I prefer keeping it. But, in this case, it should find this
> Email thread and, in turn, this patch. So I'm fine with dropping it.
>
> > ---
> > Subject: sched/fair: Prevent dead task groups from regaining cfs_rq's
> > From: Mathias Krause <[email protected]>
> > Date: Wed, 3 Nov 2021 20:06:13 +0100
> >
> > From: Mathias Krause <[email protected]>
> >
> > Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> > in update_blocked_averages(). Initial debugging revealed that we've
> > live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> > free_fair_sched_group(). However, it was unclear how that can happen.
> >
> > His kernel config happened to lead to a layout of struct sched_entity
> > that put the 'my_q' member directly into the middle of the object
> > which makes it incidentally overlap with SLUB's freelist pointer.
> > That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
> > mangling, leads to a reliable access violation in form of a #GP which
> > made the UAF fail fast.
> >
> > Michal seems to have run into the same issue[1]. He already correctly
> > diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
> > cfs_rq's to list on unthrottle") is causing the preconditions for the
> > UAF to happen by re-adding cfs_rq's also to task groups that have no
> > more running tasks, i.e. also to dead ones. His analysis, however,
> > misses the real root cause and it cannot be seen from the crash
> > backtrace only, as the real offender is tg_unthrottle_up() getting
> > called via sched_cfs_period_timer() via the timer interrupt at an
> > inconvenient time.
> >
> > When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
> > task group, it doesn't protect itself from getting interrupted. If the
> > timer interrupt triggers while we iterate over all CPUs or after
> > unregister_fair_sched_group() has finished but prior to unlinking the
> > task group, sched_cfs_period_timer() will execute and walk the list of
> > task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
> > dying task group. These will later -- in free_fair_sched_group() -- be
> > kfree()'ed while still being linked, leading to the fireworks Kevin
> > and Michal are seeing.
> >
> > To fix this race, ensure the dying task group gets unlinked first.
> > However, simply switching the order of unregistering and unlinking the
> > task group isn't sufficient, as concurrent RCU walkers might still see
> > it, as can be seen below:
> >
> > CPU1: CPU2:
> > : timer IRQ:
> > : do_sched_cfs_period_timer():
> > : :
> > : distribute_cfs_runtime():
> > : rcu_read_lock();
> > : :
> > : unthrottle_cfs_rq():
> > sched_offline_group(): :
> > : walk_tg_tree_from(…,tg_unthrottle_up,…):
> > list_del_rcu(&tg->list); :
> > (1) : list_for_each_entry_rcu(child, &parent->children, siblings)
> > : :
> > (2) list_del_rcu(&tg->siblings); :
> > : tg_unthrottle_up():
> > unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > : :
> > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
> > : :
> > : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> > (3) : list_add_leaf_cfs_rq(cfs_rq);
> > : :
> > : :
> > : :
> > : :
> > : :
> > (4) : rcu_read_unlock();
> >
> > CPU 2 walks the task group list in parallel to sched_offline_group(),
> > specifically, it'll read the soon to be unlinked task group entry at
> > (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
> > still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
> > all cfs_rq's via list_del_leaf_cfs_rq() in
> > unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
> > these at (3), which is the cause of the UAF later on.
> >
> > To prevent this additional race from happening, we need to wait until
> > walk_tg_tree_from() has finished traversing the task groups, i.e.
> > after the RCU read critical section ends in (4). Afterwards we're safe
> > to call unregister_fair_sched_group(), as each new walk won't see the
> > dying task group any more.
> Replace the following paragraph, which is outdated by now,...:
>
> > Using synchronize_rcu() might be seen as a too heavy hammer to nail
> > this problem. However, the overall tear down sequence (e.g., as
> > documented in css_free_rwork_fn()) already relies on quite a few
> > assumptions regarding execution context and RCU grace periods from
> > passing. Looking at the autogroup code, which calls
> > sched_destroy_group() directly after sched_offline_group() and the
> > apparent need to have at least one RCU grace period expire after
> > unlinking the task group, prior to calling
> > unregister_fair_sched_group(), there seems to be no better
> > alternative. Calling unregister_fair_sched_group() via call_rcu()
> > will only lead to trouble in sched_offline_group() which also relies
> > on (yet another) expired RCU grace period.
>
> ...with something like this (already mentioned in the code, btw):
>
> On top of that, we need to wait yet another RCU grace period after
> unregister_fair_sched_group() to ensure print_cfs_stats(), which might
> run concurrently, always sees valid objects, i.e. not already free'd ones.
>
> >
> > This patch survives Michal's reproducer[2] for 8h+ now, which used to
> > trigger within minutes before.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/lkml/[email protected]/
> >
> > Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> > Reported-by: Kevin Tanguy <[email protected]>
> > Signed-off-by: Mathias Krause <[email protected]>
> > [peterz: shuffle code around a bit]
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > kernel/sched/autogroup.c | 2 +-
> > kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++----------
> > kernel/sched/fair.c | 4 ++--
> > kernel/sched/rt.c | 12 +++++++++---
> > kernel/sched/sched.h | 3 ++-
> > 5 files changed, 50 insertions(+), 17 deletions(-)
> >
> > --- a/kernel/sched/autogroup.c
> > +++ b/kernel/sched/autogroup.c
> > @@ -31,7 +31,7 @@ static inline void autogroup_destroy(str
> > ag->tg->rt_se = NULL;
> > ag->tg->rt_rq = NULL;
> > #endif
> > - sched_offline_group(ag->tg);
> > + sched_release_group(ag->tg);
> > sched_destroy_group(ag->tg);
> > }
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9716,6 +9716,22 @@ static void sched_free_group(struct task
> > kmem_cache_free(task_group_cache, tg);
> > }
> >
> > +static void sched_free_group_rcu(struct rcu_head *rcu)
> > +{
> > + sched_free_group(container_of(rcu, struct task_group, rcu));
> > +}
> > +
> > +static void sched_unregister_group(struct task_group *tg)
> > +{
> > + unregister_fair_sched_group(tg);
> > + unregister_rt_sched_group(tg);
> > + /*
> > + * We have to wait for yet another RCU grace period to expire, as
> > + * print_cfs_stats() might run concurrently.
> > + */
> > + call_rcu(&tg->rcu, sched_free_group_rcu);
> > +}
> > +
> > /* allocate runqueue etc for a new task group */
> > struct task_group *sched_create_group(struct task_group *parent)
> > {
> > @@ -9736,7 +9752,7 @@ struct task_group *sched_create_group(st
> > return tg;
> >
> > err:
> > - sched_free_group(tg);
> > + sched_unregister_group(tg);
>
> This can stay sched_free_group() as neither have the bandwidth timers
> been started yet, nor was this tg made visible outside of this function.
> So omitting the calls to destroy_{cfs,rt}_bandwidth() isn't a problem --
> timers aren't running yet.
>
> > return ERR_PTR(-ENOMEM);
> > }
> >
> > @@ -9759,25 +9775,35 @@ void sched_online_group(struct task_grou
> > }
> >
> > /* rcu callback to free various structures associated with a task group */
> > -static void sched_free_group_rcu(struct rcu_head *rhp)
> > +static void sched_unregister_group_rcu(struct rcu_head *rhp)
> > {
> > /* Now it should be safe to free those cfs_rqs: */
> > - sched_free_group(container_of(rhp, struct task_group, rcu));
> > + sched_unregister_group(container_of(rhp, struct task_group, rcu));
> > }
> >
> > void sched_destroy_group(struct task_group *tg)
> > {
> > /* Wait for possible concurrent references to cfs_rqs complete: */
> > - call_rcu(&tg->rcu, sched_free_group_rcu);
> > + call_rcu(&tg->rcu, sched_unregister_group_rcu);
> > }
> >
> > -void sched_offline_group(struct task_group *tg)
> > +void sched_release_group(struct task_group *tg)
> > {
> > unsigned long flags;
> >
> > - /* End participation in shares distribution: */
> > - unregister_fair_sched_group(tg);
> > -
> > + /*
> > + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> > + * sched_cfs_period_timer()).
> > + *
> > + * For this to be effective, we have to wait for all pending users of
> > + * this task group to leave their RCU critical section to ensure no new
> > + * user will see our dying task group any more. Specifically ensure
> > + * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
> > + *
> > + * We therefore defer calling unregister_fair_sched_group() to
> > + * sched_unregister_group() which is guarantied to get called only after the
> > + * current RCU grace period has expired.
> > + */
> > spin_lock_irqsave(&task_group_lock, flags);
> > list_del_rcu(&tg->list);
> > list_del_rcu(&tg->siblings);
> > @@ -9896,7 +9922,7 @@ static void cpu_cgroup_css_released(stru
> > {
> > struct task_group *tg = css_tg(css);
> >
> > - sched_offline_group(tg);
> > + sched_release_group(tg);
> > }
> >
> > static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> > @@ -9906,7 +9932,7 @@ static void cpu_cgroup_css_free(struct c
> > /*
> > * Relies on the RCU grace period between css_released() and this.
> > */
> > - sched_free_group(tg);
> > + sched_unregister_group(tg);
> > }
> >
> > /*
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_g
> > {
> > int i;
> >
> > - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> > -
> > for_each_possible_cpu(i) {
> > if (tg->cfs_rq)
> > kfree(tg->cfs_rq[i]);
> > @@ -11551,6 +11549,8 @@ void unregister_fair_sched_group(struct
> > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> > raw_spin_rq_unlock_irqrestore(rq, flags);
> > }
>
> > +
> > + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
>
> Move that hunk to the beginning of unregister_fair_sched_group() and
> we're good.

With Mathias comments, your proposal looks ok for me as well. I don't
have any reproducer so it's hard to test it
>
> > }
> >
> > void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(str
> > return rt_rq->rq;
> > }
> >
> > -void free_rt_sched_group(struct task_group *tg)
> > +void unregister_rt_sched_group(struct task_group *tg)
> > {
> > - int i;
> > -
> > if (tg->rt_se)
> > destroy_rt_bandwidth(&tg->rt_bandwidth);
> >
> > +}
> > +
> > +void free_rt_sched_group(struct task_group *tg)
> > +{
> > + int i;
> > +
> > for_each_possible_cpu(i) {
> > if (tg->rt_rq)
> > kfree(tg->rt_rq[i]);
> > @@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(
> > return &rq->rt;
> > }
> >
> > +void unregister_rt_sched_group(struct task_group *tg) { }
> > +
> > void free_rt_sched_group(struct task_group *tg) { }
> >
> > int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runti
> > extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> >
> > +extern void unregister_rt_sched_group(struct task_group *tg);
> > extern void free_rt_sched_group(struct task_group *tg);
> > extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> > extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> > @@ -503,7 +504,7 @@ extern struct task_group *sched_create_g
> > extern void sched_online_group(struct task_group *tg,
> > struct task_group *parent);
> > extern void sched_destroy_group(struct task_group *tg);
> > -extern void sched_offline_group(struct task_group *tg);
> > +extern void sched_release_group(struct task_group *tg);
> >
> > extern void sched_move_task(struct task_struct *tsk);
> >
> >
>
> Thanks,
> Mathias

2021-11-10 15:20:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

On Tue, 9 Nov 2021 at 19:47, Michal Koutný <[email protected]> wrote:
>
> Hello.
>
> Let me add a little summary (and I'm CCing cgroups ML too).
>
> The cgroup API allows following callbacks (simplified explanation):
>
> - .css_offline called after killing main reference
> - .css_released called when css.refcnt hits zero
> - .css_free called after css_released and RCU GP
>
> (The hidden catch is that they're not called directly but through workqueues,
> see css_free_work_fn() for details. .css_free() is queued after RCU GP though.)
>
> How is this currently used in the cpu controller:
>
> - .css_offline not used
> - .css_released sched_offline_group / unregister_fair_sched_group
> - .css_free sched_free_group / free_fair_sched_group
>
>
> On Mon, Nov 08, 2021 at 12:40:19PM +0100, Peter Zijlstra <[email protected]> wrote:
> > Something like so good?
>
> Thanks for putting this together, in short I understand it reshuffles as follows:
>
> .css_offline not used
> .css_released cpu_cgroup_css_released
> sched_released_group(tg)
> // unlinking from tg tree
> .css_free cpu_cgroup_css_free
> sched_unregister_group(tg)
> unregister_fair_sched_group(tg)
> for_each(cpu)
> remove_entity_load_avg(cfs_rq)
> list_del_leaf_cfs_rq(cfs_rq) (1)
> destroy_cfs_bandwidth(tg) (2)
> call_rcu(sched_free_group_rcu) (3)
> [rcu] sched_free_group
>
> I see two following issues with this:
>
> The cfs_bandwidth.period_timer is still active between (1) and (2) and can be
> fired and tg_unthrottle_up() may add a cfs_rq back to the leaf list (subsequent
> GP won't help).
> (Admittedly, this window is shorter than the original window between
> cpu_cgroup_css_released() and cpu_cgroup_css_released().)
>
> Having another call_rcu() at (3) seems overly complex given the cgroup API
> provides a grace period for free (pun intended :-).
>
>
> Therefore, symbolically, the symbolic ordering should be:
>
> .css_offline not used
> .css_released cpu_cgroup_css_released
> sched_unregister_group(tg)
> unregister_fair_sched_group(tg)
> destroy_cfs_bandwidth(tg)
> for_each(cpu)
> remove_entity_load_avg(cfs_rq)
> list_del_leaf_cfs_rq(cfs_rq)

As we are not unlinked from tg tree, parent can walk_tg_tree_from and
add it back

> sched_released_group(tg)
> // unlinking from tg tree
>
> .css_free cpu_cgroup_css_free
> sched_free_group
>
> That is, the destroy_cfs_bandwidth() is called first to make sure the
> tg_unthrottle_up() won't undo some of the cleanups and the respective
> structures are only freed after RCU grace period.
>
> (Considering Vincent's remark, remove_entity_load_avg() and
> list_del_leaf_cfs_rq() should be swapped but I'd keep that for a different
> patch.)
>
> To continue the discussion, the suggestion above in a form of diff (I stripped
> the commit message for now).
>
> --8<--
> diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
> index 2067080bb235..8629b37d118e 100644
> --- a/kernel/sched/autogroup.c
> +++ b/kernel/sched/autogroup.c
> @@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
> ag->tg->rt_se = NULL;
> ag->tg->rt_rq = NULL;
> #endif
> - sched_offline_group(ag->tg);
> + sched_release_group(ag->tg);
> sched_destroy_group(ag->tg);
> }
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 523fd602ea90..a5ca3ae6837b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9771,12 +9771,12 @@ void sched_destroy_group(struct task_group *tg)
> call_rcu(&tg->rcu, sched_free_group_rcu);
> }
>
> -void sched_offline_group(struct task_group *tg)
> +void sched_release_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> unregister_fair_sched_group(tg);
> + unregister_rt_sched_group(tg);
>
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> @@ -9896,7 +9896,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
> {
> struct task_group *tg = css_tg(css);
>
> - sched_offline_group(tg);
> + sched_release_group(tg);
> }
>
> static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 13950beb01a2..6e476f6d9435 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11456,8 +11456,6 @@ void free_fair_sched_group(struct task_group *tg)
> {
> int i;
>
> - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> -
> for_each_possible_cpu(i) {
> if (tg->cfs_rq)
> kfree(tg->cfs_rq[i]);
> @@ -11534,6 +11532,8 @@ void unregister_fair_sched_group(struct task_group *tg)
> struct rq *rq;
> int cpu;
>
> + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
> +
> for_each_possible_cpu(cpu) {
> if (tg->se[cpu])
> remove_entity_load_avg(tg->se[cpu]);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index bb945f8faeca..b48baaba2fc2 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
> return rt_rq->rq;
> }
>
> -void free_rt_sched_group(struct task_group *tg)
> +void unregister_rt_sched_group(struct task_group *tg)
> {
> - int i;
> -
> if (tg->rt_se)
> destroy_rt_bandwidth(&tg->rt_bandwidth);
>
> +}
> +
> +void free_rt_sched_group(struct task_group *tg)
> +{
> + int i;
> +
> for_each_possible_cpu(i) {
> if (tg->rt_rq)
> kfree(tg->rt_rq[i]);
> @@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
> return &rq->rt;
> }
>
> +void unregister_rt_sched_group(struct task_group *tg) { }
> +
> void free_rt_sched_group(struct task_group *tg) { }
>
> int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7f1612d26c18..0e66749486e7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -488,6 +488,7 @@ extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>
> +extern void unregister_rt_sched_group(struct task_group *tg);
> extern void free_rt_sched_group(struct task_group *tg);
> extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> @@ -503,7 +504,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
> extern void sched_online_group(struct task_group *tg,
> struct task_group *parent);
> extern void sched_destroy_group(struct task_group *tg);
> -extern void sched_offline_group(struct task_group *tg);
> +extern void sched_release_group(struct task_group *tg);
>
> extern void sched_move_task(struct task_struct *tsk);
>
>