LinuxLists.cc - [patch 4/5] sched: Delay task stack freeing on RT

2021-09-28 12:29:48

Subject: [patch 4/5] sched: Delay task stack freeing on RT

From: Sebastian Andrzej Siewior <[email protected]>

Anything which is done on behalf of a dead task at the end of
finish_task_switch() is preventing the incoming task from doing useful
work. While it is benefitial for fork heavy workloads to recycle the task
stack quickly, this is a latency source for real-time tasks.

Therefore delay the stack cleanup on RT enabled kernels.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/exit.c | 5 +++++
kernel/fork.c | 5 ++++-
kernel/sched/core.c | 8 ++++++--
3 files changed, 15 insertions(+), 3 deletions(-)

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
kprobe_flush_task(tsk);
perf_event_delayed_put(tsk);
trace_sched_process_free(tsk);
+
+ /* RT enabled kernels delay freeing the VMAP'ed task stack */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ put_task_stack(tsk);
+
put_task_struct(tsk);
}

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -289,7 +289,10 @@ static inline void free_thread_stack(str
return;
}

- vfree_atomic(tsk->stack);
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ vfree_atomic(tsk->stack);
+ else
+ vfree(tsk->stack);
return;
}
#endif
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);

- /* Task is done with its stack. */
- put_task_stack(prev);
+ /*
+ * Release VMAP'ed task stack immediate for reuse. On RT
+ * enabled kernels this is delayed for latency reasons.
+ */
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ put_task_stack(prev);

put_task_struct_rcu_user(prev);
}

2021-09-29 12:02:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Tue, Sep 28, 2021 at 02:24:30PM +0200, Thomas Gleixner wrote:

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
> kprobe_flush_task(tsk);
> perf_event_delayed_put(tsk);
> trace_sched_process_free(tsk);
> +
> + /* RT enabled kernels delay freeing the VMAP'ed task stack */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> + put_task_stack(tsk);
> +
> put_task_struct(tsk);
> }

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
> - /* Task is done with its stack. */
> - put_task_stack(prev);
> + /*
> + * Release VMAP'ed task stack immediate for reuse. On RT
> + * enabled kernels this is delayed for latency reasons.
> + */
> + if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> + put_task_stack(prev);
>
> put_task_struct_rcu_user(prev);
> }

Having this logic split across two files seems unfortunate and prone to
'accidents'. Is there a real down-side to unconditionally doing it in
delayed_put_task_struct() ?

/me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.

Bah.. Andy?

2021-10-01 17:43:44

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Fri, Oct 01 2021 at 09:12, Andy Lutomirski wrote:
> On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <[email protected]> wrote:
>> Having this logic split across two files seems unfortunate and prone to
>> 'accidents'. Is there a real down-side to unconditionally doing it in
>> delayed_put_task_struct() ?
>>
>> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
>>
>> Bah.. Andy?
>
> Could we make whatever we do here unconditional?

Sure. I just was unsure about your reasoning in 68f24b08ee89.

> And what actually causes the latency? If it's vfree, shouldn't the
> existing use of vfree_atomic() in free_thread_stack() handle it? Or
> is it the accounting?

The accounting muck because it can go into the allocator and sleep in
the worst case, which is nasty even on !RT kernels.

But thinking some more, there is actually a way nastier issue on RT in
the following case:

CPU 0 CPU 1
T1
spin_lock(L1)
rt_mutex_lock()
schedule()

T2
do_exit()
do_task_dead() spin_unlock(L1)
wake(T1)
__schedule()
switch_to(T1)
finish_task_switch()
put_task_stack()
account()
....
spin_lock(L2)

So if L1 == L2 or L1 and L2 have a reverse dependency then this can just
deadlock.

We've never observed that, but the above case is obviously hard to
hit. Nevertheless it's there.

Thanks,

tglx

2021-10-01 18:47:31

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Sep 28, 2021 at 02:24:30PM +0200, Thomas Gleixner wrote:
>
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -172,6 +172,11 @@ static void delayed_put_task_struct(stru
> > kprobe_flush_task(tsk);
> > perf_event_delayed_put(tsk);
> > trace_sched_process_free(tsk);
> > +
> > + /* RT enabled kernels delay freeing the VMAP'ed task stack */
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > + put_task_stack(tsk);
> > +
> > put_task_struct(tsk);
> > }
>
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4846,8 +4846,12 @@ static struct rq *finish_task_switch(str
> > if (prev->sched_class->task_dead)
> > prev->sched_class->task_dead(prev);
> >
> > - /* Task is done with its stack. */
> > - put_task_stack(prev);
> > + /*
> > + * Release VMAP'ed task stack immediate for reuse. On RT
> > + * enabled kernels this is delayed for latency reasons.
> > + */
> > + if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > + put_task_stack(prev);
> >
> > put_task_struct_rcu_user(prev);
> > }
>
>
> Having this logic split across two files seems unfortunate and prone to
> 'accidents'. Is there a real down-side to unconditionally doing it in
> delayed_put_task_struct() ?
>
> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
>
> Bah.. Andy?

Could we make whatever we do here unconditional? And what actually
causes the latency? If it's vfree, shouldn't the existing use of
vfree_atomic() in free_thread_stack() handle it? Or is it the
accounting?

--
Andy Lutomirski
AMA Capital Management, LLC

2021-10-01 19:16:23

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <[email protected]> wrote:
>
> On Fri, Oct 01 2021 at 09:12, Andy Lutomirski wrote:
> > On Wed, Sep 29, 2021 at 4:54 AM Peter Zijlstra <[email protected]> wrote:
> >> Having this logic split across two files seems unfortunate and prone to
> >> 'accidents'. Is there a real down-side to unconditionally doing it in
> >> delayed_put_task_struct() ?
> >>
> >> /me goes out for lunch... meanwhile tglx points at: 68f24b08ee89.
> >>
> >> Bah.. Andy?
> >
> > Could we make whatever we do here unconditional?
>
> Sure. I just was unsure about your reasoning in 68f24b08ee89.

Mmm, right. The reasoning is that there are a lot of workloads that
frequently wait for a task to exit and immediately start a new task --
most shell scripts, for example. I think I tested this with the
following amazing workload:

while true; do true; done

and we want to reuse the same stack each time from the cached stack
lookaside list instead of vfreeing and vmallocing a stack each time.
Deferring the release to the lookaside list breaks it. Although I
suppose the fact that it works well right now is a bit fragile --
we're waking the parent (sh, etc) before releasing the stack, but
nothing gets to run until the stack is released.

>
> > And what actually causes the latency? If it's vfree, shouldn't the
> > existing use of vfree_atomic() in free_thread_stack() handle it? Or
> > is it the accounting?
>
> The accounting muck because it can go into the allocator and sleep in
> the worst case, which is nasty even on !RT kernels.

Wait, unaccounting memory can go into the allocator? That seems quite nasty.

>
> But thinking some more, there is actually a way nastier issue on RT in
> the following case:
>
> CPU 0 CPU 1
> T1
> spin_lock(L1)
> rt_mutex_lock()
> schedule()
>
> T2
> do_exit()
> do_task_dead() spin_unlock(L1)
> wake(T1)
> __schedule()
> switch_to(T1)
> finish_task_switch()
> put_task_stack()
> account()
> ....
> spin_lock(L2)
>
> So if L1 == L2 or L1 and L2 have a reverse dependency then this can just
> deadlock.
>
> We've never observed that, but the above case is obviously hard to
> hit. Nevertheless it's there.

Hmm.

ISTM it would be conceptually for do_exit() to handle its own freeing
in its own preemptible context. Obviously that can't really work,
since we can't free a task_struct or a task stack while we're running
on it. But I wonder if we could approximate it by putting this work
in a workqueue so that it all runs in a normal schedulable context.
To make the shell script case work nicely, we want to release the task
stack before notifying anyone waiting for the dying task to exit, but
maybe that's doable. It could involve some nasty exit_signal hackery,
though.

2021-10-01 21:24:34

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Fri, Oct 1, 2021 at 11:48 AM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <[email protected]> wrote:
> >

> ISTM it would be conceptually for do_exit() to handle its own freeing
> in its own preemptible context. Obviously that can't really work,
> since we can't free a task_struct or a task stack while we're running
> on it. But I wonder if we could approximate it by putting this work
> in a workqueue so that it all runs in a normal schedulable context.
> To make the shell script case work nicely, we want to release the task
> stack before notifying anyone waiting for the dying task to exit, but
> maybe that's doable. It could involve some nasty exit_signal hackery,
> though.

I'm making this way more complicated than it needs to be. How about
we unaccount the task stack in do_exit and release it for real in
finish_task_switch()? Other than accounting, free_thread_stack
doesn't take any locks.

--Andy

2021-10-01 21:45:20

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch 4/5] sched: Delay task stack freeing on RT

On Fri, Oct 01 2021 at 12:02, Andy Lutomirski wrote:
> On Fri, Oct 1, 2021 at 11:48 AM Andy Lutomirski <[email protected]> wrote:
>>
>> On Fri, Oct 1, 2021 at 10:24 AM Thomas Gleixner <[email protected]> wrote:
>> >
>
>> ISTM it would be conceptually for do_exit() to handle its own freeing
>> in its own preemptible context. Obviously that can't really work,
>> since we can't free a task_struct or a task stack while we're running
>> on it. But I wonder if we could approximate it by putting this work
>> in a workqueue so that it all runs in a normal schedulable context.
>> To make the shell script case work nicely, we want to release the task
>> stack before notifying anyone waiting for the dying task to exit, but
>> maybe that's doable. It could involve some nasty exit_signal hackery,
>> though.
>
> I'm making this way more complicated than it needs to be. How about
> we unaccount the task stack in do_exit and release it for real in
> finish_task_switch()? Other than accounting, free_thread_stack
> doesn't take any locks.

Right.