LinuxLists.cc - [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

2024-04-03 01:00:06

Subject: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

For fair tasks inheriting the priority (nice) without reweighting is
a NOP as the task's share won't change.

This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
with low priority values are susceptible to starvation leading to PI
like impact on lock contention.

The logic in rt_mutex will reset these low priority fair tasks into nice
0, but without the additional reweight operation to actually update the
weights, it doesn't have the desired impact of boosting them to allow
them to run sooner/longer to release the lock.

Apply the reweight for fair_policy() tasks to achieve the desired boost
for those low nice values tasks. Note that boost here means resetting
their nice to 0; as this is what the current logic does for fair tasks.

Handling of idle_policy() requires more code refactoring and is not
handled yet. idle_policy() are treated specially and only run when the
CPU is idle and get a hardcoded low weight value. Changing weights won't
be enough without a promotion first to SCHED_OTHER.

Tested with a test program that creates three threads.

1. main thread that spanws high prio and low prio task and busy
loops

2. low priority thread that holds a pthread_mutex() with
PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
after holding the lock.

3. high priority thread that holds a pthread_mutex() with
PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
priority thread. Runs at nice 0. Should remain blocked by the
low priority thread.

All tasks are pinned to CPU0.

Without the patch I can see the low priority thread running only for
~10% of the time which is what expected without it being boosted.

With the patch the low priority thread runs for ~50% which is what
expected if it gets boosted to nice 0.

I modified the test program logic afterwards to ensure that after
releasing the lock the low priority thread goes back to running for 10%
of the time, and it does.

Reported-by: Yabin Cui <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0621e4ee31de..b90a541810da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
} else {
if (dl_prio(oldprio))
p->dl.pi_se = &p->dl;
- if (rt_prio(oldprio))
+ else if (rt_prio(oldprio))
p->rt.timeout = 0;
+ else if (!task_has_idle_policy(p))
+ reweight_task(p, prio - MAX_RT_PRIO);
}

__setscheduler_prio(p, prio);
--
2.34.1

2024-04-03 13:11:26

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
>
> For fair tasks inheriting the priority (nice) without reweighting is
> a NOP as the task's share won't change.

AFAICT, there is no nice priority inheritance with rt_mutex; All nice
tasks are sorted with the same "default prio" in the rb waiter tree.
This means that the rt top waiter is not the cfs with highest prio but
the 1st cfs waiting for the mutex.

>
> This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> with low priority values are susceptible to starvation leading to PI
> like impact on lock contention.
>
> The logic in rt_mutex will reset these low priority fair tasks into nice
> 0, but without the additional reweight operation to actually update the
> weights, it doesn't have the desired impact of boosting them to allow
> them to run sooner/longer to release the lock.
>
> Apply the reweight for fair_policy() tasks to achieve the desired boost
> for those low nice values tasks. Note that boost here means resetting
> their nice to 0; as this is what the current logic does for fair tasks.

But you can at the opposite decrease the cfs prio of a task
and even worse with the comment :
/* XXX used to be waiter->prio, not waiter->task->prio */

we use the prio of the top cfs waiter (ie the one waiting for the
lock) not the default 0 so it can be anything in the range [-20:19]

Then, a task with low prio (i.e. nice > 0) can get a prio boost even
if this task and the waiter are low priority tasks

>
> Handling of idle_policy() requires more code refactoring and is not
> handled yet. idle_policy() are treated specially and only run when the
> CPU is idle and get a hardcoded low weight value. Changing weights won't
> be enough without a promotion first to SCHED_OTHER.
>
> Tested with a test program that creates three threads.
>
> 1. main thread that spanws high prio and low prio task and busy
> loops
>
> 2. low priority thread that holds a pthread_mutex() with
> PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
> after holding the lock.
>
> 3. high priority thread that holds a pthread_mutex() with
> PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
> priority thread. Runs at nice 0. Should remain blocked by the
> low priority thread.
>
> All tasks are pinned to CPU0.
>
> Without the patch I can see the low priority thread running only for
> ~10% of the time which is what expected without it being boosted.
>
> With the patch the low priority thread runs for ~50% which is what
> expected if it gets boosted to nice 0.
>
> I modified the test program logic afterwards to ensure that after
> releasing the lock the low priority thread goes back to running for 10%
> of the time, and it does.
>
> Reported-by: Yabin Cui <[email protected]>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
> kernel/sched/core.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0621e4ee31de..b90a541810da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> } else {
> if (dl_prio(oldprio))
> p->dl.pi_se = &p->dl;
> - if (rt_prio(oldprio))
> + else if (rt_prio(oldprio))
> p->rt.timeout = 0;
> + else if (!task_has_idle_policy(p))
> + reweight_task(p, prio - MAX_RT_PRIO);
> }
>
> __setscheduler_prio(p, prio);
> --
> 2.34.1
>

2024-04-03 13:40:13

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

On Wed, 3 Apr 2024 15:11:06 +0200
Vincent Guittot <[email protected]> wrote:

> On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> >
> > For fair tasks inheriting the priority (nice) without reweighting is
> > a NOP as the task's share won't change.
>
> AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> tasks are sorted with the same "default prio" in the rb waiter tree.
> This means that the rt top waiter is not the cfs with highest prio but
> the 1st cfs waiting for the mutex.

I think the issue here is that the running process doesn't update its
weight and if there are other tasks that are not contending on this mutex,
they can still starve the lock owner.

IIUC (it's been ages since I looked at the code), high nice values (low
priority) turn to at lease nice 0 when they are "boosted". It doesn't
improve their chances of getting the lock though.

>
> >
> > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > with low priority values are susceptible to starvation leading to PI
> > like impact on lock contention.
> >
> > The logic in rt_mutex will reset these low priority fair tasks into nice
> > 0, but without the additional reweight operation to actually update the
> > weights, it doesn't have the desired impact of boosting them to allow
> > them to run sooner/longer to release the lock.
> >
> > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > for those low nice values tasks. Note that boost here means resetting
> > their nice to 0; as this is what the current logic does for fair tasks.
>
> But you can at the opposite decrease the cfs prio of a task
> and even worse with the comment :
> /* XXX used to be waiter->prio, not waiter->task->prio */
>
> we use the prio of the top cfs waiter (ie the one waiting for the
> lock) not the default 0 so it can be anything in the range [-20:19]
>
> Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> if this task and the waiter are low priority tasks

Yeah, I'm all confused to exactly how the inheritance works with
SCHED_OTHER. I know John Stultz worked on this for a bit recently. He's
Cc'ed. But may not be paying attention ;-)

-- Steve

2024-04-03 13:54:31

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

On Wed, 3 Apr 2024 at 15:40, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 3 Apr 2024 15:11:06 +0200
> Vincent Guittot <[email protected]> wrote:
>
> > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > >
> > > For fair tasks inheriting the priority (nice) without reweighting is
> > > a NOP as the task's share won't change.
> >
> > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> > tasks are sorted with the same "default prio" in the rb waiter tree.
> > This means that the rt top waiter is not the cfs with highest prio but
> > the 1st cfs waiting for the mutex.
>
> I think the issue here is that the running process doesn't update its
> weight and if there are other tasks that are not contending on this mutex,
> they can still starve the lock owner.

But I think it's on purpose because we don't boost cfs tasks and we
never boost them. That could be a good thing to do but I think that
the current code has not been done for that and this might raise other
problem. I don't think it's an oversight

>
> IIUC (it's been ages since I looked at the code), high nice values (low
> priority) turn to at lease nice 0 when they are "boosted". It doesn't
> improve their chances of getting the lock though.
>
> >
> > >
> > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > with low priority values are susceptible to starvation leading to PI
> > > like impact on lock contention.
> > >
> > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > 0, but without the additional reweight operation to actually update the
> > > weights, it doesn't have the desired impact of boosting them to allow
> > > them to run sooner/longer to release the lock.
> > >
> > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > for those low nice values tasks. Note that boost here means resetting
> > > their nice to 0; as this is what the current logic does for fair tasks.
> >
> > But you can at the opposite decrease the cfs prio of a task
> > and even worse with the comment :
> > /* XXX used to be waiter->prio, not waiter->task->prio */
> >
> > we use the prio of the top cfs waiter (ie the one waiting for the
> > lock) not the default 0 so it can be anything in the range [-20:19]
> >
> > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > if this task and the waiter are low priority tasks
>
>
> Yeah, I'm all confused to exactly how the inheritance works with
> SCHED_OTHER. I know John Stultz worked on this for a bit recently. He's
> Cc'ed. But may not be paying attention ;-)
>
> -- Steve

2024-04-04 22:09:40

On 04/10/24 16:47, Qais Yousef wrote:
> On 04/10/24 11:13, Vincent Guittot wrote:
>
> > > > Without cgroup, the solution could be straightforward but android uses
> > > > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > > > track the top cfs waiter and its "prio"
> > >
> > > :(
> > >
> > > IIUC the issue is that we can't easily come up with a single number of
> > > 'effective prio' for N level hierarchy and compare it with another M level
> > > hierarchy..
> >
> > And then how do you apply it on the hierarchy ?
>
> (I am not disagreeing with you, just trying to state the reasons more
> explicitly)
>
> I think the application is easy, attach to the leaf cfs_rq? Which IIUC
> correctly what should happen with proxy execution, but by consuming the context
> of the donor directly without having explicitly to move the lock owner.
>
> Finding out which hierarchy actually has the highest effective share is not
> straightforward I agree. And if we combine a potential operation of something
> that could move any waiting task to a different hierarchy at anytime, this gets
> even more complex.
>
> I need to go and find more info, but seems Windows has some kind of boost
> mechanism to help the lock owner to release the lock faster. I wonder if
> something like that could help as interim solution. What we could do is move
> the task to root group as a boost with the simple reweight operation proposed
> here applied. As soon as it releases the lock we should restore it.
>
> From what I heard in Windows this boost happens randomly (don't quote me on
> this). I am not sure could be our trigger mechanism. We sure don't want to do
> this unconditionally otherwise we break fairness.
>
> Maybe there are easier ways to introduce a simple such boost mechanism..

FWIW, trying to find the top-fair-waiter and that wasn't as trivial. I needed
to refactor a fair bit of the code that expects the top-waiter to be the
leftmost..

And I haven't looked at that temporary boost mechanism for cfs. Maybe I'll try
that if I get a chance. For the time being, I got this bandaid if anybody is
interested in a temporary 'solution'

--->8---

From 7169519792f11a73f861a41dd7d5c9151dc44dd7 Mon Sep 17 00:00:00 2001
From: Qais Yousef <[email protected]>
Date: Mon, 1 Apr 2024 03:04:00 +0100
Subject: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

For fair tasks inheriting the priority (nice) without reweighting is
a NOP as the task's share won't change.

This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
with low priority values are susceptible to starvation leading to PI
like impact on lock contention.

The logic in rt_mutex will reset these low priority fair tasks into nice
0, but without the additional reweight operation to actually update the
weights, it doesn't have the desired impact of boosting them to allow
them to run sooner/longer to release the lock.

Apply the reweight for fair_policy() tasks to achieve the desired boost
for those low nice values tasks. Note that boost here means resetting
their nice to 0; as this is what the current logic does for fair tasks.

We need to re-instate ordering fair tasks by their priority order on the
waiter tree to ensure we inherit the top_waiter properly.

Handling of idle_policy() requires more code refactoring and is not
handled yet. idle_policy() are treated specially and only run when the
CPU is idle and get a hardcoded low weight value. Changing weights won't
be enough without a promotion first to SCHED_OTHER.

Tested with a test program that creates three threads.

1. main thread that spawns high prio and low prio task and busy
loops

2. low priority thread that holds a pthread_mutex() with
PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
after holding the lock.

3. high priority thread that holds a pthread_mutex() with
PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
priority thread. Runs at nice 0. Should remain blocked by the
low priority thread.

All tasks are pinned to CPU0.

Without the patch I can see the low priority thread running only for
~10% of the time which is what expected without it being boosted.

With the patch the low priority thread runs for ~50% which is what
expected if it gets boosted to nice 0.

I modified the test program logic afterwards to ensure that after
releasing the lock the low priority thread goes back to running for 10%
of the time, and it does.

Reported-by: Yabin Cui <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/locking/rtmutex.c | 7 +------
kernel/sched/core.c | 4 +++-
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..4e155862aba1 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -345,12 +345,7 @@ static __always_inline bool unlock_rt_mutex_safe(struct rt_mutex_base *lock,

static __always_inline int __waiter_prio(struct task_struct *task)
{
- int prio = task->prio;
-
- if (!rt_prio(prio))
- return DEFAULT_PRIO;
-
- return prio;
+ return task->prio;
}

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a914388144a..f22db270b0d9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
} else {
if (dl_prio(oldprio))
p->dl.pi_se = &p->dl;
- if (rt_prio(oldprio))
+ else if (rt_prio(oldprio))
p->rt.timeout = 0;
+ else if (!task_has_idle_policy(p))
+ reweight_task(p, prio - MAX_RT_PRIO);
}

__setscheduler_prio(p, prio);
--
2.34.1