For fair tasks inheriting the priority (nice) without reweighting is
a NOP as the task's share won't change.
This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
with low priority values are susceptible to starvation leading to PI
like impact on lock contention.
The logic in rt_mutex will reset these low priority fair tasks into nice
0, but without the additional reweight operation to actually update the
weights, it doesn't have the desired impact of boosting them to allow
them to run sooner/longer to release the lock.
Apply the reweight for fair_policy() tasks to achieve the desired boost
for those low nice values tasks. Note that boost here means resetting
their nice to 0; as this is what the current logic does for fair tasks.
Handling of idle_policy() requires more code refactoring and is not
handled yet. idle_policy() are treated specially and only run when the
CPU is idle and get a hardcoded low weight value. Changing weights won't
be enough without a promotion first to SCHED_OTHER.
Tested with a test program that creates three threads.
1. main thread that spanws high prio and low prio task and busy
loops
2. low priority thread that holds a pthread_mutex() with
PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
after holding the lock.
3. high priority thread that holds a pthread_mutex() with
PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
priority thread. Runs at nice 0. Should remain blocked by the
low priority thread.
All tasks are pinned to CPU0.
Without the patch I can see the low priority thread running only for
~10% of the time which is what expected without it being boosted.
With the patch the low priority thread runs for ~50% which is what
expected if it gets boosted to nice 0.
I modified the test program logic afterwards to ensure that after
releasing the lock the low priority thread goes back to running for 10%
of the time, and it does.
Reported-by: Yabin Cui <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0621e4ee31de..b90a541810da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
} else {
if (dl_prio(oldprio))
p->dl.pi_se = &p->dl;
- if (rt_prio(oldprio))
+ else if (rt_prio(oldprio))
p->rt.timeout = 0;
+ else if (!task_has_idle_policy(p))
+ reweight_task(p, prio - MAX_RT_PRIO);
}
__setscheduler_prio(p, prio);
--
2.34.1
On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
>
> For fair tasks inheriting the priority (nice) without reweighting is
> a NOP as the task's share won't change.
AFAICT, there is no nice priority inheritance with rt_mutex; All nice
tasks are sorted with the same "default prio" in the rb waiter tree.
This means that the rt top waiter is not the cfs with highest prio but
the 1st cfs waiting for the mutex.
>
> This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> with low priority values are susceptible to starvation leading to PI
> like impact on lock contention.
>
> The logic in rt_mutex will reset these low priority fair tasks into nice
> 0, but without the additional reweight operation to actually update the
> weights, it doesn't have the desired impact of boosting them to allow
> them to run sooner/longer to release the lock.
>
> Apply the reweight for fair_policy() tasks to achieve the desired boost
> for those low nice values tasks. Note that boost here means resetting
> their nice to 0; as this is what the current logic does for fair tasks.
But you can at the opposite decrease the cfs prio of a task
and even worse with the comment :
/* XXX used to be waiter->prio, not waiter->task->prio */
we use the prio of the top cfs waiter (ie the one waiting for the
lock) not the default 0 so it can be anything in the range [-20:19]
Then, a task with low prio (i.e. nice > 0) can get a prio boost even
if this task and the waiter are low priority tasks
>
> Handling of idle_policy() requires more code refactoring and is not
> handled yet. idle_policy() are treated specially and only run when the
> CPU is idle and get a hardcoded low weight value. Changing weights won't
> be enough without a promotion first to SCHED_OTHER.
>
> Tested with a test program that creates three threads.
>
> 1. main thread that spanws high prio and low prio task and busy
> loops
>
> 2. low priority thread that holds a pthread_mutex() with
> PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
> after holding the lock.
>
> 3. high priority thread that holds a pthread_mutex() with
> PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
> priority thread. Runs at nice 0. Should remain blocked by the
> low priority thread.
>
> All tasks are pinned to CPU0.
>
> Without the patch I can see the low priority thread running only for
> ~10% of the time which is what expected without it being boosted.
>
> With the patch the low priority thread runs for ~50% which is what
> expected if it gets boosted to nice 0.
>
> I modified the test program logic afterwards to ensure that after
> releasing the lock the low priority thread goes back to running for 10%
> of the time, and it does.
>
> Reported-by: Yabin Cui <[email protected]>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
> kernel/sched/core.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0621e4ee31de..b90a541810da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> } else {
> if (dl_prio(oldprio))
> p->dl.pi_se = &p->dl;
> - if (rt_prio(oldprio))
> + else if (rt_prio(oldprio))
> p->rt.timeout = 0;
> + else if (!task_has_idle_policy(p))
> + reweight_task(p, prio - MAX_RT_PRIO);
> }
>
> __setscheduler_prio(p, prio);
> --
> 2.34.1
>
On Wed, 3 Apr 2024 15:11:06 +0200
Vincent Guittot <[email protected]> wrote:
> On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> >
> > For fair tasks inheriting the priority (nice) without reweighting is
> > a NOP as the task's share won't change.
>
> AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> tasks are sorted with the same "default prio" in the rb waiter tree.
> This means that the rt top waiter is not the cfs with highest prio but
> the 1st cfs waiting for the mutex.
I think the issue here is that the running process doesn't update its
weight and if there are other tasks that are not contending on this mutex,
they can still starve the lock owner.
IIUC (it's been ages since I looked at the code), high nice values (low
priority) turn to at lease nice 0 when they are "boosted". It doesn't
improve their chances of getting the lock though.
>
> >
> > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > with low priority values are susceptible to starvation leading to PI
> > like impact on lock contention.
> >
> > The logic in rt_mutex will reset these low priority fair tasks into nice
> > 0, but without the additional reweight operation to actually update the
> > weights, it doesn't have the desired impact of boosting them to allow
> > them to run sooner/longer to release the lock.
> >
> > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > for those low nice values tasks. Note that boost here means resetting
> > their nice to 0; as this is what the current logic does for fair tasks.
>
> But you can at the opposite decrease the cfs prio of a task
> and even worse with the comment :
> /* XXX used to be waiter->prio, not waiter->task->prio */
>
> we use the prio of the top cfs waiter (ie the one waiting for the
> lock) not the default 0 so it can be anything in the range [-20:19]
>
> Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> if this task and the waiter are low priority tasks
Yeah, I'm all confused to exactly how the inheritance works with
SCHED_OTHER. I know John Stultz worked on this for a bit recently. He's
Cc'ed. But may not be paying attention ;-)
-- Steve
On Wed, 3 Apr 2024 at 15:40, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 3 Apr 2024 15:11:06 +0200
> Vincent Guittot <[email protected]> wrote:
>
> > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > >
> > > For fair tasks inheriting the priority (nice) without reweighting is
> > > a NOP as the task's share won't change.
> >
> > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> > tasks are sorted with the same "default prio" in the rb waiter tree.
> > This means that the rt top waiter is not the cfs with highest prio but
> > the 1st cfs waiting for the mutex.
>
> I think the issue here is that the running process doesn't update its
> weight and if there are other tasks that are not contending on this mutex,
> they can still starve the lock owner.
But I think it's on purpose because we don't boost cfs tasks and we
never boost them. That could be a good thing to do but I think that
the current code has not been done for that and this might raise other
problem. I don't think it's an oversight
>
> IIUC (it's been ages since I looked at the code), high nice values (low
> priority) turn to at lease nice 0 when they are "boosted". It doesn't
> improve their chances of getting the lock though.
>
> >
> > >
> > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > with low priority values are susceptible to starvation leading to PI
> > > like impact on lock contention.
> > >
> > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > 0, but without the additional reweight operation to actually update the
> > > weights, it doesn't have the desired impact of boosting them to allow
> > > them to run sooner/longer to release the lock.
> > >
> > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > for those low nice values tasks. Note that boost here means resetting
> > > their nice to 0; as this is what the current logic does for fair tasks.
> >
> > But you can at the opposite decrease the cfs prio of a task
> > and even worse with the comment :
> > /* XXX used to be waiter->prio, not waiter->task->prio */
> >
> > we use the prio of the top cfs waiter (ie the one waiting for the
> > lock) not the default 0 so it can be anything in the range [-20:19]
> >
> > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > if this task and the waiter are low priority tasks
>
>
> Yeah, I'm all confused to exactly how the inheritance works with
> SCHED_OTHER. I know John Stultz worked on this for a bit recently. He's
> Cc'ed. But may not be paying attention ;-)
>
> -- Steve
On 04/03/24 15:11, Vincent Guittot wrote:
> On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> >
> > For fair tasks inheriting the priority (nice) without reweighting is
> > a NOP as the task's share won't change.
>
> AFAICT, there is no nice priority inheritance with rt_mutex; All nice
Hmm from what I see there is
> tasks are sorted with the same "default prio" in the rb waiter tree.
> This means that the rt top waiter is not the cfs with highest prio but
> the 1st cfs waiting for the mutex.
This is about the order on which tasks contending for the lock more than the
effective priority the task holding the lock should run at though, no?
>
> >
> > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > with low priority values are susceptible to starvation leading to PI
> > like impact on lock contention.
> >
> > The logic in rt_mutex will reset these low priority fair tasks into nice
> > 0, but without the additional reweight operation to actually update the
> > weights, it doesn't have the desired impact of boosting them to allow
> > them to run sooner/longer to release the lock.
> >
> > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > for those low nice values tasks. Note that boost here means resetting
> > their nice to 0; as this is what the current logic does for fair tasks.
>
> But you can at the opposite decrease the cfs prio of a task
> and even worse with the comment :
> /* XXX used to be waiter->prio, not waiter->task->prio */
>
> we use the prio of the top cfs waiter (ie the one waiting for the
> lock) not the default 0 so it can be anything in the range [-20:19]
>
> Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> if this task and the waiter are low priority tasks
I don't see this effect. The only change I am doing here
is that when we set the prio that we are supposed to be inheriting, instead of
simply changing prio, I also ensure we reweight so that we run at the inherited
nice value. I am not changing how the waiter logic works.
Here's my test app FWIW
https://github.com/qais-yousef/pi_test
When I run
pi_test --lp-nice 0 --lp-nice 10
the lp thread runs at 0 still
If I do
pi_test --lp-nice 10 --lp-nice 5
low priority thread runs at 5
What combination are you worried about? I can give it a try. I use
sched-analyzer-pp [1] to see the division of runnable/running or you can
monitor them on top
#!/bin/bash
set -eux
sudo sched-analyzer &
./pi_test --lp-nice ${1:-10} --hp-nice ${2:-0} --affine-cpu ${3:-0} &
sleep 10
pkill -SIGKILL pi_test
sudo pkill -SIGINT sched-analyzer
sched-analyzer-pp --sched-states pi_test sched-analyzer.perfetto-trace
Picutres of output is attached for before and after
pi_test --lp-nice 10 --hp-nice 0
[1] https://github.com/qais-yousef/sched-analyzer
On 04/03/24 09:42, Steven Rostedt wrote:
> On Wed, 3 Apr 2024 15:11:06 +0200
> Vincent Guittot <[email protected]> wrote:
>
> > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > >
> > > For fair tasks inheriting the priority (nice) without reweighting is
> > > a NOP as the task's share won't change.
> >
> > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> > tasks are sorted with the same "default prio" in the rb waiter tree.
> > This means that the rt top waiter is not the cfs with highest prio but
> > the 1st cfs waiting for the mutex.
>
> I think the issue here is that the running process doesn't update its
> weight and if there are other tasks that are not contending on this mutex,
> they can still starve the lock owner.
>
> IIUC (it's been ages since I looked at the code), high nice values (low
> priority) turn to at lease nice 0 when they are "boosted". It doesn't
> improve their chances of getting the lock though.
The main issue is that if this low nice value is holding the lock and the cpus
are busy, it can take a long time to release the lock.
In today's systems the amount of waiting time that use cases 'permitted' (for
lack of better word) keeps shrinking.
And the way userspace is coded these days with all these layers, app writers
might not be aware there's this dependency where a low priority task might
contend for a lock required by something important. Being able to rely on
PTHREAD_PRIO_INHERIT to ensure they get boosted is useful to protect against
a thread that has a low priority inadvertently holds this lock from causing
massive delays to waiters as it might not get enough RUNNING time to release
it.
To my knowledge windows has some sort of boost mechanism (that maybe Joel knows
more about than me) for the lock holder to help release the lock faster that
app developers rely on to fix similar issues.
>
> >
> > >
> > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > with low priority values are susceptible to starvation leading to PI
> > > like impact on lock contention.
> > >
> > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > 0, but without the additional reweight operation to actually update the
> > > weights, it doesn't have the desired impact of boosting them to allow
> > > them to run sooner/longer to release the lock.
> > >
> > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > for those low nice values tasks. Note that boost here means resetting
> > > their nice to 0; as this is what the current logic does for fair tasks.
> >
> > But you can at the opposite decrease the cfs prio of a task
> > and even worse with the comment :
> > /* XXX used to be waiter->prio, not waiter->task->prio */
> >
> > we use the prio of the top cfs waiter (ie the one waiting for the
> > lock) not the default 0 so it can be anything in the range [-20:19]
> >
> > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > if this task and the waiter are low priority tasks
>
>
> Yeah, I'm all confused to exactly how the inheritance works with
> SCHED_OTHER. I know John Stultz worked on this for a bit recently. He's
> Cc'ed. But may not be paying attention ;-)
>
> -- Steve
On Fri, 5 Apr 2024 at 00:05, Qais Yousef <[email protected]> wrote:
>
> On 04/03/24 15:11, Vincent Guittot wrote:
> > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > >
> > > For fair tasks inheriting the priority (nice) without reweighting is
> > > a NOP as the task's share won't change.
> >
> > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
>
> Hmm from what I see there is
>
> > tasks are sorted with the same "default prio" in the rb waiter tree.
> > This means that the rt top waiter is not the cfs with highest prio but
> > the 1st cfs waiting for the mutex.
>
> This is about the order on which tasks contending for the lock more than the
> effective priority the task holding the lock should run at though, no?
No, they are ordered by priority in the rb tree so you can get the
priority of the top waiter and apply it to the owner of the lock
>
> >
> > >
> > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > with low priority values are susceptible to starvation leading to PI
> > > like impact on lock contention.
> > >
> > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > 0, but without the additional reweight operation to actually update the
> > > weights, it doesn't have the desired impact of boosting them to allow
> > > them to run sooner/longer to release the lock.
> > >
> > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > for those low nice values tasks. Note that boost here means resetting
> > > their nice to 0; as this is what the current logic does for fair tasks.
> >
> > But you can at the opposite decrease the cfs prio of a task
> > and even worse with the comment :
> > /* XXX used to be waiter->prio, not waiter->task->prio */
> >
> > we use the prio of the top cfs waiter (ie the one waiting for the
> > lock) not the default 0 so it can be anything in the range [-20:19]
> >
> > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > if this task and the waiter are low priority tasks
>
> I don't see this effect. The only change I am doing here
> is that when we set the prio that we are supposed to be inheriting, instead of
> simply changing prio, I also ensure we reweight so that we run at the inherited
> nice value. I am not changing how the waiter logic works.
But if you look more deeply in the code, you will see that all cfs are
sorted with the same default prio so cfs tasks are not sorted and are
considered to be the same.
All that to say that I think the weight is not applied on purpose.
This might work for your particular case but there are more changes to
be done if you want to apply prio inheritance between cfs tasks.
As an example, what about the impact of cgroup on the actual weight
and the inherited priority of a task ? If the owner and the waiter
don't belong to the same cgroup their own prio is meaningless... task
nice -20 in a group with a weight equal to nice 19 vs a task nice 19
in a group with a weight equals to nice -20
>
> Here's my test app FWIW
>
> https://github.com/qais-yousef/pi_test
>
> When I run
>
> pi_test --lp-nice 0 --lp-nice 10
>
> the lp thread runs at 0 still
>
> If I do
>
> pi_test --lp-nice 10 --lp-nice 5
>
> low priority thread runs at 5
>
> What combination are you worried about? I can give it a try. I use
> sched-analyzer-pp [1] to see the division of runnable/running or you can
> monitor them on top
>
> #!/bin/bash
> set -eux
>
> sudo sched-analyzer &
>
> ./pi_test --lp-nice ${1:-10} --hp-nice ${2:-0} --affine-cpu ${3:-0} &
>
> sleep 10
>
> pkill -SIGKILL pi_test
>
> sudo pkill -SIGINT sched-analyzer
>
> sched-analyzer-pp --sched-states pi_test sched-analyzer.perfetto-trace
>
> Picutres of output is attached for before and after
>
> pi_test --lp-nice 10 --hp-nice 0
>
> [1] https://github.com/qais-yousef/sched-analyzer
On 04/05/24 14:15, Vincent Guittot wrote:
> On Fri, 5 Apr 2024 at 00:05, Qais Yousef <[email protected]> wrote:
> >
> > On 04/03/24 15:11, Vincent Guittot wrote:
> > > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > > >
> > > > For fair tasks inheriting the priority (nice) without reweighting is
> > > > a NOP as the task's share won't change.
> > >
> > > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> >
> > Hmm from what I see there is
> >
> > > tasks are sorted with the same "default prio" in the rb waiter tree.
> > > This means that the rt top waiter is not the cfs with highest prio but
> > > the 1st cfs waiting for the mutex.
> >
> > This is about the order on which tasks contending for the lock more than the
> > effective priority the task holding the lock should run at though, no?
>
> No, they are ordered by priority in the rb tree so you can get the
> priority of the top waiter and apply it to the owner of the lock
I think I see what you're getting at now. There's no guarantee the top waiter
is the higher priority fair task. Yes.
>
> >
> > >
> > > >
> > > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > > with low priority values are susceptible to starvation leading to PI
> > > > like impact on lock contention.
> > > >
> > > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > > 0, but without the additional reweight operation to actually update the
> > > > weights, it doesn't have the desired impact of boosting them to allow
> > > > them to run sooner/longer to release the lock.
> > > >
> > > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > > for those low nice values tasks. Note that boost here means resetting
> > > > their nice to 0; as this is what the current logic does for fair tasks.
> > >
> > > But you can at the opposite decrease the cfs prio of a task
> > > and even worse with the comment :
> > > /* XXX used to be waiter->prio, not waiter->task->prio */
> > >
> > > we use the prio of the top cfs waiter (ie the one waiting for the
> > > lock) not the default 0 so it can be anything in the range [-20:19]
> > >
> > > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > > if this task and the waiter are low priority tasks
> >
> > I don't see this effect. The only change I am doing here
> > is that when we set the prio that we are supposed to be inheriting, instead of
> > simply changing prio, I also ensure we reweight so that we run at the inherited
> > nice value. I am not changing how the waiter logic works.
>
> But if you look more deeply in the code, you will see that all cfs are
> sorted with the same default prio so cfs tasks are not sorted and are
> considered to be the same.
Yes I saw that. We can potentially revert 715f7f9ece46 ("locking/rtmutex:
Squash !RT tasks to DEFAULT_PRIO") ;-)
/hides
>
> All that to say that I think the weight is not applied on purpose.
> This might work for your particular case but there are more changes to
> be done if you want to apply prio inheritance between cfs tasks.
>
> As an example, what about the impact of cgroup on the actual weight
> and the inherited priority of a task ? If the owner and the waiter
> don't belong to the same cgroup their own prio is meaningless... task
> nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> in a group with a weight equals to nice -20
That is on my mind actually. But I thought it's a separate problem. That has to
do with how we calculate the effective priority of the pi_task. And probably
the sorting order to if we agree we need to revert the above. If that is done
appropriately, I hope the current reweight approach could be used as-is. Hmm
but but as I write this I realize the compound weight will still be different.
Maybe we should inherit the weight rather than the prio, which IIUC should
already be the effective weight taking cgroup into account?
Just to put it out on the clear, you don't think the issue is wrong, but just
that we need to look further for a proper fix, right? ie: it is a problem we
should fix, but we need to nail down more details IIUC.
If that's the case it'd be good to know what else beside sorting order and
handling cgroup I need to take into account to make this more correct.
There's the obvious SCHED_IDLE case of course that needs a policy promotion,
beside weight adjustment.
>
>
> >
> > Here's my test app FWIW
> >
> > https://github.com/qais-yousef/pi_test
> >
> > When I run
> >
> > pi_test --lp-nice 0 --lp-nice 10
> >
> > the lp thread runs at 0 still
> >
> > If I do
> >
> > pi_test --lp-nice 10 --lp-nice 5
> >
> > low priority thread runs at 5
> >
> > What combination are you worried about? I can give it a try. I use
> > sched-analyzer-pp [1] to see the division of runnable/running or you can
> > monitor them on top
> >
> > #!/bin/bash
> > set -eux
> >
> > sudo sched-analyzer &
> >
> > ./pi_test --lp-nice ${1:-10} --hp-nice ${2:-0} --affine-cpu ${3:-0} &
> >
> > sleep 10
> >
> > pkill -SIGKILL pi_test
> >
> > sudo pkill -SIGINT sched-analyzer
> >
> > sched-analyzer-pp --sched-states pi_test sched-analyzer.perfetto-trace
> >
> > Picutres of output is attached for before and after
> >
> > pi_test --lp-nice 10 --hp-nice 0
> >
> > [1] https://github.com/qais-yousef/sched-analyzer
On 04/05/24 18:16, Qais Yousef wrote:
> >
> > All that to say that I think the weight is not applied on purpose.
> > This might work for your particular case but there are more changes to
> > be done if you want to apply prio inheritance between cfs tasks.
> >
> > As an example, what about the impact of cgroup on the actual weight
> > and the inherited priority of a task ? If the owner and the waiter
> > don't belong to the same cgroup their own prio is meaningless... task
> > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > in a group with a weight equals to nice -20
>
> That is on my mind actually. But I thought it's a separate problem. That has to
> do with how we calculate the effective priority of the pi_task. And probably
> the sorting order to if we agree we need to revert the above. If that is done
Thinking more about it the revert is not the right thing to do. We want fair
tasks to stay ordered in FIFO for better fairness and avoid potential
starvation issues. It's just the logic for searching the top_waiter need to be
different. If the top_waiter is fair, then we need to traverse the tree to find
the highest nice value. We probably can keep track of this while adding items
to the tree to avoid the search.
For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
detach_attach_task_cfs_rq() before the reweight? This seems the most
straightforward solution and will contain the complexity to keeping track of
cfs_rq. But it'll have similar issue to proxy execution where a task that
doesn't belong to the cgroup will consume its share..
Can we treat the two as separate problems? Or you think any solution must
address the two? Both must be fixed of course.
Thanks!
--
Qais Yousef
On Fri, 5 Apr 2024 at 19:16, Qais Yousef <[email protected]> wrote:
>
> On 04/05/24 14:15, Vincent Guittot wrote:
> > On Fri, 5 Apr 2024 at 00:05, Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/03/24 15:11, Vincent Guittot wrote:
> > > > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <[email protected]> wrote:
> > > > >
> > > > > For fair tasks inheriting the priority (nice) without reweighting is
> > > > > a NOP as the task's share won't change.
> > > >
> > > > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> > >
> > > Hmm from what I see there is
> > >
> > > > tasks are sorted with the same "default prio" in the rb waiter tree.
> > > > This means that the rt top waiter is not the cfs with highest prio but
> > > > the 1st cfs waiting for the mutex.
> > >
> > > This is about the order on which tasks contending for the lock more than the
> > > effective priority the task holding the lock should run at though, no?
> >
> > No, they are ordered by priority in the rb tree so you can get the
> > priority of the top waiter and apply it to the owner of the lock
>
> I think I see what you're getting at now. There's no guarantee the top waiter
> is the higher priority fair task. Yes.
>
> >
> > >
> > > >
> > > > >
> > > > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > > > with low priority values are susceptible to starvation leading to PI
> > > > > like impact on lock contention.
> > > > >
> > > > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > > > 0, but without the additional reweight operation to actually update the
> > > > > weights, it doesn't have the desired impact of boosting them to allow
> > > > > them to run sooner/longer to release the lock.
> > > > >
> > > > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > > > for those low nice values tasks. Note that boost here means resetting
> > > > > their nice to 0; as this is what the current logic does for fair tasks.
> > > >
> > > > But you can at the opposite decrease the cfs prio of a task
> > > > and even worse with the comment :
> > > > /* XXX used to be waiter->prio, not waiter->task->prio */
> > > >
> > > > we use the prio of the top cfs waiter (ie the one waiting for the
> > > > lock) not the default 0 so it can be anything in the range [-20:19]
> > > >
> > > > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > > > if this task and the waiter are low priority tasks
> > >
> > > I don't see this effect. The only change I am doing here
> > > is that when we set the prio that we are supposed to be inheriting, instead of
> > > simply changing prio, I also ensure we reweight so that we run at the inherited
> > > nice value. I am not changing how the waiter logic works.
> >
> > But if you look more deeply in the code, you will see that all cfs are
> > sorted with the same default prio so cfs tasks are not sorted and are
> > considered to be the same.
>
> Yes I saw that. We can potentially revert 715f7f9ece46 ("locking/rtmutex:
> Squash !RT tasks to DEFAULT_PRIO") ;-)
>
> /hides
>
> >
> > All that to say that I think the weight is not applied on purpose.
> > This might work for your particular case but there are more changes to
> > be done if you want to apply prio inheritance between cfs tasks.
> >
> > As an example, what about the impact of cgroup on the actual weight
> > and the inherited priority of a task ? If the owner and the waiter
> > don't belong to the same cgroup their own prio is meaningless... task
> > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > in a group with a weight equals to nice -20
>
> That is on my mind actually. But I thought it's a separate problem. That has to
> do with how we calculate the effective priority of the pi_task. And probably
> the sorting order to if we agree we need to revert the above. If that is done
> appropriately, I hope the current reweight approach could be used as-is. Hmm
> but but as I write this I realize the compound weight will still be different.
> Maybe we should inherit the weight rather than the prio, which IIUC should
> already be the effective weight taking cgroup into account?
>
> Just to put it out on the clear, you don't think the issue is wrong, but just
> that we need to look further for a proper fix, right? ie: it is a problem we
> should fix, but we need to nail down more details IIUC.
Yes, I agree about your problem but your current proposal is not
correct because there are more things to consider and fix
>
> If that's the case it'd be good to know what else beside sorting order and
> handling cgroup I need to take into account to make this more correct.
>
> There's the obvious SCHED_IDLE case of course that needs a policy promotion,
> beside weight adjustment.
>
> >
> >
> > >
> > > Here's my test app FWIW
> > >
> > > https://github.com/qais-yousef/pi_test
> > >
> > > When I run
> > >
> > > pi_test --lp-nice 0 --lp-nice 10
> > >
> > > the lp thread runs at 0 still
> > >
> > > If I do
> > >
> > > pi_test --lp-nice 10 --lp-nice 5
> > >
> > > low priority thread runs at 5
> > >
> > > What combination are you worried about? I can give it a try. I use
> > > sched-analyzer-pp [1] to see the division of runnable/running or you can
> > > monitor them on top
> > >
> > > #!/bin/bash
> > > set -eux
> > >
> > > sudo sched-analyzer &
> > >
> > > ./pi_test --lp-nice ${1:-10} --hp-nice ${2:-0} --affine-cpu ${3:-0} &
> > >
> > > sleep 10
> > >
> > > pkill -SIGKILL pi_test
> > >
> > > sudo pkill -SIGINT sched-analyzer
> > >
> > > sched-analyzer-pp --sched-states pi_test sched-analyzer.perfetto-trace
> > >
> > > Picutres of output is attached for before and after
> > >
> > > pi_test --lp-nice 10 --hp-nice 0
> > >
> > > [1] https://github.com/qais-yousef/sched-analyzer
On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
>
> On 04/05/24 18:16, Qais Yousef wrote:
>
> > >
> > > All that to say that I think the weight is not applied on purpose.
> > > This might work for your particular case but there are more changes to
> > > be done if you want to apply prio inheritance between cfs tasks.
> > >
> > > As an example, what about the impact of cgroup on the actual weight
> > > and the inherited priority of a task ? If the owner and the waiter
> > > don't belong to the same cgroup their own prio is meaningless... task
> > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > in a group with a weight equals to nice -20
> >
> > That is on my mind actually. But I thought it's a separate problem. That has to
> > do with how we calculate the effective priority of the pi_task. And probably
> > the sorting order to if we agree we need to revert the above. If that is done
>
> Thinking more about it the revert is not the right thing to do. We want fair
> tasks to stay ordered in FIFO for better fairness and avoid potential
> starvation issues. It's just the logic for searching the top_waiter need to be
> different. If the top_waiter is fair, then we need to traverse the tree to find
> the highest nice value. We probably can keep track of this while adding items
> to the tree to avoid the search.
>
> For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> detach_attach_task_cfs_rq() before the reweight? This seems the most
> straightforward solution and will contain the complexity to keeping track of
> cfs_rq. But it'll have similar issue to proxy execution where a task that
> doesn't belong to the cgroup will consume its share..
That's a good point, Would proxy execution be the simplest way to fix all this ?
>
> Can we treat the two as separate problems? Or you think any solution must
> address the two? Both must be fixed of course.
>
>
> Thanks!
>
> --
> Qais Yousef
On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
<[email protected]> wrote:
>
> On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> >
> > On 04/05/24 18:16, Qais Yousef wrote:
> >
> > > >
> > > > All that to say that I think the weight is not applied on purpose.
> > > > This might work for your particular case but there are more changes to
> > > > be done if you want to apply prio inheritance between cfs tasks.
> > > >
> > > > As an example, what about the impact of cgroup on the actual weight
> > > > and the inherited priority of a task ? If the owner and the waiter
> > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > in a group with a weight equals to nice -20
> > >
> > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > do with how we calculate the effective priority of the pi_task. And probably
> > > the sorting order to if we agree we need to revert the above. If that is done
> >
> > Thinking more about it the revert is not the right thing to do. We want fair
> > tasks to stay ordered in FIFO for better fairness and avoid potential
> > starvation issues. It's just the logic for searching the top_waiter need to be
> > different. If the top_waiter is fair, then we need to traverse the tree to find
> > the highest nice value. We probably can keep track of this while adding items
> > to the tree to avoid the search.
> >
> > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > straightforward solution and will contain the complexity to keeping track of
> > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > doesn't belong to the cgroup will consume its share..
>
> That's a good point, Would proxy execution be the simplest way to fix all this ?
So, at the moment, in part. It ought to resolve the issue for
in-kernel mutexes (blocked tasks stay on rq, if blocked tasks are
selected to run we will instead run the runnable lock owner - thus it
works across scheduling classes), but it isn't tied into userland
futexes the way rt_mutexes are at this point.
Review and feedback on the series would be greatly appreciated!
(Nudge! Nudge! :)
-john
On 04/08/24 12:51, John Stultz wrote:
> On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/05/24 18:16, Qais Yousef wrote:
> > >
> > > > >
> > > > > All that to say that I think the weight is not applied on purpose.
> > > > > This might work for your particular case but there are more changes to
> > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > >
> > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > in a group with a weight equals to nice -20
> > > >
> > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > the sorting order to if we agree we need to revert the above. If that is done
> > >
> > > Thinking more about it the revert is not the right thing to do. We want fair
> > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > the highest nice value. We probably can keep track of this while adding items
> > > to the tree to avoid the search.
> > >
> > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > straightforward solution and will contain the complexity to keeping track of
> > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > doesn't belong to the cgroup will consume its share..
> >
> > That's a good point, Would proxy execution be the simplest way to fix all this ?
Is it? Over 4.5 years ago Unity reported to me about performance inversion
problem and that's when proxy execution work was revived as simplest way to fix
all of this. But still no end in sight from what I see. I was and still think
an interim solution in rt_mutex could help a lot of use cases already without
being too complex. Not as elegant and comprehensive like proxy execution, but
given the impact on both userspace and out of tree kernel hacks are growing
waiting for this to be ready, the cost of waiting is high IMHO.
FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
I think this reweight issue is more serious problem and likely why I heard this
feedback. I could be underestimating the complexity of the fix though. So I'll
trust your judgement and ditch further effort if you think it's more effort
than helping proxy execution patchset - for the list at least. I'm likely still
to pursue something out of tree to get into as many android LTS kernels. And
will be happy to share this work if there's desire to try to pick something up
for mainline to make the problem less severe at least.
Note that binder has already performance, latency (out of tree) and priority
inheritance and without those performance is impacted in many corner cases and
considered indispensable part.
>
> So, at the moment, in part. It ought to resolve the issue for
> in-kernel mutexes (blocked tasks stay on rq, if blocked tasks are
> selected to run we will instead run the runnable lock owner - thus it
> works across scheduling classes), but it isn't tied into userland
> futexes the way rt_mutexes are at this point.
>
> Review and feedback on the series would be greatly appreciated!
> (Nudge! Nudge! :)
I am guilty of this, sorry.
On Tue, 9 Apr 2024 at 08:19, Qais Yousef <[email protected]> wrote:
>
> On 04/08/24 12:51, John Stultz wrote:
> > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > >
> > > > > >
> > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > This might work for your particular case but there are more changes to
> > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > >
> > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > don't belong to the same cgroup their own prio is meaningless.. task
> > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > in a group with a weight equals to nice -20
> > > > >
> > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > >
> > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > the highest nice value. We probably can keep track of this while adding items
> > > > to the tree to avoid the search.
> > > >
> > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > straightforward solution and will contain the complexity to keeping track of
> > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > doesn't belong to the cgroup will consume its share..
> > >
> > > That's a good point, Would proxy execution be the simplest way to fix all this ?
>
> Is it? Over 4.5 years ago Unity reported to me about performance inversion
> problem and that's when proxy execution work was revived as simplest way to fix
> all of this. But still no end in sight from what I see. I was and still think
> an interim solution in rt_mutex could help a lot of use cases already without
> being too complex. Not as elegant and comprehensive like proxy execution, but
> given the impact on both userspace and out of tree kernel hacks are growing
> waiting for this to be ready, the cost of waiting is high IMHO.
>
> FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> I think this reweight issue is more serious problem and likely why I heard this
> feedback. I could be underestimating the complexity of the fix though. So I'll
Without cgroup, the solution could be straightforward but android uses
extensively cgroup AFAICT and update_cfs_group() makes impossible to
track the top cfs waiter and its "prio"
> trust your judgement and ditch further effort if you think it's more effort
> than helping proxy execution patchset - for the list at least. I'm likely still
> to pursue something out of tree to get into as many android LTS kernels. And
> will be happy to share this work if there's desire to try to pick something up
> for mainline to make the problem less severe at least.
>
> Note that binder has already performance, latency (out of tree) and priority
> inheritance and without those performance is impacted in many corner cases and
> considered indispensable part.
>
> >
> > So, at the moment, in part. It ought to resolve the issue for
> > in-kernel mutexes (blocked tasks stay on rq, if blocked tasks are
> > selected to run we will instead run the runnable lock owner - thus it
> > works across scheduling classes), but it isn't tied into userland
> > futexes the way rt_mutexes are at this point.
> >
> > Review and feedback on the series would be greatly appreciated!
> > (Nudge! Nudge! :)
>
> I am guilty of this, sorry.
me too
On 04/09/24 14:35, Vincent Guittot wrote:
> On Tue, 9 Apr 2024 at 08:19, Qais Yousef <[email protected]> wrote:
> >
> > On 04/08/24 12:51, John Stultz wrote:
> > > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > > <[email protected]> wrote:
> > > >
> > > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > > > >
> > > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > > >
> > > > > > >
> > > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > > This might work for your particular case but there are more changes to
> > > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > > >
> > > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > > in a group with a weight equals to nice -20
> > > > > >
> > > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > > >
> > > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > > the highest nice value. We probably can keep track of this while adding items
> > > > > to the tree to avoid the search.
> > > > >
> > > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > > straightforward solution and will contain the complexity to keeping track of
> > > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > > doesn't belong to the cgroup will consume its share..
> > > >
> > > > That's a good point, Would proxy execution be the simplest way to fix all this ?
> >
> > Is it? Over 4.5 years ago Unity reported to me about performance inversion
> > problem and that's when proxy execution work was revived as simplest way to fix
> > all of this. But still no end in sight from what I see. I was and still think
> > an interim solution in rt_mutex could help a lot of use cases already without
> > being too complex. Not as elegant and comprehensive like proxy execution, but
> > given the impact on both userspace and out of tree kernel hacks are growing
> > waiting for this to be ready, the cost of waiting is high IMHO.
> >
> > FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> > I think this reweight issue is more serious problem and likely why I heard this
> > feedback. I could be underestimating the complexity of the fix though. So I'll
>
> Without cgroup, the solution could be straightforward but android uses
> extensively cgroup AFAICT and update_cfs_group() makes impossible to
> track the top cfs waiter and its "prio"
:(
IIUC the issue is that we can't easily come up with a single number of
'effective prio' for N level hierarchy and compare it with another M level
hierarchy..
Does proxy execution fix this problem then? If we can't find the top waiter,
I can't see how proxy execution would work here too. To my understanding it's
more about how we apply inheritance (by donating execution context of the top
waiter) instead of manually applying inheritance like we're doing now.
But the logic of finding the top waiter should remain the same, no?
Need to extend my test case to cover this scenario.
On Wed, 10 Apr 2024 at 08:59, Qais Yousef <[email protected]> wrote:
>
> On 04/09/24 14:35, Vincent Guittot wrote:
> > On Tue, 9 Apr 2024 at 08:19, Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/08/24 12:51, John Stultz wrote:
> > > > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > > > > >
> > > > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > > > >
> > > > > > > >
> > > > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > > > This might work for your particular case but there are more changes to
> > > > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > > > >
> > > > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > > > in a group with a weight equals to nice -20
> > > > > > >
> > > > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > > > >
> > > > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > > > the highest nice value. We probably can keep track of this while adding items
> > > > > > to the tree to avoid the search.
> > > > > >
> > > > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > > > straightforward solution and will contain the complexity to keeping track of
> > > > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > > > doesn't belong to the cgroup will consume its share..
> > > > >
> > > > > That's a good point, Would proxy execution be the simplest way to fix all this ?
> > >
> > > Is it? Over 4.5 years ago Unity reported to me about performance inversion
> > > problem and that's when proxy execution work was revived as simplest way to fix
> > > all of this. But still no end in sight from what I see. I was and still think
> > > an interim solution in rt_mutex could help a lot of use cases already without
> > > being too complex. Not as elegant and comprehensive like proxy execution, but
> > > given the impact on both userspace and out of tree kernel hacks are growing
> > > waiting for this to be ready, the cost of waiting is high IMHO.
> > >
> > > FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> > > I think this reweight issue is more serious problem and likely why I heard this
> > > feedback. I could be underestimating the complexity of the fix though So I'll
> >
> > Without cgroup, the solution could be straightforward but android uses
> > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > track the top cfs waiter and its "prio"
>
> :(
>
> IIUC the issue is that we can't easily come up with a single number of
> 'effective prio' for N level hierarchy and compare it with another M level
> hierarchy..
And then how do you apply it on the hierarchy ?
>
> Does proxy execution fix this problem then? If we can't find the top waiter,
I haven't dug enough in the patchset. John ?
> I can't see how proxy execution would work here too. To my understanding it's
> more about how we apply inheritance (by donating execution context of the top
> waiter) instead of manually applying inheritance like we're doing now.
>
> But the logic of finding the top waiter should remain the same, no?
>
> Need to extend my test case to cover this scenario.
On 04/10/24 11:13, Vincent Guittot wrote:
> > > Without cgroup, the solution could be straightforward but android uses
> > > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > > track the top cfs waiter and its "prio"
> >
> > :(
> >
> > IIUC the issue is that we can't easily come up with a single number of
> > 'effective prio' for N level hierarchy and compare it with another M level
> > hierarchy..
>
> And then how do you apply it on the hierarchy ?
(I am not disagreeing with you, just trying to state the reasons more
explicitly)
I think the application is easy, attach to the leaf cfs_rq? Which IIUC
correctly what should happen with proxy execution, but by consuming the context
of the donor directly without having explicitly to move the lock owner.
Finding out which hierarchy actually has the highest effective share is not
straightforward I agree. And if we combine a potential operation of something
that could move any waiting task to a different hierarchy at anytime, this gets
even more complex.
I need to go and find more info, but seems Windows has some kind of boost
mechanism to help the lock owner to release the lock faster. I wonder if
something like that could help as interim solution. What we could do is move
the task to root group as a boost with the simple reweight operation proposed
here applied. As soon as it releases the lock we should restore it.
From what I heard in Windows this boost happens randomly (don't quote me on
this). I am not sure could be our trigger mechanism. We sure don't want to do
this unconditionally otherwise we break fairness.
Maybe there are easier ways to introduce a simple such boost mechanism..
On 04/10/24 10:30, John Stultz wrote:
> On Tue, Apr 9, 2024 at 11:59 PM Qais Yousef <[email protected]> wrote:
> >
> > On 04/09/24 14:35, Vincent Guittot wrote:
> > > On Tue, 9 Apr 2024 at 08:19, Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 04/08/24 12:51, John Stultz wrote:
> > > > > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > > > > > >
> > > > > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > > > > >
> > > > > > > > >
> > > > > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > > > > This might work for your particular case but there are more changes to
> > > > > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > > > > >
> > > > > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > > > > in a group with a weight equals to nice -20
> > > > > > > >
> > > > > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > > > > >
> > > > > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > > > > the highest nice value. We probably can keep track of this while adding items
> > > > > > > to the tree to avoid the search.
> > > > > > >
> > > > > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > > > > straightforward solution and will contain the complexity to keeping track of
> > > > > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > > > > doesn't belong to the cgroup will consume its share..
> > > > > >
> > > > > > That's a good point, Would proxy execution be the simplest way to fix all this ?
> > > >
> > > > Is it? Over 4.5 years ago Unity reported to me about performance inversion
> > > > problem and that's when proxy execution work was revived as simplest way to fix
> > > > all of this. But still no end in sight from what I see. I was and still think
> > > > an interim solution in rt_mutex could help a lot of use cases already without
> > > > being too complex. Not as elegant and comprehensive like proxy execution, but
> > > > given the impact on both userspace and out of tree kernel hacks are growing
> > > > waiting for this to be ready, the cost of waiting is high IMHO.
> > > >
> > > > FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> > > > I think this reweight issue is more serious problem and likely why I heard this
> > > > feedback. I could be underestimating the complexity of the fix though. So I'll
> > >
> > > Without cgroup, the solution could be straightforward but android uses
> > > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > > track the top cfs waiter and its "prio"
> >
> > :(
> >
> > IIUC the issue is that we can't easily come up with a single number of
> > 'effective prio' for N level hierarchy and compare it with another M level
> > hierarchy..
> >
> > Does proxy execution fix this problem then? If we can't find the top waiter,
> > I can't see how proxy execution would work here too. To my understanding it's
> > more about how we apply inheritance (by donating execution context of the top
> > waiter) instead of manually applying inheritance like we're doing now.
>
> So, while proxy provides a sort of generalized inheritance, it isn't
> deep enough in the class scheduler logic to need to really think about
> priority/cgroups.
>
> It just looks at what gets selected to run. That's the most important
> task at that moment. It doesn't really need to care about how/why,
> that's left to pick_next_task().
>
> Since it leaves mutex blocked tasks on the RQ, it allows the class
> scheduler logic to pick the most important task (mutex-blocked or not)
> to run. Then if a mutex-blocked task gets selected, we will then find
> the mutex owner and run it instead so it can release the lock. When
> locks are released, if the owner has a "donor" task, the lock is
> handed off to the donor. So, this basically uses the
> pick_next_task()'s evaluation of what it wanted to run to effectively
> provide the "top waiter".
Thanks John. So there's no top waiter and all tasks are left runnable, makes
sense.
On Tue, Apr 9, 2024 at 11:59 PM Qais Yousef <[email protected]> wrote:
>
> On 04/09/24 14:35, Vincent Guittot wrote:
> > On Tue, 9 Apr 2024 at 08:19, Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/08/24 12:51, John Stultz wrote:
> > > > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <[email protected]> wrote:
> > > > > >
> > > > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > > > >
> > > > > > > >
> > > > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > > > This might work for your particular case but there are more changes to
> > > > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > > > >
> > > > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > > > don't belong to the same cgroup their own prio is meaningless... task
> > > > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > > > in a group with a weight equals to nice -20
> > > > > > >
> > > > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > > > >
> > > > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > > > the highest nice value. We probably can keep track of this while adding items
> > > > > > to the tree to avoid the search.
> > > > > >
> > > > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > > > straightforward solution and will contain the complexity to keeping track of
> > > > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > > > doesn't belong to the cgroup will consume its share..
> > > > >
> > > > > That's a good point, Would proxy execution be the simplest way to fix all this ?
> > >
> > > Is it? Over 4.5 years ago Unity reported to me about performance inversion
> > > problem and that's when proxy execution work was revived as simplest way to fix
> > > all of this. But still no end in sight from what I see. I was and still think
> > > an interim solution in rt_mutex could help a lot of use cases already without
> > > being too complex. Not as elegant and comprehensive like proxy execution, but
> > > given the impact on both userspace and out of tree kernel hacks are growing
> > > waiting for this to be ready, the cost of waiting is high IMHO.
> > >
> > > FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> > > I think this reweight issue is more serious problem and likely why I heard this
> > > feedback. I could be underestimating the complexity of the fix though So I'll
> >
> > Without cgroup, the solution could be straightforward but android uses
> > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > track the top cfs waiter and its "prio"
>
> :(
>
> IIUC the issue is that we can't easily come up with a single number of
> 'effective prio' for N level hierarchy and compare it with another M level
> hierarchy..
>
> Does proxy execution fix this problem then? If we can't find the top waiter,
> I can't see how proxy execution would work here too. To my understanding it's
> more about how we apply inheritance (by donating execution context of the top
> waiter) instead of manually applying inheritance like we're doing now.
So, while proxy provides a sort of generalized inheritance, it isn't
deep enough in the class scheduler logic to need to really think about
priority/cgroups.
It just looks at what gets selected to run. That's the most important
task at that moment. It doesn't really need to care about how/why,
that's left to pick_next_task().
Since it leaves mutex blocked tasks on the RQ, it allows the class
scheduler logic to pick the most important task (mutex-blocked or not)
to run. Then if a mutex-blocked task gets selected, we will then find
the mutex owner and run it instead so it can release the lock. When
locks are released, if the owner has a "donor" task, the lock is
handed off to the donor. So, this basically uses the
pick_next_task()'s evaluation of what it wanted to run to effectively
provide the "top waiter".
thanks
-john
On 04/10/24 16:47, Qais Yousef wrote:
> On 04/10/24 11:13, Vincent Guittot wrote:
>
> > > > Without cgroup, the solution could be straightforward but android uses
> > > > extensively cgroup AFAICT and update_cfs_group() makes impossible to
> > > > track the top cfs waiter and its "prio"
> > >
> > > :(
> > >
> > > IIUC the issue is that we can't easily come up with a single number of
> > > 'effective prio' for N level hierarchy and compare it with another M level
> > > hierarchy..
> >
> > And then how do you apply it on the hierarchy ?
>
> (I am not disagreeing with you, just trying to state the reasons more
> explicitly)
>
> I think the application is easy, attach to the leaf cfs_rq? Which IIUC
> correctly what should happen with proxy execution, but by consuming the context
> of the donor directly without having explicitly to move the lock owner.
>
> Finding out which hierarchy actually has the highest effective share is not
> straightforward I agree. And if we combine a potential operation of something
> that could move any waiting task to a different hierarchy at anytime, this gets
> even more complex.
>
> I need to go and find more info, but seems Windows has some kind of boost
> mechanism to help the lock owner to release the lock faster. I wonder if
> something like that could help as interim solution. What we could do is move
> the task to root group as a boost with the simple reweight operation proposed
> here applied. As soon as it releases the lock we should restore it.
>
> From what I heard in Windows this boost happens randomly (don't quote me on
> this). I am not sure could be our trigger mechanism. We sure don't want to do
> this unconditionally otherwise we break fairness.
>
> Maybe there are easier ways to introduce a simple such boost mechanism..
FWIW, trying to find the top-fair-waiter and that wasn't as trivial. I needed
to refactor a fair bit of the code that expects the top-waiter to be the
leftmost..
And I haven't looked at that temporary boost mechanism for cfs. Maybe I'll try
that if I get a chance. For the time being, I got this bandaid if anybody is
interested in a temporary 'solution'
--->8---
From 7169519792f11a73f861a41dd7d5c9151dc44dd7 Mon Sep 17 00:00:00 2001
From: Qais Yousef <[email protected]>
Date: Mon, 1 Apr 2024 03:04:00 +0100
Subject: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio
For fair tasks inheriting the priority (nice) without reweighting is
a NOP as the task's share won't change.
This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
with low priority values are susceptible to starvation leading to PI
like impact on lock contention.
The logic in rt_mutex will reset these low priority fair tasks into nice
0, but without the additional reweight operation to actually update the
weights, it doesn't have the desired impact of boosting them to allow
them to run sooner/longer to release the lock.
Apply the reweight for fair_policy() tasks to achieve the desired boost
for those low nice values tasks. Note that boost here means resetting
their nice to 0; as this is what the current logic does for fair tasks.
We need to re-instate ordering fair tasks by their priority order on the
waiter tree to ensure we inherit the top_waiter properly.
Handling of idle_policy() requires more code refactoring and is not
handled yet. idle_policy() are treated specially and only run when the
CPU is idle and get a hardcoded low weight value. Changing weights won't
be enough without a promotion first to SCHED_OTHER.
Tested with a test program that creates three threads.
1. main thread that spawns high prio and low prio task and busy
loops
2. low priority thread that holds a pthread_mutex() with
PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops
after holding the lock.
3. high priority thread that holds a pthread_mutex() with
PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low
priority thread. Runs at nice 0. Should remain blocked by the
low priority thread.
All tasks are pinned to CPU0.
Without the patch I can see the low priority thread running only for
~10% of the time which is what expected without it being boosted.
With the patch the low priority thread runs for ~50% which is what
expected if it gets boosted to nice 0.
I modified the test program logic afterwards to ensure that after
releasing the lock the low priority thread goes back to running for 10%
of the time, and it does.
Reported-by: Yabin Cui <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/locking/rtmutex.c | 7 +------
kernel/sched/core.c | 4 +++-
2 files changed, 4 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..4e155862aba1 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -345,12 +345,7 @@ static __always_inline bool unlock_rt_mutex_safe(struct rt_mutex_base *lock,
static __always_inline int __waiter_prio(struct task_struct *task)
{
- int prio = task->prio;
-
- if (!rt_prio(prio))
- return DEFAULT_PRIO;
-
- return prio;
+ return task->prio;
}
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a914388144a..f22db270b0d9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7242,8 +7242,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
} else {
if (dl_prio(oldprio))
p->dl.pi_se = &p->dl;
- if (rt_prio(oldprio))
+ else if (rt_prio(oldprio))
p->rt.timeout = 0;
+ else if (!task_has_idle_policy(p))
+ reweight_task(p, prio - MAX_RT_PRIO);
}
__setscheduler_prio(p, prio);
--
2.34.1